python

1月 112018
 

The SAS® platform is now open to be accessed from open-source clients such as Python, Lua, Java, the R language, and REST APIs to leverage the capabilities of SAS® Viya® products and solutions. You can analyze your data in a cloud-enabled environment that handles large amounts of data in a variety of different formats. To find out more about SAS Viya, see the “SAS Viya: What's in it for me? The user.” article.

This blog post focuses on the openness of SAS® 9.4 and discusses features such as the SASPy package and the SAS kernel for Jupyter Notebook and more as clients to SAS. Note: This blog post is relevant for all maintenance releases of SAS 9.4.

SASPy

The SASPy package enables you to connect to and run your analysis from SAS 9.4 using the object-oriented methods and objects from the Python language as well as the Python magic methods. SASPy translates the objects and methods added into the SAS code before executing the code. To use SASPy, you must have SAS 9.4 and Python 3.x or later.
Note: SASPy is an open-source project that encourages your contributions.

After you have completed the installation and configuration of SASPy, you can import the SASPy package as demonstrated below:
Note: I used Jupyter Notebook to run the examples in this blog post.

1.   Import the SASPy package:

Openness of SAS® 9.4

2.   Start a new session. The sas object is created as a result of starting a SAS session using a locally installed version of SAS under Microsoft Windows. After this session is successfully established, the following note is generated:

Adding Data

Now that the SAS session is started, you need to add some data to analyze. This example uses SASPy to read a CSV file that provides census data based on the ZIP Codes in Los Angeles County and create a SASdata object named tabl:

To view the attributes of this SASdata object named tabl, use the PRINT() function below, which shows the libref and the SAS data set name. It shows the results as Pandas, which is the default result output for tables.

Using Methods to Display and Analyze Data

This section provides some examples of how to use different methods to interact with SAS data via SASPy.

Head() Method

After loading the data, you can look at the first few records of the ZIP Code data, which is easy using the familiar head() method in Python. This example uses the head() method on the SASdata object tabl to display the first five records. The output is shown below:

Describe() Method

After verifying that the data is what you expected, you can now analyze the data. To generate a simple summary of the data, use the Python describe() method in conjunction with the index [1:3]. This combination generates a summary of all the numeric fields within the table and displays only the second and third records. The subscript works only when the result is set to Pandas and does not work if set to HTML or Text, which are also valid options.

Teach_me_SAS() Method

The SAS code generated from the object-oriented Python syntax can also be displayed using SASPy with the teach_me_SAS() method. When you set the argument in this method to True, which is done using a Boolean value, the SAS code is displayed without executing the code:

ColumnInfo() Method

In the next cell, use the columnInfo() method to display the information about each variable in the SAS data set. Note: The SAS code is generated as a result of adding the teach_me_SAS() method in the last section:

Submit() Method

Then, use the submit() method to execute the PROC CONTENTS that are displayed in the cell above directly from Python. The submit method returns a dictionary with two keys, LST and LOG. The LST key contains the results and the LOG key returns the SAS log. The results are displayed as HTML. The HTML package is imported  to display the results.

The SAS Kernel Using Jupyter Notebook

Jupyter Notebook can run programs in various programming languages including SAS when you install and configure the SAS kernel. Using the SAS kernel is another way to run SAS interactively using a web-based program, which also enables you to save the analysis in a notebook. See the links above for details about installation and configuration of the SAS kernel. To verify that the SAS kernel installed successfully, you can run the following code: jupyter kernelspec list

From the command line, use the following command to start the Jupyter Notebook: Jupyter notebook. The screenshot below shows the Jupyter Notebook session that starts when you run the code. To execute SAS syntax from Jupyter Notebook, select SAS from the New drop-down list as shown below:

You can add SAS code to a cell in Jupyter Notebook and execute it. The following code adds a PRINT procedure and a SGPLOT procedure. The output is in HTML5 by default. However, you can specify a different output format if needed.

You can also use magics in the cell such as the %%python magic even though you are using the SAS kernel. You can do this for any kernel that you have installed.

Other SAS Goodness

There are more ways of interacting with other languages with SAS as well. For example, you can use the Groovy procedure to run Groovy statements on the Java Virtual Machine (JVM). You can also use the LUA procedure to run LUA code from SAS along with the ability to call most SAS functions from Lua. For more information, see “Using Lua within your SAS programs.” Another very powerful feature is the DATA step JavaObject, which provides the ability to instantiate Java classes and access fields and methods. The DATA step JavaObject has been available since SAS® 9.2.

Resources

SASPy Documentation

Introducing SASPy: Use Python code to access SAS

Come on in, we're open: The openness of SAS® 9.4 was published on SAS Users.

12月 222017
 
In keras, we can visualize activation functions' geometric properties using backend functions over layers of a model.

We all know the exact function of popular activation functions such as 'sigmoid', 'tanh', 'relu', etc, and we can feed data to these functions to directly obtain their output. But how to do that via keras without explicitly specifying their functional forms?

This can be done following the four steps below:

1. define a simple MLP model with a one dimension input data, a one neuron dense network as the hidden layer, and the output layer will have a 'linear' activation function for one neuron.
2. Extract layers' output of the model (fitted or not) via iterating through model.layers
3. Using backend function K.function() to obtain calculated output for a given input data
4. Feed desired data to the above functions to obtain the output from appropriate activation function.

The code below is a demo:




from keras.layers import Dense, Activation
from keras.models import Sequential
import keras.backend as K
import numpy as np
import matplotlib.pyplot as plt



# 以下设置显示中文文方法根据 http://blog.csdn.net/rumswell/article/details/6544377
plt.rcParams['font.sans-serif'] = ['SimHei'] #指定默认字体
plt.rcParams['axes.unicode_minus'] = False #解决图像中中文符号显示为方块的问题

def NNmodel(activationFunc='linear'):
'''
定义一个神经网络模型。如果要定义不同的模型,可以直接修改该函数
'''
if (activationFunc=='softplus') | (activationFunc=='sigmoid'):
winit='lecun_uniform'
elif activationFunc=='hard_sigmoid':
winit='lecun_normal'
else:
winit='he_uniform'
model = Sequential()
model.add(Dense(1, input_shape=(1,), activation=activationFunc,
kernel_initializer=winit,
name='Hidden'))

model.add(Dense(1, activation='linear', name='Output'))
model.compile(loss='mse', optimizer='sgd')
return model

def VisualActivation(activationFunc='relu', plot=True):
x = (np.arange(100)-50)/10
y = np.log(x+x.max()+1)

model = NNmodel(activationFunc = activationFunc)

inX = model.input
outputs = [layer.output for layer in model.layers if layer.name=='Hidden']
functions = [K.function([inX], [out]) for out in outputs]

layer_outs = [func([x.reshape(-1, 1)]) for func in functions]
activationLayer = layer_outs[0][0]

activationDf = pd.DataFrame(activationLayer)
result=pd.concat([pd.DataFrame(x), activationDf], axis=1)
result.columns=['X', 'Activated']
result.set_index('X', inplace=True)
if plot:
result.plot(title=f)

return result


# Now we can visualize them (assuming default settings) :
actFuncs = ['linear', 'softmax', 'sigmoid', 'tanh', 'softsign', 'hard_sigmoid', 'softplus', 'selu', 'elu']

from keras.layers import LeakyReLU
figure = plt.figure()
for i, f in enumerate(actFuncs):
# 依次画图
figure.add_subplot(3, 3, i+1)
out=VisualActivation(activationFunc=f, plot=False)
plt.plot(out.index, out.Activated)
plt.title(u'激活函数:'+f)

This figure is the output from above code. As we can see, the geometric property of each activation function is well captured.

 Posted by at 4:44 下午
6月 292017
 

One of the big benefits of the SAS Viya platform is how approachable it is for programmers of other languages. You don't have to learn SAS in order to become productive quickly. We've seen a lot of interest from people who code in Python, maybe because that language has become known for its application in machine learning. SAS has a new product called SAS Visual Data Mining and Machine Learning. And these days, you can't offer such a product without also offering something special to those Python enthusiasts.

Introducing Python SWAT

And so, SAS has published the Python SWAT project (where "SWAT" stands for the SAS scripting wapper for analytical transfer. The project is a Python code library that SAS released using an open source model. That means that you can download it for free, make changes locally, and even contribute those changes back to the community (as some developers have already done!). You'll find it at github.com/sassoftware/python-swat.

SAS developer Kevin Smith is the main contributor on Python SWAT, and he's a big fan of Python. He's also an expert in SAS and in many programming languages. If you're a SAS user, you probably run Kevin's code every day; he was an original developer on the SAS Output Delivery System (ODS). Now he's a member of the cloud analytics team in SAS R&D. (He's also the author of more than a few conference papers and SAS books.)

Kevin enjoys the dynamic, fluid style that a scripting language like Python affords - versus the more formal "code-compile-build-execute" model of a compiled language. Watch this video (about 14 minutes) in which Kevin talks about what he likes in Python, and shows off how Python SWAT can drive SAS' machine learning capabilities.

New -- but familiar -- syntax for Python coders

The analytics engine behind the SAS Viya platform is called CAS, or SAS Cloud Analytic Services. You'll want to learn that term, because "CAS" is used throughout the SAS documentation and APIs. And while CAS might be new to you, the Python approach to CAS should feel very familiar for users of Python libraries, especially users of pandas, the Python Data Analysis Library.

CAS and SAS' Python SWAT extends these concepts to provide intuitive, high-performance analytics from SAS Viya in your favorite Python environment, whether that's a Jupyter notebook or a simple console. Watch the video to see Kevin's demo and discussion about how to get started. You'll learn:

  • How to connect your Python session to the CAS server
  • How to upload data from your client to the CAS server
  • How SWAT extends the concept of the DataFrame API in pandas to leverage CAS capabilities
  • How to coax CAS to provide descriptive statistics about your data, and then go beyond what's built into the traditional DataFrame methods.

Learn more about SAS Viya and Python

There are plenty of helpful resources to help you learn about using Python with SAS Viya:

And finally, what if you don't have SAS Viya yet, but you're interested in using Python with SAS 9.4? Check out the SASPy project, which allows you to access your traditional SAS features from a Jupyter notebook or Python console. It's another popular open source project from SAS R&D.

The post Using Python to work with SAS Viya and CAS appeared first on The SAS Dummy.

4月 092017
 

Thanks to a new open source project from SAS, Python coders can now bring the power of SAS into their Python scripts. The project is SASPy, and it's available on the SAS Software GitHub. It works with SAS 9.4 and higher, and requires Python 3.x.

I spoke with Jared Dean about the SASPy project. Jared is a Principal Data Scientist at SAS and one of the lead developers on SASPy and a related project called Pipefitter. Here's a video of our conversation, which includes an interactive demo. Jared is obviously pretty excited about the whole thing.

Use SAS like a Python coder

SASPy brings a "Python-ic" sensibility to this approach for using SAS. That means that all of your access to SAS data and methods are surfaced using objects and syntax that are familiar to Python users. This includes the ability to exchange data via pandas, the ubiquitous Python data analysis framework. And even the native SAS objects are accessed in a very "pandas-like" way.

import saspy
import pandas as pd
sas = saspy.SASsession(cfgname='winlocal')
cars = sas.sasdata("CARS","SASHELP")
cars.describe()

The output is what you expect from pandas...but with statistics that SAS users are accustomed to. PROC MEANS anyone?

In[3]: cars.describe()
Out[3]: 
       Variable Label    N  NMiss   Median          Mean        StdDev  
0         MSRP     .   428      0  27635.0  32774.855140  19431.716674   
1      Invoice     .   428      0  25294.5  30014.700935  17642.117750   
2   EngineSize     .   428      0      3.0      3.196729      1.108595   
3    Cylinders     .   426      2      6.0      5.807512      1.558443   
4   Horsepower     .   428      0    210.0    215.885514     71.836032   
5     MPG_City     .   428      0     19.0     20.060748      5.238218   
6  MPG_Highway     .   428      0     26.0     26.843458      5.741201   
7       Weight     .   428      0   3474.5   3577.953271    758.983215   
8    Wheelbase     .   428      0    107.0    108.154206      8.311813   
9       Length     .   428      0    187.0    186.362150     14.357991   

       Min       P25      P50      P75       Max  
0  10280.0  20329.50  27635.0  39215.0  192465.0  
1   9875.0  18851.00  25294.5  35732.5  173560.0  
2      1.3      2.35      3.0      3.9       8.3  
3      3.0      4.00      6.0      6.0      12.0  
4     73.0    165.00    210.0    255.0     500.0  
5     10.0     17.00     19.0     21.5      60.0  
6     12.0     24.00     26.0     29.0      66.0  
7   1850.0   3103.00   3474.5   3978.5    7190.0  
8     89.0    103.00    107.0    112.0     144.0  
9    143.0    178.00    187.0    194.0     238.0  

SASPy also provides high-level Python objects for the most popular and powerful SAS procedures. These are organized by SAS product, such as SAS/STAT, SAS/ETS and so on. To explore, issue a dir() command on your SAS session object. In this example, I've created a sasstat object and I used dot<TAB> to list the available SAS analyses:

SAS/STAT object in SASPy

The SAS Pipefitter project extends the SASPy project by providing access to advanced analytics and machine learning algorithms. In our video interview, Jared presents a cool example of a decision tree applied to the passenger survival factors on the Titanic. It's powered by PROC HPSPLIT behind the scenes, but Python users don't need to know all of that "inside baseball."

Installing SASPy and getting started

Like most things Python, installing the SASPy package is simple. You can use the pip installation manager to fetch the latest version:

pip install saspy

However, since you need to connect to a SAS session to get to the SAS goodness, you will need some additional files to broker that connection. Most notably, you need a few Java jar files that SAS provides. You can find these in the SAS Deployment Manager folder for your SAS installation:
../deploywiz/sas.svc.connection.jar
..deploywiz/log4j.jar
../deploywiz/sas.security.sspi.jar
../deploywiz/sas.core.jar

The jar files are compatible between Windows and Unix, so if you find them in a Unix SAS install you can still copy them to your Python Windows client. You'll need to modify the sascgf.py file (installed with the SASPy package) to point to where you've stashed these. If using local SAS on Windows, you also need to make sure that the sspiauth.dll is in your Windows system PATH. The easiest method to add SASHOMESASFoundation9.4coresasexe to your system PATH variable.

All of this is documented in the "Installation and Configuration" section of the project documentation. The connectivity options support an impressively diverse set of SAS configs: Windows, Unix, SAS Grid Computing, and even SAS on the mainframe!

Download, comment, contribute

SASPy is an open source project, and all of the Python code is available for your inspection and improvement. The developers at SAS welcome you to give it a try and enter issues when you see something that needs to be improved. And if you're a hotshot Python coder, feel free to fork the project and issue a pull request with your suggested changes!

The post Introducing SASPy: Use Python code to access SAS appeared first on The SAS Dummy.

4月 022017
 

python ngram


# -*- coding: utf-8 -*-
# @DATE    : 2017/4/1 10:39
# @Author  : 
# @File    : ngram.py
from collections import defaultdict


def gen_n_gram(input, sep=" ", n=2):
    input = input.split(sep)
    output = {}
    for i in xrange(len(input) - n + 1):
        gram = "".join(input[i: i + n])
        output.setdefault(gram, 0)
        output[gram] += 1
    return output


def dict_sum(*dict):
    ret = defaultdict(int)
    for d in dict:
        for k, v in d.items():
            ret[k] += v
    return ret


def sum_n_gram(inputs, sep=" ", n=2):
    output_sum = defaultdict(int)
    for input in inputs:
        output_sum = dict_sum(output_sum, gen_n_gram(input))
    output_sum = sorted(output_sum.items(), key=lambda x: x[1], reverse=True)
    return output_sum


if __name__ == "__main__":
    inputs = ["a a a j 9 3 h d e", "a j 9 3 h", "g g h 9 3"]
    print(gen_n_gram("a a a j 9 3 h d e"))
    output = sum_n_gram(inputs)
    print(output)
    output_file = "dict.txt"
    cnt = len(output)
    with open(output_file, "w") as out:
        for i, value in enumerate(output):
            if i + 1 <</span> cnt:
                out.write("{}:{}n".format(value[0], value[1]))
            else:
                out.write("{}:{}".format(value[0], value[1]))

运行日志


{'aa': 2, 'de': 1, 'j9': 1, 'aj': 1, '3h': 1, '93': 1, 'hd': 1}
[('93', 3), ('aa', 2), ('aj', 2), ('j9', 2), ('3h', 2), ('de', 1), ('gg', 1), ('h9', 1), ('hd', 1), ('gh', 1)]

Process finished with exit code 0

 
 Posted by at 12:32 下午
2月 102017
 

# -*- coding: utf-8 -*-
# @DATE : 2017/2/10 10:47
# @File : collection_usage.py

import collections

# counter 初始化
print(collections.Counter(["a", "b", "a", "b", "a", "c"]))
print(collections.Counter({"a": 2, "b": 3, "c": 1}))
print(collections.Counter(a=2, b=3, c=1))

c = collections.Counter()
print(c)
c.update("abababc")
print(c)
c.update({"a": 1, "d": 5})
print(c)

# counter access
c = collections.Counter("ababac")
for letter in "abcde":
print("{}: {}".format(letter, c[letter]))

# elements
c = collections.Counter("extremly")
print(c)
print(list(c.elements()))

# most common
c = collections.Counter()
with open("WordFilter.py", "r") as f:
for line in f:
c.update(line.strip().replace(" ", "").lower())

for letter, count in c.most_common(10):
print("{}: {}".format(letter, count))

# 数学运算
c1 = collections.Counter(["a", "b", "c", "a", "b", "b"])
c2 = collections.Counter("alphabet")

print(c1)
print(c2)
print(c1 + c2)
print(c1 - c2)
print(c1 & c2)
print(c1 | c2)


# ordered dict
d = {}
d["a"] = "A"
d["b"] = "B"
d["c"] = "C"
d["d"] = "D"
d["e"] = "E"
for k, v in d.items():
print("{}, {}".format(k, v))

print("Ordered Dict")
d = collections.OrderedDict()
d["a"] = "A"
d["b"] = "B"
d["c"] = "C"
d["d"] = "D"
d["e"] = "E"
for k, v in d.items():
print("{}, {}".format(k, v))




Counter({'a': 3, 'b': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter()
Counter({'a': 3, 'b': 3, 'c': 1})
Counter({'d': 5, 'a': 4, 'b': 3, 'c': 1})
a: 3
b: 2
c: 1
d: 0
e: 0
Counter({'e': 2, 'm': 1, 'l': 1, 'r': 1, 't': 1, 'y': 1, 'x': 1})
['e', 'e', 'm', 'l', 'r', 't', 'y', 'x']
e: 58
r: 56
t: 55
i: 48
o: 48
s: 48
d: 39
f: 36
n: 35
l: 33
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'a': 2, 'b': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
Counter({'a': 4, 'b': 4, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
Counter({'b': 2, 'c': 1})
Counter({'a': 2, 'b': 1})
Counter({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
a, A
c, C
b, B
e, E
d, D
Ordered Dict
a, A
b, B
c, C
d, D
e, E

 
 Posted by at 9:13 下午
11月 172016
 



通过OrderedDict实现有序字典


import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
from collections import OrderedDict
import pandas as pd

data = StringIO("""title;date
Event0;2016-01-03
Event1;2016-02-28
Event2;2016-06-19
Event3;2016-04-17
Event4;2015-11-12
""")

df = pd.read_csv(data, sep=";")
d = df.set_index("title").date.to_dict()
print(d)

od = OrderedDict(sorted(d.items(), key=lambda x: x[1], reverse=True))
for k, v in od.items():
print("k: {0},v: {1}".format(k, v))

{'Event4': '201代码5-11-12', 'Event2': '2016-06-19', 'Event3': '2016-04-17', 'Event0': '2016-01-03', 'Event1': '2016-02-28'}
k: Event2,v: 2016-06-19
k: Event3,v: 2016-04-17
k: Event1,v: 2016-02-28
k: Event0,v: 2016-01-03
k: Event4,v: 2015-11-12

Process finished with exit code 0

 
 Posted by at 9:27 下午
11月 162016
 


import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd


data = StringIO("""year;month;clickSource
2010;01;google, yahoo, google, google, facebook, facebook
2010;02;facebook, yahoo, google, google, facebook, facebook
2010;03;yahoo, yahoo, google, google, facebook, facebook
2010;04;google, yahoo, google, twitter, facebook, facebook
2010;05;facebook, yahoo, google, google, facebook, facebook
2010;06;twitter, yahoo, google, twitter, facebook, google
""")

df = pd.read_csv(data, sep=";")
print(df.shape)
print(df)

df_new = df.set_index(['year', 'month']).clickSource.str.split(', ').apply(pd.value_counts).fillna(0).astype(int).reset_index()
print(df_new)

(6, 3)
   year  month                                        clickSource
0  2010      1  google, yahoo, google, google, facebook, facebook
1  2010      2  facebook, yahoo, google, google, facebook, fac...
2  2010      3   yahoo, yahoo, google, google, facebook, facebook
3  2010      4  google, yahoo, google, twitter, facebook, face...
4  2010      5  facebook, yahoo, google, google, facebook, fac...
5  2010      6  twitter, yahoo, google, twitter, facebook, google
   year  month  facebook  google  twitter  yahoo
0  2010      1         2       3        0      1
1  2010      2         3       2        0      1
2  2010      3         2       2        0      2
3  2010      4         2       2        1      1
4  2010      5         3       2        0      1
5  2010      6         1       2        2      1

Process finished with exit code 0


 
 Posted by at 8:13 下午
10月 202016
 

The study of social networks has gained importance over the years within social and behavioral research on HIV and AIDS. Social network research can show routes of potential viral transfer, and be used to understand the influence of peer norms and practices on the risk behaviors of individuals. This example analyzes the […]

Analyzing social networks using Python and SAS Viya was published on SAS Voices.

10月 122016
 

Recently, one of sons came to me and asked about something called “The Monty Hall Paradox.” They had discussed it in school and he was having a hard time understanding it (as you often do with paradoxes).

For those of you who may not be familiar with the Monty Hall Paradox, it is named for the host of a popular TV game show called “Let’s Make a Deal.” On the show, a contestant would be selected and shown a valuable prize.  Monty Hall would then explain that the prize is located just behind one of three doors and asked the contestant to pick a door.  Once a door was selected, Monty would then tease the contestant with cash to get him/her to either abandon the game or switch to another door.  Invariably, the contestant would stand firm and then Monty would proceed to show the contestant what was behind one of the other doors.  Of course, it wouldn’t be any fun if the prize was behind the revealed door, so after showing the contestant an empty door Monty would then ply them with even more cash, in the hopes that they would abandon the game or switch to the remaining door.

Almost without fail, the contestant would stand firm in their belief that their chosen door was the winner and would not switch to the other door.

So where’s the paradox?

When left with two doors, most people assume that they've got a 50/50 chance at winning. However, the truth is that the contestant will double his/her chance of winning by switching to the other door.

After explaining this to my son, it occurred to me that this would be an excellent exercise for coding in Python and in SAS to see how the two languages compared. Like many of you reading this blog, I’ve been programming in SAS for years so the struggle for me was coding this in Python.

I kept it simple. I generated my data randomly and then applied simple logic to each row and compared the results.  The only difference between the two is in how the languages approach it.  Once we look at the two approaches then we can look at the answer.

First, let's look at SAS:

data choices (drop=max);
do i = 1 to 10000;
	u=ranuni(1);
	u2=ranuni(2);
	max=3;
	prize = ceil(max*u);
	choice = ceil(max*u2);
	output;
end;
run;

I started by generating two random numbers for each row in my data. The first random number will be used to randomize the prize door and the second will be used to randomize the choice that the contestant makes. The result is a dataset with 10,000 rows each with columns ‘prize’ and ‘choice’ to represent the doors.  They will be random integers between 1 and 3.  Our next task will be to determine which door will be revealed and determine a winner.

If our prize and choice are two different doors, then we must reveal the third door. If the prize and choice are the same, then we must choose a door to reveal. (Note: I realize that my logic in the reveal portion is somewhat flawed, but given that I am using an IF…ELSE IF and the fact that the choices are random and there isn’t any risk of introducing bias, this way of coding it was much simpler.)

data results;
set choices;
by i;
 
if prize in (1,2) and choice in (1,2) then reveal=3;
else if prize in (1,3) and choice in (1,3) then reveal=2;
else if prize in (2,3) and choice in (2,3) then reveal=1;

Once we reveal a door, we must now give the contestant the option to switch. Switch means they always switch, neverswitch means they never switch.

if reveal in (1,3) and choice in (1,3) then do;
        switch = 2; neverswitch = choice; 
end;
else if reveal in (2,3) and choice in (2,3) then do;
	switch = 1; neverswitch = choice; 
end;
else if reveal in (1,2) and choice in (1,2) then do;
	switch = 3; neverswitch = choice; 
end;

Now we create a column for the winner.  1=win 0=loss.

	switchwin = (switch=prize);
	neverswitchwin = (neverswitch=prize);
run;

Next, let’s start accumulating our results across all of our observations.  We’ll take a running tally of how many times a contestant who switches win as well as for the contestant who never switches.

data cumstats;
set results;
format cumswitch cumnever comma8.;
format pctswitch pctnever percent8.2;
retain cumswitch cumnever;
if _N_ = 1 then do;
	cumswitch = 0; cumnever = 0;
end;
else do;
cumswitch = cumswitch+switchwin;
cumnever = cumnever+neverswitchwin;
end;
 
pctswitch = cumswitch/i;
pctnever = cumnever/i;
run;
 
proc means data=results n mean std;
var switchwin neverswitchwin;
run;
legend1
frame	;
symbol1 interpol=splines;
pattern1 value=ms;
axis1
	style=1
	width=1
	minor=none ;
axis2
	style=1
	width=1
	major=none
	minor=none ;
axis3
	style=1
	width=1
	minor=none ;
title;
title1 " Cumulative chances of winning on Let's Make a Deal ";
 
proc gplot data=work.cumstats;
	plot pctnever * i  /
	areas=1
frame	vaxis=axis1
	haxis=axis2
	lvref=1
	cvref=black
	vzero
	legend=legend1;
plot2 pctswitch * i  = 2 /
  	areas=1
	vaxis=axis3
	vzero
overlay 
 	legend=legend1 ;
run; quit; 

monthy_hall8

The output of PROC MEANS shows that people who always switch (switchwin) have a win percentage of nearly 67%, while the people who never switch (neverswitchwin) have a win percentage of only 33%. The Area Plot proves the point graphically showing that the win percentage of switchers to be well above the non-switchers.

Now let’s take a look at how I approached the problem in Python (keeping in mind that this language is new to me).

Now, let’s look at Python:

Copied from Jupyter Notebook

import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import accumulate
%matplotlib inline

First let's create a blank dataframe with 10,000 rows and 10 columns, then fill in the blanks with zeros.

rawdata = {'index': range(10000)}
df = pd.DataFrame(rawdata,columns=['index','prize','choice','reveal','switch','neverswitch','switchwin','neverswitchwin','cumswitch','cumnvrswt'])
df = df.fillna(0)

Now let's populate our columns. The prize column represents the door that contains the new car! The choice column represents the door that the contestant chose. We will populate them both with a random number between 1 and 3.

prize=[]
choice=[]
for row in df['index']:
    prize.append(random.randint(1,3))
    choice.append(random.randint(1,3))   
df['prize']=prize
df['choice']=choice

Now that Monty Hall has given the contestant their choice of door, he reveals the blank door that they did not choose.

reveal=[]
for i in range(len(df)):
    if (df['prize'][i] in (1,2) and df['choice'][i] in (1,2)):
        reveal.append(3)
    elif (df['prize'][i] in (1,3) and df['choice'][i] in (1,3)):
        reveal.append(2)
    elif (df['prize'][i] in (2,3) and df['choice'][i] in (2,3)):
        reveal.append(1) 
df['reveal']= reveal

Here's the rub. The contestant has chosen a door, Monty has revealed a blank door, and now he's given the contestant the option to switch to the other door. Most of the time the contestant will not switch even though they should. To prove this, we create a column called 'switch' that reflects a contestant that ALWAYS switches their choice. And, a column called 'neverswitch' that represents the opposite.

switch=[]
neverswitch=[]
for i in range(len(df)):
    if (df['reveal'][i] in (1,3) and df['choice'][i] in (1,3)):
        switch.append(2)
    elif (df['reveal'][i] in (1,2) and df['choice'][i] in (1,2)):
        switch.append(3)
    elif (df['reveal'][i] in (2,3) and df['choice'][i] in (2,3)):
        switch.append(1) 
    neverswitch = choice
df['switch']=switch
df['neverswitch']=neverswitch

Now let's create a flag for when the Always Switch contestant wins and a flag for when the Never Switch contestant wins.

switchwin=[]
neverswitchwin=[]
for i in range(len(df)):
    if (df['switch'][i]==df['prize'][i]):
        switchwin.append(1)
    else:
        switchwin.append(0)    
    if (df['neverswitch'][i]==df['prize'][i]):
        neverswitchwin.append(1)
    else:
        neverswitchwin.append(0)     
df['switchwin']=switchwin
df['neverswitchwin']=neverswitchwin

Now we accumulate the total number of wins for each contestant.

cumswitch=[]
cumnvrswt=[]
df['cumswitch']=list(accumulate(df['switchwin']))
df['cumnvrswt']=list(accumulate(df['neverswitchwin']))

…and divide by the number of observations for a win percentage.

pctswitch=[]
pctnever=[]
for i in range(len(df)):
    pctswitch=df['cumswitch']/(df['index']+1)
    pctnever=df['cumnvrswt']/(df['index']+1)
df['pctswitch']=pctswitch
df['pctnever']=pctnever

Now we are ready to plot the results. Green represents the win percentage of Always Switch, blue represents the win percentage of Never Switch.

x=df['index']
y=df['pctswitch']
y2=df['pctnever']
fig, ax = plt.subplots(1, 1, figsize=(12, 9))
ax.plot(x,y,lw=3, label='Always', color='green')
ax.plot(x,y2,lw=3, label='Never',color='blue',alpha=0.5)
ax.fill_between(x,y2,y, facecolor='green',alpha=0.6)
ax.fill_between(x,0,y2, facecolor='blue',alpha=0.5)
ax.set_xlabel("Iterations",size=14)
ax.set_ylabel("Win Pct",size=14)
ax.legend(loc='best')
plt.title("Cumulative chances of winning on Let's Make a Deal", size=16)
plt.grid(True)

monthy_hall9

Why does it work?

Most people think that because there are two doors left (the door you chose and the door Monty didn’t show you) that there is a fifty-fifty chance that you’ve got the prize.  But we just proved that it’s not, so “what gives”?

Remember that the door you chose at first has a 1/3 chance of winning.  That means that the other two doors combined have a 2/3 chance in winning.  Even though Monty showed us what’s behind one of those two doors, the two of them together still have a 2/3 chance of winning.  Since you know one of them is empty, that means the door you didn’t pick MUST have a 2/3 chance of winning.  You should switch.  The green line in the Python graph (or the red line in the SAS graph) shows that after having run 10,000 contestants through the game the people that always switched won 67% of the time while the people that never switched only won 33% of the time.

My comparisons and thoughts between SAS and Python.

In terms of number of lines of code required, SAS wins hands down.  I only needed 57 lines of code to get the result in SAS, compared to 74 lines in Python. I realize that experience has a lot to do with it, but I think there is an inherent verbosity to the Python code that is not necessarily there in SAS.

In terms of ease of use, I’m going to give the edge to Python.  I really liked how easy it was to generate a random number between two values.  In SAS, you have to actually perform arithmetic functions to do it, whereas in Python it’s a built-in function. It was exactly the same for accumulating totals of numbers. It was exactly the same for accumulating totals of numbers.  In Python, it was the accumulate function. In SAS, it was a do loop that summed each of the previous values.

In terms of iterative ability and working “free style,” I give the edge to SAS.  With Python, it is easy to iterate, but I felt myself having to start all over again having to pre-define columns, packages, etc., in order to complete my analysis.  With SAS, I could just code.  I didn’t have to start over because I created a new column.  I didn’t have to start over because I needed to figure out which package I needed, find it on Github, install it and then import it.

In terms of tabular output, SAS wins.  Easy to read, easy to generate.

In terms of graphical output, Python edges SAS out.  Both are verbose and tedious to get it to work. Python wins because the output is cleaner and there are way more options.

In terms of speed, SAS wins.  On my laptop, I could change the number of rows from 10,000 to 100,000 without noticing much of a difference in speed (0.25 – 0.5 seconds).  In Python, anything over 10,000 got slow.  10,000 rows was 6 seconds, 100,000 rows was 2 minutes 20 seconds.

Of course, this speed has a resource cost.  In those terms, Python wins.  My Anaconda installation is under 2GB of disk space, while my particular deployment of SAS requires 50GB of disk space.

Finally, in terms of mathematics, they tied.  They both produce the same answer as expected.  Of course, I used extremely common packages that are well used and tested.  Newer or more sophisticated packages are often tested against SAS as the standard for accuracy.

But in the end, comparing the two as languages is limited.  Python is much a more versatile object oriented language that has capabilities that SAS doesn’t have.  While SAS’ mature DATA step can do things to data that Python has difficulty with.   But most importantly, is the release of SAS Viya. Through Viya’s open APIs and micro-services, SAS is transforming itself into something more than just a coding language, it aims to be the analytical platform that all data scientists can use to get their work done.

tags: Python, SAS Programmers

The Monty Hall Paradox - SAS vs. Python was published on SAS Users.