10月 122016
 

Recently, one of sons came to me and asked about something called “The Monty Hall Paradox.” They had discussed it in school and he was having a hard time understanding it (as you often do with paradoxes).

For those of you who may not be familiar with the Monty Hall Paradox, it is named for the host of a popular TV game show called “Let’s Make a Deal.” On the show, a contestant would be selected and shown a valuable prize.  Monty Hall would then explain that the prize is located just behind one of three doors and asked the contestant to pick a door.  Once a door was selected, Monty would then tease the contestant with cash to get him/her to either abandon the game or switch to another door.  Invariably, the contestant would stand firm and then Monty would proceed to show the contestant what was behind one of the other doors.  Of course, it wouldn’t be any fun if the prize was behind the revealed door, so after showing the contestant an empty door Monty would then ply them with even more cash, in the hopes that they would abandon the game or switch to the remaining door.

Almost without fail, the contestant would stand firm in their belief that their chosen door was the winner and would not switch to the other door.

So where’s the paradox?

When left with two doors, most people assume that they've got a 50/50 chance at winning. However, the truth is that the contestant will double his/her chance of winning by switching to the other door.

After explaining this to my son, it occurred to me that this would be an excellent exercise for coding in Python and in SAS to see how the two languages compared. Like many of you reading this blog, I’ve been programming in SAS for years so the struggle for me was coding this in Python.

I kept it simple. I generated my data randomly and then applied simple logic to each row and compared the results.  The only difference between the two is in how the languages approach it.  Once we look at the two approaches then we can look at the answer.

First, let's look at SAS:

data choices (drop=max);
do i = 1 to 10000;
	u=ranuni(1);
	u2=ranuni(2);
	max=3;
	prize = ceil(max*u);
	choice = ceil(max*u2);
	output;
end;
run;

I started by generating two random numbers for each row in my data. The first random number will be used to randomize the prize door and the second will be used to randomize the choice that the contestant makes. The result is a dataset with 10,000 rows each with columns ‘prize’ and ‘choice’ to represent the doors.  They will be random integers between 1 and 3.  Our next task will be to determine which door will be revealed and determine a winner.

If our prize and choice are two different doors, then we must reveal the third door. If the prize and choice are the same, then we must choose a door to reveal. (Note: I realize that my logic in the reveal portion is somewhat flawed, but given that I am using an IF…ELSE IF and the fact that the choices are random and there isn’t any risk of introducing bias, this way of coding it was much simpler.)

data results;
set choices;
by i;
 
if prize in (1,2) and choice in (1,2) then reveal=3;
else if prize in (1,3) and choice in (1,3) then reveal=2;
else if prize in (2,3) and choice in (2,3) then reveal=1;

Once we reveal a door, we must now give the contestant the option to switch. Switch means they always switch, neverswitch means they never switch.

if reveal in (1,3) and choice in (1,3) then do;
        switch = 2; neverswitch = choice; 
end;
else if reveal in (2,3) and choice in (2,3) then do;
	switch = 1; neverswitch = choice; 
end;
else if reveal in (1,2) and choice in (1,2) then do;
	switch = 3; neverswitch = choice; 
end;

Now we create a column for the winner.  1=win 0=loss.

	switchwin = (switch=prize);
	neverswitchwin = (neverswitch=prize);
run;

Next, let’s start accumulating our results across all of our observations.  We’ll take a running tally of how many times a contestant who switches win as well as for the contestant who never switches.

data cumstats;
set results;
format cumswitch cumnever comma8.;
format pctswitch pctnever percent8.2;
retain cumswitch cumnever;
if _N_ = 1 then do;
	cumswitch = 0; cumnever = 0;
end;
else do;
cumswitch = cumswitch+switchwin;
cumnever = cumnever+neverswitchwin;
end;
 
pctswitch = cumswitch/i;
pctnever = cumnever/i;
run;
 
proc means data=results n mean std;
var switchwin neverswitchwin;
run;
legend1
frame	;
symbol1 interpol=splines;
pattern1 value=ms;
axis1
	style=1
	width=1
	minor=none ;
axis2
	style=1
	width=1
	major=none
	minor=none ;
axis3
	style=1
	width=1
	minor=none ;
title;
title1 " Cumulative chances of winning on Let's Make a Deal ";
 
proc gplot data=work.cumstats;
	plot pctnever * i  /
	areas=1
frame	vaxis=axis1
	haxis=axis2
	lvref=1
	cvref=black
	vzero
	legend=legend1;
plot2 pctswitch * i  = 2 /
  	areas=1
	vaxis=axis3
	vzero
overlay 
 	legend=legend1 ;
run; quit; 

monthy_hall8

The output of PROC MEANS shows that people who always switch (switchwin) have a win percentage of nearly 67%, while the people who never switch (neverswitchwin) have a win percentage of only 33%. The Area Plot proves the point graphically showing that the win percentage of switchers to be well above the non-switchers.

Now let’s take a look at how I approached the problem in Python (keeping in mind that this language is new to me).

Now, let’s look at Python:

Copied from Jupyter Notebook

import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import accumulate
%matplotlib inline

First let's create a blank dataframe with 10,000 rows and 10 columns, then fill in the blanks with zeros.

rawdata = {'index': range(10000)}
df = pd.DataFrame(rawdata,columns=['index','prize','choice','reveal','switch','neverswitch','switchwin','neverswitchwin','cumswitch','cumnvrswt'])
df = df.fillna(0)

Now let's populate our columns. The prize column represents the door that contains the new car! The choice column represents the door that the contestant chose. We will populate them both with a random number between 1 and 3.

prize=[]
choice=[]
for row in df['index']:
    prize.append(random.randint(1,3))
    choice.append(random.randint(1,3))   
df['prize']=prize
df['choice']=choice

Now that Monty Hall has given the contestant their choice of door, he reveals the blank door that they did not choose.

reveal=[]
for i in range(len(df)):
    if (df['prize'][i] in (1,2) and df['choice'][i] in (1,2)):
        reveal.append(3)
    elif (df['prize'][i] in (1,3) and df['choice'][i] in (1,3)):
        reveal.append(2)
    elif (df['prize'][i] in (2,3) and df['choice'][i] in (2,3)):
        reveal.append(1) 
df['reveal']= reveal

Here's the rub. The contestant has chosen a door, Monty has revealed a blank door, and now he's given the contestant the option to switch to the other door. Most of the time the contestant will not switch even though they should. To prove this, we create a column called 'switch' that reflects a contestant that ALWAYS switches their choice. And, a column called 'neverswitch' that represents the opposite.

switch=[]
neverswitch=[]
for i in range(len(df)):
    if (df['reveal'][i] in (1,3) and df['choice'][i] in (1,3)):
        switch.append(2)
    elif (df['reveal'][i] in (1,2) and df['choice'][i] in (1,2)):
        switch.append(3)
    elif (df['reveal'][i] in (2,3) and df['choice'][i] in (2,3)):
        switch.append(1) 
    neverswitch = choice
df['switch']=switch
df['neverswitch']=neverswitch

Now let's create a flag for when the Always Switch contestant wins and a flag for when the Never Switch contestant wins.

switchwin=[]
neverswitchwin=[]
for i in range(len(df)):
    if (df['switch'][i]==df['prize'][i]):
        switchwin.append(1)
    else:
        switchwin.append(0)    
    if (df['neverswitch'][i]==df['prize'][i]):
        neverswitchwin.append(1)
    else:
        neverswitchwin.append(0)     
df['switchwin']=switchwin
df['neverswitchwin']=neverswitchwin

Now we accumulate the total number of wins for each contestant.

cumswitch=[]
cumnvrswt=[]
df['cumswitch']=list(accumulate(df['switchwin']))
df['cumnvrswt']=list(accumulate(df['neverswitchwin']))

…and divide by the number of observations for a win percentage.

pctswitch=[]
pctnever=[]
for i in range(len(df)):
    pctswitch=df['cumswitch']/(df['index']+1)
    pctnever=df['cumnvrswt']/(df['index']+1)
df['pctswitch']=pctswitch
df['pctnever']=pctnever

Now we are ready to plot the results. Green represents the win percentage of Always Switch, blue represents the win percentage of Never Switch.

x=df['index']
y=df['pctswitch']
y2=df['pctnever']
fig, ax = plt.subplots(1, 1, figsize=(12, 9))
ax.plot(x,y,lw=3, label='Always', color='green')
ax.plot(x,y2,lw=3, label='Never',color='blue',alpha=0.5)
ax.fill_between(x,y2,y, facecolor='green',alpha=0.6)
ax.fill_between(x,0,y2, facecolor='blue',alpha=0.5)
ax.set_xlabel("Iterations",size=14)
ax.set_ylabel("Win Pct",size=14)
ax.legend(loc='best')
plt.title("Cumulative chances of winning on Let's Make a Deal", size=16)
plt.grid(True)

monthy_hall9

Why does it work?

Most people think that because there are two doors left (the door you chose and the door Monty didn’t show you) that there is a fifty-fifty chance that you’ve got the prize.  But we just proved that it’s not, so “what gives”?

Remember that the door you chose at first has a 1/3 chance of winning.  That means that the other two doors combined have a 2/3 chance in winning.  Even though Monty showed us what’s behind one of those two doors, the two of them together still have a 2/3 chance of winning.  Since you know one of them is empty, that means the door you didn’t pick MUST have a 2/3 chance of winning.  You should switch.  The green line in the Python graph (or the red line in the SAS graph) shows that after having run 10,000 contestants through the game the people that always switched won 67% of the time while the people that never switched only won 33% of the time.

My comparisons and thoughts between SAS and Python.

In terms of number of lines of code required, SAS wins hands down.  I only needed 57 lines of code to get the result in SAS, compared to 74 lines in Python. I realize that experience has a lot to do with it, but I think there is an inherent verbosity to the Python code that is not necessarily there in SAS.

In terms of ease of use, I’m going to give the edge to Python.  I really liked how easy it was to generate a random number between two values.  In SAS, you have to actually perform arithmetic functions to do it, whereas in Python it’s a built-in function. It was exactly the same for accumulating totals of numbers. It was exactly the same for accumulating totals of numbers.  In Python, it was the accumulate function. In SAS, it was a do loop that summed each of the previous values.

In terms of iterative ability and working “free style,” I give the edge to SAS.  With Python, it is easy to iterate, but I felt myself having to start all over again having to pre-define columns, packages, etc., in order to complete my analysis.  With SAS, I could just code.  I didn’t have to start over because I created a new column.  I didn’t have to start over because I needed to figure out which package I needed, find it on Github, install it and then import it.

In terms of tabular output, SAS wins.  Easy to read, easy to generate.

In terms of graphical output, Python edges SAS out.  Both are verbose and tedious to get it to work. Python wins because the output is cleaner and there are way more options.

In terms of speed, SAS wins.  On my laptop, I could change the number of rows from 10,000 to 100,000 without noticing much of a difference in speed (0.25 – 0.5 seconds).  In Python, anything over 10,000 got slow.  10,000 rows was 6 seconds, 100,000 rows was 2 minutes 20 seconds.

Of course, this speed has a resource cost.  In those terms, Python wins.  My Anaconda installation is under 2GB of disk space, while my particular deployment of SAS requires 50GB of disk space.

Finally, in terms of mathematics, they tied.  They both produce the same answer as expected.  Of course, I used extremely common packages that are well used and tested.  Newer or more sophisticated packages are often tested against SAS as the standard for accuracy.

But in the end, comparing the two as languages is limited.  Python is much a more versatile object oriented language that has capabilities that SAS doesn’t have.  While SAS’ mature DATA step can do things to data that Python has difficulty with.   But most importantly, is the release of SAS Viya. Through Viya’s open APIs and micro-services, SAS is transforming itself into something more than just a coding language, it aims to be the analytical platform that all data scientists can use to get their work done.

tags: Python, SAS Programmers

The Monty Hall Paradox - SAS vs. Python was published on SAS Users.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)