2月 052018
 
The Birthday Problem: The probability of a matching birthday in a room of that contains N people (N=2,3,...,60). You can simulate the birthday problem in SAS and compute a Monte Carlo estimate of the probability of a match.

This article simulates the birthday-matching problem in SAS. The birthday-matching problem (also called the birthday problem or birthday paradox) answers the following question: "if there are N people in a room, what is the probability that at least two people share a birthday?" The birthday problem is famous because the probability of duplicate birthdays is much higher than most people would guess: Among 23 people, the probability of a shared birthday is more than 50%.

If you assume a uniform distribution of birthdays, the birthday-matching problem can be solved exactly. In reality birthdays are not uniformly distributed, but as I show in Statistical Programming with SAS/IML Software (Wicklin 2010), you can use a Monte Carlo simulation to estimate the probability of a matching birthday.

This article uses ideas from Wicklin (2010) to simulate the birthday problem in the SAS/IML language. The simulation is simplest to understand if you assume uniformly distributed birthdays (p. 335–337), but it is only slightly harder to run a simulation that uses an empirical distribution of birthdays (p. 344–346). For both cases, the simulation requires only a few lines of code.

Monte Carlo estimates of the probabilities in the birthday problem

The birthday-matching problem readily lends itself to Monte Carlo simulation. Represent the birthdays by the integers {1, 2, ..., 365} and do the following:

  1. Choose a random sample of size N (with replacement) from the set of birthdays. That sample represents a "room" of N people.
  2. Determine whether any of the birthdays in the room match. There are N(N-1)/2 pairs of birthdays to check.
  3. Repeat this process for thousands of rooms and compute the proportion of rooms that contain a match. This is a Monte Carlo estimate of the probability of a shared birthday.
  4. If desired, simulate rooms of various sizes, such as N=2, 3, 4,..., 60, and plot the results.

How to simulate the birthday problem in SAS

Before running the full simulation, let's examine a small-scale simulation that simulates only 10 rooms, each containing a random sample of N = 23 people:

proc iml;
call randseed(12345);                 /* set seed for random number stream */
N = 23;                               /* number of people in room */
NumRooms = 10;                        /* number of simulated rooms */
bday = Sample(1:365, NumRooms||N);    /* simulate: each column is a room that contains N birthdays */
*print bday;                          /* optional: print the 23 x 10 matrix of birthdays */
match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
print match[label="Num Matches (N=23)" c=("Rm1":"Rm10")];
 
/* estimate the probability of a match, based on 10 simulations */
probMatch = (match > 0)[:];          /* mean = proportion of matches */
print probMatch;
Simulate the birthday problem in SAS and estimate the probability of a match

The output shows the number of matches in 10 rooms, each with 23 people. The first room did not contain any people with matching birthdays, nor did rooms 3, 5, 6, 7, and 10. The second room contained one matching birthday, as did rooms 8 and 9. The fourth room contains two shared birthdays. In this small simulation, 4/10 of the rooms contain a shared birthday, so the Monte Carlo estimate of the probability is 0.4. You can compute the estimate by forming the binary indicator variable (match > 0) and computing the mean. (Recall that the mean of a binary vector is the proportion of 1s.)

Notice that the simulation requires only three SAS/IML vectorized statements. The SAMPLE function produces a matrix that has N rows. Each column of the matrix represents the birthdays of N people in a room. The COUNTUNIQUE function returns a row vector that contains the number of unique birthdays in each column. If you subtract that number from N, you obtain the number of shared birthdays. Finally, the estimate is obtained by computing the mean of an indicator variable.

Simulate the birthday problem in SAS

For convenience, the following SAS/IML statements define a function that simulates B rooms, each containing a random sample of N birthdays. The function returns a row vector of size B that contains the number of matching birthdays in each room. The program calls the function in a loop and graphs the results.

start BirthdaySim(N, B);
   bday = Sample(1:365, B||N);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
finish;
 
maxN = 60;                    /* consider rooms with 2, 3, ..., maxN people */
NumPeople = 2:maxN;
probMatch = j(MaxN, 1, 0);    /* allocate vector for results */
NumRooms = 10000;             /* number of simulated rooms */
 
do N = 1 to ncol(NumPeople);
   match =  BirthdaySim(NumPeople[N], NumRooms);
   probMatch[N] = (match > 0)[:];
end;
 
title "The Birthday Matching Problem";
call series(NumPeople, probMatch) grid={x y} 
     label={"Number of People in Room" "Probability of a Shared Birthday"};

The graph is shown at the top of this article. It shows the estimated probability of one or more matching birthdays in a room of size N for N=2, 3, ..., 60. The estimates are based on 10,000 Monte Carlo iterations. (Why did I use 10,000 rooms? See the Appendix.) As mentioned earlier, the probability exceeds 50% when a room contains 23 people. For 30 people, the probability of a match is 71%. If you put 50 people in a room, the chances that two people in the room share a birthday is about 97%.

Allow leap days and a nonuniform distribution of birthdays

The curious reader might wonder how this analysis changes if you account for people born on leap day (February 29th). The answer is that the probabilities of a match decrease very slightly because there are more birthdays, and you can calculate the probability analytically. If you are interested in running a simulation that accounts for leap day, use the following SAS/IML function:

start BirthdaySim(N, B);
   /* A four-year period has 3*365 + 366 = 1461 days. Assuming uniformly distributed 
      birthdays, the probability vector for randomly choosing a birthday is as follows: */
   p = j(366, 1, 4/1461);      /* most birthdays occur 4 times in 4 years */
   p[31+29] = 1/1461;          /* leap day occurs once in 4 years */
   /* sample with replacement using this probability of selection */
   bday = Sample(1:366, B||N, "replace", p);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
finish;

If you run the simulation, you will find that the estimated probability of sharing a birthday among N people does not change much if you include leap-day birthdays. The difference between the estimated probabilities is about 0.003 on average, which is much less than the standard error of the estimates. In practice, you can exclude February 29 without changing the conclusion that shared birthdays are expected to occur.

In a similar way, you can set the probability vector to be the empirical distribution of birthdays in the population. Wicklin (2010, p. 346) reports that the simulation-based estimates are slightly higher than the exact results that you obtain from assuming a uniform distribution. This agrees with theory: A. G. Munford (TAS, 1977) showed that any nonuniform distribution increases the likelihood of a matching birthday. In other words, in the real world, the probability of a shared birthday is slightly higher than in the idealized mathematical world.

Appendix: How many Monte Carlo simulations are necessary?

Why did I use 10,000 Monte Carlo simulations? Notice that the Monte Carlo estimate in this problem is an estimate of a proportion of a binary variable, which means that you can estimate the standard error. If p̂ is the estimate, then a 95% confidence interval has radius 1.96 sqrt(p̂(1-p̂)/B), where B is the number of Monte Carlo simulations (that is, the number of "rooms"). This quantity is largest when p̂ = 1/2, so if you want, say, two digits of accuracy you can choose a value of B which is large enough so that 1.96 sqrt(0.5(1-0.5)/B) ≤ 0.01. Solving the equation for B give B ≥ 10000. If you choose that many Monte Carlo iterations, then the conservative 95% confidence interval has a half-width no greater than 0.01.

This simple calculation shows that if you use B=10,000 for the number of simulated "room," you can expect the Monte Carlo estimates to be within 0.01 of the true probability 95% of the time. The standard error is largest where the curve is steepest (near N=23). You could add error bars to the graph, but the errors would be too small to see.

The post Simulate the birthday-matching problem appeared first on The DO Loop.

2月 022018
 

For the past few years, I've been hearing from many news sources that Detroit is the most violent large city in the US. If the news is saying this, then the data should corroborate it, eh? Well, of course I decided to check for myself... When comparing crime data across [...]

The post Does Detroit really have the most violent crime? appeared first on SAS Learning Post.

2月 022018
 

Do you know what the #1 fear in North America is? Most people say fear of public speaking or fear of death, but you may just want to consider this new fear upping the charts - the fear of writing a SAS Certification exam! The real test begins before even [...]

The post Demystifying SAS Certification (Part 2): Write it down appeared first on SAS Learning Post.

2月 012018
 

Groups and organizations lauding artificially intelligent solutions are popping up everywhere with promises to create the next battlefield advantage using next generation weapons, gear, or satellites.  The term artificial intelligence (AI) splashes the headlines with promises that we're moments away from revolutionizing the battlefield. As a former intelligence analyst, I [...]

13 ideas for AI in military intelligence was published on SAS Voices by Mary Beth Ainsworth

1月 312018
 

This article shows how to construct a "stacked band plot" in SAS, as shown to the right. (Click to enlarge.) You are probably familiar with a stacked bar chart in which the cumulative amount of some quantity is displayed by stacking the contributions of several groups. A canonical example is a bar chart of revenue for a company in which the revenues for several regions (North America, Europe, Asia, and so forth) are stacked to show the total revenue. If you plot the revenue for several years, you get a stacked bar chart that shows how the total revenue changes over time, as well as the changes in the regional revenues. A stacked band plot is similar, but it is used when the horizontal variable is continuous rather than discrete.

Create a stacked bar chart in SAS

A few years ago Sanjay and I each wrote about ways to create a stacked bar chart in SAS by using the SGPLOT procedure in SAS. Recently I created a stacked bar chart that spanned about 20 years of data. When I looked at it, I realized that as the number of categories (years) grows, the bars become a nuisance. The graph becomes cluttered with vertical bars that are not necessary to show the underlying trends in the data. At that point, it is better to switch to a continuous scale and use a stacked band plot.

This section shows how to create a stacked bar chart by using PROC SGPLOT. The next section shows how to create a stacked band chart.

For data, I decided to use the amount of carbon dioxide (CO2, in million metric tons) that is produced by several sources from 1973–2016. These data were plotted as a stacked bar chart by my friend Robert Allison. The data are originally in "wide form," with five variables that contain the data. For a stacked bar chart, you should convert the data to "long form," as follows. You can download the complete SAS program that creates the graphs in this article.

/* 1. Original data is in "wide form": CO2 emission for five sources for each year. */
data CO2;
informat Transportation Commercial Residential Industrial Electric COMMA5.;
input Year Transportation Commercial Residential Industrial Electric;
datalines;
2016	1,879	235	308	928	1,821
2015	1,845	240	321	942	1,913
2014	1,821	233	347	955	2,050
2013	1,803	223	333	952	2,050
   ... more lines ...   
;
 
/* 2. Convert from wide to long format */
proc sort data=CO2 out=Wide;
   by Year;                     /* sort X categories */
run;
 
proc transpose data=Wide 
   out=Long(rename=(Col1=Value))/* name of new column that contains values */
   name=Source;                 /* name of new column that contains categories */
   by Year;                     /* original X var */
   var Transportation Commercial Residential Industrial Electric; /* original Y vars */
run;

For data in the long form, you can use the VBAR statement and the GROUPDISPLAY=STACK option in PROC SGPLOT to create a stacked bar chart, as follows:

/* 3. create a stacked bar chart */
ods graphics / subpixel=on;
title "U.S. Carbon Dioxide Emissions from Energy Consumption";
title2 "Stacked Bar Chart";
proc sgplot data=Long;
   vbar Year / response=Value group=Source groupdisplay=stack grouporder=data;
   xaxis type=linear thresholdmin=0;   /* use TYPE=LINEAR so not every bar is labeled */ 
   yaxis grid values=(0 to 6000 by 1000);
   label Source = "Source of CO2" Value = "CO2 (mmt)";
run;

The graph is not terrible, but it can be improved. The vertical lines are a distraction. With 43 years of data, the horizontal axis begins to lose its discrete nature. All those bars and the gaps between bars are cluttering the display. The next section shows how to create two new variables and use those variables to create a stacked band plot.

Create a stacked band plot

To create a stacked bar chart with PROC SGPLOT, you must create two new variables. The cumValue variable will contain the cumulative amount of CO2 per year as you accumulate the amount for each group (Transportation, Commercial, Residential, and so forth). This will serve as the upper boundary of each band. The Previous variable will contain the lower boundary of each band. You can use the LAG function in Base SAS to easily obtain the previous cumulative value: Previous = lag(cumValue). When you create these variables, you need to initialize the value for the first group (Transportation). Use the BY YEAR statement and the FIRST.YEAR indicator variable to detect the beginning of each new value of the Year variable, as follows:

/* 4. Accumulate the contributions of each group in the 'cumValue' variable.
      Add the lag(cumValue) variable and call it 'Previous'. 
      Create a band plot for each group with LOWER=Previous and UPPER=cumValue.
*/
data Energy;   
   set Long;
   by Year;
   if first.Year then cumValue=0;   /* for each year, initialize cumulative amount to 0 */
   cumValue + Value;
   Previous = lag(cumValue);
   if first.Year then Previous=0;   /* for each year, initialize baseline value */
   label Source = "Source of CO2" cumValue = "CO2 (mmt)";
run;
 
title2 "Stacked Band Plot";
proc sgplot data=Energy;
   band x=Year lower=Previous upper=cumValue / group=Source;
   xaxis display=(nolabel) thresholdmin=0; 
   yaxis grid;
   keylegend / position=right sortorder=reverseauto; /* SAS 9.4M5: reverse legend order */
run;

The graph appears at the top of this article. The vertical lines are gone. The height of a band shows the amount of CO2 that is produced by the corresponding source. The top of the highest band (Electric) shows the cumulative amount of CO2 that is produced by all sources. This kind of display makes it easy to see the cumulative total. If you want to see the trends for each individual source, a time series plot would be a better choice.

The graph uses a new feature of SAS 9.4M5. The KEYLEGEND statement now supports the SORTORDER=REVERSEAUTO option, which reverses the order of the legend elements. This option makes the legend match the bottom-to-top progression of the stacked bar chart. I also used the POSTITION= option to move the legend to the right side so that the legend is next to the color bands.

Label the stacked bands directly

If you have very thin bands, a legend is probably the best way to associate colors with groups. However, for these data the bands are wide enough that you might want to display the name of each group on the bands. I tried several label positions (left, center, and right) and decided to display the group name near the left side of the graph. Since many people scan a graph from left to right, this causes the reader to see the labels early in the scanning process.

The following DATA step computes the positions for the labels. I use 1979 as the horizontal position and use the midpoint of the band as the vertical position. After you compute the positions for the labels, you can concatenate the data and the labels and use the TEXT statement to overlay the labels on the band plot, as follows:

data Labels;
   set Energy;
   where Year = 1979;  /* position labels at 1979 */
   Label = Source;
   XPos = Year;
   YPos = (cumValue + Previous) / 2;
   keep Label XPos YPos;
run;
 
data EnergyLabels;
   set Energy Labels;
run;
 
title2 "Stacked Band Plot";
proc sgplot data=EnergyLabels noautolegend;
band x=Year lower=Previous upper=cumValue / group=Source;
refline 1000 to 6000 by 1000 / axis=y lineattrs=GraphGridLines transparency=0.75;
text x=XPos y=Ypos text=Label;
xaxis display=(nolabel) values=(1973, 1980 to 2010 by 10, 2016)
      offsetmin=0 offsetmax=0;
yaxis grid values=(0 to 6000 by 1000) label="CO2 (mmt)"; 
run;

Summary

In summary, you can use PROC SGPLOT to create a stacked band plot in SAS. A stacked band plot is similar to a stacked bar chart but presumes that the positions of the bars represent a continuous variable on a linear scale. For this example, the bars represented years. You can let PROC SGPLOT automatically place the legend along the bottom, but if you have SAS 9.4M5 you might want to move the legend to the right and use the SORTORDER=REVERSEAUTO option to reverse the legend order. Alternatively, you can display labels on the bands directly if there is sufficient room.

The post Create a stacked band plot in SAS appeared first on The DO Loop.

1月 312018
 

Can you use a data visualization tool to display building maps, floor designs and other Esri data? With the recent addition of custom polygon support in SAS Visual Analytics 8.2, customers wondered if this feature can be utilized to render different types of regional overlays. A common request is to [...]

Building and visualizing custom polygons in SAS Visual Analytics was published on SAS Voices by Falko Schulz