2月 142018
 

When I first learned to program in SAS, I remember being confused about the difference between CLASS statements and BY statements. A novice SAS programmer recently asked when to use one instead of the other, so this article explains the difference between the CLASS statement and BY variables in SAS procedures.

The BY statement and the CLASS statement in SAS both enable you to specify one or more categorical variables whose levels define subgroups of the data. (For simplicity, we consider only a single categorical variable.) The primary difference is that the BY statement computes many analyses, each on a subset of the data, whereas the CLASS statement computes a single analysis of all the data. Specifically,

  • The BY statement repeats an analysis on every subgroup. The subgroups are treated as independent samples. If a BY variable defines k groups, the output will contains k copies of every table and graph, one copy for the first group, one copy for the second group, and so on.
  • The CLASS statement enables you to include a categorical variable as part of an analysis. Often the CLASS variable is used to compare the groups, such as in a t test or an ANOVA analysis. In regression models, the CLASS statement enables you to estimate parameters for the levels of a categorical variable, thereby estimating the effect of each level on the response. Another use of a CLASS variable is to define categories for a classification task, such as a discriminant analysis.

To illustrate the differences between an analysis that uses a BY statement and one that uses a CLASS statement, let's create a subset (called Cars) of the Sashelp.Cars data. The levels of the Origin variable indicate whether a vehicle is manufactured in "Asia", "Europe", or the "USA". For efficiency reasons, most classical SAS procedures require that you sort the data when you use a BY statement. Therefore, a call to PROC SORT creates a sorted version of the data called CarsSorted, which will be used for the BY-group analyses.

data Cars;
   set Sashelp.Cars;
   where cylinders in (4,6,8) and type ^= 'Hybrid'; 
run;
 
proc sort data=Cars out=CarsSorted; 
   by Origin; 
run;

Descriptive statistics for grouped data

When you generate descriptive statistics for groups of data, the univariate statistics are identical whether you use a CLASS statement or a BY statement. What changes is the way that the statistics are displayed. When you use the CLASS statement, you get one table that contains all statistics or one graph that shows the distribution of each subgroup. However, when you use the BY statement you get multiple tables and graphs.

The following statements use the CLASS statement to produce descriptive statistics. PROC UNIVARIATE displays one (paneled) graph that shows a comparative histogram for the vehicles that are made in Asia, Europe, and USA. PROC MEANS displays one table that contains descriptive statistics:

proc univariate data=Cars;
   class Origin;
   var Horsepower;
   histogram Horsepower / nrows=3; /* must use NROWS= to get panel */
   ods select histogram;
run;
 
proc means data=Cars N Mean Std;
   class Origin;
   var Horsepower Weight Mpg_Highway;
run;

In contrast, if you run a BY-group analysis on the levels of the Origin variable, you will see three times as many tables and graphs. Each analysis is preceded by a label that identifies each BY group. Notice that the BY-group analysis uses the sorted data.

proc means data=CarsSorted N Mean Std;
   by Origin;
   var Horsepower Weight Mpg_Highway;
run;

Always remember that the output from a BY statement is equivalent to the output from running the procedure multiple times on subsets of the data. For example, the previous statistics could also be generated by calling PROC MEANS three times, each call with a different WHERE clause, as follows:

proc means N Mean Std data=CarsSorted( where=(origin='Asia') );
   var Horsepower Weight Mpg_Highway;
run;
proc means N Mean Std data=CarsSorted( where=(origin='Europe') );
   var Horsepower Weight Mpg_Highway;
run;
proc means N Mean Std data=CarsSorted( where=(origin='USA') );
   var Horsepower Weight Mpg_Highway;
run;

In fact, if you ever find yourself repeating an analysis many times (perhaps by using a macro loop), you should consider whether you can rewrite your program to be more efficient by using a BY statement.

Comparing groups: Use the CLASS statement

As a general rule, you should use a CLASS statement when you want to compare or contrast groups. For example, the following call to PROC GLM performs an ANOVA analysis on the horsepower (response variable) for the three groups defined by the Origin variable. The procedure automatically creates a graph that displays three boxplots, one for each group. The procedure also computes parameter estimates for the levels of the CLASS variable (not shown).

proc glm data=Cars; /* by default, create graph with side-by-side boxplots */
   class Origin;
   model Horsepower = Origin / solution;
run;

You can specify multiple variables on the CLASS statement to include multiple categorical variables in a model. Any variables that are not listed on the CLASS statement are assumed to be continuous. Thus the following call to PROC GLM analyzes a model that has one continuous and one classification variable. The procedure automatically produces a graph that overlays the three regression curves on the data:

ods graphics /antialias=on;
title "CLASS Variable Regression: One Model with Multiple Parameters";
proc GLM data=Cars plots=FitPlot;
   class Origin;
   model Horsepower = Origin | Weight / solution;
   ods select ParameterEstimates ANCOVAPlot;
quit;

In contrast, if you use a BY statement, the Origin variable cannot be part of the model but is used only to subset the data. If you use a BY statement, you obtain three different models of the form Horsepower = Weight. You get three parameter estimates tables and three graphs, each showing one regression line overlaid on a subset of the data.

Predicted Values: CLASS VARIABLE versus BY Variable

When you use a BY statement and fit three models of the form Horsepower = Weight, the procedure fits a total of six parameters. Notice that when you use the CLASS statement and fit the model Horsepower = Origin | Weight, you also fit six free parameters. It turns out that these two methods produce the same predicted values. In fact, you can combine the parameter estimates (for the GLM parameterization) for the CLASS model to obtain the parameter estimates from the BY-variable analysis, as shown below. Each parameter estimate for the BY-variable models are obtained as the sum of two estimates for the CLASS-variable analysis:

For many regression models, the predicted values for the BY-variable analyses are the same as for a particular model that uses a CLASS variable. As shown above, you can even see how the parameters are related when you use a GLM or reference parameterization. However, the CLASS variable formulation can fit models (such as the equal-slope model Horsepower = Origin Weight) that are not available when you use a BY variable to fit three separate models. Furthermore, the CLASS statement provides parameter estimates so that you can see the effect of the groups on the response variable. It is more difficult to compare the models that are produced by using the BY statement.

Other CLASS-like statements in SAS

Some SAS procedures use other syntax to analyze groups. In particular, the SGPLOT procedure calls classification variables "group variables." If you want to overlay graphs for multiple groups, you can use the GROUP= option on many SGPLOT statements. (Some statements support the CATEGORY= option, which is similar.) For example, to replicate the two-variable regression analysis from PROC GLM, you can use the following statements in PROC SGPLOT:

proc sgplot data=Cars;
   reg y=Horsepower x=Weight / group=Origin; /* Horsepower = Origin | Weight */
run;

Summary

In summary, use the BY statement in SAS procedures when you want to repeat an analysis for every level of one or more categorical variables. The variables define the subsets but are not otherwise part of the analysis. In classical SAS procedures, the data must be sorted by the BY variables. A BY-group analysis can produce many tables and graphs, so you might want to suppress the ODS output and write the results to a SAS data set.

Use the CLASS statement when you want to include a categorical variable in a model. A CLASS statement often enables you to compare or contrast subgroups. For example, in regression models you can evaluate the relative effect of each level on the response variable.

In some cases, the BY statement and the CLASS statement produce identical statistics. However, the CLASS statement enables you to fit a wider variety of models.

The post The difference between CLASS statements and BY statements in SAS appeared first on The DO Loop.

2月 122018
 

In the hype and excitement surrounding artificial intelligence and big data, most of us miss out on critical aspects related to collection, processing, handling and analyzing data. It's important for data science practitioners to understand these critical aspects and add a human touch to big data. What are these aspects? [...]

Why is it important to add a human touch to big data? was published on SAS Voices by Jay Paulson

2月 122018
 

Did you know that SAS can combine or "merge" a symbol and a line pattern into a single legend item, as shown below? This kind of legend is useful when you are overlaying a group of curves onto a scatter plot. It enables the reader to quickly associate values of a categorical variable with colors, line patterns, and marker symbols in a plot.

Merged legend that combines symbols and line patterns in SAS

Legend items for marker/curve overlays

When you use PROC SGPLOT and the GROUP= option to create a graph, the SGPLOT procedure automatically displays the group attributes (such a symbol, color, and line pattern) in a legend. If you overlay multiple plot types (such as a series plot on a scatter plot) the default behavior is to create a legend for the first plot statement. You can use the KEYLEGEND statement to control which plots contribute to the legend. In the following example, the KEYLEGEND statement creates a legend that shows the attributes for the scatter plot (the marker shapes) and also the series plot (line patterns):

data ScatCurve;    /* example data: scatter plot and curves */
call streaminit(1);
do Group = 1 to 2;
   do x = 0 to 5 by 0.2;
      Pred = Group + (1/Group)*x - (0.2*x)**2;
      y = Pred + rand("Normal",0,0.5);
      output;
   end;
end;
run;
 
ods graphics / antialias=on subpixel=on;
title "Legend Not Merged";
proc sgplot data=ScatCurve;
   scatter x=x y=y / group=Group name="obs" markerattrs=(symbol=CircleFilled);
   series x=x y=Pred / group=Group name="curves"; 
   keylegend "obs" "curves" / title="Group"; /* separate items for markers and lines */
run;
Legend that shows symbols and line patterns separately

The legend contains all the relevant information about symbols, colors, and line patterns, but it is longer than it needs to be. When you display curves and markers for the same groups, you can obtain a more compact representation by merging the symbols and line patterns into a single legend, as shown in the next sections.

The MERGEDLEGEND statement in GTL

If you are comfortable with the Graph Template Language (GTL) in SAS, then you can use the MERGEDLEGEND statement to create a merged legend. The following statements create a graph template that overlays a scatter plot and a series plot and creates a merged legend in which each item contains both a symbol and a line pattern:

proc template;
  define statgraph ScatCurveLegend;
  dynamic _X _Y _Curve _Group _Title;       /* dynamic variables */
  begingraph;
  entrytitle _Title;  
   layout overlay; 
     scatterplot x=_X y=_Y / group=_Group name="obs" markerattrs=(symbol=CircleFilled);
     seriesplot x=_X y=_Curve / group=_Group name="curves";
     /* specify exactly two names for the MERGEDLEGEND statement */
     mergedlegend "obs" "curves" / border=true title=_Group;
   endlayout;
  endgraph;
  end;
run;
 
proc sgrender data=ScatCurve template=ScatCurveLegend;
   dynamic _Title="Overlay Curves, Merged Legend"
           _X="x" _Y="y" _Curve="Pred" _Group="Group";
run;
Merged legend that combines symbols and line patterns in SAS

The graph is the same as before, but the legend is different. The MERGEDLEGEND statement is used in many graphs that are created by SAS regression procedures. As you can see, the legend shows the symbol and line pattern for each group.

The LEGENDITEM statement in PROC SGPLOT

Unfortunately, the MERGEDLEGEND statement is not supported by PROC SGPLOT. However, SAS 9.4M5 supports the LEGENDITEM statement in PROC SGPLOT. The LEGENDITEM statement enables you to construct a custom legend and gives you complete control over every item in a legend. The following example constructs a legend that uses the symbols, line patterns, and group values that are present in the graph. Notice that you have to specify these attributes manually, as shown in the following example:

title "Merged Legend by Using LEGENDITEM in SGPLOT";
proc sgplot data=ScatCurve;
   scatter x=x y=y / group=Group markeratters=(symbol=CircleFilled);
   series x=x y=Pred / group=Group; 
   legenditem type=markerline name="I1" /
              label="1" lineattrs=GraphData1 markerattrs=GraphData1(symbol=CircleFilled);
   legenditem type=markerline name="I2" /
              label="2" lineattrs=GraphData2 markerattrs=GraphData2(symbol=CircleFilled);
   keylegend "I1" "I2" / title="Group"; 
run;

The graph and legend are identical to the previous graph and are not shown.

The advantage of the LEGENDITEM statement is that you can layout the legend however you choose; the legend is not tied to the attributes in any previous graph component. However, this is also a disadvantage. If you change the marker attributes in the SCATTER statement, the legend will not reflect that change until you manually modify each LEGENDITEM statement. Although there is no denying the power of the LEGENDITEM statement, the MERGEDLEGEND statement in the GTL always faithfully and automatically reflects the attributes in the SCATTERPLOT and SERIESPLOT statements.

In summary, the SG procedures in SAS automatically create a legend. When you overlay multiple plots, you can use the KEYLEGEND statement to control which plots contribute to the legend. However, it is also possible to merge the symbols and line patterns into a single compact legend. In the GTL, you can use the MERGEDLEGEND statement. In SAS 9.4M5, PROC SGPLOT supports the LEGENDITEM statement to customize the items that appear in a legend.

The post Merged legends: Overlay a symbol and line in a legend item appeared first on The DO Loop.

2月 092018
 

As the old saying goes, the only constant is change. The corollary that I’ll add is this: and the pace of change feels like it's directly proportional to Moore’s Law. Utilities used to be favorites for conservative investors. Smooth. Steady. When there was change, it was incremental. But today, uncertainty [...]

Trends to watch for utilities in the analytics economy was published on SAS Voices by Mike F. Smith

2月 092018
 

Want to see my newly minted certified professional badge? Scroll down to take a peek. Yes, I managed to successfully complete the Base SAS Programmer certification exam… with, ahem, flying colors I might add. Here are my tips to tackle the Base SAS certification exam: 1.  Get clear on the [...]

The post Demystifying certification (Part 3): To the finish appeared first on SAS Learning Post.

2月 082018
 

In my last article, I worked with an example of using custom polygon data to create a regional geo map in SAS Visual Analytics 7.4. In this article, I will use almost the same example to illustrate the ease of implementing custom polygons to produce the same regional map in SAS Visual Analytics 8.2.

In this example, as in my last blog, the site has sales data for each sales region in the US and would like to display a geo map of the regions.

The six sales regions are:

Custom polygons in SAS Visual Analytics

We will again start with the MAPSGFK.US_STATES dataset, which contains the data required to overlay all states of the US on a VA region geomap and has these columns:

As in my last post, we will add the sales regions (REGION) column and values using data step code, and then use GREMOVE to remove the state boundaries, leaving the region boundary points.  For a look at that code, see my previous blog.

The following datastep adds the necessary columns/values to the polygon dataset so that the form of the data is what is expected by VA.  Note that the LAT and LONG columns are already in unprojected form, so we just assign those values to Y and X, so our column names will more closely match what we will see in the VA interface when creating the geographic data item.   We also create a SEQUENCE column, required by VA 8.2,  using the values of the internal variable, _n_.

data mydata.regions;
   set mydata.regions;
   sequence=_n_;
   id=region;
   x=long;
   y=lat;
   keep ID SEQUENCE SEGMENT X Y ;
   run;

The polygon table, REGIONS,  now has the following columns.

The dataset containing the region and measure data, REGIONSALES contains these columns:

Both datasets should be loaded into memory. Sign in to SAS Visual Analytics – Explore and Visualize Data and create a new report with data source REGIONSALES.

Create a new Geography data item from REGION as shown below, also specifying a New Polygon Provider with values shown on the next several screen shots.  Give the new provider a name and label, and specify the CAS server, library, and table name.

Scroll down to add the ID, Sequence, Segment, latitude and longitude columns.

The new geography data item, after clicking OK:

Now create a Geo Map of type Regions as shown:

Please Creating a regional map with custom polygons in SAS Visual Analytics 8.2 was published on SAS Users.

2月 082018
 

By default, SAS Visual Analytics 7.4 supports country and state level polygons for regional geomaps. In SAS Visual Analytics 7.4, custom shape files are now supported, as well. This means that if a site has their own custom polygon data that defines custom regions, it’s possible to create a region geomap that displays those regions.

Implementing the process requires completing some preparatory steps, explicitly execution of some SAS code, but the steps are explained in Appendix 2 of the SAS Visual Analytics 7.4: Administration Guide. The SAS program that completes the steps is provided for download at http://support.sas.com/rnd/datavisualization/vageo/va74polygons.sas.

Two examples using the program are provided in Appendix 2 for US counties and German provinces. The instructions in Appendix 2 assume that the custom polygon data is provided in ESRI shape file format, which is likely the most common use-case. The site will need access to a SAS programming environment and SAS/GRAPH software, and whoever completes the process will need access to the SAS Visual Analytics configuration directory and the ability to restart services—so an administrator-type person will be required.

One common request is to provide a regional geomap, where the regions are site-defined groups of states or provinces of a country. In this example problem, the site has sales data for each sales region in the US and would like to display a geo map of the regions.

Custom regional map in SAS Visual AnalyticsFor this type of region/province example, you will likely be able to use one of the maps already provided by SAS in the MAPSGFK library to produce your region boundaries. For more information on the datasets in the MAPSGFK library, see this paper. 

The MAPSGFK.US_STATES dataset contains the data required to overlay all states of the US on a VA region geomap and has these columns:

The highlighted columns, STATECODE, LONG, and LAT will be particularly useful, but first, the sales region (REGION) column and values must be added using simple data step code. The unnecessary FIPS code (STATE) can be dropped in the same DATA step.  Note that the region values are assigned in upper case, as these will later be converted to ID values, which VA expects to be in upper case.

data regions;
   length region $ 12;
   drop state;
   set mapsgfk.us_states;
      if statecode in ('AK','HI','PR') then delete;
      else if statecode in ('WA','MT','OR','ID','WY')
         then region='NORTHWEST';
      else if statecode in ('CA','NV','UT','AZ','CO','NM')
         then region='SOUTHWEST'; 
      else if statecode in ('ND','SD','NE','MN','WI','MI','IA','IL','IN')
         then region='NORTHCENTRAL'; 
      else if statecode in ('KS','OK','TX','MO','AR')
         then region='SOUTHCENTRAL'; 
      else if statecode in ('ME','NH','VT','MA','RI','CT','NY','PA','NJ','OH','DE',
'MD','DC')then region='NORTHEAST';
      else if statecode in ('KY','WV','VA','TN','NC','MS','AL','LA','GA','SC','FL')
         then region='SOUTHEAST';
      run;

The data is then sorted by the REGION values, a requirement of the SAS/GRAPH GREMOVE procedure, which is used to remove the internal state boundary data points, leaving the region boundary points only.

proc sort data=regions;
   by region;
 proc gremove data=regions out=mapscstm.regions1;
    by region;
    id statecode;
    run;

To complete the process, since the LAT and LONG values are already in the form that VA needs (unprojected) and we are using a SAS dataset rather than the ESRI shape file format, we’ll only use a part of the code from the downloadable program mentioned at the beginning of the blog.

First, create a mapscstm directory under /SASHome/SASFoundation/9.4 to store the custom polygon dataset.  Make sure that the library is accessible to the SAS session by including a libname statement in the appserver_autoexec_usermods.sas file, found in config/Lev1/SASApp, and then restarting the Object Spawner.

Example:

libname MAPSCSTM “SASHome/SASFoundation/9.4/mapscstm”;

Tip:  Be sure to back up the original ATTRLOOKUP and CENTLOOKUP datasets before running any additional code, as you will be modifying the originals.

To complete creation of the polygon dataset, you will need to execute only a part of the downloadable program to:
• Make sure that your polygon dataset has all of the columns expected by SAS Visual Analytics.
• Add the region attributes to the ATTRLOOKUP.
• Add the region center point locations to the CENTLOOKUP dataset.

%let REGION_LABEL=USRegions;   /* The label for the custom region */
 %let REGION_PREFIX=R1; /* unique ISO 2-Letter Code  */
 %let REGION_ISO=000; /* unique ISO Code  */
 %let REGION_DATASET=MAPSCSTM.REGIONS1;  /* Polygon data set to be 
              created - be sure to use suffix "1" */

Note that the downloadable program includes additional macro assignments and additional code, but since our data is already in the form of a SAS dataset, rather than ESRI shape file format, we won’t be using all of the code.

The following datastep adds the necessary columns/values to the polygon dataset so that the form of the data is what is expected by VA.  Note that the LAT and LONG columns are already in unprojected form, so we just assign the same values to X and Y.  (VA doesn’t actually use the X,Y columns from the polygon dataset.)

data &REGION_DATASET.;
   set &REGION_DATASET.;
   where density <= 3; 
   id=region;
   idname=region;
   x=long;  
   y=lat;
   ISO = "&REGION_ISO.";
   RESOLUTION = 1;
   LAKE = 0;
   ISOALPHA2 = "&REGION_PREFIX.";
   AdminType = "regions";
   keep ID SEGMENT IDNAME LONG LAT X Y ISO DENSITY RESOLUTION LAKE ISOALPHA2 AdminType;
   run;

Then PROC SQL steps are executed to add rows relative to the custom polygons to the ATTRLOOKUP and CENTLOOKUP datasets:

This step adds the USRegions row to ATTRLOOKUP:

proc sql;
   insert into valib.attrlookup
      values ( 
         "&REGION_LABEL.",         /* IDLABEL=State/Province Label */
         "&REGION_PREFIX.",        /* ID=SAS Map ID Value */
         "&REGION_LABEL.",         /* IDNAME=State/Province Name */
         "",                       /* ID1NAME=Country Name */
         "",                       /* ID2NAME */
         "&REGION_ISO.",           /* ISO=Country ISO Numeric Code */
         "&REGION_LABEL.",         /* ISONAME */
         "&REGION_LABEL.",         /* KEY */
         "",                       /* ID1=Country ISO 2-Letter Code */
         "",                       /* ID2 */
         "",                       /* ID3 */
         "",                       /* ID3NAME */
         0                         /* LEVEL (0=country level, 1=state level) */
         );
quit;

This step adds a row to ATTRLOOKUP for each individual region:

proc sql;
   insert into valib.attrlookup
      select distinct 
         IDNAME,            /* IDLABEL=State/Province Label */
         ID,                /* ID=SAS Map ID Value */
         IDNAME,            /* IDNAME=State/Province Name */
 
         "&REGION_LABEL.",  /* ID1NAME=Country Name */
         "",                /* ID2NAME */
         "&REGION_ISO.",    /* ISO=Country ISO Numeric Code */
         "&REGION_LABEL.",  /* ISONAME */
         trim(IDNAME) || "|&REGION_LABEL.",  /* KEY */
         "&REGION_PREFIX.",   /* ID1=Country ISO 2-Letter Code */
         "",                  /* ID2 */
         "",                  /* ID3 */
         "",                  /* ID3NAME */
         1                    /* LEVEL (1=state level) */
   from &REGION_DATASET.;
quit;

This step calculates and adds the central location point for each of the regions to the CENTLOOKUP dataset.   The site data contains only the 48 contiguous states (no Alaska or Hawaii). If Alaska and Hawaii had been included, a different algorithm would need to be used to calculate the central location.

proc sql;
   /* Add custom region */
   insert into valib.centlookup
      select distinct
         "&REGION_DATASET." as mapname,
         "&REGION_PREFIX." as ID,
         avg(x) as x,
         avg(y) as y
      from &REGION_DATASET.;
 
   /* Add custom provinces */
   insert into valib.centlookup
      select distinct
         "&REGION_DATASET." as mapname,
         ID as ID,
         avg(x) as x,
         avg(y) as y
      from &REGION_DATASET.
         group by id;
quit;

After executing the code above, you will need to restart the Web Application server, so that SAS Visual Analytics has access to the new polygons.

Code is also included in the downloadable program to create a dataset for validating your results. The validate dataset includes a column for the ID and IDNAME of the regions, in addition to two randomly calculated measures.  In our case, we will instead just use our original REGIONSALES dataset containing the regional sales data.

1. Sign into SAS Visual Analytics and create a new exploration with data source REGIONSALES.
2. Create a Geo data item from State: Right-click Regions, select Geography?Subdivision(State, Province) Names. From the Country or Region drop-down list, select the USRegions region label.
3. Create a geo map visualization. Select Regions for the map style, Regions for the Geography role, and salesamt for the Color role.

Your regions should display, similar to this:

You can also include the region data item in a hierarchy with the state data item to produce a drill-down region map:

Or a bubble or coordinate map:

I hope this example has been helpful to users of SAS Visual Analytics 7.4.  In my next blog, you will see that this process is tremendously simplified by new mapping features in SAS Visual Analytics 8.2.

Creating a custom regional map in SAS Visual Analytics 7.4 was published on SAS Users.

2月 072018
 
Multiple-Birthday Problem: The distribution of the number of shared birthdays among 23 random people

If N random people are in a room, the classical birthday problem provides the probability that at least two people share a birthday. The birthday problem does not consider how many birthdays are in common. However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of shared birthdays. Specifically, among N people, what is the probability that exactly k birthdays are shared (k = 1, 2, 3, ..., floor(N/2))? The bar chart at the right shows the distribution for N=23. The heights of the bars indicate the probability of 0 shared birthdays, 1 shared birthday, and so on.

This article uses simulation in SAS to examine the multiple-birthday problem. If you are interested in a theoretical treatment, see Fisher, Funk, and Sams (2013, p. 5-14).

The probability of multiple shared birthdays for N=23

You can explore the classical birthday problem by using probability theory or you can estimate the probabilities by using Monte Carlo simulation. The simulation in this article assumes 365 equally distributed birthdays, but see my previous article to extend the simulation to include leap days or a nonuniform distribution of birthdays.

The simulation-based approach enables you to investigate the multiple birthday problem. To begin, consider a room that contains N=23 people. The following SAS/IML statements are taken from my previous article. The Monte Carlo simulation generates one million rooms of size 23 and estimates the proportion of rooms in which the number of shared birthdays is 0, 1, 2, and so forth.

proc iml;
/* Function that simulates B rooms, each with N people, and counts the number of shared 
   (matching) birthdays in each room. The return value is a row vector that has B counts. */
start BirthdaySim(N, B);                 
   bday = Sample(1:365, B||N);             /* each column is a room */
   match = N - countunique(bday, "col");   /* number of unique birthdays in each col */
   return ( match );
finish;
 
call randseed(12345);                      /* set seed for random number stream */
NumPeople = 23;
NumRooms = 1e6;
match = BirthdaySim(NumPeople, NumRooms);  /* 1e6 counts */
 
call tabulate(k, freq, match);  /* summarize: k=number of shared birthdays, freq=counts */ 
prob = freq / NumRooms;         /* proportion of rooms that have 0, 1, 2,... shared birthdays */
print prob[F=BEST6. C=(char(k,2)) L="Estimate of Probability (N=23)"];
Multiple-Birthday Problem: The distribution of shared birthdays among 23 random people

The output summarizes the results. In less than half the rooms, no person shared a birthday with anyone else. In about 36% of the rooms, one birthday is shared by two or more people. In about 12% of the room, there were two birthdays that were shared by four or more people. About 2% of the rooms had three birthdays shared among six or more individuals, and so forth. In theory, there is a small probability of a room in which 8, 9, 10, or 11 birthdays are shared, but the probability of these events is very small. The probability estimates are plotted in the bar chart at the top of this article.

There is a second way to represent this probability distribution: you can create a stacked bar chart that shows the cumulative probability of k or fewer shared birthdays for k=1, 2, 3,.... You can see that the probability of 0 matching birthdays is less than 50%, the probability of 1 or fewer matches is 87%, the probability of 2 or fewer matches is 97%, and so forth. The probability for 5, 6, or 7 matches is not easily determined from either chart.

Multiple-Birthday Problem: The cumulative distribution of shared birthdays among 23 random people

Mathematically speaking, the information in the stacked bar chart is equivalent to the information in the regular bar chart. The regular bar chart shows the PMF (probability mass function) for the distribution whereas the stacked chart shows the CDF (cumulative distribution function).

The probability of shared birthdays among N people

The previous section showed the distribution of shared birthdays for N=23. You can download the SAS program that simulates the multiple birthday problem for rooms that contain N=2, 3, ..., and 60 people. For each simulation, the program computes estimates for the probabilities of k matching birthdays, where k ranges from 0 to 30.

Multiple-Birthday Problem: Three distributions of the number of shared birthdays among N=25, 40, and 60 random people

There are a few ways to visualize those 59 distributions. One way is to plot 59 bar charts, either in a panel or by using an animation. A panel of three bar charts is shown to the right for rooms of size N=25, 40, and 60. You can see that the bar chart for N=25 is similar to the one shown earlier. For N=40, most rooms have between one and three shared birthdays. For rooms with N=60 people, most rooms have between three and six shared birthdays. The horizontal graphs are truncated at k=12, even though a few simulations generated rooms that contain more than 12 matches.

Obviously plotting all 59 bar charts would require a lot of space. A different way to visualize these 59 distributions is to place 59 stacked bar charts side by side. This visualization is more compact. In fact, instead of 59 stacked bar charts, you might want to create a stacked band plot, which is what I show below.

The following stacked band chart is an attempt to visualize the distribution of shared birthdays for ALL room sizes N=2, 3, ..., 60. (Click to enlarge.)

Multiple-Birthday Problem: The cumulative distribution of the probability of k or fewer shared birthdays among N random people, N=2,3,...,60

How do you interpret this graph? A room that contains N=23 people corresponds to drawing a vertical line at N=23. That vertical line intersects the red, green, and brown bands near 0.5, 0.87, and 0.97. Therefore, these are the probabilities that a room with 23 people contains 0 shared birthdays, 1 or fewer shared birthdays, and 2 or fewer shared birthdays. The vertical heights of the bands qualitatively indicate the individual probabilities.

Next consider a room that contains N=40 people. A vertical line at N=40 intersects the colored bands at 0.11, 0.37, 0.66, 0.86, and 0.95. These are the cumulative probabilities P(k ≤ 0), P(k ≤ 1), P(k ≤ 2), and so forth. If you need the exact values, you can use PROC PRINT to display the actual probability estimates. For example:

proc print data=BDayProb noobs;
   where N=40 and k < 5;
run;
Multiple-Birthday Problem: The distribution of shared birthdays among N=40 random people

Summary

In summary, this article examines the multiple-birthday problem, which is a generalization of the classical birthday problem. If a room contains N random people, the multiple-birthday problem examines the probability that k birthdays are shared by two or more people for k = 0, 1, 2, ..., floor(N/2). This article shows how to use Monte Carlo simulation in SAS to estimate the probabilities. You can visualize the results by using a panel of bar charts or by using a stacked band plot to visualize the cumulative distribution. You can download the SAS program that simulates the multiple birthday problem, estimates the probabilities, and visualizes the results.

What do you think? Do you like the stacked band plot that summarizes the results in one graph, or do you prefer a series of traditional bar charts? Leave a comment.

The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.