data analysis

4月 192017
 

Restricted cubic splines are a powerful technique for modeling nonlinear relationships by using linear regression models. I have attended multiple SAS Global Forum presentations that show how to use restricted cubic splines in SAS regression procedures. However, the presenters have all used the %RCSPLINE macro (Frank Harrell, 1988) to generate a SAS data set that contains new variables for the spline basis functions. They then use those basis variables in a SAS regression procedure.

Since SAS 9.3, many SAS regression procedures provide a native implementation of restricted cubic splines by using the EFFECT statement in SAS. This article provides examples of using splines in regression models. Because some older procedures (such as PROC REG) do not support the EFFECT statement, the article also shows how to use SAS procedures to generate the spline basis, just like the %RCSPLINE macro does.

If you are not familiar with splines and knots, read the overview article "Understanding splines in the EFFECT statement." You can also read the documentation for the EFFECT statement, which includes an overview of spline effects as well as a brief description of restricted cubic splines, which are also called "natural cubic splines." I think the fact that the SAS documentation refers to the restricted cubic splines as "natural cubic splines" has prevented some practitioners from realizing that SAS supports restricted cubic splines.

Regression with restricted cubic splines in SAS

This section provides an example of using splines in PROC GLMSELECT to fit a GLM regression model. Because the functionality is contained in the EFFECT statement, the syntax is the same for other procedures. For example, if you have a binary response you can use the EFFECT statement in PROC LOGISTIC. If you a fitting a generalized linear model or a mixed model, you can use PROC GLMMIX.

This example fits the MPG_CITY variable as a function of the WEIGHT variable for vehicles in the Sashelp.Cars data set. The data and a scatter plot smoother are shown in the adjacent graph. (To produce the graph, the following statements sort the data, but that is not required.) The smoother is based on a restricted spline basis with five knots. Notice that the SELECTION=NONE option in the MODEL statement turns off the variable selection features of PROC GLMSELECT, which makes the procedure fit one model just like PROC GLM.

/* create sample data; sort by independent variable for graphing */
proc sort data=sashelp.cars   
          out=cars(keep=mpg_city weight model);
by weight;
run;
 
/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures:
   GLIMMIX, GLMSELECT, LOGISTIC, PHREG, PLS, QUANTREG, ROBUSTREG, ... */
ods select ANOVA ParameterEstimates SplineKnots;
proc glmselect data=cars;
  effect spl = spline(weight / details naturalcubic basis=tpf(noint)                 
                               knotmethod=percentiles(5) );
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   output out=SplineOut predicted=Fit;            /* output predicted values for graphing */
quit;
 
title "Restricted Cubic Spline Regression";
proc sgplot data=SplineOut noautolegend;
   scatter x=weight y=mpg_city;
   series x=weight y=Fit / lineattrs=(thickness=3 color=red);
run;

The EFFECT statement with the SPLINE option is used to generate spline effects from the WEIGHT variable. The effects are assigned the collective name 'spl'. Putting 'spl' on the MODEL statement says "use the spline effects that I created." You can specify multiple EFFECT statements. Spline effects depend on three quantities: the kind of spline, the number of knots, and the placement of the knots.

  • You specify restricted cubic splines by using the NATURALCUBIC BASIS=TPF(NOINT) options. Technically you do not need the NOINT suboption, but I recommend it because it makes the parameter estimates table simpler.
  • The KNOTMETHOD= option enables you to specify the number and placement of knots. In this example, PERCENTILES(5) places knots at five evenly spaced percentiles of the explanatory variable, which are the 16.6%, 33.3%, 50%, 66.6%, and 83.3% percentiles. Equivalently, the knots are placed at the 1/6, 2/6, 3/6, 4/6, and 5/6 quantiles.

The number and placement of knots for splines

Most researchers use a small number of knots, often three to five. The exact location of knots is not usually critical: if you change the positions slightly the predicted values of the regression change only slightly. A common scheme is to place the knots at fixed percentiles of the data, as shown above. Alternatively, Harrell (Regression Modeling Strategies, 2010 and 2015) suggests heuristic percentiles as shown below, and this scheme seems to be popular among biostatisticians.

Harrell's table of knot placement for restricted cubic splines

The KNOTMETHOD= option on the EFFECT statement provides several options for specifying the locations of knots. Try each of the following options and look at the "SplineKnots" table to see the positions of the knots:

  • KNOTMETHOD=PERCENTILES(5): Places knots at the percentiles 1/6, 2/6, ..., 5/6. An example is shown above.
  • KNOTMETHOD=EQUAL(5): Places knots evenly within the range of the data. If δ = (max-min)/6, then the knots are located at min + i δ for i=1, 2, ..., 5.
  • KNOTMETHOD=RANGEFRACTIONS(0.05 0.275 0.50 0.725 0.95): If you want knots placed unevenly with the range of the data, use this option. For example, the value 0.05 means "place a knot 5% along the data range" and 0.272 means "place a knot 27.5% along the data range." You can separate list values by using spaces or commas.
  • KNOTMETHOD=LIST(2513, 3174, 3474.5, 3903, 5000): This option enables you to list the locations (in data coordinates) of the knots. You can use this option to specify Harrell's schemes. For example, Harrel suggests the 5th, 27.5th, 50th, 72.5th, and 95th percentiles when you use five knots. You can use PROC UNIVARIATE to find these percentiles for the WEIGHT variable and then type the results into the KNOTMETHOD=LIST option.
Comparison of several knot-placement schemes for restricted cubic spline regression in SAS

The adjacent graph shows the predicted values for the four different knot placement methods. (Click to enlarge.) You can see that the general shape of the regression curves is similar regardless of the placement of knots. The differences can be understood by looking at the placement of the first and last knots. The slope of the natural cubic spline fit is linear past the extreme knots, so the placement of the extreme knots dictate where the predictions become linear. For the EQUAL and RANGEFRACTION methods, the largest knots are placed at WEIGHT=6300 and WEIGHT=6923, respectively. Consequently, the predictions for large values of WEIGHT look more cubic than linear. In contrast, the largest knot for the PERCENTILES method is placed at 4175 and the largest LIST knot is 5000. The predictions past those values are linear.

Automate Harrell's knot placement suggestions

In the previous section, I manually typed in the values that correspond to the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of the WEIGHT variable, as suggested by Harrell's scheme. However, it is not difficult to automate that step. You can use PROC UNIVARIATE to write the percentile values to a SAS data set and then use a short DATA _NULL_ step to store those values into a macro variable. The macro variable can then be used as the argument to the KNOTMETHOD=LIST option, as follows:

proc univariate data=cars noprint;
   var weight;
   output out=percentiles pctlpts=5 27.5 50 72.5 95 pctlpre=p_; / * specify the percentiles */
run;
 
data _null_;
set percentiles;
call symput('pctls',catx(',', of _numeric_));   /* put all values into a comma-separated list */
run;
 
%put &=pctls;                 /* optional: display the list of values */
...
   effect spl = spline(weight / ... knotmethod=list(&pctls) ); /* use the macro variable here */
...
PCTLS=2513,3174,3474.5,3903,5000

Write the spline basis functions to a SAS data set

As mentioned eaerlier, not every SAS procedure supports the EFFECT statement. However, the GLMSELECT, LOGISTIC, and GLIMMIX procedures all provide an OUTDESIGN= option, which enables you to output the design matrix that contains the spline basis functions. Furthermore, PROC LOGISTIC supports the OUTDESIGNONLY option and PROC GLIMMIX supports the NOFIT option so that the procedures do not fit the model but only output the design matrix.

You can use the variables in the design matrix to perform regression or other analyses. For example, the following example writes the restricted cubic basis functions to a data set, then uses PROC REG to fit the model:

/* create SplineBasis = data set that contains spline basis functions */
proc glmselect data=cars outdesign(addinputvars fullmodel)=SplineBasis;
   effect spl = spline(weight / naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
   model mpg_city = spl / selection=none;  
quit;
 
/* use design variables in other procedures */
proc reg data=SplineBasis;
   model mpg_city = spl_:;
   output out=out p=p;
run;

One last comment: the basis functions that are generated by the EFFECT statement are not equal to the basis functions created by the %RCSPLINE macro, but they are equivalent. The EFFECT statement uses the definition from The Elements of Statistical Learning (Hastie, Tibshirani, and Friedman, 2009, 2nd Ed, pp. 144-146). The %RCSPLINE macro implements scaled versions of the basis function. Thus parameter estimates will be different but the predicted values will be the same.

Summary

You can use the NATURALCUBIC BASIS=TPF(NOINT) option in the EFFECT statement in SAS to perform regression with restricted cubic splines, which are also called natural cubic splines. You can use the KNOTMETHOD= option to specify the number and placement of the knots. The EFFECT statement is supported by the GLMSELECT, LOGISTIC, and GLIMMIX procedures, among others. These procedures enable you to output a design matrix that contains the spline basis functions, which you can use in procedures that do not support the EFFECT statement.

This article was inspired by several talks that I heard at SAS Global Forum. For more information on this topic, see the following:

The post Regression with restricted cubic splines in SAS appeared first on The DO Loop.

3月 132017
 

It is time for Pi Day, 2017! Every year on March 14th (written 3/14 in the US), geeky mathematicians and their friends celebrate "all things pi-related" because 3.14 is the three-decimal approximation to pi. This year I use SAS software to show an amazing fact: you can find your birthday (or any other date) within the first 10 million digits of pi!

Patterns within the digits in pi

Mathematicians conjecture that the decimal expansion of pi exhibits many properties of a random sequence of digits. If so, you should be able to find any sequence of digits within the decimal digits of pi.

If you want to search for a particular date, such as your birthday, you need to choose a pattern of digits that represents the date. For example, Pi Day was first celebrated on 14MAR1988. You can represent that date in several ways. This article uses the MMDDYY representation, which is 031488. You could also use a representation such as 31488, which drops the leading zero for months or days less than 10. Or use the DDMMYY convention, which is 140399.


Can you find your birthday within the digits of pi?
Click To Tweet


In 2015 I showed how to use SAS software to download the first ten million digits of pi from an internet site. The program then uses PROC PRINT to print six consecutive digits of pi beginning at the 433,422th digit:

/* read data over the internet from a URL */
filename rawurl url "http://www.cs.princeton.edu/introcs/data/pi-10million.txt"
                /* proxy='http://yourproxy.company.com:80' */ ;
data PiDigits;
   infile rawurl lrecl=10000000;
   input Digit 1. @@;
   Position = _n_;
run;
 
/* Pi Day "birthday" 03/14/88 represented as 031488 */
proc print noobs data=PiDigits(firstobs=433422 obs=433427);
   var Position Digit;
run;

Look at that! The six-digit pattern 031488 appears in the decimal digits of pi! This location also contains the alternative five-digit representation 31488, but you can find that five-digit sequence much earlier, at the 19,466th digit:

/* Alternative representation: Pi Day birthday = 31488 */
proc print noobs data=PiDigits(firstobs=19466 obs=19470);
   var Position Digit;
run;

How did I know where to look for these patterns? Read on to discover how to use SAS to find a particular pattern digits within the decimal expansion of pi.

Finding patterns within the digits in pi

Last week I showed how to use SAS to search for a particular pattern within a long sequence of digits. Let's use that technique to search for the six-digit Pi Day "birthday," pattern 031488. The following call to PROC IML in SAS defines a function that implements the search algorithm. The program then reads in the first 10 million digits of pi and conducts the search for the pattern:

proc iml;
/* FindPattern: Finds a specified pattern within a long sequence of digits.
   Input: target : row vector of the target pattern, such as {0 3 1 4 8 8}
          digits : col vector of the digits in which to search
   Prints the number of times the pattern appears and the first location of the pattern.
   See https://blogs.sas.com/content/iml/2017/03/10/find-pattern-in-sequence-of-digits.html
*/
start FindPattern(target, digits);
   p = ncol(target);               /* length of target sequence */
   D = lag(digits, (p-1):0);       /* columns shift the digits */
   D = D[p:nrow(digits),];         /* delete first p rows */
   X = (D=target);                 /* binary matrix */
   /* sum across columns. Which rows contain all 1s? */
   b = (X[,+] = p);                /* b[i]=1 if i_th row matches target */
   NumRepl = sum(b);               /* how many times does target appear? */
   if NumRepl=0 then FirstLoc = 0;  else FirstLoc = loc(b)[1];
   result = NumRepl // FirstLoc;
   labl = "Pattern = "  + rowcat(char(target,1));  /* convert to string */
   print result[L=labl F=COMMA9. rowname={"Num Repl", "First Loc"}];
finish;
 
/* read in 10 million digits of pi */
use PiDigits;  read all var {"Digit"};  close;
 
target = {0 3  1 4  8 8};  /* six-digit "birthday" of Pi Day */
call FindPattern(target, Digit);
 
target = {3  1 4  8 8};    /* five-digit "birthday" */
call FindPattern(target, Digit);

Success! The program shows the starting location for each pattern within the digits of pi. The starting locations match the values of the FIRSTOBS= option that was used in PROC PRINT in the previous section.

Search for your birthday within the digits of pi

You can use this program to search for your birthday, your anniversary, or any other special date. (If you prefer to use the SAS DATA step, see the comments of my previous article.) If you don't have SAS, don't despair! I got the idea for this article from a nifty web page on PBS.org that contains an applet that you can use to find your birthday among the digits of pi.

The PBS applet does not require any special software. However, I noticed that it gives slightly different answers from the SAS program I wrote. One trivial difference is that the applet starts with the "3" digit of pi, whereas the SAS program starts with the "1" in the tenths decimal place. So the two programs should give locations that differ by one place. Another difference is that the applet appears to always represent months and days that are less than 10 as a one-digit value, so that the PBS applet represents 02JAN2003 as "1203" rather than "010203." However, I have observed (but cannot explain) that the PBS applet seems to consistently report a location that is three digits more than the SAS-reported location. For example, the applet reports 02JAN2003 (1203) as occurring at the 60,875th digit, whereas the SAS program reports the location as the 60,872th digit.

Some unique dates within the digits of pi

We know that the Pi Day "birthday" date appears, but what about other dates? I wrote a SAS program that searches for all six-digit MMDDYY representation of dates from 01JAN1900 to 21DEC1999. I verified all dates are contained in the first 10 million digits of pi except for one. The date 01DEC1954 (120154) is the only date that does not appear!

I also discovered some other interesting properties while searching for dates (in the MMDDYY format) within the first 10 million digits of pi:

  1. First appearance: The first date to appear is 28JUN1962 (062862), which appears in the 71st decimal location.
  2. Latest (first) appearance: The date 23NOV1960 (112360) does not appear until the 9,982,545th location.
  3. Rarest: The date 01DEC1954 (120154) is the only date that does not appear. (But the five-digit representation (12154) does appear.)
  4. Second rarest: There are 15 dates that only appear one time.
  5. Most frequent: The date 22JUL1982 (072282) appears 25 times.
  6. Distribution of appearances: Most dates appear between seven and 12 times. The following graph shows the distribution of the number of times that each date appears.
Distribution of the number of times that each date MMDDYY appears in first 10M digits of pi

If you want to discover other awesome facts, you can explore the data yourself. You can download the results (in CSV format) of the exhaustive search. If you want to see how I searched the set of all MMDDYY patterns, you can download the SAS program that I used to create the analyses in this article.

The post Find your birthday in the digits of pi appeared first on The DO Loop.

2月 222017
 

Sometimes SAS programmers ask about how to analyze quantiles with SAS. Common questions include:

  • How can I compute 95% confidence intervals for a median in SAS?
  • How can I test whether the medians of two independent samples are significantly different?
  • How can I repeat the previous analyses with other percentiles, such as the 20th percentile or the 90th percentile?

Historically, PROC UNIVARIATE and PROC NPAR1WAY are two procedures in SAS that analysts used for univariate analysis. PROC UNIVARIATE performs standard parametric tests. In contrast, PROC NPAR1WAY performs nonparametric tests and distribution-free analyses. An internet search reveals many resources that describe how to use UNIVARIATE and NPAR1WAY for analyzing quantiles.

However, there is an alternative way to analyze univariate quantiles: PROC QUANTREG. Although QUANTREG is designed for quantile regression, the same procedure can easily analyze quantiles of univariate data. All you need to do is omit the regressors from the right side of the MODEL statement and the procedure will analyze the "response" variable.

Be aware that the QUANTREG procedure uses an optimization algorithm to perform its analysis. This can sometimes result in different estimates than a traditional computation. For example, if the data set has an even number of observations and the middle values are a and b, one estimate for the median is the average of the two middle values (a+b)/2. The QUANTREG procedure might provide a different estimate, which could be any value in [a, b]. This difference is most noticeable in small samples. (Don't let this bother you too much. There are many definitions for quantile estimates. SAS supports five different definitions for calculating quantiles.)

Confidence intervals for percentiles

I have previously shown how to compute confidence intervals for percentiles in SAS by using PROC UNIVARIATE. The following statements compute the 20th, 50th, and 90th percentiles for the cholesterol levels of 5209 patients in a medical study, along with 95% confidence intervals for the quantiles. The computation is shown twice: first with PROC UNIVARIATE, then with PROC QUANTREG.

/* 1. Use PROC UNIVARIATE to get 95% CIs for 20th, 50th, and 90th pctls */
proc univariate data=Sashelp.Heart noprint;
   var Cholesterol;
   output out=pctl pctlpts=20 50 90 pctlpre=p
          cipctldf=(lowerpre=LCL upperpre=UCL);    /* 12.1 options (SAS 9.3m2) */
run;
 
data QUni;  /* rearrange the statistics into a table */
set pctl;
Quantile = 0.2; Estimate = p20; Lower = LCL20; Upper = UCL20; output;
Quantile = 0.5; Estimate = p50; Lower = LCL50; Upper = UCL50; output;
Quantile = 0.9; Estimate = p90; Lower = LCL90; Upper = UCL90; output;
keep Quantile Estimate Lower Upper;
run;
 
title "UNIVARIATE Results"; 
proc print noobs; run;
 
/**************************************/
/* 2. Alternative: Use PROC QUANTREG! */
ods select none; ods output ParameterEstimates=QReg ;
proc quantreg data=Sashelp.Heart;
   model Cholesterol = / quantile=0.2 0.5 .9;
run;
ods select all;
 
title "QUANTREG Results"; 
proc print noobs;
   var Quantile Estimate LowerCL UpperCL;
run;

The output shows that the confidence intervals (CIs) for the quantiles are similar, although the QUANTREG intervals are slightly wider. Although UNIVARIATE can produce CIs for these data, the situation changes if you add a weight variable. The UNIVARIATE procedure supports estimates for weighted quantiles, but does not produce confidence intervals. However, the QUANTREG procedure can provide CIs even for a weighted analysis.

Test the difference of medians

In general, PROC QUANTREG can compute statistics for quantiles that UNIVARIATE cannot. For example, you can use the ESTIMATE statement in QUANTREG to get a confidence interval for the difference between medians in two independent samples. If the confidence interval does not contain 0, you can conclude that the medians are significantly different.

The adjacent box plot shows the distribution of diastolic blood pressure for male and female patients in a medial study. Reference lines are drawn at the median values for each gender. You might want to estimate the median difference in diastolic blood pressure between male and female patients and compute a confidence interval for the difference. The following call to PROC QUANTREG estimates those quantities:

ods select ParameterEstimates Estimates;
proc quantreg data=Sashelp.Heart;
   class sex;
   model diastolic = sex / quantile=0.5;
   estimate 'Diff in Medians' sex 1 -1 / CL;
run;

The syntax should look familiar to programmers who use PROC GLM to compare the means of groups. However, this computation compares medians of groups. The analysis indicates that female patients have a diastolic blood pressure that is 3 points lower than male patients. The 95% confidence interval for the difference does not include 0, therefore the difference is statistically significant. By changing the value on the QUANTILE= option, you can compare quantiles other than the median. No other SAS procedure provides that level of control over quantile estimation.

Conclusions

PROC QUANTREG provides another tool for the SAS programmer who needs to analyze quantiles. Although QUANTREG was written for quantile regression, the procedure can also analyze univariate samples. You can use the ESTIMATE statement to compare quantiles across groups and to obtain confidence intervals for the parameters.

In general, SAS regression procedures enable you to conduct univariate analyses that are not built into any univariate procedure.

The post Quantile estimates and the difference of medians in SAS appeared first on The DO Loop.

2月 202017
 
Distribution of colors of M&M's

Many introductory courses in probability and statistics encourage students to collect and analyze real data. A popular experiment in categorical data analysis is to give students a bag of M&M® candies and ask them to estimate the proportion of colors in the population from the sample data. In some classes, the students are also asked to perform a chi-square analysis to test whether the colors are uniformly distributed or whether the colors match a hypothetical set of proportions.

M&M's® have a long history at SAS. SAS is the world’s largest corporate consumer of M&M's. Every Wednesday a SAS employee visits every breakroom on campus and fill two large containers full of M&M's. This article uses SAS software to analyze the classic "distribution of colors" experiment.

The proportion of colors for M&M's

The "plain" M&M candies (now called "milk chocolate M&M's") are produced by the Mars, Inc. company. The distribution of colors in M&M's has a long and colorful history. The colors and proportions occasionally change, and the distribution is different for peanut and other varieties. A few incidents from my lifetime that made the national news are:

  • 1976: Red M&M's are replaced by orange. This was a big story. "Red dye #2" had been discovered to be a carcinogen. Although Mars did not use this dye in their candies, the company changed colors to alleviate customer concerns.
  • 1986: Red M&M's are brought back. Orange stays.
  • 1995: The tan color is replaced by a more vivid color. In a promotional campaign, the public is asked to vote for the replacement color. Ten million vote; blue wins in a landslide.
  • Late 1990s: The M&M web site lists the distribution of colors. Circa 1997, the color distribution was
    30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue.
    Statistician and educators rejoice and publish many papers on the topic.
  • 2008: Mars changes the color distribution to
    24% blue, 20% orange, 16% green, 14% yellow, 13% red, 13% brown.
    Some time later, the proportions were removed from the web site and have not been restored.
  • 2017: What is the current distribution of colors? Read on for an interesting story!

Proportions and chi-square test

The breakroom containers at SAS are filled from two-pound bags. So as to not steal all the M&M's in the breakroom, I conducted this experiment over many weeks in late 2016 and early 2017, taking one scoop of M&M's each week. The following data set contains the cumulative counts for each of the six colors in a sample of size N = 712:

data MandMs;
input Color $7. Count;
datalines;
Red     108
Orange  133
Yellow  103
Green   139
Blue    133
Brown    96
;

A bar chart that shows the observed distribution of colors in M&M's is shown at the top of this article.

To estimate the proportion of colors in the population, simply divide each count by the total sample size, or use the FREQ procedure in SAS. PROC FREQ also enables you to run a chi-square test that compares the sample counts to the expected counts under a specified distribution. The most recent published distribution is from 2008, so let's test those proportions:

proc freq data = MandMs order=data;
   weight Count;
   tables Color / nocum chisq
   /* 2008 proportions:  red orange yellow green blue brown */
                  testp=(0.13  0.20  0.14  0.16  0.24  0.13);
run;
Chi-square test for distribution of colors in M&M's

The observed and expected proportions are shown in the table to the right. The chi-square test rejects the test hypothesis at the α = 0.05 significance level (95% confidence). In other words, the distribution of colors for M&M's in this 2017 sample does NOT appear to be the same as the color distribution from 2008! You can see this visually from the bar chart: the red and green bars are too tall and the blue bar is too short compared with the expected values.

You need a large sample to be confident that this empirical deviation is real. After collecting data for a few weeks, I did a preliminary analysis that analyzed about 300 candies. With that smaller sample, the difference between the observed and expected proportions could be attributed to sampling variability and so the chi-square test did not reject the null hypothesis. However, while running that test I noticed that the green and blue colors accounted for the majority of the difference between the observed and theoretical proportions, so I decided to collect more data.

Simultaneous confidence intervals for the M&M proportions

As I explained in a previous article, you can use the sample proportions to construct simultaneous confidence intervals for the population proportions. The following SAS/IML statements load and call the functions from the previous post:

%include "conint.sas";         /* define the MultCI and MultCIPrint modules */
proc iml; 
load module=(MultCI MultCIPrint);  
use MandMs; read all var {"Color" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Color, Count, alpha, 2); /* construct CIs using Goodman's method */
Simultaneous confidence limits for proportions of colors in plain M&M's

The table indicates that the published 2008 proportion for blue (0.24) is far outside the 95% confidence interval, and the proportion for green (0.16) is just barely inside its interval. That by itself does not prove that the 2008 proportion are no longer valid (we might have gotten unlucky during sampling), but combined with the earlier chi-square test, it seems unlikely that the 2008 proportions are applicable to these data.

Calling the experts

The published proportions for green and blue do not seem to match the sample proportions from 2008. For this large sample, the published proportion of blue is too large whereas the published proportion of green is too small.

From reading previous articles, I know that the Customer Care team at M&M/Mars is very friendly and responsive. Apparently they get asked about the distribution of colors quite often, so I sent them a note. The next day they sent a breakdown of the colors for all M&M candies.

Interestingly, plain (and peanut) M&M's are now produced at two different factories in the US, and the factories do not use the same mixture of colors! You need to look on the packaging for the manufacturing code, which is usually stamped inside a rectangle. In the middle of the code will be the letters HKP or CLV. For example, the code might read 632GCLV20.

  • CLV: The Cleveland plant uses the following proportion of colors for plain M&M's:
    Red=0.131, Orange=0.205, Yellow=0.135, Green=0.198, Blue=0.207, and Brown=0.124.
  • HKP: The Hackettstown, NJ, plant uses the following proportion of colors for plain M&M's:
    Red=0.125, Orange=0.25, Yellow=0.125, Green=0.125, Blue=0.25, and Brown=0.125.

Although I did not know about the manufacturing codes when I collected the data, I think it is clear that the bulk of my data came from the CLV plant. You can create a graph that shows the sample proportions, the 95% simultaneous confidence intervals, and vertical hash marks to indicate the CLV population parameters, as follows:

Observed and Expected proportions of M&M color (2017)

The graph shows that the observed proportions are close to the proportions from the CLV plant. All proportions are well within the 95% simultaneous confidence intervals from the data. If you rerun the PROC FREQ chi-square analysis with the CLV proportions, the test does not reject the null hypothesis.

Summary

The experimental evidence indicates that the colors of plain M&M's in 2017 do not match the proportions that were published in 2008.

After contacting the M&M/Mars Customer Care team, I was sent a new set of proportions for 2017. The color proportions now depend on where the candies were manufactured. My data matches the proportion of colors from the Cleveland plant (manufacturing code CLV).

If you are running this analysis yourself, be sure to record whether your candies came from the HKP or CLV plant. If you want to see my analysis, you can download the complete SAS program that analyzes these data.

Educators who use M&M's to teach probability and statistics need to record the manufacturing plant, but this is still a fun (and yummy!) experiment. What do you think? Do you prefer the almost-equal orange-blue-green distribution from the CLV plant? Or do you like the orange-blue dominance from the HKP plant? Or do you just enjoy the crunchy shell and melt-in-your-mouth goodness, regardless of what colors the candies are?

The post The distribution of colors for plain M&M candies appeared first on The DO Loop.

2月 132017
 

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.


Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet


Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
...
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   quit;
 
   /* ...then accumulate statistics... */
%end;
%mend;

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
   output;
end;
drop x:;
run;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
quit;
 
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;
run;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.

Conclusions

You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

tags: Data Analysis, Getting Started, Statistical Programming

The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.

1月 112017
 

Last week I wrote about the 10 most popular articles from The DO Loop in 2016. The popular articles tend to be about elementary topics that appeal to a wide range of SAS programmers. Today I present an "editor's choice" list of technical articles that describe more advanced statistical methods in SAS.

I've grouped the articles into three categories: statistical graphics and visualization, statistical computations, and matrix computations. If you are a SAS statistical programmer, these articles deserve a second look.


Ten posts from The DO Loop that deserve a second look #SASTip
Click To Tweet


Statistical graphics and visualization

An effect plot

SAS ODS graphics provides an easy way to create standard graphs for data analysis. The graphs in this list are more sophisticated:

Statistical computations

A nearest neighbor plot

These article show helpful statistical techniques that you should know about:

Matrix computations

markovabsorbing1

The SAS DATA step is awesome. For many programming tasks, it is an efficient and effective tool. However, advanced analytical algorithms and multivariate statistics often require matrix-vector computations, which means programming in the SAS/IML language.

There you have it, 10 articles from The DO Loop in 2016 that I think are worth a second look. Did I omit your favorite article? Leave a comment.

tags: Data Analysis

The post Ten posts from 2016 that deserve a second look appeared first on The DO Loop.

1月 062017
 
“La Quinta” is Spanish for “next to Denny’s.”
     -- Mitch Hedberg, comedian

Mitch Hedberg's joke resonates with travelers who drive on the US interstate system because many highway exits feature both a La Quinta Inn™ and a Denny's® restaurant within a short distance of each other. But does a statistical data analysis support this anecdotal evidence?

In 2014 John Reiser wrote a blog post that uses the Python language to scrape the web for the locations of La Quinta Inns and Denny's restaurants. He then analyzed the data to show that, yes, in general, a guest at a La Quinta Inn does not have far to travel if he wants to eat at a Denny's restaurant. This work inspired Colin Rundel and Mine Cetckaya-Rundel to assign this analysis as a project for their students at Duke University and to write an article in CHANCE magazine (29(2), 2016) about the assignment.

Rundel and Cetckaya-Rundel posted CSV files on the CHANCE web site that contain the longitude, latitude, and addresses of 851 La Quinta Inns and 1,634 Denny's restaurants in the contiguous US (as of Dec 2015). This article follows their presentation and shows how to analyze the La Quinta-Denny's spatial data in SAS. The analysis is a straightforward extension of the nearest-neighbor techniques shown in article "Distances between observations in two groups." You can download the SAS program that imports the data and creates all the graphs and tables in this article.

Visualizing the locations of La Quinta Inns and Denny's restaurants

Locations of La Quinta Inns and Denny's restaurants in the contiguous US

You can use PROC HTTP in SAS to read CSV files directly from a URL. After importing the locations of La Quinta Inns and Denny's restaurants from the CHANCE web site, you can use PROC SGPLOT to plot the (unprojected) locations as longitudes and latitudes. To help visualize the locations, the adjacent graph overlays the data on an outline of the lower-48 states in the US. (Click the graph to enlarge.)

In the graph, the La Quinta Inns are represented by blue circles and Denny's by a red cross. In the enlarged version you can see that many circles enclose a red cross, which indicates that the inn and restaurant are very close. On the other hand, there are a few La Quinta locations that seem to be far away from any Denny's. Montana, North Dakota, Louisiana, Kansas, Nevada, and southwest Texas are some geographic regions in which a blue circle is not close to a red cross.

The distance to the nearest Denny's for each La Quinta Inn

Imagine that a husband and wife are spending their retirement by crisscrossing the US. The wife wants to sleep each night at a La Quinta Inn. The husband wants to eat breakfast each morning at a Denny's restaurant. If they both get their wishes, how far will they need to travel to get breakfast each morning?

I've previously written about how to compute the distance to the nearest neighbor between observations that are in different groups. For these data, the La Quinta Inns form one group and the Denny's restaurants form a second group. Because the coordinates for these data are longitude and latitude, you need to use the GEODIST function in Base SAS to compute the distance ( in kilometers, as the crow flies) between each hotel and the nearest Denny's.

For each of the 851 hotels, you can find the distance to the nearest Denny's. The following table summarizes the distribution of distances, in kilometers:

lqdennys2

The table shows that the closest Denny 's is a mere 11 meters from the adjacent La Quinta Inn! For 25% of the inns, the nearest Denny's is within 1.3 km, which is a short walk. Fifty percent of the inns are within 5 km (an easy drive) of a Denny's, and 75% are within 17 km. It seems that the data supports a modified version of Hedberg's joke: “La Quinta” is Spanish for “often close to a Denny’s.”

The following histogram shows the distribution of the 851 distances from each La Quinta Inn to the nearest Denny's. The histogram shows that about 65% of the inns are within 10 km and about 77% are within 20 km. As long as the husband and wife stay at these inns, they will both be happy, well-rested, and well-fed!

lqdennys3

La Quinta Inns that are far from a Denny's

lqdennys4

Clearly the couple can happily sleep at a La Quinta Inn and eat at a Denny's provided that they avoid the few La Quinta Inns that are not located near a Denny's. The scatter plot and map near the top of this article gives some indications about where these inns are located, but we can identify these inns more precisely. For the sake of the couple's marital bliss, let's enumerate the inns that are farthest from a Denny's.

The adjacent table shows 10 La Quinta Inns that are farthest from a Denny's restaurant. The farthest distance is the La Quinta Inn in Glendive, Montana, which is 282 km from the nearest Denny's. La Quinta Inns in Kansas, Nevada, Texas, and Louisiana are also more than 150 km away from a Denny's.

If possible, the couple should avoid these inns, but what if their travels bring them through these cities? In that case, they need to know that distance and location to the nearest Denny's. The next graph might be helpful.

The following graph shows all the La Quinta Inns. The inns shown in red are those for which the distance to the nearest Denny's is more than 80 km away. For each of these inns, an arrow is drawn from the La Quinta Inn to the location of the nearest Denny's. (I explained this nearest-neighbor plot in a previous article.) This helps the couple know what direction they need to drive in order to reach breakfast—or perhaps lunch! For some La Quinta locations (MT, SC, TN) the couple will need to drive into an adjacent state.

Nearest Denny's for La Quinta Inns That Are Far from a Denny's

Summary

Mitch Hedberg's joke is funny because it has an element of truth. For most La Quinta Inns, the nearest Denny's restaurant is a short walk or drive away. By using nearest-neighbor computations and the GEODIST function in SAS, you can compute the distances from each inn to the nearest Denny's. You can graph the set of distances and compute statistics such as quantiles. You can visualize the co-locations of the inns and restaurants, and even direct travelers to the nearest Denny's. I agree with Rundel and Cetckaya-Rundel that this exercise provides a fun activity in data analysis for students.

For professional SAS programmers, the exercise demonstrates how to conduct a particular kind of co-location analysis. Instead of Denny's restaurants, the professional analyst might be interested in the distance to the nearest hospital, distribution center, or cell-phone tower. SAS statistical graphics and SAS/IML, provides the tools for analyzing the distance between groups of spatial data.

If you want to examine or extend my analysis of these data, you can download the SAS program.

tags: Data Analysis, Spatial Data, Statistical Graphics

The post Is "La Quinta" Spanish for "Next to Denny's"? appeared first on The DO Loop.

1月 042017
 

I wrote 105 posts for The DO Loop blog in 2016. My most popular articles were about data analysis, SAS programming tips, and elementary statistics. Without further ado, here are the most popular articles from 2016.


Top 10 blog posts from The DO Loop in 2016
Click To Tweet


Data Analysis and Visualization

Ovarlay histograms

Start with a juicy set of data and an interesting question. Mix in some SAS data analysis and a colorful graph that visualizes the data and tells a story. What have you got? A popular blog post! The following posts generated buzz on social media. They also show off the power of SAS analytics and visualization.

General SAS programming techniques

selectwhen

Everyone who uses SAS needs to know tricks and techniques for programming in the DATA step or for working with macros. No wonder these articles were so popular!

Statistical Techniques

Moving average in SAS

If you browse SAS discussion forums, you will see many questions about computing moving statistics, creating dummy variables, and running weighted analyses. I wrote some articles about these topics that resonated with readers in 2016:

Did you miss any of these popular posts? Here's your chance to read (or re-read!) one of these top 10 posts from 2016.

Next week: My picks for articles that did not make this list, but deserve a second look.

tags: Data Analysis

The post The top 10 posts from The DO Loop in 2016 appeared first on The DO Loop.

12月 202016
 

Data governance seems to be the hottest topic at data-related conferences this year, and the question I get asked most often is, “where do we start?” Followed closely by how do we do it, what skills do we need, how do we convince the rest of the organisation to get […]

4 steps to get started with data governance was published on SAS Voices.

12月 052016
 
Kepler's third law for planetary bodies

A recent issue of Astronomy magazine mentioned Kepler's third law of planetary motion, which states "the square of a planet's orbital period is proportional to the cube of its average distance from the Sun" (Astronomy, Dec 2016, p. 17). The article included a graph (shown at the right) that shows the period and distance for several planetary bodies. The graph is plotted on a log-log scale and shows that the planetary data falls on a straight line through the origin.

I sometimes see Kepler's third law illustrated by using a graph of the cubed distances versus the squared periods. In a cubed-versus-squared graph, the planets fall on a straight line with unit slope through the origin. Since power transformations and log transformations are different, I was puzzled. How can both graphs be correct?

After a moment's thought, I realized that the magazine had done something very clever. Although it is true that a graph of the squared-periods versus the cubed-distances will CONFIRM the relationship AFTER it has been discovered, the magazine's graph gives insight into how a researcher can actually DISCOVER a power-law relationship in the first place! To discover the values of the exponents in a power-law relationship, log-transform both variables and fit a regression line.


How to discover a power law? Log-transform the data!
Click To Tweet


How to discover a power law

Suppose that you suspect that a measured quantity Y is related by a power law to another quantity X. That is, the quantities satisfy the relationship Ym = A Xn for some integer values of the unknown parameters m and n and constant A. If you have data for X and Y, how can use discover the values of m and n?

One way is to use linear regression on the log-transformed data. Take the logarithms of both size and simplify to obtain log(Y) = C + (n/m) log(X) where C is a constant. You can use ordinary least squares regression to estimate values of the constant C and the ratio n/m.

Planetary periods and distances

For example, let's examine how a modern statistician could quickly discover Kepler's third law by using logarithms and regression. A NASA site for teachers provides the period of revolution (in years) and the mean distance from the Sun (in astronomical units) for several planetary bodies. Some of the data (Uranus, Neptune, and Pluto) were not known to Kepler. The following SAS DATA step reads the data and computes the log (base 10) of the distances and periods:

data Kepler;
input Name $8. Period Distance;
logDistance = log10(Distance);
logPeriod   = log10(Period);
label logDistance="log(Mean Distance from Sun) (AU)"
      logPeriod  ="log(Orbital Period) (Years)";
datalines;
Mercury   0.241   0.387
Venus     0.616   0.723
Earth     1       1
Mars      1.88    1.524
Jupiter  11.9     5.203
Saturn   29.5     9.539
Uranus   84.0    19.191
Neptune  165.0   30.071
Pluto    248.0   39.457
;

The graph in Astronomy magazine plots distances on the vertical axis and periods horizontally, but it is equally valid to flip the axes. It seems more natural to compute a linear regression of the period as a function of the distance, and in fact this how Kepler expressed his third law:

The proportion between the periodic times of any two planets is precisely one and a half times the proportion of the mean distances.

Consequently, the following call to PROC REG in SAS estimates the power relationship between the distance and period:

proc reg data=Kepler plots(only)=(FitPlot ResidualPlot);
   model logPeriod = logDistance;
run;
kepler2

The linear fit is almost perfect. The R2 value (not shown) is about 1, and the root mean square error is 0.0005. The table of parameter estimates is shown. The intercept is statistically insignificant. The estimate for the ratio (n/m) is 1.49990, which looks suspiciously like a decimal approximation for 3/2. Thus a simple linear regression reveals the powers used in Kepler's third law: the second power of the orbital period is proportional to the third power of the average orbital distance.

A modern log-log plot of Kepler's third law

Not only can a modern statistician easily discover the power law, but it is easy to create a scatter plot that convincingly shows the nearly perfect fit. The following call to PROC SGPLOT in SAS creates the graph, which contains the same information as the graph in Astronomy magazine. Notice that I used custom tick labels for the log-scaled axes:

title "Kepler's Third Law";
title2 "The Squared Period Is Proportional to the Cubed Distance";
proc sgplot data=Kepler;
  scatter y=logPeriod x=logDistance / datalabel=Name datalabelpos=bottomright datalabelattrs=(size=12);
  lineparm x=0 y=0 slope=1.5 / legendlabel="log(Period) = 3/2 log(Distance)" name="line";
  yaxis grid values=(-1 0 1 2 3) valuesdisplay=("0.1" "1" "10" "100" "1000") offsetmax=0;
  xaxis grid values=(-1 0 1 2) valuesdisplay=("0.1" "1" "10" "100");
  keylegend "line" / location=inside opaque;
run;
Kepler's Third Law: For a planetary body, the square of the orbital period is proportional to the cube of the mean distance to the sun

Remark on the history of Kepler's third law

This article shows how a modern statistician can discover Kepler's third law using linear regression on log-transformed data. The regression line fits the planetary data to high accuracy, as shown by the scatter plot on a log-log scale.

It is impressive that Kepler discovered the third law without having access to these modern tools. After publishing his first two laws of planetary motion in 1609, Kepler spent more than a decade trying to find the third law. Kepler said that the third law "appeared in [his] head" in 1618.

Kepler did not have the benefit of drawing a scatter plot on a log-log scale because Descartes did not create the Cartesian coordinate system until 1637. Kepler could not perform linear regression because Galton did not describe it until the 1880s.

However, Kepler did know about logarithms, which John Napier published in 1614. According to Kevin Brown, (Reflections on Relativity, 2016) "Kepler was immediately enthusiastic about logarithms" when he read Napier's work in 1616. Although historians cannot be sure, it is plausible that Kepler used logarithms to discover his third law. For more information about Kepler, Napier, and logarithms, read Brown's historical commentary.

tags: Data Analysis, History

The post Discover power laws by log-transforming data appeared first on The DO Loop.