data analysis

3月 232016

My previous blog post shows how to use PROC LOGISTIC and spline effects to predict the probability that an NBA player scores from various locations on a court. The LOGISTIC procedure fits parametric models, which means that the procedure estimates parameters for every explanatory effect in the model. Spline bases enable you to fit complex models, but it is easy to generate many spline effects, which means that you need to be careful not to overfit the data.

In contrast, modern nonparametric models enable you to balance the complexity of a model with the goodness of fit, thus reducing the likelihood of overfitting the data. SAS provides several procedures that fit nonparametric regression models for a binary response variable. Options include:

  • Use variable selection techniques in PROC LOGISTIC or PROC HPGENSELECT to allow the data to select the effects that best model the data. Variable selection creates a hybrid analysis that has properties of nonparametric models while preserving the interpretability of parametric models.
  • Use the GAMPL procedure in SAS/STAT 14.1 (SAS 9.4m3) to fit the data. The GAMPL procedure uses penalized likelihood (PL) methods to fit generalized additive models (GAM).

Other choices in SAS/STAT software include the ADAPTIVEREG procedure, which combines splines with variable selection techniques, and the HPSPLIT procedure, which is a tree-based classification procedure. Both procedures were introduced in SAS/STAT 12.1.

Generalized additive models in SAS

Generalized additive models use spline effects to model nonlinear relationships in data. A smoothing penalty is applied to each spline term in an attempt to model nonlinear features without overfitting the data. For details and examples, you can read the GAMPL documentation or watch a video about PROC GAMPL.

The syntax for the GAMPL procedure is similar to the familiar syntax for PROC LOGISTIC or PROC GENMOD. You can specify spline effects and the distribution of the response variable. The following statement uses a two-dimensional thin-plate spline to model the probability of Stephen Curry scoring from various shooting locations. The data are from Robert Allison's blog "How to graph NBA data with SAS." You can download the complete SAS program that produces the graphs in this post.

proc gampl data=Curry;
   where Shot_Distance <= 30;
   model Shot_Made(event='Made') = Spline(X Y / maxdf=40) / dist=binary;
   id X Y Shot_Made;
   output out=GamPLOut;

The OUTPUT statement saves the predicted probabilities to a data set. The option MAXDF=40 tells the procedure to consider up to 40 degrees of freedom for the spline effect and to choose the smoothing parameter that provides the best tradeoff between model complexity and goodness of fit. For the Stephen Curry data, the optimal smoothing parameter results in 14.7 degrees of freedom.

GAMPL Analysis of Curry Data

You can use the graph template language (GTL) to create a contour plot of the predicted probabilities. The contour map is qualitatively similar to the probabilities that were predicted by the PROC LOGISTIC analysis in my previous post. There is an area of high probability near the basket at (0,0). The probabilities on the right side of the graph are lower than on the left. There is a "hot spot" on the left side of the graph, which corresponds to a high probability that Curry will score from that region.

Verify: The fundamental principle of nonparametric analysis

I initially view the results of any nonparametric analyses with skepticism. I trust the mathematics behind the methods, but I need to be convinced that a qualitative feature in the predicted values is real and not merely an artifact of some complicated nonparametric witchcraft.

There are many statistical techniques that enable you to evaluate whether a model fits data well, but it is wise to perform a basic "sanity check" by using a different nonparametric procedure to analyze the same data. If the two analyses reveal the same qualitative features in the data, that is evidence that the features are truly present. Conversely, if two models produce different qualitative features, then I question whether either model is accurate. I call this sanity check the fundamental principle of nonparametric analysis: Trust, but verify.

Let's apply the fundamental principle to the NBA data by running PROC ADAPTIVEREG:

proc adaptivereg data=Curry plots;
   where Shot_Distance <= 30;
   model Shot_Made(event='Made') = X Y / dist=binary;
   output out=AdaptiveOut p(ilink);
ADAPTIVEREG Analysis of Curry Data

The PROC ADAPTIVEREG analysis is shown to the left. The contour plot shows the same qualitative features that were apparent from the LOGISTIC and GAMPL analyses. Namely, the probability of scoring is high under the basket, low to the right, average up the middle, and high on the left. Seeing these features appear in several analyses gives me confidence that these features of the data are real. After verifying that the models are qualitatively similar, you can investigate which model is better, perhaps by splitting the data into subsets for model training, validation, and testing.

The fundamental principle of nonparametric analysis: trust but verify. #StatWisdom
Click To Tweet


This article briefly introduced two nonparametric procedures in SAS that can analyze binary response variables and other response distributions. The two analyses produced qualitatively similar predictions on sample data. The fundamental principle of nonparametric analysis is a meta-theorem that says that you should verify the qualitative predictions of a nonparametric model. Reproducibility is a necessary (but not sufficient) condition to believe that a feature is real and not spurious. For this example, all analyses agree that Stephen Curry shoots better from one side of the court than from the other.

tags: 14.1, Data Analysis

The post Nonparametric regression for binary response data in SAS appeared first on The DO Loop.

3月 212016

Last week Robert Allison showed how to download NBA data into SAS and create graphs such as the location where Stephen Curry took shots in the 2015-16 season to date. The graph at left shows the kind of graphs that Robert created. I've reversed the colors from Robert's version, so that red indicates "good" (a basket was scored) and blue indicates "bad" (a missed shot). The location of the NBA three-point line is evident by the many markers that form an arc in the scatter plot.

When I saw the scatter plot, I knew that I wanted to add some statistical analysis. In particular, I wanted to use SAS to construct a statistical model that estimates the probability that Curry scores from any position on the court.

This article focuses on the results of the analysis. You can download the SAS program that generates the analyses and graphics. Although this article analyzes Stephen Curry, you can modify the SAS program to analyze Kevin Durant, Lebron James, or any other player.

Probability as a function of distance

The first analysis estimates the probability that Curry makes a basket solely as a function of his distance from the basket. Curry is known for his consistent ability to make three-point shots. A three-point shot in the NBA requires that a player shoot from at least 22 feet away (when near the baseline) or 23.9 feet away (when further up the court).


You can use logistic regression to model the probability of making a shot as a function of the distance to the basket. The adjacent plot shows the result of a logistic regression analysis in SAS. The model predicts a probability of 0.7 that Curry will make a shot from under the basket, a probability of 0.5 from 20 feet away, and a probability of 0.46 from the three-point arc, indicated by the vertical gray line at 23.9 feet. Recall that a probability of 0.46 is equivalent to predicting that Curry will sink 46% of shots from the three-point arc.

Almost all (98.3%) of Curry's shots were taken from 30 feet or closer, and the shots taken from beyond 30 feet were end-of-quarter "Hail Mary" heaves. Therefore, the remaining analyses restrict to shots that were from 30 feet or closer.

Probability as a function of angle and distance

The previous analysis only considers the distance from the basket. It ignores position of the shot relative to the basket. In general, the probability of scoring depends on the location from which the shot was launched.

For consistency, let's agree that "right" and "left" means the portion of the court as seen by a fan sitting behind the backboard. This is, of course, opposite of what Curry would see when coming down the court toward the basket. Our "right" is Curry's left.


One way to model the positional dependence in the model is incorporate the angle relative to the backboard. The diagram shows one way to assign an angle to each position on the court. In the diagram, 90 degrees indicates a perpendicular shot to the basket, such as from the top of the key. An angle of 0 indicates a "baseline shot" from the right side of the court. Similarly, an angle of 180 degrees means a baseline shot from the left side of the court.

The following panel of graphs is the result of a logistic regression analysis that includes the interaction between angle and distance. The vertical lines in some plots indicate the distance to the sideline at particular angles. For 0 and 180 degress, the distance from the basket to the sideline is 25 feet.


The panel of plots show that Curry is most accurate when he shoots from the left side of the court. (The left side corresponds to angles greater than 90 degrees, which are on the left side of the panel.) Remarkably, the model estimates that Curry's probability of making a shot from the left side barely depends on the distance from the basket! He is a solid shooter (probability 0.5, which is 50%) from the left baseline (Angle = 180) and from a slight angle (Angle = 150). The previous scatter plot shows that he shoots many shots from the 120 degree angle. This analysis shows that he is uncannily accurate from 20 and even 30 feet away, although the probability of scoring decreases as the distance increases.

Does Stephen Curry shoot better from his right? #Statistics can tell you! #NBA #Analytics
Click To Tweet

On the right side of the court (angles less than 90 degrees), Curry's probability of making a shot depends more strongly on the distance to the basket. Near the basket, the model predicts a scoring probability of 0.6 or more. However, the probability drops dramatically as the distance increases. On the right side of the court, Curry is less accurate from 20 or more feet than for the same distance on the other side. At three-point range, Curry's probability of making a shot on the right (his left) drops to "only" 0.4. The probability drops off most dramatically when Curry shoots from the baseline (Angle = 0).

Probability as a function of position

A logistic analysis is a parametric model, which means that the analyst must specify the explanatory variables in the model and also the way that those variables interact with each other. This often leads to simplistic models, such as a linear or quadratic model. A simple model is often not appropriate for modeling the scoring probability as a function of the Cartesian X and Y positions on the court because a simple model cannot capture local spatial variations in the data.

SAS provides several possibilities for nonparametric modeling of data, but let's stick with logistic regression for now. Many SAS regression procedures, including PROC LOGISTIC, support using an EFFECT statement to generate spline effects for a variable. A spline effect expands a variable into spline bases. Spline effects enable you to model complex nonlinear behavior without specifying an explicit form for the nonlinear effects. The following graph visualizes such a model.


The image shows a scatter plot of the location of shots overlaid on a heat map that shows the predicted probability of Curry sinking a basket from various locations on the court. To better show the shot locations, the image has been stretched vertically. As mentioned previously, the location with the highest predicted probability is under the basket. From farther away, the predicted probability varies according to direction: to the left of the basket the probability is about 0.5, whereas a 15-foot jumper in front of the basket has probability 0.6. Notice the relative abundance of blue color (low probability) for shots on the right side. The lowest probability (about 0.3) occurs just beyond the three-point line at about a 60 degree angle, which agrees with the previous analysis. The same distance on the left side of the court is a much lighter shade of whitish-blue that corresponds almost 0.5 probability.

Statisticians will wonder about how well the model fits the data. The Pearson goodness-of-fit test indicates that the spline fit is not great, which is not surprising for a parametric fit to this kind of spatial data. In a follow-up post I will present an alternative nonparametric analysis.


SAS programmers will appreciate the fact that "effect plots" in this article were generated automatically by PROC LOGISTIC. By using the EFFECT statement and the EFFECTPLOT statement, it is simple to create graphs that visualize the predictions for a logistic regression model.

These graphs show that in general Stephen Curry is a phenomenal shooter who has a high probability of scoring from even a long distance. Logistic regression was used to model the probability that Curry makes a shot from various angles and locations on the court. The analysis indicates that Curry shoots better from his right side, especially from three-point range.

tags: Data Analysis

The post A statistical analysis of Stephen Curry's shooting appeared first on The DO Loop.

3月 072016

Most SAS regression procedures support the "stars and bars" operators, which enable you to create models that include main effects and all higher-order interaction effects. You can also easily create models that include all n-way interactions up to a specified value of n. However, it can be a challenge to specify models that include many—but not all!—higher-order interactions. This article describes a little-known trick: you can use COLLECTION effects to specify interaction terms.

Stars and Bars: Building models with interaction terms in SAS

Many of the regression procedures in SAS (such as GLM, GENMOD, LOGISTIC, MIXED,...) support the bar operator (|) to specify all interactions between effects. For example, the following MODEL statement specifies that the model should include all main effects and all higher-order interactions:

proc logistic;
   model Y = x1 | x2 | x3 | x4;   /* all main effects and interactions */

The previous MODEL statement includes all two-way, three-way, and four-way interaction effects. The statement is equivalent to the following statement that uses the star operator (*) to explicitly specify each interaction term:

model Y = x1 x2 x3 x4                         /* all main effects */
          x1*x2 x1*x3 x1*x4 x2*x3 x2*x4 x3*x4 /* all two-way interactions */
          x1*x2*x3 x1*x2*x4 x1*x3*x4 x2*x3*x4 /* all three-way interactions */
          x1*x2*x3*x4;                        /* all four-way interactions */

Fitting a model with so many effects will lead to overfitting, so in practice an analyst might restrict the model to two-way interactions. Again, SAS supplies an easy syntax. You can use the "at" operator (@) to specify the highest interaction terms in the model. For example, the following syntax specifies that the model contains only main effects and two-way interactions:

model Y = x1 | x2 | x3 | x4 @2;   /* main effects and two-way interactions */

Specifying many, but not all, interaction terms

Unfortunately, there is no simple syntax for constructing many, but not all, interaction effects. This can be frustrating when there is a structure to the interaction terms. A common structure is that there are two lists of variables and you want to build all interactions that involve one effect from the first list and one effect from the second list.

For example, suppose you want to create the following interaction effects:
c1*x1 c1*x2 c2*x1 c2*x2
The interaction terms are the pairwise combinations of the variables {c1 c2} with the variables {x1 x2}. Note, however, that within-list interactions are not desired: there are no terms for c1*c2 or x1*x2.

It would be great to have some kind of shorthand notation that tells SAS to "cross all elements in the first list with all elements in the second list." A natural syntax would be
(c1 c2) | (x1 x2)
but unfortunately that syntax is not supported.

Some SAS programmers might use the macro language to generate all pairwise interactions between two lists of variables, but COLLECTION effects offer an easier way.


More than a dozen regression procedures in SAS support the EFFECT statement. According the documentation, the EFFECT statement generates "special collections of columns for design matrices." In particular, the so-called COLLECTION effect enables you to specify multiple effects that are "considered as a unit." A complete explanation of collection effects is beyond the scope of this article, but if V and W are two collection effects, then V*W contains all pairwise interactions of the individual effects in V with the individual effects in W. Similarly, V | W contains all main effects and the pairwise interaction effects.

As an example of using COLLECTION effects, the following model uses two classification variables and four continuous variables in the SasHelp.BWeight data. Here is the model specified in the usual way:

proc logistic data=Sashelp.Heart;
   class BP_Status Sex;
   model Status = BP_Status Sex Cholesterol Height Weight MRW
         BP_Status*Cholesterol BP_Status*Height BP_Status*Weight BP_Status*MRW
               Sex*Cholesterol       Sex*Height       Sex*Weight       Sex*MRW;
   ods select ParameterEstimates;
   ods output ParameterEstimates = Parm1;

Manually enumerating all those interaction terms requires a lot of typing. More importantly, the enumeration does not make it clear that the interaction terms are the pairwise interactions between the classification variables and the continuous variables. In contrast, the following statements use COLLECTION effects to define two sets of variables. The MODEL statement uses the familiar bar operator to form all main effects and pairwise interactions between the effects.

proc logistic data=Sashelp.Heart;
   class BP_Status Sex;
   effect V = collection(BP_Status Sex);                     /* one list     */ 
   effect W = collection(Cholesterol Height Weight MRW);     /* another list */ 
   model Status = V | W;      /* vars and interactions between the var lists */
   ods select ParameterEstimates;
   ods output ParameterEstimates = Parm2;

The second model statement is more concise. The two models produce equivalent predictions, but the second is much easier to type and to interpret.

You can use COLLECTION effects to specify interaction terms in regression models. #sastip
Click To Tweet

You can use PROC COMPARE to show that the parameter estimates are the same (to eight decimal places), and therefore the predicted values will be the same. Because the order of the parameters differs between models, the parameter estimates are sorted before running the comparison.

proc sort data=Parm1; by Estimate; run;
proc sort data=Parm2; by Estimate; run;
proc compare brief method=absolute criterion=1e-8
             base   =Parm1(drop=Variable)
             compare=Parm2(drop=Variable ClassVal:);
NOTE: All values compared are within the equality criterion used.

This use of the COLLECTION effect is somewhat nonstandard. SAS introduced COLLECTION effects for variable selection routines such as the "group LASSO" as a way to specify that all variables in the collection should be included in the model, or all should be excluded. The variables enter or leave the model "as a unit."

Although most tables and statistics from PROC LOGISTICS are the same for the two models, there are differences. One difference is the "Type 3 Analysis of Effects," which tests whether all the parameters associated with an effect are zero. The first call to PROC LOGISTIC analyzes 14 effects; the second call analyzes three (collection) effects.

In summary, the EFFECT statement provides a way to treat sets of variables "as a unit." This leads to a simple syntax for forming specific interaction effects. The example in this article showed how to create pairwise interactions, but the COLLECTION effects can also be used to specify higher-order interactions.

tags: Data Analysis

The post How to use COLLECTION effects to specify pairwise interations in SAS appeared first on The DO Loop.

3月 022016

Last week I showed how to create dummy variables in SAS by using the GLMMOD procedure. The procedure enables you to create design matrices that encode continuous variables, categorical variables, and their interactions. You can use dummy variables to replace categorical variables in procedures that do not support a CLASS statement. You can use other procedures to create design matrices for other parameterizations.

SAS/IML programmers can use two built-in functions to create dummy variables. The DESIGN function generates dummy variables for the GLM parameterization. The DESIGNF function generates dummy variables for the EFFECT encoding. You can use the HDIR function to create interaction effects from the main-effect dummy variables.

The following DATA step creates sample data. The PROC IML statements read the data into vectors and use the DESIGN and DESIGNF function to create dummy variables. Note the use of the ODS LAYOUT GRIDDED statement to print SAS/IML matrices across the page.

data Patients;
   keep Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc iml;
use Patients;  
read all var {Cholesterol Sex BP_Status};  
close Patients;
Dummy_GLM = design( BP_Status );      /* dummy vars, GLM encoding */
Dummy_Effect = designf( BP_Status );  /* dummy vars, EFFECT encoding */
ods layout gridded columns=3 advance=table; /* create gridded layout in HTML */
print BP_Status, Dummy_GLM, Dummy_Effect;
ods layout end;

You can see that the DESIGN matrix creates k binary dummy variables for a categorical variable that contains k levels. The first column represents the first level (in alphabetical order), which for this data is "High." The first column has the value 1 for each row for which BP_Status="High." Similarly, the second column contains a 1 for each row for which BP_Status="Normal." The third column contains a 1 for each row for which BP_Status="Optimal."

In contrast, the DESIGNF creates a design matrix that has k–1 columns. The matrix has the EFFECT encoding, with the last category ("Optimal") serving as the reference level. The first column has the value 1 for rows for which BP_Status="High," the value –1 for rows for which BP_Status is the reference level, and 0 otherwise. The second column is similar, except that 1 indicates rows for which BP_Status="Normal."

Linear regression with dummy variables

Dummy variables convert character variables (and other categorical variables) into numerical variables with a specified encoding. As such they enable you to use matrix computations to perform a statistical analysis such as linear regression.

For example, the following SAS/IML statements perform a regression analysis that models the Cholesterol variable as a linear function of the Sex and BP_Status variables. The statements use the DESIGNF statement to form the dummy variables for each categorical variable. These columns (and an intercept column) are concatenated horizontally to form the design matrix. Because the DESIGNF statement is a nonsingular parameterization, you can use the SOLVE function to solve the normal equations and obtain the least squares solution, as follows:

Y = Cholesterol;                 /* response variable */
Intercept = j(nrow(Y), 1, 1);
X1 = designf( Sex );
X2 = designf( BP_Status );
X = Intercept || X1 || X2;      /* design matrix with EFFECT parameterization */
/* Matrix formulation of one-way ANOVA (cell means model): Y = X*beta + epsilon
   See       */
b = solve( X`*X, X`*Y );       /* solve normal equations */
print b[rowname={"Intercept" "Sex:Female" "BP_Status:High" "BP_Status:Normal"}];

The interpretation of the parameter estimates for this linear example is somewhat complicated; see Lewis (2007) if you are interested. However, for comparison, the following call to PROC GENMOD creates parameter estimates for the same linear model. The PARAM=EFFECT option is used so that the procedure uses the EFFECT parameterization.

proc genmod data=Patients;
   class Sex BP_Status / param=effect;
   model Cholesterol = Sex BP_Status / noscale;
   ods select ParameterEstimates;

Strictly speaking, PROC GENMOD uses maximum likelihood estimation whereas the PROC IML code is a least squares estimate, but you can see that the estimates are identical to four decimal places.

REFERENCE encoding and the GLM parameter estimates

Although SAS/IML does not provide a built-in function for generating a design matrix that uses the REFERENCE encoding, you can easily create such a function. The REFERENCE encoding is similar to the GLM encoding, but with the (redundant) last column dropped:

/* design matrix for reference encoding */
start designr(x); 
   A = design(x);             /* get design matrix with GLM encoding */
   return( A[,1:ncol(A)-1] ); /* drop last column */

If you use the REFERENCE encoding to create the X matrix as in the previous section, then the SOLVE function returns the same parameter estimates that are provided by the GLM procedure. (The GLM procedure sets the parameters for the last dummy columns to zero.)

Interactions of dummy variables

You can use the HDIR function to create interaction effects. For example, the following statements create columns that indicate the interaction between the Sex and BP_Status variables. The printed output shows the results for the EFFECT parameterization, but the same SAS/IML statement will produce the interaction effects for other parameterizations:

X1X2 = hdir(X1, X2);   /* dummy variables for interaction term */
print X1X2[c={"Female High" "Female Normal"}];

By using the tips in this article, you can create design matrices for ANOVA and regression models that contain categorical variables. In this way, you can use SAS/IML to reproduce the parameter estimates in many SAS linear regression procedures.

tags: Data Analysis, Matrix Computations

The post Dummy variables in SAS/IML appeared first on The DO Loop.

2月 242016

SAS programmers sometimes ask, "How do I create a design matrix in SAS?" A design matrix is a numerical matrix that represents the explanatory variables in regression models. In simple models, the design matrix contains one column for each continuous variable and multiple columns (called dummy variables) for each classification variable.

I previously wrote about how to create dummy variables in SAS by using the GLMMOD procedure to create binary indicator variables for each categorical variable. But PROC GLMMOD is not the only way to generate design matrices in SAS. This article demonstrate four SAS procedures that create design matrices: GLMMOD, LOGISTIC, TRANSREG, and GLIMMIX. Of the four, the LOGISTIC procedure provides the most flexibility for creating design matrices and also supports an easy-to-use syntax.

How categorical variables are represented in a design matrix in SAS

The CLASS statement in a SAS procedure specifies categorical variables that should be replaced by dummy variables when forming the design matrix. The process of forming columns in a design matrix is called parameterization or encoding. The three most popular parameterizations are the GLM encoding, the EFFECT encoding, and the REFERENCE encoding. For a detailed explanation of these encodings, see the section "Parameterization of Model Effects" in the SAS/STAT documentation. For applications and interpretation of different parameterizations, see Pasta (2005).

The following DATA step create an example data set with 10 observations. It has three fixed effects: one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels. It also has a categorical variable (HospitalID) that will be used as a random effect.

data Patients;
   HospitalID = mod(_N_, 4);
   keep HospitalID Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc print; run;
Example data set for creating design matrices in SAS

PROC GLMMOD: Design matrices that use the GLM encoding

The simplest way to create dummy variables is by using the GLMMOD procedure, which can produce a basic design matrix with GLM encoding. The GLM encoding is a singular parameterization in which each categorical variable is represented by k binary variables, where k is the number of levels in the variable. There is also an intercept column that has all 1s. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to create interaction effects. See my previous article for an example of how to use PROC GLMMOD to create a design matrix and how the singular parameterization affects parameter estimates in regression.

PROC LOGISTIC: Design matrices for any parameterization

You can also create a design matrix in SAS by using the LOGISTIC procedure. The PROC LOGISTIC statement supports a DESIGNONLY option, which prevents the procedure from running the analysis. Instead it only forms the design matrix and writes it to a data set. By default, PROC LOGISTIC uses the EFFECT encoding for classification variables, but you can use the PARAM= option on the CLASS statement to specify any parameterization.

A drawback of using PROC LOGISTIC is that you must supply a binary response variable on the MODEL statement, which might require you to run an extra DATA step. The following DATA step creates a view that contains a variable that has the constant value 0. This variable is used on the left-hand side of the MODEL statement in PROC LOGISTIC, but is dropped from the design matrix:

data Temp / view=Temp;
   set Patients;
   FakeY = 0;
proc logistic data=Temp outdesign=EffectDesign(drop=FakeY) outdesignonly;
   class sex BP_Status / param=effect; /* also supports REFERENCE & GLM encoding */
   model FakeY = Cholesterol Sex BP_Status;
proc print data=EffectDesign; run;
Design matrix in SAS with effect encoding

The design matrix shows the effect encoding, which uses –1 to indicate the reference level, which by default is the last level in alphabetical order. The name of a dummy variable is the conatenation of the original variable name and a level. For example, the Sex variable is replaced by the dummy variable named SexFemale, which has the value 1 to represent females and –1 to represent the reference level ("Male"). The BP_Status variable is replaced by two variables. The BP_StatusHigh variable contains 1 for patients that have high blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise. Similarly, the BP_StatusNormal dummy variable has the value 1 for patients with normal blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise.

The effect encoding produces k-1 columns for a categorical variable that has k levels. This results in a nonsingular design matrix.

You can use the REF= option after each classification variable to specify the reference level. You can also use the PARAM= option on the CLASS statement to specify a different parameterization. For example, the following statements create a design matrix that uses the REFERENCE parameterization. The reference level for the Sex variable is set to "Female" and the reference level for the BP_Status variable is set to "Normal."

proc logistic data=Temp outdesign=RefDesign(drop=FakeY) outdesignonly;
   class sex(ref="Female") BP_Status(ref="Normal") / param=reference; 
   model FakeY = Sex BP_Status;
proc print data=RefDesign; run;

Parameterizations affect the way that parameter estimates are interpreted in a regression analysis. For the reference encoding, parameter estimates of main effects indicate the difference of each level as compared to the effect of the reference level. For the effect encoding, the comparison is to the average effect over all levels.

PROC TRANSREG: Design matrices and a macro for variable names

Using PROC LOGISTIC is very flexible, but it has two drawbacks: You have to create a fake response variable, and you have to look at the output data set to discover the names of the dummy variables. In contrast, PROC TRANSREG does not require that you specify a response variable when you generate the design matrix. Furthermore, the procedure creates a macro variable (&_TRGIND, for "TRANSREG indicator" variables) that contains the names of the columns of the design matrix. Another nice feature is that the output data set contains the original variables, and you can use the ID variable to output additional variables.

However, the syntax for the TRANSREG procedure is different from most other SAS regression procedures. Instead of a CLASS statement, you specify classification effects in a CLASS() transformation list. You can use the ZERO= option to control reference levels, and the procedure supports the GLM and EFFECT parameterization, but not the REFERENCE encoding. The following statements show an example that generates a design matrix with the effect encoding:

proc transreg data=Patients design;
   model identity(Cholesterol) 
         class(Sex BP_Status / EFFECT zero="Female" "Normal");
   output out=B;
proc print data=B; 
   var Intercept &_TrgInd; 

The output is not shown because it is identical to the EffectDesign data set in the previous section. Notice that the output is displayed by using the &_TRGIND macro variable. For details about generating design matrices, see the TRANSREG documentation section "Using the DESIGN Output Option."

PROC GLIMMIX: Design matrices for fixed and random effects

PROC GLIMMIX enables you to construct two design matrices: one for the fixed effects and another for the random effects. The PROC GLIMMIX statement supports an OUTDESIGN= option that you can use to specify the output data set and a NOFIT option that ensures that the procedure will not try to fit the model.

The following statements create an output data set that contains two design matrices:

proc glimmix data=Patients outdesign(names novar)=MixedDesign nofit;
   class sex BP_Status HospitalID;
   model Cholesterol = Sex BP_Status;
   random HospitalID;
   ods select ColumnNames;
proc print data=MixedDesign; run;
Design matrix in SAS for fixed and random effects

Dummy variables for the fixed effects are prefixed by "_X" and dummy variables for the random effects are prefixed by "_Z." Two additional tables (not shown) associate the levels of the original variables with the columns of the design matrices.

The GLIMMIX procedure uses only the GLM parameterization. Consequently, there is little advantage to using PROC GLIMMIX instead of PROC GLMMOD. You can generate the same designs by calling PROC GLMMOD twice, once for the fixed effects and once for the random effects.


In summary, SAS provides four procedures that you can use to generate design matrices for continuous variables, classification variables, and their interactions. The GLMMOD procedure is ideal for creating design matrices that use the GLM encoding. PROC LOGISTIC supports all encodings in SAS and provides an easy-to-use syntax for specifying interactions. PROC TRANSREG supports fewer parameterizations, but does not require that you manufacture a response variable. Lastly, the GLIMMIX procedure produces design matrices for both fixed and random effects.

tags: Data Analysis, Tips and Techniques

The post Four ways to create a design matrix in SAS appeared first on The DO Loop.

2月 032016

Last week I showed how to use PROC EXPAND to compute moving averages and other rolling statistics in SAS. Unfortunately, PROC EXPAND is part of SAS/ETS software and not every SAS site has a license for SAS/ETS. For simple moving averages, you can write a DATA step program, as discussed in previous post. However, for complex rolling statistics, the SAS/IML language, which enables you to easily access previous observations (and even future ones!), is a more powerful tool for computing rolling statistics.

This article shows how to implement various rolling statistics in SAS/IML. To keep the explanations and programs simple, the functions assume that there are no missing values in the data. The article "What is a moving average" explains the mathematical formulas used in this post.

A simple moving average function

The key to computing most rolling statistics is to define a rolling window of observations. At each time point, you extract the observations in the rolling window and use them to compute the statistic. You then move on to the next time point and repeat the computation. You might need to perform special computations at the beginning of the time series.

The following SAS/IML program implements a simple moving average. The arguments to the MA function are a column vector, Y, of time series values and a scalar value, k, which indicates the number of values to use for each moving average computation. The program loops over elements in Y. For each element, the program computes the mean of the current element and previous k-1 values.

proc iml;
/* Simple moving average of k data values.
   First k-1 values are assigned the mean of all previous values.
   Inputs:     y     column vector of length N >= k
               k     number of data points to use to construct each average
start MA(y, k);
   MA = j(nrow(y), 1, .);
   do i = 1 to nrow(y);
      idx = max(1,(i-k+1)):i;   /* rolling window of data points */
      MA[i] = mean( y[idx] );   /* compute average */
   return ( MA );

The first k-1 values require special handling because these values have fewer than k-1 prior observations to average. You could handle these special values by using a separate loop. However, I chose to use the expression max(1, (i-k+1)) to select the first element for the rolling mean computation. When i is less than k, this expression returns 1 for the first element, and the program computes the mean of the first i values. Otherwise, this expression returns i minus k-1 (which is i-k+1) for the first element, and the program computes the mean of k values.

The most important part of this computation is enumerating the time points to use in the computation (for example, idx = (i-k+1):i;) followed by extracting the associated data (for example, y[idx]). With these two expressions, you can compute any rolling statistic. For example, by changing the function call from MEAN to STD, you can compute a rolling standard deviation. The rolling median, rolling minimum, and rolling maximum are also trivial to implement. By changing the time points, you can compute rolling statistics for centered windows. If your data contain several variables, you can compute a rolling correlation.

A weighted moving average function

The following function computes a weighted moving average. The arguments to the WMA function are a column data vector, Y, and a vector of weights that has k elements. For each time point, wk (the last weight) is the weight for current data value, wk-1 is for the previous data value, and so forth. The function internally standardizes the weights so that they sum to unity. (This ordering was chosen so that the WMA function uses the same syntax as PROC EXPAND.) This function handles the first few values in a separate loop:

/* Weighted moving average of k data values.
   First k-1 values are assigned the weighted mean of all preceding values.
   Inputs:     y     column vector of length N >= k
               wt    column vector of weights. w[k] is weight for most recent 
                      data; wt[1] is for most distant data value.  The function 
                     internally standardizes the weights so that sum(wt)=1.
   Example call: WMA  = WMA(y, 1:5);
start WMA(y, wt);
   w = colvec(wt) / sum(wt);       /* standardize weights so that sum(w)=1 */
   k = nrow(w);
   MA = j(nrow(y), 1, .);
   /* handle first k values separately */
   do i = 1 to k-1;
      wIdx = k-i+1:k;                 /* index for previous i weights */
      tIdx = 1:i;                     /* index for previous i data values */
      MA[i] = sum(wt[wIdx]#y[tIdx]) / sum(wt[wIdx]);  /* weighted average */
   /* main computation: average of current and previous k-1 data values */
   do i = k to nrow(y);
      idx = (i-k+1):i;               /* rolling window of k data points */
      MA[i] = sum( w#y[idx] );       /* weighted sum of k data values */
   return ( MA );

Notice that the function requires computing a weighted mean, which is described in a previous article.

An exponentially weighted moving average function

An exponentially weighted moving average is defined recursively. The average at time t is a weighted average of the data point at time t and the average from time t-1. The relative weights are determined by the smoothing parameter, α. The following function implements that definition:

/* Exponentially weighted moving average (EWMA) with smoothing parameter alpha.
   Inputs:      y     column vector of length N
                alpha scalar value 0 < alpha < 1
start EWMA(y, alpha);
   MA = j(nrow(y), 1, .);
   MA[1] = y[1];              /* initialize first value of smoother */
   do i = 2 to nrow(y);
      MA[i] = alpha*y[i] + (1-alpha)*MA[i-1];
   return ( MA );

The three moving average functions are now defined. You can read the time series data into a vector and call the functions. If necessary, you can write the rolling statistics to a SAS data set:

/* read time series data */
use Sashelp.Air;  
   read all var "date" into t;
   read all var "air" into y;
MA   = MA(y, 5);           /* moving average, k=5 */
WMA  = WMA(y, 1:5);        /* weighted moving average */
EWMA = EWMA(y, 0.3);       /* exponentially WMA, alpha=0.3 */
create out var{t y MA WMA EWMA};  append;  close out;

You can use the SGPLOT procedure to visualize the rolling statistics, as shown in my previous article.

Vectorizing time series computations

The experienced SAS/IML programmer will notice that these functions are not heavily vectorized. The MA and WMA computations use vectors of length k to compute the means and weighted means, respectively. It is possible to write these functions by using a matrix operation, but if the time series has N data points, the transformation matrix is an N x N lower-triangular banded matrix, which requires a lot of memory for large values of N.

Notice that the EWMA function uses scalar quantities inside a loop. For time series computations that use lagged data values, you can sometimes vectorize the time series computations. However, for operations that are defined recursively, such as the EWMA, the effort required to vectorize the computation might exceed the benefit you gain from the vectorization. In many cases, a function that uses a simple loop is fast and easy to maintain.


This article presents functions for computing rolling statistics in SAS/IML. Examples included a simple moving average (MA), a weighted moving average (WMA), and an exponentially weighted moving average (EWMA). The article describes how to modify these function to compute other rolling statistics in SAS.

Computations of rolling statistics might not be easy to vectorize. Even when you can vectorize a computation, a simpler approach might run faster.

tags: Data Analysis, Statistical Programming, vectorization

The post Rolling statistics in SAS/IML appeared first on The DO Loop.

1月 272016

A common question on SAS discussion forums is how to compute a moving average in SAS. This article shows how to use PROC EXPAND and contains links to articles that use the DATA step or macros to compute moving averages in SAS.

Moving average in SAS

In a previous post, I explained how to define a moving average and provided an example, which is shown here. The graph is a scatter plot of the monthly closing price for IBM stock over a 20-year period. The three curves are moving averages. The "MA" curve is a five-point (trailing) moving average. The "WMA" curve is a weighted moving average with weights 1 through 5. (When computing the weighted moving average at time t, the value yt has weight 5, the value yt-1 has weight 4, the value yt-2 has weight 3, and so forth.) The "EWMA" curve is an exponentially weighted moving average with smoothing factor α = 0.3.

This article shows how to use the EXPAND procedure in SAS/ETS software to compute a simple moving average, a weighted moving average, and an exponentially weighted moving average in SAS. For an overview of PROC EXPAND and its many capabilities, I recommend reading the short paper "Stupid Human Tricks with PROC EXPAND" by David Cassell (2010).

Because not every SAS customer has a license for SAS/ETS software, there are links at the end of this article that show how to compute a simple moving average in SAS by using the DATA step.

Create an example time series

Before you can compute a moving average in SAS, you need data. The following call to PROC SORT creates an example time series with 233 observations. There are no missing values. The data are sorted by the time variable, T. The variable Y contains the monthly closing price of IBM stock during a 20-year period.

/* create example data: IBM stock price */
title "Monthly IBM Stock Price";
proc sort data=sashelp.stocks(where=(STOCK='IBM') rename=(Date=t Close=y)) 
  by t;

Compute a moving average in SAS by using PROC EXPAND

PROC EXPAND computes many kinds of moving averages and other rolling statistics, such as rolling standard deviations, correlations, and cumulative sums of squares.

In the procedure, the ID statement identifies the time variable, T. The data should be sorted by the ID variable. The CONVERT statement specifies the names of the input and output variables. The TRANSFORMOUT= option specifies the method and parameters that are used to compute the rolling statistics.

/* create three moving average curves */
proc expand data=Series out=out method=none;
   id t;
   convert y = MA   / transout=(movave 5);
   convert y = WMA  / transout=(movave(1 2 3 4 5)); 
   convert y = EWMA / transout=(ewma 0.3);

The example uses three CONVERT statements:

  • The first specifies that MA is an output variable that is computed as a (backward) moving average that uses five data values (k=5).
  • The second CONVERT statement specifies that WMA is an output variable that is a weighted moving average. The weights are automatically standardized by the procedure, so the formula is WMA(t) = (5yt + 4yt-1 + 3yt-2 + 2yt-3 + 1yt-4) / 15.
  • The third CONVERT statement specifies that EWMA is an output variable that is an exponentially weighted moving average with parameter 0.3.

Notice the METHOD=NONE option on the PROC EXPAND statement. By default, the EXPAND procedure fits cubic spline curves to the nonmissing values of variables. The METHOD=NONE options ensures that the raw data points are used to compute the moving averages, rather than interpolated values.

Visualizing moving averages

An important use of a moving average is to overlay a curve on a scatter plot of the raw data. This enables you to visualize short-term trends in the data. The following call to PROC SGPOT creates the graph at the top of this article:

proc sgplot data=out cycleattrs;
   series x=t y=MA   / name='MA'   legendlabel="MA(5)";
   series x=t y=WMA  / name='WMA'  legendlabel="WMA(1,2,3,4,5)";
   series x=t y=EWMA / name='EWMA' legendlabel="EWMA(0.3)";
   scatter x=t y=y;
   keylegend 'MA' 'WMA' 'EWMA';
   xaxis display=(nolabel) grid;
   yaxis label="Closing Price" grid;

To keep this article as simple as possible, I have not discussed how to handle missing data when computing moving averages. See the documentation for PROC EXPAND for various issues related to missing data. In particular, you can use the METHOD= option to specify how to interpolate missing values. You can also use transformation options to control how moving averages are defined for the first few data points.

Create a moving average in SAS by using the DATA step

If you do not have SAS/ETS software, the following references show how to use the SAS DATA step to compute simple moving averages by using the LAG function:

The DATA step, which is designed to handle one observation at a time, is not the best tool for time series computations, which naturally require multiple observations (lags and leads). In a future blog post, I will show how to write SAS/IML functions that compute simple, weighted, and exponentially weighted moving averages. The matrix language in PROC IML is easier to work with for computations that require accessing multiple time points.

tags: Data Analysis, Tips and Techniques

The post Compute a moving average in SAS appeared first on The DO Loop.

1月 252016

A moving average (also called a rolling average) is a satistical technique that is used to smooth a time series. Moving averages are used in finance, economics, and quality control. You can overlay a moving average curve on a time series to visualize how each value compares to a rolling average of previous values. For example, the following graph shows the monthly closing price of IBM stock over a 20-year period. Three kinds of moving averages are overlaid on a scatter plot of the data.

Moving average of stock price

The IBM stock price increased in some time periods and decreased in others. The moving-average curves help to visualize these trends and identify these time periods. For a simple moving average, the smoothness of a curve is determined by the number of time points, k, that is used to compute the moving average. Small values of k result in curves that reflect the short-term ups and downs of the data; large values of k undulate less. For stock charts that show daily prices, the 30-day moving average and the 5-day moving average are popular choices.

How do you define a moving average?

The most common moving averages are the simple moving average (MA), the weighted moving average (WMA), and the exponentially weighted moving average (EWMA). The following list provides a brief description and mathematical formula for these kinds of moving averages. See the Wikipedia article on moving averages for additional details.

Let {y0, y1, ..., yt, ...} be the time series that you want to smooth, where yt is the value of the response at time t.

  • The simple moving average at time t is the arithmetic mean of the series at yt and the previous k-1 time points. In symbols,
          MA(t; k) = (1/k) Σ yi
    where the summation is over the k values {yt-k+1, ..., yt}.
  • The weighted moving average (WMA) at time t is a weighted average of the series at yt and the previous k-1 time points. Typically the weights monotonically decrease so that data from "long ago" contribute less to the average than recent data. If the weights sum to unity (Σ wi = 1) then
          WMA(t; k) = Σ wi yi
    If the weights do not sum to unity, then divide that expression by Σ wi.
  • The exponentially weighted moving average (EWMA) does not use a finite rolling window. Instead of the parameter k, the EWMA uses a decay parameter α, where 0 < α < 1. The smoothed value at time t is defined recursively as
          EWMA(t; α) = α yt + (1 - α) EWMA(t-1; α)
    You can "unwind" this equation to obtain the EWMA as a WMA where the weights decrease geometrically. The choice of α determines the smoothness of the EWMA. A value of α ≈ 1 implies that older data contribute very little to the average. Conversely, small values of α imply that older data contribute to the moving average almost as much as newer data.

Each of these definitions contains an ambiguity for the first few values of the moving average. For example, if t < k, then there are fewer than k previous values in the MA and WMA methods. Some practitioners assign missing values to the first k-1 values, whereas others average the values even when fewer than k previous data points exist. For the EWMA, the recursive definition requires a value for EWMA(0; α), which is often chosen to be y0.

My next blog post shows how to compute various moving averages in SAS. The article shows how to create the IBM stock price example, which is a time series plot overlaid with MA, WMA, and EWMA curves.

tags: Data Analysis, Getting Started, Math

The post What is a moving average? appeared first on The DO Loop.

1月 062016

Weighted averages are all around us. Teachers use weighted averages to assign a test more weight than a quiz. Schools use weighted averages to compute grade-point averages. Financial companies compute the return on a portfolio as a weighted average of the component assets. Financial charts show (linearly) weighted moving averages or exponentially-weighted moving averages for stock prices.

The weighted average (or weighted mean, as statisticians like to call it) is easy to compute in SAS by using either PROC MEANS or PROC UNIVARIATE. Use the WEIGHT statement to specify a weight variable (w), and use the VAR statement as usual to specify the measurement variable (x). The formula for the weighted mean is the ratio of sums Σ wixi /  Σ wi. The following example computes the numerator (weighted sum), the denominator (sum of weights), and the weighted mean for a set of eight data points. For these data and weights, the weighted sum is 0.325:

data Wt;
input x wt;
-2   1
-1.5 0.8
-1.2 0.5
-0.5 1
 0   1
 0.8 1.5
 1.4 2.3
 2.0 1.5
proc means data=Wt sum sumwgt mean;
   weight wt;
   var x;

The WEIGHT statement is supported in many SAS procedures. By convention, weights are positive values, so any observations that contain missing or nonpositive weights are excluded from the computation.

Weighted means in SAS/IML software

The computation of the weighted mean is easy to program in SAS/IML software. Recall that the elementwise multiplication operator (#) computes the elementwise product of two vectors. If there are no missing values in the data and all the weights are positive, then the SAS/IML statement m = WtMean = sum(x#w) / sum(w) computes the weighted mean of the X values weighted by W.

For consistency with the rest of SAS, the following function excludes observations for which the X value is missing or for which the weight variable is not positive. Consequently, the following function duplicates the computation is used by PROC MEANS and PROC UNIVARIATE:

proc iml;
start WtMean(x, w);
   idx = loc(x^=. & w>0);                /* use only valid observations */
   if ncol(idx)=0 then return(.);        /* no valid obs; return missing */
   m = sum(x[idx]#w[idx]) / sum(w[idx]); /* compute weighted mean */
   return( m );
use Wt;   read all var {x wt};   close Wt;  /* read the example data */
WtMean = WtMean(x, wt);                     /* test the function */
print WtMean;
call symputx("xbar", WtMean);            /* store value in macro var for later */

The result (not shown) is the same as reported by PROC MEANS. The SYMPUTX call creates a macro variable xbar that contains the value of the weighted mean for this example. This macro variable is used in the next section.

Visualizing a weighted mean

Weighted distributions are not always easy to visualize, and for this reason PROC UNIVARIATE does not support creating graphs of weighted analyses. However, weighted means have a simple physical interpretation.

For the usual unweighted mean, imagine placing N identical point masses at the locations x1, x2, ..., xN along a massless rod. (An idealized point mass has no extent; the mass is concentrated at a single mathematical point.) The mean value of the X values is the center of mass for the point masses: the location at which the rod is perfectly balanced. In a similar way, the weighted mean is the location of the center of mass for a system of N point masses in which the mass wi is placed at the locations xi.

You can use a bubble plot to depict the physical arrangement of masses for this example. Instead of an idealized point mass, the bubble plot enables you to represent each mass by a circle whose size is related to the mass. The SIZE= option for the BUBBLE statement in PROC SGPLOT determines the diameter of the bubbles, but mass is proportional to area (actually volume, but I'm going to use a 2-D picture), so I use the square root of the weight to determine the size of each bubble. This trick ensures that the area of the bubbles is proportional to the weight.

The following DATA step computes the square root of each weight and adds a horizontal coordinate, y=0. The call to PROC SGPLOT creates the bubble plot. The REFLINE statement displays the massless rod. A drop line is shown at the center of mass for this system; the horizontal position is the valueof the xbar macro variable that was previously computed. (You can imagine that the system is perfectly balanced on the tip of a needle.) Finally, the TEXT statement (added in SAS 9.4m2) displays the weight of each mass. For earlier releases of SAS, you can use the MARKERCHAR= option in the SCATTER statement to display the weights.

data Bubble;
set Wt;
y = 0;
radius = sqrt(Wt);
ods graphics / width = 400px height=200px;
proc sgplot data=Bubble noautolegend;
   refline 0 / axis=y;
   dropline x=&xbar y=0 / dropto=x;
   bubble x=x y=y size=radius;
   text x=x y=y text=wt / strip;   /* or   scatter x=x y=y / markerchar=wt; */
   yaxis display=none;

In the graph, the five small masses to the left of the center of mass are balanced by the three larger masses to the right of the center of mass.

Although this example is one-dimensional, you can use the weighted mean computation to compute the center of mass for a two-dimensional collection of point masses: the X coordinates of the points are used to compute the X coordinate of the center of mass, and he Y coordinate for the center of mass is computed similarly. The bubble plot is easily modified to represent the two-dimensional arrangement.

In summary, the weighted mean is easy to compute and fun to visualize in SAS. Have you needed to compute a weighted mean? What did the weights represent? Leave a comment.

tags: Data Analysis, Getting Started

The post Compute a weighted mean in SAS appeared first on The DO Loop.

1月 042016

I wrote 114 posts for The DO Loop blog in 2015. Which were the most popular with readers?

In general, highly technical articles appeal to only a small group of readers, whereas less technical articles appeal to a larger audience. Consequently, many of my popular articles were related to data analysis or general SAS programming. These are topics that are relevant to all SAS programmers. However, it is gratifying to see that a few of my articles about computational statistics also attracted many readers.

General SAS techniques

Although I mostly write about statistical programming, several articles about how to do things in Base SAS were very popular:

Statistical Graphics and Data Visualization

Everyone likes to learn better ways to visualize data in SAS! The following posts generated some interesting discussions.

Statistical Techniques

No list would be complete without including articles about how to implement statistical techniques, computations, simulations, and visualizations, which are the bread and butter of many SAS programmers. Here are a few topics that resonated with readers in 2015:

Did you make a New Year's resolution to improve your SAS skills this year? Start your new year by (re-)reading one of these 12 popular posts from 2015.

Do you remember a favorite article that didn't make the list? Name your favorite in the comments.

tags: Data Analysis

The post Popular posts from The DO Loop in 2015 appeared first on The DO Loop.