rick wicklin

8月 132018
 

My colleague, Robert Allison, recently published an interesting visualization of the relationship between chess ratings and age. His post was inspired by the article "Age vs Elo — Your battle against time," which was published on the chess.com website. ("Elo" is one of the rating systems in chess.) Robert Allison's article indicates how to download the 2014 Elo ratings for 70,963 active players into a SAS data set. You can use PROC SGPLOT to create a scatter plot of the rating and age of each player. Because the graph displays so many observations, I followed Robert's advice and used semi-transparency to bring out hidden features in the data. I also colored each marker according to whether the player is male (blue) or female (light red). The results are shown below:

Elo rating in chess versus age for 70,000+ active chess players in 2014.

As pointed out in the previous articles, ratings depend on age in a nonlinear fashion. The ratings for young players tend to be less than for young adults. Ratings tend to be highest for players in their mid to late twenties. After age 30, ratings tend to decrease with age. The purpose of this article is to use statistical regression to predict ratings from age.

Predict percentiles of Elo rating as a function of age

The chess.com article shows a line plot that is formed by binning the players' ages into five-year age ranges, computing the average rating in each bin, and then connecting the means in each age group. This bin-and-connect-the-means method is less powerful than regression, but approximates the nonlinear relationship between age and the average rating for that age. A more powerful statistical technique is quantile regression. In quantile regression, you model the percentiles of the response variable, such as the 25th, 50th, or 90th percentile of the distribution of ratings for a given age. In other words, quantile regression enables you to compute separate predictions for the ratings of poor players, for good players, and for great players. A previous article compares quantile regression with the simpler binning technique.

I used the QUANTREG procedure in SAS to performs quantile regression to model the 10th, 25th, 50th, 75th, and 90th percentiles of rating as a function of age. (Quantiles are numbers between 0 and 1, whereas percentiles are between 0 and 100, but otherwise the terms are interchangeable.) I ignored the gender of the players. The following graph is the default plot that is created by PROC QUANTREG. The lowest curve shows the predicted rating for the 10th percentile of players (the weak players) as a function of age. The middle curve shows the predicted rating for the 50th percentile of players (the average players). The highest curve shows the predicted rating for the 90th percentile, who are excellent chess players.

Elo rating in chess versus age. The 10th, 25th, 50th, 75th, and 90th percentiles are shown for active players.

For the shown percentiles, the predicted chess ratings increase quickly for young players, then peak for players in their mid-twenties. This happens for all percentiles, from the 10th to the 90th. For example, the median player (50th percentile) at age 29 has a rating of about 2050. In contrast, the 90th percentile for a 29-year-old is about 2356.

This plot should remind you of a children's weight and height charts in a doctor's office. Just as the mother of a young child might use the doctor's chart to estimate her child's percentile for weight and height, so too can a chess player use this chart to estimate his percentile for rating. If you are a chess player, find your age on the horizontal axis. Then trace a line straight up until you reach your personal Elo rating. Use the relative position of the surrounding curves to estimate your rating percentile. For example, a 40-year-old with a 1900 rating could conclude from the graph that he is between the 25th and 50th percentile, but closer to the 50th.

As I said, this graph is created by default, but we can do better. In the following section, I add separate curves for women and men. I also add grid lines to make it easier to estimate coordinates in the graph.

Adding gender to the predict percentiles

Chess is enjoyed by both men and women. If you add gender to the quantile regression model, you obtain separate curves for men and women for each percentile. The following graph shows the predicted percentiles for the 10th, 50th, and 90th percentiles, for both men and women. Notice that the predicted male ratings are higher than the predicted female ratings at every age and for each quantile. Notice also that the predicted scores for females tend to peak later in life: about 30 years for females versus 25 for males. After peaking, the female curves drop off more quickly than their male counterparts.

Percentile curves for mean and women. Shown are the 10th, 50th, and 90th percentiles

Percentiles for Elo chess ratings by gender

The previous section overlays the male and female percentile curves. This makes it easy to compare males and females, but even with only three quantiles, the graph is crowded. If you want to display five or more percentile curves, it makes sense to separate the graph into a panel that contains two graphs, one for males and one for females. Again, this is similar to the height-weight percentile charts in the pediatrician offices, which often have separate charts for boys and girls.

The following panel shows side-by-side predictions for quantiles of ratings as a function of age. If you are a chess player, you can use these charts to estimate your rating percentile. For your current age and gender, what is your percentile?

In summary, I used data for Elo ratings in chess to build a quantile regression model that predicts percentiles of ratings for a given age and gender. Using these graphs, you can see the predicted Elo ratings for poor, good, and great chess players for every combination of age and gender. If you are a chess player yourself, you can locate your age-score value in one of the plots and estimate your percentile among your age-gender peer group.

If you are a SAS user and are interested in the details, you can download the SAS program that creates the analysis and graphs. It uses PROC QUANTREG to predict the quantile curves, and uses tips and techniques from the articles "How to score and graph a quantile regression model in SAS" and "Plot curves for levels of two categorical variables in SAS."

The post A quantile regression analysis of chess ratings by age appeared first on The DO Loop.

8月 082018
 
Use the GROUP= option in PROC SGPLOT to plot lines for TWO categorical variables

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

data TwoGroups;
do G1=1 to 3;
   do G2='A', 'B';
      do X = 1 to 10;
         Y = 10 + G1 + 0.5*(G2='A') + G1*X/20;
         output;
      end;
   end;
end;
run;
 
title "First Attempt: Does Not Work";
proc sgplot data=TwoGroups;
   series x=x y=y / group=G1 curvelabel;
run;

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

  • Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
  • Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

/* create a new group variable by concatenating the two existing variables */
proc sort data=TwoGroups;
   by G1 G2 X;
run;
 
data Make2Groups;
set TwoGroups;
by G1 G2;
/* Option 1: automatically create joint levels from original levels */
Label = catt("G1=", G1) || "; " || catt("G2=", G2);
/* Option 2: create new categorical variable and use PROC FORMAT to assign values */
if first.G2 then 
   GroupID + 1;        /* GroupID = 1, 2, 3, ... numGroups */
run;
 
/* For Option 2: use PROC FORMAT to form labels that encode the two groups. See
https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html
*/
proc format;                 
  value GroupFmt 1 = "G1=1; G2='A'"   2 = "G1=1; G2='B'"   3 = "G1=2; G2='A'"
                 4 = "G1=2; G2='B'"   5 = "G1=3; G2='A'"   6 = "G1=3; G2='B'";
run;

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

title "Two Groups, One SERIES Statement";
proc sgplot data=Make2Groups;
/* format GroupID GroupFmt.; */            /* for Option 2 */
series x=x y=y / group=Label grouplc=G2    /* or group=GroupID for Option 2 */
                 lineattrs=(pattern=solid) curvelabel;
run;

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

7月 302018
 

A programmer recently asked a question on a SAS discussion forum about design matrices for categorical variables. He had generated a design matrix by using PROC GLMMOD and wanted to use the design columns in a subsequent procedure. However, the columns were named COL1, COL2, COL3,..., so he couldn't tell which dummy variables correspond to each categorical variable. The following example illustrates his situation for the Weight_Status and Smoking_Status variables in the Sashelp.Heart data set:

proc glmmod data=Sashelp.Heart outdesign=GLMDesign;
   class Weight_Status Smoking_Status;
   model Cholesterol = Weight_Status Smoking_Status;
   ods select Parameters;
run;
 
proc contents data=GLMDesign varnum short; run;
The association betwen columns in a design matrix and levels of the original categorical variables

The "Parameters" table shows the association between columns of the design matrix and levels of the categorical variables in the model. The output from PROC CONTENTS shows that the columns of the design matrix (as stored in the GLMDesign data set) are named COL1, COL2, and so forth. On the discussion forum, I showed how to save the "Parameters" table to a SAS data and use a DATA _NULL_ step to form macro variables that you can use to associate the design columns to the original variables.

In retrospect, I missed a golden opportunity to mention that the GLMMOD procedure (which always produces a singular design matrix) is not as friendly or powerful as other procedures in SAS that can generate design matrices. In particular, the GLMSELECT and TRANSREG procedures can create design matrices for many different parameterizations of the classification variables. Furthermore, these procedures automatically create macro variables that tell you the names of the columns in the design matrix.

As a reminder, a major reason to create a design matrix is to perform an analysis with a SAS procedure that does not support a CLASS statement. For example, the documentation of the MCMC procedure shows how to use PROC TRANSREG to create a design matrix as preparation for performing a Bayesian regression analysis. For details of the many ways to generate design matrices in SAS, see my previous article, "Four ways to create a design matrix in SAS."

GLMSELECT: An easier way to associate columns of a design matrix

The GLMSELECT procedure supports the OUTDESIGN= option, which enables you to output a design matrix for the variables in a regression model. The GLMSELECT procedure has the following advantages of the GLMMOD procedure:

  • The procedure supports the EFFECT statement, which you can use to define spline effects, collection effects, and more. The OUTDESIGN= option creates columns for any effect that you can define.
  • The procedure supports nonsingular parameterizations such as the 'effect' or 'reference' parameterizations (PARAM=EFFECT or PARAM=REF, respectively).
  • The dummy variables have meaningful names of the form VarName_Level, where VarName is the name of a categorical variable and Level is one of its values.
  • The procedure automatically creates a macro variable (called _GLSMod) that contains the names of the columns of the design matrix.

The following statements create a design matrix for the Weight_Status and Smoking_Status variables. The design matrix uses the 'effect coding' parameterization and is written to the GLSDesign data set. The value of the _GLSMod macro is displayed in the SAS log and can be used in a subsequent procedure call:

proc glmselect data=Sashelp.Heart outdesign(fullmodel)=GLSDesign noprint;
   class Weight_Status(ref='Normal')
         Smoking_Status(ref='Non-smoker') / param=effect;
   model Cholesterol = Weight_Status Smoking_Status / selection=none; /* include ALL dummy variables */
run;
 
ods select Position;
proc contents data=GLSDesign varnum; run;   /* display all variables in OUTDESIGN= data set */
 
%put &_GLSMod;                              /* names of columns in design matrix */
--- SAS Log ---
Weight_Status_Overweight Weight_Status_Underweight
Smoking_Status_Heavy__16_25_ Smoking_Status_Light__1_5_
Smoking_Status_Moderate__6_15_ Smoking_Status_Very_Heavy____25_

The output from PROC CONTENTS show the variables in the GLSDesign data set, which are INTERCEPT, CHOLESTEROL (the response variable in the model), and the dummy variables listed in the _GLSMod macro variable. By default, the dummy variables have the pattern VarName_Level, although you can use the OUTDESIGN(PREFIX=prefix) option to generate column names of the form prefix1, prefix2, and so forth, where prefix is a user-specified prefix. You can see that values such as "Heavy (16-25)" are transformed into valid names for SAS variables.

In this example, the _GLSMOD macro contains the names of ALL dummy variables because the SELECTION=NONE option is used. If you use an option (such as SELECTION=LASSO) to perform variable selection, the _GLSMOD variable will contain the names of the dummy variables that are selected in the final model. This means that you can use the _GLSMOD variable to perform additional analyses. For example, if you want to use POC REG to output collinearity diagnostics for the variables in the final model, you could execute the following:

/* use the variables in the final model in a procedure that does not support a CLASS statement */
proc reg data=GLSDesign plots=none; 
   model Cholesterol = &_GLSMod / collin;
run;

TRANSREG: Dummy variables and transformed variable

In a similar way, the TRANSREG procedure supports the DESIGN option, which adds dummy variables to the output data set, as shown in the following call. The syntax is different, but the dummy variables in the TRGDesign data set are formed by the same 'effect coding' as in the previous example. The procedure automatically creates a macro variable (called _TRGInd) that contains the names of the columns of the design matrix.

proc transreg data=Sashelp.Heart design;
   model class(Weight_Status Smoking_Status / EFFECTS 
          zero="Normal"      "Non-smoker");
   output out=TRGDesign;
run;
 
%put &_TRGInd;
Weight_StatusOverweight Weight_StatusUnderweight Smoking_StatusHeavy__16_25_ Smoking_StatusLight__1_5_
Smoking_StatusModerate__6_15_ Smoking_StatusVery_Heavy____25_

The dummy variables contain the same values as in the previous example, but the two procedures construct slightly different names. The GLMSELECT names contain an extra underscore. For example, a dummy variable in the GLSDesign data set is Weight_Status_Overweight whereas the corresponding variable in the TRGDesign data set is Weight_StatusOverweight.

It is worth noting the PROC TRANSREG also supports macro variables that specify the names of dependent variables and transformed variables. You can even use the MACRO option to specify the name of the macro variables that are created. For details, see the documentation for the PROC TRANSREG statement.

In summary, when you generate a design matrix by using the GLMSELECT or TRANSREG procedures, the procedures create dummy variables that have meaningful names of the form VarName_Level or VarNameLevel. The procedures each create a macro variable that contains the names of the dummy variables. The _GLSMOD macro that is created by PROC GLMSELECT contains the names of dummy variables in the final selected model, so use SELECTION=NONE if you want all names. These macros make it easy for a programmer to refer to columns of the design matrix.

The post Meaningful names for columns of a design matrix appeared first on The DO Loop.

7月 252018
 

Back in SAS 9.3M2 (SAS/STAT 12.1), PROC FREQ introduced mosaic plots to visualize the joint frequencies in a contingency table. By default, the cells in a mosaic plot are colored according to levels of one of the categorical variables in the analysis. However, in 2013 I showed how you can use the output from PROC FREQ and the MOSAICPARM statement in the Graph Template Language (GTL) to color the cells by a statistic such as the standardized residuals in the chi-square model for independence.

I only recently learned that PROC FREQ in SAS/STAT 13.1 introduced built-in support for coloring cells in a mosaic plot. In other words, you can now automatically generate the graph that once required using GTL. For example, in my previous article, I wrote a program that orders the levels of the blood pressure and weight categories in the Sashelp.Heart data set. The following call to PROC FREQ creates a mosaic plot of the data in that program and specifies the COLORSTAT=STDRES suboption. Whereas the cell sizes are proportional to the frequency of the joint levels, the colors indicate the magnitude of the standardized residuals in a model that assumes independence between the two variables:

proc freq data=heart;
tables BP_Cat*Weight_Cat / norow cellchi2 expected stdres crosslist
                           missing plots=MosaicPlot(colorstat=StdRes);
run;

The mosaic plot visualizes the patterns of association between the weights of patients (categorized into underweight, normal, and overweight) and their blood pressure (categorized into optimal, normal, and high ranges). The size of the cells indicates that most patients in the study are overweight and about 35% are both overweight and have high blood pressure. The red colors indicate pairs of characteristics that occur more often in the data than would be expected if these measurements were independent. The blue colors indicate conditions that appear less often than would be expected. In particular:

  • There are more overweight people with high blood pressure than would be expected under independence.
  • There are fewer overweight people with optimal blood pressure than would be expected.
  • There are more normal-weight patients with optimal blood pressure than would be expected.
  • There are fewer normal-weight patients with high blood pressure than would be expected.

The mosaic plot indicates why the chi-square test for independence rejects the null hypothesis of independence and shows which categories of weight and blood pressure are strongly associated with each other. When you use the PLOTS=MOSAICPLOT(COLORSTAT=STDRES) option on the TABLES statement, PROC FREQ creates a mosaic plot that visualizes a chi-square test for independence.

The post Color cells in a mosaic plot by deviation from independence appeared first on The DO Loop.

7月 232018
 

Since the late 1990s, SAS has supplied macros for basic bootstrap and jackknife analyses. This article provides an example that shows how to use the %BOOT and %BOOTCI macros. The %BOOT macro generates a bootstrap distribution and computes basic statistics about the bootstrap distribution, including estimates of bias, standard error, and a confidence interval that is suitable when the sampling distribution is normally distributed. Because bootstrap methods are often used when you do not want to assume a statistic is normally distributed, the %BOOTCI macro supports several additional confidence intervals, such as percentile-based and bias-adjusted intervals.

You can download the macros for free from the SAS Support website. The website includes additional examples, documentation, and a discussion of the capabilities of the macros.

The %BOOT macro uses simple uniform random sampling (with replacement) or balanced bootstrap sampling to generate the bootstrap samples. It then calls a user-supplied %ANALYZE macro to compute the bootstrap distribution of your statistic.

How to install and use the %BOOT and %BOOTCI macros

To use the macros, do the following:

  1. Download the source file for the macros and save it in a directory that is accessible to SAS. For this example, I saved the source file to C:\Temp\jackboot.sas.
  2. Define a macro named %ANALYZE that computes the bootstrap statistic from a bootstrap sample. The next section provides an example.
  3. Call the %BOOT macro. The %BOOT macro creates three primary data sets:
    • BootData is a data set view that contains B bootstrap samples of the data. For this example, I use B=5000.
    • BootDist is a data set that contains the bootstrap distribution. It is created when the %BOOT macro internally calls the %ANALYZE macro on the BootData data set.
    • BootStat is a data set that contains statistics about the bootstrap distribution. For example, the BootStat data set contains the mean and standard deviation of the bootstrap distribution, among other statistics.
  4. If you want confidence inervals, use the %BOOTCI macro to compute up to six different interval estimates. The %BOOTCI macro creates a data set named BootCI that contains the statistics that are used to construct the confidence interval. (You can also generate multiple interval estimates by using the %ALLCI macro.)

An example of calling the %BOOT macro

This section shows how to call the %BOOT macro. The example was previously analyzed in an article that shows how to compute a bootstrap percentile confidence interval in SAS. The statistic of interest is the skewness of the SepalWidth variable for 50 iris flowers of the species Iris virginica. The following SAS statements define the sample data and compute the skewness statistic on the original data.

%include "C:\Temp\jackboot.sas";         /* define the %BOOT and %BOOTCI macros */
 
data sample(keep=x);                     /* data are sepal widths for 50 Iris virginica flowers */
   set Sashelp.Iris(where=(Species="Virginica") rename=(SepalWidth=x));
run;
 
/* compute value of the statistic on original data: Skewness = 0.366 */
title 'Skewness for Petal Widths (Iris virginica)';
proc means data=sample nolabels skewness;
   var x;
   output out=SkewOut skew=Skewness;   /* three output variables: _type_ _freq_ and Skewness */
run;

The skewness statistic (not shown) is 0.366. The call to PROC MEANS is not necessary, but it shows how to create an output data set (SkewOut) that contains the Skewness statistic. By default, the %BOOT macro will analyze all numeric variables in the output data set, so the next step defines the %ANALYZE macro and uses the DROP= data set option to omit some unimportant variables that PROC MEANS automatically generates.

When you define the %ANALYZE macro, be sure to use the NOPRINT option or otherwise suppress ODS output during the bootstrap process. Include the %BYSTMT macro, which will tell the %BOOT macro to use a BY statement to efficiently implement the bootstrap analysis. The %ANALYZE macro is basically the same as the previous call to PROC MEANS, except for the addition of the NOPRINT, %BYSTMT, and DROP= options:

%macro analyze(data=,out=);
   proc means noprint data=&data;   
      %bystmt;
      var x;
      output out=&out(drop=_type_ _freq_) skew=Skewness;
   run;
%mend;

Although the DROP= statement is not essential, it reduces the size of the data that are read and written during the bootstrap analysis. Do NOT use a KEEP= statement in the %ANALYZE macro because the %BOOT macro will generate several other variables (called _SAMPLE_ and _OBS_) as part of the resampling process.

You can now use the %BOOT macro to generate bootstrap samples and compute basic descriptive statistics about the bootstrap distribution:

/* creates GootData, BootDist, and BootStat data sets */
title2 'Bootstrap Analysis of Skewness';
%boot(data=sample,       /* data set that contains the original data */
      samples=5000,      /* number of bootstrap samples */
      random=12345,      /* random number seed for resampling */
      chart=0,           /* do not display the old PROC CHART histograms */
      stat=Skewness,     /* list of output variables to analyze (default=_NUMERIC_) */
      alpha=0.05,        /* significance level for CI (default=0.05) */
      print=1);          /* print descriptive stats (default=1)*/
 
proc print data=bootstat noobs;  /* or use LABEL option to get labels as column headers */
   id method n;
   var value bootmean bias stderr biasco alcl aucl;
run;

I recommend that you specify the first four options. The last three options are shown in case you want to override their default values. Although the %BOOT macro prints a table of descriptive statistics, the table contains 14 columns and is very wide. To shorten the output, I used PROC PRINT to display the most important results. The table shows the estimate of the skewness statistic on the original data (VALUE), the mean of the bootstrap distribution (BOOTMEAN), the estimate for the standard error of the statistic (STDERR), and lower and upper confidence limits (ALCL and AUCL) for an approximate confidence interval under the assumption that the statistic is normally distributed. (The limits are b ± z1-α * stderr, where z1-α is the (1 - α)th normal quantile and b = value - bias is a bias-corrected estimate.)

The data for the bootstrap distribution is in the BootDist data set, so you can use PROC SGPLOT to display a histogram of the bootstrap statistics. I like to assign some of the descriptive statistics into macro variables so that I can display them on the histogram, as follows:

/* OPTIONAL: Store bootstrap statistic in a macro variable */
proc sql noprint;
select value, alcl,     aucl 
 into :Stat, :LowerCL, :UpperCL
 from BootStat;
quit;
 
proc sgplot data=BootDist;      /* <== this data set contains the bootstrap distribution */
   histogram Skewness;
   refline &Stat / axis=x lineattrs=(color=red);
   refline &LowerCL &UpperCL / axis=x;
run;
Bootstrap distribution for skewness of Iris verginica petal widths

An example of calling the %BOOTCI macro

The %BOOTCI macro enables you to compute several confidence intervals (CIs) for the statistic that you are bootstrapping. The following statements display a percentile-based CI and a bias-adjusted and corrected CI.

title2 'Percentile-Based Confidence Interval';
%bootci(PCTL);    /* creates BootCI data set for Pctl CI */

The percentile-based CI is about the same width as the normal-based CI, but it is shifted to the left. The default output from the %BOOTCI macro is very wide, so sometimes I prefer to use the PRINT=0 option to suppress the output. The estimates are written to a data set named BootCI, so it is easy to use PROC PRINT to display only the statistics that you want to see, as shown in the following call that computes a bias-corrected and adjusted interval estimate:

title2 'Bias-Adjusted and Corrected Bootstrap Confidence Interval';
%bootci(BCa, print=0);       /* creates BootCI data set for BCa CI */
proc print data=BootCI noobs label;
   id method n;
   var value alcl aucl;
run;

Notice that each call to the %BOOTCI macro creates a data set named BootCI. In particular, the second call overwrites the data set that was created by the first call. If you want to compare the estimates, be sure to make a copy of the first BootCI data set before you overwrite it.

The %ALLCI macro

If you want to compare multiple CIs, you can use the %ALLCI macro, which computes multiple definitions of the CIs and concatenates them into a data set named AllCI, as shown by the following:

title2 'Comparison of Bootstrap Confidence Intervals';
%allci(print=0); 
proc print data=AllCI(drop=_LABEL_) noobs label;
   id method n;
   var value alcl aucl;
run;

The output (not shown) contains interval estimates for five bootstrap CIs and a jackknife CI.

Be aware the when you run the %ALLCI macro you will see several warnings in the SAS log, such as the following:

WARNING: Variable _lo was not found on DATA file.
WARNING: Variable bootmean was not found on BASE file. The variable will
         not be added to the BASE file.

These warnings are coming from PROC APPEND and can be ignored. To suppress these warnings, you can edit the jackboot.sas file, search for the word 'force' on the PROC APPEND statements, and add the NOWARN option to those PROC APPEND statements. For example:
proc append data=bootci&keep base=ALLCI force nowarn; run;

Pros and cons of using the %BOOT macro

The %BOOT, %BOOTCI, and %ALLCI macros can be a time-saver when you want to perform a basic bootstrap in SAS. However, in my opinion, they are not a substitute for understanding how to implement a bootstrap computation manually in SAS. Here are a few advantages and disadvantages of the macros:

  • Advantage: The macros encapsulate the tedious steps of the bootstrap analysis.
  • Advantage: The macros generate SAS data sets that you can use for additional analyses or for graphing the results.
  • Advantage: The macros handle the most common sampling schemes such as simple uniform sampling (with replacement), balanced bootstrap sampling, and residual sampling in regression models.
  • Advantage: The %BOOTCI macro supports many popular confidence intervals for parameters.
  • Disadvantage: The macros do not provide the same flexibility as writing your own analysis. For example, the macros do not support the stratified resampling scheme that is used for a bootstrap analysis of the difference of means in a t test.
  • Disadvantage: There are only a few examples of using the macros. When I first used them, I made several mistakes and had to look at the underlying source code to understand what the macros were doing.

Summary

The %BOOT and %BOOTCI macros provide a convenient way to perform simple bootstrap analyses in SAS. The macros support several common resampling schemes and estimates for confidence intervals. Although the macros are not a replacement for understanding how to program a general, efficient, bootstrap analysis, they can be a useful tool for data analysts who want compact code to create a bootstrap analysis in SAS.

The post How to use the %BOOT and %BOOTCI macros in SAS appeared first on The DO Loop.

7月 182018
 

This article shows how to implement balanced bootstrap sampling in SAS. The basic bootstrap samples with replacement from the original data (N observations) to obtain B new samples. This is called "uniform" resampling because each observation has a uniform probability of 1/N of being selected at each step of the resampling process. Within the union of the B bootstrap samples, each observation has an expected value of appearing B times.

Balanced bootstrap resampling (Davison, Hinkley, and Schechtman, 1986) is an alternative process in which each observation appears exactly B times in the union of the B bootstrap samples of size N. This has some practical benefits for estimating certain inferential statistics such as the bias and quantiles of the sampling distribution (Hall, 1990).

It is easy to implement a balanced bootstrap resampling scheme: Concatenate B copies of the data, randomly permute the B*N observations, and then use the first N observations for the first bootstrap sample, the next B for the second sample, and so forth. (Other algorithms are also possible, as discussed by Gleason, 1988). This article shows how to implement balanced bootstrap sampling in SAS.

Balanced bootstrap samples in SAS

To illustrate the idea, consider the following data set that has N=6 observations. Five observations are clustered near x=0 and the sixth is a large outlier (x=10). The sample skewness for these data is skew=2.316 because of the influence of the outlier.

data Sample(keep=x);
input x @@;
datalines;
-1 -0.2 0 0.2 1 10
;
 
proc means data=Sample skewness;
run;
%let ObsStat = 2.3163714;

You can use the bootstrap to approximate the sampling distribution for the skewness statistic for these data. I have previously shown how to use SAS to bootstrap the skewness statistic: Use PROC SURVEYSELECT to form bootstrap samples, use PROC MEANS with a BY statement to analyze the samples, and use PROC UNIVARIATE to analyze the bootstrap distribution of skewness values. In that previous article, PROC SURVEYSELECT is used to perform uniform sampling (sampling with replacement).

It is straightforward to modify the previous program to perform balanced bootstrap sampling. The following program is based on a SAS paper by Nils Penard at PhUSE 2012. It does the following:

  1. Use PROC SURVEYSEELCT to concatenate B copies of the input data.
  2. Use the DATA step to generate a uniform random number for each observation.
  3. Use PROC SORT to sort the data by the random values. After this step, the N*B observations are in random order.
  4. Generate a variable that indicates the bootstrap sample for each observation. Alternatively, reuse the REPLICATE variable from PROC SURVEYSELECT, as shown below.
/* balanced bootstrap computation */
proc surveyselect data=Sample out=DupData noprint
                  reps=5000              /* duplicate data B times */
                  method=SRS samprate=1; /* sample w/o replacement */
run;
 
data Permute;  
   set DupData;
   call streaminit(12345);
   u = rand("uniform");    /* generate a uniform random number for each obs */
run;
 
proc sort data=Permute; by u; run;  /* sort in random order */
 
data BalancedBoot;
   merge DupData(drop=x) Permute(keep=x);  /* reuse REPLICATE variable */
run;

You can use the BalancedBoot data set to perform subsequent bootstrap analyses. If you perform a bootstrap analysis, you obtain the following approximate bootstrap distribution for the skewness statistic. The observed statistic is indicated by a red vertical line. For reference, the mean of the bootstrap distribution is indicated by a gray vertical line. You can see that the sampling distribution for this tiny data set is highly nonnormal. Many bootstrap samples that contain the outlier (exactly one-sixth of the samples in a balanced bootstrap) will have a large skewness value.

Bootstrap distribution for balanced resampling method

To assure yourself that each of the original six observations appears exactly B times in the union of the bootstrap sample, you can run PROC FREQ, as follows:

proc freq data=BalancedBoot;   /* OPTIONAL: Show that each obs appears B times */
   tables x / nocum;
run;

Balanced bootstrap samples in SAS/IML

As shown in the article "Bootstrap estimates in SAS/IML," you can perform bootstrap computations in the SAS/IML language. For uniform sampling, the SAMPLE function samples with replacement from the original data. However, you can modify the sampling scheme to support balanced bootstrap resampling:

  1. Use the REPEAT function to duplicate the data B times.
  2. Use the SAMPLE function with the "WOR" option to sample without replacement. The resulting vector is a permutation of the B*N observations.
  3. Use the SHAPE function to reshape the permuted data into an N x B matrix for which each column is a bootstrap sample. This form is useful for implementing vectorized computations on the columns.

The following SAS/IML program modifies the program in the previous post to perform balanced bootstrap sampling:

/* balanced bootstrap computation in SAS/IML */
proc iml;
use Sample; read all var "x"; close;
call randseed(12345);
 
/* Return a row vector of statistics, one for each column. */
start EvalStat(M);
   return skewness(M);               /* <== put your computation here */
finish;
Est = EvalStat(x);                   /* 1. observed statistic for data */
 
/* balanced bootstrap resampling */
B = 5000;                            /* B = number of bootstrap samples */
allX = repeat(x, B);                 /* replicate the data B times */
s = sample(allX, nrow(allX), "WOR"); /* 2. sample without replacement (=permute) */
s = shape(s, nrow(x), B);            /*    reshape to (N x B) */
 
/* use the balanced bootstrap samples in subsequent computations */
bStat = T( EvalStat(s) );            /* 3. compute the statistic for each bootstrap sample */
bootEst = mean(bStat);               /* 4. summarize bootstrap distrib such as mean */
bias = Est - bootEst;                /*    Estimate of bias */
RBal = Est || BootEst || Bias;       /* combine results for printing */
print RBal[format=8.4 c={"Obs" "BootEst" "Bias"}];

As shown in the previous histogram, the bias estimate (the difference between the observed statistic and the mean of the bootstrap distribution) is sizeable.

It is worth mentioning that the SAS-supplied %BOOT macro performs balanced bootstrap sampling by default. To generate balanced bootstrap samples with the %BOOT macro, set the BALANCED=1 option, as follows:
%boot(data=Sample, samples=5000, balanced=1) /* or omit BALANCED= option */
If you want uniform (unbalanced) samples, call the macro as follows:
%boot(data=Sample, samples=5000, balanced=0).

In conclusion, it is easy to generate balanced bootstrap samples. Balanced sampling can improve the efficiency of certain bootstrap estimates and inferences. For details, see the previous references of Appendix II of Hall (1992).

The post Balanced bootstrap resampling in SAS appeared first on The DO Loop.