Statistical Programming

8月 082018
 
Use the GROUP= option in PROC SGPLOT to plot lines for TWO categorical variables

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

data TwoGroups;
do G1=1 to 3;
   do G2='A', 'B';
      do X = 1 to 10;
         Y = 10 + G1 + 0.5*(G2='A') + G1*X/20;
         output;
      end;
   end;
end;
run;
 
title "First Attempt: Does Not Work";
proc sgplot data=TwoGroups;
   series x=x y=y / group=G1 curvelabel;
run;

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

  • Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
  • Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

/* create a new group variable by concatenating the two existing variables */
proc sort data=TwoGroups;
   by G1 G2 X;
run;
 
data Make2Groups;
set TwoGroups;
by G1 G2;
/* Option 1: automatically create joint levels from original levels */
Label = catt("G1=", G1) || "; " || catt("G2=", G2);
/* Option 2: create new categorical variable and use PROC FORMAT to assign values */
if first.G2 then 
   GroupID + 1;        /* GroupID = 1, 2, 3, ... numGroups */
run;
 
/* For Option 2: use PROC FORMAT to form labels that encode the two groups. See
https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html
*/
proc format;                 
  value GroupFmt 1 = "G1=1; G2='A'"   2 = "G1=1; G2='B'"   3 = "G1=2; G2='A'"
                 4 = "G1=2; G2='B'"   5 = "G1=3; G2='A'"   6 = "G1=3; G2='B'";
run;

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

title "Two Groups, One SERIES Statement";
proc sgplot data=Make2Groups;
/* format GroupID GroupFmt.; */            /* for Option 2 */
series x=x y=y / group=Label grouplc=G2    /* or group=GroupID for Option 2 */
                 lineattrs=(pattern=solid) curvelabel;
run;

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

6月 272018
 

This article describes how to obtain an initial guess for nonlinear regression models, especially nonlinear mixed models. The technique is to first fit a simpler fixed-effects model by replacing the random effects with their expected values. The parameter estimates for the fixed-effects model are often good initial guesses for the parameters in the full model.

Background: initial guesses for regression parameters

For linear regression models, you don't usually need to provide an initial guess for the parameters. Least squares methods directly solve for the parameter estimates by using the data. For iterative methods, such as logistic regression model, software usually starts with a pre-determined initial guess of the slope parameters and iteratively refine the guess by using an optimization technique. For example, PROC LOGISTIC in SAS begins by guessing that all slopes are zero. The "slopes all zero" model, which is better known as an intercept-only model, is a reduced model that is related to the full logistic model.

Several other regression procedures also fit a related (simpler) model and then use the estimates for the simpler model as an initial guess for the full model. For example, the "M method" in robust regression starts with a least squares solution and then iteratively refines the coefficients until a robust fit is obtained. Similarly, one way to fit a generalized linear mixed models is to first solve a linear mixed model and use those estimates as starting values for the generalized mixed model.

Fit a fixed-effect model to obtain estimates for a mixed model

The previous paragraphs named procedures that do not require the user to provide an initial guess for parameters. However, procedures such as PROC NLIN and PROC NLMIXED in SAS enable you to fit arbitrary nonlinear regression models. The cost for this generality is that the user must provide a good initial guess for the parameters.

This can be a challenge. To help, the NLIN and NLMIXED procedures enable you to specify a grid of initial values, which can be a valuable technique. For some maximum-likelihood estimates (MLE), you can use the method of moments to choose initial parameters. Although it is difficult to choose good initial parameters for a nonlinear mixed model, the following two-step process can be effective:

  • Substitute the expected value for every random effect in the model. You obtain a fixed-effect model that is related to the original mixed model, but has fewer parameters. Estimate the parameters for the fixed-effect model. (Most random effects are assumed to be normally distributed with zero mean, so in practice you "drop" the random effects.)
  • Use the estimates from the fixed-effect model as an initial guess for the full model. Hopefully, the initial guesses are close enough and the full model will converge quickly.

The following example shows how this two-step process works. The PROC NLMIXED documentation contains an example of data from an experiment that fed two different diets to pregnant rats and observed the survival of pups in the litters. The following SAS statements define the data and show the full nonlinear mixed model in the example:

data rats;
input trt $ m x @@;
x1 = (trt='c');
x2 = (trt='t');
litter = _n_;
datalines;
c 13 13   c 12 12   c  9  9   c  9  9   c  8  8   c  8  8   c 13 12   c 12 11
c 10  9   c 10  9   c  9  8   c 13 11   c  5  4   c  7  5   c 10  7   c 10  7
t 12 12   t 11 11   t 10 10   t  9  9   t 11 10   t 10  9   t 10  9   t  9  8
t  9  8   t  5  4   t  9  7   t  7  4   t 10  5   t  6  3   t 10  3   t  7  0
;
 
title "Full nonlinear mixed model";
proc nlmixed data=rats;
   parms  t1=2 t2=2 s1=1 s2=1;     /* <== documentation values. How to guess these??? */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;
Estimate nonlinear mixed model in SAS by using PROC NLMIXED

The model converges in about 10 iterations for the initial guesses on the PARMS statement. But if you deviate from those values, you might encounter problems where the model does not converge.

Fit a fixed-effect model to obtain initial estimates for the mixed model

The full mixed model has one random effect that involves two variance parameters, s1 and s2. If you replace the random effect with its expected value (0), you obtain the following two-parameter fixed-effect model:

title "Reduced model: Fixed effects only";
proc nlmixed data=rats;
   parms  t1=2 t2=2;
   eta = x1*t1 + x2*t2;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   ods output ParameterEstimates = FixedEstimates;
run;
Estimate fixed-effect parameters in a reduced model in SAS by using PROC NLMIXED

Notice that the estimates for t1 and t2 in the two-parameter model are close to the values in the full model. However, the two-parameter model is much easier to work with, and you can use grids and other visualization techniques to guide you in estimating the parameters. Notice that the estimates for t1 and t2 are written to a SAS data set called FixedEstimates. You can read those estimates on the PARMS statement by using a second call to PROC NLMIXED. This second call fits a four-parameter model, but the t1 and t2 parameters are hopefully good guesses so you can focus on finding good guesses for s1 and s2 in the second call.

/* full random-effects model. The starting values for the fixed effects
   are the estimates from the fixed-effect-only model */
title "Full model: Estimates from a reduced model";
proc nlmixed data=rats;
   parms s1=0.1 s2=1 / data=FixedEstimates;   /* read in initial values for t1, t2 */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);       /* standard normal quantile for eta */
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;
Use estimates in the reduced (fixed-effect-only) model to estimate a full nonlinear mixed model in SAS by using PROC NLMIXED

Combining a grid search and a reduced (fixed-effect) model

You can combine the previous technique with a grid search for the random-effect parameters. You can use PROC TRANSPOSE to convert the FixedEstimates data set from long form (with the variables 'Parameter' and 'Estimate') to wide form (with variables 't1' and 't2'). You can then use a DATA step to add a grid of values for the s1 and s2 parameters that are used to estimate the random effect, as follows:

/* transpose fixed-effect estimates from long form to wide form */
proc transpose data=FixedEstimates(keep=Parameter Estimate) out=PE(drop=_NAME_);
   id Parameter;
   var Estimate;
run;
 
/* add grid of guesses for the parameters in the random effect */
data AllEstimates;
set PE;
do s1 = 0.5 to 1.5 by 0.5;     /* s1 in [0.5, 1.5] */
   do s2 = 0.5 to 2 by 0.5;    /* s2 in [0.5, 2.0] */
      output;
   end;
end;
run;
 
title "Full model: Estimates from a reduced model and from a grid search";
proc nlmixed data=rats;
   parms / data=AllEstimates;   /* use values from reduced model for t1, t2; grid for s1, s2  */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;

The output is the same as for the initial model and is not shown.

In summary, you can often use a two-step process to help choose initial values for the parameters in a many-parameter mixed model. The first step is to estimate the parameters in a smaller and simpler fixed-effect model by replacing the random effects with their expected values. You can then use those fixed-effect estimates in the full model. You can combine this reduced-model technique with a grid search.

I do not have a reference for this technique. It appears to be "folklore" that "everybody knows," but I was unable to locate an example nor a proof that indicates the conditions under which this technique to work. If you know a good reference for this technique, please leave a comment.

The post Reduced models: A way to choose initial parameters for a mixed model appeared first on The DO Loop.

6月 252018
 

When you fit nonlinear fixed-effect or mixed models, it is difficult to guess the model parameters that fit the data. Yet, most nonlinear regression procedures (such as PROC NLIN and PROC NLMIXED in SAS) require that you provide a good guess! If your guess is not good, the fitting algorithm, which is often a nonlinear optimization algorithm, might not converge. This seems to be a Catch-22 paradox: You can't estimate the parameters until you fit the model, but you can't fit the model until you provide an initial guess!

Fortunately, SAS provides a way to circumvent this paradox: you can specify multiple initial parameter values for the nonlinear regression models in PROC NLIN and PROC NLMIXED. The procedure will evaluate the model on a grid that results from all combinations of the specified values and initialize the optimization by using the value in the grid that provides the best fit. This article shows how to specify a grid of initial values for a nonlinear regression model.

An example that specifies a grid of initial values

The following data and model is from the documentation of the NLMIXED procedure. The data describe the number of failures and operating hours for 10 pumps. The pumps are divided into two groups: those that are operated continuously (group=1) and those operated intermittently (group=2). The number of failures is modeled by a Poisson distribution whose parameter (lambda) is fit by a linear regression with separate slopes and intercepts for each group. The example in the documentation specifies a single initial value (0 or 1) for each of the five parameters. In contrast, the following call to PROC NLMIXED specifies multiple initial values for each parameter:

data pump;
  input y t group;q
  pump = _n_;
  logtstd = log(t) - 2.4564900;
  datalines;
 5  94.320 1
 1  15.720 2
 5  62.880 1
14 125.760 1
 3   5.240 2
19  31.440 1
 1   1.048 2
 1   1.048 2
 4   2.096 2
22  10.480 2
;
 
proc nlmixed data=pump;
   parms logsig 0 1                 /* multiple initial guesses for each parameter */
         beta1 -1 to 1 by 0.5 
         beta2 -1 to 1 by 0.5
         alpha1 1 5 10
         alpha2 1 5;                /* use BEST=10 option to see top 10 choices */
   if (group = 1) then eta = alpha1 + beta1*logtstd + e;
                  else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

On the PARMS statement, several guesses are provided for each parameter. Two or three guesses are listed for the logsig, alpha1, and alpha2 parameters. An arithmetic sequence of values is provided for the beta1 and beta2 parameters. The syntax for an arithmetic sequence is start TO stop BY increment, although you can omit the BY clause if the increment is 1. In this example, the arithmetic sequence is equivalent to the list -1, -0.5, 0, 0.5, 1.

The procedure will form the Cartesian product of the guesses for each parameter. Thus the previous syntax specifies a uniform grid of points in parameter space that contains 2 x 5 x 5 x 3 x 2 = 300 initial guesses. As this example indicates, the grid size grows geometrically with the number of values for each parameter. Therefore, resist the urge to use a fine grid for each parameter. That will result in many unnecessary evaluations of the log-likelihood function.

The following table shows a few of the 300 initial parameter values, along with the negative log-likelihood function evaluated at each set of parameters. From among the 300 initial guesses, the procedure will find the combination that results in the smallest value of the negative log-likelihood function and use that value to initialize the optimization algorithm.

If you want to visualize the range of values for the log-likelihood function, you can graph the negative log-likelihood versus the row number. In the following graph, a star indicates the parameter value in the grid that has the smallest value of the negative log-likelihood. I used the IMAGEMAP option on the ODS GRAPHICS statement to turn on tool tips for the graph, so that the parameter values for each point appears when you hover the mouse pointer over the point.

The graph shows that about 10 parameter values have roughly equivalent log-likelihood values. If you want to see those values in a table, you can add the BEST=10 option on the PARMS statement. In this example, there is little reason to choose the smallest parameter value over some of the other values that have a similar log-likelihood.

Create a file for the initial parameters

The ability to specify lists and arithmetic sequences is powerful, but sometimes you might want to use parameter values from a previous experiment or study or from a simplified model for the data. Alternatively, you might want to generate initial parameter values randomly from a distribution such as the uniform, normal, or exponential distributions. The PARMS statement in PROC NLMIXED supports a DATA= option that enables you to specify a data set for which each row is a set of parameter values. (In PROC NLIN and other SAS procedures, use the PDATA= option.) For example, the following DATA step specifies two or three values for the logsig, alpha1, and alpha2 parameters. It specifies random values in the interval (-1, 1) for the beta1 and beta2 parameters. The ParamData data set has 12 rows. The NLMIXED procedure evaluates the model on these 12 guesses, prints the top 5, and then uses the best value to initialize the optimization algorithm:

data ParamData;
call streaminit(54321);
do logsig = 0, 1;
   do alpha1 = 1, 5, 10;
      do alpha2 = 1, 5; 
         beta1 = -1 + 2*rand("uniform");  /* beta1 ~ U(-1, 1) */ 
         beta2 = -1 + 2*rand("uniform");  /* beta2 ~ U(-1, 1) */ 
         output;
      end;
   end;
end;
run;
 
proc nlmixed data=pump;
   parms / DATA=ParamData BEST=5;        /* read guesses for parameters from data set */
   if (group = 1) then 
        eta = alpha1 + beta1*logtstd + e;
   else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

As mentioned, the values for the parameters can come from anywhere. Sometimes random values are more effective than values on a regular grid. If you use a BOUNDS statement to specify valid values for the parameters, any infeasible parameters are discarded and only feasible parameter values are used.

Summary

In summary, this article provides an example of a syntax to specify a grid of initial parameters. SAS procedures that support a grid search include NLIN, NLMIXED, MIXED and GLIMMIX (for covariance parameters), SPP, and MODEL. You can also put multiple guesses into a "wide form" data set: the names of the parameters are the columns and each row contains a different combination of the parameters.

It is worth emphasizing that if your model does not converge, it might not be the parameter search that is at fault. It is more likely that the model does not fit your data. Although the technique in this article is useful for finding parameters so that the optimization algorithm converges, no technique cannot compensate for an incorrect model.

The post Use a grid search to find initial parameter values for regression models in SAS appeared first on The DO Loop.

6月 252018
 

When you fit nonlinear fixed-effect or mixed models, it is difficult to guess the model parameters that fit the data. Yet, most nonlinear regression procedures (such as PROC NLIN and PROC NLMIXED in SAS) require that you provide a good guess! If your guess is not good, the fitting algorithm, which is often a nonlinear optimization algorithm, might not converge. This seems to be a Catch-22 paradox: You can't estimate the parameters until you fit the model, but you can't fit the model until you provide an initial guess!

Fortunately, SAS provides a way to circumvent this paradox: you can specify multiple initial parameter values for the nonlinear regression models in PROC NLIN and PROC NLMIXED. The procedure will evaluate the model on a grid that results from all combinations of the specified values and initialize the optimization by using the value in the grid that provides the best fit. This article shows how to specify a grid of initial values for a nonlinear regression model.

An example that specifies a grid of initial values

The following data and model is from the documentation of the NLMIXED procedure. The data describe the number of failures and operating hours for 10 pumps. The pumps are divided into two groups: those that are operated continuously (group=1) and those operated intermittently (group=2). The number of failures is modeled by a Poisson distribution whose parameter (lambda) is fit by a linear regression with separate slopes and intercepts for each group. The example in the documentation specifies a single initial value (0 or 1) for each of the five parameters. In contrast, the following call to PROC NLMIXED specifies multiple initial values for each parameter:

data pump;
  input y t group;q
  pump = _n_;
  logtstd = log(t) - 2.4564900;
  datalines;
 5  94.320 1
 1  15.720 2
 5  62.880 1
14 125.760 1
 3   5.240 2
19  31.440 1
 1   1.048 2
 1   1.048 2
 4   2.096 2
22  10.480 2
;
 
proc nlmixed data=pump;
   parms logsig 0 1                 /* multiple initial guesses for each parameter */
         beta1 -1 to 1 by 0.5 
         beta2 -1 to 1 by 0.5
         alpha1 1 5 10
         alpha2 1 5;                /* use BEST=10 option to see top 10 choices */
   if (group = 1) then eta = alpha1 + beta1*logtstd + e;
                  else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

On the PARMS statement, several guesses are provided for each parameter. Two or three guesses are listed for the logsig, alpha1, and alpha2 parameters. An arithmetic sequence of values is provided for the beta1 and beta2 parameters. The syntax for an arithmetic sequence is start TO stop BY increment, although you can omit the BY clause if the increment is 1. In this example, the arithmetic sequence is equivalent to the list -1, -0.5, 0, 0.5, 1.

The procedure will form the Cartesian product of the guesses for each parameter. Thus the previous syntax specifies a uniform grid of points in parameter space that contains 2 x 5 x 5 x 3 x 2 = 300 initial guesses. As this example indicates, the grid size grows geometrically with the number of values for each parameter. Therefore, resist the urge to use a fine grid for each parameter. That will result in many unnecessary evaluations of the log-likelihood function.

The following table shows a few of the 300 initial parameter values, along with the negative log-likelihood function evaluated at each set of parameters. From among the 300 initial guesses, the procedure will find the combination that results in the smallest value of the negative log-likelihood function and use that value to initialize the optimization algorithm.

If you want to visualize the range of values for the log-likelihood function, you can graph the negative log-likelihood versus the row number. In the following graph, a star indicates the parameter value in the grid that has the smallest value of the negative log-likelihood. I used the IMAGEMAP option on the ODS GRAPHICS statement to turn on tool tips for the graph, so that the parameter values for each point appears when you hover the mouse pointer over the point.

The graph shows that about 10 parameter values have roughly equivalent log-likelihood values. If you want to see those values in a table, you can add the BEST=10 option on the PARMS statement. In this example, there is little reason to choose the smallest parameter value over some of the other values that have a similar log-likelihood.

Create a file for the initial parameters

The ability to specify lists and arithmetic sequences is powerful, but sometimes you might want to use parameter values from a previous experiment or study or from a simplified model for the data. Alternatively, you might want to generate initial parameter values randomly from a distribution such as the uniform, normal, or exponential distributions. The PARMS statement in PROC NLMIXED supports a DATA= option that enables you to specify a data set for which each row is a set of parameter values. (In PROC NLIN and other SAS procedures, use the PDATA= option.) For example, the following DATA step specifies two or three values for the logsig, alpha1, and alpha2 parameters. It specifies random values in the interval (-1, 1) for the beta1 and beta2 parameters. The ParamData data set has 12 rows. The NLMIXED procedure evaluates the model on these 12 guesses, prints the top 5, and then uses the best value to initialize the optimization algorithm:

data ParamData;
call streaminit(54321);
do logsig = 0, 1;
   do alpha1 = 1, 5, 10;
      do alpha2 = 1, 5; 
         beta1 = -1 + 2*rand("uniform");  /* beta1 ~ U(-1, 1) */ 
         beta2 = -1 + 2*rand("uniform");  /* beta2 ~ U(-1, 1) */ 
         output;
      end;
   end;
end;
run;
 
proc nlmixed data=pump;
   parms / DATA=ParamData BEST=5;        /* read guesses for parameters from data set */
   if (group = 1) then 
        eta = alpha1 + beta1*logtstd + e;
   else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

As mentioned, the values for the parameters can come from anywhere. Sometimes random values are more effective than values on a regular grid. If you use a BOUNDS statement to specify valid values for the parameters, any infeasible parameters are discarded and only feasible parameter values are used.

Summary

In summary, this article provides an example of a syntax to specify a grid of initial parameters. SAS procedures that support a grid search include NLIN, NLMIXED, MIXED and GLIMMIX (for covariance parameters), SPP, and MODEL. You can also put multiple guesses into a "wide form" data set: the names of the parameters are the columns and each row contains a different combination of the parameters.

It is worth emphasizing that if your model does not converge, it might not be the parameter search that is at fault. It is more likely that the model does not fit your data. Although the technique in this article is useful for finding parameters so that the optimization algorithm converges, no technique cannot compensate for an incorrect model.

The post Use a grid search to find initial parameter values for regression models in SAS appeared first on The DO Loop.

6月 202018
 

A previous article provides an example of using the BOOTSTRAP statement in PROC TTEST to compute bootstrap estimates of statistics in a two-sample t test. The BOOTSTRAP statement is new in SAS/STAT 14.3 (SAS 9.4M5). However, you can perform the same bootstrap analysis in earlier releases of SAS by using procedures in Base SAS and SAS/STAT. This article gives an example of how to bootstrap in SAS.

The main steps of the bootstrap method in SAS

A previous article describes how to construct a bootstrap confidence interval in SAS. The major steps of a bootstrap analysis follow:

  1. Compute the statistic of interest for the original data
  2. Resample B times (with replacement) from the data to form B bootstrap samples. The resampling process should respect the structure of the analysis and the null hypothesis. In SAS it is most efficient to use the DATA step or PROC SURVEYSELECT to put all B random bootstrap samples into a single data set.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample. The BY-group approach is much faster than using macro loops. The union of the statistic is the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis.
  4. Use the bootstrap distribution to obtain bootstrap estimates of bias, standard errors, and confidence intervals.

Compute the statistic of interest

This article uses the same bootstrap example as the previous article. The following SAS DATA step subsets the Sashelp.Cars data to create a data set that contains two groups: SUV" and "Sedan". There are 60 SUVs and 262 sedans. The statistic of interest is the difference of means between the two groups. A call to PROC TTEST computes the difference between group means for the data:

data Sample;    /* create the sample data. The two groups are "SUV" and "Sedan" */
set Sashelp.Cars(keep=Type MPG_City);
if Type in ('Sedan' 'SUV');
run;
 
/* 1. Compute statistic (difference of means) for data */
proc ttest data=Sample;
   class Type;
   var MPG_City;
   ods output Statistics=SampleStats;   /* save statistic in SAS data set */
run;
 
/* 1b. OPTIONAL: Store sample statistic in a macro variable for later use */
proc sql noprint;
select Mean into :Statistic
       from SampleStats where Method="Satterthwaite";
quit;
%put &=Statistic;
STATISTIC= -4.9840

The point estimate for the difference of means between groups is -4.98. The TTEST procedure produces a graph (not shown) that indicates that the MPG_City variable is moderately skewed for the "Sedan" group. Therefore you might question the usefulness of the classical parametric estimates for the standard error and confidence interval for the difference of means. The following bootstrap analysis provides a nonparametric estimate about the accuracy of the difference of means.

Resample from the data

For many resampling schemes, PROC SURVEYSELECT is the simplest way to generate bootstrap samples. The documentation for PROC TTEST states, "In a bootstrap for a two-sample design, random draws of size n1 and n2 are taken with replacement from the first and second groups, respectively, and combined to produce a single bootstrap sample." One way to carry out this sampling scheme is to use the STRATA statement in PROC SURVEYSELECT to sample (with replacement) from the "SUV" and "Sedan" groups. To perform stratified sampling, sort the data by the STRATA variable. The following statements sort the data and generate 10,000 bootstrap samples by drawing random samples (with replacement) from each group:

/* 2. Sample with replacement from each stratum. First sort by the STRATA variable. */
proc sort data=Sample; by Type; run;
 
/* Then perform stratified sampling with replacement */
proc surveyselect data=Sample out=BootSamples noprint seed=123 
                  method=urs              /* with replacement */
                  /* OUTHITS */           /* use OUTHITS option when you do not want a frequency variable */
                  samprate=1 reps=10000;  /* 10,000 resamples */
   strata Type;   /* sample N1 from first group and N2 from second */
run;

The BootSamples data set contains 10,000 random resamples. Each sample contains 60 SUVs and 262 sedans, just like the original data. The BootSamples data contains a variable named NumberHits that contains the frequency with which each original observation appears in the resample. If you prefer to use duplicated observations, specify the OUTHITS option in the PROC SURVEYSELECT statement. The different samples are identified by the values of the Replicate variable.

BY-group analysis of bootstrap samples

Recall that a BY-group analysis is an efficient way to process 10,000 bootstrap samples. Recall also that it is efficient to suppress output when you perform a large BY-group analysis. The following macros encapsulate the commands that suppress ODS objects prior to a simulation or bootstrap analysis and then permit the objects to appear after the analysis is complete:

/* Define useful macros */
%macro ODSOff(); /* Call prior to BY-group processing */
ods graphics off;  ods exclude all;  ods noresults;
%mend;
 
%macro ODSOn(); /* Call after BY-group processing */
ods graphics on;  ods exclude none;  ods results;
%mend;

With these definitions, the following call to PROC TTEST computes the Satterthwaite test statistic for each bootstrap sample. Notice that you need to sort the data by the Replicate variable because the BootSamples data are ordered by the values of the Type variable. Note also that the NumberHits variable is used as a FREQ variable.

/* 3. Compute statistics */
proc sort data = BootSamples; by Replicate Type; run;
 
%ODSOff                          /* suppress output */
proc ttest data=BootSamples;
   by Replicate;
   class Type;
   var MPG_City;
   freq NumberHits;              /* Use FREQ variable in analysis (or use OUTHITS option) */
   ods output ConfLimits=BootDist(where=(method="Satterthwaite")
              keep=Replicate Variable Class Method Mean rename=(Mean=DiffMeans)); 
run; 
%ODSOn                           /* enable output   */

Obtain estimates from the bootstrap distribution

At this point in the bootstrap example, the data set BootDist contains the bootstrap distribution in the variable DiffMeans. You can use this variable to compute various bootstrap statistics. For example, the bootstrap estimate of the standard error is the standard deviation of the DiffMeans variable. The estimate of bias is the difference between the mean of the bootstrap estimates and the original statistic. The percentiles of the DiffMeans variable can be used to construct a confidence interval. (Or you can use a different interval estimate, such as the bias-adjusted and corrected interval.) You might also want to graph the bootstrap distribution. The following statements use PROC UNIVARIATE to compute these estimates:

/* 4. Plot sampling distribution of difference of sample means. Write stats to BootStats data set */ 
proc univariate data=BootDist; /* use NOPRINT option to suppress output and graphs */
   var DiffMeans;
   histogram DiffMeans;      /* OPTIONAL */
   output out=BootStats pctlpts =2.5  97.5  pctlname=P025 P975
                  pctlpre =Mean_ mean=BootMean std=BootStdErr;
run;
 
/* use original sample statistic to compute bias */
data BootStats;
set BootStats;
Bias = BootMean - &Statistic;
label Mean_P025="Lower 95% CL"  Mean_P975="Upper 95% CL";
run;
 
proc print data=BootStats noobs; 
   var BootMean BootStdErr Bias Mean_P025 Mean_P975;
run;

The results are shown. The bootstrap distribution appears to be normally distributed. This indicates that the bootstrap estimates will probably be similar to the classical parametric estimates. For this problem, the classical estimate of the standard error is 0.448 and a 95% confidence interval for the difference of means is [-5.87, -4.10]. In comparison, the bootstrap estimates are 0.444 and [-5.87, -4.13]. In spite of the skewness of the MPG_City variable for the "Sedan" group, the two-sample Satterthwaite t provides similar estimates regarding the accuracy of the point estimate for the difference of means. The bootstrap statistics also are similar to the statistics that you can obtain by using the BOOTSTRAP statement in PROC TTEST in SAS/STAT 14.3.

In summary, you can use Base SAS and SAS/STAT procedures to compute a bootstrap analysis of a two-sample t test. Although the "manual" bootstrap requires more programming effort than using the BOOTSTRAP statement in PROC TTEST, the example in this article generalizes to other statistics for which a built-in bootstrap option is not supported. This article also shows how to use PROC SURVEYSELECT to perform stratified sampling as part of a bootstrap analysis that involves sampling from multiple groups.

The post The bootstrap method in SAS: A t test example appeared first on The DO Loop.

6月 082018
 

I recently recorded a short video about the new syntax for specifying and manipulating lists in SAS/IML 14.3. This is a video of my Super Demo at SAS Global Forum 2018. The new syntax supports dynamic arrays, associative arrays ("named lists"), and hierarchical data structures such as lists of lists.


If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read about lists, the following references provide more information and additional examples of ways to use lists:

The post Video: A new syntax for lists in SAS/IML appeared first on The DO Loop.