Regression

7月 292020
 

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory;
call streaminit(54321);
do i = 1 to 2000;                               /* number in each treatment group */
   do Trt = 1 to 3;                             /* Trt = 1, 2, and 3 */
      x1 = rand('Normal');                      /* x1 ~ N(0,1) */
      x2 = rand('Normal');                      /* x2 ~ N(0,1) */
      LinCont = -1.5 + 0.4*x1 - 0.7*x2;         /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */
      output;
   end;
end;
drop i;
run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt;
call streaminit(123);
set SimExplanatory;                      /* read the explanatory variables */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* a straightforward way to generate Y data from a logistic model */
eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3);
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Parameter Estimates Do Not Match Simulation Code";
ods select ParameterEstimates;
proc logistic data=LogiSim_Attempt plots=none;
   class Trt; 
   model y(event='1') = Trt x1 x2;
run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref;
call streaminit(123);
set SimExplanatory;
/* assume reference or GLM parameterization */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
RefLevel = TrtCoef[3];                   /* use last level as reference */
eta = LinCont + RefLevel                 /* constant term is (Intercept + RefLevel) */
              + (TrtCoef[1]-RefLevel)*(Trt=1)   /* param = +2 */
              + (TrtCoef[2]-RefLevel)*(Trt=2)   /* param = +1 */
              + (TrtCoef[3]-RefLevel)*(Trt=3);  /* param =  0 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Reference Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_Ref plots=none;
   class Trt / param=reference ref=LAST; /* reference parameterization */
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect;
call streaminit(123);
set SimExplanatory;
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* assume effect parameterization, which is equivalent to using deviations from the mean:
array TrtDev[3] _temporary_ (+1, 0, -1);
*/
 
MeanTrt = 2;                             /* for balanced design, mean=(3+2+1)/3 */
eta = LinCont + MeanTrt                  /* est intercept = (Intercept + MeanTrt) */
              + (TrtCoef[1]-MeanTrt)*(Trt=1)   /* param = +1 */
              + (TrtCoef[2]-MeanTrt)*(Trt=2)   /* param =  0 */
              + (TrtCoef[3]-MeanTrt)*(Trt=3);  /* param = -1 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Effect Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_effect plots=none;
   class Trt / param=effect ref=LAST; 
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.

7月 292020
 

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory;
call streaminit(54321);
do i = 1 to 2000;                               /* number in each treatment group */
   do Trt = 1 to 3;                             /* Trt = 1, 2, and 3 */
      x1 = rand('Normal');                      /* x1 ~ N(0,1) */
      x2 = rand('Normal');                      /* x2 ~ N(0,1) */
      LinCont = -1.5 + 0.4*x1 - 0.7*x2;         /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */
      output;
   end;
end;
drop i;
run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt;
call streaminit(123);
set SimExplanatory;                      /* read the explanatory variables */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* a straightforward way to generate Y data from a logistic model */
eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3);
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Parameter Estimates Do Not Match Simulation Code";
ods select ParameterEstimates;
proc logistic data=LogiSim_Attempt plots=none;
   class Trt; 
   model y(event='1') = Trt x1 x2;
run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref;
call streaminit(123);
set SimExplanatory;
/* assume reference or GLM parameterization */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
RefLevel = TrtCoef[3];                   /* use last level as reference */
eta = LinCont + RefLevel                 /* constant term is (Intercept + RefLevel) */
              + (TrtCoef[1]-RefLevel)*(Trt=1)   /* param = +2 */
              + (TrtCoef[2]-RefLevel)*(Trt=2)   /* param = +1 */
              + (TrtCoef[3]-RefLevel)*(Trt=3);  /* param =  0 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Reference Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_Ref plots=none;
   class Trt / param=reference ref=LAST; /* reference parameterization */
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect;
call streaminit(123);
set SimExplanatory;
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* assume effect parameterization, which is equivalent to using deviations from the mean:
array TrtDev[3] _temporary_ (+1, 0, -1);
*/
 
MeanTrt = 2;                             /* for balanced design, mean=(3+2+1)/3 */
eta = LinCont + MeanTrt                  /* est intercept = (Intercept + MeanTrt) */
              + (TrtCoef[1]-MeanTrt)*(Trt=1)   /* param = +1 */
              + (TrtCoef[2]-MeanTrt)*(Trt=2)   /* param =  0 */
              + (TrtCoef[3]-MeanTrt)*(Trt=3);  /* param = -1 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Effect Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_effect plots=none;
   class Trt / param=effect ref=LAST; 
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.

6月 082020
 

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For procedures that do not support the EFFECT statement, you can generate the interaction effects yourself. Because the customer's question was specifically about the GAMPL procedure, which fits generalized additive models, I also show a third technique, which enables you to model interactions in the GAMPL procedure.

To illustrate the problem, consider the following SAS DATA step, which simulates data from two different response functions:

data Have;
call streaminit(1);
Group="A"; 
do x = 1 to 10 by .1;
   y =    2*x + sin(x) + rand("Normal"); output;
end;
Group="B";
do x = 1 to 10 by .1;
   y = 10 - x - sin(x) + rand("Normal"); output;
end;
run;

The graph shows the data and overlays a regression model. The model predicts the response for each level of the Group variable. Notice that the two curves have different shapes, which means that there is an interaction between the Group variable and the x variable. The goal of this article is to show how to use a spline effect to predict the response for each group.

Use the grammar of the procedure to form interactions

Many SAS regression procedures support the EFFECT statement, the CLASS statement, and enable you to specify interactions on the MODEL statement. Examples include the GLMMIX, GLMSELECT, LOGISTIC, QUANTREG, and ROBUSTREG procedures. For example, the following call to PROC GLMSELECT specifies several model effects by using the "stars and bars" syntax:

  • The syntax Group | x includes the classification effect (Group), a linear effect (x), and an interaction effect (Group*x).
  • The syntax Group * spl includes an interaction effect between the classification variable and the spline.
title "GLMSELECT Model with Spline and Interaction Effects";
ods select none;
proc glmselect data=Have
        outdesign(addinputvars fullmodel)=SplineBasis; /* optional: write design matrix to data set */
   class Group;
   effect spl = spline(x);
   model y = Group|x  Group*spl / selection=none;
   output out=SplineOut predicted=p;            /* output predicted values */
quit;
ods select all;
 
%macro PlotInteraction(dsname);
   proc sgplot data=&dsname;
      scatter x=x y=y / group=Group;
      series x=x y=p / group=Group curvelabel;
      keylegend / location=inside position=E;
   run;
%mend;
%PlotInteraction(SplineOut);

The predicted curves are shown in the previous section. I wrapped the call to PROC SGPLOT in a macro so that I can create the same plot for other models.

Output the design matrix for the interactions

Some SAS procedures do not support the EFFECT statement. For these procedures, you cannot form effects like Group*spline(x) within the procedure. However, the GLMSELECT procedure supports the OUTDESIGN= option, which writes the design matrix to a SAS data set. I used the option in the previous section to create the SplineBasis data set. The data set includes variables for the spline basis and for the interaction between the Group variable and the spline basis. Therefore, you can use the splines and interactions with splines in other SAS procedures.

For example, the GLM procedure does not support the EFFECT statement, so you cannot generate a spline basis within the procedure. However, you can read the design matrix from the previous call to PROC GLMSELECT. The interaction effects are named spl_Group_1_A, spl_Group_1_B, spl_Group_2_A, ...., where the suffix "_i_j" indicates the interaction effect between the i_th spline basis and the j_th level of the Group variable. You can type the names of the design variables, or you can use the "colon notation" to match all variables that begin with the prefix "spl_Group". The following call to PROC GLM computes the same model as in the previous section:

proc glm data=SplineBasis;
   class Group;
   model y = Group|x  spl_Group: ;           /* spl_Group: => spl_Group_1_A,  spl_Group_1_B, ... */
   output out=GLMOut predicted=p;            /* output predicted values */
quit;
 
/* show that the predicted values are the same */
data Comp;
   merge SplineOut GLMOut(keep=p rename=(p=p2));
   diff = p - p2;
run;
proc means data=comp Mean Min Max;  var diff;  run;

The output of PROC MEANS shows that the two models are the same. There is no difference between the predicted values from PROC GLM (which reads the design matrix) and the values from PROC GLMSELECT (which reads the raw data).

The splines of the interactions versus the interactions of the splines

Some nonparametric regression procedures, such as the GAMPL procedure, have their own syntax to generate spline effects. In fact, PROC GAMPL uses thin-plate splines, which are different from the splines that are supported by the EFFECT statement. Recently, a SAS customer asked how he could model interaction terms in PROC GAMPL.

The GAMPL procedure supports semiparametric models. In the parametric portion of the model, it is easy to specify interactions by using the standard "stars and bars" notation in SAS. However, the SPLINE() transformation in PROC GAMPL does not support interactions between a classification variable and a continuous variable. To be specific, consider the following semiparametric model:

title "No Interaction Between Classification and Spline Effects";
proc gampl data=Have; *plots=components;
   class Group;
   model Y = param(Group | x)     /* parametric terms */
             spline(x);           /* nonprametric, but no interaction */
   output out=GamPLOut pred=p r=r;
   id Y X Group;
run;
 
%PlotInteraction(SplineOut);

The output shows that the model does not capture the nonlinear features of the data. This is because the spline term is computed for all values of x, not separately for each group. When the groups are combined, the spline effect is essentially linear, which is probably not the best model for these data.

Unfortunately, in SAS/STAT 15.1, PROC GAMPL does not support a syntax such as "spline(Group*x)", which would enable the shape of the nonparametric curve to vary between levels of the Group variable. In fact, the 15.1 documentation states, "only continuous variables (not classification variables) can be specified in spline-effects."

However, if your goal is to use a flexible model to fit the data, there is a model that might work. Instead of the interaction between the Group variable and spline(x), you could use the spline effects for the interaction between the Group variable and x. These are not the same models: the interaction with the splines is not equal to the spline of the interactions! However, let's see how to fit this model to these data.

The idea is to form the interaction effect Group*x in a separate step. You can use the design matrix (SplineBasis) that was created in the first section, but to make the process as transparent as possible, I am going to manually generate the dummy variables for the interaction effect Group*x:

data Interact; 
set Have; 
x_Group_A = x*(Group='A'); 
x_Group_B = x*(Group='B');
run;

You can use the spline effects for the variable x_Group_A and x_Group_B in regression models. The following model is not the same as the previous models, but is a flexible model that seems to fit these data:

title "Interaction Between Classification and Spline Effects";
proc gampl data=Interact plots=components;
   class Group;
   model Y = param(Group | x) 
             spline(x_Group_A) spline(x_Group_B);  /* splines of interactions */
   output out=GamPLOut pred=p;
   id Y X Group;
run;
 
title "GAMPL with Interaction Effects";
%PlotInteraction(GamPLOut);

The graph shows that (visually) the model seems to capture the undulations in the data as well as the earlier model. Although PROC GAMPL does not currently support interactions between classification variables and spline effects, this trick provides a way to model the data.

The post Interactions with spline effects in regression models appeared first on The DO Loop.

5月 132020
 

This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima. Except for simple cases like quadratic regression, you need to use numerical techniques to locate these values.

In a previous article, I showed how to use SAS to evaluate the slope of a regression curve at specific points. The present article applies that technique by scoring the regression curve on a fine grid of point. You can use finite differences to approximate the slope of the curve at each point on the grid. You can then estimate the locations where the slope is zero.

In this article, I use the LOESS procedure to demonstrate the technique, but the method applies equally well to any one-dimensional regression curve. There are several ways to score a regression model:

  • Most parametric regression procedures in SAS (GLM, GLIMMIX, MIXED, ...) support the STORE statement. The STORE statement saves a representation of the model in a SAS item store. You can use PROC PLM to score a model from an item store.
  • Some nonparametric regression procedures in SAS do not support the STORE statement but support a SCORE statement. PROC ADAPTIVEREG and PROC LOESS are two examples. This article shows how to use the SCORE statement to find points where the regression curve has zero slope.
  • Some regression procedures in SAS do not support either the STORE or SCORE statements. For those procedures, you need to use the use the missing value trick to score the model.

The technique in this article will not detect inflection points. An inflection point is a location where the curve has zero slope but is not a local min or max. Consequently, this article is really about "how to find a point where a regression curve has a local extremum," but I will use the slightly inaccurate phrase "find points where the slope is zero."

How to find locations where the slope of the curve is zero?

For convenience, I assume the explanatory variable is named X and the response variable is named Y. The goal is to find locations where a nonparametric curve (x, f(x)) has zero slopes, where f(x) is the regression model. The general outline follows:

  1. Create a grid of points in the range of the explanatory variable. The grid does not have to be evenly spaced, but it is in this example.
  2. Score the model at the grid locations.
  3. Use finite differencing to approximate the slope of the regression curve at the grid points. If the slope changes sign between consecutive grid points, estimate the location between the grid points where the slope is exactly zero. Use linear interpolation to approximate the response at that location.
  4. Optionally, graph the original data, the regression curve, and the point along the curve where the slope is zero.

Example data

SAS distributes the ENSO data set in the SASHelp library. You can create a DATA step view that renames the explanatory and response variables to X and Y, respectively, so that it is easier to follow the logic of the program:

/* Create VIEW where x is the independent variable and y is the response */
data Have / view=Have;
set Sashelp.Enso(rename=(Month=x Pressure=y));
keep x y;
run;

Create a grid of points

After the data set is created, you can use PROC SQL to find the minimum and maximum values of the explanatory variable. You can create an evenly spaced grid of points for the range of the explanatory variable.

/* Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from Have;
quit;
 
/* Evaluate the model and estimate derivatives at these points */
data Grid;
dx = (&max_x - &min_x)/201;    /* choose the step size wisely */
do x = &min_x to &max_x by dx;   
   output;
end;
drop dx;
run;

Score the model at the grid locations

This is the step that will vary from procedure to procedure. You have to know how to use the procedure to score the regression model on the points in the Grid data set. The LOESS procedure supports a SCORE statement, so the call fits the model and scores the model on the Grid data set:

/* Score the model on the grid */
ods select none;    /* do not display the tables */
proc loess data=Have plots=none;
   model y = x;
   score data=Grid;  /* PROC LOESS does not support an OUT= option */
   /* Most procedures support an OUT= option to save the scored values.
      PROC LOESS displays the scored values in a table, so use ODS to
      save the table to an output data set */
   ods output ScoreResults=ScoreOut;
run;
ods select all;

If a procedure supports the STORE statement, you can use PROC PLM to score the model on the data. The SAS program that accompanies this article includes an example that uses the GAMPL procedure. The GAMPL procedure does not support the STORE or SCORE statements, but you can use the missing value trick to find zero derivatives.

Find the locations where the slope is zero

This is the mathematical portion of the computation. You can use a backward difference scheme to estimate the derivative (slope) of the curve. If (x0, y0) and (x1, y1) are two consecutive points along the curve (in the ScoreOut data set), then the slope at (x1, y1) is approximately m = (y1 - y0) / (x1 - x0). When the slope changes sign between consecutive points, it indicates that the slope changed from positive to negative (or vice versa) between the points. If the slope is continuous, it must have been exactly zero somewhere on the interval. You can use a linear approximation to find the point, t, where the slope is zero. You can then use linear interpolation to approximate the point (t, f(t)) at which the curve is a local min or max.

You can use the following SAS DATA step to process the scoring data, approximate the slope, and estimate where the slope of the curve is zero:

/* Compute slope by using finite difference formula. */
data Deriv0;
set ScoreOut;
Slope = dif(p_y) / dif(x);      /* (f(x) - f(x-dx)) / dx */
xPrev = lag(x);  yPrev = lag(p_y);  SlopePrev = lag(Slope);
if n(SlopePrev) AND sign(SlopePrev) ^= sign(Slope) then do;
   /* The slope changes sign between this obs and the previous.
      Assuming linearity on the interval, find (t, f(t))
      where slope is exactly zero */
   t0 = xPrev - SlopePrev * (x - xPrev)/(Slope - SlopePrev); 
   /* use linear interpolation to find the corresponding y value:
      f(t) ~ y0 + (y1-y0)/(x1-x0) * (t - x0)       */
   f_t0 = yPrev + (yPrev - p_y)/(x - xPrev) * (t0 - xPrev);
   if sign(SlopePrev) > 0 then _Type_ = "Max";
   else _Type_ = "Min";
   output; 
end;
keep t0 f_t0 _Type_;
label f_t0 = "f(t0)";
run;
 
proc print data=Deriv0 label;
run;

The table shows that there are seven points at which the derivative of the loess regression curve has a local min or max.

Graph the results

If you want to display the local extreme on the graph of the regression curve, you can concatenate the original data, the regression curve, and the local extreme. You can then use PROC SGPLOT to overlay the three layers. The resulting graph is shown at the top of this article.

data Combine;
merge Have                           /* data   : (x, y)     */
      ScoreOut(rename=(x=t p_y=p_t)) /* curve  : (t, p_t)   */
      Deriv0;                        /* extrema: (t0, f_t0) */
run;
 
title "Loess Smoother";
title2 "Red Markers Indicate Zero Slope for Smoother";
proc sgplot data=Combine noautolegend;
   scatter x=x y=y;
   series x=t y=p_t / lineattrs=GraphData2;
   scatter x=t0 y=f_t0 / markerattrs=(symbol=circlefilled color=red);
   yaxis grid; 
run;

Summary

In summary, if you can evaluate a regression curve on a grid of points, you can approximate the slope at each point along the curve. By looking for when the slope changes sign, you can find local minima and maxima. You can then use a simple linear estimator on the interval to estimate where the slope is exactly zero.

You can download the SAS program that performs the computations in this article.

The post Find points where a regression curve has zero slope appeared first on The DO Loop.

1月 132020
 

Did you add "learn something new" to your list of New Year's resolutions? Last week, I wrote about the most popular articles from The DO Loop in 2019. The most popular articles are about elementary topics in SAS programming or univariate statistics because those topics have broad appeal.

Advanced topics and multivariate statistics are less popular but no less important. If you want to learn something new, check out this "Editor's Choice" list of articles that will broaden your statistical knowledge and enhance your SAS programming skills. I've grouped the articles into three broad categories.

Regression statistics

Schematic diagram of outliers in bivariate normal data. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'b' is a multivariate outlier.

High-dimensional analyses and visualization:

Low-dimensional data visualization

The tips and techniques in these articles are useful, so read a few articles today and teach yourself something new in this New Year!

Do you have a favorite article from 2019 that did not make either list? Share it in a comment!

The post 10 posts from 2019 that deserve a second look appeared first on The DO Loop.

1月 082020
 

Many SAS procedures can automatically create a graph that overlays multiple prediction curves and their prediction limits. This graph (sometimes called a "fit plot" or a "sliced fit plot") is useful when you want to visualize a model in which a continuous response variable depends on one continuous explanatory variable and one categorical (classification) variable. You can use the EFFECTPLOT statement in PROC PLM to create similar visualizations of other kinds of regression models. This article shows three ways to create a sliced fit plot: direct from the procedure, by using the EFFECTPLOT statement in PROC PLM, and by writing the predictions to a data set and using PROC SGPLOT to graph the results.

The data for this example is from the PROC LOGISTIC documentation. The response variable is the presence or absence of pain in seniors who are splits into two treatment groups (A and B) and a placebo group (P).

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;

Automatically create a sliced fit plot

Many SAS regression procedures support the PLOTS= option on the PROC statement. For PROC LOGISTIC, the option that creates a sliced fit plot is the PLOTS=EFFECTPLOT option, and you can add prediction limits to the graph by using the CLBAND suboption, as follows:

proc logistic data=Neuralgia alpha=0.2 plots(only)=effectplot(clband);
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
run;

That was easy! The procedure automatically creates a title, legend, axis labels, and so forth. By using the PLOTS= option, you get a very nice plot that shows the predictions and prediction limits for the model.

Create a sliced fit plot by using PROC PLM

One of the nice things about the STORE statement in SAS regression procedures is that it enables you to create graphs and perform other post-fitting analyses without rerunning the procedure. Maybe you intend to examine many models before deciding on the best model. You can run goodness-of-fit statistics for the models and then use PROC PLM to create a sliced fit plot for only the final model. To do this, use the STORE statement in the regression procedure and then "restore" the model in PROC PLM, which can perform several post-fitting analyses, including creating a sliced fit plot, as follows:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   store PainModel / label='Neuralgia Study';
run;
 
proc plm restore=PainModel noinfo;
   effectplot slicefit(x=Age sliceby=Treatment) / clm;
run;

The sliced fit plot is identical to the one that is produced by PROC LOGISTIC and is not shown.

Create a sliced fit plot manually

For many situations, the statistical graphics that are automatically produced are adequate. However, at times you might want to customize the graph by changing the title, the placement of the legend, the colors, and so forth. Sometimes companies mandate color-schemes and fonts that every report must use. For this purpose, SAS supports ODS styles and templates, which you can use to permanently change the output of SAS procedures. However, in many situations, you just want to make a small one-time modification. In that situation, it is usually simplest to write the predictions to a SAS data set and then use PROC SGPLOT to create the graph.

It is not hard to create a sliced fit plot. For these data, you can perform three steps:

  1. Write the predicted values and upper/lower prediction limits to a SAS data set.
  2. Sort the data by the classification variable and by the continuous variable.
  3. Use the BAND statement with the TRANSPARENCY= option to plot the confidence bands. Use the SERIES statement to plot the predicted values.

You can use the full power of PROC SGPLOT to modify the plot. For example, the following statements label the curves, move the legend, and change the title and Y axis label:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   /* 1. Use a procedure or DATA step to write Pred, Lower, and Upper limits */
   output out=LogiOut pred=Pred lower=Lower upper=Upper;
run;
 
/* 2. Be sure to SORT! */
proc sort data=LogiOut;
   by Treatment Age;
run;
 
/* 3. Use a BAND statement. If more that one band, use transparency */
title "Predicted Probabilities with 80% Confidence Limits";
title2 "Three Treatment Groups";
proc sgplot data=LogiOut;
   band x=Age lower=Lower upper=Upper / group=Treatment transparency=0.75 name="L";
   series x=Age y=Pred / group=Treatment curvelabel;
   xaxis grid;
   yaxis grid label="Predicted Probability of Pain" max=1;
   keylegend "L" / location=inside position=NW title="Treatment" across=1 opaque;
run;

The graph is shown at the top of this article. It has a customized title, label, and legend. In addition, the curves on this graph differ from the curves on the previous graph. The OUTPUT statement evaluates the model at the observed values of Age and Treatment. Notice that the A and B treatment groups do not have any patients that are over the age of 80, therefore those prediction curves do not extend to the right-hand side of the graph.

In summary, there are three ways to visualize predictions and confidence bands for a regression model in SAS. This example used PROC LOGISTIC, but many other regression procedures support similar options. In most SAS/STAT procedures, you can use the PLOTS= option to obtain a fit plot or a sliced fit plot. More than a dozen procedures support the STORE statement, which enables you to use PROC PLM to create the visualization. Lastly, all regression procedures support some way to output predicted values to a SAS data set. You can sort the data, then use the BAND statement (with transparency) and the SERIES statement to create the sliced fit plot.

The post 3 ways to add confidence limits to regression curves in SAS appeared first on The DO Loop.

11月 202019
 

In a linear regression model, the predicted values are on the same scale as the response variable. You can plot the observed and predicted responses to visualize how well the model agrees with the data, However, for generalized linear models, there is a potential source of confusion. Recall that a generalized linear model (GLIM) has two components: a linear predictor, which is a linear combination of the explanatory variables, and a transformation (called the inverse link function) that maps the linear predictor onto the scale of the data. Consequently, SAS regression procedures support two types of predicted values and prediction limits. In the SAS documentation, the first type is called "predictions on the linear scale" whereas the second type is called "predictions on the data scale."

For many SAS procedures, the default is to compute predicted values on the linear scale. However, for GLIMs that model nonnormal response variables, it is more intuitive to predict on the data scale. The ILINK option, which is shorthand for "apply the inverse link transformation," converts the predicted values to the data scale. This article shows how the ILINK option works by providing an example for a logistic regression model, which is the most familiar generalized linear models.

Review of generalized linear models

The SAS documentation provides an overview of GLIMs and link functions. The documentation for PROC GENMOD provides a list of link functions for common regression models, including logistic regression, Poisson regression, and negative binomial regression.

Briefly, the linear predictor is
η = X*β
where X is the design matrix and β is the vector of regression coefficients. The link function (g) is a monotonic function that relates the linear predictor to the conditional mean of the response. Sometimes the symbol μ is used to denote the conditional mean of the response (μ = E[Y|x]), which leads to the formula
g(μ) = X*β

In SAS, you will often see options and variables names (in output data sets) that contains the substring 'XBETA'. When you see 'XBETA', it indicates that the statistic or variable is related to the LINEAR predictor. Because the link function, g, is monotonic, it has an inverse, g-1. For generalized linear models, the inverse link function maps the linear-scale predictions to data-scale predictions: if η = x β is a predicted value on the linear scale, then g-1(η) is the predicted value for x on the data scale.

When the response variable is binary, the GLIM is the logistic model. If you use the convention that Y=1 indicates an event and Y=0 indicates the absence of an event, then the "data scale" is [0, 1] and the GLIM predicts the probability that the event occurs. For the logistic GLIM, the link function is the logit function:
g(μ) = logit(μ) = log( μ / (1 - μ) )
The inverse of the logit function is called the logistic function:
g-1(η) = logistic(η) = 1 / (1 + exp(-η))

To demonstrate the ILINK option, the next sections perform the following tasks:

  1. Use PROC LOGISTIC to fit a logistic model to data. You can use the STORE statement to store the model to an item store.
  2. Use the SCORE statement in PROC PLM to score new data. This example scores data by using the ILINK option.
  3. Score the data again, but this time do not use the ILINK option. Apply the logistic transformation to the linear estimates to demonstrate that relationship between the linear scale and the data scale.

The transformation between the linear scale and the data scale is illustrated by the following graph:

Fit a logistic model

The following data are from the documentation for PROC LOGISTIC. The model predicts the probability of Pain="Yes" (the event) for patients in a study, based on the patients' sex, age, and treatment method ('A', 'B', or 'P'). The STORE statement in PROC LOGISTIC creates an item store that can be used for many purposes, including scoring new observations.

data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   DROP Duration;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;
 
title 'Logistic Model on Neuralgia';
proc logistic data=Neuralgia;
   class Sex Treatment;
   model Pain(Event='Yes')= Sex Age Treatment;
   store PainModel / label='Neuralgia Study';  /* store model for post-fit analysis */
run;

Score new data by using the ILINK option

There are many reasons to use PROC PLM, but an important purpose of PROC PLM is to score new observations. Given information about new patients, you can use PROC PLM to predict the probability of pain if these patients are given a specific treatment. The following DATA step defines the characteristics of four patients who will receive Treatment B. The call to PROC PLM scores and uses the ILINK option to predict the probability of pain:

/* Use PLM to score new patients */
data NewPatients;
   input Treatment $ Sex $ Age Duration;
   DROP Duration;
   datalines;
B M 67 15 
B F 73  5 
B M 74 12 
B F 79 16 
;
 
/* predictions on the DATA scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

The Predicted column contains probabilities in the interval [0, 1]. The 95% prediction limits for the predictions are given by the LCLM and UCLM columns. For example, the prediction interval for the 67-year-old man is approximately [0.03, 0.48].

These values and intervals are transformations of analogous quantities on the linear scale. The logit transformation maps the predicted probabilities to the linear estimates. The inverse logit (logistic) transformation maps the linear estimates to the predicted probabilities.

Linear estimates and the logistic transformation

The linear scale is important because effects are additive on this scale. If you are testing the difference of means between groups, the tests are performed on the linear scale. For example, the ESTIMATE, LSMEANS, and LSMESTIMATE statements in SAS perform hypothesis testing on the linear estimates. Each of these statements supports the ILINK option, which enables you to display predicted values on the data scale.

To demonstrate the connection between the predicted values on the linear and data scale, the following call to PROC PLM scores the same data according to the same model. However, this time the ILINK option is omitted, so the predictions are on the linear scale.

/* predictions on the LINEAR scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreXBeta(
         rename=(Predicted=XBeta LCLM=LowerXB UCLM=UpperXB))
         predicted lclm uclm;         /* ILINK not used, so linear predictor */
run;
 
proc print data=ScoreXBeta; run;

I have renamed the variables that PROC PLM creates for the estimates on the linear scale. The XBeta column shows the predicted values. The LowerXB and UpperXB columns show the prediction interval for each patient. The XBeta column shows the values you would obtain if you use the parameter estimates table from PROC LOGISTIC and apply those estimates to the observations in the NewPatients data.

To demonstrate that the linear estimates are related to the estimates in the previous section, the following SAS DATA step uses the logistic (inverse logit) transformation to convert the linear estimates onto the predicted probabilities:

/* Use the logistic (inverse logit) to transform the linear estimates
     (XBeta) into probability estimates in [0,1], which is the data scale.
   You can use the logit transformation to go the other way. */
data LinearToData;
   set ScoreXBeta;   /* predictions on linear scale */
   PredProb   = logistic(XBeta);
   LCLProb    = logistic(LowerXB);
   UCLProb    = logistic(UpperXB); 
run;
 
proc print data=LinearToData;
   var Treatment Sex Age PredProb LCLProb UCLProb;
run;

The transformation of the linear estimates gives the same values as the estimates that were obtained by using the ILINK option in the previous section.

Summary

In summary, there are two different scales for predicting values for a generalized linear model. When you report predicted values, it is important to specify the scale you are using. The data scale makes intuitive sense because it is the same scale as the response variable. You can use the ILINK option in SAS to get predicted values and prediction intervals on the data scale.

The post Predicted values in generalized linear models: The ILINK option in SAS appeared first on The DO Loop.

6月 262019
 

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;
run;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
run;
 
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
run;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;
run;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

10月 272011
 
The background is from An Introduction of Generalized Linear Model by A.J.Dobson; It's described as below:

The data are from an experiment to promote the recovery of stroke patients. There were three experimental groups:

A was a new occupational therapy intervention;
B was the existing stroke rehabilitation program conducted in the same hospital where A was conducted;
C was the usual care regime for stroke patients provided in a different hospital.

There were eight patients in each experimental group. The response variable was a measure off unctional ability, the Bartel index; higher scores correspond to better outcomes and the maximum score is 100. Each patient was assessed weekly over the eight weeks of the study.

data table11_1;
  input Subject    Group $     week1-week8;
datalines;
1     A     45    45    45    45    80    80    80    90
2     A     20    25    25    25    30    35    30    50
3     A     50    50    55    70    70    75    90    90
4     A     25    25    35    40    60    60    70    80
5     A     100   100   100   100   100   100   100   100
6     A     20    20    30    50    50    60    85    95
7     A     30    35    35    40    50    60    75    85
8     A     30    35    45    50    55    65    65    70
9     B     40    55    60    70    80    85    90    90
10    B     65    65    70    70    80    80    80    80
11    B     30    30    40    45    65    85    85    85
12    B     25    35    35    35    40    45    45    45
13    B     45    45    80    80    80    80    80    80
14    B     15    15    10    10    10    20    20    20
15    B     35    35    35    45    45    45    50    50
16    B     40    40    40    55    55    55    60    65
17    C     20    20    30    30    30    30    30    30
18    C     35    35    35    40    40    40    40    40
19    C     35    35    35    40    40    40    45    45
20    C     45    65    65    65    80    85    95    100
21    C     45    65    70    90    90    95    95    100
22    C     25    30    30    35    40    40    40    40
23    C     25    25    30    30    30    30    35    40
24    C     15    35    35    35    40    50    65    65
;
run;

Here a linear model is used to study the relation of Group and Time. The model here used is:
                                                E(Y_{i,j,k}) = \alpha_{i} + \beta_{i} * t_{k} + e_{i,j,k}
Here Y_{i,j,k} is the score at time t_{k} ( k=1 to 8) for patient j (j=1 to 8) in group i (i=1 for Group A, 2 for B and 3 for C);

For linear regression convenience, we need to rearrange the data from wide to long:

data table11_1long;
      set table11_1;
      array week(8) week1-week8;
      do time=1 to 8;
            score=week(time);
            output;
      end;
      drop week1-week8;
run;

Also we need to transfer Categorical Variable to Dummy Variable:

data table11_3a;
      set table11_1long;
      if group='B' then do; a2=1; a3=0; end;
      if group='C' then do; a2=0; a3=1; end;
      if group='A' then do; a2=0; a3=0; end;
run;

Without considering the dependence of the score for different time, we first assume they are independent. Consider the interaction of time and Group, the model can be expressed as:


proc glm data=table11_3a;
      model score = time a2 a3 time*a2 time*a3 / solution ss3;
run;
quit;

The above model considers the interaction of Group and Time. It's a saturated model. It is the same as we make a regression of Score over Time for each Group. So, it can be equally achieved by:


proc reg data=table11_1long;
      by group;
      model score = time;
run; 

Sometimes it's required an exploratory analysis, or called data reduction or data summary.  It consists of summarizing the response profiles for each subject by a small number of descriptive statistics based on the assumption that measurements on the same subject are independent. Here intercept and slope are used as the summary statistics:

proc reg data = table11_1long;
      by subject;
      model score = time;
      ods output parameterestimates = table11_4 (keep = subject variable estimate stderr);
run;
quit;

proc tabulate data=table11_4;
      var estimate stderr;
      class variable subject / order=data;
      table subject='Subject'*sum='', (variable='')*(estimate stderr)*f=6.3 / rts=20 row=float;
run;

Then we can get the following summary table:

------------------------------------------------
|                  |  Intercept  |    time     |
|                  |-------------+-------------|
|                  |Param-|      |Param-|      |
|                  | eter |Stand-| eter |Stand-|
|                  |Estim-| ard  |Estim-| ard  |
|                  | ate  |Error | ate  |Error |
|------------------+------+------+------+------|
|Subject           |      |      |      |      |
|------------------|      |      |      |      |
|1                 |30.000| 7.289| 7.500| 1.443|
|------------------+------+------+------+------|
|2                 |15.536| 4.099| 3.214| 0.812|
|------------------+------+------+------+------|
|3                 |39.821| 3.209| 6.429| 0.636|
|------------------+------+------+------+------|
|4                 |11.607| 3.387| 8.393| 0.671|
|------------------+------+------+------+------|
|5                 |100.00| 0.000| 0.000| 0.000|
|------------------+------+------+------+------|
|6                 | 0.893| 5.304|11.190| 1.050|
|------------------+------+------+------+------|
|7                 |15.357| 4.669| 7.976| 0.925|
|------------------+------+------+------+------|
|8                 |25.357| 1.971| 5.893| 0.390|
|------------------+------+------+------+------|
|9                 |38.571| 3.522| 7.262| 0.698|
|------------------+------+------+------+------|
|10                |61.964| 2.236| 2.619| 0.443|
|------------------+------+------+------+------|
|11                |14.464| 5.893| 9.702| 1.167|
|------------------+------+------+------+------|
|12                |26.071| 2.147| 2.679| 0.425|
|------------------+------+------+------+------|
|13                |48.750| 8.927| 5.000| 1.768|
|------------------+------+------+------+------|
|14                |10.179| 3.209| 1.071| 0.636|
|------------------+------+------+------+------|
|15                |31.250| 1.948| 2.500| 0.386|
|------------------+------+------+------+------|
|16                |34.107| 2.809| 3.810| 0.556|
|------------------+------+------+------+------|
|17                |21.071| 2.551| 1.429| 0.505|
|------------------+------+------+------+------|
|18                |34.107| 1.164| 0.893| 0.231|
|------------------+------+------+------+------|
|19                |32.143| 1.164| 1.607| 0.231|
|------------------+------+------+------+------|
|20                |42.321| 3.698| 7.262| 0.732|
|------------------+------+------+------+------|
|21                |48.571| 6.140| 7.262| 1.216|
|------------------+------+------+------+------|
|22                |24.821| 1.885| 2.262| 0.373|
|------------------+------+------+------+------|
|23                |22.321| 1.709| 1.845| 0.338|
|------------------+------+------+------+------|
|24                |13.036| 4.492| 6.548| 0.890|
------------------------------------------------

The last part is to do analysis of variance for intercepts and slopes. Since in Table11_4 there is only these four variables: 'subject' 'variable' 'estimate' 'stderr'. In table11_1long there is variable 'time' and 'group'. We need to combine these two data sets:



proc sql;
      create table table11_5 as
      select table11_4.*, table11_1long.group
      from  table11_4, table11_1long
      where table11_4.subject = table11_1long.subject;
quit;

Then the analysis of Intercepts can be done by:

proc genmod data=table11_5;
      class group (ref='A' param=ref);
      where variable='Intercept';
      model estimate = group ;
run;
quit;

The analysis of Slope can be done by: 

proc genmod data=table11_5;
      class group (ref='A' param=ref);
      where variable='time';
      model estimate = group   ;
run;
quit;

Here we use proc genmod which allows us use categorical variables directly and has the choice of selecting reference level. We don't use proc glm since it has no choice of reference level in the regression.

In the book the author use proc reg to do it. If use proc reg, we need to use dummy variables rather than categorical variable. So, first it requires to combine the table table11_4 with table11_3a together:

proc sql;
      create table table11_5 as
      select distinct table11_4.*, x.a2 as a2, x.a3 as a3
      from table11_4, table11_3a as x
      where table11_4.subject = x.subject;
quit;

proc reg data=table11_5;
      where variable='Intercept';
      model estimate = a2 a3;
run;
quit;

proc reg data=table11_5;
      where variable='time';
      model estimate = a2 a3;
run;
quit;

It gives the same result as we use proc genmod.


As we know, usually the independence assumption here for the repeated measures is not suitable. A more suitable model for this data should consider the dependence of scores for these 8 time. Generalized Estimation Equations (GEE) model should be more reasonable.