Regression

12月 142020
 

A segmented regression model is a piecewise regression model that has two or more sub-models, each defined on a separate domain for the explanatory variables. For simplicity, assume the model has one continuous explanatory variable, X. The simplest segmented regression model assumes that the response is modeled by one parametric model when X is less than some threshold value and by a different parametric model when X is greater than the threshold value. The threshold value is also called a breakpoint, a cutpoint, or a join point.

A previous article shows that for simple piecewise polynomial models, you can use the EFFECT statement in SAS regression procedures to use a spline to fit some segmented regression models. The method relies on two assumptions. First, it assumes that you know the location of the breakpoint. Second, it assumes that each model has the same parametric form on each interval. For example, the model might be piecewise constant, piecewise linear, or piecewise quadratic.

If you need to estimate the location of the breakpoint from the data, or if you are modeling the response differently on each segment, you cannot use a single spline effect. Instead, you need to use a SAS procedure such as PROC NLIN that enables you to specify the model on each segment and to estimate the breakpoint. For simplicity, this article shows how to estimate a model that is quadratic on the first segment and constant on the second segment. (This is called a plateau model.) The model also estimates the location of the breakpoint.

An example of a segmented model

A SAS customer recently asked about segmented models on the SAS Support Communities. Suppose that a surgeon wants to model how long it takes her to perform a certain procedure over time. When the surgeon first started performing the procedure, it took about 3 hours (180 minutes). As she refined her technique, that time decreased. The surgeon wants to predict how long this surgery now takes now, and she wants to estimate when the time reached its current plateau. The data are shown below and visualized in a scatter plot. The length of the procedure (in minutes) is recorded for 25 surgeries over a 16-month period.

data Have;
  input SurgeryNo Date :mmddyy10. duration @@;
  format Date mmddyy10.;
datalines;
1       3/20/2019       182   2       5/16/2019       150
3       5/30/2019       223   4       6/6/2019        142
5       6/11/2019       111   6       7/11/2019       164
7       7/26/2019        83   8       8/22/2019       144
9       8/29/2019       162  10       9/19/2019        83
11      10/10/2019       70  12       10/17/2019      114
13      10/31/2019      113  14       11/7/2019        97
15      11/21/2019       83  16       12/5/2019       111
17      12/5/2019        73  18       12/12/2019       87
19      12/19/2019       86  20       1/9/2020        102
21      1/16/2020       124  22       1/23/2020        95
23      1/30/2020       134  24       3/5/2020        121
25      6/4/2020         60
;
 
title "Time to Perform a Surgery";
proc sgplot data=Have;
  scatter x=Date y=Duration;
run;

From the graph, it looks like the duration for the procedure decreased until maybe November or December 2019. The goal of a segmented model is to find the breakpoint and to model the duration before and after the breakpoint. For this purpose, you can use a segmented plateau model that is a quadratic model prior to the breakpoint and a constant model after the breakpoint.

Formulate a segmented regression model

A segmented plateau model is one of the examples in the PROC NLIN documentation. The documentation shows how to use constraints in the problem to eliminate one or more parameters. For example, a common assumption is that the two segments are joined smoothly at the breakpoint location, x0. If f(x) is the predictive model to the left of the breakpoint and g(x) is the model to the right, then continuity dictates that f(x0) = g(x0) and smoothness dictates that f`(x0) = g`(x0). In many cases (such as when the models are low-degree polynomials), the two constraints enable you to reparameterize the models to eliminate two of the parameters.

The PROC NLIN documentation shows the details. Suppose f(x) is a quadratic function f(x) = α + β x + γ x2 and g(x) is a constant function g(x) = c. Then you can use the constraints to reparameterize the problem so that (α, β, γ) are free parameters, and the other two parameters are determined by the formulas:
x0 = -β / (2 γ)
c = α – β2 / (4 γ)
You can use the ESTIMATE statement in PROC NLIN to obtain estimates and standard error for the x0 and c parameters.

Provide initial guesses for the parameters

As is often the case, the hard part is to guess initial values for the parameters. You must supply an initial guess on the PARMS statement in PROC NLIN. One way to create a guess is to use a related "reduced model" to provide the estimates. For example, you can use PROC GLM to fit a global quadratic model to the data, as follows:

/* rescale by using x="days since start" as the variable */
%let RefDate = '20MAR2019'd;
data New;
set Have;
rename duration=y;
x = date - &RefDate;  /* days since first record */
run;
 
proc glm data=New;
   model y = x x*x;
run;

These parameter estimates are used in the next section to specify the initial parameter values for the segmented model.

Fit a segmented regression model in SAS

Recall that a SAS date is represented by the number of days since 01JAN1960. Thus, for these data, the Date values are approximately 22,000. Numerically speaking, it is often better to use smaller numbers in a regression problem, so I will rescale the explanatory variable to be the number of days since the first surgery. You can use the parameter estimates from the quadratic model as the initial values for the segmented model:

title 'Segmented Model with Plateau';
proc nlin data=New plots=fit noitprint;
   parms alpha=184 beta= -0.5 gamma= 0.001;
 
   x0 = -0.5*beta / gamma;
   if (x < x0) then
        mean = alpha + beta*x  + gamma*x*x;    /* quadratic model for x < x0 */
   else mean = alpha + beta*x0 + gamma*x0*x0;  /* constant model for x >= x0 */
   model y = mean;
 
   estimate 'plateau'    alpha + beta*x0 + gamma*x0*x0;
   estimate 'breakpoint' -0.5*beta / gamma;
   output out=NLinOut predicted=Pred L95M=Lower U95M=Upper;
   ods select ParameterEstimates AdditionalEstimates FitPlot;
run;

The output from PROC NLIN includes estimates for the quadratic regression coefficients and for the breakpoint and the plateau value. According to the second table, the surgeon can now perform this surgical procedure in about 98 minutes, on average. The 95% confidence interval [77, 119] suggests that the surgeon might want to schedule two hours for this procedure so that she is not late for her next appointment.

According to the estimate for the breakpoint (x0), she achieved her plateau after about 287 days of practice. However, the confidence interval is quite large, so there is considerable uncertainty in this estimate.

The output from PROC NLIN includes a plot that overlays the predictions on the observed data. However, that graph is in terms of the number of days since the first surgery. If you want to return to the original scale, you can graph the predicted values versus the Date. You can also add reference lines that indicate the plateau value and the estimated breakpoint, as follows:

/* convert x back to original scale (Date) */
data _NULL_;
   plateauDate = 287 + &RefDate;    /* translate back to Date scale */
   call symput('x0', plateauDate);  /* store breakpoint in macro variable */
   put plateauDate DATE10.;
run;
 
/* plot the predicted values and CLM on the original scale */
proc sgplot data=NLinOut noautolegend;
   band x=Date lower=Lower upper=Upper;
   refline 97.97  / axis=y label="Plateau"    labelpos=max;
   refline &x0 / axis=x label="01JAN2020" labelpos=max;
   scatter x=Date y=y;
   series  x=Date y=Pred;
   yaxis label='Duration';
run;

The graph shows the breakpoint estimate, which happens to be 01JAN2020. It also shows the variability in the data and the wide prediction limits.

Summary

This article shows how to fit a simple segmented model in SAS. The model has one breakpoint, which is estimated from the data. The model is quadratic before the breakpoint, constant after the breakpoint, and joins smoothly at the breakpoint. The constraints on continuity and smoothness reduce the problem from five parameters to three free parameters. The article shows how to use PROC NLIN in SAS to solve segmented models and how to visualize the result.

You can use a segmented model for many other data sets. For example, if you are a runner and routinely run a 5k distance, you can use a segmented model to monitor your times. Are your times decreasing, or did you reach a plateau? When did you reach the plateau?

The post Segmented regression models in SAS appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

9月 212020
 

A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to solve restricted least squares problems. However, if a constraint is an INEQUALITY instead of an EQUALITY, then (in general) a least squares method does not solve the problem. Therefore, you cannot use PROC REG to solve problems that have inequality constraints.

This article shows how to use PROC NLIN to solve linear regression problems whose regression coefficients are constrained by a linear inequality. Examples are B1 ≥ 3 or B1 + B2 ≥ 6.

PROC NLIN and constrained regression problems

Before solving a problem that has inequality constraints, let's see how PROC NLIN solves a problem that has linear equality constraints. The following statements use the data and model from the previous article. The model is
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2. In PROC NLIN, you can replace the B3 and B4 parameters, thus leaving only three parameters in the model. If desired, you can use the ESTIMATE statement to recover the value of B4, as follows:

/* You can solve the restricted problem by using PROC NLIN */
proc nlin data=RegSim noitprint;      /* use NLINMEASURES to estimate MSE */
   parameters b0=0 b1=3 b2=1;         /* specify names of parameters and initial guess */
   model Y = b0 + b1*x1 +   b2*x2 +
                  b1*x3 - 2*b2*x4;   /* replace B3 = B1 and B4 = -2*B2 */
   estimate 'b4' -2*b2;              /* estimate B4 */
run;

The parameter estimates are the same as those produced by PROC REG in the previous article.

The syntax for PROC NLIN is natural and straightforward. The PARAMETERS statement specifies the names of the parameters in the problem; the other symbols (X1-X4, Y) refer to data set variables. You must specify an initial guess for each parameter. For nonlinear regressions, this can be a challenge, but there are some tricks you can use to choose a good initial guess. For LINEAR regressions, the initial guess shouldn't matter: you can use all zeros or all ones, if you want.

Boundary constraints

Suppose that you want to force the regression coefficients to satisfy certain inequality constraints. You can use the BOUNDS statement in PROC NLIN to specify the constraints. The simplest type of constraint is to restrict a coefficient to a half-interval or interval. For example, the following call to PROC NLIN restricts B1 ≥ 3 and restricts B2 to the interval [0, 4].

/* INEQUALITY constraints on the regression parameters */
proc nlin data=RegSim;
   parameters b0=0 b1=3 b2=1;         /* initial guess */
   bounds b1 >= 3, 
          0<= b2 <= 4;
   model Y = b0 + b1*x1 + b2*x2 + b1*x3 - 2*b2*x4;
   ods select ParameterEstimates;
run;

The solution that minimizes the residual sum of squares (subject to the constraints) places the B1 parameter on the constraint B1=3. Therefore, a standard error is not available for that parameter estimate.

You can also use the BOUNDS statement to enforce simple relationships between parameters. For example, you can specify
     bounds b1 >= b2;
to specify that the B1 coefficient must be greater than or equal to the B2 parameter.

More general linear constraints

The BOUNDS statement is for simple relationships. For more complicated relationships, you might need to reparameterize the model. For example, the BOUNDS statement does not accept the following syntax:
     bounds b1 >= 2*b2; /* INVALID SYNTAX! */
However, you can reparameterize the problem by introducing a new parameter C = 2*B2. You can then systematically substitute (C/2) everywhere that B2 appears. For example, the following statements show the constrained model in terms of the new parameter, C. You can use the ESTIMATE statement to find the estimates for the original parameter.

/* let c = 2*b2 be the new parameter. Then substitute c/2 for b2 in the MODEL statement: */
proc nlin data=RegSim;
   parameters b0=0 b1=3 c=2;         /* initial guess */
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (c/2)*x2 + b1*x3 - c*x4;
   estimate 'b2' c/2;                /* estimate original parameter */
run;

The output from this procedure is not shown.

You can use a similar trick to handle linear constraints such as B1 + B2 ≥ 6. Move the B2 term to the right side of the inequality and define C = 6 – B2. The constraint becomes B1 ≥ C and on the MODEL statement you can substitute (6–C) everywhere that B2 appears, as follows:

/* Define c = 6 - b2 so that b2=6-c */
proc nlin data=RegSim;
   parameters b0=0 b1=5 c=5;
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (6-c)*x2 + b1*x3 - 2*(6-c)*x4;
   estimate 'b2' 6-c;              /* estimate original parameter */
   ods select ParameterEstimates AdditionalEstimates;
run;

Summary

In summary, you can use the NLIN procedure to solve linear regression problems that have linear constraints among the coefficients. Each equality constraint enables you to eliminate one parameter in the MODEL statement. You can use the BOUNDS statement to specify simple inequality constraints. For more complicated constraints, you can reparametrize the model and again use the BOUNDS statement to specify the constraints. If desired, you can use the ESTIMATE statement to estimate the parameters in the original model.

The post Regression with inequality constraints on parameters appeared first on The DO Loop.

7月 292020
 

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory;
call streaminit(54321);
do i = 1 to 2000;                               /* number in each treatment group */
   do Trt = 1 to 3;                             /* Trt = 1, 2, and 3 */
      x1 = rand('Normal');                      /* x1 ~ N(0,1) */
      x2 = rand('Normal');                      /* x2 ~ N(0,1) */
      LinCont = -1.5 + 0.4*x1 - 0.7*x2;         /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */
      output;
   end;
end;
drop i;
run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt;
call streaminit(123);
set SimExplanatory;                      /* read the explanatory variables */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* a straightforward way to generate Y data from a logistic model */
eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3);
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Parameter Estimates Do Not Match Simulation Code";
ods select ParameterEstimates;
proc logistic data=LogiSim_Attempt plots=none;
   class Trt; 
   model y(event='1') = Trt x1 x2;
run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref;
call streaminit(123);
set SimExplanatory;
/* assume reference or GLM parameterization */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
RefLevel = TrtCoef[3];                   /* use last level as reference */
eta = LinCont + RefLevel                 /* constant term is (Intercept + RefLevel) */
              + (TrtCoef[1]-RefLevel)*(Trt=1)   /* param = +2 */
              + (TrtCoef[2]-RefLevel)*(Trt=2)   /* param = +1 */
              + (TrtCoef[3]-RefLevel)*(Trt=3);  /* param =  0 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Reference Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_Ref plots=none;
   class Trt / param=reference ref=LAST; /* reference parameterization */
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect;
call streaminit(123);
set SimExplanatory;
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* assume effect parameterization, which is equivalent to using deviations from the mean:
array TrtDev[3] _temporary_ (+1, 0, -1);
*/
 
MeanTrt = 2;                             /* for balanced design, mean=(3+2+1)/3 */
eta = LinCont + MeanTrt                  /* est intercept = (Intercept + MeanTrt) */
              + (TrtCoef[1]-MeanTrt)*(Trt=1)   /* param = +1 */
              + (TrtCoef[2]-MeanTrt)*(Trt=2)   /* param =  0 */
              + (TrtCoef[3]-MeanTrt)*(Trt=3);  /* param = -1 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Effect Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_effect plots=none;
   class Trt / param=effect ref=LAST; 
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.

7月 292020
 

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory;
call streaminit(54321);
do i = 1 to 2000;                               /* number in each treatment group */
   do Trt = 1 to 3;                             /* Trt = 1, 2, and 3 */
      x1 = rand('Normal');                      /* x1 ~ N(0,1) */
      x2 = rand('Normal');                      /* x2 ~ N(0,1) */
      LinCont = -1.5 + 0.4*x1 - 0.7*x2;         /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */
      output;
   end;
end;
drop i;
run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt;
call streaminit(123);
set SimExplanatory;                      /* read the explanatory variables */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* a straightforward way to generate Y data from a logistic model */
eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3);
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Parameter Estimates Do Not Match Simulation Code";
ods select ParameterEstimates;
proc logistic data=LogiSim_Attempt plots=none;
   class Trt; 
   model y(event='1') = Trt x1 x2;
run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref;
call streaminit(123);
set SimExplanatory;
/* assume reference or GLM parameterization */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
RefLevel = TrtCoef[3];                   /* use last level as reference */
eta = LinCont + RefLevel                 /* constant term is (Intercept + RefLevel) */
              + (TrtCoef[1]-RefLevel)*(Trt=1)   /* param = +2 */
              + (TrtCoef[2]-RefLevel)*(Trt=2)   /* param = +1 */
              + (TrtCoef[3]-RefLevel)*(Trt=3);  /* param =  0 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Reference Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_Ref plots=none;
   class Trt / param=reference ref=LAST; /* reference parameterization */
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect;
call streaminit(123);
set SimExplanatory;
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* assume effect parameterization, which is equivalent to using deviations from the mean:
array TrtDev[3] _temporary_ (+1, 0, -1);
*/
 
MeanTrt = 2;                             /* for balanced design, mean=(3+2+1)/3 */
eta = LinCont + MeanTrt                  /* est intercept = (Intercept + MeanTrt) */
              + (TrtCoef[1]-MeanTrt)*(Trt=1)   /* param = +1 */
              + (TrtCoef[2]-MeanTrt)*(Trt=2)   /* param =  0 */
              + (TrtCoef[3]-MeanTrt)*(Trt=3);  /* param = -1 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Effect Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_effect plots=none;
   class Trt / param=effect ref=LAST; 
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.

6月 082020
 

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For procedures that do not support the EFFECT statement, you can generate the interaction effects yourself. Because the customer's question was specifically about the GAMPL procedure, which fits generalized additive models, I also show a third technique, which enables you to model interactions in the GAMPL procedure.

To illustrate the problem, consider the following SAS DATA step, which simulates data from two different response functions:

data Have;
call streaminit(1);
Group="A"; 
do x = 1 to 10 by .1;
   y =    2*x + sin(x) + rand("Normal"); output;
end;
Group="B";
do x = 1 to 10 by .1;
   y = 10 - x - sin(x) + rand("Normal"); output;
end;
run;

The graph shows the data and overlays a regression model. The model predicts the response for each level of the Group variable. Notice that the two curves have different shapes, which means that there is an interaction between the Group variable and the x variable. The goal of this article is to show how to use a spline effect to predict the response for each group.

Use the grammar of the procedure to form interactions

Many SAS regression procedures support the EFFECT statement, the CLASS statement, and enable you to specify interactions on the MODEL statement. Examples include the GLMMIX, GLMSELECT, LOGISTIC, QUANTREG, and ROBUSTREG procedures. For example, the following call to PROC GLMSELECT specifies several model effects by using the "stars and bars" syntax:

  • The syntax Group | x includes the classification effect (Group), a linear effect (x), and an interaction effect (Group*x).
  • The syntax Group * spl includes an interaction effect between the classification variable and the spline.
title "GLMSELECT Model with Spline and Interaction Effects";
ods select none;
proc glmselect data=Have
        outdesign(addinputvars fullmodel)=SplineBasis; /* optional: write design matrix to data set */
   class Group;
   effect spl = spline(x);
   model y = Group|x  Group*spl / selection=none;
   output out=SplineOut predicted=p;            /* output predicted values */
quit;
ods select all;
 
%macro PlotInteraction(dsname);
   proc sgplot data=&dsname;
      scatter x=x y=y / group=Group;
      series x=x y=p / group=Group curvelabel;
      keylegend / location=inside position=E;
   run;
%mend;
%PlotInteraction(SplineOut);

The predicted curves are shown in the previous section. I wrapped the call to PROC SGPLOT in a macro so that I can create the same plot for other models.

Output the design matrix for the interactions

Some SAS procedures do not support the EFFECT statement. For these procedures, you cannot form effects like Group*spline(x) within the procedure. However, the GLMSELECT procedure supports the OUTDESIGN= option, which writes the design matrix to a SAS data set. I used the option in the previous section to create the SplineBasis data set. The data set includes variables for the spline basis and for the interaction between the Group variable and the spline basis. Therefore, you can use the splines and interactions with splines in other SAS procedures.

For example, the GLM procedure does not support the EFFECT statement, so you cannot generate a spline basis within the procedure. However, you can read the design matrix from the previous call to PROC GLMSELECT. The interaction effects are named spl_Group_1_A, spl_Group_1_B, spl_Group_2_A, ...., where the suffix "_i_j" indicates the interaction effect between the i_th spline basis and the j_th level of the Group variable. You can type the names of the design variables, or you can use the "colon notation" to match all variables that begin with the prefix "spl_Group". The following call to PROC GLM computes the same model as in the previous section:

proc glm data=SplineBasis;
   class Group;
   model y = Group|x  spl_Group: ;           /* spl_Group: => spl_Group_1_A,  spl_Group_1_B, ... */
   output out=GLMOut predicted=p;            /* output predicted values */
quit;
 
/* show that the predicted values are the same */
data Comp;
   merge SplineOut GLMOut(keep=p rename=(p=p2));
   diff = p - p2;
run;
proc means data=comp Mean Min Max;  var diff;  run;

The output of PROC MEANS shows that the two models are the same. There is no difference between the predicted values from PROC GLM (which reads the design matrix) and the values from PROC GLMSELECT (which reads the raw data).

The splines of the interactions versus the interactions of the splines

Some nonparametric regression procedures, such as the GAMPL procedure, have their own syntax to generate spline effects. In fact, PROC GAMPL uses thin-plate splines, which are different from the splines that are supported by the EFFECT statement. Recently, a SAS customer asked how he could model interaction terms in PROC GAMPL.

The GAMPL procedure supports semiparametric models. In the parametric portion of the model, it is easy to specify interactions by using the standard "stars and bars" notation in SAS. However, the SPLINE() transformation in PROC GAMPL does not support interactions between a classification variable and a continuous variable. To be specific, consider the following semiparametric model:

title "No Interaction Between Classification and Spline Effects";
proc gampl data=Have; *plots=components;
   class Group;
   model Y = param(Group | x)     /* parametric terms */
             spline(x);           /* nonprametric, but no interaction */
   output out=GamPLOut pred=p r=r;
   id Y X Group;
run;
 
%PlotInteraction(SplineOut);

The output shows that the model does not capture the nonlinear features of the data. This is because the spline term is computed for all values of x, not separately for each group. When the groups are combined, the spline effect is essentially linear, which is probably not the best model for these data.

Unfortunately, in SAS/STAT 15.1, PROC GAMPL does not support a syntax such as "spline(Group*x)", which would enable the shape of the nonparametric curve to vary between levels of the Group variable. In fact, the 15.1 documentation states, "only continuous variables (not classification variables) can be specified in spline-effects."

However, if your goal is to use a flexible model to fit the data, there is a model that might work. Instead of the interaction between the Group variable and spline(x), you could use the spline effects for the interaction between the Group variable and x. These are not the same models: the interaction with the splines is not equal to the spline of the interactions! However, let's see how to fit this model to these data.

The idea is to form the interaction effect Group*x in a separate step. You can use the design matrix (SplineBasis) that was created in the first section, but to make the process as transparent as possible, I am going to manually generate the dummy variables for the interaction effect Group*x:

data Interact; 
set Have; 
x_Group_A = x*(Group='A'); 
x_Group_B = x*(Group='B');
run;

You can use the spline effects for the variable x_Group_A and x_Group_B in regression models. The following model is not the same as the previous models, but is a flexible model that seems to fit these data:

title "Interaction Between Classification and Spline Effects";
proc gampl data=Interact plots=components;
   class Group;
   model Y = param(Group | x) 
             spline(x_Group_A) spline(x_Group_B);  /* splines of interactions */
   output out=GamPLOut pred=p;
   id Y X Group;
run;
 
title "GAMPL with Interaction Effects";
%PlotInteraction(GamPLOut);

The graph shows that (visually) the model seems to capture the undulations in the data as well as the earlier model. Although PROC GAMPL does not currently support interactions between classification variables and spline effects, this trick provides a way to model the data.

The post Interactions with spline effects in regression models appeared first on The DO Loop.