rick wicklin

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

11月 302020
 

Most games of skill are transitive. If Player A wins against Player B and Player B wins against Player C, then you expect Player A to win against Player C, should they play. Because of this, you can rank the players: A > B > C

Interestingly, not all games are transitive. The game "Rock, Paper, Scissors," is an intransitive game. In this game, Rock beats Scissors, Scissors beats Paper, and Paper beats Rock. You cannot rank the game shapes. You cannot declare that one shape wins more often than another.

In "Rock, Paper, Scissors," both players display their shapes simultaneously. If not, the second player could win 100% of the time by knowing the shape that is shown by the first player.

Intransitive dice

Recently, I saw an intransitive game that involves special six-sided dice. In this game, there are four dice: A, B, C, and D. Two players each choose a die and roll it. The higher number wins. If the players choose their die sequentially, the second player is at a huge advantage: No matter which die the first player chooses, the second player can choose a die such that his probability of winning the game is 2/3.

Clearly, the numbers on the faces of the dice are important since this advantage does not exist for the usual six-sided dice. One set of intransitive dice have faces as follows:
A = {0, 0, 4, 4, 4, 4}
B = {3, 3, 3, 3, 3, 3}
C = {2, 2, 2, 2, 6, 6}
D = {1, 1, 1, 5, 5, 5}

For these dice:

  • A wins against B with probability 2/3.
  • B wins against C with probability 2/3.
  • C wins against D with probability 2/3.
  • D wins against A with probability 2/3.

The probability of winning for pairs that involve die B are easy to compute because that die has a 3 on every face. By inspection, die A will win against B whenever A shows a 4, which occurs 2/3 of the time. Similarly, die B wins against die C whenever C shows a 2, which is 2/3 of the time.

The remaining probabilities can be computed by using a tree diagram for conditional probability. A tree diagram for the dice C and D are shown below. If C shows a 6 (which happens 1/3 of the time), then C wins. If C shows a 2 (which happens 2/3 of the time), then the game depends on the roll of D. If D rolls a 1 (which happens 1/2 of the time), then C wins. But if D rolls a 5 (which happens 1/2 of the time), then C loses. Since the rolls are independent, you multiply the individual probabilities to obtain the total probability of each pair of events. For C versus D, C wins with probability 2/3.

You can construct a similar diagram for the D versus A game. It is also interesting to compute the probability of A versus C and B versus D for these dice.

Simulate many games

Because conditional probability can be tricky, I always like to check my computations by running a simulation. The following SAS/IML program uses the SAMPLE function to simulate 10,000 rolls of each die. The estimated probability for winning is the number of times that one die showed the greater number, divided by 10,000, which is the total number of simulated games. For each of the four games considered, the estimated probability is approximately 2/3.

proc iml;
/* define the dice */
A = {0, 0, 4, 4, 4, 4};
B = {3, 3, 3, 3, 3, 3};
C = {2, 2, 2, 2, 6, 6};
D = {1, 1, 1, 5, 5, 5}; 
 
call randseed(1234);
N = 10000;
/* simulate N rolls for each die */
sA = sample(A, N);
sB = sample(B, N);
sC = sample(C, N);
sD = sample(D, N);
 
/* estimate the probability of 'A vs B', 'B vs C', etc */
AvsB = sum(sA > sB) / N;
BvsC = sum(sB > sC) / N;
CvsD = sum(sC > sD) / N;
DvsA = sum(sD > sA) / N;
print AvsB BvsC CvsD DvsA;

Further reading

There are other examples of intransitive dice. The ones in this article are called "Efron's Dice" because they were discovered by Brad Efron in 1970. Yes, the same Brad Efron who also invented bootstrap resampling methods! The following articles provide more examples and more explanation.

  • "Nontransitive dice", Wikipedia article with lots of examples and a section devoted to Efron's dice, as used in this article.
  • Grimes, J., (2010) "Curious Dice," Plus Magazine. A wonderful exposition.

The post Intransitive dice appeared first on The DO Loop.

11月 232020
 

I previously showed how to create a decile calibration plot for a logistic regression model in SAS. A decile calibration plot (or "decile plot," for short) is used in some fields to visualize agreement between the data and a regression model. It can be used to diagnose an incorrectly specified model.

In principle, the decile plot can be applied to any regression model that has a continuous regressor. In this article, I show how to create two decile plots for an ordinary least squares regression model. The first overlays a decile plot on a fit plot, as shown to the right. The second decile plot shows the relationship between the observed and predicted responses.

Although this article shows how to create a decile plot in SAS, I do not necessarily recommend them. In many cases, it is better and simpler to use a loess fit, as I discuss at the end of this article.

A misspecified model

A decile plot can be used to identify an incorrectly specified model. Suppose, for example, that a response variable (Y) depends on an explanatory variable (X) in a nonlinear manner. If you model Y as a linear function of X, then you have specified an incorrect model. The following SAS DATA step simulates a response variable (Y) that depends strongly on a linear effect and weakly on quadratic and cubic effects for X. Because the nonlinear dependence is weak, a traditional fit plot does not indicate that the model is misspecified:

/* Y = b0 + b1*x + b2*x2 + b3*x**3 + error, where b2 and b3 are small */
%let N = 200;                      /* sample size */ 
data Have(keep= Y x);
call streaminit(1234);             /* set the random number seed */
do i = 1 to &N;                    /* for each observation in the sample  */
   x = rand("lognormal", 0, 0.5);  /* x = explanatory variable */
   y = 20 + 5*x - 0.5*(x-3)**3 +   /* include weak cubic effect */
            rand("Normal", 0, 4); 
   output;
end;
run;
 
proc reg data=Have;               /* Optional: plots=residuals(smooth); */
   model y = x;
quit;

The usual p-values and diagnostic plots (not shown) do not reveal that the model is misspecified. In particular, the fit plot, which shows the observed and predicted response versus the explanatory variable, does not indicate that the model is misspecified. At the end of this article, I discuss how to get PROC REG to overlay additional information about the fit on its graphs.

Overlay a decile plot on a fit plot

This section shows how to overlay a decile plot on the fit plot. The fit plot shows the observed values of the response (Y) for each value of the explanatory variable (X). You can create a decile fit plot as follows:

  1. Bin the X values into 10 bins by using the deciles of the X variable as cut points. In SAS, you can use PROC RANK for this step.
  2. For each bin, compute the mean Y value and a confidence interval for the mean. At the same time, compute the mean of each bin, which will be the horizontal plotting position for the bin. In SAS, you can use PROC MEANS for this step.
  3. Overlay the fit plot with the 10 confidence intervals and mean values for Y. You can use PROC SGPLOT for this step.
/* 1. Assign each obs to a decile for X */
proc rank data=Have out=Decile groups=10 ties=high;
   var x;           /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 2. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y x;
   output out=DecileOut mean=YMean XMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 3. Create a fit plot and overlay the deciles. */
data All;
   set Have DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed and Predicted Response versus Explanatory";
proc sgplot data=All noautolegend;
   reg x=x y=y / nomarkers CLM alpha=0.1 name="reg";
   scatter x=xMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 90% CI" markerattrs=GraphData2;
   fringe x / transparency=0.5;
   xaxis label="x" values=(0 to 4.5 by 0.5) valueshint;
   yaxis label="y" offsetmin=0.05;
   keylegend "reg" "obs" / position=NW location=inside across=1;
run;

The graph is shown at the top of this article. I use the FRINGE statement to display the locations of the X values at the bottom of the plot. The graph indicates that the model might be misspecified. The mean Y values for the first three deciles are all above the predicted value from the model. For the next six deciles, the mean Y values are all below the model's predictions. This kind of systematic deviation from the model can indicate that the model is missing an important component. In this case, we know that the true response is cubic whereas the model is linear.

A decile plot of observed versus predicted response

You can create a fit plot when the model contains a single continuous explanatory variable. If you have multiple explanatory variables, you can assess the model by plotting the observed responses against the predicted responses. You can overlay a decile plot by computing the average observed response for each decile of the predicted response. When you hear someone use the term "decile plot," they are probably referring to this second plot, which is applicable to most regression models.

The SAS program is almost identical to the program in the previous section. The main difference is that you need to create an output data set (FIT) that contains the predicted values (Pred). You then use the Pred variable instead of the X variable throughout the program, as follows:

/* 1. Compute predicted values */
proc reg data=Have noprint;
   model y = x;
   output out=Fit Pred=Pred;
quit;
 
/* 2. Assign each obs to a decile for Pred */
proc rank data=Fit out=Decile groups=10 ties=high;
   var Pred;        /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 3. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y Pred;
   output out=DecileOut mean=YMean PredMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 4. Create an Observed vs Predicted plot and overlay the deciles. */
data All;
   set Fit DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed Response verus Predicted Response";
proc sgplot data=All noautolegend;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip; /* add identity line */
   scatter x=PredMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 95% CI" markerattrs=GraphData2;
   fringe Pred;
   xaxis label="Predicted Response";
   yaxis label="Observed Response" offsetmin=0.05;
   keylegend "obs" / position=NW location=inside across=1;
run;

The decile plot includes a diagonal line, which is the line of perfect agreement between the model and the data. In a well-specified model, the 10 empirical means of the deciles should be close to the line, and they should vary randomly above and below the line. The decile plot for the misspecified model shows a systematic trend rather than random variation. For the lower 30% of the predicted values, the observed response is higher (on average) than the predictions. When the predicted values, are in the range [29, 32] (the fourth through ninth deciles) the observed response is lower (on average) than the predictions. Thus, the decile plot indicates that the linear model might be missing an important nonlinear effect.

The loess alternative

A general principle in statistics is that you should not discretize a continuous variable unless absolutely necessary. You can often get better insight by analyzing the continuous variable. In accordance with that principle, this section shows how to identify misspecified models by using a loess smoother rather than by discretizing a continuous variable into deciles.

In addition to the "general principle," a decile plot requires three steps: compute the deciles, compute the means for the deciles, and merge the decile information with the original data so that you can plot it. As I discuss at the end of my previous article, it is often easier to overlay a loess smoother.

The following statements use the LOESS statement in PROC SGPLOT to add a loess smoother to a fit plot. If the loess fit is not approximately linear, it could indicate that the linear model is not appropriate.

title "Linear Regression Fit";
proc sgplot data=Have;
   reg x=x y=y / clm;              /* fit linear model y = b0 + b1*x */
   loess x=x y=y / nomarkers;      /* use loess fit instead of decile plot */
run;

This plot is the continuous analog to the first decile plot. It tells you the same information: For very small and very large values of X, the predicted values are too low. The loess curve suggests that the linear model is lacking an effect.

The previous plot is useful for single-variable regressions. But even if you have multiple regressors, you can add a loess smoother to the "observed versus predicted" plot. If the FIT data set contains the observed and predicted response, you can create the graph as follows:

title "Observed versus Predicted";
proc sgplot data=Fit;
   loess x=Pred y=y;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip legendlabel="Identity Line";
run;

This plot is the continuous analog to the second decile plot. The loess curve indicates that the observed response is higher than predicted for small and large predicted values. When the model predicts values of Y in the range [29.5, 32], the predicted values tend to be smaller than the observed values. This is the same information that the decile plot provides, but the loess curve is easier to use and gives more information because it does not discretize a continuous variable.

Finally, you can add a loess smoother to a residual plot by using an option in PROC REG. In a well-specified model, the residual plot should not show any systematic trends. You can use the PLOTS=RESIDUALS(SMOOTH) options on the PROC REG statement to add a loess smoother to the residual plot, as follows:
proc reg data=Have plots=residuals(smooth);
   model y = x;
   ods select ResidualPlot;
quit;

The loess smoother on the residual plot shows that the residuals seem to have a systematic component that is not captured by the linear model. When you see the loess fit, you might be motivated to fit models that are more complex. You can use this option even for models that have multiple explanatory variables.

Summary

In summary, this article shows how to create two decile plots for least-square regression models in SAS. The first decile plot adds 10 empirical means to a fit plot. You can create this plot when you have a single continuous regressor. The second decile plot is applicable to models that have multiple continuous regressors. It adds 10 empirical means to the "observed versus predicted" plot. Lastly, although you can create decile plots in SAS, in many cases it is easier and more informative to use a loess curve to assess the model.

The post Decile plots in SAS appeared first on The DO Loop.

11月 182020
 
Predicted probabilities for a logistic regression model

To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When the explanatory variables are correlated, the standard visualizations might evaluate the model at unreasonable (or even impossible!) combinations of the explanatory variables. This article describes how to generate points in an ellipsoid and score (evaluate) a regression model at those points. The ellipsoid contains typical combinations of the explanatory variables when the explanatory variables are multivariate normal. An example is shown in the graph to the right, which shows the predicted probabilities for a logistic regression model.

This technique can help prevent nonsensical predicted values. For example, in a previous article, I show an example of a quadratic regression surface that appears to predict negative values for a quantity that can never be negative (such as length, mass, or volume). The reason for the negative prediction is that the model is being evaluated at atypical points that are far from the explanatory values in the experiment.

Example: Visualizing the predicted probabilities for a logistic model

Let's look at an example of a regression model for which the regressors are correlated. Suppose you want to predict the gender of the 19 students in the Sashelp.Class data based on their height and weight. (Apologies to my international readers, but the measurements are in pounds and inches.) The following call to PROC LOGISTIC fits a logistic model to the data and uses the EFFECTPLOT statement to create a contour plot of the predicted probability that a student is female:

proc logistic data=Sashelp.Class;
   model Sex = Height Weight;
   effectplot contour / ilink;      /* contour plot of pred prob */
   store out=LogiModel;             /* store model for later */
run;

The contour plot shows the predicted probability of being female according to the model. The red regions have a high probability of being female and the blue regions have a low probability. For a given height, lighter students are more likely to be female.

This graph is a great summary of the model. However, because height and weight are correlated, the data values do not fill the rectangular region in the graph. There are no students in the upper-left corner near (height, weight)=(51, 150). Such a student would be severely obese. Similarly, there are no students in the lower-right corner near (height, weight)=(72, 50). Such a student would be severely underweight.

You can graph just the explanatory variables and overlay a 95% prediction ellipse for bivariate normal data, as shown below:

title "Weight versus Height";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / jitter transparency=0.25 group=Sex markerattrs=(size=10 symbol=CircleFilled);
   ellipse x=Height y=Weight;    /* 95% prediction ellipse for bivariate normal data */
run;

The heights and weights of most students will fall into the prediction ellipse, assuming that the variables are multivariate normal with a mean vector that is close to (62.3, 100), which is the mean of the data.

If you want to score the model at "representative" values of the data, it makes sense to generate points inside the ellipse. The following sections show how to generate a regular grid of values inside the ellipse and how to score the logistic model on those points.

Generate data in an ellipsoid

I can think of several ways to generate a grid of values inside the ellipse, but the easiest way is to use the Mahalanobis distance (MD). The Mahalanobis distance is a metric that accounts for the correlation in data. Accordingly, you can generate a regular grid of points and then keep only the points whose Mahalanobis distance to the mean vector is less than a certain cutoff value. In this section, I choose the cutoff value to be the maximum Mahalanobis distance in the original data. Alternately, you can use probability theory and the fact that the distribution of the squared MD values is asymptotically chi-squared.

The following SAS/IML program reads the explanatory variables and computes the sample mean and covariance matrix. You can use the MAHALANOBIS function to compute the Mahalanobis distance between each observation and the sample mean. The maximum MD for these data is about 2.23. You can use the ExpandGrid function to generate a regular grid of points, then keep only those points whose MD is less than 2.23. The resulting points are inside an ellipse and represent "typical" values for the variables.

proc iml;
varNames = {'Height' 'Weight'};
use Sashelp.Class; read all var varNames into X;  close;
 
S = cov(X);
m = mean(X);
MD_Data = mahalanobis(X, m, S);
MaxMD = max(MD_Data);
print MaxMD;
 
minX = 50;  maxX = 72;           /* or use min(X[,1]) and max(X[,1]) */
minY = 50;  maxY = 150;          /* or use min(X[,2]) and max(X[,2]) */
xRng = minX:maxX;                /* step by 1 in X */
yRng = do(minY, maxY, 5);        /* step by 5 in Y */
 
u = ExpandGrid( xRng, yRng );    /* regular grid of points */
MD = mahalanobis(u, m, S);       /* MD to center for each point */
u = u[ loc(MD < MaxMD), ];       /* keep only points where MD < cutoff */
ods graphics / width=300px height=300px;
call scatter(u[,1], u[,2]);
 
create ScoreIn from u[c=varNames];   /* write score data set */
   append from u;
close;
QUIT;

Score data in an ellipsoid

You can use PROC PLM to score the logistic model at the points in the ellipsoid. In the earlier call to PROC LOGISTIC, I used the STORE statement to create an item store named LogiModel. The following call to PROC PLM scores the model and uses the ILINK option to output the predicted probabilities. You can use the HEATMAPPARM statement in PROC SGPLOT to visualize the predicted probabilities. So that the color ramp always visualizes probabilities on the interval [0,1], I add two fake observations (0 and 1) to the predicted values.

proc PLM restore=LogiModel noprint;
   score data=ScoreIn out=ScoreOut / ilink;      /* predicted probability */
run;
 
data Fake;           /* Optional trick: Add 0 and 1 so colorramp shows range [0,1] */
Predicted=0; output;
Predicted=1; output;
run;
 
data Score;          /* concatenate the real and fake data */
   set ScoreOut Fake;
run;
 
title "Predicted Values of Sex='F' for Logistic Model";
proc sgplot data=Score;
   styleattrs wallcolor=CxF8F8F8;
   heatmapparm  x=Height y=Weight colorresponse=Predicted;
   xaxis grid; yaxis grid;
run;

The graph is shown at the top of this article. It shows the predicted probabilities for representative combinations of the height and weight variables. Students whose measurements appear in the lower portion of the ellipse (red) are more likely to be female. Students in the upper portion of the ellipse (blue) are more likely to be male. It is not likely that students have measurements outside the ellipse, so the model is not evaluated outside the ellipse.

Summary

This article shows how you can use the Mahalanobis distance to generate scoring data that are in an ellipsoid that is centered at the sample mean vector. Such data are appropriate for scoring "typical" combinations of variables in data that are approximated multivariate normal. Although the example used a logistic regression model, the idea applies to any model that you want to score. This technique can help to prevent scoring the model at unrealistic values of the explanatory variables.

The post Create scoring data when regressors are correlated appeared first on The DO Loop.

11月 162020
 
Graph of Tan(x) with points of discontinuity

I have previously written about how to plot a discontinuous function in SAS. That article shows how to use the GROUP= option on the SERIES statement to graph a discontinuous function. An alternative approach is to place a missing value for the Y variable at the locations at which the graph is not continuous. You can also display reference lines at the points of discontinuity.

An example of a graph of a discontinuous function is shown at the right. This function is the tangent function, which has discontinuities at odd multiples of π/2, as indicated by the gray vertical lines. This article describes the following techniques for graphing discontinuous functions in SAS:

  • Tip #1: Use a GROUP variable
  • Tip #2: Use missing values and the BREAK option
  • Tip #3: Use reference lines to indicate the points of discontinuity

An important special case of a discontinuous function is a piecewise-constant step function, which is discussed in a separate article. In statistics, you can use a step function to visualize the empirical cumulative distribution of data.

Tip #1: Use a GROUP variable

A previous article shows how to add a grouping variable to the data for a discontinuous function. The grouping variable identifies intervals on which the function is continuous. The SERIES statement in PROC SGPLOT will automatically draw each curve on each interval without connecting points across a discontinuity.

For example, consider the following function, which is a piecewise polynomial:

The following SAS DATA step evaluates the function on three intervals. On each interval, the function is continuous. The DOMAIN variable is a grouping variable that identifies each interval. The graph is visualized by using the SERIES statement in PROC SGPLOT. When you use the GROUP= option, the individual segments are drawn without connecting points in different segments.

/* Method 1: Create a grouping variable that identifies the continuous segments 
   https://blogs.sas.com/content/iml/2013/03/25/plot-a-discontinuous-function.html
*/
data Func;
Domain = 1;
do x = -1 to 0 by 0.1;
   y = 0.5-x**2; output;
end;
Domain = 2;
do x = 0 to 1 by 0.1;
   y = 1-x; output;
end;
Domain = 3;
do x = 1 to 2 by 0.1;
   y = x**2 - 0.5; output;
end;
run;
 
title "A Discontinuous Function";
title2 "The GROUP= Option";
proc sgplot data=Func noautolegend;
  series x=x y=y / group=Domain lineattrs=GraphDataDefault;
run;

By default, each curve segment will appear in a different color. If you want them to be the same color, you can use the LINEATTRS= option to specify the line attributes, as shown in the previous program.

Tip #2: Use missing values and the BREAK option

The BREAK option on the SERIES statement tells the SGPLOT procedure not to connect consecutive points along a curve if the Y value is missing. Therefore, you can visualize a discontinuous function by specifying a missing value for Y at the points of discontinuity.

The following program contains three DATA steps. The first evaluates the discontinuous function at an evenly spaced set of points. The second defines the points of discontinuity. The third concatenates the two data sets. The data are then sorted by using PROC SORT. The BREAK option on the SERIES statement in PROC SGPLOT creates the visualization. The REFLINE statement displays vertical reference lines at the points of discontinuity.

/* Method 2: Add missing value at the break points and use the BREAK option.
   The BREAK option does not connect across a missing value.
*/
data Func2;
do x = -1 to 2 by 0.01;  /* use smaller step size */
   if x < 0 then
      y = 0.5-x**2;
   else if x <= 1 then 
      y = 1-x; 
   else
      y = x**2 - 0.5;
   output;
end;
run;
 
data BreakPoints;
   retain y .;
   x = 0; output;
   x = 1; output;
run;
 
data DisconFunc;
   set Func2 BreakPoints;
run;
proc sort data=DisconFunc; by x; run;   /* important to sort the data by x */
 
title "Plot a Discontinuous Function";
title2 "The BREAK Option";
proc sgplot data=DisconFunc;
  series x=x y=y / break; 
  refline 0 1 / axis=x;   /* optional: plot the locations of the break points */
run;

Tip #3: Use reference lines to indicate the points of discontinuity

The previous example uses a REFLINE statement to indicate the points of discontinuity. The locations of the reference lines are specified manually, even though that information was already available in the BreakPoints data set. In some situations, it is more convenient to directly read the points from the BreakPoints data. A previous article describes how to use a data set to specify the location of reference lines.

A good example is the tangent function. The tangent function is discontinuous at odd multiples of π/2. The following DATA steps evaluate the tangent function and compute an indicator variable that is constant on the intervals (-π/2, π/2), (π/2, 3π/2), and so forth. It also creates a data step that contains the points of discontinuity and merges that data set with the original data. Finally, it plots the data and overlays reference lines at the points of discontinuity.

/* You can put the location of the breaks in a separate data set, which 
   is useful for periodic functions like tan, cot, sec, csc, etc.
   Tip: Use the same number of points in each domain such as (-pi/2, pi/2).
   Choose a step size of the form pi/k.
*/
data Tan;
pi = constant('pi');
do x = -2*pi to 2*pi by (4*pi)/(4*51); 
   y = tan(x);
   Domain = floor( (x + pi/2) / pi ); /* constant on (-pi/2, pi/2) and translations of this interval */
   output;
end;
drop pi;
run;
 
data BreakPoints;
pi = constant('pi');
do k = -3 to 3 by 2;  /* odd multiples of pi/2 */
   z = k*pi/2;
   output;
end;
keep z;
run;
 
data PlotTan;
   merge Tan BreakPoints;
run;
 
title "The Tangent Function";
title2 "Reference Lines at Points of Discontinuity";
proc sgplot data=PlotTan noautolegend;
   refline 0 / axis=y;        /* X axis */
   series x=x y=y / group=Domain lineattrs=GraphDataDefault;
   yaxis min=-10 max=10;      /* truncate range */
   refline z / axis=x;        /* the locations of the reference lines come from a data set */
run;

The graph is shown at the top of this article. By reading the points of discontinuity from a data set, you do not have to type numbers like 1.5707963268 (=π/2) on the REFLINE statement.

Summary

In summary, this article shows three tips for graphing discontinuous functions. To prevent the SERIES statement in PROC SGPLOT from connecting points across a discontinuity, you can either use a GROUP variable or you can use missing values and the BREAK option. To show the locations of the discontinuities, you can use the REFLINE statement. You can either specify the locations manually, or you can read them from a data set.

The post Three tips for plotting discontinuous functions in SAS appeared first on The DO Loop.

11月 112020
 

The REFLINE statement in PROC SGPLOT is one of my favorite ways to augment statistical graphics such as scatter plots, series plots, and histograms. The REFLINE statement overlays a vertical or horizontal reference line on a graph. You can specify the location of the reference lines on the REFLINE statement. For example, if you want reference lines at x=0, x=3.14, and x=7.2, you can specify those values as follows:

   refline 0 3.14 7.2 / axis=x;
Histogram with five reference lines

But did you know that you can also specify the location of the reference lines by referring to a variable in the data? This enables you to automate the placement of reference lines. Of particular interest is visualizing statistics such as mean, medians, and percentiles on a graph.

For example, suppose you want to create a histogram of a variable and overlay the 10th, 25th, 50th, 75th, and 90th percentiles, as shown on the histogram to the right. One way is to use PROC MEANS to display those statistics, then copy and paste them into the REFLINE statement. However, if this is a graph that you need to create for data that changes daily or weekly, cutting and pasting is not practical. It is better to automate the process.

The following list outlines the main steps in automating the placement of reference lines:

  1. Create a SAS data set in which each row contains a value at which to place a reference line. Optionally, the data set can contain a second column that contains labels for the reference lines.
  2. Merge the original data set and the data set that contains the locations of the reference lines.
  3. Plot the original data and use the REFLINE statement to overlay the reference lines.

The following sections show each step for overlaying the 10th, 25th, 50th, 75th, and 90th percentiles of a variable.

Step 1: Get the reference values in a data set

Although "get the reference values into a data set" is conceptually one step, in practice this might be a multi-step process, depending on what statistics you are computing and what procedure you use to compute them. The reference values need to be in "long form," so you might need to transpose the statistics if they are output in "wide form."

To demonstrate, suppose you want to output five percentiles of the SepalLength variable in the Sashelp.Iris data. You can use PROC MEANS to compute the percentiles, but the output is wide, so you will have to use PROC TRANSPOSE or another DATA step to transpose the data from wide to long, as follows:

%let DSName = sashelp.Iris;
%let VarName = SepalLength;
proc means data=&DSName;
   var &VarName;
   output out=RefWide P10=P10 P25=P25 median=P50 P75=P75 P90=P90;
run;
 
/* Transpose the data set so that each column becomes an observation. */
proc transpose data=RefWide(keep=P:) 
               out=Ref(rename=(Col1=Value)) name=Label;
run;

The transposed data are shown. Note that this example includes one column for the reference values and another column for a label. If you have multiple reference values, you will want to keep the labels short to avoid collisions when you display the labels. For this example, the reference values are stored in the Value variable and the labels are stored in the Label variable.

Step 2: Merge the original data and the reference values

You can use the MERGE or SET statements to concatenate the original data and the values for the statistics:

data All;
   set &DSName Ref(keep=Label Value);
run;

Step 3: Overlay the reference lines

The rest of the program is easy: merely specify the name of the variables (Value and Label) in the REFLINE statement. By default, the reference lines are a grey color, but you can use the LINEATTRS= option to change the attributes of the reference lines. For example, the following example displays dark red vertical reference lines.

title "Distribution of &VarName and Percentiles";
proc sgplot data=All;
   histogram &VarName;
   refline Value / axis=x label=Label lineattrs=(color=DarkRed);
run;

The graph is shown at the top of this article. By default, the labels appear outside the graph area and at the top of the plot. You can use the LABELPOS=MIN option to display the labels at the bottom of the graph. You can use the LABELLOC=INSIDE option to move the labels inside the graph area. If you moved them inside, you might also want to use the OFFSETMAX= option on the YAXIS statement to make room for the labels, as shown in the following statements:

proc sgplot data=All;
   histogram &VarName;
   refline Value / axis=x label=Label labelloc=inside
                   lineattrs=(color=DarkRed);
   yaxis offsetmax=0.1;
run;

Summary

In summary, you can automate the placement of reference lines in PROC SGPLOT by writing the values of the reference lines to a SAS data set and merging that information with the original data. This article demonstrates the method by overlaying five percentile values onto a histogram.

Obviously, you can use statistics other than percentiles. For example, another common application is to plot the mean and ±1 or ±2 standard deviations from the mean. Writing the SAS code is left as an exercise.

The post Automate the placement of reference lines in PROC SGPLOT appeared first on The DO Loop.