Regression

5月 172021
 

I previously wrote about how to understand standardized regression coefficients in PROC REG in SAS. You can obtain the standardized estimates by using the STDB option on the MODEL statement in PROC REG. Several readers have written to ask whether I could write a similar article about the STDCOEFF option on the MODEL statement in PROC GLIMMIX. One reader wrote:

Dear Dr. Wicklin: I get different results when I request standardized regression coefficients from PROC REG and from PROC GLMMIX. I don't understand. Isn't the STB option in PROC REG equivalent to the STDCOEF option in PROC GLIMMIX (for DIST=NORMAL LINK=IDENTITY)? Can you explain what STDCOEF does?

The two options are not equivalent because the procedures use two different methods to standardize the regression coefficients. This article explains the STDCOEF option in PROC GLIMMIX and how it differs from the STDB option in PROC REG.

How to interpret standardized regression coefficients?

Briefly, researchers look at standardized regression coefficients when the explanatory variables are measured in different units and on different scales. If you do not standardize the data, then a regression coefficient estimates the change in the response variable when you make a unit change in the explanatory variable, holding the other variables constant. As explained in my previous article, the STDB option in PROC REG enables you to interpret the regression in terms of standard deviations. For the standardized coefficients in PROC REG, you can interpret the regression coefficient as the number of standard deviations that the response will change for one "standard deviation" of change in the explanatory variable, holding the other variables constant.

The standardized coefficients in PROC REG have an alternative interpretation: They are the regression estimates that you obtain if you standardize all variables (explanatory AND response) by using the usual standardization formula. The usual standardization is Z* = (Z - mean(Z))/std(Z), where mean(Z) is the sample mean and std(Z) is the sample standard deviation.

If you adopt the second interpretation, then you quickly realize that there are many ways to produce standardize regression coefficients. If you standardize the explanatory variables by any method, you immediately obtain the corresponding set of standardized regression coefficients.

There is no reason why you must include the response variable among the set of variables that you standardize. You can compare the effect of explanatory variables that have different scales without standardizing the response variable (J. Bring, 1994, "How to Standardize Regression Coefficients", Eqn 2.5).

Other standardizations

The STDCOEF option in PROC GLMIMMIX produces standardized regression coefficients that are different from PROC REG because it uses a different standardization of the variables. First, PROC GLIMMIX does not standardize the response variable. This makes sense for a procedure that supports many different distributions of response variables, such as binary, multinomial, Poisson, and more. Second, instead of dividing each explanatory variable by its standard deviation, PROC GLMIMMIX divides by the square root of the corrected sum of squares (CSS): sqrt(Σi (xi - m)2), where m is the sample mean.

Notice that the CSS is almost the same as the standard deviation, but it lacks the divisor sqrt(N-1), where N is the number of nonmissing observations. In ANOVA computations, the CSS is sometimes called the total sum of squares (TSS). In regression, it is called the residual sum of squares (RSS). Whatever you call it, it represents the deviation of the data from the mean, in squared units. Taking the square root gives a statistic that represents the total deviation of the data from the mean in the original units. The documentation for PROC GLIMMIX includes this formula and states "If you specify the STDCOEF option, fixed-effects parameter estimates and their standard errors are reported in terms of the standardized (scaled and centered) coefficients in addition to the usual results in noncentered form."

That, then, explains the difference between the standardized regression estimates in PROC REG and the estimates in PROC GLIMMIX. The procedures use different standardizations, and PROC REG also standardizes the response variable whereas PROC GLIMMIX does not.

The denominator (sqrt(CSS)), measures the spread of the data much like the standard deviation does, but it is a "total spread" instead of an "average spread." Interpreting a standardized coefficient in PROC GLIMMIX is not very intuitive. You could say that the coefficient estimates the change in the response variable when you make a change of s units in the explanatory variable (holding the other variables constant), where s is a measure of the total spread of the data. Although that sentence is long and wordy, remember that the purpose of the standardized coefficients is to enable you to compare the relative effect of different explanatory variables that are measured on different scales. The PROC GLMMIX standardization accomplishes that purpose.

A SAS program to compute the standardized regression coefficients

In my previous article, I showed a SAS program that produces the same standardized coefficients as the STDB option in PROC REG. The program is straightforward: Create a new data set that contains the standardized variables and run an ordinary regression with those variables. You can do the same thing to reproduce the STDCOEFF option in PROC GLIMMIX. For simplicity, I will consider a model that has only continuous variables, but to handle classification variables you can generate a design matrix for the explanatory variables and standardize the columns of the design matrix.

Let's use the same example data as in the previous article, which is the Sashelp.Class data set. It contains only 19 observations and has three continuous variables. Let's use Weight as the response variable and Height and Age as the explanatory variables. The following statements use PROC GLIMMIX to fit a fixed-effects model to the data. The STDCOEF option requests standardized regression estimates, which are shown below:

/* run PROC GLIMMIX on the original data; use STDCOEF option */
proc glimmix data=Sashelp.Class noitprint;
   model Weight = Height Age / ddfm=kr s STDCOEF; /* request standardization of estimates */
   ods select StandardizedCoefficients;
run;

According to the discussion in the previous section, you can obtain these same estimates by creating a new data set that contains centered and scaled versions of the explanatory variables. The variables are centered by using their sample means. The variables are scaled by using the square root of their CSS. The mean and CSS statistics are available from PROC MEANS, so you can write these statistics to a SAS data set and use the METHOD=IN option in PROC STDIZE to standardize the variables, as shown in a previous article. For simplicity, I have replaced the standardization code by a macro call so that the main idea is apparent:

/* Standardize 2 variables (Height and Age) in the Sashelp.Class data.
   Center by the mean; Scale by the sqrt(CSS).
   Write the standardized variables to the Want data set. */
%StdizeMeanSqrtCSS(Height Age, 2, Sashelp.Class, Want);   /* encapsulates the standardization */
 
/* run PROC GLIMMIX on the standardized data */
proc glimmix data=Want noitprint;          
   model Weight = Height Age / ddfm=kr s;
   ods select ParameterEstimates;
run;

Voila! As expected, you get the same estimates by using the standardized data as you get when using the original data and the STDCOEF option. You can download the complete SAS program, which includes the details of the standardization.

Summary

This article explains the STDCOEF option in PROC GLIMMIX and how it differs from the STDB option in PROC REG. PROC REG standardizes both the explanatory and response variables. It uses the formula (X - mean(X))/std(X) to standardize variables. In contrast, PROC GLIMMIX does not standardize the response variable. The explanatory variables are standardized by using the formula (X - mean(X))/sqrt(CSS(X)).

The post Standardized regression coefficients in PROC GLIMMIX appeared first on The DO Loop.

3月 082021
 

I recently learned about a new feature in PROC QUANTREG that was added in SAS/STAT 15.1 (part of SAS 9.4M6). Recall that PROC QUANTREG enables you to perform quantile regression in SAS. (If you are not familiar with quantile regression, see an earlier article that describes quantile regression and provides introductory references.) Whereas ordinary least-squares regression predicts the conditional mean of a continuous response variable, quantile regression predicts one or more conditional quantiles, such as the median or deciles.

The CONDDIST statement in PROC QUANTREG enables you to estimate and visualize the conditional distribution of a response variable at any X value in regression data. Furthermore, it enables you to compare the conditional distribution to the "unconditional" distribution of the response (that is, to the marginal distribution). This article shows how to use the CONDDIST statement in PROC QUANTREG.

The main idea

Before discussing quantile regression, let's recall what we mean by a conditional and marginal distribution. The following SAS DATA step simulates 250 observations in a regression model where the response variable (Y) depends on an explanatory variable (X) and random noise. A scatter plot shows the data and overlays three sets of reference lines that will be used later in this article.

data  Have;
   call  streaminit(123);
   subj=1; x=2; y=12; output;
   subj=2; x=8; y=30; output;   
   do subj = 3 to 250;
      x =  rand ('uniform',  1, 10);           /* x in interval (1, 10) */
      err = rand ('normal',   0, 2+log(3*x));
      y = 1  + 4 * x + err;                    /* Y = response variable */
      output;
   end;
run;
 
proc sql;
   select mean(x) into :xavg from Have; 
quit;
%let xavg = %sysfunc(round(&xavg,0.1));  /* round to nearest tenth */
%put &=xavg;
 
title "Joint Distribution of X and Y";
proc sgplot data=Have;
   scatter x=x y=y;
   refline 2 &xavg 8 / axis=X lineattrs=(thickness=2) label;
   xaxis grid; yaxis grid;
run;

The scatter plot shows that the conditional mean of Y at any value X is close to 1 + 4*X, which is the form of the underlying model. The data are simulated, so we know that the exact conditional distributions are normal distributions. In a subsequent section, we will use the CONDDIST statement in PROC QUANTREG to estimate the conditional distribution at X=2, X=8, and at the average X value, which is X=5.7. We will compare those distributions to the overall (marginal) distribution of Y.

The graph overlays three sets of drop lines to guide the discussion of conditional distributions. The following are visual estimates:

  • From the line at X=2, it looks like the conditional distribution of Y is centered near Y=9 and most of the probability density is on the interval [2, 18].
  • From the line at X=5.7, it looks like the conditional distribution of Y is centered near Y=24 and most of the probability density is on [13, 33].
  • From the line at X=8, it looks like the conditional distribution of Y is centered near Y=33 and most of the probability density is on [23, 43].

Empirical estimates for the marginal and conditional distributions

Before using PROC QUANTREG to estimate the marginal (unconditional) and conditional densities, let's estimate by using a simpler method. The marginal distribution of Y is easy to estimate: Use the CDFPLOT statement in PROC UNIVARIATE to obtain the empirical cumulative distribution function (ECDF). You can use the HISTOGRAM statement and the KERNEL option to estimate the marginal density function (PDF).

/* The marginal distribution of Y does not consider values of X */
proc univariate data=Have;
   var y;
   cdfplot y / vscale=proportion    odstitle="Empirical Marginal CDF";
   histogram y / kernel outkernel=K odstitle="Empirical Marginal PDF";
   ods select CDFPlot Histogram;
run;

These graphs estimate the PDF and CDF of the Y variable by using the observed data, so they are empirical estimates. The kernel density estimate shows two small humps in the density near X=15 and X=30. The OUTKERNEL= option creates a SAS data set (K) that contains the kernel density estimate.

It is more difficult to estimate the conditional distribution of Y at some value of X. For definiteness, suppose we want to estimate the conditional distribution of Y at the mean of X, which is X=5.7. The simplest estimate is to choose all observations that are close to 5.7. There is no precise definition for "close to." In the following procedure, the WHERE clause limits the data to observations for which X is within 0.5 units of 5.7. I call this the "poor man's" estimation procedure because it doesn't use any analytics. As shown in the next section, quantile regression provides a more elegant estimate.

%let Threshold = 0.5;
proc univariate data=Have;
   WHERE abs(x-&xavg) < &Threshold;
   var y;
   cdfplot y / vscale=proportion       odstitle="Empirical Conditional CDF at X=5.7";
   histogram y / kernel outkernel=KAvg odstitle="Empirical Conditional PDF at X=5.7" ;
   ods select CDFPlot Histogram;
run;

Only the PDF is shown because it is easier to interpret. According to the "poor man's" estimate, the conditional distribution at X=5.7 looks approximately normal but perhaps with some positive skewness. The peak is near Y=23 and it looks like the ±3σ range is [13, 36]. These values are very close to the estimates that we guessed based on the scatter plot in the previous section.

Compare the conditional and marginal distributions

I used the OUTKERNEL= option to output the kernel density estimates to SAS data sets. To compare the marginal and conditional distributions, you can combine those data sets and overlay the estimates on a single graph, as follows:

/* Compare the marginal distribution of Y to the distribution of Y|(x is near average) */
data All;
label Dist = "Distribution";
set k(in=marginal) kAvg;
if marginal then Dist = "Marginal   ";
            else Dist = "Conditional";
run;
 
title "Empirical Marginal and ""Poor-Man's"" Conditional Distributions";
title2 "Approximation to the Conditional Distribution at x=&xavg";
proc sgplot data=All;
   series x=_Value_ y=_Density_ / group=Dist;
   xaxis grid; yaxis grid;
run;

This graph enables you to compare the overall distribution of the Y values (in blue) to the conditional distribution of the Y values near X=5.7 (in red). You can see that the conditional distribution is narrower and approximately normal.

To obtain this graph, I ran PROC UNIVARIATE twice, merged the outputs, and called PROC SGPLOT to overlay the curves. By using the CONDDIST statement in PROC QUANTREG, you can create a similar graph with one line of SAS code.

Estimate the conditional distribution at the mean of the explanatory variables

So far, the analyses have been empirical and heuristic. This section uses quantile regression to create a model of the conditional quantiles. The CONDDIST statement fits a quantile regression model for many quantiles You can invert the quantile estimates to obtain a model-based estimate of the conditional quantiles.

Let's start by visualizing several quantile regression models for these data. The following call to PROC QUANTREG fits nine models to the data. The models are for the quantiles 0.1, 0.2, ..., 0.9. The PLOT=FITPLOT option displays the nine quantile regression curves on a scatter plot of the data.

/* Quantile regression models */
proc quantreg data=Have ci=none;
   model y = x / plot=fitplot quantile=(0.1 to 0.9 by 0.1);
run;

The graph shows the nine quantile models. If you draw a vertical line at any X value (such as X=5.7) the places where the vertical line crosses the regression curves are the conditional quantiles for Y at X. You can think about plotting those Y values against the quantile values (0.1, 0.2, ..., 0.9) to obtain a model-based cumulative distribution curve.

But why stop at nine quantile models? Why not fit 100 models for the quantiles 0.005, 0.015, ..., 0.995? Then the scatter plot has 100 regression curves overlaid on it. If you draw a vertical line at X=5.7, that line will intersect the 100 regression lines. If the ordered pairs of the intersection points are (y1,0.005), (y2,0.015), ..., (y100, 0.995), then those 100 points estimate a model-based conditional distribution of Y at X=5.7.

That is essentially what the CONDDIST statement in PROC QUANTREG does. The documentation for the QUANTREG procedure contains a more mathematical description. The following call to the QUANTREG procedure uses the SHOWAVG option on the CONDDIST statement to create a CDF plot and a PDF plot. Each plot overlays the observed marginal distribution of Y and the conditional distribution at the average of the explanatory variables, which is X=5.7. To make the comparison easier, the marginal distribution is evaluated at the same quantiles as the conditional distribution.

/* The conditional distribution of response at the average of X is created 
   automatically by using the CONDIST statement in PROC QUANTREG. The SHOWAVG 
   option displays the conditional distribution of Y at the mean of X. */
%let HideAll = HideDropLines HideDropNumbers HideObsDots HideObsLabels;
proc quantreg data=Have ci=none;
   model y = x;
   conddist plot(&HideAll ShowGrids)=(cdfplot pdfplot) showavg; 
run;

The CONDDIST statement creates two plots, but only the PDF plot is shown here. Notice the similarity to the earlier graph that compared the kernel density estimates from PROC UNIVARIATE. In both graphs, the blue curve represents the marginal density estimate for Y. The red curve estimates the conditional density at X=5.7. In the previous graph, the estimate was crude and was obtained by throwing out a lot of the data and using only observations near X=5.7. In the current graph, the model-based estimate is formed by using all the data and by fitting 100 different quantile regression models. In both graphs, the conditional PDF has a peak near Y=23, is approximately normal, and the bulk of the density is supported on [13, 36].

Estimate the conditional distribution at a data value

The SHOWAVG option to the CONDDIST statement estimates the conditional distribution of Y at the average value of the explanatory variables. However, you can use the OBS= option to specify values in the data at which to estimate the conditional distribution. When I created the simulated data for this example, I intentionally hard-coded the first two observations as (2,9) and (8, 33). The following call to PROC QUANTREG estimates the conditional distribution at X=2 and X=8:

/* The conditional distribution of response at the average of X is created 
   automatically by using the CONDDIST stastement in PROC QUANTREG */
proc quantreg data=Have ci=none;
   model y = x;
   conddist plot(ShowGrids)=(pdfplot) obs=(1 2) ;
run;

Again, the marginal distribution of Y is overlaid as a comparison. To help interpret the curves, PROC QUANTSELECT adds drop lines to the graph (by default). The vertical drop line specifies the Y value for the observation (or the mean of the Ys for the marginal density, which is shown in blue).

  • The estimate of the conditional distribution at X=2 is shown in red. The Y value of the data point is Y=12, which is larger than the most likely value (which is approximately Y=10). The distribution appears to be normally distributed with the bulk of the density on [0,18].
  • The estimate of the conditional distribution at X=8 is shown in green. The Y value of the data point is Y=30, which is smaller than the most likely value (which is approximately Y=34). The distribution appears to approximately normal, but maybe there is some positive skewness? The bulk of the density is on [20,46].

Why is this useful?

Why is this useful? Suppose you are measuring blood pressure of patients. Some of the patients are overweight and others are not. If you set Y=BloodPressure and X=BMI (body-mass index), then a graph like this enables you to compare the distribution of blood pressures for all patients (the marginal distribution), for healthy-weight individuals (BMI=20), and for obese individuals (BMI=30). In fact, for ANY value of the BMI, you can compare the conditional distribution to the marginal distribution and/or to other conditional distributions.

And although I have presented an example with only one continuous variable, you can do the same thing for models that contain multiple explanatory variables, both continuous and CLASS variables.

I have left out a lot of details, but I hope I have whetted your appetite to learn more about the CONDDIST statement in PROC QUANTREG. The documentation for the QUANTREG procedure has additional information and additional examples. It also shows how you can fit the regression on one set of data (the training data) and apply the models to estimate "counterfactual" distributions for new data (the test data).

The post The conditional distribution of a response variable appeared first on The DO Loop.

12月 142020
 

A segmented regression model is a piecewise regression model that has two or more sub-models, each defined on a separate domain for the explanatory variables. For simplicity, assume the model has one continuous explanatory variable, X. The simplest segmented regression model assumes that the response is modeled by one parametric model when X is less than some threshold value and by a different parametric model when X is greater than the threshold value. The threshold value is also called a breakpoint, a cutpoint, or a join point.

A previous article shows that for simple piecewise polynomial models, you can use the EFFECT statement in SAS regression procedures to use a spline to fit some segmented regression models. The method relies on two assumptions. First, it assumes that you know the location of the breakpoint. Second, it assumes that each model has the same parametric form on each interval. For example, the model might be piecewise constant, piecewise linear, or piecewise quadratic.

If you need to estimate the location of the breakpoint from the data, or if you are modeling the response differently on each segment, you cannot use a single spline effect. Instead, you need to use a SAS procedure such as PROC NLIN that enables you to specify the model on each segment and to estimate the breakpoint. For simplicity, this article shows how to estimate a model that is quadratic on the first segment and constant on the second segment. (This is called a plateau model.) The model also estimates the location of the breakpoint.

An example of a segmented model

A SAS customer recently asked about segmented models on the SAS Support Communities. Suppose that a surgeon wants to model how long it takes her to perform a certain procedure over time. When the surgeon first started performing the procedure, it took about 3 hours (180 minutes). As she refined her technique, that time decreased. The surgeon wants to predict how long this surgery now takes now, and she wants to estimate when the time reached its current plateau. The data are shown below and visualized in a scatter plot. The length of the procedure (in minutes) is recorded for 25 surgeries over a 16-month period.

data Have;
  input SurgeryNo Date :mmddyy10. duration @@;
  format Date mmddyy10.;
datalines;
1       3/20/2019       182   2       5/16/2019       150
3       5/30/2019       223   4       6/6/2019        142
5       6/11/2019       111   6       7/11/2019       164
7       7/26/2019        83   8       8/22/2019       144
9       8/29/2019       162  10       9/19/2019        83
11      10/10/2019       70  12       10/17/2019      114
13      10/31/2019      113  14       11/7/2019        97
15      11/21/2019       83  16       12/5/2019       111
17      12/5/2019        73  18       12/12/2019       87
19      12/19/2019       86  20       1/9/2020        102
21      1/16/2020       124  22       1/23/2020        95
23      1/30/2020       134  24       3/5/2020        121
25      6/4/2020         60
;
 
title "Time to Perform a Surgery";
proc sgplot data=Have;
  scatter x=Date y=Duration;
run;

From the graph, it looks like the duration for the procedure decreased until maybe November or December 2019. The goal of a segmented model is to find the breakpoint and to model the duration before and after the breakpoint. For this purpose, you can use a segmented plateau model that is a quadratic model prior to the breakpoint and a constant model after the breakpoint.

Formulate a segmented regression model

A segmented plateau model is one of the examples in the PROC NLIN documentation. The documentation shows how to use constraints in the problem to eliminate one or more parameters. For example, a common assumption is that the two segments are joined smoothly at the breakpoint location, x0. If f(x) is the predictive model to the left of the breakpoint and g(x) is the model to the right, then continuity dictates that f(x0) = g(x0) and smoothness dictates that f`(x0) = g`(x0). In many cases (such as when the models are low-degree polynomials), the two constraints enable you to reparameterize the models to eliminate two of the parameters.

The PROC NLIN documentation shows the details. Suppose f(x) is a quadratic function f(x) = α + β x + γ x2 and g(x) is a constant function g(x) = c. Then you can use the constraints to reparameterize the problem so that (α, β, γ) are free parameters, and the other two parameters are determined by the formulas:
x0 = -β / (2 γ)
c = α – β2 / (4 γ)
You can use the ESTIMATE statement in PROC NLIN to obtain estimates and standard error for the x0 and c parameters.

Provide initial guesses for the parameters

As is often the case, the hard part is to guess initial values for the parameters. You must supply an initial guess on the PARMS statement in PROC NLIN. One way to create a guess is to use a related "reduced model" to provide the estimates. For example, you can use PROC GLM to fit a global quadratic model to the data, as follows:

/* rescale by using x="days since start" as the variable */
%let RefDate = '20MAR2019'd;
data New;
set Have;
rename duration=y;
x = date - &RefDate;  /* days since first record */
run;
 
proc glm data=New;
   model y = x x*x;
run;

These parameter estimates are used in the next section to specify the initial parameter values for the segmented model.

Fit a segmented regression model in SAS

Recall that a SAS date is represented by the number of days since 01JAN1960. Thus, for these data, the Date values are approximately 22,000. Numerically speaking, it is often better to use smaller numbers in a regression problem, so I will rescale the explanatory variable to be the number of days since the first surgery. You can use the parameter estimates from the quadratic model as the initial values for the segmented model:

title 'Segmented Model with Plateau';
proc nlin data=New plots=fit noitprint;
   parms alpha=184 beta= -0.5 gamma= 0.001;
 
   x0 = -0.5*beta / gamma;
   if (x < x0) then
        mean = alpha + beta*x  + gamma*x*x;    /* quadratic model for x < x0 */
   else mean = alpha + beta*x0 + gamma*x0*x0;  /* constant model for x >= x0 */
   model y = mean;
 
   estimate 'plateau'    alpha + beta*x0 + gamma*x0*x0;
   estimate 'breakpoint' -0.5*beta / gamma;
   output out=NLinOut predicted=Pred L95M=Lower U95M=Upper;
   ods select ParameterEstimates AdditionalEstimates FitPlot;
run;

The output from PROC NLIN includes estimates for the quadratic regression coefficients and for the breakpoint and the plateau value. According to the second table, the surgeon can now perform this surgical procedure in about 98 minutes, on average. The 95% confidence interval [77, 119] suggests that the surgeon might want to schedule two hours for this procedure so that she is not late for her next appointment.

According to the estimate for the breakpoint (x0), she achieved her plateau after about 287 days of practice. However, the confidence interval is quite large, so there is considerable uncertainty in this estimate.

The output from PROC NLIN includes a plot that overlays the predictions on the observed data. However, that graph is in terms of the number of days since the first surgery. If you want to return to the original scale, you can graph the predicted values versus the Date. You can also add reference lines that indicate the plateau value and the estimated breakpoint, as follows:

/* convert x back to original scale (Date) */
data _NULL_;
   plateauDate = 287 + &RefDate;    /* translate back to Date scale */
   call symput('x0', plateauDate);  /* store breakpoint in macro variable */
   put plateauDate DATE10.;
run;
 
/* plot the predicted values and CLM on the original scale */
proc sgplot data=NLinOut noautolegend;
   band x=Date lower=Lower upper=Upper;
   refline 97.97  / axis=y label="Plateau"    labelpos=max;
   refline &x0 / axis=x label="01JAN2020" labelpos=max;
   scatter x=Date y=y;
   series  x=Date y=Pred;
   yaxis label='Duration';
run;

The graph shows the breakpoint estimate, which happens to be 01JAN2020. It also shows the variability in the data and the wide prediction limits.

Summary

This article shows how to fit a simple segmented model in SAS. The model has one breakpoint, which is estimated from the data. The model is quadratic before the breakpoint, constant after the breakpoint, and joins smoothly at the breakpoint. The constraints on continuity and smoothness reduce the problem from five parameters to three free parameters. The article shows how to use PROC NLIN in SAS to solve segmented models and how to visualize the result.

You can use a segmented model for many other data sets. For example, if you are a runner and routinely run a 5k distance, you can use a segmented model to monitor your times. Are your times decreasing, or did you reach a plateau? When did you reach the plateau?

The post Segmented regression models in SAS appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

12月 022020
 

A SAS customer asked a great question: "I have parameter estimates for a logistic regression model that I computed by using multiple imputations. How do I use these parameter estimates to score new observations and to visualize the model? PROC LOGISTIC can do the computation I want, but how do I tell it to use the externally computed estimates?"

This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.

Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.

I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.

Produce parameter estimates by using PROC MIANALYZE

As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.

data Fish2;
   title 'Fish Measurement Data';
   input Species $ Length Width @@;
   datalines;
Parkki  16.5  2.3265    Parkki  17.4  2.3142    .      19.8   .
Parkki  21.3  2.9181    Parkki  22.4  3.2928    .      23.2  3.2944
Parkki  23.2  3.4104    Parkki  24.1  3.1571    .      25.8  3.6636
Parkki  28.0  4.1440    Parkki  29.0  4.2340    Perch   8.8  1.4080
.       14.7  1.9992    Perch   16.0  2.4320    Perch  17.2  2.6316
Perch   18.5  2.9415    Perch   19.2  3.3216    .      19.4   .
Perch   20.2  3.0502    Perch   20.8  3.0368    Perch  21.0  2.7720
Perch   22.5  3.5550    Perch   22.5  3.3075    .      22.5   .
Perch   22.8  3.5340    .       23.5   .        Perch  23.5  3.5250
Perch   23.5  3.5250    Perch   23.5  3.5250    Perch  23.5  3.9950
.       24.0   .        Perch   24.0  3.6240    Perch  24.2  3.6300
Perch   24.5  3.6260    Perch   25.0  3.7250    .      25.5  3.7230
Perch   25.5  3.8250    Perch   26.2  4.1658    Perch  26.5  3.6835
.       27.0  4.2390    Perch   28.0  4.1440    Perch  28.7  5.1373
.       28.9  4.3350    .       28.9   .        .      28.9  4.5662
Perch   29.4  4.2042    Perch   30.1  4.6354    Perch  31.6  4.7716
Perch   34.0  6.0180    .       36.5  6.3875    .      37.3  7.7957
.       39.0   .        .       38.3   .        Perch  39.4  6.2646
Perch   39.3  6.3666    Perch   41.4  7.4934    Perch  41.4  6.0030
Perch   41.3  7.3514    .       42.3   .        Perch  42.5  7.2250
Perch   42.4  7.4624    Perch   42.5  6.6300    Perch  44.6  6.8684
Perch   45.2  7.2772    Perch   45.5  7.4165    Perch  46.0  8.1420
Perch   46.6  7.5958
;
 
proc format;
value $FishFmt
    " " = "Unknown";
run;
 
proc sgplot data=Fish2;
   format Species $FishFmt.;
   styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue);
   scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled);
run;

The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.

Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.

/* Example from the MIANALYZE documentation 
   "Reading Logistic Model Results from a PARMS= Data Set"  https://bit.ly/394VlI7
*/
proc mi data=Fish2 seed=1305417 out=outfish2;
   class Species;
   monotone logistic( Species= Length Width);
   var Length Width Species;
run;
 
ods select none; options nonotes;
proc logistic data=outfish2;
   class Species;
   model Species= Length Width / covb;
   by _Imputation_;
   ods output ParameterEstimates=lgsparms;
run;
ods select all; options notes;
 
proc mianalyze parms=lgsparms;
   modeleffects Intercept Length Width;
   ods output ParameterEstimates=MI_PE;
run;
 
proc print data=MI_PE noobs; 
  var Parm Estimate;
run;

The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?

Get PROC LOGISTIC to use external estimates

As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.

You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:

/* 1. Use PROC LOGISTIC to create an OUTEST= data set */
proc logistic data=Fish2 outest=OutEst noprint;
   class Species;
   model Species= Length Width;
run;
 
/* 2. replace the values of the parameter estimates with different values */
data inEst;
set outEst;
Intercept = -0.130560;
Length = 1.169782;
Width = -8.284998;
run;
 
/* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create
      a model. Use the STORE statement to write an item store.
      https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html
*/
proc logistic data=Fish2 inest=InEst;        /* read in extermal model */
   model Species= Length Width / maxiter=0;  /* do not refine model fit */
   effectplot contour / ilink;
   store LogiModel;
run;

The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).

Scoring the model

Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:

/* 4. create a score data set */
data NewFish;
input Length Width;
datalines;
17.0  2.7
18.1  2.1
21.3  2.9
22.4  3.0
29.1  4.3
;
 
/* 5. predictions on the DATA scale */
proc plm restore=LogiModel noprint;
   score data=NewFish out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.

Thoughts on other regression procedures

Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!

  • For least squares models, the venerable PROC SCORE can handle the scoring. It can read an OUTEST/INEST-style data set, just like in the PROC LOGISTIC example. If you have CLASS variables or other constructed effects (for example, spline effects), you will have to use columns of a design matrix as variables.
  • Many SAS regression procedures support the CODE statement, which writes DATA step code to score the model. Because the CODE statement writes a text file, you can edit the text file and replace the parameter estimates in the file with different estimates. However, the CODE statement does not handle constructed effects.
  • The MODEL statement of PROC GENMOD supports the INITIAL= and INTERCEPT= options. Therefore, you ought to be able to specify the initial values for parameter estimates. PROC GENMOD also supports the MAXITER=0 option. Be aware that the values on the INITIAL= option must match the order that the estimates appear in the ParameterEstimates table.
  • Some nonlinear modeling procedures (such as PROC NLIN and PROC NLMIXED) support ways to specify the initial values for parameters. If you specify TECH=NONE, then the procedure does not perform any optimization. These procedures also support the MAXITER= option.

Summary

This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.

If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.

The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.

9月 212020
 

A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to solve restricted least squares problems. However, if a constraint is an INEQUALITY instead of an EQUALITY, then (in general) a least squares method does not solve the problem. Therefore, you cannot use PROC REG to solve problems that have inequality constraints.

This article shows how to use PROC NLIN to solve linear regression problems whose regression coefficients are constrained by a linear inequality. Examples are B1 ≥ 3 or B1 + B2 ≥ 6.

PROC NLIN and constrained regression problems

Before solving a problem that has inequality constraints, let's see how PROC NLIN solves a problem that has linear equality constraints. The following statements use the data and model from the previous article. The model is
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2. In PROC NLIN, you can replace the B3 and B4 parameters, thus leaving only three parameters in the model. If desired, you can use the ESTIMATE statement to recover the value of B4, as follows:

/* You can solve the restricted problem by using PROC NLIN */
proc nlin data=RegSim noitprint;      /* use NLINMEASURES to estimate MSE */
   parameters b0=0 b1=3 b2=1;         /* specify names of parameters and initial guess */
   model Y = b0 + b1*x1 +   b2*x2 +
                  b1*x3 - 2*b2*x4;   /* replace B3 = B1 and B4 = -2*B2 */
   estimate 'b4' -2*b2;              /* estimate B4 */
run;

The parameter estimates are the same as those produced by PROC REG in the previous article.

The syntax for PROC NLIN is natural and straightforward. The PARAMETERS statement specifies the names of the parameters in the problem; the other symbols (X1-X4, Y) refer to data set variables. You must specify an initial guess for each parameter. For nonlinear regressions, this can be a challenge, but there are some tricks you can use to choose a good initial guess. For LINEAR regressions, the initial guess shouldn't matter: you can use all zeros or all ones, if you want.

Boundary constraints

Suppose that you want to force the regression coefficients to satisfy certain inequality constraints. You can use the BOUNDS statement in PROC NLIN to specify the constraints. The simplest type of constraint is to restrict a coefficient to a half-interval or interval. For example, the following call to PROC NLIN restricts B1 ≥ 3 and restricts B2 to the interval [0, 4].

/* INEQUALITY constraints on the regression parameters */
proc nlin data=RegSim;
   parameters b0=0 b1=3 b2=1;         /* initial guess */
   bounds b1 >= 3, 
          0<= b2 <= 4;
   model Y = b0 + b1*x1 + b2*x2 + b1*x3 - 2*b2*x4;
   ods select ParameterEstimates;
run;

The solution that minimizes the residual sum of squares (subject to the constraints) places the B1 parameter on the constraint B1=3. Therefore, a standard error is not available for that parameter estimate.

You can also use the BOUNDS statement to enforce simple relationships between parameters. For example, you can specify
     bounds b1 >= b2;
to specify that the B1 coefficient must be greater than or equal to the B2 parameter.

More general linear constraints

The BOUNDS statement is for simple relationships. For more complicated relationships, you might need to reparameterize the model. For example, the BOUNDS statement does not accept the following syntax:
     bounds b1 >= 2*b2; /* INVALID SYNTAX! */
However, you can reparameterize the problem by introducing a new parameter C = 2*B2. You can then systematically substitute (C/2) everywhere that B2 appears. For example, the following statements show the constrained model in terms of the new parameter, C. You can use the ESTIMATE statement to find the estimates for the original parameter.

/* let c = 2*b2 be the new parameter. Then substitute c/2 for b2 in the MODEL statement: */
proc nlin data=RegSim;
   parameters b0=0 b1=3 c=2;         /* initial guess */
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (c/2)*x2 + b1*x3 - c*x4;
   estimate 'b2' c/2;                /* estimate original parameter */
run;

The output from this procedure is not shown.

You can use a similar trick to handle linear constraints such as B1 + B2 ≥ 6. Move the B2 term to the right side of the inequality and define C = 6 – B2. The constraint becomes B1 ≥ C and on the MODEL statement you can substitute (6–C) everywhere that B2 appears, as follows:

/* Define c = 6 - b2 so that b2=6-c */
proc nlin data=RegSim;
   parameters b0=0 b1=5 c=5;
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (6-c)*x2 + b1*x3 - 2*(6-c)*x4;
   estimate 'b2' 6-c;              /* estimate original parameter */
   ods select ParameterEstimates AdditionalEstimates;
run;

Summary

In summary, you can use the NLIN procedure to solve linear regression problems that have linear constraints among the coefficients. Each equality constraint enables you to eliminate one parameter in the MODEL statement. You can use the BOUNDS statement to specify simple inequality constraints. For more complicated constraints, you can reparametrize the model and again use the BOUNDS statement to specify the constraints. If desired, you can use the ESTIMATE statement to estimate the parameters in the original model.

The post Regression with inequality constraints on parameters appeared first on The DO Loop.

7月 292020
 

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory;
call streaminit(54321);
do i = 1 to 2000;                               /* number in each treatment group */
   do Trt = 1 to 3;                             /* Trt = 1, 2, and 3 */
      x1 = rand('Normal');                      /* x1 ~ N(0,1) */
      x2 = rand('Normal');                      /* x2 ~ N(0,1) */
      LinCont = -1.5 + 0.4*x1 - 0.7*x2;         /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */
      output;
   end;
end;
drop i;
run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt;
call streaminit(123);
set SimExplanatory;                      /* read the explanatory variables */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* a straightforward way to generate Y data from a logistic model */
eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3);
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Parameter Estimates Do Not Match Simulation Code";
ods select ParameterEstimates;
proc logistic data=LogiSim_Attempt plots=none;
   class Trt; 
   model y(event='1') = Trt x1 x2;
run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref;
call streaminit(123);
set SimExplanatory;
/* assume reference or GLM parameterization */
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
RefLevel = TrtCoef[3];                   /* use last level as reference */
eta = LinCont + RefLevel                 /* constant term is (Intercept + RefLevel) */
              + (TrtCoef[1]-RefLevel)*(Trt=1)   /* param = +2 */
              + (TrtCoef[2]-RefLevel)*(Trt=2)   /* param = +1 */
              + (TrtCoef[3]-RefLevel)*(Trt=3);  /* param =  0 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Reference Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_Ref plots=none;
   class Trt / param=reference ref=LAST; /* reference parameterization */
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect;
call streaminit(123);
set SimExplanatory;
array TrtCoef[3] _temporary_ (3, 2, 1);  /* treatment effects */
/* assume effect parameterization, which is equivalent to using deviations from the mean:
array TrtDev[3] _temporary_ (+1, 0, -1);
*/
 
MeanTrt = 2;                             /* for balanced design, mean=(3+2+1)/3 */
eta = LinCont + MeanTrt                  /* est intercept = (Intercept + MeanTrt) */
              + (TrtCoef[1]-MeanTrt)*(Trt=1)   /* param = +1 */
              + (TrtCoef[2]-MeanTrt)*(Trt=2)   /* param =  0 */
              + (TrtCoef[3]-MeanTrt)*(Trt=3);  /* param = -1 */
p = logistic(eta);                       /* P(Y | explanatory vars) */
y = rand("Bernoulli", p);                /* random realization of Y */
drop eta p;
run;
 
title "Effect Parameterization";
ods select ParameterEstimates;
proc logistic data=LogiSim_effect plots=none;
   class Trt / param=effect ref=LAST; 
   model y(event='1') = Trt x1 x2;
run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.