Regression

6月 082020
 

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For procedures that do not support the EFFECT statement, you can generate the interaction effects yourself. Because the customer's question was specifically about the GAMPL procedure, which fits generalized additive models, I also show a third technique, which enables you to model interactions in the GAMPL procedure.

To illustrate the problem, consider the following SAS DATA step, which simulates data from two different response functions:

data Have;
call streaminit(1);
Group="A"; 
do x = 1 to 10 by .1;
   y =    2*x + sin(x) + rand("Normal"); output;
end;
Group="B";
do x = 1 to 10 by .1;
   y = 10 - x - sin(x) + rand("Normal"); output;
end;
run;

The graph shows the data and overlays a regression model. The model predicts the response for each level of the Group variable. Notice that the two curves have different shapes, which means that there is an interaction between the Group variable and the x variable. The goal of this article is to show how to use a spline effect to predict the response for each group.

Use the grammar of the procedure to form interactions

Many SAS regression procedures support the EFFECT statement, the CLASS statement, and enable you to specify interactions on the MODEL statement. Examples include the GLMMIX, GLMSELECT, LOGISTIC, QUANTREG, and ROBUSTREG procedures. For example, the following call to PROC GLMSELECT specifies several model effects by using the "stars and bars" syntax:

  • The syntax Group | x includes the classification effect (Group), a linear effect (x), and an interaction effect (Group*x).
  • The syntax Group * spl includes an interaction effect between the classification variable and the spline.
title "GLMSELECT Model with Spline and Interaction Effects";
ods select none;
proc glmselect data=Have
        outdesign(addinputvars fullmodel)=SplineBasis; /* optional: write design matrix to data set */
   class Group;
   effect spl = spline(x);
   model y = Group|x  Group*spl / selection=none;
   output out=SplineOut predicted=p;            /* output predicted values */
quit;
ods select all;
 
%macro PlotInteraction(dsname);
   proc sgplot data=&dsname;
      scatter x=x y=y / group=Group;
      series x=x y=p / group=Group curvelabel;
      keylegend / location=inside position=E;
   run;
%mend;
%PlotInteraction(SplineOut);

The predicted curves are shown in the previous section. I wrapped the call to PROC SGPLOT in a macro so that I can create the same plot for other models.

Output the design matrix for the interactions

Some SAS procedures do not support the EFFECT statement. For these procedures, you cannot form effects like Group*spline(x) within the procedure. However, the GLMSELECT procedure supports the OUTDESIGN= option, which writes the design matrix to a SAS data set. I used the option in the previous section to create the SplineBasis data set. The data set includes variables for the spline basis and for the interaction between the Group variable and the spline basis. Therefore, you can use the splines and interactions with splines in other SAS procedures.

For example, the GLM procedure does not support the EFFECT statement, so you cannot generate a spline basis within the procedure. However, you can read the design matrix from the previous call to PROC GLMSELECT. The interaction effects are named spl_Group_1_A, spl_Group_1_B, spl_Group_2_A, ...., where the suffix "_i_j" indicates the interaction effect between the i_th spline basis and the j_th level of the Group variable. You can type the names of the design variables, or you can use the "colon notation" to match all variables that begin with the prefix "spl_Group". The following call to PROC GLM computes the same model as in the previous section:

proc glm data=SplineBasis;
   class Group;
   model y = Group|x  spl_Group: ;           /* spl_Group: => spl_Group_1_A,  spl_Group_1_B, ... */
   output out=GLMOut predicted=p;            /* output predicted values */
quit;
 
/* show that the predicted values are the same */
data Comp;
   merge SplineOut GLMOut(keep=p rename=(p=p2));
   diff = p - p2;
run;
proc means data=comp Mean Min Max;  var diff;  run;

The output of PROC MEANS shows that the two models are the same. There is no difference between the predicted values from PROC GLM (which reads the design matrix) and the values from PROC GLMSELECT (which reads the raw data).

The splines of the interactions versus the interactions of the splines

Some nonparametric regression procedures, such as the GAMPL procedure, have their own syntax to generate spline effects. In fact, PROC GAMPL uses thin-plate splines, which are different from the splines that are supported by the EFFECT statement. Recently, a SAS customer asked how he could model interaction terms in PROC GAMPL.

The GAMPL procedure supports semiparametric models. In the parametric portion of the model, it is easy to specify interactions by using the standard "stars and bars" notation in SAS. However, the SPLINE() transformation in PROC GAMPL does not support interactions between a classification variable and a continuous variable. To be specific, consider the following semiparametric model:

title "No Interaction Between Classification and Spline Effects";
proc gampl data=Have; *plots=components;
   class Group;
   model Y = param(Group | x)     /* parametric terms */
             spline(x);           /* nonprametric, but no interaction */
   output out=GamPLOut pred=p r=r;
   id Y X Group;
run;
 
%PlotInteraction(SplineOut);

The output shows that the model does not capture the nonlinear features of the data. This is because the spline term is computed for all values of x, not separately for each group. When the groups are combined, the spline effect is essentially linear, which is probably not the best model for these data.

Unfortunately, in SAS/STAT 15.1, PROC GAMPL does not support a syntax such as "spline(Group*x)", which would enable the shape of the nonparametric curve to vary between levels of the Group variable. In fact, the 15.1 documentation states, "only continuous variables (not classification variables) can be specified in spline-effects."

However, if your goal is to use a flexible model to fit the data, there is a model that might work. Instead of the interaction between the Group variable and spline(x), you could use the spline effects for the interaction between the Group variable and x. These are not the same models: the interaction with the splines is not equal to the spline of the interactions! However, let's see how to fit this model to these data.

The idea is to form the interaction effect Group*x in a separate step. You can use the design matrix (SplineBasis) that was created in the first section, but to make the process as transparent as possible, I am going to manually generate the dummy variables for the interaction effect Group*x:

data Interact; 
set Have; 
x_Group_A = x*(Group='A'); 
x_Group_B = x*(Group='B');
run;

You can use the spline effects for the variable x_Group_A and x_Group_B in regression models. The following model is not the same as the previous models, but is a flexible model that seems to fit these data:

title "Interaction Between Classification and Spline Effects";
proc gampl data=Interact plots=components;
   class Group;
   model Y = param(Group | x) 
             spline(x_Group_A) spline(x_Group_B);  /* splines of interactions */
   output out=GamPLOut pred=p;
   id Y X Group;
run;
 
title "GAMPL with Interaction Effects";
%PlotInteraction(GamPLOut);

The graph shows that (visually) the model seems to capture the undulations in the data as well as the earlier model. Although PROC GAMPL does not currently support interactions between classification variables and spline effects, this trick provides a way to model the data.

The post Interactions with spline effects in regression models appeared first on The DO Loop.

5月 132020
 

This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima. Except for simple cases like quadratic regression, you need to use numerical techniques to locate these values.

In a previous article, I showed how to use SAS to evaluate the slope of a regression curve at specific points. The present article applies that technique by scoring the regression curve on a fine grid of point. You can use finite differences to approximate the slope of the curve at each point on the grid. You can then estimate the locations where the slope is zero.

In this article, I use the LOESS procedure to demonstrate the technique, but the method applies equally well to any one-dimensional regression curve. There are several ways to score a regression model:

  • Most parametric regression procedures in SAS (GLM, GLIMMIX, MIXED, ...) support the STORE statement. The STORE statement saves a representation of the model in a SAS item store. You can use PROC PLM to score a model from an item store.
  • Some nonparametric regression procedures in SAS do not support the STORE statement but support a SCORE statement. PROC ADAPTIVEREG and PROC LOESS are two examples. This article shows how to use the SCORE statement to find points where the regression curve has zero slope.
  • Some regression procedures in SAS do not support either the STORE or SCORE statements. For those procedures, you need to use the use the missing value trick to score the model.

The technique in this article will not detect inflection points. An inflection point is a location where the curve has zero slope but is not a local min or max. Consequently, this article is really about "how to find a point where a regression curve has a local extremum," but I will use the slightly inaccurate phrase "find points where the slope is zero."

How to find locations where the slope of the curve is zero?

For convenience, I assume the explanatory variable is named X and the response variable is named Y. The goal is to find locations where a nonparametric curve (x, f(x)) has zero slopes, where f(x) is the regression model. The general outline follows:

  1. Create a grid of points in the range of the explanatory variable. The grid does not have to be evenly spaced, but it is in this example.
  2. Score the model at the grid locations.
  3. Use finite differencing to approximate the slope of the regression curve at the grid points. If the slope changes sign between consecutive grid points, estimate the location between the grid points where the slope is exactly zero. Use linear interpolation to approximate the response at that location.
  4. Optionally, graph the original data, the regression curve, and the point along the curve where the slope is zero.

Example data

SAS distributes the ENSO data set in the SASHelp library. You can create a DATA step view that renames the explanatory and response variables to X and Y, respectively, so that it is easier to follow the logic of the program:

/* Create VIEW where x is the independent variable and y is the response */
data Have / view=Have;
set Sashelp.Enso(rename=(Month=x Pressure=y));
keep x y;
run;

Create a grid of points

After the data set is created, you can use PROC SQL to find the minimum and maximum values of the explanatory variable. You can create an evenly spaced grid of points for the range of the explanatory variable.

/* Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from Have;
quit;
 
/* Evaluate the model and estimate derivatives at these points */
data Grid;
dx = (&max_x - &min_x)/201;    /* choose the step size wisely */
do x = &min_x to &max_x by dx;   
   output;
end;
drop dx;
run;

Score the model at the grid locations

This is the step that will vary from procedure to procedure. You have to know how to use the procedure to score the regression model on the points in the Grid data set. The LOESS procedure supports a SCORE statement, so the call fits the model and scores the model on the Grid data set:

/* Score the model on the grid */
ods select none;    /* do not display the tables */
proc loess data=Have plots=none;
   model y = x;
   score data=Grid;  /* PROC LOESS does not support an OUT= option */
   /* Most procedures support an OUT= option to save the scored values.
      PROC LOESS displays the scored values in a table, so use ODS to
      save the table to an output data set */
   ods output ScoreResults=ScoreOut;
run;
ods select all;

If a procedure supports the STORE statement, you can use PROC PLM to score the model on the data. The SAS program that accompanies this article includes an example that uses the GAMPL procedure. The GAMPL procedure does not support the STORE or SCORE statements, but you can use the missing value trick to find zero derivatives.

Find the locations where the slope is zero

This is the mathematical portion of the computation. You can use a backward difference scheme to estimate the derivative (slope) of the curve. If (x0, y0) and (x1, y1) are two consecutive points along the curve (in the ScoreOut data set), then the slope at (x1, y1) is approximately m = (y1 - y0) / (x1 - x0). When the slope changes sign between consecutive points, it indicates that the slope changed from positive to negative (or vice versa) between the points. If the slope is continuous, it must have been exactly zero somewhere on the interval. You can use a linear approximation to find the point, t, where the slope is zero. You can then use linear interpolation to approximate the point (t, f(t)) at which the curve is a local min or max.

You can use the following SAS DATA step to process the scoring data, approximate the slope, and estimate where the slope of the curve is zero:

/* Compute slope by using finite difference formula. */
data Deriv0;
set ScoreOut;
Slope = dif(p_y) / dif(x);      /* (f(x) - f(x-dx)) / dx */
xPrev = lag(x);  yPrev = lag(p_y);  SlopePrev = lag(Slope);
if n(SlopePrev) AND sign(SlopePrev) ^= sign(Slope) then do;
   /* The slope changes sign between this obs and the previous.
      Assuming linearity on the interval, find (t, f(t))
      where slope is exactly zero */
   t0 = xPrev - SlopePrev * (x - xPrev)/(Slope - SlopePrev); 
   /* use linear interpolation to find the corresponding y value:
      f(t) ~ y0 + (y1-y0)/(x1-x0) * (t - x0)       */
   f_t0 = yPrev + (yPrev - p_y)/(x - xPrev) * (t0 - xPrev);
   if sign(SlopePrev) > 0 then _Type_ = "Max";
   else _Type_ = "Min";
   output; 
end;
keep t0 f_t0 _Type_;
label f_t0 = "f(t0)";
run;
 
proc print data=Deriv0 label;
run;

The table shows that there are seven points at which the derivative of the loess regression curve has a local min or max.

Graph the results

If you want to display the local extreme on the graph of the regression curve, you can concatenate the original data, the regression curve, and the local extreme. You can then use PROC SGPLOT to overlay the three layers. The resulting graph is shown at the top of this article.

data Combine;
merge Have                           /* data   : (x, y)     */
      ScoreOut(rename=(x=t p_y=p_t)) /* curve  : (t, p_t)   */
      Deriv0;                        /* extrema: (t0, f_t0) */
run;
 
title "Loess Smoother";
title2 "Red Markers Indicate Zero Slope for Smoother";
proc sgplot data=Combine noautolegend;
   scatter x=x y=y;
   series x=t y=p_t / lineattrs=GraphData2;
   scatter x=t0 y=f_t0 / markerattrs=(symbol=circlefilled color=red);
   yaxis grid; 
run;

Summary

In summary, if you can evaluate a regression curve on a grid of points, you can approximate the slope at each point along the curve. By looking for when the slope changes sign, you can find local minima and maxima. You can then use a simple linear estimator on the interval to estimate where the slope is exactly zero.

You can download the SAS program that performs the computations in this article.

The post Find points where a regression curve has zero slope appeared first on The DO Loop.

1月 132020
 

Did you add "learn something new" to your list of New Year's resolutions? Last week, I wrote about the most popular articles from The DO Loop in 2019. The most popular articles are about elementary topics in SAS programming or univariate statistics because those topics have broad appeal.

Advanced topics and multivariate statistics are less popular but no less important. If you want to learn something new, check out this "Editor's Choice" list of articles that will broaden your statistical knowledge and enhance your SAS programming skills. I've grouped the articles into three broad categories.

Regression statistics

Schematic diagram of outliers in bivariate normal data. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'b' is a multivariate outlier.

High-dimensional analyses and visualization:

Low-dimensional data visualization

The tips and techniques in these articles are useful, so read a few articles today and teach yourself something new in this New Year!

Do you have a favorite article from 2019 that did not make either list? Share it in a comment!

The post 10 posts from 2019 that deserve a second look appeared first on The DO Loop.

1月 082020
 

Many SAS procedures can automatically create a graph that overlays multiple prediction curves and their prediction limits. This graph (sometimes called a "fit plot" or a "sliced fit plot") is useful when you want to visualize a model in which a continuous response variable depends on one continuous explanatory variable and one categorical (classification) variable. You can use the EFFECTPLOT statement in PROC PLM to create similar visualizations of other kinds of regression models. This article shows three ways to create a sliced fit plot: direct from the procedure, by using the EFFECTPLOT statement in PROC PLM, and by writing the predictions to a data set and using PROC SGPLOT to graph the results.

The data for this example is from the PROC LOGISTIC documentation. The response variable is the presence or absence of pain in seniors who are splits into two treatment groups (A and B) and a placebo group (P).

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;

Automatically create a sliced fit plot

Many SAS regression procedures support the PLOTS= option on the PROC statement. For PROC LOGISTIC, the option that creates a sliced fit plot is the PLOTS=EFFECTPLOT option, and you can add prediction limits to the graph by using the CLBAND suboption, as follows:

proc logistic data=Neuralgia alpha=0.2 plots(only)=effectplot(clband);
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
run;

That was easy! The procedure automatically creates a title, legend, axis labels, and so forth. By using the PLOTS= option, you get a very nice plot that shows the predictions and prediction limits for the model.

Create a sliced fit plot by using PROC PLM

One of the nice things about the STORE statement in SAS regression procedures is that it enables you to create graphs and perform other post-fitting analyses without rerunning the procedure. Maybe you intend to examine many models before deciding on the best model. You can run goodness-of-fit statistics for the models and then use PROC PLM to create a sliced fit plot for only the final model. To do this, use the STORE statement in the regression procedure and then "restore" the model in PROC PLM, which can perform several post-fitting analyses, including creating a sliced fit plot, as follows:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   store PainModel / label='Neuralgia Study';
run;
 
proc plm restore=PainModel noinfo;
   effectplot slicefit(x=Age sliceby=Treatment) / clm;
run;

The sliced fit plot is identical to the one that is produced by PROC LOGISTIC and is not shown.

Create a sliced fit plot manually

For many situations, the statistical graphics that are automatically produced are adequate. However, at times you might want to customize the graph by changing the title, the placement of the legend, the colors, and so forth. Sometimes companies mandate color-schemes and fonts that every report must use. For this purpose, SAS supports ODS styles and templates, which you can use to permanently change the output of SAS procedures. However, in many situations, you just want to make a small one-time modification. In that situation, it is usually simplest to write the predictions to a SAS data set and then use PROC SGPLOT to create the graph.

It is not hard to create a sliced fit plot. For these data, you can perform three steps:

  1. Write the predicted values and upper/lower prediction limits to a SAS data set.
  2. Sort the data by the classification variable and by the continuous variable.
  3. Use the BAND statement with the TRANSPARENCY= option to plot the confidence bands. Use the SERIES statement to plot the predicted values.

You can use the full power of PROC SGPLOT to modify the plot. For example, the following statements label the curves, move the legend, and change the title and Y axis label:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   /* 1. Use a procedure or DATA step to write Pred, Lower, and Upper limits */
   output out=LogiOut pred=Pred lower=Lower upper=Upper;
run;
 
/* 2. Be sure to SORT! */
proc sort data=LogiOut;
   by Treatment Age;
run;
 
/* 3. Use a BAND statement. If more that one band, use transparency */
title "Predicted Probabilities with 80% Confidence Limits";
title2 "Three Treatment Groups";
proc sgplot data=LogiOut;
   band x=Age lower=Lower upper=Upper / group=Treatment transparency=0.75 name="L";
   series x=Age y=Pred / group=Treatment curvelabel;
   xaxis grid;
   yaxis grid label="Predicted Probability of Pain" max=1;
   keylegend "L" / location=inside position=NW title="Treatment" across=1 opaque;
run;

The graph is shown at the top of this article. It has a customized title, label, and legend. In addition, the curves on this graph differ from the curves on the previous graph. The OUTPUT statement evaluates the model at the observed values of Age and Treatment. Notice that the A and B treatment groups do not have any patients that are over the age of 80, therefore those prediction curves do not extend to the right-hand side of the graph.

In summary, there are three ways to visualize predictions and confidence bands for a regression model in SAS. This example used PROC LOGISTIC, but many other regression procedures support similar options. In most SAS/STAT procedures, you can use the PLOTS= option to obtain a fit plot or a sliced fit plot. More than a dozen procedures support the STORE statement, which enables you to use PROC PLM to create the visualization. Lastly, all regression procedures support some way to output predicted values to a SAS data set. You can sort the data, then use the BAND statement (with transparency) and the SERIES statement to create the sliced fit plot.

The post 3 ways to add confidence limits to regression curves in SAS appeared first on The DO Loop.

11月 202019
 

In a linear regression model, the predicted values are on the same scale as the response variable. You can plot the observed and predicted responses to visualize how well the model agrees with the data, However, for generalized linear models, there is a potential source of confusion. Recall that a generalized linear model (GLIM) has two components: a linear predictor, which is a linear combination of the explanatory variables, and a transformation (called the inverse link function) that maps the linear predictor onto the scale of the data. Consequently, SAS regression procedures support two types of predicted values and prediction limits. In the SAS documentation, the first type is called "predictions on the linear scale" whereas the second type is called "predictions on the data scale."

For many SAS procedures, the default is to compute predicted values on the linear scale. However, for GLIMs that model nonnormal response variables, it is more intuitive to predict on the data scale. The ILINK option, which is shorthand for "apply the inverse link transformation," converts the predicted values to the data scale. This article shows how the ILINK option works by providing an example for a logistic regression model, which is the most familiar generalized linear models.

Review of generalized linear models

The SAS documentation provides an overview of GLIMs and link functions. The documentation for PROC GENMOD provides a list of link functions for common regression models, including logistic regression, Poisson regression, and negative binomial regression.

Briefly, the linear predictor is
η = X*β
where X is the design matrix and β is the vector of regression coefficients. The link function (g) is a monotonic function that relates the linear predictor to the conditional mean of the response. Sometimes the symbol μ is used to denote the conditional mean of the response (μ = E[Y|x]), which leads to the formula
g(μ) = X*β

In SAS, you will often see options and variables names (in output data sets) that contains the substring 'XBETA'. When you see 'XBETA', it indicates that the statistic or variable is related to the LINEAR predictor. Because the link function, g, is monotonic, it has an inverse, g-1. For generalized linear models, the inverse link function maps the linear-scale predictions to data-scale predictions: if η = x β is a predicted value on the linear scale, then g-1(η) is the predicted value for x on the data scale.

When the response variable is binary, the GLIM is the logistic model. If you use the convention that Y=1 indicates an event and Y=0 indicates the absence of an event, then the "data scale" is [0, 1] and the GLIM predicts the probability that the event occurs. For the logistic GLIM, the link function is the logit function:
g(μ) = logit(μ) = log( μ / (1 - μ) )
The inverse of the logit function is called the logistic function:
g-1(η) = logistic(η) = 1 / (1 + exp(-η))

To demonstrate the ILINK option, the next sections perform the following tasks:

  1. Use PROC LOGISTIC to fit a logistic model to data. You can use the STORE statement to store the model to an item store.
  2. Use the SCORE statement in PROC PLM to score new data. This example scores data by using the ILINK option.
  3. Score the data again, but this time do not use the ILINK option. Apply the logistic transformation to the linear estimates to demonstrate that relationship between the linear scale and the data scale.

The transformation between the linear scale and the data scale is illustrated by the following graph:

Fit a logistic model

The following data are from the documentation for PROC LOGISTIC. The model predicts the probability of Pain="Yes" (the event) for patients in a study, based on the patients' sex, age, and treatment method ('A', 'B', or 'P'). The STORE statement in PROC LOGISTIC creates an item store that can be used for many purposes, including scoring new observations.

data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   DROP Duration;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;
 
title 'Logistic Model on Neuralgia';
proc logistic data=Neuralgia;
   class Sex Treatment;
   model Pain(Event='Yes')= Sex Age Treatment;
   store PainModel / label='Neuralgia Study';  /* store model for post-fit analysis */
run;

Score new data by using the ILINK option

There are many reasons to use PROC PLM, but an important purpose of PROC PLM is to score new observations. Given information about new patients, you can use PROC PLM to predict the probability of pain if these patients are given a specific treatment. The following DATA step defines the characteristics of four patients who will receive Treatment B. The call to PROC PLM scores and uses the ILINK option to predict the probability of pain:

/* Use PLM to score new patients */
data NewPatients;
   input Treatment $ Sex $ Age Duration;
   DROP Duration;
   datalines;
B M 67 15 
B F 73  5 
B M 74 12 
B F 79 16 
;
 
/* predictions on the DATA scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

The Predicted column contains probabilities in the interval [0, 1]. The 95% prediction limits for the predictions are given by the LCLM and UCLM columns. For example, the prediction interval for the 67-year-old man is approximately [0.03, 0.48].

These values and intervals are transformations of analogous quantities on the linear scale. The logit transformation maps the predicted probabilities to the linear estimates. The inverse logit (logistic) transformation maps the linear estimates to the predicted probabilities.

Linear estimates and the logistic transformation

The linear scale is important because effects are additive on this scale. If you are testing the difference of means between groups, the tests are performed on the linear scale. For example, the ESTIMATE, LSMEANS, and LSMESTIMATE statements in SAS perform hypothesis testing on the linear estimates. Each of these statements supports the ILINK option, which enables you to display predicted values on the data scale.

To demonstrate the connection between the predicted values on the linear and data scale, the following call to PROC PLM scores the same data according to the same model. However, this time the ILINK option is omitted, so the predictions are on the linear scale.

/* predictions on the LINEAR scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreXBeta(
         rename=(Predicted=XBeta LCLM=LowerXB UCLM=UpperXB))
         predicted lclm uclm;         /* ILINK not used, so linear predictor */
run;
 
proc print data=ScoreXBeta; run;

I have renamed the variables that PROC PLM creates for the estimates on the linear scale. The XBeta column shows the predicted values. The LowerXB and UpperXB columns show the prediction interval for each patient. The XBeta column shows the values you would obtain if you use the parameter estimates table from PROC LOGISTIC and apply those estimates to the observations in the NewPatients data.

To demonstrate that the linear estimates are related to the estimates in the previous section, the following SAS DATA step uses the logistic (inverse logit) transformation to convert the linear estimates onto the predicted probabilities:

/* Use the logistic (inverse logit) to transform the linear estimates
     (XBeta) into probability estimates in [0,1], which is the data scale.
   You can use the logit transformation to go the other way. */
data LinearToData;
   set ScoreXBeta;   /* predictions on linear scale */
   PredProb   = logistic(XBeta);
   LCLProb    = logistic(LowerXB);
   UCLProb    = logistic(UpperXB); 
run;
 
proc print data=LinearToData;
   var Treatment Sex Age PredProb LCLProb UCLProb;
run;

The transformation of the linear estimates gives the same values as the estimates that were obtained by using the ILINK option in the previous section.

Summary

In summary, there are two different scales for predicting values for a generalized linear model. When you report predicted values, it is important to specify the scale you are using. The data scale makes intuitive sense because it is the same scale as the response variable. You can use the ILINK option in SAS to get predicted values and prediction intervals on the data scale.

The post Predicted values in generalized linear models: The ILINK option in SAS appeared first on The DO Loop.

6月 262019
 

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;
run;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
run;
 
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
run;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;
run;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

10月 272011
 
The background is from An Introduction of Generalized Linear Model by A.J.Dobson; It's described as below:

The data are from an experiment to promote the recovery of stroke patients. There were three experimental groups:

A was a new occupational therapy intervention;
B was the existing stroke rehabilitation program conducted in the same hospital where A was conducted;
C was the usual care regime for stroke patients provided in a different hospital.

There were eight patients in each experimental group. The response variable was a measure off unctional ability, the Bartel index; higher scores correspond to better outcomes and the maximum score is 100. Each patient was assessed weekly over the eight weeks of the study.

data table11_1;
  input Subject    Group $     week1-week8;
datalines;
1     A     45    45    45    45    80    80    80    90
2     A     20    25    25    25    30    35    30    50
3     A     50    50    55    70    70    75    90    90
4     A     25    25    35    40    60    60    70    80
5     A     100   100   100   100   100   100   100   100
6     A     20    20    30    50    50    60    85    95
7     A     30    35    35    40    50    60    75    85
8     A     30    35    45    50    55    65    65    70
9     B     40    55    60    70    80    85    90    90
10    B     65    65    70    70    80    80    80    80
11    B     30    30    40    45    65    85    85    85
12    B     25    35    35    35    40    45    45    45
13    B     45    45    80    80    80    80    80    80
14    B     15    15    10    10    10    20    20    20
15    B     35    35    35    45    45    45    50    50
16    B     40    40    40    55    55    55    60    65
17    C     20    20    30    30    30    30    30    30
18    C     35    35    35    40    40    40    40    40
19    C     35    35    35    40    40    40    45    45
20    C     45    65    65    65    80    85    95    100
21    C     45    65    70    90    90    95    95    100
22    C     25    30    30    35    40    40    40    40
23    C     25    25    30    30    30    30    35    40
24    C     15    35    35    35    40    50    65    65
;
run;

Here a linear model is used to study the relation of Group and Time. The model here used is:
                                                E(Y_{i,j,k}) = \alpha_{i} + \beta_{i} * t_{k} + e_{i,j,k}
Here Y_{i,j,k} is the score at time t_{k} ( k=1 to 8) for patient j (j=1 to 8) in group i (i=1 for Group A, 2 for B and 3 for C);

For linear regression convenience, we need to rearrange the data from wide to long:

data table11_1long;
      set table11_1;
      array week(8) week1-week8;
      do time=1 to 8;
            score=week(time);
            output;
      end;
      drop week1-week8;
run;

Also we need to transfer Categorical Variable to Dummy Variable:

data table11_3a;
      set table11_1long;
      if group='B' then do; a2=1; a3=0; end;
      if group='C' then do; a2=0; a3=1; end;
      if group='A' then do; a2=0; a3=0; end;
run;

Without considering the dependence of the score for different time, we first assume they are independent. Consider the interaction of time and Group, the model can be expressed as:


proc glm data=table11_3a;
      model score = time a2 a3 time*a2 time*a3 / solution ss3;
run;
quit;

The above model considers the interaction of Group and Time. It's a saturated model. It is the same as we make a regression of Score over Time for each Group. So, it can be equally achieved by:


proc reg data=table11_1long;
      by group;
      model score = time;
run; 

Sometimes it's required an exploratory analysis, or called data reduction or data summary.  It consists of summarizing the response profiles for each subject by a small number of descriptive statistics based on the assumption that measurements on the same subject are independent. Here intercept and slope are used as the summary statistics:

proc reg data = table11_1long;
      by subject;
      model score = time;
      ods output parameterestimates = table11_4 (keep = subject variable estimate stderr);
run;
quit;

proc tabulate data=table11_4;
      var estimate stderr;
      class variable subject / order=data;
      table subject='Subject'*sum='', (variable='')*(estimate stderr)*f=6.3 / rts=20 row=float;
run;

Then we can get the following summary table:

------------------------------------------------
|                  |  Intercept  |    time     |
|                  |-------------+-------------|
|                  |Param-|      |Param-|      |
|                  | eter |Stand-| eter |Stand-|
|                  |Estim-| ard  |Estim-| ard  |
|                  | ate  |Error | ate  |Error |
|------------------+------+------+------+------|
|Subject           |      |      |      |      |
|------------------|      |      |      |      |
|1                 |30.000| 7.289| 7.500| 1.443|
|------------------+------+------+------+------|
|2                 |15.536| 4.099| 3.214| 0.812|
|------------------+------+------+------+------|
|3                 |39.821| 3.209| 6.429| 0.636|
|------------------+------+------+------+------|
|4                 |11.607| 3.387| 8.393| 0.671|
|------------------+------+------+------+------|
|5                 |100.00| 0.000| 0.000| 0.000|
|------------------+------+------+------+------|
|6                 | 0.893| 5.304|11.190| 1.050|
|------------------+------+------+------+------|
|7                 |15.357| 4.669| 7.976| 0.925|
|------------------+------+------+------+------|
|8                 |25.357| 1.971| 5.893| 0.390|
|------------------+------+------+------+------|
|9                 |38.571| 3.522| 7.262| 0.698|
|------------------+------+------+------+------|
|10                |61.964| 2.236| 2.619| 0.443|
|------------------+------+------+------+------|
|11                |14.464| 5.893| 9.702| 1.167|
|------------------+------+------+------+------|
|12                |26.071| 2.147| 2.679| 0.425|
|------------------+------+------+------+------|
|13                |48.750| 8.927| 5.000| 1.768|
|------------------+------+------+------+------|
|14                |10.179| 3.209| 1.071| 0.636|
|------------------+------+------+------+------|
|15                |31.250| 1.948| 2.500| 0.386|
|------------------+------+------+------+------|
|16                |34.107| 2.809| 3.810| 0.556|
|------------------+------+------+------+------|
|17                |21.071| 2.551| 1.429| 0.505|
|------------------+------+------+------+------|
|18                |34.107| 1.164| 0.893| 0.231|
|------------------+------+------+------+------|
|19                |32.143| 1.164| 1.607| 0.231|
|------------------+------+------+------+------|
|20                |42.321| 3.698| 7.262| 0.732|
|------------------+------+------+------+------|
|21                |48.571| 6.140| 7.262| 1.216|
|------------------+------+------+------+------|
|22                |24.821| 1.885| 2.262| 0.373|
|------------------+------+------+------+------|
|23                |22.321| 1.709| 1.845| 0.338|
|------------------+------+------+------+------|
|24                |13.036| 4.492| 6.548| 0.890|
------------------------------------------------

The last part is to do analysis of variance for intercepts and slopes. Since in Table11_4 there is only these four variables: 'subject' 'variable' 'estimate' 'stderr'. In table11_1long there is variable 'time' and 'group'. We need to combine these two data sets:



proc sql;
      create table table11_5 as
      select table11_4.*, table11_1long.group
      from  table11_4, table11_1long
      where table11_4.subject = table11_1long.subject;
quit;

Then the analysis of Intercepts can be done by:

proc genmod data=table11_5;
      class group (ref='A' param=ref);
      where variable='Intercept';
      model estimate = group ;
run;
quit;

The analysis of Slope can be done by: 

proc genmod data=table11_5;
      class group (ref='A' param=ref);
      where variable='time';
      model estimate = group   ;
run;
quit;

Here we use proc genmod which allows us use categorical variables directly and has the choice of selecting reference level. We don't use proc glm since it has no choice of reference level in the regression.

In the book the author use proc reg to do it. If use proc reg, we need to use dummy variables rather than categorical variable. So, first it requires to combine the table table11_4 with table11_3a together:

proc sql;
      create table table11_5 as
      select distinct table11_4.*, x.a2 as a2, x.a3 as a3
      from table11_4, table11_3a as x
      where table11_4.subject = x.subject;
quit;

proc reg data=table11_5;
      where variable='Intercept';
      model estimate = a2 a3;
run;
quit;

proc reg data=table11_5;
      where variable='time';
      model estimate = a2 a3;
run;
quit;

It gives the same result as we use proc genmod.


As we know, usually the independence assumption here for the repeated measures is not suitable. A more suitable model for this data should consider the dependence of scores for these 8 time. Generalized Estimation Equations (GEE) model should be more reasonable.
10月 062011
 
Linear Regression with Categorical Predictors and Its interactions

The data set we use is elemapi2; variable mealcat is the percentage of free meals in 3 categories (mealcat=1, 2, 3); collcat is three different collections. They are both categorical variables.


1: Regression with only one categorical predictors

Since mealcat is categorical, we transfer it to dummy variables for linear regression. That it, mealcat1=1 if mealcat=1, else mealcat1=0; mealcat2=1 if mealcat=2 else mealcat2=0; mealcat3=1 if mealcat=3 else mealcat3=0;
The regression of api00~mealcat2 mealcat3 is (mealcat1 is as reference):
proc reg data=elemdum;
       model api00 = mealcat2  mealcat3;
       plot predicted.*mealcat;
run;
The output is:

The intercept is 805.72; Since mealcat=1 is reference group, that is:
mean of api00 at mealcat=1 is 805.72;
the coefficient for mealcat2 is the difference between mealcat=2 and mealcat=1;
Mean of api00 at mealcat=2 is 805.72-166.32=639.40;
The coefficient for mealcat3 is the difference between mealcat=3 and mealcat=1;
Mean of api00 at mealcat=3 is 805.72-301.34=504.38.

The result can be verified by:
proc means data=elemdum;
       var api00;
       class mealcat;
run;

mealcat
1
2
3
mean of api00
805.72
639.4
504.38

It is the same as anova result.

Since in proc glm the last category is automatically chosen as reference, it shows the same information as above.


2: Regression with two Categorical Variables

1)      here we consider there are two categorical variables, mealcat and collcat. The regression is:

proc reg data=elemdum;
       model api00 = mealcat1  mealcat2 collcat1 collcat2 ;
run;
quit;



Look at the following graph:

With respect wot mealcat, mealcat=3 is the reference; with respect to collcat, collcat=3 is the reference. So cell 9 is the reference cell, and intercept is the predicted value for this cell.

The coefficient for mealcat2 if the difference between cell 8 and cell 9; it’s also the difference between cell 5 and cell 6, and the difference between cell 2 and cell 3; in the same way we can explain mealcat1.
The coefficient for collcat2 is the difference between cell 6 and cell 9; it’s also the difference between cell 5 and cell 8, and the difference between cell 4 and cell 7; in the same way we can explain collcat1.

But if you calculate the mean for each cell, it will be different from the predicted value here. Why? Because here we only have main effect, therefore we have restrictions that the difference for some cells are the same as stated above.

2)      Next we can use anova to get the same result:

proc glm data=elemdum;
       class mealcat collcat;
       model api00 = mealcat collcat / solution ss3;
run;


The result is the same as we use dummy variables.


3: Regression with Interactions of Categorical Variables

1)      As is shown above, it’s assumed that the differences of some cells are the same. To remove this restriction, now we consider interactions of the two categorical variables.
First we need to create the dummy interactions, that is, the interaction of variable mealcat and collcat. SAS code is:
data elemdum;
       set sasreg.elemapi2 (keep=api00 some_col mealcat collcat);
       array mealdum(3) mealcat1-mealcat3;
       array colldum(3) collcat1-collcat3;
       array mxc(3,3) m1xc1-m1xc3 m2xc1-m2xc3 m3xc1-m3xc3 ;
       do i=1 to 3;
              do j=1 to 3;
                     mealdum(i)=(mealcat=i);
                     colldum(j)=(collcat=j);
                     mxc(i,j)=mealdum(i)*colldum(j);
              end;
       end;
       drop i j;
run;

We can check the indicators by proc freq in SAS:
proc freq data=elemdum;
       table mealcat*collcat * m1xc1*m1xc2*m1xc3*m2xc1*m2xc2*m2xc3*m3xc1*m3xc2*m3xc3 / nopercent nocum list;
run;

The option list displays two-way to n-way tables in a list format rather than as crosstabulation tables.

Now we will add the dummy variables and interactions into the regression. We also set mealcat=3 and collcat=3 as reference.
proc reg data=elemdum;
       model api00 = mealcat1 mealcat2 mealcat3 collcat1 collcat2 collcat3 m1xc1--m3xc3;
run;
The output is:

From the output we can see:
a: some variables has no coefficient estimation since they are at reference level;
b: interaction m2xc1 and m2xc2 is not sifnificant.

One important thing is what does the interaction mean? In the case without interaction, the coefficient Bmealcat1 means the difference between group mealcat=1 and the reference group mealcat=3. Here with interactions, it’s more complicated. The presence of interaction imply that the difference of mealcat=1 and mealcat=3 depends on the level of collcat. For example, the interaction term Bm1xc1 means the extent to which the difference between mealcat=1 and mealcat=3 changes when collcat=1 compared to collcat=3. That is, Bm1xc1=(c1-c7)-(c3-c9).

To be more detailed, I will list the items in each cell as below:

So, if you want to perform a test of the main effect of collcat=1 and collcat=3 when mealcat=1, you need to compare (Intercept + Bmealcat1+ Bcollcat1 + Bm1xc1) - (Intercept + Bmealcat1) = Bcollcat1 + Bm1xc1 is zero or not. That is:

The test is significant, indicating that collcat at level 1 and 3 is significant for the mealcat=1 group.

2)      The same result can be given by anova as below:
proc glm data = elemdum;
       class mealcat collcat;
       model api00 = mealcat collcat mealcat*collcat / solution;
       lsmeans mealcat*collcat / slice=collcat;
run;


At last, we can verify our result by calculating the average of each cell by proc tabulate:

Or we can get it by proc sql from SAS:
proc sql;
       create table temp as
       select mean(api00) as avg, collcat, mealcat
       from elemdum
       group by mealcat, collcat
       order by mealcat, collcat;
run;
quit;
10月 032011
 
These questions are from the webbook 'Regression of Stata' of ucla ats. The original questions require to use Stata to answer. Here I try to use SAS to answer these question.

                                                                                     Regression Diagnostics

Question 1:  The following data set (davis) consists of measured weight, measured height, reported weight and reported height of some 200 people. We tried to build a model to predict measured weight by reported weight, reported height and measured height. Draw an leverage v.s. residual square plot after the regression. Explain what you see in the graph and try to use other SAS commands to identify the problematic observation(s). What do you think the problem is and what is your solution?

Answer 1:  First we do the linear regression in SAS:
proc reg data=sasreg.davis;
      model measwt = measht reptwt reptht ;
      output out=reg1davis h=lev r=r rstudent=rstudent p=p cookd=cookd dffits=dffits;
run;

1)    Then we draw the graph the question give: leverage v.s. normalized residual square.
data rsquare;
      set reg1davis;
      rsquare=r**2;
run;

proc means data=rsquare;
      var rsquare;
      output out=rsquare1 mean=means std=stds;
run;

data davis1;
      set rsquare1;
      rsquarestd=0;
      do until (eof);
            set rsquare end=eof nobs=ntotal;
                  rsquarestd=((rsquare-means)/stds);
            output;
      end;
run;

goptions reset=all;
axis1 label=(r=0 a=90);
symbol1 pointlabel=("#measwt") v=plus i=none c=blue height=1;
proc gplot data=davis1;
      plot lev*rsquarestd / vaxis=axis1;
run;
quit;
The corresponding graph is:

From the graph above we can see there is a point (measwt=166 is far away from the other points.) we will study it furthermore.

2)    Next we drat the scatter plot of measwt measht reptwt reptht.

Also we can see the measwt=166 point stands out.

3)      Usually when the student residual is greater than 3, then the corresponding data merit further study.
proc univariate data=reg1davis;
      var rstudent;
run;
               
We can see there is one student residual greater than 3. Print the corresponding measwt, it is:
proc print data=reg1davis;
      var measwt rstudent;
      where abs(rstudent) > 3;
run;
                           
It also shows that measwt=166 is an outlier.

4)      Leverage: Generally, a point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations. In our question, it equals to (2*3+2)/181=0.0442.
         

5)  Cook’s D: the higher the Cook's D is, the more influential the point is. The conventional cut-off point is 4/n. We list the points with Cook’s D above the cut-off point:
proc print data=reg1davis;
      var   measwt   measht reptwt reptht cookd;
      where cookd > (4/181);
run;
                                   
6)  DFBETA: We can consider more specific measure of influence by access how much a coefficient will change by deleting that data point. This is more computationally intensive than the summary statistics such as CookD. In SAS, we need to use the ods output OutStatistics statement to produce the DFBETAs for each of the predictors. A DFBETA value in excess of 2/sqrt(n) merits further investigation.
proc reg data=sasreg.davis ;
      model measwt =  measht reptwt reptht / influence;
      ods output outputstatistics=measwtdfbetas;
      id measwt;
run;
quit;

proc print data=measwtdfbetas  (obs=12);
      var  measwt  dfb_measht dfb_reptwt dfb_reptht;
run;
        
Based on these information above, we will say that measwt=166 is an outlier. Probably we need to delete this point for the regression model.

Question 2:  Using the data from the last exercise, what measure would you use if you want to know how much change an observation would make on a coefficient for a predictor? For example, show how much change would it be for the coefficient of predictor reptht if we omit observation 12 from our regression analysis? What are the other measures that you would use to assess the influence of an observation on regression? What are the cut-off values for them?
Answer 2: DFBETA is the measure of how much change an observations would make on a coefficient for a predictor. From above graph we show the DFBETA for observation 12.
The value of dfb_reptht is 24.2546, which means that by being included in the analysis (as compared to being excluded), the coefficient of reptht will increase by 24.2546 standard errors. That is, the parameter will increase by 24.2514*.04197=1.0178.
A DFBETA value in excess of 2/sqrt(n) merits further investigation.

Question 3: The following data file is called bbwt and it is from Weisberg's Applied Regression Analysis. It consists of the body weights and brain weights of some 60 animals. We want to predict the brain weight by body weight, that is, a simple linear regression of brain weight against body weight. Show what you have to do to verify the linearity assumption. If you think that it violates the linearity assumption, show some possible remedies that you would consider.
Answer 3:  1) first we will do the linear assumption test. Usually we use scatter plot to study the relation. Here we draw two plots.
proc reg data=sasreg.bbwt;
      model brainwt = bodywt;
      plot residual.*predicted.;
      plot brainwt*bodywt;
run;
quit;
The first is residual v.s. predicted plot:

The second is scatter plot of brainwt v.s. bodywt.

2) A way to modify the regression is to use lowess regression line as below. SAS will output 4 graphs for the corresponding 4 lowess smoothing parameter.
proc loess data=sasreg.bbwt;
      model brainwt = bodywt / smooth = .1 .2 .3 .4;
      ods output outputstatistics=results;
run;

proc sort data=results;
      by smoothingparameter bodywt;
run;

goptions reset=all;
symbol1 v=dot i=none c=black;
symbol2 v=none i=join c=blue;
symbol3 v=none i=r c=red;
proc gplot data=results;
      by smoothingparameter;
      plot depvar*bodywt pred*bodywt depvar*bodywt / overlay;
run;
quit;

Here we show one graph with smoothingparameter=.1. From the graph it shows the relation is not linear.

3) From both graph it shows the linear assumption is not good. So we need to make some transformations of the data. Here we use log-transformation for both brainwt and bodywt.
data lbbwt;
      set sasreg.bbwt;
      lbrainwt=log(brainwt);
      lbodywt=log(bodywt);
run;

proc reg data=lbbwt;
      model lbrainwt = lbodywt;
      plot residual.*predicted.;
      plot lbrainwt*lbodywt;
run;
quit;



Now we can see after transformation they are more linear than before.

Question 4:  We did a regression analysis using the data file elemapi2. Continuing with the analysis we did, draw an additional variable plot (avplot) here. Explain what an avplot is and what type of information you would get from the plot. If variable full were put in the model, would it be a significant predictor?
Answer: A group of points can be jointly influential. An additional variable plot is an attractive graphic method to present multiple influential points on a predictor. What we are looking for in an additional variable plot are those points that can exert substantial change to the regression line.
From the avplot we can see for the variable full, the observation with sch00l number 211 is very low on the left. Deleting it will flatten the regression a lot. In other words, it will decrease the coefficient for variable full a lot.


Question 5: The data set wage is from a national sample of 6000 households with a male head earning less than $15,000 annually in 1966. The data were classified into 39 demographic groups for analysis. We tried to predict the average hours worked by average age of respondent and average yearly non-earned income. Both predictors are significant. Now if we add ASSET to our predictors list, neither NEIN nor ASSET is significant. Can you explain why?
Answer 5: The reason is NEIN and ASSET are highly correlated which result in they are not significant when both are used in the model. So the problem here is Multicollinearity.
proc corr data=sasreg.wage;
      var age nein asset;
run;


We can also the VIF value to double check multicollinearity.
proc reg data=sasreg.wage;
      model hrs = age nein asset / vif;
run;


Question 6:  Continue to use the previous data set. This time we want to predict the average hourly wage by average percent of white respondents. Carry out the regression analysis and list the SAS commands that you can use to check for heteroscedasticity. Explain the result of your test(s).
Now we want to build another model to predict the average percent of white respondents by the average hours worked. Repeat the analysis you performed on the previous regression model. Explain your results.
Answer 6:  1)  First we do the linear regression and then draw the plot of residuals v.s. predicted value. From the graph it doesn’t show any special pattern of the residuals.
proc reg data=sasreg.wage;
      model rate = race / spec;
      output out=reg1wage r=r p=p rstudent=rstudent;
run;

goptions reset=all;
proc gplot data=reg1wage;
      plot r*p;
run;

Next we can do the White Test. A test for heteroscedasticity, the White test. The White test tests the null hypothesis that the variance of the residuals is homogenous. Therefore, if the p-value is very small, we would have to reject the hypothesis and accept the alternative hypothesis that the variance is not homogenous. We use the / spec option on the model statement to obtain the White test.

p-value is greater than .05. So we couldn’t reject the null hypothesis. That is, we couldn’t reject that the variance are homogeneous.

2) From the residual plot we can see the residuals blast out. Also, the white test verify this since p-value is less than .05. So we think the variance is not homogenous.




Question 7:  We have a data set (tree) that consists of volume, diameter and height of some objects. Someone did a regression of volume on diameter and height. Explain what tests you can use to detect model specification errors and if there is any, your solution to correct it.
Answer 7: A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.
Link test is used for model specification. If a model a well specified, one shouldn't get any new independent variables that is significant. For model specification, a common method is to make a new regression of y depends on the fitted value (fv) and squared fitted value (fv2). For the well specified model, fv should be significant and fv2 should not. We will show the result as below:
proc reg data=sasreg.tree;
      model vol = dia height;
      output out=reg1tree (keep= vol dia height fv) predicted=fv;
run;

data reg1square;
      set reg1tree;
      fv2=fv**2;
run;

proc reg data=reg1square;
      model vol = fv fv2;
run;
quit;


From the graph above, it shows both fv and fv2 are significant. That is, our model of vol~dia height is not well specified.
It's reasonable since we know volume should be proportional to the square of diameter and height. So, we next add a independent variable dia2 as square of diameter.


Now we can see the model is well specified.