linear regression

6月 242019
 

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include additional explanatory effects to model the data. Sometimes it is difficult to spot patterns in a seemingly random cloud of points, so some analysts like to add a scatter plot smoother to the residual plots. You can use the SMOOTH suboption to the PLOTS=RESIDUALS option in many SAS regression procedures to generate a panel of residual plots that contain loess smoothers. For SAS procedures that do not support the PLOTS=RESIDUALS option, you can use PROC SGPLOT to manually create a residual plot with a smoother.

Residual plots with loess smoothers

Many SAS linear regression procedures such as PROC REG and PROC GLM support the PLOTS=RESIDUAL(SMOOTH) option on the PROC statement. For example, the following call to PROC GLM automatically creates a panel of scatter plots where the residuals are plotted against each regressor. The model is a two-variable regression of the MPG_City variable in the Sashelp.Cars data.

/* residual plots with loess smoother */
ods graphics on;
proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight;
run; quit;

The loess smoothers can sometimes reveal patterns in the residuals that would not otherwise be perceived. In this case, it looks like there is a quadratic pattern to the residuals-versus-EngineSize graph (and perhaps for the Weight variable as well). This indicates that you might need to include a quadratic effect in the model. Because the EngineSize and Weight variables are highly correlated (ρ = 0.81), the following statements add only a quadratic effect for EngineSize:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight
                    EngineSize*EngineSize ;
quit;

After adding the quadratic effect, the residual plots do not reveal any obvious systematic trends. Also, the residual plot for Weight no longer shows any quadratic pattern.

How to use PROC SGPLOT to create a residual plot with a smoother

If you use a SAS procedure that does not support the PLOTS=RESIDUALS(SMOOTH) option, you can output the residual values to a SAS data set and use PROC SGPLOT to create the residual plots. Even when a procedure DOES support the PLOTS=RESIDUALS(SMOOTH) option, you might want to customize the plot by adding legends, by changing attributes of the markers or curve, or by specifying a value for the smoothing parameter.

An example is shown below. If you use the same model for MPG_City, but use all observations in the data set, the residual plot for EngineSize looks very strange. For these data, the smoothing parameter for the loess curve is very small and therefore the loess curve overfits the residuals:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   model MPG_City = EngineSize Weight;
   output out=RegOut predicted=Pred residual=Residual;
run; quit;

Yuck! The loess curve for the plot on the left clearly overfits the residuals-versus-EngineSize data! Unfortunately, you cannot change the smoothing parameter from the PROC GLM syntax. However, you can change the default smoothing parameter in PROC SGPLOT and you can make other modifications to the plot as well. Notice in the previous call to PROC GLM that the OUTPUT statement creates a data set named RegOut that contains the residual values and the original variables. Therefore, you can create a residual plot and add a loess smoother by using PROC SGPLOT, as follows:

ods graphics / attrpriority=NONE;
title "Residuals for Model";
proc sgplot data=RegOut ;
   scatter x=EngineSize y=Residual / group=Origin;
   loess x=EngineSize y=Residual / nomarkers smooth=0.5;
   refline 0 / axis=y;
   xaxis grid; yaxis grid;
run;

The smoothing parameter was manually set to 0.5, but you can use PROC LOESS if you want to choose a smoothing parameter that optimizes some information criterion such as the AICC statistic. Notice that you can use additional SGPLOT statements to add a reference grid and to change marker attributes. If you prefer, you could add a different kind of smoother such as a penalized B-spline by using the PBSPLINE statement.

You might wonder why the smoother in the residual plot for EngineSize is so small. The parameter is chosen to optimize a criterion such as the AICC statistic, so why does it overfit the data? An example in the PROC LOESS documentation provides an explanation. The chosen value for the smoothing parameter is one that corresponds to a local minimum of an objective function that involves the AICC statistic. Unfortunately, a set of data can have multiple local minima, and this is the case for the residuals of the EngineSize variable. When the smoothing parameter is 0.534, the AICC criterion reaches a local minimum. However, there are smaller values of the smoothing parameter for which the AICC criterion is even smaller. The minimum value of the AICC occurs when the smoothing parameter is 0.015, which leads to the "jagged" loess curve that is seen in the panel of residual plots shown earlier in this section. If you want to see this phenomenon yourself, run the following PROC LOESS code and look at the criterion plot.

ods select CriterionPlot SmoothingCriterion FitPlot;
proc loess data=RegOut;
   model Residual = EngineSize / select=AICC(global) ;
run;

Because a data set can be smoothed at multiple scales, the "optimal" smoothing parameter that is chosen automatically by the PLOTS=RESIDUALS(SMOOTH) option might not enable you to see the general trend of the residuals. If you experience this phenomenon, output the residuals and use PROC SGPLOT or PROC LOESS to compute a more useful smoother.

In summary, SAS provides the PLOTS=RESIDUALS(SMOOTH) option to automatically create residual-versus-regressor plots. Although this panel usually provides a useful indication of patterns in the residuals, you can also output the residuals to a data set and use PROC SGPLOT or PROC LOESS to create a customized residual plot.

The post Add loess smoothers to residual plots appeared first on The DO Loop.

6月 192019
 

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the DFFTIS and Cook's distance, which is also known as Cook's D statistic. Both statistics measure the change in predicted values that occurs when you delete an observation and refit the model. This article describes the DFFITS and Cook's D statistics and shows how to compute and graph them in SAS.

DFFITS: How the predicted value changes if an observation is excluded

If you exclude an observation from a model and refit, the predicted values will change. The DFFITS statistic is a measure of how the predicted value at the i_th observation changes when the i_th observation is deleted. High-leverage points tend to pull the regression surface towards the response at that point, so the change in the predicted value at that point is a good indication of how influential the observation is. So that the DFFITS values are independent of the scale of the data, the change in predicted values is scaled by dividing by the standard error of the predicted value at that point. The exact formula is given in the documentation for PROC REG.. The book Regression Diagnostics by Belsley, Kuh, and Welsch (1980) suggests that an observation is influential if the magnitude of its DFFITS value exceeds 2*sqrt(p/n), where p is the number of effects in the model and n is the sample size.

PROC REG provides three ways to generate the DFFITS statistics for each observation:

  • You can create a graph of the DFFITS statistics by using the PLOTS=DFFITS option.
  • You can also display a table of the DFFITS (and other influence statistics) by using the INFLUENCE option in the MODEL statement.
  • You can write the DFFITS statistics to a data set by using the DFFITS= option in the OUTPUT statement.

The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City. The data are sorted because the DFFITS statistic is graphed against the observation number, which is an arbitrary quantity. By sorting the data, you know that small observation numbers correspond to low values of the response and so forth. If you have a short ID variable, you can label the influential observations by using the LABEL suboption, as follows:

/* Create sample data */
data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
/* Optional but helpful: Sort by response variable */
proc sort data=cars;
   by MPG_City;
run;
 
proc reg data=Cars plots(only) = DFFITS(label); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;

The DFFITS graph shows that three observations have a large positive DFFITS value. The observations are the Ford Excursion, the Ford Ranger, and the Madza BB230. For these observations, the predicted value (at the observation) is higher with the observation included in the model than if it were excluded. Thus, these observations "pull the regression up." There are four observations that have large negative DFFITS, which means that these observations "pull the regression down." They include the Land Rover Discovery and the Volvo XC90.

Cook's D: A distance measure for the change in regression estimates

When you estimate a vector of regression coefficients, there is uncertainty. The confidence regions for the parameter estimate is an ellipsoid in k-dimensional space, where k is the number of effects that you are estimating (including the intercept). Cook (1977) defines a distance that the estimates move within the confidence ellipse when the i_th point is deleted. Equivalently, Cook shows that the statistic is proportional to the squared studentized residual for the i_th observation. The documentation for PROC REG provides a formula in terms of the studentized residuals.

By default, PROC REG creates a plot of Cook's D statistic as part of the panel of diagnostic plots. (Cook's D is the second row and third column.) You can create a larger stand-alone plot by using the PLOTS=DFFITS option. Optionally, you can label the influential points (those whose Cook's D statistic exceeds 4/sqrt(n)) by using the LABEL suboption, as shown below:
/* create multiple plots and label influential points */
proc reg data=Cars plots(only) = (CooksD(label) DFFits(label));   
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
   output out=RegOut pred=Pred rstudent=RStudent dffits=DFFits cookd=CooksD; /* optional: output statistics */
run; quit;

In many ways, the plot of Cook's D looks similar to a plot of the squared DFFITS statistics. Both measure a change in the predicted value at the i_th observation when the i_th observation is excluded from the analysis. The formula for Cook's D statistic squares a residual-like quantity, so it does not show the direction of the change, whereas the DFFITS statistics do show the direction. Otherwise, the observations that are "very influential" are often the same for both statistics, as seen in this example.

The post Influential observations in a linear regression model: The DFFITS and Cook's D statistics appeared first on The DO Loop.

6月 172019
 

My article about deletion diagnostics investigated how influential an observation is to a least squares regression model. In other words, if you delete the i_th observation and refit the model, what happens to the statistics for the model? SAS regression procedures provide many tables and graphs that enable you to examine the influence of deleting an observation. For example:

  • The DFBETAS are statistics that indicate the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS and Cook's D statistics indicate the effect that deleting each observation has on the predicted values of the model.
  • The COVRATIO statistics indicate the effect that deleting each observation has on the variance-covariance matrix of the estimates.

These observation-wise statistics are typically used for smaller data sets (n ≤ 1000) because the influence of any single observation diminishes as the sample size increases. You can get a table of these (and other) deletion diagnostics by using the INFLUENCE option on the MODEL statement of PROC REG in SAS. However, because there is one statistic per observation, these statistics are usually graphed. PROC REG can automatically generate needle plots of these statistics (with heuristic cutoff values) by using the PLOTS= option on the PROC REG statement.

This article describes the DFBETAS statistic and shows how to create graphs of the DFBETAS in PROC REG in SAS. The next article discusses the DFFITS and Cook's D statistics. The COVRATIO statistic is not as popular, so I won't say more about that statistic.

DFBETAS: How the coefficient estimates change if an observation is excluded

The documentation for PROC REG has a section that describes the influence statistics, which is based on the book Regression Diagnostics by Belsley, Kuh, and Welsch (1980, p. 13-14). Among these, the DFBETAS statistics are perhaps the easiest to understand. If you exclude an observation from the data and refit the model, you will get new parameter estimates. How much do the estimates change? Notice that you get one statistic for each observation and also one for each regressor (including the intercept). Thus if you have n observations and k regressors, you get nk statistics.

Typically, these statistics are shown in a panel of k plots, with the DFBETAS for each regressor plotted against the observation number. Because "observation number" is an arbitrary number, I like to sort the data by the response variable. Then I know that the small observation numbers correspond to low values of the response variable and large observation numbers correspond to high values of the response variable. The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City:

data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
proc sort data=cars;
   by MPG_City;
run;
 
proc print data=cars(obs=5) noobs;
   var Make Model IDMakeMod MPG_City;
run;

The first few observations are shown. Notice that the first observations correspond to small values of the MPG_City variable. Notice also a short label (IDMakeMod) identifies each vehicle.

There are two ways to generate the DFBETAS statistics: You can use the INFLUENCE option on the MODEL statement to generate a table of statistics, or you can use the PLOTS=DFBETAS option in the PROC REG statement to generate a panel of graphs. The following call to PROC REG generates a panel of graphs. The IMAGEMAP=ON option on the ODS GRAPHICS statement enables you to hover the mouse pointer over an observation and obtain a brief description of the observation:

ods graphics on / imagemap=on;              /* enable data tips (tooltips) */
proc reg data=Cars plots(only) = DFBetas; 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;
ods graphics / imagemap=off;

The panel shows the influence of each observation on the estimates of the four regression coefficients. The statistics are standardized so that all graphs can use the same vertical scale. Horizontal lines are drawn at ±2/sqrt(n) ≈ 0.22. Observations are called influential if they have a DFBETA statistic that exceeds that value. The graph shows a tool tip for one of the observations in the EngineSize graph, which shows that the influential point is observation 4, the Land Rover Discovery.

Each graph reveals a few influential observations:

  • For the intercept estimate, the most influential observations are numbers 1, 35, 83, and 84.
  • For the EngineSize estimates, the most influential observations are numbers 4, 35, and 38.
  • For the Horsepower estimates, the most influential observations are numbers 1, 4, and 38.
  • For the Weight estimates, the most influential observations are numbers 1, 24, 35, and 38.

Notice that several observations (such as 1, 35, and 38) are influential for more than one estimate. Excluding those observations causes several parameter estimates to change substantially.

Labeing the influential observations

For me, the panel of graphs is too small. I found it difficult to hover the mouse pointer exactly over the tip of a needle in an attempt to discover the observation number and name of the vehicle. Fortunately, if you want details like that, PROC REG supplies options that make the process easier. If you don't have too many observations, you can add labels to the DFBETAS plots by using the LABEL suboption. To plot each graph individually (instead of in a panel), use the UNPACK suboption, as follows:

proc reg data=Cars plots(only) = DFBetas(label unpack); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
quit;

The REG procedure creates four plots, but only the graph for the Weight variable is shown here. In this graph, the influential observations are labeled by the IDMakeMod variable, which enables you to identify vehicles rather than observation numbers. For example, some of the influential observations for the Weight variable are the Ford Excursion (1), the Toyota Tundra (24), the Mazda B400 (35), and the Volvo XC90 (38).

A table of influential observations

If you want a table that displays the most influential observations, you can use the INFLUENCE option to generate the OutputStatistics table, which contains the DFBETAS for all regressors. You can write that table to a SAS data set and exclude any that do not have a large DFBETAS statistic, where "large" means the magnitude of the statistic exceeds 2/sqrt(n), where n is the sample size. The following DATA step filters the observations and prints only the influential ones.

ods exclude all;
proc reg data=Cars plots=NONE; 
   model MPG_City = EngineSize HorsePower Weight / influence;
   id IDMakeMod;
   ods output OutputStatistics=OutputStats;      /* save influence statistics */
run; quit;
ods exclude none;
 
data Influential;
set OutputStats nobs=n;
array DFB[*] DFB_:;
cutoff = 2 / sqrt(n);
ObsNum = _N_;
influential = 0;
DFBInd = '0000';                   /* binary string indicator */
do i = 1 to dim(DFB);
   if abs(DFB[i])>cutoff then do;  /* this obs is influential for i_th regressor */
      substr(DFBInd,i,1) = '1';
      influential = 1;
   end;
end;
if influential;                    /* output only influential obs */
run;
 
proc print data=Influential noobs;
   var ObsNum IDMakeMod DFBInd cutoff DFB_:;
run;

The DFBInd variable is a four-character binary string that indicates which parameter estimates are influenced by each observation. Some observations are influential only for one coefficient; others (1, 3, 35, and 38) are influential for many variables. Creating a binary string for each observation is a useful trick.

By the way, did you notice that the name of the statistic ("DFBETAS") has a large S at the end? Until I researched this article, I assumed it was to make the word plural since there is more than one "DFBETA" statistic. But, no, it turns out that the S stands for "scaled." You can define the DFBETA statistic (without the S) to be the change in parameter estimates bb(i), but that statistic depends on the scale of the variables. To standardize the statistic, divide by the standard error of the parameter estimates. That scaling is the reason for the S as the end of DFBETAS. The same is true for the DFFITS statistic: S stands for "scaled."

The next article describes how to create similar graphs for the DFFITS and Cook's D statistics.

---------------

DFFITS: How the predicted values change if an observation is excluded

The DFFITS statistic measures, for each observation, how the predicted value at that observation changes if you exclude the observation and refit the model.

Cook's D: How the sum of the predicted values change if an observation is excluded

Cook's distance (D) statistic measures, for each observation, the sum of the differences in the predicted values (summed over all observations) if you exclude the observation and refit the model.

The post Influential observations in a linear regression model: The DFBETAS statistics appeared first on The DO Loop.

6月 122019
 

For linear regression models, there is a class of statistics that I call deletion diagnostics or leave-one-out statistics. These observation-wise statistics address the question, "If I delete the i_th observation and refit the model, what happens to the statistics for the model?" For example:

  • The PRESS statistic is similar to the residual sum of squares statistic but is based on fitting n different models, where n is the sample size and the i_th model excludes the i_th observation.
  • Cook's D statistic measures the influence of the i_th observation on the fit.
  • The DFBETAS statistics measure how the regression estimates change if you delete the i_th observation.

Although most references define these statistics in terms of deleting an observation and refitting the model, you can use a mathematical trick to compute the statistics without ever refitting the model! For example, the Wikipedia page on the PRESS statistic states, "each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors." Although this paragraph is conceptually correct, theSAS/STAT documentation for PROC GLMSELECT states that the PRESS statistic "can be efficiently obtained without refitting the model n times."

A rank-1 update to the inverse of a matrix

Recall that you can use the "normal equations" to obtain the least squares estimate for the regression problem with design matrix X and observed responses Y. The normal equations are b = (X`X)-1(X`Y), where X`X is known as the sum of squares and crossproducts (SSCP) matrix and b is the least squares estimate of the regression coefficients. For data sets with many observations (very large n), the process of reading the data and forming the SSCP is a relatively expensive part of fitting a regression model. Therefore, if you want the PRESS statistic, it is better to avoid rebuilding the SSCP matrix and computing its inverse n times. Fortunately, there is a beautiful result in linear algebra that relates the inverse of the full SSCP matrix to the inverse when a row of X is deleted. The result is known as the Sherman-Morrison formula for rank-1 updates.

The key insight is that one way to compute the SSCP matrix is as a sum of outer products of the rows of X. Therefore if xi is the i_th row of X, the SCCP matrix for data where xi is excluded is equal to X`X - xi`xi. You have to invert this matrix to find the least squares estimates after excluding xi.

The Sherman-Morrison formula enables you to compute the inverse of X`X - xi`xi when you already know the inverse of X`X. For brevity, set A = X`X. The Sherman-Morrison formula for deleting a row vector xi` is
(A – xi`xi)-1 = A-1 + A-1 xi`xi A-1 / (1 – xiA-1xi`)

Implement the Sherman-Morrison formula in SAS

The formula shows how to compute the inverse of the updated SSCP by using a matrix-vector multiplication and an outer product. Let's use a matrix language to demonstrate the update method. The following SAS/IML program reads in a small data set, forms the SSCP matrix (X`X), and computes its inverse:

proc iml;
use Sashelp.Class;   /* read data into design matrix X */
read all var _NUM_ into X[c=varNames];  
close;
XpX = X`*X;          /* form SSCP */
XpXinv = inv(XpX);   /* compute the inverse */

Suppose you want to compute a leave-one-out statistic such as PRESS. For each observation, you need to estimate the parameters that result if you delete that observation. For simplicity, let's just look at deleting the first row of the X matrix. The following program creates a new design matrix (Z) that excludes the row, forms the new SSCP matrix, and finds its inverse:

/* Inefficient: Manually delete the row from the X matrix 
   and recompute the inverse */
n = nrow(X);
Z = X[2:n, ];       /* delete first row */
ZpZ = Z`*Z;         /* reform the SSCP matrix */
ZpZinv = inv(ZpZ);  /* recompute the inverse */
print ZpZinv[c=varNames r=varNames L="Inverse of SSCP After Deleting First Row"];

The previous statements essentially repeat the entire least squares computation. To compute a leave-one-out statistic, you would perform a total of n similar computations.

In contrast, it is much cheaper to apply the Sherman-Morrison formula to update the inverse of the original SSCP. The following statements apply the Sherman-Morrison formula as it is written:

/* Alternative: Do not change X or recompute the inverse. 
   Use the Sherman-Morrison rank-1 update formula.
   https://en.wikipedia.org/wiki/Sherman–Morrison_formula */
r = X[1, ];          /* first row */
rpr = r`*r;          /* outer product */
/* apply Sherman-Morrison formula */
NewInv = XpXinv + XPXinv*rpr*XPXinv / (1 - r*XpXinv*r`);
print NewInv[c=varNames r=varNames L="Inverse from Sherman-Morrison Formula"];

These statements compute the new inverse by using the old inverse, an outer product, and a few matrix multiplications. Notice that the denominator of the Sherman-Morrison formula includes the expression r*(X`X)-1*r`, which is the leverage statistic for the i_th row.

The INVUPDT function in SAS/IML

Because it is important to be able to update an inverse matrix quickly when an observation is deleted (or added!), the SAS/IML language supports the IMVUPDT function, which implements the Sherman-Morrison formula. You merely specify the inverse matrix to update, the vector (as a column vector) to use for the rank-one update, and an optional scalar value, which is usually +1 if you are adding a new observation and -1 if you are deleting an observation. For example, the following statements are the easiest way to implement the Sherman-Morrison formula in SAS for a leave-one-out statistic:

NewInv2 = invupdt(XpXinv, r`, -1);
print NewInv2[c=varNames r=varNames L="Inverse from INVUPDT Function"];

The output is not displayed because the matrix NewInv2 is the same as the matrix NewInv in the previous section. The documentation includes additional examples.

The general Sherman-Morrison-Woodbury formula

The Sherman-Morrison formula shows how to perform a rank-1 update of an inverse matrix. There is a more general formula, called the Sherman-Morrison-Woodbury formula, which enables you to update an inverse for any rank-k modification of the original matrix. The general formula (Golub and van Loan, p. 51 of 2nd ed. or p. 65 of 4th ed.) shows how to find the matrix of a rank-k modification to a nonsingular matrix, A, in terms of the inverse of A. The general formula is
(A + U VT)-1 = A-1 – A-1 U (I + VT A-1 U) VT A-1
where U and V are p x k and all inverses are assumed to exist. When k = 1, the matrices U and V become vectors and the k x k identify matrix becomes the scalar value 1. In the previous section, U equals -xiT and V equals xiT.

The Sherman-Morrison-Woodbury formula is one of my favorite results in linear algebra. It shows that a rank-k modification of a matrix results in a rank-k modification of its inverse. It is not only a beautiful theoretical result, but it has practical applications to leave-one-out statistics because you can use the formula to quickly compute the linear regression model that results by dropping an observation from the data. In this way, you can study the influence of each observation on the model fit (Cook's D, DFBETAS,...) and perform leave-one-out cross-validation techniques, such as the PRESS statistic.

The post Leave-one-out statistics and a formula to update a matrix inverse appeared first on The DO Loop.

6月 042014
 
One of the most common tasks in chemistry is to determine the concentration of a chemical in an aqueous solution (i.e., the chemical is dissolved in water, with other chemicals possibly in the solution). A common way to accomplish this task is to create a calibration curve by measuring the […]