linear regression

10月 142020
 

A previous article discusses the confidence band for the mean predicted value in a regression model. The article shows a "graded confidence band plot," which I saw in Claus O. Wilke's online book, Fundamentals of Data Visualization (Section 16.3). It communicates uncertainty in the predictions. A graded band plot is shown to the right for the 95% (outermost), 90%, 80%, 70% and 50% (innermost) confidence levels.

The graded band plot to the right has five confidence levels, but that was an arbitrary choice. Why not use 10 levels? Or 100?

When you start using many levels, it is no longer feasible to identify the individual confidence bands. However, you can use a heat map to visualize the uncertainty in the mean. Given any (x,y) value in the graph, you can find the value of α such that the 100(1-α)% confidence limit passes through (x,y). You can then color the point (x,y) by the value of α and use a gradient legend to indicate the α values (or, equivalently, the confidence levels). An example is shown below. In this article, I show how to create this visualization, which I call a continuous band plot. This article was inspired by a question and graph that Michael Friendly posted on Twitter.

A review: The formula for confidence limits for the mean predicted value

Most statistics textbooks provide a formula for the 100(1-α)% confidence limits for the predicted mean of a regression model. You can see the formula in the SAS documentation. To keep the math simple, I will restrict the discussion to one-dimensional linear regression of the form Y = b0 + b1*x + ε, where ε ~ N(0, σ). Suppose that the sample contains n observations. Given a significance level (α) and values for the explanatory variable (x), the upper and lower confidence limits are given by
    CLM(x) = pred(x) ± tα SEM(x)
where

  • tα = quantile("t", 1-α/2, n-2) is a quantile of the t distribution with n-2 degrees of freedom
  • SEM(x) = sqrt(MSE * h(x)) is the standard error of the predicted mean at x
  • MSE is the mean squared error, which is the sum of the squared residuals divided by n-2.
  • h(x) is the leverage statistic at x. The SAS documentation provides a matrix equation for the leverage statistic at x in terms of a quadratic form that involves the inverse of the crossproduct matrix. For a one-variable regression, you can explicitly write the leverage in terms of the elements of the inverse matrix. If M = (X`X)-1 is the inverse matrix, then the leverage is h(x) = M[1,1] + 2*M[1,2]*x + M[2,2]*x2.

The SAS/IML documentation includes a Getting Started example about computing statistics for linear regression. The following program computes these quantities for α=0.05 and for a sequence of x values:

data Have;
input x y;
datalines;
21 45
24 40
26 32
27 47
30 55
30 60
33 42
34 67
36 50
39 54
;
 
proc iml;
/* Use IML to produce regression estimates */
use Have;  read all var {x y};  close;
 
X = j(nrow(x),1,1) || x;        /* add intercept col to design matrix */
n = nrow(x);                    /* sample size                        */
xpxi = inv(X`*X);               /* inverse of X'X                     */
b = xpxi * (X`*y);              /* parameter estimates of coeffcients */
yHat = X*b;                     /* predicted values for data          */
dfe = n - ncol(X);              /* error DF                           */
mse = ssq(y-yHat)/dfe;          /* MSE = SSQ(resid)/dfe               */
 
/* Given M = inv(X`*X), compute leverage at x. In general, the leverage is 
   h = vecdiag(x*M*x`), but for a single regressor this simplifies. */
start leverage(x, M);
   return ( M[1,1] + 2*M[1,2]*x + M[2,2]*x##2 );
finish;
 
alpha = 0.05;
t_a= quantile("t", 1-alpha/2, dfe);
 
/* evaluate CLM for a sequence of x values */
z = T(do(20,40,1));             /* sequence of x values in [20,40] */
h = leverage(z, xpxi);          /* evaluate h(x) */
SEM = sqrt(mse * h);            /* SEM(x) = standard error of predicted mean */
Pred = b[1] + b[2]*z;           /* evaluate pred(x) */
LowerCLM = Pred - t_a * SEM;
UpperCLM = Pred + t_a * SEM;
 
print z Pred LowerCLM UpperCLM;

The program computes the predicted values and the lower and upper 95% confidence limits for the mean for a sequence of x values in the interval [20, 40]. If you overlay these curves on a scatter plot of the data, you obtain the usual regression fit plot that is produced automatically by regression procedures in SAS.

Invert the confidence limit formula

The previous section implements formulas that are typically presented in a first course on linear regression. Given α and x, the formulas enable you to find the upper and lower CLM at x. But here is a cool thing that I haven't seen before: You can INVERT that formula. That is, given a value for the explanatory variable (x) and the response (y), you can find the significance level (α) such that the 100(1-α) confidence limit passes through (x, y). This enables you to compute a heat map that visualizes the uncertainty in the predicted mean.

It is not hard to invert the CLM formula. You can use algebra to find
    tα = |y - pred(x)| / SEM(x)
You can then apply the CDF function to both sides to find
    1 - α/2 = CDF("t", |y - pred(x)| / SEM(x), n-2)
which you can solve for α(x,y). The following program carries out these computations:

/* Given (x,y), find alpha(x,y) so that the 100(1-alpha)% confidence
   limit passes through (x,y). */
/* Vectorize this process and compute alpha(x,y) for a grid of (x,y) values */
xx = do(20, 40, 0.25);          /* x in [20,40] */
yy = do(26, 73, 0.2);           /* y in [26,73] */
xy = ExpandGrid(yy, xx);        /* generate (x,y) pairs where x changes fastest */
x = xy[,2]; y = xy[,1];         /* vectors of x & y */
 
h = leverage(x, xpxi);          /* h(x)  */
SEM = sqrt(mse * h);            /* SEM(x) = standard error of predicted mean */
Pred = b[1] + b[2]*x;           /* best predition of mean at x */
LHS = abs(y - Pred)/SEM;
p = cdf("t", LHS, dfe);         /* = 1 - alpha/2 */
alpha = 2*(1 - p);
ConfidenceLevel = 100*(1-alpha);
 
/* write the data to a SAS data set */
create Heat var{'x' 'y' 'alpha' 'ConfidenceLevel'}; append; close;
QUIT;

For each (x,y) value in a grid, the program computes the α value such that the 100(1-α)% confidence limit passes through (x,y). These values are written to a SAS data set and are used in the next section.

A continuous band plot

The following statements visualize the uncertainty in the predicted value by creating a heat map of the confidence bands for a range of confidence values:

data ContBand;
   set Have Heat(rename=(x=xx y=yy)); /* concatenate with orig data */
   label xx="x" yy="y";
run;
 
title 'Continuous Band Plot';
title2 'Confidence for Mean Predicted Value';
proc sgplot data=ContBand noautolegend;
   heatmapparm x=xx y=yy colorresponse=ConfidenceLevel / 
       colormodel=(VeryDarkGray Gray White) name="prob";
   reg x=x y=y/ markerattrs=(symbol=CircleFilled size=12);
   gradlegend "prob";
   yaxis min=30 max=68;
run;

The graph is shown in the first section of this article. The graph is similar to Wilke's graded confidence band plot. However, instead of overlaying a small number of confidence bands, it visualizes "infinitely many" confidence bands. Basically, it inverts the usual process (limits as a function of confidence level) to produce confidence as a function of y, for each x.

The gradient legend associates a color to a confidence level. In theory, you can associate each (x,y) point with a confidence level. In practice, it is hard to visually distinguish grey scales. If it is important to distinguish, say, the 80% confidence region, you can replace the color ramp for the COLORMODEL= option. For example, you could use colormodel=(black red yellow cyan white) if you are looking for vibrant colors that emphasize the 25%, 50%, and 75% levels.

A profile plot for the confidence band

It is enlightening to take a vertical slice of the continuous band plot. A vertical slice provides a "profile plot," which shows how the confidence level changes as a function of y for a fixed value of x. Here is the profile plot at x=30:

title 'Profile Plot of Confidence Level';
title2 'x = 30';
proc sgplot data=Heat noautolegend;
   where x = 30;
   series x=ConfidenceLevel y=y;
   refline 49.2 / axis=y;
   yaxis min=30 max=68 grid;
   xaxis grid;
run;

When x=30, the predicted mean is 49.2, which is marked by a horizontal reference line. That point prediction is the "0% confidence interval," since it does not account for sampling variability. For ConfidenceLevel > 0, the curves give the width of the confidence interval for the predicted mean. The width is small and increases almost linearly with the confidence level until about the 70% level. The lower and upper 80% CLMs are 45.1 and 53.3, respectively. The lower and upper 95% CLMs are much bigger at 42.4 and 56.0, respectively. The 99% CLMs are 39.3 and 59.1. As the confidence level approaches 100%, the width of the interval increases without bound.

Summary

In summary, this article shows formulas for the lower and upper 100(1-α)% confidence limits for the predicted mean at a point x. You can invert the formulas: for any (x,y), you can find the value of α for which the 100(1-α)% confidence limits pass through the point (x,y). You can use a heat map to visualize the resulting surface, which is a continuous version of the graded confidence band plot. The continuous band plot is one way to visualize the uncertainty of the predicted mean in a regression model.

The post A continuous band plot for visualizing uncertainty in regression predictions appeared first on The DO Loop.

10月 122020
 
Regression fit plot with 95% confidence limits for the mean

You've probably seen many graphs that are similar to the one at the right. This plot shows a regression line overlaid on a scatter plot of some data. Given a value for the independent variable (x), the regression line gives the best prediction for the mean of the response variable (y). The light blue band shows a 95% confidence band for the conditional mean.

This article is about how to understand the confidence band. The band conveys uncertainty in the location of the conditional mean. You can think of the confidence band as being a bunch of vertical confidence intervals, one at each possible value of x. For example, when x=30, the graph predicts 49.2 for the conditional mean of y and shows that the confidence interval for the mean of y at x=30 is [42.4, 56.0]. The confidence intervals for x=20 and x=40 are wider, which indicates that there is more uncertainty in the prediction when x is extreme than when x is near the center of the data. (The confidence interval of the conditional mean is sometimes called the confidence interval of the prediction. No matter what you call it, the intervals in this article are for the MEAN, not for individual responses.)

Statistics have uncertainty because they are based on a random sample from the population. If you were to choose a different random sample of (x,y) values from the same population, you would get a different regression line. If you choose a third random sample, you would get yet another regression line. In this article, I discuss some intuitive ways to think about the confidence band without using any formulas. I show that the confidence band is related to repeatedly choosing a random sample and fitting many regression lines. I show a couple of alternative ways to visualize uncertainty in the predicted values.

Fit the example data

You can use a simulation to demonstrate the relationship between a confidence interval and repeated random sampling. The best way is to use a (known) model to repeatedly generate the random sample. However, I am going to use a slightly different approach, which is discussed in Section 13.3 of Simulating Data with SAS, Wicklin (2013). When you deal with real data, you don't know the real underlying relationship between the response and explanatory variables, but you can always simulate from the fitted regression model. This means that you fit the data and then using the parameters estimates as if they were the true values of the parameters.

Let's see how this works. The following DATA step defines a toy example that has 10 observations. The call to PROC REG fits a linear model to the data.

data Have;
input x y;
datalines;
21 45
24 40
26 32
27 47
30 55
30 60
33 42
34 67
36 50
39 54
;
 
proc reg data=Have outest=PE alpha=0.05 plots(only)=fitplot(nocli);
   model y = x;
quit;
 
proc print data=PE noobs label;
   label Intercept="b0" x="b1" _RMSE_="s";
   var Intercept x _RMSE_;
run;
Parameter estimates for a regression

The REG procedure creates a fit plot that is similar to the one at the top of this article. The model is Y = b0 + b1*x + eps, where eps ~ N(0, s). The call to PROC PRINT shows the three parameter estimates: the intercept term (b0), the coefficient of the linear term (b1), and the "root mean square error," which is the estimate of the standard deviation of the error term (s).

Visualizing uncertainty by using graded confidence bands

The conditional mean is assumed to be normally distributed inside the confidence band. For a particular value of x, the probability of the conditional mean is high near the regression line and lower near the upper and lower limits. You can visualize the distribution by overlaying confidence bands for various confidence levels. If you make the bands semi-transparent, the union of the bands is darkest near the regression line and lightest far away from the line.

In SAS, you can use the REG statement (with the CLM option) in PROC SGPLOT to overlay several semi-transparent confidence bands that have different levels of confidence. The following graph overlays the 95%, 90%, 80%, 70% and 50% confidence bands. To save typing, I use a SAS macro to create each REG statement.

%macro CLM(alpha=, thickness=);
reg x=x y=y / nomarkers alpha=&alpha lineattrs=GraphFit(thickness=&thickness)
              clm clmattrs=(clmfillattrs=(color=gray transparency=0.8));
%mend;
 
title "Graded Confidence Band Plot";
title2 "alpha = 0.05, 0.1, 0.2, 0.3, 0.5";
proc sgplot data=Have noautolegend;
   %CLM(alpha=0.05, thickness=0);
   %CLM(alpha=0.10, thickness=0);
   %CLM(alpha=0.20, thickness=0);
   %CLM(alpha=0.30, thickness=0);
   %CLM(alpha=0.50, thickness=2);
   scatter x=x y=y/ markerattrs=(symbol=CircleFilled size=12);
run;
Graded confidence band plot, which overlays several semi-transparent confidence bands

Claus O. Wilke calls this a "graded confidence band plot" in his book _Fundamentals of Data Visualization_ (Section 16.3). It communicates that the location of the conditional mean is uncertain, but it is most likely to be found near the regression line.

Simulate from the fitted model

The confidence band visualizes the uncertainty in the prediction due to the fact that the data are a random sample from some unknown population. Although we don't know the underlying true relationship between the (x,y) pairs, we can simulate random samples from the fitted linear model by using the parameter estimates as if they were the true parameters. The following SAS DATA step simulates 500 random samples from the regression model. The x values are fixed. The y values are simulated according to Y = b0 + b1*x + eps, where eps ~ N(0, s).

/* simulate many times from model, using parameter estimates as the true model */
%let numSim = 500;
data RegSim;
call streaminit(1234);
b0 = 21.1014; b1 = 0.93662; RMSE = 9.33046;
set Have;                     /* for each X value in the original data */
do SampleID = 1 to &numSim;   /* simulate Y = b0 + b1*x + eps, eps ~ N(0,RMSE) */
   YSim = b0 + b1*x + rand("Normal", 0, RMSE);
   output;
end;   
run;
 
/* use BY-group processing to fit a regression model to each simulated data */
proc sort data=RegSim; by SampleID; run;
proc reg data=RegSim outest=PESim alpha=0.05 noprint;
   by SampleID;
   model YSim = x;
quit;

The output from PROC REG is 500 pairs of parameter estimates (b0 and b1). Each estimate represents a regression line for a random sample from the same linear model. Let's see what happens if you overlay all those regression lines on a single plot:

/* two points determine a line, so score regression on [min(x), max(x)] */
data Viz;
set PESim(rename=(Intercept=b0 x=b1));
/* min(x)=21; max(x)=39. Evaluate fit b0 + b1*x for each simulated sample */
xx = 21; yy = b0 + b1*xx;  output;
xx = 39; yy = b0 + b1*xx;  output;
keep SampleID xx yy;
run;
 
/* overlay the fits on the original data */
data Combine;
   set Have Viz;
run;
 
title "Overlay of &numSim Regression Lines";
title2 "Y = b0 + b1*x + eps, eps ~ N(0,RMSE)";
ods graphics / antialias=off GROUPMAX=10000;
proc sgplot data=Combine noautolegend;
   reg x=x y=y / nomarkers alpha=0.05 clm clmattrs=(clmfillattrs=(transparency=0.5));
   series x=xx y=yy / group=SampleId lineattrs=(color=gray pattern=solid) transparency=0.9;
   scatter x=x y=y;
   reg x=x y=y / nomarkers;
run;
Overlay of regression lines for 500 random samples from the same model

Notice that most of the regression lines are in the interior of the confidence band. In fact, if you fix a value of x (such as x=30) and evaluate all the regression lines at x, then about 95% of the conditional means will be contained within the original confidence band.

This is the intuitive meaning of the confidence band. If you imagine obtaining a different random sample from the same population and fitting a regression line, the conditional mean will be contained in the band for 95% of the random samples.

The distribution of the conditional mean

As I said earlier, you should think of the confidence bands as a bunch of vertical confidence intervals. The distribution of the conditional mean within each vertical slice is assumed to be normally distributed with mean b0 + b1*x. The following histograms show the distribution of the predicted means for x=21 and x=39 (respectively) over all 500 regression lines. You should compare the 2.5th and 97.5th percentiles for these histograms to the vertical limits (at the same x values) of the confidence band in the graph at the top of this article.

title "Distribution of Conditional Means";
proc sgpanel data=Combine(where=(xx^=.));
   label yy="y" xx="x0";
   panelby xx / layout=rowlattice columns=1;
   histogram yy;
run;
Panel of histograms: The conditional distribution of the conditional mean for two values of the independent (x) variable

Summary

In summary, this article visualizes a few facts about the confidence band for a regression line. The confidence band conveys uncertainty about the location of the conditional mean. The visualization can be improved by overlaying several semi-transparent bands, a graph that Wilke calls a "graded confidence band plot." You can use simulation to relate the confidence band to sampling variability. If you simulate many random samples and overlay the regression lines, they form a confidence band. The limits of the confidence band are related to the quantiles of the conditional distribution of the means.

The post Visualize uncertainty in regression predictions appeared first on The DO Loop.

10月 122020
 
Regression fit plot with 95% confidence limits for the mean

You've probably seen many graphs that are similar to the one at the right. This plot shows a regression line overlaid on a scatter plot of some data. Given a value for the independent variable (x), the regression line gives the best prediction for the mean of the response variable (y). The light blue band shows a 95% confidence band for the conditional mean.

This article is about how to understand the confidence band. The band conveys uncertainty in the location of the conditional mean. You can think of the confidence band as being a bunch of vertical confidence intervals, one at each possible value of x. For example, when x=30, the graph predicts 49.2 for the conditional mean of y and shows that the confidence interval for the mean of y at x=30 is [42.4, 56.0]. The confidence intervals for x=20 and x=40 are wider, which indicates that there is more uncertainty in the prediction when x is extreme than when x is near the center of the data. (The confidence interval of the conditional mean is sometimes called the confidence interval of the prediction. No matter what you call it, the intervals in this article are for the MEAN, not for individual responses.)

Statistics have uncertainty because they are based on a random sample from the population. If you were to choose a different random sample of (x,y) values from the same population, you would get a different regression line. If you choose a third random sample, you would get yet another regression line. In this article, I discuss some intuitive ways to think about the confidence band without using any formulas. I show that the confidence band is related to repeatedly choosing a random sample and fitting many regression lines. I show a couple of alternative ways to visualize uncertainty in the predicted values.

Fit the example data

You can use a simulation to demonstrate the relationship between a confidence interval and repeated random sampling. The best way is to use a (known) model to repeatedly generate the random sample. However, I am going to use a slightly different approach, which is discussed in Section 13.3 of Simulating Data with SAS, Wicklin (2013). When you deal with real data, you don't know the real underlying relationship between the response and explanatory variables, but you can always simulate from the fitted regression model. This means that you fit the data and then using the parameters estimates as if they were the true values of the parameters.

Let's see how this works. The following DATA step defines a toy example that has 10 observations. The call to PROC REG fits a linear model to the data.

data Have;
input x y;
datalines;
21 45
24 40
26 32
27 47
30 55
30 60
33 42
34 67
36 50
39 54
;
 
proc reg data=Have outest=PE alpha=0.05 plots(only)=fitplot(nocli);
   model y = x;
quit;
 
proc print data=PE noobs label;
   label Intercept="b0" x="b1" _RMSE_="s";
   var Intercept x _RMSE_;
run;
Parameter estimates for a regression

The REG procedure creates a fit plot that is similar to the one at the top of this article. The model is Y = b0 + b1*x + eps, where eps ~ N(0, s). The call to PROC PRINT shows the three parameter estimates: the intercept term (b0), the coefficient of the linear term (b1), and the "root mean square error," which is the estimate of the standard deviation of the error term (s).

Visualizing uncertainty by using graded confidence bands

The conditional mean is assumed to be normally distributed inside the confidence band. For a particular value of x, the probability of the conditional mean is high near the regression line and lower near the upper and lower limits. You can visualize the distribution by overlaying confidence bands for various confidence levels. If you make the bands semi-transparent, the union of the bands is darkest near the regression line and lightest far away from the line.

In SAS, you can use the REG statement (with the CLM option) in PROC SGPLOT to overlay several semi-transparent confidence bands that have different levels of confidence. The following graph overlays the 95%, 90%, 80%, 70% and 50% confidence bands. To save typing, I use a SAS macro to create each REG statement.

%macro CLM(alpha=, thickness=);
reg x=x y=y / nomarkers alpha=&alpha lineattrs=GraphFit(thickness=&thickness)
              clm clmattrs=(clmfillattrs=(color=gray transparency=0.8));
%mend;
 
title "Graded Confidence Band Plot";
title2 "alpha = 0.05, 0.1, 0.2, 0.3, 0.5";
proc sgplot data=Have noautolegend;
   %CLM(alpha=0.05, thickness=0);
   %CLM(alpha=0.10, thickness=0);
   %CLM(alpha=0.20, thickness=0);
   %CLM(alpha=0.30, thickness=0);
   %CLM(alpha=0.50, thickness=2);
   scatter x=x y=y/ markerattrs=(symbol=CircleFilled size=12);
run;
Graded confidence band plot, which overlays several semi-transparent confidence bands

Claus O. Wilke calls this a "graded confidence band plot" in his book _Fundamentals of Data Visualization_ (Section 16.3). It communicates that the location of the conditional mean is uncertain, but it is most likely to be found near the regression line.

Simulate from the fitted model

The confidence band visualizes the uncertainty in the prediction due to the fact that the data are a random sample from some unknown population. Although we don't know the underlying true relationship between the (x,y) pairs, we can simulate random samples from the fitted linear model by using the parameter estimates as if they were the true parameters. The following SAS DATA step simulates 500 random samples from the regression model. The x values are fixed. The y values are simulated according to Y = b0 + b1*x + eps, where eps ~ N(0, s).

/* simulate many times from model, using parameter estimates as the true model */
%let numSim = 500;
data RegSim;
call streaminit(1234);
b0 = 21.1014; b1 = 0.93662; RMSE = 9.33046;
set Have;                     /* for each X value in the original data */
do SampleID = 1 to &numSim;   /* simulate Y = b0 + b1*x + eps, eps ~ N(0,RMSE) */
   YSim = b0 + b1*x + rand("Normal", 0, RMSE);
   output;
end;   
run;
 
/* use BY-group processing to fit a regression model to each simulated data */
proc sort data=RegSim; by SampleID; run;
proc reg data=RegSim outest=PESim alpha=0.05 noprint;
   by SampleID;
   model YSim = x;
quit;

The output from PROC REG is 500 pairs of parameter estimates (b0 and b1). Each estimate represents a regression line for a random sample from the same linear model. Let's see what happens if you overlay all those regression lines on a single plot:

/* two points determine a line, so score regression on [min(x), max(x)] */
data Viz;
set PESim(rename=(Intercept=b0 x=b1));
/* min(x)=21; max(x)=39. Evaluate fit b0 + b1*x for each simulated sample */
xx = 21; yy = b0 + b1*xx;  output;
xx = 39; yy = b0 + b1*xx;  output;
keep SampleID xx yy;
run;
 
/* overlay the fits on the original data */
data Combine;
   set Have Viz;
run;
 
title "Overlay of &numSim Regression Lines";
title2 "Y = b0 + b1*x + eps, eps ~ N(0,RMSE)";
ods graphics / antialias=off GROUPMAX=10000;
proc sgplot data=Combine noautolegend;
   reg x=x y=y / nomarkers alpha=0.05 clm clmattrs=(clmfillattrs=(transparency=0.5));
   series x=xx y=yy / group=SampleId lineattrs=(color=gray pattern=solid) transparency=0.9;
   scatter x=x y=y;
   reg x=x y=y / nomarkers;
run;
Overlay of regression lines for 500 random samples from the same model

Notice that most of the regression lines are in the interior of the confidence band. In fact, if you fix a value of x (such as x=30) and evaluate all the regression lines at x, then about 95% of the conditional means will be contained within the original confidence band.

This is the intuitive meaning of the confidence band. If you imagine obtaining a different random sample from the same population and fitting a regression line, the conditional mean will be contained in the band for 95% of the random samples.

The distribution of the conditional mean

As I said earlier, you should think of the confidence bands as a bunch of vertical confidence intervals. The distribution of the conditional mean within each vertical slice is assumed to be normally distributed with mean b0 + b1*x. The following histograms show the distribution of the predicted means for x=21 and x=39 (respectively) over all 500 regression lines. You should compare the 2.5th and 97.5th percentiles for these histograms to the vertical limits (at the same x values) of the confidence band in the graph at the top of this article.

title "Distribution of Conditional Means";
proc sgpanel data=Combine(where=(xx^=.));
   label yy="y" xx="x0";
   panelby xx / layout=rowlattice columns=1;
   histogram yy;
run;
Panel of histograms: The conditional distribution of the conditional mean for two values of the independent (x) variable

Summary

In summary, this article visualizes a few facts about the confidence band for a regression line. The confidence band conveys uncertainty about the location of the conditional mean. The visualization can be improved by overlaying several semi-transparent bands, a graph that Wilke calls a "graded confidence band plot." You can use simulation to relate the confidence band to sampling variability. If you simulate many random samples and overlay the regression lines, they form a confidence band. The limits of the confidence band are related to the quantiles of the conditional distribution of the means.

The post Visualize uncertainty in regression predictions appeared first on The DO Loop.

9月 212020
 

A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to solve restricted least squares problems. However, if a constraint is an INEQUALITY instead of an EQUALITY, then (in general) a least squares method does not solve the problem. Therefore, you cannot use PROC REG to solve problems that have inequality constraints.

This article shows how to use PROC NLIN to solve linear regression problems whose regression coefficients are constrained by a linear inequality. Examples are B1 ≥ 3 or B1 + B2 ≥ 6.

PROC NLIN and constrained regression problems

Before solving a problem that has inequality constraints, let's see how PROC NLIN solves a problem that has linear equality constraints. The following statements use the data and model from the previous article. The model is
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2. In PROC NLIN, you can replace the B3 and B4 parameters, thus leaving only three parameters in the model. If desired, you can use the ESTIMATE statement to recover the value of B4, as follows:

/* You can solve the restricted problem by using PROC NLIN */
proc nlin data=RegSim noitprint;      /* use NLINMEASURES to estimate MSE */
   parameters b0=0 b1=3 b2=1;         /* specify names of parameters and initial guess */
   model Y = b0 + b1*x1 +   b2*x2 +
                  b1*x3 - 2*b2*x4;   /* replace B3 = B1 and B4 = -2*B2 */
   estimate 'b4' -2*b2;              /* estimate B4 */
run;

The parameter estimates are the same as those produced by PROC REG in the previous article.

The syntax for PROC NLIN is natural and straightforward. The PARAMETERS statement specifies the names of the parameters in the problem; the other symbols (X1-X4, Y) refer to data set variables. You must specify an initial guess for each parameter. For nonlinear regressions, this can be a challenge, but there are some tricks you can use to choose a good initial guess. For LINEAR regressions, the initial guess shouldn't matter: you can use all zeros or all ones, if you want.

Boundary constraints

Suppose that you want to force the regression coefficients to satisfy certain inequality constraints. You can use the BOUNDS statement in PROC NLIN to specify the constraints. The simplest type of constraint is to restrict a coefficient to a half-interval or interval. For example, the following call to PROC NLIN restricts B1 ≥ 3 and restricts B2 to the interval [0, 4].

/* INEQUALITY constraints on the regression parameters */
proc nlin data=RegSim;
   parameters b0=0 b1=3 b2=1;         /* initial guess */
   bounds b1 >= 3, 
          0<= b2 <= 4;
   model Y = b0 + b1*x1 + b2*x2 + b1*x3 - 2*b2*x4;
   ods select ParameterEstimates;
run;

The solution that minimizes the residual sum of squares (subject to the constraints) places the B1 parameter on the constraint B1=3. Therefore, a standard error is not available for that parameter estimate.

You can also use the BOUNDS statement to enforce simple relationships between parameters. For example, you can specify
     bounds b1 >= b2;
to specify that the B1 coefficient must be greater than or equal to the B2 parameter.

More general linear constraints

The BOUNDS statement is for simple relationships. For more complicated relationships, you might need to reparameterize the model. For example, the BOUNDS statement does not accept the following syntax:
     bounds b1 >= 2*b2; /* INVALID SYNTAX! */
However, you can reparameterize the problem by introducing a new parameter C = 2*B2. You can then systematically substitute (C/2) everywhere that B2 appears. For example, the following statements show the constrained model in terms of the new parameter, C. You can use the ESTIMATE statement to find the estimates for the original parameter.

/* let c = 2*b2 be the new parameter. Then substitute c/2 for b2 in the MODEL statement: */
proc nlin data=RegSim;
   parameters b0=0 b1=3 c=2;         /* initial guess */
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (c/2)*x2 + b1*x3 - c*x4;
   estimate 'b2' c/2;                /* estimate original parameter */
run;

The output from this procedure is not shown.

You can use a similar trick to handle linear constraints such as B1 + B2 ≥ 6. Move the B2 term to the right side of the inequality and define C = 6 – B2. The constraint becomes B1 ≥ C and on the MODEL statement you can substitute (6–C) everywhere that B2 appears, as follows:

/* Define c = 6 - b2 so that b2=6-c */
proc nlin data=RegSim;
   parameters b0=0 b1=5 c=5;
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (6-c)*x2 + b1*x3 - 2*(6-c)*x4;
   estimate 'b2' 6-c;              /* estimate original parameter */
   ods select ParameterEstimates AdditionalEstimates;
run;

Summary

In summary, you can use the NLIN procedure to solve linear regression problems that have linear constraints among the coefficients. Each equality constraint enables you to eliminate one parameter in the MODEL statement. You can use the BOUNDS statement to specify simple inequality constraints. For more complicated constraints, you can reparametrize the model and again use the BOUNDS statement to specify the constraints. If desired, you can use the ESTIMATE statement to estimate the parameters in the original model.

The post Regression with inequality constraints on parameters appeared first on The DO Loop.

9月 162020
 

A data analyst recently asked a question about restricted least square regression in SAS. Recall that a restricted regression puts linear constraints on the coefficients in the model. Examples include forcing a coefficient to be 1 or forcing two coefficients to equal each other. Each of these problems can be solved by using PROC REG in SAS. The analyst's question was, "Why can I use PROC REG to solve a restricted regression problem when the restriction involves EQUALITY constraints, but PROC REG can't solve a regression problem that involves an INEQUALITY constraint?"

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that has linear constraints on the parameters. I also use geometry and linear algebra to show that solving an equality-constrained regression problem is mathematically similar to solving an unrestricted least squares system of equations. This explains why PROC REG can solve equality-constrained regression problems. In a future article, I will show how to solve regression problems that involve inequality constraints on the parameters.

The geometry of restricted regression

The linear algebra for restricted least squares regression gets messy, but the geometry is easy to picture. A schematic depiction of restricted regression is shown to the right.

Geometrically, ordinary least-squares (OLS) regression is the orthogonal projection of the observed response (Y) onto the column space of the design matrix. (For continuous regressors, this is the span of the X variables, plus an "intercept column.") If you introduce equality constraints among the parameters, you are restricting the solution to a linear subspace of the span (shown in green). But the geometry doesn't change. The restricted least squares (RLS) solution is still a projection of the observed response, but this time onto the restricted subspace.

In terms of linear algebra, the challenge is to write down the projection matrix for the restricted problem. As shown in the diagram, you can obtain the RLS solution in two ways:

  • Starting from Y: You can directly project the Y vector onto the restricted subspace. The linear algebra for this method is straightforward and is often formulated in terms of Lagrange multipliers.
  • Starting from the OLS solution: If you already know the OLS solution, you can project that solution vector onto the restricted subspace. Writing down the projection matrix is complicated, so I refer you to the online textbook Practical Econometrics and Data Science by A. Buteikis.

Restricted models and the TEST statement in PROC REG

Let's start with an example, which is based on an example in the online textbook by A. Buteikis. The following SAS DATA step simulates data from a regression model
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2 and ε ~ N(0, 3) is a random error term.

/* Model: Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4
   where B3 = B1 and B4 = -2*B2
*/
data RegSim;
array beta[0:4] (-5, 2, 3, 2, -6);  /* parameter values for regression coefficients */
N = 100;                            /* sample size */
call streaminit(123);
do i = 1 to N;
   x1 = rand("Normal", 5, 2);       /* X1 ~ N(5,2) */
   x2 = rand("Integer", 1, 50);     /* X2 ~ random integer 1:50 */
   x3 = (i-1)*10/(N-1);             /* X3 = sequence: 0 to 10 by 1/N */
   x4 = rand("Normal", 10, 3);      /* X4 ~ N(10, 3) */
   eps = rand("Normal", 0, 3);      /* RMSE = 3 */
   Y = beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3 + beta[4]*x4 + eps;
   output;
end;
run;

Although there are five parameters in the model (B0-B4), there are only three free parameters because the values of B2 and B4 are determined by the values of B1 and B4, respectively.

If you suspect that a parameter in your model is a known constant or is equal to another parameter, you can use the RESTRICT statement in PROC REG to incorporate that restriction into the model. Before you do, however, it is usually a good idea to perform a statistical test to see if the data supports the model. For this example, you can use the TEST statement in PROC REG to hypothesize that B3 = B1 and B4 = –2*B2. Recall that the syntax for the TEST statement uses the variable names (X1-X4) to represent the coefficients of the variable. The following call to PROC REG carries out this analysis:

proc reg data=RegSim plots=none;
   model Y = x1 - x4;
   test x3 = x1, x4 = -2*x2;
run;

The output shows that the parameter estimates for the simulated data are close to the parameter values used to generate the data. Specifically, the root MSE is close to 3, which is the standard deviation for the error term. The Intercept estimate is close to -5. The coefficients for the X1 and X3 term are each approximately 2. The coefficient for X4 is approximately –2 times the coefficient of X2. The F test for the test of hypotheses has a large p-value, so you should not reject the hypotheses that B3 = B1 and B4 = –2*B2.

The RESTRICT statement in PROC REG

If the hypothesis on the TEST statement fails to reject, then you can use the RESTRICT statement to recompute the parameter estimates subject to the constraints. You could rerun the entire analysis or, if you are running SAS Studio in interactive mode, you can simply submit a RESTRICT statement and PROC REG will compute the new parameter estimates without rereading the data:

   restrict x3 = x1, x4 = -2*x2; /* restricted least squares (RLS) */
   print;
quit;

The new parameter estimates enforce the restrictions among the regression coefficients. In the new model, the coefficients for X1 and X3 are exactly equal, and the coefficient for X4 is exactly –2 times the coefficient for X2.

Notice that the ParameterEstimates table is augmented by two rows labeled "RESTRICT". These rows have negative degrees of freedom because they represent constraints. The "Parameter Estimate" column shows the values of the Lagrange multipliers that are used to enforce equality constraints while solving the least squares system. Usually, a data analyst does not care about the actual values in these rows, although the Wikipedia article on Lagrange multipliers discusses ways to interpret the values.

The linear algebra of restricted regression

This section shows the linear algebra behind the restricted least squares solution by using SAS/IML. Recall that the usual way to compute the unrestricted OLS solution is the solve the "normal equations" (X`*X)*b = X`*Y for the parameter estimates, b. Although textbooks often solve this equation by using the inverse of the X`X matrix, it is more efficient to use the SOLVE function in SAS/IML to solve the equation, as follows:

proc iml;
varNames = {x1 x2 x3 x4};
use RegSim;                  /* read the data */
   read all var varNames into X;
   read all var "Y";
close;
X = j(nrow(X), 1, 1) || X;   /* add intercept column to form design matrix */
XpX = X`*X;
 
/* Solve the unrestricted OLS system */
b_ols = solve(XpX, X`*Y);    /* normal equations X`*X * b = X`*Y */
ParamNames = "Intercept" || varNames;
print b_ols[r=ParamNames F=10.5];

The OLS solution is equal to the first (unrestricted) solution from PROC REG.

If you want to require that the solution satisfies B3 = B1 and B4 = –2*B2, then you can augment the X`X matrix to enforce these constraints. If L*λ = c is the matrix equation for the linear constraints, then the augmented system is
\( \begin{pmatrix} X^{\prime}X & L^{\prime} \\ L & 0 \end{pmatrix} \begin{pmatrix} b_{\small\mathrm{RLS}} \\ \lambda \end{pmatrix} = \begin{pmatrix} X^{\prime}Y \\ c \end{pmatrix} \)

The following statements solve the augmented system:

/* Direct method: Find b_RLS by solving augmented system */
/* Int X1 X2 X3 X4 */
L = {0  1  0 -1  0,     /*    X1 - X3 = 0 */
     0  0 -2  0 -1};    /* -2*X2 - X4 = 0 */
c = {0, 0};
 
zero = j(nrow(L), nrow(L), 0);
A = (XpX || L`   ) //
    (L   || zero );
z = X`*Y // c;
b_rls = solve(A, z);
rNames = ParamNames||{'Lambda1' 'Lambda2'};
print b_rls[r=rNames F=10.5];

The parameter estimates are the same as are produced by using PROC REG. You can usually ignore the last two rows, which are the values of the Lagrange multipliers that enforce the constraints.

By slogging through some complicated linear algebra, you can also obtain the restricted least squares solution from the OLS solution and the constraint matrix (L). The following equations use the formulas in the online textbook by A. Buteikis to compute the restricted solution:

/* Adjust the OLS solution to obtain the RLS solution.
   b_RLS = b_OLS - b_adjustment
   Follows
   http://web.vu.lt/mif/a.buteikis/wp-content/uploads/PE_Book/4-4-Multiple-RLS.html
*/
RA_1 = solve(XpX, L`);
RA_3 = solve(L*RA_1, L*b_ols - c);
b_adj = RA_1 * RA_3;
b_rls = b_ols - b_adj;
print b_rls[r=ParamNames F=10.5];

This answer agrees with the previous answer but does not compute the Lagrange multipliers. Instead, it computes the RLS solution as the projection of the OLS solution onto the restricted linear subspace.

Summary

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that involves linear constraints on the parameters. Solving an equality-constrained regression problem is very similar to solving an unrestricted least squares system of equations. Geometrically, they are both projections onto a linear subspace. Algebraically, you can solve the restricted problem directly or as the projection of the OLS solution.

Although changing one of the equality constraints into an INEQUALITY seems like a minor change, it means that the restricted region is no longer a linear subspace. Ordinary least squares can no longer solve the problem because the solution might not be an orthogonal projection. The next article discusses ways to solve problems where the regression coefficients are related by linear inequalities.

The post Restricted least squares regression in SAS appeared first on The DO Loop.

2月 172020
 

A previous article shows how to interpret the collinearity diagnostics that are produced by PROC REG in SAS. The process involves scanning down numbers in a table in order to find extreme values. This can be a tedious and error-prone process. Friendly and Kwan (2009) compare this task to a popular picture book called Where's Waldo? in which children try to find one particular individual (Waldo) in a crowded scene that involves of hundreds of people. The game is fun for children, but less fun for a practicing analyst who is trying to discover whether a regression model suffers from severe collinearities in the data.

Friendly and Kwan suggest using visualization to turn a dense table of numbers into an easy-to-read graph that clearly displays the collinearities, if they exist. Friendly and Kwan (henceforth, F&K) suggest several different useful graphs. I decided to implement a simple graph (a discrete heat map) that is easy to create and enables the analyst to determine whether there are collinearities in the data. One version of the collinearity diagnostic heat map is shown below. (Click to enlarge.) For comparison, the table from my previous article is shown below it. The highlighted cells in the table were added by me; they are not part of the output from PROC REG.


Visualization principles

There are two important sets of elements in a collinearity diagnostics table. The first is the set of condition indices, which are displayed in the leftmost column of the heat map. The second is the set of cells that show the proportion of variance explained by each row. (However, only the rows that have a large condition index are important.) F&K make several excellent points about the collinearity diagnostic table:

  • Display order: In a table, the important information is in the bottom rows. It is better to reverse-sort the table so that the largest condition indices (the important ones) are at the top.
  • Condition indices: A condition number between 20 and 30 is starting to get large (F&K use 10-30). An index over 30 is generally considered large and an index that exceeds 100 is "a sign of potential disaster in estimation" (p. 58). F&K suggest using "traffic lighting" (green, yellow, and red) to color the condition indices by the severity of the collinearity. I modified their suggestion to include an orange category.
  • Proportion of variance: F&K note that "the variance proportions corresponding to small condition numbers are completely irrelevant" (p. 58) and also that tables print too many decimals. "Do we really care that [a]variance proportion is 0.00006088?" Of course not! Therefore we should only display the large proportions. F&K also suggest displaying a percentage (instead of proportion) and rounding the percentage to the nearest integer.

A discrete heat map to visualize collinearity diagnostics

There are many ways to visualize the Collinearity Diagnostics table. F&K use traffic lighting for the condition numbers and a bubble plot for the proportion of variance entries. Another choice would be to use a panel of bar charts for the proportion of variance. However, I decided to use a simple discrete heat map. The following list describes the main steps to create the plot. You can download the complete SAS program that creates the plot and modify it (if desired) to use with your own data. For each step, I link to a previous article that describes more details about how to perform the step.

  1. Use the ODS OUTPUT statement to save the Collinearity Diagnostics table to a data set.
  2. Use PROC FORMAT to define a format. The format converts the table values into discrete values. The condition indices are in the range [1, ∞) whereas the values for the proportion of variance are in the range [0, 1). Therefore you can use a single format that maps these values into 'low', 'medium', and 'high' values.
  3. The HEATMAPPARM statement in PROC SGPLOT is designed to work with data in "long format." Therefore convert the Collinearity Diagnostics data set from wide form to long form.
  4. Create a discrete attribute map that maps categories to colors.
  5. Use the HEATMAPPARM statement in PROC SGPLOT to create a discrete heat map that visualizes the collinearity diagnostics. Overlay (rounded) values for the condition indices and the important (relatively large) values of the proportion of variance.

The discrete heat map enables you to draw the same conclusions as the original collinearity diagnostics table. However, whereas using the table is akin to playing "Where's Waldo," the heat map makes it apparent that the most severe collinearity (top row; red condition index) is between the RunPulse and MaxPulse variables. The second most severe collinearity (second row from top; orange condition index) is between the Intercept and the Age variable. None of the remaining rows have two or more large cells for the proportion of variance.

You can download the SAS program that creates the collinearity plot. It would not be hard to turn it into a SAS macro, if you intend to use it regularly.

References

Friendly, M., & Kwan, E. (2009). "Where's Waldo? Visualizing collinearity diagnostics." The American Statistician, 63(1), 56-65. https://doi.org/10.1198/tast.2009.0012

The post Visualize collinearity diagnostics appeared first on The DO Loop.

2月 052020
 

A SAS programmer wanted to create a graph that illustrates how Deming regression differs from ordinary least squares regression. The main idea is shown in the panel of graphs below.

  • The first graph shows the geometry of least squares regression when we regress Y onto X. ("Regress Y onto X" means "use values of X to predict Y.") The residuals for the model are displayed as vectors that show how the observations are projected onto the regression line. The projection is vertical when we regression Y onto X.
  • The second graph shows the geometry when we regress X onto Y. The projection is horizontal.
  • The third graph shows the perpendicular projection of both X and Y onto the identity line. This is the geometry of Deming regression.

This article answers the following two questions:

  1. Given any line and any point in the plane, how do you find the location on the line that is closest to the point? This location is the perpendicular projection of the point onto the line.
  2. How do you use the SGPLOT procedure in SAS to create the graphs that chow the projections of points onto lines?

The data for the examples are shown below:

data Have;
input x y @@;
datalines;
0.5 0.6   0.6 1.4   1.4 3.0   1.7 1.4   2.2 1.7
2.4 2.1   2.4 2.4   3.0 3.3   3.1 2.5 
;

The projection of a point onto a line

Assume that you know the slope and intercept of a line: y = m*x + b. You can use calculus to show that the projection of the point (x0, y0) onto the line is the point (xL, yL) where
xL = (x + m*(y – b)) / (1 + m2) and yL = m * xL + b.

To derive this formula, you need to solve for the point on the line that minimizes the distance from (x0, y0) to the line. Let (x, m*x + b) be any point on the line. We want to find a value of x so that the distance from (x0, y0) to (x, m*x + b) is minimized. The solution that minimizes the distance also minimizes the squared distance, so define the squared-distance function
f(x) = (x - x0)2 + (m*x + b - y0)2.
To find the location of the minimum for this function, set the derivative equal to zero and solve for the value of x:

  • f`(x) = 2(x - x0) + 2 m*(m*x + b - y0)
  • Set f`(x)=0 and solve for x. The solution is the value xL = (x + m*(y – b)) / (1 + m2), which minimizes the distance from the point to the line.
  • Plug xL into the formula for the line to find the corresponding vertical coordinate on the line: yL = m * xL + b.

You can use the previous formulas to write a simple SAS DATA step that projects each observation onto a specified line. (For convenience, I put the value of the slope (m) and intercept (b) into macro variables.) The following DATA step projects a set of points onto the line y = m*x + b. You can use PROC SGPLOT to create a scatter plot of the observations. Use the VECTOR statement to draw the projections of the points onto the line.

/* projection onto general line of the form y = &m*x + &b */
%let b = 0.4;
%let m = 0.8;
data Want;
set Have;
xL = (x + &m *(y - &b)) / (1 + &m**2);
yL = &m * xL + &b;
run;
 
title "Projection onto Line y=&m x + &b";
proc sgplot data=Want aspect=1 noautolegend;
   scatter x=x y=y;
   vector x=xL y=yL / xorigin=x yorigin=y; /* use the NOARROWHEADS option to suppress the arrow heads */
   lineparm x=0 y=&b slope=&m / lineattrs=(color=black);
   xaxis grid; yaxis grid;
run;

You can get the graph for Deming regression by setting b=0 and m=1 in the previous formulas and program.

In summary, you can use that math you learned in high school to find the perpendicular projection of a point onto a line. You can then use the VECTOR statement in PROC SGPLOT in SAS to create a graph that illustrates the projection. Such a graph is useful for comparing different kinds of regressions, such as comparing least-squares and Deming regression.

The post Visualize residual projections for linear regression appeared first on The DO Loop.

1月 292020
 

In a previous article, I showed how to perform collinearity diagnostics in SAS by using the COLLIN option in the MODEL statement in PROC REG. For models that contain an intercept term, I noted that there has been considerable debate about whether the data vectors should be mean-centered prior to performing the collinearity diagnostics. In other words, if X is the design matrix used for the regression, should you use X to analyze collinearity or should you use the centered data X – mean(X)? The REG procedure provides options for either choice. The COLLIN option uses the X matrix to assess collinearity; the COLLINOINT option uses the centered data.

As Belsley (1984, p. 76) states, "centering will typically seem to improve the conditioning." However, he argues that running collinearity diagnostics on centered data "gives us information about the wrong problem." He goes on to say, "mean-centering typically removes from the data the interpretability that makes conditioning diagnostics meaningful."

This article looks at how centering the data affects the collinearity diagnostics. Throughout this article, when I say "collinearity diagnostics, I am referring to the variance-decomposition algorithm that is implemented by the COLLIN in PROC REG, which was described in the previous article. Nothing in this article applies to the VIF or TOL options in PROC REG, which provide alternative diagnostics.

The article has two main sections:

  • The mathematics behind the COLLIN (and COLLINOINT) options in PROC REG.
  • An example of an ill-conditioned linear system that becomes perfectly conditioned if you center the data.

The arguments in this article are taken from the references at the end of this article. This article assumes that you have already read my previous article about collinearity diagnostics.

The mathematics of the COLLIN option

The COLLIN option implements the regression-coefficient variance decomposition due to Belsley and presented in Belsley, Kuh, and Welsch (1980), henceforth, BKW. The collinearity diagnostics algorithm (also known as an analysis of structure) performs the following steps:

  1. Let X be the data matrix. If the model includes an intercept, X has a column of ones. BKW recommend that you NOT center X, but if you choose to center X, do it at this step. As a reminder, the COLLIN option in PROC REG does not center the data whereas the COLLINOINT option centers the data.
  2. Scale X so that each column has unit length (unit variance).
  3. Compute the singular value decomposition of X = UDV`.
  4. From the diagonal matrix, D, compute the eigenvalues and condition indices of X`X.
  5. Compute P, the matrix of variance-decomposition proportions as described in BKW, p. 105-107.
  6. From this information, you can determine whether the regression model suffers from harmful collinearity.

To make sure that I completely understand an algorithm, I like to implement it in the SAS/IML matrix language. The following SAS/IML statements implement the analysis-of-structure method. You can run the program on the same Fitness data that were used in the previous article. The results are the same as from PROC REG.

proc iml;
start CollinStruct(lambda, cond, P,         /* output variables */
                   XVar, opt);              /* input variables  */
 
   /* 1. optionally center the data */
   if upcase(opt) = "NOINT" then 
      X = XVar - mean(XVar);                /* COLLINOINT: do not add intercept, center */
   else
      X = j(nrow(XVar), 1, 1) || XVar;      /* COLLIN: add intercept, do not center */
 
   /* 2. Scale X to have unit column length (unit variance) */
   Z = X / sqrt(X[##, ]);
   /* 3. Obtain the SVD of X and calculate condition indices and the P matrix */
   call svd(U, D, V, Z);
   /* 4. compute the eigenvalues and condition indices of X`X */
   lambda = D##2;                           /* eigenvalues are square of singular values */
   cond = sqrt(lambda[1] / lambda);         /* condition indices */
 
   /* 5. Compute P = matrix of variance-decomposition proportions */
   phi = V##2 / lambda`;          /* divide squared columns by eigenvalues (proportions of each PC) */
   phi_k = phi[,+];               /* for each component, sum across columns */
   P = T( phi / phi_k );          /* create proportions and transpose the result */
finish;
 
/* Perform Regression-Coefficient Variance Decomposition of BKW */
varNames = {RunTime Age Weight RunPulse MaxPulse RestPulse};
use fitness;
   read all var varNames into XVar;
close;
 
/* perform COLLIN analysis */
call CollinStruct(lambda, cond, P,  XVar, "INT");
print "----- Do Not Center the Data (COLLIN) -----", lambda cond;
 
/* perform COLLINOINT analysis */
call CollinStruct(lambda0, cond0, P0,  XVar, "NOINT");
print "----- Center the Data (COLLINOINT) -----", lambda0 cond0;

The first table shows the eigenvalues and (more importantly) the condition indices for the original (uncentered) data. You can see that there are three condition indices that exceed 30, which indicates that there might be as many as three sets of nearly collinear relationships among the variables. My previous analysis showed two important sets of relationships:

  • Age is moderately collinear with the intercept term.
  • RunPulse is strongly collinear with MaxPulse.

In the second table, which analyses the structure of the centered data, none of the condition indices are large. An interpretation of the second table is that the variables are not collinear. This contradicts the first analysis.

Why does centering the data change the condition indices so much? This phenomenon was studied by Belsley who showed that "centering will typically seem to improve the conditioning," sometimes by a large factor (Belsley, 1984, p. 76). Belsley says that the second table "gives us information about the wrong problem; namely, it tells us about the sensitivity of the LS solution... to numerically small relative changes in the centered data. And since the magnitude of the centered changes is usually uninterpretable," so also are the condition indices for the centered data.

Ill-conditioned data that becomes perfectly conditioned by centering

Belsley (1984) presents a small data set (N=20) for which the original variables are highly collinear (maximum condition index is 1,242) whereas the centered data is perfectly conditioned (all condition indices are 1). Belsley could have used a much smaller example, as shown in Chennamaneni et al. (2008). I used their ideas to construct the following example.

Suppose that the (uncentered) data matrix is
X = A + ε B
where A is any N x k matrix that has constant columns, B is a centered orthogonal matrix, and ε > 0 is a small number, such as 0.001. Clearly, X is a small perturbation of a rank-deficient and ill-conditioned matrix (A). The condition indices for X can be made arbitrarily large by making ε arbitrarily small. I think everyone would agree that the columns of X are highly collinear. As shown below, the analysis-of-structure algorithm on X reveals the collinearities.

But what happens if you center the data? Because A has constant columns, the mean-centered version of A is the zero matrix. Centering B does not change it because the columns are already centered. Therefore, the centered version of X is ε B, which is a perfectly conditioned orthogonal matrix! This construction is valid for arbitrarily large data, but the following statements implement this construction for a small 3 x 2 matrix.

A = { 1  2,         /* linearly dependent ==> infinite condition index) */
      1  2,
      1  2};
B = {-1  1,         /* orthogonal columns ==> perfect condition number (1) */
      0 -2,
      1  1};
eps = 0.001;        /* the smaller eps is, the more ill-conditioned X is */
X = A + eps * B;    /* small perturbation of a rank deficient matrix */
 
/* The columns of X are highly collinear. The columns X - mean(X) are perfectly conditioned. */
Xc = X - mean(X);
print X, Xc;

This example reveals how "mean-centering can remove from the data the information needed to assess conditioning correctly" (Belsley, 1984, p. 74). As expected, if you run the analysis-of-structure diagnostics on this small example, the collinearity is detected in the original data. However, if you center the data prior to running the diagnostics, the results do not indicate that the data are collinear:

/* The columns of the X matrix are highly collinear, but only 
   the analysis of the uncentered data reveals the collinearity */
call CollinStruct(lambda, cond, P, X, "INT");
print lambda cond P[c={"Intercept" "X1" "X2"}];  /* as ill-conditioned as you want */
 
call CollinStruct(lambda0, cond0, P0, X, "NOINT");
print lambda0 cond0 P0[c={"X1" "X2"}];           /* perfectly conditioned */

In the first table (which is equivalent to the COLLIN option in PROC REG), the strong collinearities are apparent. In the second table (which is equivalent to the COLLINOINT option), the collinearities are not apparent. As Belsley (1984, p. 75) says, an example like this "demonstrates that it matters very much in what form the data are taken in order to assess meaningfully the conditioning of a LS problem, and that centered data are not usually the correct form."

Geometrically, the situation is similar to the following diagram, which is part of Figure 2 on p. 7 of Chennamaneni, et al. (2008). The figure shows two highly collinear vectors (U and V). However, the mean-centered vectors are not close to being collinear and can even be orthogonal (perfectly conditioned).

U and V are highly collinear. The mean centered vectors are orthogonal.

Summary

In summary, this article has presented Belsley's arguments about why collinearity diagnostics should be performed on the original data. As Belsley (1984, p. 75) says, there are situations in which centering the data is useful, but "assessing conditioning is not one of them." The example in the second section demonstrates why Belsley's arguments are compelling: the data are clearly strongly collinear, yet if you apply the algorithm to the mean-centered data, you do not get any indication that the problem exists. The analysis of the fitness data shows that the same situation can occur in real data.

These examples convince me that the analysis-of-structure algorithm reveals collinearity only when you apply it to the original data. If you agree, then you should use the COLLIN option in PROC REG to perform collinearity diagnostics.

However, not everyone agrees with Belsley. If you are not convinced, you can use the COLLINOINT option in PROC REG to perform collinearity diagnostics on the centered data. However, be aware that the estimates for the centered data are still subject to inflated variances and sensitive parameter estimates (Belsley, 1984, p. 74), even if this diagnostic procedure does not reveal that fact.

Further reading

The post Collinearity diagnostics: Should the data be centered? appeared first on The DO Loop.

1月 232020
 

I was recently asked about how to interpret the output from the COLLIN (or COLLINOINT) option on the MODEL statement in PROC REG in SAS. The example in the documentation for PROC REG is correct but is somewhat terse regarding how to use the output to diagnose collinearity and how to determine which variables are collinear. This article uses the same data but goes into more detail about how to interpret the results of the COLLIN and COLLINOINT options.

An overview of collinearity in regression

Collinearity (sometimes called multicollinearity) involves only the explanatory variables. It occurs when a variable is nearly a linear combination of other variables in the model. Equivalently, there a set of explanatory variables that is linearly dependent in the sense of linear algebra. (Another equivalent statement is that the design matrix and the X`X matrices are not full rank.)

For example, suppose a model contains five regressor variables and the variables are related by X3 = 3*X1 - X2 and X5 = 2*X4;. In this case, there are two sets of linear relationships among the regressors, one relationship that involves the variables X1, X2, and X3, and another that involves the variables X4 and X5. In practice, collinearity means that a set of variables are almost linearly combinations of each other. For example, the vectors u = X3 - 3*X1 + X2 and v = X5 - 2*X4; are close to the zero vector.

Unfortunately, the words "almost" and "close to" are difficult to quantify. The COLLIN option on the MODEL statement in PROC REG provides a way to analyze the design matrix for potentially harmful collinearities.

Why should you avoid collinearity in regression?

The assumptions of ordinary least square (OLS) regression are not violated if there is collinearity among the independent variables. OLS regression still provides the best linear unbiased estimates of the regression coefficients.

The problem is not the estimates themselves but rather the variance of the estimates. One problem caused by collinearity is that the standard errors of those estimates will be big. This means that the predicted values, although the "best possible," will have wide prediction limits. In other words, you get predictions, but you can't really trust them.

A second problem concerns interpretability. The sign and magnitude of a parameter estimate indicate how the dependent variable changes due to a unit change of the independent variable when the other variables are held constant. However, if X1 is nearly collinear with X2 and X3, it does not make sense to discuss "holding constant" the other variables (X2 and X3) while changing X1. The variables necessarily must change together. Collinearities can even cause some parameter estimates to have "wrong signs" that conflict with your intuitive notion about how the dependent variable should depend on an independent variable.

A third problem with collinearities is numerical rather than statistical. Strong collinearities cause the cross-product matrix (X`X) to become ill-conditioned. Computing the least squares estimates requires solving a linear system that involves the cross-product matrix. Solving an ill-conditioned system can result in relatively large numerical errors. However, in practice, the statistical issues usually outweigh the numerical one. A matrix must be extremely ill-conditioned before the numerical errors become important, whereas the statistical issues are problematic for moderate to large collinearities.

How to interpret the output of the COLLIN option?

The following example is from the "Collinearity Diagnostics" section of the PROC REG documentation. Various health and fitness measurements were recorded for 31 men, such as time to run 1.5 miles, the resting pulse, the average pulse rate while running, and the maximum pulse rate while running. These measurements are used to predict the oxygen intake rate, which is a measurement of fitness but is difficult to measure directly.

data fitness;
   input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
   datalines;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
 
proc reg data=fitness plots=none;
   model Oxygen = RunTime Age Weight RunPulse MaxPulse RestPulse / collin;
   ods select ParameterEstimates CollinDiag;
   ods output CollinDiag = CollinReg;
quit;

The output from the COLLIN option is shown. I have added some colored rectangles to the output to emphasize how to interpret the table. To determine collinearity from the output, do the following:

  • Look at the "Condition Index" column. Large values in this column indicate potential collinearities. Many authors use 30 as a number that warrants further investigation. Other researchers suggest 100. Most researchers agree that no single number can handle all situations.
  • For each row that has a large condition index, look across the columns in the "Proportion of Variation" section of the table. Identify cells that have a value of 0.5 or greater. The columns of these cells indicate which variables contribute to the collinearity. Notice that at least two variables are involved in each collinearity, so look for at least two cells with large values in each row. However, there could be three or more cells that have large values. "Large" is relative to the value 1, which is the sum of each column.

Let's apply these rules to the output for the example:

  • If you use 30 as a cutoff value, there are three rows (marked in red) whose condition numbers exceed the cutoff value. They are rows 5, 6, and 7.
  • For the 5th row (condition index=33.8), there are no cells that exceed 0.5. The two largest cells (in the Weight and RestPulse columns) indicate a small near-collinearity between the Weight and RestPulse measurements. The relationship is not strong enough to worry about.
  • For the 6th row (condition index=82.6), there are two cells that are 0.5 or greater (rounded to four decimals). The cells are in the Intercept and Age columns. This indicates that the Age and Intercept terms are nearly collinear. Collinearities with the intercept term can be hard to interpret. See the comments at the end of this article.
  • For the 7th row (condition index=196.8), there are two cells that are greater than 0.5. The cells are in the RunPulse and MaxPulse columns, which indicates a very strong linear relationship between these two variables.

Your model has collinearities. Now what?

After you identify the cause of the collinearities, what should you do? That is a difficult and controversial question that has many possible answers.

  • Perhaps the simplest solution is to use domain knowledge to omit the "redundant" variables. For example, you might want to drop MaxPulse from the model and refit. However, in this era of Big Data and machine learning, some analysts want an automated solution.
  • You can use dimensionality reduction and an (incomplete) principal component regression.
  • You can use a biased estimation technique such as ridge regression, which allows bias but reduces the variance of the estimates.
  • Some practitioners use variable selection techniques to let the data decide which variables to omit from the model. However, be aware that different variable-selection methods might choose different variables from among the set of nearly collinear variables.

Keep the intercept or not?

Equally controversial is the question of whether to include the intercept term in a collinearity diagnostic. The COLLIN option in PROC REG includes the intercept term among the variables to be analyzed for collinearity. The COLLINOINT option excludes the intercept term. Which should you use? Here are two opinions that I found:

  1. Use the intercept term: Belsley, Kuh, and Welsch (Regression Diagnostics, 1980, p. 120) state that omitting the intercept term is "inappropriate in the event that [the design matrix]contains a constant column." They go on to say (p. 157) that "centering the data [when the model has an intercept term]can mask the role of the constant in any near dependencies and produce misleading diagnostic results." These quotes strongly favor using the COLLIN option when the model contains an intercept term.
  2. Do not use the intercept term if it is outside of the data range: Freund and Littell (SAS System for Regression, 3rd Ed, 2000) argue that including the intercept term in the collinearity analysis is not always appropriate. "The intercept is an estimate of the response at the origin, that is, where all independent variables are zero. .... [F]or most applications the intercept represents an extrapolation far beyond the reach of the data. For this reason, the inclusion of the intercept in the study of multicollinearity can be useful only if the intercept has some physical interpretation and is within reach of the actual data space." For the example data, it is impossible for a person to have zero age, weight, or pulse rate, therefore I suspect Freund and Little would recommend using the COLLINOINT option instead of the COLLIN option for these data.

So what should you do if the experts disagree? I usually defer to the math. Although I am reluctant to contradict Freund and Littell (both widely published experts and Fellows of the American Statistical Association), the mathematics of the collinearity analysis (which I will discuss in a separate article) seems to favor the opinion of Belsley, Kuh, and Welsch. I use the COLLIN option when my model includes an intercept term.

Do you have an opinion on this matter? Leave a comment.

The post Collinearity in regression: The COLLIN option in PROC REG appeared first on The DO Loop.

6月 242019
 

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include additional explanatory effects to model the data. Sometimes it is difficult to spot patterns in a seemingly random cloud of points, so some analysts like to add a scatter plot smoother to the residual plots. You can use the SMOOTH suboption to the PLOTS=RESIDUALS option in many SAS regression procedures to generate a panel of residual plots that contain loess smoothers. For SAS procedures that do not support the PLOTS=RESIDUALS option, you can use PROC SGPLOT to manually create a residual plot with a smoother.

Residual plots with loess smoothers

Many SAS linear regression procedures such as PROC REG and PROC GLM support the PLOTS=RESIDUAL(SMOOTH) option on the PROC statement. For example, the following call to PROC GLM automatically creates a panel of scatter plots where the residuals are plotted against each regressor. The model is a two-variable regression of the MPG_City variable in the Sashelp.Cars data.

/* residual plots with loess smoother */
ods graphics on;
proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight;
run; quit;

The loess smoothers can sometimes reveal patterns in the residuals that would not otherwise be perceived. In this case, it looks like there is a quadratic pattern to the residuals-versus-EngineSize graph (and perhaps for the Weight variable as well). This indicates that you might need to include a quadratic effect in the model. Because the EngineSize and Weight variables are highly correlated (ρ = 0.81), the following statements add only a quadratic effect for EngineSize:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight
                    EngineSize*EngineSize ;
quit;

After adding the quadratic effect, the residual plots do not reveal any obvious systematic trends. Also, the residual plot for Weight no longer shows any quadratic pattern.

How to use PROC SGPLOT to create a residual plot with a smoother

If you use a SAS procedure that does not support the PLOTS=RESIDUALS(SMOOTH) option, you can output the residual values to a SAS data set and use PROC SGPLOT to create the residual plots. Even when a procedure DOES support the PLOTS=RESIDUALS(SMOOTH) option, you might want to customize the plot by adding legends, by changing attributes of the markers or curve, or by specifying a value for the smoothing parameter.

An example is shown below. If you use the same model for MPG_City, but use all observations in the data set, the residual plot for EngineSize looks very strange. For these data, the smoothing parameter for the loess curve is very small and therefore the loess curve overfits the residuals:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   model MPG_City = EngineSize Weight;
   output out=RegOut predicted=Pred residual=Residual;
run; quit;

Yuck! The loess curve for the plot on the left clearly overfits the residuals-versus-EngineSize data! Unfortunately, you cannot change the smoothing parameter from the PROC GLM syntax. However, you can change the default smoothing parameter in PROC SGPLOT and you can make other modifications to the plot as well. Notice in the previous call to PROC GLM that the OUTPUT statement creates a data set named RegOut that contains the residual values and the original variables. Therefore, you can create a residual plot and add a loess smoother by using PROC SGPLOT, as follows:

ods graphics / attrpriority=NONE;
title "Residuals for Model";
proc sgplot data=RegOut ;
   scatter x=EngineSize y=Residual / group=Origin;
   loess x=EngineSize y=Residual / nomarkers smooth=0.5;
   refline 0 / axis=y;
   xaxis grid; yaxis grid;
run;

The smoothing parameter was manually set to 0.5, but you can use PROC LOESS if you want to choose a smoothing parameter that optimizes some information criterion such as the AICC statistic. Notice that you can use additional SGPLOT statements to add a reference grid and to change marker attributes. If you prefer, you could add a different kind of smoother such as a penalized B-spline by using the PBSPLINE statement.

You might wonder why the smoother in the residual plot for EngineSize is so small. The parameter is chosen to optimize a criterion such as the AICC statistic, so why does it overfit the data? An example in the PROC LOESS documentation provides an explanation. The chosen value for the smoothing parameter is one that corresponds to a local minimum of an objective function that involves the AICC statistic. Unfortunately, a set of data can have multiple local minima, and this is the case for the residuals of the EngineSize variable. When the smoothing parameter is 0.534, the AICC criterion reaches a local minimum. However, there are smaller values of the smoothing parameter for which the AICC criterion is even smaller. The minimum value of the AICC occurs when the smoothing parameter is 0.015, which leads to the "jagged" loess curve that is seen in the panel of residual plots shown earlier in this section. If you want to see this phenomenon yourself, run the following PROC LOESS code and look at the criterion plot.

ods select CriterionPlot SmoothingCriterion FitPlot;
proc loess data=RegOut;
   model Residual = EngineSize / select=AICC(global) ;
run;

Because a data set can be smoothed at multiple scales, the "optimal" smoothing parameter that is chosen automatically by the PLOTS=RESIDUALS(SMOOTH) option might not enable you to see the general trend of the residuals. If you experience this phenomenon, output the residuals and use PROC SGPLOT or PROC LOESS to compute a more useful smoother.

In summary, SAS provides the PLOTS=RESIDUALS(SMOOTH) option to automatically create residual-versus-regressor plots. Although this panel usually provides a useful indication of patterns in the residuals, you can also output the residuals to a data set and use PROC SGPLOT or PROC LOESS to create a customized residual plot.

The post Add loess smoothers to residual plots appeared first on The DO Loop.