I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and wanted to be able to generate all pairwise interaction effects within SAS/IML. One way to solve the programmer's problem is to use the HDIR function in SAS/IML. This article introduces the HDIR function and shows how to use it to generate pairwise (quadratic) interactions for continuous effects.

Generate pairwise interactions by using the EFFECT statement

Before trying to solve a problem with a new tool, I like to use an existing method to solve the problem. Then I can compare the old and new answers and make sure they are the same.

For sample data, let's use the numerical variables in the Sashelp.Class data. So that the techniques are clearer, I will rename the variables to X1, X2, and X3. We want to generate all six pairwise interactions, including the "pure quadratic" terms where a variable interacts with itself. The names of the interaction effects will be X1_2, X1_X2, x1_X3, X2_2, X2_X3, and X3_2. The names with "_2" at the end are pure quadratic effects; the others are interactions.

One way to generate a design matrix that has all pairwise interactions is to use the EFFECT statement in PROC GLMSELECT. The interactions are written to the OUTDESIGN= data set. The columns of the design matrix have interpretable names, which are stored in the _GLSMOD macro variable.

data Have; /* create the example data */ set sashelp.class(rename=(Age=X1 Height=X2 Weight=X3)); if _N_=3 then X1=.; /* add a missing value */ keep _NUMERIC_; run;   /* Use PROC GLMSELECT to generate all pairwise interactions. Step 1: Add a fake response variable */ %let DSIn = Have; /* the name of the input data set */ %let DSOut = Want; /* the name of the output data set */ data AddFakeY / view=AddFakeY; set &DSIn; _Y = 0; /* add a fake response variable */ run;   /* Step 2: Use the EFFECT statement to generate the degree-2 effects, as shown in https://blogs.sas.com/content/iml/2017/09/07/polynomial-effects-regression-sas.html */ proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y); effect poly2 = polynomial( X1 X2 X3 / degree=2); model _Y = poly2 / noint selection=none; /* equiv to X1*X1 X1*X2 X1*X3 X2*X2 X2*X3 X3*X3 */ run; %put &=_GLSMOD; /* look at names of the interaction effects in the OUTDESIGN= data */   proc print data=&DSOut(obs=5); var X1_2 X1_X2 X1_X3 X2_2 X2_X3 X3_2; /* names are generated automatically */ run;

The output from PROC GLMSELECT contains the "pure quadratic" interactions (X1_2, X2_2, and X3_2) and the cross-variable interactions (X1_X2, X1_X3, and X2_X3). If you have k variables, there will be k pure quadratic terms and "k choose 2" cross-variable terms. Hence, the total number of quadratic interactions is "(k+1) choose 2," which is (k+1)*k/2. Here, k=3 and there are six quadratic interactions.

Notice how PROC GLMSELECT handles the missing value in the third observation: because the X1 value is missing, the procedure puts a missing value into all interaction effects.

The horizontal direct product between matrices

The horizontal direct product between matrices A and B is formed by the elementwise multiplication of their columns. The operation is most often used to form interactions between dummy variables for two categorical variables. If A is the design matrix for the categorical variable C1 and B is the design matrix for the categorical variable C2, then HDIR(A,B) is the design matrix for the interaction effect C1*C2.

The following simple example shows how the HDIR function works. The HDIR function returns a matrix whose columns are formed by elementwise multiplication of the columns of A and the matrix B. The first set of columns is the product A[,1]#B, the second set is A[,2]#B, and so forth. If A has k1 columns and B has k2 columns, then HDIR(A,B) has k1*k2 columns.

proc iml; /* horizontal direct product multiplies each column of A by each column of B */ A = {1 2, 2 3, 4 5}; B = {0 1 2, 1 1 3, 2 -1 4}; C = hdir(A,B); print C[c={'A1_B1' 'A1_B2' 'A1_B3' 'A2_B1' 'A2_B2' 'A2_B3'} L='hdir(A,B)'];

Interactions of continuous variables

The previous section shows that if X is a data matrix of continuous variables, the function call HDIR(X,X) generates all pairwise combinations of columns of X. Unfortunately, the matrix HDIR(X,X) contains more columns than we need. if X contains k columns, we need the "(k+1) choose 2" =(k+1)k/2 columns of interactions, but the matrix HDIR(X,X) contains k*k columns. The problem is that HDIR(X,X) contains columns for X1*X2 and for X2*X1, even though those columns are the same. The same holds for other crossed terms such as X1*X3 and X3*X1, X2*X3 and X3*X2, and so forth.

If you want to use the HDIR function to generate all pairwise interactions, you have a choice: You can generate ALL products of pairs and delete the redundant ones, or you can compute only the unique pairs. I will show the latter because it is more efficient.

varNames = 'X1':'X3'; use Have; read all var varNames into X; close;   /* Compute only the columns we need */ A1 = HDIR(X[,1], X[,1:3]); /* interaction of X1 with X1, X2, X3 */ A2 = HDIR(X[,2], X[,2:3]); /* interaction of X2 with X2 X3 */ A3 = HDIR(X[,3], X[,3]); /* interaction of X3 with X3 */ A = A1 || A2 || A3;   /* get the HEAD module from https://blogs.sas.com/content/iml/2021/02/10/print-top-rows-of-data-sas.html */ load module=(Head); run Head(A) colname={'X1_2' 'X1_X2' 'X1_X3' 'X2_2' 'X2_X3' 'X3_2'};

The HEAD module displays only the top five rows of the matrix of interactions. This output is almost the same as the output from PROC GLMSELECT. A difference is how missing values propagate. For the third row, X1 is missing. PROC GLMSELECT puts a missing value in all columns of the pairwise interactions. In contrast, the SAS/IML output only has missing values in the first three columns because those are the only columns that involve the X1 variable. If your data contain missing values, you might want to use the CompleteCases function to delete the rows that have missing values before performing additional analyses.

A function that computes interactions of continuous variables

The previous section computes all quadratic interactions for a matrix that has three variables. It is straightforward to generalize the idea to k variables. You simply compute the interactions HDIR(X[,i],X[,i:k]) for i=1,2,...,k. This is done in the following program. During the computation, it is convenient to also generate names for the interactions. The following program generates the same names as PROC GLMSELECT. To make the code easier to understand, I encapsulate the logic for names into a helper function called GetNamePairs. The helper function is called as part of the InteractPairs module, which returns both the matrix of interactions and the character vector that names the columns:

/* Helper function: Form names of interaction effects. Input: s is a scalar name. t is a character vector of names. Return row vector of pairs of names "s_2" or "s_t[i]" Example: GetNamePairs('X1', 'X1':'X3') returns {'X1_2' 'X1_X2' 'X1_X3'} */ start GetNamePairs(s, t); k = nrow(t)*ncol(t); b = blankstr(nleng(s)+nleng(t)+1); /* max length of interaction name */ pairs = j(1, k, b); do i = 1 to k; if s=t[i] then pairs[i] = strip(s) + "_2"; else pairs[i] = catx("_", strip(s), strip(t[i])); end; return pairs; finish;   /* Generate all quadratic interactions for continuous variables. Input: design matrix X and a vector of column names. Output: OutX: a matrix whose columns are the pairwise interactions OutNames: the names of interactions, such as Age_2 and Height_Age */ start InteractPairs(OutX, OutNames, X, Names); k = ncol(X); numInteract = comb(k+1,2); /* = k + comb(k,2) */ OutX = j(nrow(X), numInteract, .); OutNames = j(1, numInteract, BlankStr(2*nleng(Names)+1)); col = 1; /* initial column to fill */ do i = 1 to k; m = k-i+1; /* number of interaction for this loop */ OutX[,col:col+m-1] = HDIR(X[,i], X[,i:k]); OutNames[,col:col+m-1] = GetNamePairs(Names[i], Names[i:k]); col = col + m; end; finish;   run InteractPairs(OutX, OutNames, X, varNames); run Head(OutX) colname=OutNames;

The output is identical to the earlier output. The input arguments to this module can have an arbitrary number of columns. In this example, the design matrix does not include an intercept column (a column that is all ones). Consequently, the output is the set of all quadratic interactions. If your design matrix includes an intercept column, the output will contain an intercept column and all main effects in addition to the quadratic effects.

Technically, you don't need to use the HDIR function in the InteractPairs module. Instead, you can use the familiar '#' operator to perform the elementwise multiplication. However, if you try to generalize the module to handle the interaction between categorical variables and continuous variables, the HDIR function will be useful.

Summary

This article introduces the HDIR function in SAS/IML, which computes the horizontal direct product of matrices. You can use the HDIR function to generate interaction effects in regression models. This article shows how to generate all quadratic interactions among a set of continuous variables.

The post Generate all quadratic interactions in a regression model appeared first on The DO Loop.

I previously showed how to create a decile calibration plot for a logistic regression model in SAS. A decile calibration plot (or "decile plot," for short) is used in some fields to visualize agreement between the data and a regression model. It can be used to diagnose an incorrectly specified model.

In principle, the decile plot can be applied to any regression model that has a continuous regressor. In this article, I show how to create two decile plots for an ordinary least squares regression model. The first overlays a decile plot on a fit plot, as shown to the right. The second decile plot shows the relationship between the observed and predicted responses.

Although this article shows how to create a decile plot in SAS, I do not necessarily recommend them. In many cases, it is better and simpler to use a loess fit, as I discuss at the end of this article.

A misspecified model

A decile plot can be used to identify an incorrectly specified model. Suppose, for example, that a response variable (Y) depends on an explanatory variable (X) in a nonlinear manner. If you model Y as a linear function of X, then you have specified an incorrect model. The following SAS DATA step simulates a response variable (Y) that depends strongly on a linear effect and weakly on quadratic and cubic effects for X. Because the nonlinear dependence is weak, a traditional fit plot does not indicate that the model is misspecified:

/* Y = b0 + b1*x + b2*x2 + b3*x**3 + error, where b2 and b3 are small */ %let N = 200; /* sample size */ data Have(keep= Y x); call streaminit(1234); /* set the random number seed */ do i = 1 to &N; /* for each observation in the sample */ x = rand("lognormal", 0, 0.5); /* x = explanatory variable */ y = 20 + 5*x - 0.5*(x-3)**3 + /* include weak cubic effect */ rand("Normal", 0, 4); output; end; run;   proc reg data=Have; /* Optional: plots=residuals(smooth); */ model y = x; quit;

The usual p-values and diagnostic plots (not shown) do not reveal that the model is misspecified. In particular, the fit plot, which shows the observed and predicted response versus the explanatory variable, does not indicate that the model is misspecified. At the end of this article, I discuss how to get PROC REG to overlay additional information about the fit on its graphs.

Overlay a decile plot on a fit plot

This section shows how to overlay a decile plot on the fit plot. The fit plot shows the observed values of the response (Y) for each value of the explanatory variable (X). You can create a decile fit plot as follows:

1. Bin the X values into 10 bins by using the deciles of the X variable as cut points. In SAS, you can use PROC RANK for this step.
2. For each bin, compute the mean Y value and a confidence interval for the mean. At the same time, compute the mean of each bin, which will be the horizontal plotting position for the bin. In SAS, you can use PROC MEANS for this step.
3. Overlay the fit plot with the 10 confidence intervals and mean values for Y. You can use PROC SGPLOT for this step.
/* 1. Assign each obs to a decile for X */ proc rank data=Have out=Decile groups=10 ties=high; var x; /* variable on which to group */ ranks Decile; /* name of variable to contain groups 0,1,...,k-1 */ run;   /* 2. Compute the mean and the empirical proportions (and CI) for each decile */ proc means data=Decile noprint; class Decile; types Decile; var y x; output out=DecileOut mean=YMean XMean lclm=yLower uclm=yUpper; /* only compute limits for Y */ run;   /* 3. Create a fit plot and overlay the deciles. */ data All; set Have DecileOut; run;   title "Decile Plot"; title2 "Observed and Predicted Response versus Explanatory"; proc sgplot data=All noautolegend; reg x=x y=y / nomarkers CLM alpha=0.1 name="reg"; scatter x=xMean y=YMean / yerrorlower=yLower yerrorupper=yUpper name="obs" legendlabel="Mean Y and 90% CI" markerattrs=GraphData2; fringe x / transparency=0.5; xaxis label="x" values=(0 to 4.5 by 0.5) valueshint; yaxis label="y" offsetmin=0.05; keylegend "reg" "obs" / position=NW location=inside across=1; run;

The graph is shown at the top of this article. I use the FRINGE statement to display the locations of the X values at the bottom of the plot. The graph indicates that the model might be misspecified. The mean Y values for the first three deciles are all above the predicted value from the model. For the next six deciles, the mean Y values are all below the model's predictions. This kind of systematic deviation from the model can indicate that the model is missing an important component. In this case, we know that the true response is cubic whereas the model is linear.

A decile plot of observed versus predicted response

You can create a fit plot when the model contains a single continuous explanatory variable. If you have multiple explanatory variables, you can assess the model by plotting the observed responses against the predicted responses. You can overlay a decile plot by computing the average observed response for each decile of the predicted response. When you hear someone use the term "decile plot," they are probably referring to this second plot, which is applicable to most regression models.

The SAS program is almost identical to the program in the previous section. The main difference is that you need to create an output data set (FIT) that contains the predicted values (Pred). You then use the Pred variable instead of the X variable throughout the program, as follows:

/* 1. Compute predicted values */ proc reg data=Have noprint; model y = x; output out=Fit Pred=Pred; quit;   /* 2. Assign each obs to a decile for Pred */ proc rank data=Fit out=Decile groups=10 ties=high; var Pred; /* variable on which to group */ ranks Decile; /* name of variable to contain groups 0,1,...,k-1 */ run;   /* 3. Compute the mean and the empirical proportions (and CI) for each decile */ proc means data=Decile noprint; class Decile; types Decile; var y Pred; output out=DecileOut mean=YMean PredMean lclm=yLower uclm=yUpper; /* only compute limits for Y */ run;   /* 4. Create an Observed vs Predicted plot and overlay the deciles. */ data All; set Fit DecileOut; run;   title "Decile Plot"; title2 "Observed Response verus Predicted Response"; proc sgplot data=All noautolegend; lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip; /* add identity line */ scatter x=PredMean y=YMean / yerrorlower=yLower yerrorupper=yUpper name="obs" legendlabel="Mean Y and 95% CI" markerattrs=GraphData2; fringe Pred; xaxis label="Predicted Response"; yaxis label="Observed Response" offsetmin=0.05; keylegend "obs" / position=NW location=inside across=1; run;

The decile plot includes a diagonal line, which is the line of perfect agreement between the model and the data. In a well-specified model, the 10 empirical means of the deciles should be close to the line, and they should vary randomly above and below the line. The decile plot for the misspecified model shows a systematic trend rather than random variation. For the lower 30% of the predicted values, the observed response is higher (on average) than the predictions. When the predicted values, are in the range [29, 32] (the fourth through ninth deciles) the observed response is lower (on average) than the predictions. Thus, the decile plot indicates that the linear model might be missing an important nonlinear effect.

The loess alternative

A general principle in statistics is that you should not discretize a continuous variable unless absolutely necessary. You can often get better insight by analyzing the continuous variable. In accordance with that principle, this section shows how to identify misspecified models by using a loess smoother rather than by discretizing a continuous variable into deciles.

In addition to the "general principle," a decile plot requires three steps: compute the deciles, compute the means for the deciles, and merge the decile information with the original data so that you can plot it. As I discuss at the end of my previous article, it is often easier to overlay a loess smoother.

The following statements use the LOESS statement in PROC SGPLOT to add a loess smoother to a fit plot. If the loess fit is not approximately linear, it could indicate that the linear model is not appropriate.

title "Linear Regression Fit"; proc sgplot data=Have; reg x=x y=y / clm; /* fit linear model y = b0 + b1*x */ loess x=x y=y / nomarkers; /* use loess fit instead of decile plot */ run;

This plot is the continuous analog to the first decile plot. It tells you the same information: For very small and very large values of X, the predicted values are too low. The loess curve suggests that the linear model is lacking an effect.

The previous plot is useful for single-variable regressions. But even if you have multiple regressors, you can add a loess smoother to the "observed versus predicted" plot. If the FIT data set contains the observed and predicted response, you can create the graph as follows:

title "Observed versus Predicted"; proc sgplot data=Fit; loess x=Pred y=y; lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip legendlabel="Identity Line"; run;

This plot is the continuous analog to the second decile plot. The loess curve indicates that the observed response is higher than predicted for small and large predicted values. When the model predicts values of Y in the range [29.5, 32], the predicted values tend to be smaller than the observed values. This is the same information that the decile plot provides, but the loess curve is easier to use and gives more information because it does not discretize a continuous variable.

Finally, you can add a loess smoother to a residual plot by using an option in PROC REG. In a well-specified model, the residual plot should not show any systematic trends. You can use the PLOTS=RESIDUALS(SMOOTH) options on the PROC REG statement to add a loess smoother to the residual plot, as follows:
proc reg data=Have plots=residuals(smooth); model y = x; ods select ResidualPlot; quit;

The loess smoother on the residual plot shows that the residuals seem to have a systematic component that is not captured by the linear model. When you see the loess fit, you might be motivated to fit models that are more complex. You can use this option even for models that have multiple explanatory variables.

Summary

In summary, this article shows how to create two decile plots for least-square regression models in SAS. The first decile plot adds 10 empirical means to a fit plot. You can create this plot when you have a single continuous regressor. The second decile plot is applicable to models that have multiple continuous regressors. It adds 10 empirical means to the "observed versus predicted" plot. Lastly, although you can create decile plots in SAS, in many cases it is easier and more informative to use a loess curve to assess the model.

The post Decile plots in SAS appeared first on The DO Loop.

A previous article discusses the confidence band for the mean predicted value in a regression model. The article shows a "graded confidence band plot," which I saw in Claus O. Wilke's online book, Fundamentals of Data Visualization (Section 16.3). It communicates uncertainty in the predictions. A graded band plot is shown to the right for the 95% (outermost), 90%, 80%, 70% and 50% (innermost) confidence levels.

The graded band plot to the right has five confidence levels, but that was an arbitrary choice. Why not use 10 levels? Or 100?

When you start using many levels, it is no longer feasible to identify the individual confidence bands. However, you can use a heat map to visualize the uncertainty in the mean. Given any (x,y) value in the graph, you can find the value of α such that the 100(1-α)% confidence limit passes through (x,y). You can then color the point (x,y) by the value of α and use a gradient legend to indicate the α values (or, equivalently, the confidence levels). An example is shown below. In this article, I show how to create this visualization, which I call a continuous band plot. This article was inspired by a question and graph that Michael Friendly posted on Twitter.

A review: The formula for confidence limits for the mean predicted value

Most statistics textbooks provide a formula for the 100(1-α)% confidence limits for the predicted mean of a regression model. You can see the formula in the SAS documentation. To keep the math simple, I will restrict the discussion to one-dimensional linear regression of the form Y = b0 + b1*x + ε, where ε ~ N(0, σ). Suppose that the sample contains n observations. Given a significance level (α) and values for the explanatory variable (x), the upper and lower confidence limits are given by
CLM(x) = pred(x) ± tα SEM(x)
where

• tα = quantile("t", 1-α/2, n-2) is a quantile of the t distribution with n-2 degrees of freedom
• SEM(x) = sqrt(MSE * h(x)) is the standard error of the predicted mean at x
• MSE is the mean squared error, which is the sum of the squared residuals divided by n-2.
• h(x) is the leverage statistic at x. The SAS documentation provides a matrix equation for the leverage statistic at x in terms of a quadratic form that involves the inverse of the crossproduct matrix. For a one-variable regression, you can explicitly write the leverage in terms of the elements of the inverse matrix. If M = (XX)-1 is the inverse matrix, then the leverage is h(x) = M[1,1] + 2*M[1,2]*x + M[2,2]*x2.

The SAS/IML documentation includes a Getting Started example about computing statistics for linear regression. The following program computes these quantities for α=0.05 and for a sequence of x values:

data Have; input x y; datalines; 21 45 24 40 26 32 27 47 30 55 30 60 33 42 34 67 36 50 39 54 ;   proc iml; /* Use IML to produce regression estimates */ use Have; read all var {x y}; close;   X = j(nrow(x),1,1) || x; /* add intercept col to design matrix */ n = nrow(x); /* sample size */ xpxi = inv(X*X); /* inverse of X'X */ b = xpxi * (X*y); /* parameter estimates of coeffcients */ yHat = X*b; /* predicted values for data */ dfe = n - ncol(X); /* error DF */ mse = ssq(y-yHat)/dfe; /* MSE = SSQ(resid)/dfe */   /* Given M = inv(X*X), compute leverage at x. In general, the leverage is h = vecdiag(x*M*x), but for a single regressor this simplifies. */ start leverage(x, M); return ( M[1,1] + 2*M[1,2]*x + M[2,2]*x##2 ); finish;   alpha = 0.05; t_a= quantile("t", 1-alpha/2, dfe);   /* evaluate CLM for a sequence of x values */ z = T(do(20,40,1)); /* sequence of x values in [20,40] */ h = leverage(z, xpxi); /* evaluate h(x) */ SEM = sqrt(mse * h); /* SEM(x) = standard error of predicted mean */ Pred = b[1] + b[2]*z; /* evaluate pred(x) */ LowerCLM = Pred - t_a * SEM; UpperCLM = Pred + t_a * SEM;   print z Pred LowerCLM UpperCLM;

The program computes the predicted values and the lower and upper 95% confidence limits for the mean for a sequence of x values in the interval [20, 40]. If you overlay these curves on a scatter plot of the data, you obtain the usual regression fit plot that is produced automatically by regression procedures in SAS.

Invert the confidence limit formula

The previous section implements formulas that are typically presented in a first course on linear regression. Given α and x, the formulas enable you to find the upper and lower CLM at x. But here is a cool thing that I haven't seen before: You can INVERT that formula. That is, given a value for the explanatory variable (x) and the response (y), you can find the significance level (α) such that the 100(1-α) confidence limit passes through (x, y). This enables you to compute a heat map that visualizes the uncertainty in the predicted mean.

It is not hard to invert the CLM formula. You can use algebra to find
tα = |y - pred(x)| / SEM(x)
You can then apply the CDF function to both sides to find
1 - α/2 = CDF("t", |y - pred(x)| / SEM(x), n-2)
which you can solve for α(x,y). The following program carries out these computations:

/* Given (x,y), find alpha(x,y) so that the 100(1-alpha)% confidence limit passes through (x,y). */ /* Vectorize this process and compute alpha(x,y) for a grid of (x,y) values */ xx = do(20, 40, 0.25); /* x in [20,40] */ yy = do(26, 73, 0.2); /* y in [26,73] */ xy = ExpandGrid(yy, xx); /* generate (x,y) pairs where x changes fastest */ x = xy[,2]; y = xy[,1]; /* vectors of x & y */   h = leverage(x, xpxi); /* h(x) */ SEM = sqrt(mse * h); /* SEM(x) = standard error of predicted mean */ Pred = b[1] + b[2]*x; /* best predition of mean at x */ LHS = abs(y - Pred)/SEM; p = cdf("t", LHS, dfe); /* = 1 - alpha/2 */ alpha = 2*(1 - p); ConfidenceLevel = 100*(1-alpha);   /* write the data to a SAS data set */ create Heat var{'x' 'y' 'alpha' 'ConfidenceLevel'}; append; close; QUIT;

For each (x,y) value in a grid, the program computes the α value such that the 100(1-α)% confidence limit passes through (x,y). These values are written to a SAS data set and are used in the next section.

A continuous band plot

The following statements visualize the uncertainty in the predicted value by creating a heat map of the confidence bands for a range of confidence values:

data ContBand; set Have Heat(rename=(x=xx y=yy)); /* concatenate with orig data */ label xx="x" yy="y"; run;   title 'Continuous Band Plot'; title2 'Confidence for Mean Predicted Value'; proc sgplot data=ContBand noautolegend; heatmapparm x=xx y=yy colorresponse=ConfidenceLevel / colormodel=(VeryDarkGray Gray White) name="prob"; reg x=x y=y/ markerattrs=(symbol=CircleFilled size=12); gradlegend "prob"; yaxis min=30 max=68; run;

The graph is shown in the first section of this article. The graph is similar to Wilke's graded confidence band plot. However, instead of overlaying a small number of confidence bands, it visualizes "infinitely many" confidence bands. Basically, it inverts the usual process (limits as a function of confidence level) to produce confidence as a function of y, for each x.

The gradient legend associates a color to a confidence level. In theory, you can associate each (x,y) point with a confidence level. In practice, it is hard to visually distinguish grey scales. If it is important to distinguish, say, the 80% confidence region, you can replace the color ramp for the COLORMODEL= option. For example, you could use colormodel=(black red yellow cyan white) if you are looking for vibrant colors that emphasize the 25%, 50%, and 75% levels.

A profile plot for the confidence band

It is enlightening to take a vertical slice of the continuous band plot. A vertical slice provides a "profile plot," which shows how the confidence level changes as a function of y for a fixed value of x. Here is the profile plot at x=30:

title 'Profile Plot of Confidence Level'; title2 'x = 30'; proc sgplot data=Heat noautolegend; where x = 30; series x=ConfidenceLevel y=y; refline 49.2 / axis=y; yaxis min=30 max=68 grid; xaxis grid; run;

When x=30, the predicted mean is 49.2, which is marked by a horizontal reference line. That point prediction is the "0% confidence interval," since it does not account for sampling variability. For ConfidenceLevel > 0, the curves give the width of the confidence interval for the predicted mean. The width is small and increases almost linearly with the confidence level until about the 70% level. The lower and upper 80% CLMs are 45.1 and 53.3, respectively. The lower and upper 95% CLMs are much bigger at 42.4 and 56.0, respectively. The 99% CLMs are 39.3 and 59.1. As the confidence level approaches 100%, the width of the interval increases without bound.

Summary

In summary, this article shows formulas for the lower and upper 100(1-α)% confidence limits for the predicted mean at a point x. You can invert the formulas: for any (x,y), you can find the value of α for which the 100(1-α)% confidence limits pass through the point (x,y). You can use a heat map to visualize the resulting surface, which is a continuous version of the graded confidence band plot. The continuous band plot is one way to visualize the uncertainty of the predicted mean in a regression model.

The post A continuous band plot for visualizing uncertainty in regression predictions appeared first on The DO Loop.

You've probably seen many graphs that are similar to the one at the right. This plot shows a regression line overlaid on a scatter plot of some data. Given a value for the independent variable (x), the regression line gives the best prediction for the mean of the response variable (y). The light blue band shows a 95% confidence band for the conditional mean.

This article is about how to understand the confidence band. The band conveys uncertainty in the location of the conditional mean. You can think of the confidence band as being a bunch of vertical confidence intervals, one at each possible value of x. For example, when x=30, the graph predicts 49.2 for the conditional mean of y and shows that the confidence interval for the mean of y at x=30 is [42.4, 56.0]. The confidence intervals for x=20 and x=40 are wider, which indicates that there is more uncertainty in the prediction when x is extreme than when x is near the center of the data. (The confidence interval of the conditional mean is sometimes called the confidence interval of the prediction. No matter what you call it, the intervals in this article are for the MEAN, not for individual responses.)

Statistics have uncertainty because they are based on a random sample from the population. If you were to choose a different random sample of (x,y) values from the same population, you would get a different regression line. If you choose a third random sample, you would get yet another regression line. In this article, I discuss some intuitive ways to think about the confidence band without using any formulas. I show that the confidence band is related to repeatedly choosing a random sample and fitting many regression lines. I show a couple of alternative ways to visualize uncertainty in the predicted values.

Fit the example data

You can use a simulation to demonstrate the relationship between a confidence interval and repeated random sampling. The best way is to use a (known) model to repeatedly generate the random sample. However, I am going to use a slightly different approach, which is discussed in Section 13.3 of Simulating Data with SAS, Wicklin (2013). When you deal with real data, you don't know the real underlying relationship between the response and explanatory variables, but you can always simulate from the fitted regression model. This means that you fit the data and then using the parameters estimates as if they were the true values of the parameters.

Let's see how this works. The following DATA step defines a toy example that has 10 observations. The call to PROC REG fits a linear model to the data.

data Have; input x y; datalines; 21 45 24 40 26 32 27 47 30 55 30 60 33 42 34 67 36 50 39 54 ;   proc reg data=Have outest=PE alpha=0.05 plots(only)=fitplot(nocli); model y = x; quit;   proc print data=PE noobs label; label Intercept="b0" x="b1" _RMSE_="s"; var Intercept x _RMSE_; run;

The REG procedure creates a fit plot that is similar to the one at the top of this article. The model is Y = b0 + b1*x + eps, where eps ~ N(0, s). The call to PROC PRINT shows the three parameter estimates: the intercept term (b0), the coefficient of the linear term (b1), and the "root mean square error," which is the estimate of the standard deviation of the error term (s).

Visualizing uncertainty by using graded confidence bands

The conditional mean is assumed to be normally distributed inside the confidence band. For a particular value of x, the probability of the conditional mean is high near the regression line and lower near the upper and lower limits. You can visualize the distribution by overlaying confidence bands for various confidence levels. If you make the bands semi-transparent, the union of the bands is darkest near the regression line and lightest far away from the line.

In SAS, you can use the REG statement (with the CLM option) in PROC SGPLOT to overlay several semi-transparent confidence bands that have different levels of confidence. The following graph overlays the 95%, 90%, 80%, 70% and 50% confidence bands. To save typing, I use a SAS macro to create each REG statement.

%macro CLM(alpha=, thickness=); reg x=x y=y / nomarkers alpha=&alpha lineattrs=GraphFit(thickness=&thickness) clm clmattrs=(clmfillattrs=(color=gray transparency=0.8)); %mend;   title "Graded Confidence Band Plot"; title2 "alpha = 0.05, 0.1, 0.2, 0.3, 0.5"; proc sgplot data=Have noautolegend; %CLM(alpha=0.05, thickness=0); %CLM(alpha=0.10, thickness=0); %CLM(alpha=0.20, thickness=0); %CLM(alpha=0.30, thickness=0); %CLM(alpha=0.50, thickness=2); scatter x=x y=y/ markerattrs=(symbol=CircleFilled size=12); run;

Claus O. Wilke calls this a "graded confidence band plot" in his book _Fundamentals of Data Visualization_ (Section 16.3). It communicates that the location of the conditional mean is uncertain, but it is most likely to be found near the regression line.

Simulate from the fitted model

The confidence band visualizes the uncertainty in the prediction due to the fact that the data are a random sample from some unknown population. Although we don't know the underlying true relationship between the (x,y) pairs, we can simulate random samples from the fitted linear model by using the parameter estimates as if they were the true parameters. The following SAS DATA step simulates 500 random samples from the regression model. The x values are fixed. The y values are simulated according to Y = b0 + b1*x + eps, where eps ~ N(0, s).

/* simulate many times from model, using parameter estimates as the true model */ %let numSim = 500; data RegSim; call streaminit(1234); b0 = 21.1014; b1 = 0.93662; RMSE = 9.33046; set Have; /* for each X value in the original data */ do SampleID = 1 to &numSim; /* simulate Y = b0 + b1*x + eps, eps ~ N(0,RMSE) */ YSim = b0 + b1*x + rand("Normal", 0, RMSE); output; end; run;   /* use BY-group processing to fit a regression model to each simulated data */ proc sort data=RegSim; by SampleID; run; proc reg data=RegSim outest=PESim alpha=0.05 noprint; by SampleID; model YSim = x; quit;

The output from PROC REG is 500 pairs of parameter estimates (b0 and b1). Each estimate represents a regression line for a random sample from the same linear model. Let's see what happens if you overlay all those regression lines on a single plot:

/* two points determine a line, so score regression on [min(x), max(x)] */ data Viz; set PESim(rename=(Intercept=b0 x=b1)); /* min(x)=21; max(x)=39. Evaluate fit b0 + b1*x for each simulated sample */ xx = 21; yy = b0 + b1*xx; output; xx = 39; yy = b0 + b1*xx; output; keep SampleID xx yy; run;   /* overlay the fits on the original data */ data Combine; set Have Viz; run;   title "Overlay of &numSim Regression Lines"; title2 "Y = b0 + b1*x + eps, eps ~ N(0,RMSE)"; ods graphics / antialias=off GROUPMAX=10000; proc sgplot data=Combine noautolegend; reg x=x y=y / nomarkers alpha=0.05 clm clmattrs=(clmfillattrs=(transparency=0.5)); series x=xx y=yy / group=SampleId lineattrs=(color=gray pattern=solid) transparency=0.9; scatter x=x y=y; reg x=x y=y / nomarkers; run;

Notice that most of the regression lines are in the interior of the confidence band. In fact, if you fix a value of x (such as x=30) and evaluate all the regression lines at x, then about 95% of the conditional means will be contained within the original confidence band.

This is the intuitive meaning of the confidence band. If you imagine obtaining a different random sample from the same population and fitting a regression line, the conditional mean will be contained in the band for 95% of the random samples.

The distribution of the conditional mean

As I said earlier, you should think of the confidence bands as a bunch of vertical confidence intervals. The distribution of the conditional mean within each vertical slice is assumed to be normally distributed with mean b0 + b1*x. The following histograms show the distribution of the predicted means for x=21 and x=39 (respectively) over all 500 regression lines. You should compare the 2.5th and 97.5th percentiles for these histograms to the vertical limits (at the same x values) of the confidence band in the graph at the top of this article.

title "Distribution of Conditional Means"; proc sgpanel data=Combine(where=(xx^=.)); label yy="y" xx="x0"; panelby xx / layout=rowlattice columns=1; histogram yy; run;

Summary

In summary, this article visualizes a few facts about the confidence band for a regression line. The confidence band conveys uncertainty about the location of the conditional mean. The visualization can be improved by overlaying several semi-transparent bands, a graph that Wilke calls a "graded confidence band plot." You can use simulation to relate the confidence band to sampling variability. If you simulate many random samples and overlay the regression lines, they form a confidence band. The limits of the confidence band are related to the quantiles of the conditional distribution of the means.

The post Visualize uncertainty in regression predictions appeared first on The DO Loop.

You've probably seen many graphs that are similar to the one at the right. This plot shows a regression line overlaid on a scatter plot of some data. Given a value for the independent variable (x), the regression line gives the best prediction for the mean of the response variable (y). The light blue band shows a 95% confidence band for the conditional mean.

This article is about how to understand the confidence band. The band conveys uncertainty in the location of the conditional mean. You can think of the confidence band as being a bunch of vertical confidence intervals, one at each possible value of x. For example, when x=30, the graph predicts 49.2 for the conditional mean of y and shows that the confidence interval for the mean of y at x=30 is [42.4, 56.0]. The confidence intervals for x=20 and x=40 are wider, which indicates that there is more uncertainty in the prediction when x is extreme than when x is near the center of the data. (The confidence interval of the conditional mean is sometimes called the confidence interval of the prediction. No matter what you call it, the intervals in this article are for the MEAN, not for individual responses.)

Statistics have uncertainty because they are based on a random sample from the population. If you were to choose a different random sample of (x,y) values from the same population, you would get a different regression line. If you choose a third random sample, you would get yet another regression line. In this article, I discuss some intuitive ways to think about the confidence band without using any formulas. I show that the confidence band is related to repeatedly choosing a random sample and fitting many regression lines. I show a couple of alternative ways to visualize uncertainty in the predicted values.

Fit the example data

You can use a simulation to demonstrate the relationship between a confidence interval and repeated random sampling. The best way is to use a (known) model to repeatedly generate the random sample. However, I am going to use a slightly different approach, which is discussed in Section 13.3 of Simulating Data with SAS, Wicklin (2013). When you deal with real data, you don't know the real underlying relationship between the response and explanatory variables, but you can always simulate from the fitted regression model. This means that you fit the data and then using the parameters estimates as if they were the true values of the parameters.

Let's see how this works. The following DATA step defines a toy example that has 10 observations. The call to PROC REG fits a linear model to the data.

data Have; input x y; datalines; 21 45 24 40 26 32 27 47 30 55 30 60 33 42 34 67 36 50 39 54 ;   proc reg data=Have outest=PE alpha=0.05 plots(only)=fitplot(nocli); model y = x; quit;   proc print data=PE noobs label; label Intercept="b0" x="b1" _RMSE_="s"; var Intercept x _RMSE_; run;

The REG procedure creates a fit plot that is similar to the one at the top of this article. The model is Y = b0 + b1*x + eps, where eps ~ N(0, s). The call to PROC PRINT shows the three parameter estimates: the intercept term (b0), the coefficient of the linear term (b1), and the "root mean square error," which is the estimate of the standard deviation of the error term (s).

Visualizing uncertainty by using graded confidence bands

The conditional mean is assumed to be normally distributed inside the confidence band. For a particular value of x, the probability of the conditional mean is high near the regression line and lower near the upper and lower limits. You can visualize the distribution by overlaying confidence bands for various confidence levels. If you make the bands semi-transparent, the union of the bands is darkest near the regression line and lightest far away from the line.

In SAS, you can use the REG statement (with the CLM option) in PROC SGPLOT to overlay several semi-transparent confidence bands that have different levels of confidence. The following graph overlays the 95%, 90%, 80%, 70% and 50% confidence bands. To save typing, I use a SAS macro to create each REG statement.

%macro CLM(alpha=, thickness=); reg x=x y=y / nomarkers alpha=&alpha lineattrs=GraphFit(thickness=&thickness) clm clmattrs=(clmfillattrs=(color=gray transparency=0.8)); %mend;   title "Graded Confidence Band Plot"; title2 "alpha = 0.05, 0.1, 0.2, 0.3, 0.5"; proc sgplot data=Have noautolegend; %CLM(alpha=0.05, thickness=0); %CLM(alpha=0.10, thickness=0); %CLM(alpha=0.20, thickness=0); %CLM(alpha=0.30, thickness=0); %CLM(alpha=0.50, thickness=2); scatter x=x y=y/ markerattrs=(symbol=CircleFilled size=12); run;

Claus O. Wilke calls this a "graded confidence band plot" in his book _Fundamentals of Data Visualization_ (Section 16.3). It communicates that the location of the conditional mean is uncertain, but it is most likely to be found near the regression line.

Simulate from the fitted model

The confidence band visualizes the uncertainty in the prediction due to the fact that the data are a random sample from some unknown population. Although we don't know the underlying true relationship between the (x,y) pairs, we can simulate random samples from the fitted linear model by using the parameter estimates as if they were the true parameters. The following SAS DATA step simulates 500 random samples from the regression model. The x values are fixed. The y values are simulated according to Y = b0 + b1*x + eps, where eps ~ N(0, s).

/* simulate many times from model, using parameter estimates as the true model */ %let numSim = 500; data RegSim; call streaminit(1234); b0 = 21.1014; b1 = 0.93662; RMSE = 9.33046; set Have; /* for each X value in the original data */ do SampleID = 1 to &numSim; /* simulate Y = b0 + b1*x + eps, eps ~ N(0,RMSE) */ YSim = b0 + b1*x + rand("Normal", 0, RMSE); output; end; run;   /* use BY-group processing to fit a regression model to each simulated data */ proc sort data=RegSim; by SampleID; run; proc reg data=RegSim outest=PESim alpha=0.05 noprint; by SampleID; model YSim = x; quit;

The output from PROC REG is 500 pairs of parameter estimates (b0 and b1). Each estimate represents a regression line for a random sample from the same linear model. Let's see what happens if you overlay all those regression lines on a single plot:

/* two points determine a line, so score regression on [min(x), max(x)] */ data Viz; set PESim(rename=(Intercept=b0 x=b1)); /* min(x)=21; max(x)=39. Evaluate fit b0 + b1*x for each simulated sample */ xx = 21; yy = b0 + b1*xx; output; xx = 39; yy = b0 + b1*xx; output; keep SampleID xx yy; run;   /* overlay the fits on the original data */ data Combine; set Have Viz; run;   title "Overlay of &numSim Regression Lines"; title2 "Y = b0 + b1*x + eps, eps ~ N(0,RMSE)"; ods graphics / antialias=off GROUPMAX=10000; proc sgplot data=Combine noautolegend; reg x=x y=y / nomarkers alpha=0.05 clm clmattrs=(clmfillattrs=(transparency=0.5)); series x=xx y=yy / group=SampleId lineattrs=(color=gray pattern=solid) transparency=0.9; scatter x=x y=y; reg x=x y=y / nomarkers; run;

Notice that most of the regression lines are in the interior of the confidence band. In fact, if you fix a value of x (such as x=30) and evaluate all the regression lines at x, then about 95% of the conditional means will be contained within the original confidence band.

This is the intuitive meaning of the confidence band. If you imagine obtaining a different random sample from the same population and fitting a regression line, the conditional mean will be contained in the band for 95% of the random samples.

The distribution of the conditional mean

As I said earlier, you should think of the confidence bands as a bunch of vertical confidence intervals. The distribution of the conditional mean within each vertical slice is assumed to be normally distributed with mean b0 + b1*x. The following histograms show the distribution of the predicted means for x=21 and x=39 (respectively) over all 500 regression lines. You should compare the 2.5th and 97.5th percentiles for these histograms to the vertical limits (at the same x values) of the confidence band in the graph at the top of this article.

title "Distribution of Conditional Means"; proc sgpanel data=Combine(where=(xx^=.)); label yy="y" xx="x0"; panelby xx / layout=rowlattice columns=1; histogram yy; run;

Summary

In summary, this article visualizes a few facts about the confidence band for a regression line. The confidence band conveys uncertainty about the location of the conditional mean. The visualization can be improved by overlaying several semi-transparent bands, a graph that Wilke calls a "graded confidence band plot." You can use simulation to relate the confidence band to sampling variability. If you simulate many random samples and overlay the regression lines, they form a confidence band. The limits of the confidence band are related to the quantiles of the conditional distribution of the means.

The post Visualize uncertainty in regression predictions appeared first on The DO Loop.

A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to solve restricted least squares problems. However, if a constraint is an INEQUALITY instead of an EQUALITY, then (in general) a least squares method does not solve the problem. Therefore, you cannot use PROC REG to solve problems that have inequality constraints.

This article shows how to use PROC NLIN to solve linear regression problems whose regression coefficients are constrained by a linear inequality. Examples are B1 ≥ 3 or B1 + B2 ≥ 6.

PROC NLIN and constrained regression problems

Before solving a problem that has inequality constraints, let's see how PROC NLIN solves a problem that has linear equality constraints. The following statements use the data and model from the previous article. The model is
Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2. In PROC NLIN, you can replace the B3 and B4 parameters, thus leaving only three parameters in the model. If desired, you can use the ESTIMATE statement to recover the value of B4, as follows:

/* You can solve the restricted problem by using PROC NLIN */ proc nlin data=RegSim noitprint; /* use NLINMEASURES to estimate MSE */ parameters b0=0 b1=3 b2=1; /* specify names of parameters and initial guess */ model Y = b0 + b1*x1 + b2*x2 + b1*x3 - 2*b2*x4; /* replace B3 = B1 and B4 = -2*B2 */ estimate 'b4' -2*b2; /* estimate B4 */ run;

The parameter estimates are the same as those produced by PROC REG in the previous article.

The syntax for PROC NLIN is natural and straightforward. The PARAMETERS statement specifies the names of the parameters in the problem; the other symbols (X1-X4, Y) refer to data set variables. You must specify an initial guess for each parameter. For nonlinear regressions, this can be a challenge, but there are some tricks you can use to choose a good initial guess. For LINEAR regressions, the initial guess shouldn't matter: you can use all zeros or all ones, if you want.

Boundary constraints

Suppose that you want to force the regression coefficients to satisfy certain inequality constraints. You can use the BOUNDS statement in PROC NLIN to specify the constraints. The simplest type of constraint is to restrict a coefficient to a half-interval or interval. For example, the following call to PROC NLIN restricts B1 ≥ 3 and restricts B2 to the interval [0, 4].

/* INEQUALITY constraints on the regression parameters */ proc nlin data=RegSim; parameters b0=0 b1=3 b2=1; /* initial guess */ bounds b1 >= 3, 0<= b2 <= 4; model Y = b0 + b1*x1 + b2*x2 + b1*x3 - 2*b2*x4; ods select ParameterEstimates; run;

The solution that minimizes the residual sum of squares (subject to the constraints) places the B1 parameter on the constraint B1=3. Therefore, a standard error is not available for that parameter estimate.

You can also use the BOUNDS statement to enforce simple relationships between parameters. For example, you can specify
bounds b1 >= b2;
to specify that the B1 coefficient must be greater than or equal to the B2 parameter.

More general linear constraints

The BOUNDS statement is for simple relationships. For more complicated relationships, you might need to reparameterize the model. For example, the BOUNDS statement does not accept the following syntax:
bounds b1 >= 2*b2; /* INVALID SYNTAX! */
However, you can reparameterize the problem by introducing a new parameter C = 2*B2. You can then systematically substitute (C/2) everywhere that B2 appears. For example, the following statements show the constrained model in terms of the new parameter, C. You can use the ESTIMATE statement to find the estimates for the original parameter.

/* let c = 2*b2 be the new parameter. Then substitute c/2 for b2 in the MODEL statement: */ proc nlin data=RegSim; parameters b0=0 b1=3 c=2; /* initial guess */ bounds b1 >= c; model Y = b0 + b1*x1 + (c/2)*x2 + b1*x3 - c*x4; estimate 'b2' c/2; /* estimate original parameter */ run;

The output from this procedure is not shown.

You can use a similar trick to handle linear constraints such as B1 + B2 ≥ 6. Move the B2 term to the right side of the inequality and define C = 6 – B2. The constraint becomes B1 ≥ C and on the MODEL statement you can substitute (6–C) everywhere that B2 appears, as follows:

/* Define c = 6 - b2 so that b2=6-c */ proc nlin data=RegSim; parameters b0=0 b1=5 c=5; bounds b1 >= c; model Y = b0 + b1*x1 + (6-c)*x2 + b1*x3 - 2*(6-c)*x4; estimate 'b2' 6-c; /* estimate original parameter */ ods select ParameterEstimates AdditionalEstimates; run;

Summary

In summary, you can use the NLIN procedure to solve linear regression problems that have linear constraints among the coefficients. Each equality constraint enables you to eliminate one parameter in the MODEL statement. You can use the BOUNDS statement to specify simple inequality constraints. For more complicated constraints, you can reparametrize the model and again use the BOUNDS statement to specify the constraints. If desired, you can use the ESTIMATE statement to estimate the parameters in the original model.

The post Regression with inequality constraints on parameters appeared first on The DO Loop.

A data analyst recently asked a question about restricted least square regression in SAS. Recall that a restricted regression puts linear constraints on the coefficients in the model. Examples include forcing a coefficient to be 1 or forcing two coefficients to equal each other. Each of these problems can be solved by using PROC REG in SAS. The analyst's question was, "Why can I use PROC REG to solve a restricted regression problem when the restriction involves EQUALITY constraints, but PROC REG can't solve a regression problem that involves an INEQUALITY constraint?"

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that has linear constraints on the parameters. I also use geometry and linear algebra to show that solving an equality-constrained regression problem is mathematically similar to solving an unrestricted least squares system of equations. This explains why PROC REG can solve equality-constrained regression problems. In a future article, I will show how to solve regression problems that involve inequality constraints on the parameters.

The geometry of restricted regression

The linear algebra for restricted least squares regression gets messy, but the geometry is easy to picture. A schematic depiction of restricted regression is shown to the right.

Geometrically, ordinary least-squares (OLS) regression is the orthogonal projection of the observed response (Y) onto the column space of the design matrix. (For continuous regressors, this is the span of the X variables, plus an "intercept column.") If you introduce equality constraints among the parameters, you are restricting the solution to a linear subspace of the span (shown in green). But the geometry doesn't change. The restricted least squares (RLS) solution is still a projection of the observed response, but this time onto the restricted subspace.

In terms of linear algebra, the challenge is to write down the projection matrix for the restricted problem. As shown in the diagram, you can obtain the RLS solution in two ways:

• Starting from Y: You can directly project the Y vector onto the restricted subspace. The linear algebra for this method is straightforward and is often formulated in terms of Lagrange multipliers.
• Starting from the OLS solution: If you already know the OLS solution, you can project that solution vector onto the restricted subspace. Writing down the projection matrix is complicated, so I refer you to the online textbook Practical Econometrics and Data Science by A. Buteikis.

Restricted models and the TEST statement in PROC REG

Let's start with an example, which is based on an example in the online textbook by A. Buteikis. The following SAS DATA step simulates data from a regression model
Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2 and ε ~ N(0, 3) is a random error term.

/* Model: Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 where B3 = B1 and B4 = -2*B2 */ data RegSim; array beta[0:4] (-5, 2, 3, 2, -6); /* parameter values for regression coefficients */ N = 100; /* sample size */ call streaminit(123); do i = 1 to N; x1 = rand("Normal", 5, 2); /* X1 ~ N(5,2) */ x2 = rand("Integer", 1, 50); /* X2 ~ random integer 1:50 */ x3 = (i-1)*10/(N-1); /* X3 = sequence: 0 to 10 by 1/N */ x4 = rand("Normal", 10, 3); /* X4 ~ N(10, 3) */ eps = rand("Normal", 0, 3); /* RMSE = 3 */ Y = beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3 + beta[4]*x4 + eps; output; end; run;

Although there are five parameters in the model (B0-B4), there are only three free parameters because the values of B2 and B4 are determined by the values of B1 and B4, respectively.

If you suspect that a parameter in your model is a known constant or is equal to another parameter, you can use the RESTRICT statement in PROC REG to incorporate that restriction into the model. Before you do, however, it is usually a good idea to perform a statistical test to see if the data supports the model. For this example, you can use the TEST statement in PROC REG to hypothesize that B3 = B1 and B4 = –2*B2. Recall that the syntax for the TEST statement uses the variable names (X1-X4) to represent the coefficients of the variable. The following call to PROC REG carries out this analysis:

proc reg data=RegSim plots=none; model Y = x1 - x4; test x3 = x1, x4 = -2*x2; run;

The output shows that the parameter estimates for the simulated data are close to the parameter values used to generate the data. Specifically, the root MSE is close to 3, which is the standard deviation for the error term. The Intercept estimate is close to -5. The coefficients for the X1 and X3 term are each approximately 2. The coefficient for X4 is approximately –2 times the coefficient of X2. The F test for the test of hypotheses has a large p-value, so you should not reject the hypotheses that B3 = B1 and B4 = –2*B2.

The RESTRICT statement in PROC REG

If the hypothesis on the TEST statement fails to reject, then you can use the RESTRICT statement to recompute the parameter estimates subject to the constraints. You could rerun the entire analysis or, if you are running SAS Studio in interactive mode, you can simply submit a RESTRICT statement and PROC REG will compute the new parameter estimates without rereading the data:

 restrict x3 = x1, x4 = -2*x2; /* restricted least squares (RLS) */ print; quit;

The new parameter estimates enforce the restrictions among the regression coefficients. In the new model, the coefficients for X1 and X3 are exactly equal, and the coefficient for X4 is exactly –2 times the coefficient for X2.

Notice that the ParameterEstimates table is augmented by two rows labeled "RESTRICT". These rows have negative degrees of freedom because they represent constraints. The "Parameter Estimate" column shows the values of the Lagrange multipliers that are used to enforce equality constraints while solving the least squares system. Usually, a data analyst does not care about the actual values in these rows, although the Wikipedia article on Lagrange multipliers discusses ways to interpret the values.

The linear algebra of restricted regression

This section shows the linear algebra behind the restricted least squares solution by using SAS/IML. Recall that the usual way to compute the unrestricted OLS solution is the solve the "normal equations" (X*X)*b = X*Y for the parameter estimates, b. Although textbooks often solve this equation by using the inverse of the XX matrix, it is more efficient to use the SOLVE function in SAS/IML to solve the equation, as follows:

proc iml; varNames = {x1 x2 x3 x4}; use RegSim; /* read the data */ read all var varNames into X; read all var "Y"; close; X = j(nrow(X), 1, 1) || X; /* add intercept column to form design matrix */ XpX = X*X;   /* Solve the unrestricted OLS system */ b_ols = solve(XpX, X*Y); /* normal equations X*X * b = X*Y */ ParamNames = "Intercept" || varNames; print b_ols[r=ParamNames F=10.5];

The OLS solution is equal to the first (unrestricted) solution from PROC REG.

If you want to require that the solution satisfies B3 = B1 and B4 = –2*B2, then you can augment the XX matrix to enforce these constraints. If L*λ = c is the matrix equation for the linear constraints, then the augmented system is
$\begin{pmatrix} X^{\prime}X & L^{\prime} \\ L & 0 \end{pmatrix} \begin{pmatrix} b_{\small\mathrm{RLS}} \\ \lambda \end{pmatrix} = \begin{pmatrix} X^{\prime}Y \\ c \end{pmatrix}$

The following statements solve the augmented system:

/* Direct method: Find b_RLS by solving augmented system */ /* Int X1 X2 X3 X4 */ L = {0 1 0 -1 0, /* X1 - X3 = 0 */ 0 0 -2 0 -1}; /* -2*X2 - X4 = 0 */ c = {0, 0};   zero = j(nrow(L), nrow(L), 0); A = (XpX || L ) // (L || zero ); z = X*Y // c; b_rls = solve(A, z); rNames = ParamNames||{'Lambda1' 'Lambda2'}; print b_rls[r=rNames F=10.5];

The parameter estimates are the same as are produced by using PROC REG. You can usually ignore the last two rows, which are the values of the Lagrange multipliers that enforce the constraints.

By slogging through some complicated linear algebra, you can also obtain the restricted least squares solution from the OLS solution and the constraint matrix (L). The following equations use the formulas in the online textbook by A. Buteikis to compute the restricted solution:

/* Adjust the OLS solution to obtain the RLS solution. b_RLS = b_OLS - b_adjustment Follows http://web.vu.lt/mif/a.buteikis/wp-content/uploads/PE_Book/4-4-Multiple-RLS.html */ RA_1 = solve(XpX, L); RA_3 = solve(L*RA_1, L*b_ols - c); b_adj = RA_1 * RA_3; b_rls = b_ols - b_adj; print b_rls[r=ParamNames F=10.5];

This answer agrees with the previous answer but does not compute the Lagrange multipliers. Instead, it computes the RLS solution as the projection of the OLS solution onto the restricted linear subspace.

Summary

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that involves linear constraints on the parameters. Solving an equality-constrained regression problem is very similar to solving an unrestricted least squares system of equations. Geometrically, they are both projections onto a linear subspace. Algebraically, you can solve the restricted problem directly or as the projection of the OLS solution.

Although changing one of the equality constraints into an INEQUALITY seems like a minor change, it means that the restricted region is no longer a linear subspace. Ordinary least squares can no longer solve the problem because the solution might not be an orthogonal projection. The next article discusses ways to solve problems where the regression coefficients are related by linear inequalities.

The post Restricted least squares regression in SAS appeared first on The DO Loop.

A previous article shows how to interpret the collinearity diagnostics that are produced by PROC REG in SAS. The process involves scanning down numbers in a table in order to find extreme values. This can be a tedious and error-prone process. Friendly and Kwan (2009) compare this task to a popular picture book called Where's Waldo? in which children try to find one particular individual (Waldo) in a crowded scene that involves of hundreds of people. The game is fun for children, but less fun for a practicing analyst who is trying to discover whether a regression model suffers from severe collinearities in the data.

Friendly and Kwan suggest using visualization to turn a dense table of numbers into an easy-to-read graph that clearly displays the collinearities, if they exist. Friendly and Kwan (henceforth, F&K) suggest several different useful graphs. I decided to implement a simple graph (a discrete heat map) that is easy to create and enables the analyst to determine whether there are collinearities in the data. One version of the collinearity diagnostic heat map is shown below. (Click to enlarge.) For comparison, the table from my previous article is shown below it. The highlighted cells in the table were added by me; they are not part of the output from PROC REG.

Visualization principles

There are two important sets of elements in a collinearity diagnostics table. The first is the set of condition indices, which are displayed in the leftmost column of the heat map. The second is the set of cells that show the proportion of variance explained by each row. (However, only the rows that have a large condition index are important.) F&K make several excellent points about the collinearity diagnostic table:

• Display order: In a table, the important information is in the bottom rows. It is better to reverse-sort the table so that the largest condition indices (the important ones) are at the top.
• Condition indices: A condition number between 20 and 30 is starting to get large (F&K use 10-30). An index over 30 is generally considered large and an index that exceeds 100 is "a sign of potential disaster in estimation" (p. 58). F&K suggest using "traffic lighting" (green, yellow, and red) to color the condition indices by the severity of the collinearity. I modified their suggestion to include an orange category.
• Proportion of variance: F&K note that "the variance proportions corresponding to small condition numbers are completely irrelevant" (p. 58) and also that tables print too many decimals. "Do we really care that [a]variance proportion is 0.00006088?" Of course not! Therefore we should only display the large proportions. F&K also suggest displaying a percentage (instead of proportion) and rounding the percentage to the nearest integer.

A discrete heat map to visualize collinearity diagnostics

There are many ways to visualize the Collinearity Diagnostics table. F&K use traffic lighting for the condition numbers and a bubble plot for the proportion of variance entries. Another choice would be to use a panel of bar charts for the proportion of variance. However, I decided to use a simple discrete heat map. The following list describes the main steps to create the plot. You can download the complete SAS program that creates the plot and modify it (if desired) to use with your own data. For each step, I link to a previous article that describes more details about how to perform the step.

1. Use the ODS OUTPUT statement to save the Collinearity Diagnostics table to a data set.
2. Use PROC FORMAT to define a format. The format converts the table values into discrete values. The condition indices are in the range [1, ∞) whereas the values for the proportion of variance are in the range [0, 1). Therefore you can use a single format that maps these values into 'low', 'medium', and 'high' values.
3. The HEATMAPPARM statement in PROC SGPLOT is designed to work with data in "long format." Therefore convert the Collinearity Diagnostics data set from wide form to long form.
4. Create a discrete attribute map that maps categories to colors.
5. Use the HEATMAPPARM statement in PROC SGPLOT to create a discrete heat map that visualizes the collinearity diagnostics. Overlay (rounded) values for the condition indices and the important (relatively large) values of the proportion of variance.

The discrete heat map enables you to draw the same conclusions as the original collinearity diagnostics table. However, whereas using the table is akin to playing "Where's Waldo," the heat map makes it apparent that the most severe collinearity (top row; red condition index) is between the RunPulse and MaxPulse variables. The second most severe collinearity (second row from top; orange condition index) is between the Intercept and the Age variable. None of the remaining rows have two or more large cells for the proportion of variance.

You can download the SAS program that creates the collinearity plot. It would not be hard to turn it into a SAS macro, if you intend to use it regularly.

References

Friendly, M., & Kwan, E. (2009). "Where's Waldo? Visualizing collinearity diagnostics." The American Statistician, 63(1), 56-65. https://doi.org/10.1198/tast.2009.0012

The post Visualize collinearity diagnostics appeared first on The DO Loop.

A SAS programmer wanted to create a graph that illustrates how Deming regression differs from ordinary least squares regression. The main idea is shown in the panel of graphs below.

• The first graph shows the geometry of least squares regression when we regress Y onto X. ("Regress Y onto X" means "use values of X to predict Y.") The residuals for the model are displayed as vectors that show how the observations are projected onto the regression line. The projection is vertical when we regression Y onto X.
• The second graph shows the geometry when we regress X onto Y. The projection is horizontal.
• The third graph shows the perpendicular projection of both X and Y onto the identity line. This is the geometry of Deming regression.

1. Given any line and any point in the plane, how do you find the location on the line that is closest to the point? This location is the perpendicular projection of the point onto the line.
2. How do you use the SGPLOT procedure in SAS to create the graphs that chow the projections of points onto lines?

The data for the examples are shown below:

data Have; input x y @@; datalines; 0.5 0.6 0.6 1.4 1.4 3.0 1.7 1.4 2.2 1.7 2.4 2.1 2.4 2.4 3.0 3.3 3.1 2.5 ;

The projection of a point onto a line

Assume that you know the slope and intercept of a line: y = m*x + b. You can use calculus to show that the projection of the point (x0, y0) onto the line is the point (xL, yL) where
xL = (x + m*(y – b)) / (1 + m2) and yL = m * xL + b.

To derive this formula, you need to solve for the point on the line that minimizes the distance from (x0, y0) to the line. Let (x, m*x + b) be any point on the line. We want to find a value of x so that the distance from (x0, y0) to (x, m*x + b) is minimized. The solution that minimizes the distance also minimizes the squared distance, so define the squared-distance function
f(x) = (x - x0)2 + (m*x + b - y0)2.
To find the location of the minimum for this function, set the derivative equal to zero and solve for the value of x:

• f(x) = 2(x - x0) + 2 m*(m*x + b - y0)
• Set f(x)=0 and solve for x. The solution is the value xL = (x + m*(y – b)) / (1 + m2), which minimizes the distance from the point to the line.
• Plug xL into the formula for the line to find the corresponding vertical coordinate on the line: yL = m * xL + b.

You can use the previous formulas to write a simple SAS DATA step that projects each observation onto a specified line. (For convenience, I put the value of the slope (m) and intercept (b) into macro variables.) The following DATA step projects a set of points onto the line y = m*x + b. You can use PROC SGPLOT to create a scatter plot of the observations. Use the VECTOR statement to draw the projections of the points onto the line.

/* projection onto general line of the form y = &m*x + &b */ %let b = 0.4; %let m = 0.8; data Want; set Have; xL = (x + &m *(y - &b)) / (1 + &m**2); yL = &m * xL + &b; run;   title "Projection onto Line y=&m x + &b"; proc sgplot data=Want aspect=1 noautolegend; scatter x=x y=y; vector x=xL y=yL / xorigin=x yorigin=y; /* use the NOARROWHEADS option to suppress the arrow heads */ lineparm x=0 y=&b slope=&m / lineattrs=(color=black); xaxis grid; yaxis grid; run;

You can get the graph for Deming regression by setting b=0 and m=1 in the previous formulas and program.

In summary, you can use that math you learned in high school to find the perpendicular projection of a point onto a line. You can then use the VECTOR statement in PROC SGPLOT in SAS to create a graph that illustrates the projection. Such a graph is useful for comparing different kinds of regressions, such as comparing least-squares and Deming regression.

The post Visualize residual projections for linear regression appeared first on The DO Loop.

In a previous article, I showed how to perform collinearity diagnostics in SAS by using the COLLIN option in the MODEL statement in PROC REG. For models that contain an intercept term, I noted that there has been considerable debate about whether the data vectors should be mean-centered prior to performing the collinearity diagnostics. In other words, if X is the design matrix used for the regression, should you use X to analyze collinearity or should you use the centered data X – mean(X)? The REG procedure provides options for either choice. The COLLIN option uses the X matrix to assess collinearity; the COLLINOINT option uses the centered data.

As Belsley (1984, p. 76) states, "centering will typically seem to improve the conditioning." However, he argues that running collinearity diagnostics on centered data "gives us information about the wrong problem." He goes on to say, "mean-centering typically removes from the data the interpretability that makes conditioning diagnostics meaningful."

This article looks at how centering the data affects the collinearity diagnostics. Throughout this article, when I say "collinearity diagnostics, I am referring to the variance-decomposition algorithm that is implemented by the COLLIN in PROC REG, which was described in the previous article. Nothing in this article applies to the VIF or TOL options in PROC REG, which provide alternative diagnostics.

The article has two main sections:

• The mathematics behind the COLLIN (and COLLINOINT) options in PROC REG.
• An example of an ill-conditioned linear system that becomes perfectly conditioned if you center the data.

The mathematics of the COLLIN option

The COLLIN option implements the regression-coefficient variance decomposition due to Belsley and presented in Belsley, Kuh, and Welsch (1980), henceforth, BKW. The collinearity diagnostics algorithm (also known as an analysis of structure) performs the following steps:

1. Let X be the data matrix. If the model includes an intercept, X has a column of ones. BKW recommend that you NOT center X, but if you choose to center X, do it at this step. As a reminder, the COLLIN option in PROC REG does not center the data whereas the COLLINOINT option centers the data.
2. Scale X so that each column has unit length (unit variance).
3. Compute the singular value decomposition of X = UDV.
4. From the diagonal matrix, D, compute the eigenvalues and condition indices of XX.
5. Compute P, the matrix of variance-decomposition proportions as described in BKW, p. 105-107.
6. From this information, you can determine whether the regression model suffers from harmful collinearity.

To make sure that I completely understand an algorithm, I like to implement it in the SAS/IML matrix language. The following SAS/IML statements implement the analysis-of-structure method. You can run the program on the same Fitness data that were used in the previous article. The results are the same as from PROC REG.

proc iml; start CollinStruct(lambda, cond, P, /* output variables */ XVar, opt); /* input variables */   /* 1. optionally center the data */ if upcase(opt) = "NOINT" then X = XVar - mean(XVar); /* COLLINOINT: do not add intercept, center */ else X = j(nrow(XVar), 1, 1) || XVar; /* COLLIN: add intercept, do not center */   /* 2. Scale X to have unit column length (unit variance) */ Z = X / sqrt(X[##, ]); /* 3. Obtain the SVD of X and calculate condition indices and the P matrix */ call svd(U, D, V, Z); /* 4. compute the eigenvalues and condition indices of XX */ lambda = D##2; /* eigenvalues are square of singular values */ cond = sqrt(lambda[1] / lambda); /* condition indices */   /* 5. Compute P = matrix of variance-decomposition proportions */ phi = V##2 / lambda; /* divide squared columns by eigenvalues (proportions of each PC) */ phi_k = phi[,+]; /* for each component, sum across columns */ P = T( phi / phi_k ); /* create proportions and transpose the result */ finish;   /* Perform Regression-Coefficient Variance Decomposition of BKW */ varNames = {RunTime Age Weight RunPulse MaxPulse RestPulse}; use fitness; read all var varNames into XVar; close;   /* perform COLLIN analysis */ call CollinStruct(lambda, cond, P, XVar, "INT"); print "----- Do Not Center the Data (COLLIN) -----", lambda cond;   /* perform COLLINOINT analysis */ call CollinStruct(lambda0, cond0, P0, XVar, "NOINT"); print "----- Center the Data (COLLINOINT) -----", lambda0 cond0;

The first table shows the eigenvalues and (more importantly) the condition indices for the original (uncentered) data. You can see that there are three condition indices that exceed 30, which indicates that there might be as many as three sets of nearly collinear relationships among the variables. My previous analysis showed two important sets of relationships:

• Age is moderately collinear with the intercept term.
• RunPulse is strongly collinear with MaxPulse.

In the second table, which analyses the structure of the centered data, none of the condition indices are large. An interpretation of the second table is that the variables are not collinear. This contradicts the first analysis.

Why does centering the data change the condition indices so much? This phenomenon was studied by Belsley who showed that "centering will typically seem to improve the conditioning," sometimes by a large factor (Belsley, 1984, p. 76). Belsley says that the second table "gives us information about the wrong problem; namely, it tells us about the sensitivity of the LS solution... to numerically small relative changes in the centered data. And since the magnitude of the centered changes is usually uninterpretable," so also are the condition indices for the centered data.

Ill-conditioned data that becomes perfectly conditioned by centering

Belsley (1984) presents a small data set (N=20) for which the original variables are highly collinear (maximum condition index is 1,242) whereas the centered data is perfectly conditioned (all condition indices are 1). Belsley could have used a much smaller example, as shown in Chennamaneni et al. (2008). I used their ideas to construct the following example.

Suppose that the (uncentered) data matrix is
X = A + ε B
where A is any N x k matrix that has constant columns, B is a centered orthogonal matrix, and ε > 0 is a small number, such as 0.001. Clearly, X is a small perturbation of a rank-deficient and ill-conditioned matrix (A). The condition indices for X can be made arbitrarily large by making ε arbitrarily small. I think everyone would agree that the columns of X are highly collinear. As shown below, the analysis-of-structure algorithm on X reveals the collinearities.

But what happens if you center the data? Because A has constant columns, the mean-centered version of A is the zero matrix. Centering B does not change it because the columns are already centered. Therefore, the centered version of X is ε B, which is a perfectly conditioned orthogonal matrix! This construction is valid for arbitrarily large data, but the following statements implement this construction for a small 3 x 2 matrix.

A = { 1 2, /* linearly dependent ==> infinite condition index) */ 1 2, 1 2}; B = {-1 1, /* orthogonal columns ==> perfect condition number (1) */ 0 -2, 1 1}; eps = 0.001; /* the smaller eps is, the more ill-conditioned X is */ X = A + eps * B; /* small perturbation of a rank deficient matrix */   /* The columns of X are highly collinear. The columns X - mean(X) are perfectly conditioned. */ Xc = X - mean(X); print X, Xc;

This example reveals how "mean-centering can remove from the data the information needed to assess conditioning correctly" (Belsley, 1984, p. 74). As expected, if you run the analysis-of-structure diagnostics on this small example, the collinearity is detected in the original data. However, if you center the data prior to running the diagnostics, the results do not indicate that the data are collinear:

/* The columns of the X matrix are highly collinear, but only the analysis of the uncentered data reveals the collinearity */ call CollinStruct(lambda, cond, P, X, "INT"); print lambda cond P[c={"Intercept" "X1" "X2"}]; /* as ill-conditioned as you want */   call CollinStruct(lambda0, cond0, P0, X, "NOINT"); print lambda0 cond0 P0[c={"X1" "X2"}]; /* perfectly conditioned */

In the first table (which is equivalent to the COLLIN option in PROC REG), the strong collinearities are apparent. In the second table (which is equivalent to the COLLINOINT option), the collinearities are not apparent. As Belsley (1984, p. 75) says, an example like this "demonstrates that it matters very much in what form the data are taken in order to assess meaningfully the conditioning of a LS problem, and that centered data are not usually the correct form."

Geometrically, the situation is similar to the following diagram, which is part of Figure 2 on p. 7 of Chennamaneni, et al. (2008). The figure shows two highly collinear vectors (U and V). However, the mean-centered vectors are not close to being collinear and can even be orthogonal (perfectly conditioned).

U and V are highly collinear. The mean centered vectors are orthogonal.

Summary

In summary, this article has presented Belsley's arguments about why collinearity diagnostics should be performed on the original data. As Belsley (1984, p. 75) says, there are situations in which centering the data is useful, but "assessing conditioning is not one of them." The example in the second section demonstrates why Belsley's arguments are compelling: the data are clearly strongly collinear, yet if you apply the algorithm to the mean-centered data, you do not get any indication that the problem exists. The analysis of the fitness data shows that the same situation can occur in real data.

These examples convince me that the analysis-of-structure algorithm reveals collinearity only when you apply it to the original data. If you agree, then you should use the COLLIN option in PROC REG to perform collinearity diagnostics.

However, not everyone agrees with Belsley. If you are not convinced, you can use the COLLINOINT option in PROC REG to perform collinearity diagnostics on the centered data. However, be aware that the estimates for the centered data are still subject to inflated variances and sensitive parameter estimates (Belsley, 1984, p. 74), even if this diagnostic procedure does not reveal that fact.