Statistical Thinking

4月 172019

I think every course in exploratory data analysis should begin by studying Anscombe's quartet. Anscombe's quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but graphical properties. They are a great reminder of why you should graph your data. You can read about Anscombe's quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe's first two examples are shown below:

The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have the same statistics sounds like a fun challenge! As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.

Generating your own version of Anscombe's data

The Wikipedia article notes that there are "several methods to generate similar data sets with identical statistics and dissimilar graphics," but I did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!

I used the following approach to construct the first two data sets:

  1. Use a simulation to create linear data with random normal errors: Y1 = 3 + 0.5 X + ε, where ε ~ N(0,1).
  2. Compute the regression estimates (b0 and b1) and the sample correlation for the linear data. These are the target statistics. They define three equations that the second data set must match.
  3. The response variable for the second data set is of the form Y2 = β0 + β1 X + β2 X2. There are three equations and three unknowns parameters, so we can solve a system of nonlinear equations to find β.

From geometric reasoning, there are three different solution for the β parameter: One with β2 > 0 (a parabola that opens up), one with β2 = 0 (a straight line), and one with β2 < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.

Construct the first data set

You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:

proc iml;
call randseed(12345);
x = T( do(4, 14, 0.2) );                              /* evenly spaced X */
eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */
y = 3 + 0.5*x + eps;                                  /* linear Y + error */
/* Helper function. Return paremater estimates for linear regression. Args are col vectors */
start LinearReg(Y, tX);
   X = j(nrow(tX), 1, 1) || tX;
   b = solve(X`*X, X`*Y);       /* solve normal equation */
   return b;
targetB = LinearReg(y, x);          /* compute regression estimates */
targetCorr = corr(y||x)[2];         /* compute sample correlation */
print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"];

You can use these values as the target values. The next step is to find a parameter vector β such that Y2 = β0 + β1 X + β2 X2 has the same regression line and corr(Y2, X) has the same sample correlation. For uniqueness, set β2 < 0.

Construct the second data set

You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.) The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y2 and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.

/* Define system of simultaneous equations: */
/* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */
start LinearFitCorr(beta) global(x);
   y2 = beta[1] + beta[2]*x + beta[3]*x##2;    /* construct quadratic Y */
   b = LinearReg(y2, x);      /* linear fit */
   corr = corr(y2||x)[2];     /* sample corr */
   return ( b` || corr);      /* return row vector */
/* This function returns the vector quantity (beta - target). 
   Find value that minimizes Sum | F_i(beta)-Target+i |^2 */
start Func(beta) global(targetB, targetCorr);
   target = rowvec(targetB) || targetCorr;
   G = LinearFitCorr(beta) - target;
   return( G );              /* return row vector */
/* now let's solve for quadratic parameters so that same 
   linear fit and correlation occur */
beta0 = {-5 1 -0.1};         /* initial guess */
con = {.  .  .,              /* constraint matrix */
       0  .  0};             /* quadratic term is negative */
optn = ncol(beta0) || 0;     /* LS with 3 components || amount of printing */
/* minimize sum( beta[i] - target[i])**2 */
call nlphqn(rc, beta, "Func", beta0, optn) blc=con;  /* LS solution */
print beta[L="Optimal beta"];
/* How nearly does the solution solve the problem? Did we match the target values? */
Y2Stats = LinearFitCorr(beta);
print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3];

The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y2 = 4.955 + 2.566*X - 0.118*X2. The following statements create scatter plots of the new Anscombe-like data:

y2 = beta[1] + beta[2]*x + beta[3]*x##2;
create Anscombe2 var {x y y2};  append;  close;
ods layout gridded columns=2 advance=table;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=X y=y;    lineparm x=0 y=3.561 slope=0.447 / clip;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=x y=y2;   lineparm x=0 y=3.561 slope=0.447 / clip;
ods layout end;

Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.

What about the other data sets?

You can create the other data sets similarly. For example:

  • For the data set that consist of points along a line except for one outlier, there are three free parameters. Most of the points fall along the line Y3 = a + b*X and one point is at the height Y3 = c. Therefore, you can run an optimization to solve for the values of (a, b, c) that match the target statistics.
  • I'm not sure how to formulate the requirements for the fourth data set. It looks like all but one point have coordinates (X, Y4), where X is a fixed value and the vertical mean of the cluster is c. The outlier has coordinate (a, b). I'm not sure whether the variance of the cluster is important.


In summary, you can solve a system of equations to construct data similar to Anscombe's quartet. By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.

To be fair, the technique I've presented does not enable you to reproduce Anscombe's quartet in its entirety. My data share a linear fit and sample correlation, whereas Anscombe's data share seven statistics!

Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet. However he did it, Anscombe's quartet is truly remarkable!

You can download the SAS program that creates the results in this article.

The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.

3月 252019

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate normal data. It is better than looking at the univariate z-scores of each coordinate because a multivariate outlier does not necessarily have extreme coordinate values.

The geometry of multivariate outliers

In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in every coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in any coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

  • The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
  • For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
  • The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
  • The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.
Geometry of multivariate outliers showing the relationship between Mahalanobis distance and univariate outliers. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'B' is a multivariate outlier.

The diagram displays two observations, labeled "A" and "B":

  • The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a univariate outlier for both the X1 and X2 coordinates!
  • The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the combination of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate z scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its z scores are relatively small.

Applications to real data

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate z scores. Another observation had large z scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

Geometry of multivariate outliers. The point 'A' has large univariate z scores but the Mahalanobis distance is only about 2.5. The point 'B' has a Mahalanobis distance of 5 and is a multivariate outlier.

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within t MD units of the center is 1 - exp(-t2/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6. Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

Summary and further reading

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

3月 182019

Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical prediction, such as trillions of people or a negative number of people! The lesson is that you should be careful when you evaluate a model far beyond the range of the training data.

The same dangers exist for multivariate regression models, but they are emphasized less often. Perhaps the reason is that it is much harder to know when you are extrapolating a multivariate model. Interpolation occurs when you evaluate the model inside the convex hull of the training data. Anything else is an extrapolation. In particular, you might be extrapolating even if you score the model at a point inside the bounding box of the training data. This differs from the univariate case in which the convex hull equals the bounding box (range) of the data. In general, the convex hull of a set of points is smaller than the bounding box.

You can use a bivariate example to illustrate the difference between the convex hull of the data and the bounding box for the data, which is the rectangle [Xmin, Xmax] x [Ymin, Ymax]. The following SAS DATA step defines two explanatory variables (X and Y) and one response variable (Z). The SGPLOT procedure shows the distribution of the (X, Y) variables and colors each marker according to the response value:

data Sample;
input X Y Z @@;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
title "Response Values for Bivariate Data";
proc sgplot data=Sample;
scatter x=x y=y / markerattrs=(size=12 symbol=CircleFilled)
        colorresponse=Z colormodel=AltThreeColorRamp;
xaxis grid; yaxis grid;
Bivariate data. The data are not uniformly distributed in the bounding box of the data.

The data are observed in a region that is approximately triangular. No observations are near the lower-left corner of the plot. If you fit a response surface to this data, it is likely that you would visualize the model by using a contour plot or a surface plot on the rectangular domain [10, 30] x [62, 95]. For such a model, predicted values near the lower-left corner are not very reliable because the corner is far from the data.

In general, you should expect less accuracy when you predict the model "outside" the data (for example, (10, 60)) as opposed to points that are "inside" the data (for example, (25, 70)). This concept is sometimes discussed in courses about the design of experiments. For a nice exposition, see the course notes of Professor Rafi Hafka (2012, p. 49–59) at the University of Florida.

The convex hull of bivariate points

You can use SAS to visualize the convex hull of the bivariate observations. The convex hull is the smallest convex set that contains the observations. The SAS/IML language supports the CVEXHULL function, which computes the convex hull for a set of planar points.

You can represent the points by using an N x 2 matrix, where each row is a 2-D point. When you call the CVEXHULL function, you obtain a vector of N integers. The first few integers are positive and represent the rows of the matrix that comprise the convex hull. The (absolute value of) the negative integers represents the rows that are interior to the convex hull. This is illustrated for the sample data:

proc iml;
use Sample;  read all var {x y} into points;  close;
/* get indices of points in the convex hull in counter-clockwise order */
indices = cvexhull( points );
print (indices`)[L="indices"];  /* positive indices are boundary; negative indices are inside */

The output shows that the observation numbers (indices) that form the convex hull are {1, 6, 7, 12, 13, 14, 11}. The other observations are in the interior. You can visualize the interior and boundary points by forming a binary indicator vector that has the value 1 for points on the boundary and 0 for points in the interior. To get the indicator vector in the order of the data, you need to use the SORTNDX subroutine to compute the anti-rank of the indices, as follows:

b = (indices > 0);               /* binary indicator variable for sorted vertices */
call sortndx(ndx, abs(indices)); /* get anti-rank, which is sort index that "reverses" the order */
onBoundary = b[ndx];             /* binary indicator data in original order */
title "Convex Hull of Bivariate Data";
call scatter(points[,1], points[,2]) group=onBoundary 
             option="markerattrs=(size=12 symbol=CircleFilled)";
Convex hull: Blue points are on the convex hull and red points are in the interior

The blue points are the boundary of the convex hull whereas the red points are in the interior.

Visualize the convex hull as a polygon

You can visualize the convex hull by forming the polygon that connects the first, sixth, seventh, ..., eleventh observations. You can do this manually by using the POLYGON statement in PROC SGPLOT, which I show in the Appendix section. However, there is an easier way to visualize the convex hull. I previously wrote about SAS/IML packages and showed how to install the polygon package. The polygon package contains a module called PolyDraw, which enables you to draw polygons and overlay a scatter plot.

The following SAS/IML statements extract the positive indices and use them to get the points on the boundary of the convex hull. If the polygon package is installed, you can load the polygon package and visualize the convex hull and data:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
/* In SAS/IML 14.1, you can use the polygon package to visualize the convex hull: */
package load polygon;    /* assumes package is installed */
run PolyDraw(convexHull, points||onBoundary) grid={x y}
    markerattrs="size=12 symbol=CircleFilled";

The graph shows the convex hull of the data. You can see that it primarily occupies the upper-right portion of the rectangle. The convex hull shows the interpolation region for regression models. If you evaluate a model outside the convex hull, you are extrapolating. In particular, even though points in the lower left corner of the plot are within the bounding box of the data, they are far from the data.

Of course, if you have 5, 10 or 100 explanatory variables, you will not be able to visualize the convex hull of the data. Nevertheless, the same lesson applies. Namely, when you evaluate the model inside the bounding box of the data, you might be extrapolating rather than interpolating. Just as in the univariate case, the model might predict nonsensical data when you extrapolate far from the data.


Packages are supported in SAS/IML 14.1. If you are running an earlier version of SAS, you create the same graph by writing the polygon data and the binary indicator variable to a SAS data set, as follows:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
/* Write the data and polygon to SAS data sets. Use the POLYGON statement in PROC SGPLOT. */
p = points || onBoundary;       
poly = j(nrow(convexHull), 1, 1) || convexHull;
create TheData from p[colname={x y "onBoundary"}]; append from p;    close;
create Hull from poly[colname={ID cX cY}];         append from poly; close;
data All; set TheData Hull; run;    /* combine the data and convex hull polygon */
proc sgplot data=All noautolegend;
   polygon x=cX y=cY ID=id / fill;
   scatter x=x y=y / group=onBoundary markerattrs=(size=12 symbol=CircleFilled); 
   xaxis grid; yaxis grid;

The resulting graph is similar to the one produced by the PolyDraw modules and is not shown.

The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.

2月 062019

Feature generation (also known as feature creation) is the process of creating new features to use for training machine learning models. This article focuses on regression models. The new features (which statisticians call variables) are typically nonlinear transformations of existing variables or combinations of two or more existing variables. This article argues that a naive approach to feature generation can lead to many correlated features (variables) that increase the cost of fitting a model without adding much to the model's predictive power.

Feature generation in traditional statistics

Feature generation is not new. Classical regression often uses transformations of the original variables. In the past, I have written about applying a logarithmic transformation when a variable's values span several orders of magnitude. Statisticians generate spline effects from explanatory variables to handle general nonlinear relationships. Polynomial effects can model quadratic dependence and interactions between variables. Other classical transformations in statistics include the square-root and inverse transformations.

SAS/STAT procedures provide several ways to generate new features from existing variables, including the EFFECT statement and the "stars and bars" notation. However, an undisciplined approach to feature generation can lead to a geometric explosion of features. For example, if you generate all pairwise quadratic interactions of N continuous variables, you obtain "N choose 2" or N*(N-1)/2 new features. For N=100 variables, this leads to 4950 pairwise quadratic effects!

Generated features might be highly correlated

In addition to the sheer number of features that you can generate, another problem with generating features willy-nilly is that some of the generated effects might be highly correlated with each other. This can lead to difficulties if you use automated model-building methods to select the "important" features from among thousands of candidates.

I was reminded of this fact recently when I wrote an article about model building with PROC GLMSELECT in SAS. The data were simulated: X from a uniform distribution on [-3, 3] and Y from a cubic function of X (plus random noise). I generated the polynomial effects x, x^2, ..., x^7, and the procedure had to build a regression model from these candidates. The stepwise selection method added the x^7 effect first (after the intercept). Later it added the x^5 effect. Of course, the polynomials x^7 and x^5 have a similar shape to x^3, but I was surprised that those effects entered the model before x^3 because the data were simulated from a cubic formula.

After thinking about it, I realized that the odd-degree polynomial effects are highly correlated with each other and have high correlations with the target (response) variable. The same is true for the even-degree polynomial effects. Here is the DATA step that generates 1000 observations from a cubic regression model, along with the correlations between the effects (x1-x7) and the target variable (y):

%let d = 7;
%let xMean = 0;
data Poly;
call streaminit(54321);
array x[&d];
do i = 1 to 1000;
   x[1] = rand("Normal", &xMean);  /* x1 ~ U(-3, 3] */
   do j = 2 to &d;
      x[j] = x[j-1] * x[1];        /* x[i] = x1**i, i = 2..7 */
   y = 2 - 1.105*x1 - 0.2*x2 + 0.5*x3 + rand("Normal");  /* response is cubic function of x1 */
drop i j;
proc corr data=Poly nosimple noprob;
   var y;
   with x:;
Correlations between the target variable and polynimal effects

You can see from the output that the x^5 and x^7 effects have the highest correlations with the response variable. Because the squared correlations are the R-square values for the regression of Y onto each effect, it makes intuitive sense that these effects are added to the model early in the process. Towards the end of the model-building process, the x^3 effect enters the model and the x^5 and x^7 effects are removed. To the model-building algorithm, these effects have similar predictive power because they are highly correlated with each other, as shown in the following correlation matrix:

proc corr data=Poly nosimple noprob plots(MAXPOINTS=NONE)=matrix(NVAR=ALL);
   var x:;
Correlations among polynomial effects

I've highlighted certain cells in the lower triangular correlation matrix to emphasize the large correlations. Notice that the correlations between the even- and odd-degree effects are close to zero and are not highlighted. The table is a little hard to read, but you can use PROC IML to generate a heat map for which cells are shaded according to the pairwise correlations:

Heat map of correlations among polynomial effects

This checkerboard pattern shows the large correlations that occur between the polynomial effects in this problem. This image shows that many of the generated features do not add much new information to the set of explanatory variables. This can also happen with other transformations, so think carefully before you generate thousands of new features. Are you producing new effects or redundant ones?

Generated features are not independent

Notice that the generated effects are not statistically independent. They are all generated from the same X variable, so they are functionally dependent. In fact, a scatter plot matrix of the data will show that the pairwise relationships are restricted to parametric curves in the plane. The graph of (x, x^2) is quadratic, the graph of (x, x^3) is cubic, the graph of (x^2, x^3) is an algebraic cusp, and so forth. The PROC CORR statement in the previous section created a scatter plot matrix, which is shown below: Again, I have highlighted cells that have highly correlated variables.

Scatter plot matrix that shows the statistical dependencies between polynomial effects

I like this scatter plot matrix for two reasons. First, it is soooo pretty! Second, it visually demonstrates that two variables that have low correlation are not necessarily independent.


Feature generation is important in many areas of machine learning. It is easy to create thousands of effects by applying transformations and generating interactions between all variables. However, the example in this article demonstrates that you should be a little cautious of using a brute-force approach. When possible, you should use domain knowledge to guide your feature generation. Every feature you generate adds complexity during model fitting, so try to avoid adding a bunch of redundant highly-correlated features.

For more on this topic, see the article "Automate your feature engineering" by my colleague, Funda Gunes. She writes: "The process [of generating new features]involves thinking about structures in the data, the underlying form of the problem, and how best to expose these features to predictive modeling algorithms. The success of this tedious human-driven process depends heavily on the domain and statistical expertise of the data scientist." I agree completely. Funda goes on to discuss SAS tools that can help data scientists with this process, such as Model Studio in SAS Visual Data Mining and Machine Learning. She shows an example of feature generation in her article and in a companion article about feature generation. These tools can help you generate features in a thoughtful, principled, and problem-driven manner, rather than relying on a brute-force approach.

The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.

8月 272018

A frequent topic on SAS discussion forums is how to check the assumptions of an ordinary least squares linear regression model. Some posts indicate misconceptions about the assumptions of linear regression. In particular, I see incorrect statements such as the following:

  • Help! A histogram of my variables shows that they are not normal! I need to apply a normalizing transformation before I can run a regression....
  • Before I run a linear regression, I need to test that my response variable is normal....

Let me be perfectly clear: The variables in a least squares regression model do not have to be normally distributed. I'm not sure where this misconception came from, but perhaps people are (mis)remembering an assumption about the errors in an ordinary least squares (OLS) regression model. If the errors are normally distributed, you can prove theorems about inferential statistics such as confidence intervals and hypothesis tests for the regression coefficients. However, the normality-of-errors assumption is not required for the validity of the parameter estimates in OLS. For the details, the Wikipedia article on ordinary least squares regression lists four required assumptions; the normality of errors is listed as an optional fifth assumption.

In practice, analysts often "check the assumptions" by running the regression and then examining diagnostic plots and statistics. Diagnostic plots help you to determine whether the data reveal any deviations from the assumptions for linear regression. Consequently, this article provides a "getting started" example that demonstrates the following:

  1. The variables in a linear regression do not need to be normal for the regression to be valid.
  2. You can use the diagnostic plots that are produced automatically by PROC REG in SAS to check whether the data seem to satisfy some of the linear regression assumptions.

By the way, don't feel too bad if you misremember some of the assumptions of linear regression. Williams, Grajales, and Kurkiewicz (2013) point out that even professional statisticians sometimes get confused.

An example of nonnormal data in regression

Consider this thought experiment: Take any explanatory variable, X, and define Y = X. A linear regression model perfectly fits the data with zero error. The fit does not depend on the distribution of X or Y, which demonstrates that normality is not a requirement for linear regression.

For a numerical example, you can simulate data such that the explanatory variable is binary or is clustered close to two values. The following data shows an X variable that has 20 values near X=5 and 20 values near X=10. The response variable, Y, is approximately five times each X value. (This example is modified from an example in Williams, Grajales, and Kurkiewicz, 2013.) Neither variable is normally distributed, as shown by the output from PROC UNIVARIATE:

/* For n=1..20, X ~ N(5, 1). For n=21..40, X ~ N(10, 1).
   Y = 5*X + e, where e ~ N(0,1) */
data Have;
input X Y @@;
 3.60 16.85  4.30 21.30  4.45 23.30  4.50 21.50  4.65 23.20 
 4.90 25.30  4.95 24.95  5.00 25.45  5.05 25.80  5.05 26.05 
 5.10 25.00  5.15 26.45  5.20 26.10  5.40 26.85  5.45 27.90 
 5.70 28.70  5.70 29.35  5.90 28.05  5.90 30.50  6.60 33.05 
 8.30 42.50  9.00 45.50  9.35 46.45  9.50 48.40  9.70 48.30 
 9.90 49.80 10.00 48.60 10.05 50.25 10.10 50.65 10.30 51.20 
10.35 49.80 10.50 53.30 10.55 52.15 10.85 56.10 11.05 55.15 
11.35 55.95 11.35 57.90 11.40 57.25 11.60 57.95 11.75 61.15 
proc univariate data=Have;
   var x y;
   histogram x y / normal;

There is no need to "normalize" these data prior to performing an OLS regression, although it is always a good idea to create a scatter plot to check whether the variables appear to be linearly related. When you regress Y onto X, you can assess the fit by using the many diagnostic plots and statistics that are available in your statistical software. In SAS, PROC REG automatically produces a diagnostic panel of graphs and a table of fit statistics (such as R-squared):

/* by default, PROC REG creates a FitPlot, ResidualPlot, and a Diagnostics panel */
ods graphics on;
proc reg data=Have;
   model Y = X;

The R-squared value for the model is 0.9961, which is almost a perfect fit, as seen in the fit plot of Y versus X.

Using diagnostic plots to check the assumptions of linear regression

You can use the graphs in the diagnostics panel to investigate whether the data appears to satisfy the assumptions of least squares linear regression. The panel is shown below (click to enlarge).

The first column in the panel shows graphs of the residuals for the model. For these data and for this model, the graphs show the following:

  • The top-left graph shows a plot of the residuals versus the predicted values. You can use this graph to check several assumptions: whether the model is specified correctly, whether the residual values appear to be independent, and whether the errors have constant variance (homoscedastic). The graph for this model does not show any misspecification, autocorrelation, or heteroscedasticity.
  • The middle-left and bottom-left graphs indicate whether the residuals are normally distributed. The middle plot is a normal quantile-quantile plot. The bottom plot is a histogram of the residuals overlaid with a normal curve. Both these graphs indicate that the residuals are normally distributed. This is evidence that you can trust the p-values for significance and the confidence intervals for the parameters.

In summary, I wrote this article to addresses two points:

  1. To dispel the myth that variables in a regression need to be normal. They do not. However, you should check whether the residuals of the model are approximately normal because normality is important for the accuracy of the inferential portions of linear regression such as confidence intervals and hypothesis tests for parameters. (A colleague mentioned to me that standard errors and hypothesis tests tend to be robust to this assumption, so a modest departure from normality is often acceptable.)
  2. To show that the SAS regression procedures automatically provide many graphical diagnostic plots that you can use to assess the fit of the model and check some assumptions for least squares regression. In particular, you can use the plots to check the independence of errors, the constant variance of errors, and the normality of errors.


There have been many excellent books and papers that describe the various assumptions of linear regression. I don't feel a need to rehash what has already been written, In addition to the Wikipedia article about ordinary linear regression, I recommend the following:

The post On the assumptions (and misconceptions) of linear regression appeared first on The DO Loop.

8月 222018

A SAS programmer recently asked how to interpret the "standardized regression coefficients" as computed by the STB option on the MODEL statement in PROC REG and other SAS regression procedures. The SAS documentation for the STB option states, "a standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor." Although correct, this definition does not provide an intuitive feeling for how to interpret the standardized regression estimates. This article uses SAS to demonstrate how parameter estimates for the original variables are related to parameter estimates for standardized variables. It also derives how regression coefficients change after a linear transformation of the variables.

Proof by example

One of my college physics professors used to smile and say "I will now prove this by example" when he wanted to demonstrate a fact without proving it mathematically. This section uses PROC STDIZE and PROC REG to "prove by example" that the standardized regression estimates for data are equal to the estimates that you obtain by standardizing the data. The following example uses continuous response and explanatory variables, but there is a SAS Usage Note that describes how to standardize classification variables.

The following call to PROC REG uses the STB option to compute the standardized parameter estimates for a model that predicts the weights of 19 students from heights and ages:

proc reg data=Sashelp.Class plots=none;
   Orig: model Weight = Height Age / stb;
   ods select ParameterEstimates;

The last column is the result of the STB option on the MODEL statement. You can get the same numbers by first standardizing the data and then performing a regression on the standardized variables, as follows:

/* Put original and standardized variables into the output data set.
   Standardized variables have the names 'StdX' where X was the name of the original variable. 
   METHOD=STD standardizes variables according to StdX = (X - mean(X)) / std(X) */
proc stdize data=Sashelp.Class out=ClassStd method=std OPREFIX SPREFIX=Std;
proc reg data=ClassStd plots=none;
   Std: model StdWeight = StdHeight StdAge;
   ods select ParameterEstimates;

The parameter estimates for the standardized data are equal to the STB estimates for the original data. Furthermore, the t values and p-values for the slope parameters are equivalent because these statistics are scale- and translation-invariant. Notice, however, that scale-dependent statistics such as standard errors and covariance of the betas will not be the same for the two analyses.

Linear transformations of random variables

Mathematically, you can use a algebra to understand how linear transformations affect the relationship between two linearly dependent random variables. Suppose X is a random variable and Y = b0 + b1*X for some constants b0 and b1. What happens to this linear relationship if you apply linear transformations to X and Y?

Define new variables U = (Y - cY) / sY and V = (X - cX) / sX. If you solve for X and Y and plug the expressions into the equation for the linear relationship, you find that the new random variables are related by
U = (b0 + b1*cX - cY) / sY + b1*(sX/sY)*V.
If you define B0 = (b0 + b1*cX - cY) / sY and B1 = b1*(sX/sY), then U = B0 + B1*V, which shows that the transformed variables (U and V) are linearly related.

The physical significance of this result is that linear relationships persist no matter what units you choose to measure the variables. For example, if X is measured in inches and Y is measured in pounds, then the quantities remain linearly related if you measure X in centimeters and measure Y in "kilograms from 20 kg."

The effect of standardizing variables on regression estimates

The analysis in the previous section holds for any linear transformation of linearly related random variables. But suppose, in addition, that

  • U and V are standardized versions of Y and X, respectively. That is, cY and cX are the sample means and sY and sX are the sample standard deviations.
  • The parameters b0 and b1 are the regression estimates for a simple linear regression model.

For simple linear regression, the intercept estimate is b0 = cY - b1*cY, which implies that B0 = 0. Furthermore, the coefficient B1 = b1*(sX/sY) is the original parameter estimate "divided by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor," just as the PROC REG documentation states and just as we saw in the PROC REG output in the previous section. Thus the STB option on the MODEL statement does not need to standardize any data! It produces the standardized estimates by setting the intercept term to 0 and dividing the parameter estimates by the ratio of standard deviations, as noted in the documentation. (A similar proof handles multiple explanatory variables.)

Interpretation of the regression coefficients

For the original (unstandardized) data, the intercept estimate predicts the value of the response when the explanatory variables are all zero. The regression coefficients predict the change in the response for one unit change in an explanatory variable. The "change in response" depends on the units for the data, such as kilograms per centimeter.

The standardized coefficients predict the number of standard deviations that the response will change for one STANDARD DEVIATION of change in an explanatory variable. The "change in response" is a unitless quantity. The fact that the standardized intercept is 0 indicates that the predicted value of the (centered) response is 0 when the model is evaluated at the mean values of the explanatory variables.

In summary, standardized coefficients are the parameter estimates that you would obtain if you standardize the response and explanatory variables by centering and scaling the data. A standardized parameter estimate predicts the change in the response variable (in standard deviations) for one standard deviation of change in the explanatory variable.

The post Standardized regression coefficients appeared first on The DO Loop.

7月 112018

In a previous article, I showed how to find the intersection (if it exists) between two line segments in the plane. There are some fun problems in probability theory that involve intersections of line segments. One is "What is the probability that two randomly chosen chords of a circle intersect?" This article shows how to create a simulation in SAS to estimate the probability.

An exact answer

For this problem, a "random chord" is defined as the line segment that joins two points chosen at random (with uniform probability) on the circle. The probability that two random chords intersect can be derived by using a simple counting argument. Suppose that you pick four points at random on the circle. Label the points according to their polar angle as p1, p2, p3, and p4. As illustrated by the following graphic, the points are arranged on the circle in one of the following three ways. Consequently, the probability that two random chords intersect is 1/3 because the chords intersect in only one of the three possible arrangements.

Possible arrangements of two random chords in the circle

A simulation in SAS

You can create a simulation to estimate the probability that two random chords intersect. The intersection of two segments can be detected by using either of the two SAS/IML modules in my article about the intersection of line segments. The following simulation generates four angles chosen uniformly at random in the interval (0, 2π). It converts those points to (x,y) coordinates on the unit circle. It then computes whether the chord between the first two points intersects the chord between the third and fourth points. It repeats this process 100,000 times and reports the proportion of times that the chords intersect.

proc iml;
/* Find the intersection between 2D line segments [p1,p2] and [q1,q2].
   This function assumes that the line segments have different slopes (A is nonsingular) */
start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
/* Generate two random chords on the unit circle.
   Simulate the probability that they intersect  */
N = 1e5;
theta = j(N, 4);
call randseed(123456);
call randgen(theta, "uniform", 0, 2*constant('pi'));
intersect = j(N,1,0);
do i = 1 to N;
   t = theta[i,]`;                 /* 4 random U(0, 2*pi) */
   pts = cos(t) || sin(t);         /* 4 pts on unit circle */
   p1 = pts[1,];    p2 = pts[2,];
   q1 = pts[3,];    q2 = pts[4,];
   intersect[i] = all(IntersectSegsSimple(p1, p2, q1, q2) ^= .);
prob = mean(intersect);
print prob;

This simulation produces an estimate that is close to the exact probability of 1/3.

Connection to Bertrand's Paradox

This problem has an interesting connection to Bertrand's Paradox. Bertrand's paradox shows the necessity of specifying the process that is used to define the random variables in a probability problem. It turns out that there are multiple ways to define "random chords" in a circle, and the different definitions can lead to different answers to probability questions. See the Wikipedia article for an example.

For the definition of "random chords" in this problem, the density of the endpoints is uniform on the circle. After you make that choice, other distributions are determined. For example, the distribution of the lengths of 1,000 random chords is shown below. The lengths are NOT uniformly distributed! The theoretical density of the chord lengths is overlaid on the distribution of the sample. If you change the process by which chords are randomly chosen (for example, you force the lengths to be uniformly distributed), you might also change the answer to the problem, as shown in Bertrand's Paradox.

Distribution of lengths of random chords in the unit circle, generated by choosing uniformly distributed endpoints

The post The probability that two random chords of a circle intersect appeared first on The DO Loop.

3月 282018

Suppose you want to find observations in multivariate data that are closest to a numerical target value. For example, for the students in the Sashelp.Class data set, you might want to find the students whose (Age, Height, Weight) values are closest to the triplet (13, 62, 100). The way to do this is to compute a distance from each observation to the target. Unfortunately, there are many definitions of distance! Which distance should you use? This article describes and compares a few common distances and shows how to compute them in SAS.

Euclidean and other distances

The two most widely used distances are the Euclidean distance (called the L2 distance by mathematicians) and the "sum of absolute differences" distance, which is better known as the L1 distance or occasionally the taxicab distance. In one dimension, these distances are equal. However, when you have multiple coordinates, the distances are different. The Euclidean and L1 distances between (x1, x2, ..., xn) and a target vector (t1, t2, ..., tn) are defined as follows:
L1 distance: |x1-t1| + |x2-t2| + ... + |xn-tn|
Euclidean distance: sqrt( (x1-t1)2 + (x2-t2)2 + ... + (xn-tn)2 )

Both of these distances are supported in the SAS DATA step. You can use the EUCLID function to compute Euclidean distance and use the SUBABS function to compute the L1 distance. For example, the following DATA step computes the distance from each observation to the target value (Age, Height, Weight) = (13, 62, 100):

data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = euclid(Age-13, Height-62, Weight-100);
L1Dist     = sumabs(Age-13, Height-62, Weight-100);
/* sort by Euclidean distance */
proc sort data=Closest out=Euclid; by EuclidDist; run;
/* plot the distances for each observation */
title "Distance to Target Values";
proc sgplot data=Euclid;
   series x=Name y=EuclidDist / curvelabel="Euclidean";  *datalabel=Weight;
   series x=Name y=L1Dist     / curvelabel="L1";         *datalabel=Weight;
   yaxis grid label="Distance";
   xaxis grid discreteorder=data;
Euclidean and L1 distances between observations and a target value

The graph shows the Euclidean and L1 distances from each student's data to the target value. The X axis is ordered by the Euclidean distance from each observation to the target value. If you add the DATALABEL=Weight or DATALABEL=Height options to the SERIES statements, you can see that the students who appear near the left side of the X axis have heights and weights that are close to the target values (Height, Weight) = (62, 100). The students who appear to the right are much taller/heavier or shorter/lighter than the target values. In particular, Joyce is the smallest student and Philip is the largest.

If you order the students by their L1 distances, you will obtain a different ordering. For example, in an L1 ranking, John and Henry would switch positions and Jeffery would be ranked 7th instead of 12th.

Distances that account for scale

There is a problem with the computations in the previous section: the variables are measured in different units, but the computations do not account for these differences. In particular, Age is an important factor in the distance computations because all student ages are within three years of the target age. In contrast, the heaviest student (Phillip) is 88 pounds more than the target weight.

The distance formula needs to account for amount of variation within each variable. If you run PROC MEANS on the data, you will discover that the standard deviation of the Age variable is about 1.5, whereas the standard deviations for the Height and Weight variables are about 5.1 and 22.8, respectively. It follows that one year should be treated as a substantial difference, whereas one pound of weight should be treated as a small difference.

A way to correct for differences among the scales of variables is to standardize the variables. In SAS, you can use PROC STDIZE to standardize the variables in a variety of ways. By default, the procedure centers each variable by subtracting the sample mean; it scales by dividing by the standard deviations. The following procedure all also writes the mean and standard deviations to a data set, which is displayed by using PROC PRINT:

proc stdize data=Sashelp.Class out=StdClass outstat=StdIn method=STD;
   var Age Height Weight;
proc print data=StdIn(obs=2);

Notice that target value is not part of the standardization! Only the data are used. However, you must convert the target value to the new coordinates before you can compute the standardized distances. You can standardize the target value by using the METHOD=IN option in PROC STDIZE to tell the procedure to transform the target value by using the location and scale values in the StdIn data set, as follows:

data Target;
   Age=13; Height=62; Weight=100;
proc stdize data=Target out=StdTarget method=in(StdIn);
   var Age Height Weight;
proc print data=StdTarget; run;

The data and the target values are now in standardized coordinates. You can therefore repeat the earlier DATA step and compute the Euclidean and L1 distances in the new coordinates. The graph of the standardized distances are shown below:

In the standardized coordinates "one unit" corresponds to one standard deviation from the mean in each variable. Thus Age is measured in units of 1.5 years, Height in units of 5.1 inches, and Weight in units of 22.8 pounds. Notice that some of the students have changed positions. Carol is still very similar to the target value, but Jeffrey is now more similar to the target value than previously thought. The smallest student (Joyce) and largest student (Philip) are still dissimilar (very distant) from the target.

Distances that account for correlation

In the previous section, different scales in the data are handled by using distances in a standardized coordinate system. This is an improvement over the unstandardized computations. However, there is one more commonly used distance computation, called the Mahalanobis distance. The Mahalanobis distance is similar to the standardized L2 distance but also accounts for correlations between the variables. I have previously discussed the meaning of Mahalanobis distance and have described how you can use the inverse Cholesky decomposition to uncorrelate variables. I won't repeat the arguments, but the following graph displays the Mahalanobis distance between each observation and the target value. Some students are in a different order than they were for the standardized distances:

If you would like to examine or modify any of my computations, you can download the SAS program that computes the various distances.

In summary, there are several reasonable definitions of distance between multivariate observations. The simplest is the raw Euclidean distance, which is appropriate when all variables are measured on the same scale. The next simplest is the standardized distance, which is appropriate if the scales of the variables are vastly different and the variables are uncorrelated. The third is the Mahalanobis distance, which becomes important if you want to measure distances in a way that accounts for correlation between variables.

For other articles about Mahalanobis distance or distance computations in SAS, see:

The post Find the distance between observations and a target value appeared first on The DO Loop.

1月 152018

Last week I got the following message:

Dear Rick: How can I create a normal distribution within a specified range (min and max)? I need to simulate a normal distribution that fits within a specified range. I realize that a normal distribution is by definition infinite... Are there any alternatives, such as a distribution that has shape and properties similar to a normal distribution, but for which I can restrict the range? - Carol

Mathematically, this request doesn't make sense, as Carol (not her real name) acknowledges. However, if you are working with a client who is not statistically savvy, you might be asked to do the impossible! How can Carol make sense of this request?

I provide a two-part answer. First I discuss general ways to choose a distribution that models a data sample. You can use these methods when you have access to the data. Then I address the specific question and present five distributions that you can use to model data that looks like a "bounded" normal distribution. These distributions can be used when you don't have access to the data.

How to choose a distribution that matches the data?

In general, statisticians are often asked to choose a model that seems to fit an observed set of data. In simulation studies, the model is used to simulate samples that have the same distributional properties as the real data. In the book Simulating Data with SAS (Chapter 16), I discuss many strategies, including the following:

  • Use domain-specific knowledge to guide you. There might be physical or biological reasons to prefer one probability distribution over another.
  • Sample from the empirical distribution of the data, which is equivalent to bootstrap resampling. For more information, see the article about how to bootstrap in SAS or Chapter 15 of Simulating Data with SAS.
  • Use a well-known “named” parametric distribution to model the data, such as the normal, gamma, and beta distributions. Typically you will fit several candidate distributions to the data to see which fits best. (In SAS, you can use PROC UNIVARIATE, PROC SEVERITY, or PROC NLMIXED to fit distributions.) After you choose a distribution that fits the data, you can draw random samples from the fitted distribution.
  • For univariate data, you can choose a flexible system of distributions such as the Pearson system or the Johnson system. The Johnson system is supported by PROC UNIVARIATE (Wicklin, p. 112–114).
  • Use a graphical tool, called the moment-ratio diagram, to help select candidate distributions for the data. Traditionally the moment-ratio diagram is a theoretical tool for understanding the relationships of families and systems of distributions, but Chapter 16 shows how to use the moment-ratio diagram as a tool to organize simulation studies.

These ideas (and others) are illustrated in the following diagram, which is also from Wicklin (2013, p. 298):

The flowchart shows a few paths that a researcher can follow to model data. Most of the paths are also applicable to multivariate data.

Simulate data from the "eyeball distribution"

Now let's discuss the specific question that I was asked: how to simulate a "bounded" normal distribution. If Carol has access to the data, she could use the techniques in the previous section. Consequently, I assume that Carol does not have access to the data. This can happen in several ways, such as when you are trying to reproduce the results in a published article but the article includes only summary statistics or a graph of the data. If you cannot obtain the original data from the authors, you might be forced to simulate fake data based only on the graph and the summary statistics.

I call this using the "eyeball distribution." For non-native speakers of English, "to eyeball" means to look at or observe something. When used as an adjective, “eyeball” indicates that a quantity was obtained by visual inspection, rather than through formal measurements. Thus an "eyeball distribution" is one that is chosen heuristically because it looks similar to the histogram of the data. (You can extend these ideas to multivariate data.)

From Carol's description, the histogram of her data is symmetric, unimodal, and "bell-shaped." The data are in the interval [a, b]. I can think of several "eyeball distributions" that could model data like this:

  1. If you know the sample mean (m) and standard deviation (s), then draw random samples from N(m, s).
  2. If you know sample quantiles of the data, you can approximate the CDF and use inverse probability sampling for the simulation.
  3. Theory tells us that 99.7% of the normal distribution is within three standard deviations of the mean. For symmetric distributions, the mean equals the midrange m = (a + b)/2. Thus you could use the three-sigma rule to approximate s ≈ (b - a)/6 and sample from N(m, s).
  4. If you have domain knowledge that guarantees that the data are truly bounded on [a, b], you could use the truncated normal distribution, which samples from N(m, s) but discards any random variates that are outside of the interval.
  5. Alternatively, you can sample from a symmetric bounded distribution such as the Beta(α, α) distribution. For example, Beta(5, 5) is approximately bell-shaped on the interval [0,1].

The first option is often the most reasonable choice. Even though the original data are in the interval [a, b], the simulated data do not have to be in the same interval. If you believe that the original data are normally distributed, then simulate from that model and accept that you might obtain simulated data that are beyond the bounds of the original sample. The same fact holds for the third option, which estimates the standard deviation from a histogram.

The following SAS DATA step implements the last three options. The sample distributions are displayed by using a comparative histogram in SAS:

%let N = 500;      /* sample size */
%let min = 5;
%let max = 15;
proc format;
value MethodFmt 1 = "Normal, 3-Sigma Rule"
                2 = "Truncated Normal"
                3 = "Beta(5, 5), Scaled";
data BellSim(keep=x Method
format method MethodFmt.;
call streaminit(54321);  /* set random number seed */
numSD = 3;               /* use 3-sigma rule to estiamte std dev */
mu = (&min + &max)/2;
sigma = (&max - &min)/ (2*numSD);
method = 1;        /* Normal distribution N(mu, sigma) */
   do i = 1 to &N;
      x = rand("Normal", mu, sigma);  output;
method = 2;        /* truncated normal distribution TN(mu, sigma; a, b) */
   i = 0;
   do until (i = &N);
      x = rand("Normal", mu, sigma);
      if &min < x < &max then do;
         i + 1;
method = 3;        /* symmetric beta distribution, scaled into [a, b] */
   alpha = 5;
   do i = 1 to &N;
      x = &min + (&max - &min)*rand("beta", alpha, alpha);
ods graphics / width=480px height=480px;
title "Simulate Bell-Shaped Data on [&min, &max]";
proc sgpanel data=BellSim;
   panelby method / columns=1 novarname;
   histogram x;
   refline &min &max / axis=x;
   colaxis values=(&min to &max) valueshint;

The histograms show a simulated sample of size 500 for each of the three models. The top sample is from the normal distribution. It contains simulated observations that are outside of the interval [a,b]. In many situations, that is fine. The second sample is from the truncated normal distribution on [a,b]. The third is from a Beta(5,5) distribution, which has been scaled onto the interval [a,b].

Although the preceding SAS program answers the question Carol asked, let me emphasize that if you (and Carol) have access to the data, you should use it to choose a model. When the data are unobtainable, you can use an "eyeball distribution," which is basically a guess. However, if your goal is to generate fake data so that you can test a computational method, an eyeball distribution might be sufficient.

The post Data unavailable? Use the "eyeball distribution" to simulate appeared first on The DO Loop.

10月 252017

This article describes the advantages and disadvantages of principal component regression (PCR). This article also presents alternative techniques to PCR.

In a previous article, I showed how to compute a principal component regression in SAS. Recall that principal component regression is a technique for handling near collinearities among the regression variables in a linear regression. The PCR algorithm in most statistical software is more correctly called "incomplete" PCR because it uses only a subset of the principal components. Incomplete PCR means that you compute the principal components for the explanatory variables, keep only the first k principal components (which explain most of the variance among the regressors), and regress the response variable onto those k components.

The principal components that are dropped correspond to the near collinearities. Consequently, the standard errors of the parameter estimates are reduced, although the tradeoff is that the estimates are biased, and "the bias increases as more [principal components]are droppped" (Jackson, p. 276).

Some arguments in this article are from J. E. Jackson's excellent book, A User's Guide to Principal Components, (1991, pp. 271–278). Jackson introduces PCR and then immediately cautions against using it (p. 271). He writes that PCR "is a widely used technique," but "it also has some serious drawbacks." Let's examine the advantages and disadvantages of principal component regression.

Advantages of principal component regression

Principal component regression is a popular and widely used method. Advantages of PCR include the following:

  • PCR can perform regression when the explanatory variables are highly correlated or even collinear.
  • PCR is intuitive: you replace the basis {X1, X2, ..., Xp} with an orthogonal basis of principal components, drop the components that do not explain much variance, and regress the response onto the remaining components.
  • PCR is automatic: The only decision you need to make is how many principal components to keep.
  • The principal components that are dropped give insight into which linear combinations of variables are responsible for the collinearities.
  • PCR has a discrete parameter, namely the number of components kept. This parameter is very interpretable in terms of geometry (linear dimensions kept) and in terms of linear algebra (low-rank approximations).
  • You can run PCR when there are more variables than observations (wide data).

Drawbacks of principal component regression

The algorithm that is currently known as PCR is actually a misinterpretation of the original ideas behind PCR (Jolliffe, 1982, p. 201). When Kendall and Hotelling first proposed PCR in the 1950s, they proposed "complete" PCR, which means replacing the original variables by all the principal components, thereby stabilizing the numerical computations. Which principal components are included in the final model is determined by looking at the significance of the parameter estimates. By the early 1980s, the term PCR had changed to mean "incomplete PCR."

The primary argument against using (incomplete) principal component regression can be summarized in a single sentence: Principal component regression does not consider the response variable when deciding which principal components to drop. The decision to drop components is based only on the magnitude of the variance of the components.

There is no a priori reason to believe that the principal components with the largest variance are the components that best predict the response. In fact, it is trivial to construct an artificial example in which the best predictor is the last component, which will surely be dropped from the analysis. (Just define the response to be the last principal component!) More damning, Jolliffe (1982, p. 302) presents four examples from published papers that advocate PCR, and he shows that some of the low-variance components (which were dropped) have greater predictive power than the high-variance components that were kept. Jolliffe concludes that "it is not necessary to find obscure or bizarre data in order for the last few principal components to be important in principal component regression. Rather it seems that such examples may be rather common in practice."

There is a hybrid version of PCR that enables you to use cross validation and the predicted residual sum of squares (PRESS) criterion to select how many components to keep. (In SAS, the syntax is proc pls method=PCR cv=one cvtest(stat=press).) Although this partially addresses the issue by including the response variable in the selection of components, it is still the case that the first k components are selected and the last p – k are dropped. The method never keeps the first, third, and sixth components, for example.

Alternatives to principal component regression

Some alternatives to principal component regression include the following:

  • Ridge regression: In ridge regression, a diagonal matrix is added to the X`X matrix so that it becomes better conditioned. This results in biased parameter estimates. You can read an explanation of ridge regression and how to compute it by using PROC REG in SAS.
  • Complete PCR: As mentioned previously, use the PCs as the variables and keep the components whose parameter estimates are significant.
  • Complete PCR with variable selection: Use the PCs as the variables and use the variable-selection techniques to decide which components to retain. However, if your primary goal is variable reduction, then use variable-selection techniques on the original variables.
  • Partial Least Squares (PLS): Partial least square regression is similar to PCR in that both select components that explain the most variance in the model. The difference is that PLS incorporates the response variable. That is, the components that are produced are those that explain the most variance in the explanatory AND response variables. In SAS, you can compute a PLS regression by using PROC PLS with METHOD=PLS or METHOD=SIMPLS. You will probably also want to use the CV and CVTEST options.


In summary, principal component regression is a technique for computing regressions when the explanatory variables are highly correlated. It has several advantages, but the main drawback of PCR is that the decision about how many principal components to keep does not depend on the response variable. Consequently, some of the variables that you keep might not be strong predictors of the response, and some of the components that you drop might be excellent predictors. A good alternative is partial least squares regression, which I recommend. In SAS, you can run partial least squares regression by using PROC PLS with METHOD=PLS.


The post Should you use principal component regression? appeared first on The DO Loop.