data analysis

1月 112021
 

On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst.

However, among last year's 100+ articles are many that discuss advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.

Machine learning concepts

Relationship between a threshold value and true/false negatives and positives

Statistical smoothers

Bilinear interpolation of 12 data values

I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:

SAS Viya and parallel computing

SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:

The map-reduce functionality in the iml action

  • The map-reduce paradigm is a two-step process for distributing a computation. Every thread runs a function and produces a result for the data that it sees. The results are aggregated and returned. The iml action supports the MAPREDUCE function, which implements the map-reduce paradigm.
  • The parallel-tasks paradigm is a way to run independent computations concurrently. The iml action supports the PARTASKS function, which implements the map-reduce paradigm.

Simulation and visualization

Decomposition of a convex polygon into triangles

Generate random points in a polygon

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2020? If so, leave a comment and tell me what topic you found interesting or useful. And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.

The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.

1月 042021
 

Last year, I wrote more than 100 posts for The DO Loop blog. In previous years, the most popular articles were about SAS programming tips, statistical analysis, and data visualization. But not in 2020. In 2020, when the world was ravaged by the coronavirus pandemic, the most-read articles were related to analyzing and visualizing the tragic loss and suffering of the pandemic. Here are some of the most popular articles from 2019 in several categories.

The coronavirus pandemic

Relationship between new cases and cumulative cases in COVID-19 infections

Statistical analysis: Regression

Visualization of collinearity diagnostics

Other statistical analyses

Data visualization

Many articles in the previous sections included data visualization, but two popular articles are specifically about data visualization:

Reference lines at values of statistical estimates

Many people claim they want to forget 2020, but these articles provide a few tips and techniques that you might want to remember. So, read (or re-read!) these popular articles from 2020. And if you made a resolution to learn something new this year, consider subscribing to The DO Loop so you don't miss a single article!

The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.

12月 212020
 

When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set of explanatory variables and an R-square statistic, you can create a response variable, Y, such that a linear regression of Y on the explanatory variables produces exactly that R-square value.

The geometry of correlation and least-square regression

In a previous article, I showed how to compute a vector that has a specified correlation with another vector. You can generalize that situation to obtain a vector that has a specified relationship with a linear subspace that is spanned by multiple vectors.

Recall that the correlation is related to the angle between two vectors by the formula cos(θ) = ρ, where θ is the angle between the vectors and ρ is the correlation coefficient. Therefore, correlation and "angle between" measure similar quantities. It makes sense to define the angle between a vector and a linear subspace as the smallest angle the vector makes with any vector in the subspace. Equivalently, it is the angle between the vector and its (orthogonal) projection onto the subspace.

This is shown graphically in the following figure. The vector z is not in the span of the explanatory variables. The vector w is the projection of z onto the linear subspace. As explained in the previous article, you can find a vector y such that the angle between y and w is θ, where cos(θ) = ρ. Equivalently, the correlation between y and w is ρ.

Correlation between a response vector and a predicted vector

There is a connection between this geometry and the geometry of least-squares regression. In least-square regression, the predicted response is the projection of an observed response vector onto the span of the explanatory variables. Consequently, the previous article shows how to simulate an "observed" response vector that has a specified correlation with the predicted response.

For simple linear regression (one explanatory variable), textbooks often point out that the R-square statistic is the square of the correlation between the independent variable, X, and the response variable, Y. So, the previous article enables you to create a response variable that has a specified R-square value with one explanatory variable.

The generalization to multivariate linear regression is that the R-square statistic is the square of the correlation between the predicted response and the observed response. Therefore, you can use the technique in this article to create a response variable that has a specified R-square value in a linear regression model.

To be explicit, suppose you are given explanatory variables X1, X2, ..., Xk, and a correlation coefficient, ρ. The following steps generate a response variable, Y, such that the R-square statistic for the regression of Y onto the explanatory variables is ρ2:

  1. Start with an arbitrary guess, z.
  2. Use least-squares regression to find w = \(\hat{\mathbf{z}}\), which is the projection of z onto the subspace spanned by the explanatory variables and the 1 vector.
  3. Use the technique in the previous article to find Y such that corr(Y, w) = ρ

Create a response variable that has a specified R-square value in SAS

The following program shows how to carry out this algorithm in the SAS/IML language:

proc iml;
/* Define or load the modules from 
   https://blogs.sas.com/content/iml/2020/12/17/generate-correlated-vector.html
*/
load module=_all_;
/* read some data X1, X2, ... into columns of a matrix, X */
use sashelp.class;
   read all var {"Height" "Weight" "Age"} into X;  /* read data into (X1,X2,X3) */
close;
 
/* Least-squares fit = Project Y onto span(1,X1,X2,...,Xk) */
start OLSPred(y, _x);
   X = j(nrow(_x), 1, 1) || _x;
   b = solve(X`*X, X`*y);
   yhat = X*b;
   return yhat;
finish;
 
/* specify the desired correlation between Y and \hat{Y}. Equiv: R-square = rho^2 */
rho = 0.543;        
 
call randseed(123);
guess = randfun(nrow(X), "Normal");   /* 1. make random guess */
w = OLSPred(guess, X);                /* 2. w is in Span(1,X1,X2,...) */
Y = CorrVec1(w, rho, guess);          /* 3. Find Y such that corr(Y,w) = rho   */
/* optional: you can scale Y anyway you want ... */
/* in regression, R-square is squared correlation between Y and YHat */
corr = corr(Y||w)[2];
R2 = rho**2;                          
PRINT rho corr R2;

The program uses a random guess to generate a vector Y such that the correlation between Y and the least-squares prediction for Y is exactly 0.543. In other words, if you run a regression model where Y is the response and (X1, X2, X3) are the explanatory variables, the R-square statistic for the model will be ρ2 = 0.2948. Let's write the Y variable to a SAS data set and run PROC REG to verify this fact:

/* Write to a data set, then call PROC REG */
Z = Y || X;
create SimCorr from Z[c={Y X1 X2 X3}];
append from Z;
close;
QUIT;
 
proc reg data=SimCorr plots=none;
   model Y = X1 X2 X3;
   ods select FitStatistics ParameterEstimates;
quit;

The "FitStatistics" table that is created by using PROC REG verifies that the R-square statistic is 0.2948, which is the square of the ρ value that was specified in the SAS/IML program. The ParameterEstimates table from PROC REG shows the vector in the subspace that has correlation ρ with Y. It is -1.26382 + 0.04910*X1 - 0.00197*X2 - 0.12016 *X3.

Summary

Many textbooks point out that the R-square statistic in multivariable regression has a geometric interpretation: It is the squared correlation between the response vector and the projection of that vector onto the linear subspace of the explanatory variables (which is the predicted response vector). You can use the program in this article to solve the inverse problem: Given a set of explanatory variables and correlation, you can find a response variable for which the R-square statistic is exactly the squared correlation.

You can download the SAS program that computes the results in this article.

The post Create a response variable that has a specified R-square value appeared first on The DO Loop.

12月 142020
 

A segmented regression model is a piecewise regression model that has two or more sub-models, each defined on a separate domain for the explanatory variables. For simplicity, assume the model has one continuous explanatory variable, X. The simplest segmented regression model assumes that the response is modeled by one parametric model when X is less than some threshold value and by a different parametric model when X is greater than the threshold value. The threshold value is also called a breakpoint, a cutpoint, or a join point.

A previous article shows that for simple piecewise polynomial models, you can use the EFFECT statement in SAS regression procedures to use a spline to fit some segmented regression models. The method relies on two assumptions. First, it assumes that you know the location of the breakpoint. Second, it assumes that each model has the same parametric form on each interval. For example, the model might be piecewise constant, piecewise linear, or piecewise quadratic.

If you need to estimate the location of the breakpoint from the data, or if you are modeling the response differently on each segment, you cannot use a single spline effect. Instead, you need to use a SAS procedure such as PROC NLIN that enables you to specify the model on each segment and to estimate the breakpoint. For simplicity, this article shows how to estimate a model that is quadratic on the first segment and constant on the second segment. (This is called a plateau model.) The model also estimates the location of the breakpoint.

An example of a segmented model

A SAS customer recently asked about segmented models on the SAS Support Communities. Suppose that a surgeon wants to model how long it takes her to perform a certain procedure over time. When the surgeon first started performing the procedure, it took about 3 hours (180 minutes). As she refined her technique, that time decreased. The surgeon wants to predict how long this surgery now takes now, and she wants to estimate when the time reached its current plateau. The data are shown below and visualized in a scatter plot. The length of the procedure (in minutes) is recorded for 25 surgeries over a 16-month period.

data Have;
  input SurgeryNo Date :mmddyy10. duration @@;
  format Date mmddyy10.;
datalines;
1       3/20/2019       182   2       5/16/2019       150
3       5/30/2019       223   4       6/6/2019        142
5       6/11/2019       111   6       7/11/2019       164
7       7/26/2019        83   8       8/22/2019       144
9       8/29/2019       162  10       9/19/2019        83
11      10/10/2019       70  12       10/17/2019      114
13      10/31/2019      113  14       11/7/2019        97
15      11/21/2019       83  16       12/5/2019       111
17      12/5/2019        73  18       12/12/2019       87
19      12/19/2019       86  20       1/9/2020        102
21      1/16/2020       124  22       1/23/2020        95
23      1/30/2020       134  24       3/5/2020        121
25      6/4/2020         60
;
 
title "Time to Perform a Surgery";
proc sgplot data=Have;
  scatter x=Date y=Duration;
run;

From the graph, it looks like the duration for the procedure decreased until maybe November or December 2019. The goal of a segmented model is to find the breakpoint and to model the duration before and after the breakpoint. For this purpose, you can use a segmented plateau model that is a quadratic model prior to the breakpoint and a constant model after the breakpoint.

Formulate a segmented regression model

A segmented plateau model is one of the examples in the PROC NLIN documentation. The documentation shows how to use constraints in the problem to eliminate one or more parameters. For example, a common assumption is that the two segments are joined smoothly at the breakpoint location, x0. If f(x) is the predictive model to the left of the breakpoint and g(x) is the model to the right, then continuity dictates that f(x0) = g(x0) and smoothness dictates that f`(x0) = g`(x0). In many cases (such as when the models are low-degree polynomials), the two constraints enable you to reparameterize the models to eliminate two of the parameters.

The PROC NLIN documentation shows the details. Suppose f(x) is a quadratic function f(x) = α + β x + γ x2 and g(x) is a constant function g(x) = c. Then you can use the constraints to reparameterize the problem so that (α, β, γ) are free parameters, and the other two parameters are determined by the formulas:
x0 = -β / (2 γ)
c = α – β2 / (4 γ)
You can use the ESTIMATE statement in PROC NLIN to obtain estimates and standard error for the x0 and c parameters.

Provide initial guesses for the parameters

As is often the case, the hard part is to guess initial values for the parameters. You must supply an initial guess on the PARMS statement in PROC NLIN. One way to create a guess is to use a related "reduced model" to provide the estimates. For example, you can use PROC GLM to fit a global quadratic model to the data, as follows:

/* rescale by using x="days since start" as the variable */
%let RefDate = '20MAR2019'd;
data New;
set Have;
rename duration=y;
x = date - &RefDate;  /* days since first record */
run;
 
proc glm data=New;
   model y = x x*x;
run;

These parameter estimates are used in the next section to specify the initial parameter values for the segmented model.

Fit a segmented regression model in SAS

Recall that a SAS date is represented by the number of days since 01JAN1960. Thus, for these data, the Date values are approximately 22,000. Numerically speaking, it is often better to use smaller numbers in a regression problem, so I will rescale the explanatory variable to be the number of days since the first surgery. You can use the parameter estimates from the quadratic model as the initial values for the segmented model:

title 'Segmented Model with Plateau';
proc nlin data=New plots=fit noitprint;
   parms alpha=184 beta= -0.5 gamma= 0.001;
 
   x0 = -0.5*beta / gamma;
   if (x < x0) then
        mean = alpha + beta*x  + gamma*x*x;    /* quadratic model for x < x0 */
   else mean = alpha + beta*x0 + gamma*x0*x0;  /* constant model for x >= x0 */
   model y = mean;
 
   estimate 'plateau'    alpha + beta*x0 + gamma*x0*x0;
   estimate 'breakpoint' -0.5*beta / gamma;
   output out=NLinOut predicted=Pred L95M=Lower U95M=Upper;
   ods select ParameterEstimates AdditionalEstimates FitPlot;
run;

The output from PROC NLIN includes estimates for the quadratic regression coefficients and for the breakpoint and the plateau value. According to the second table, the surgeon can now perform this surgical procedure in about 98 minutes, on average. The 95% confidence interval [77, 119] suggests that the surgeon might want to schedule two hours for this procedure so that she is not late for her next appointment.

According to the estimate for the breakpoint (x0), she achieved her plateau after about 287 days of practice. However, the confidence interval is quite large, so there is considerable uncertainty in this estimate.

The output from PROC NLIN includes a plot that overlays the predictions on the observed data. However, that graph is in terms of the number of days since the first surgery. If you want to return to the original scale, you can graph the predicted values versus the Date. You can also add reference lines that indicate the plateau value and the estimated breakpoint, as follows:

/* convert x back to original scale (Date) */
data _NULL_;
   plateauDate = 287 + &RefDate;    /* translate back to Date scale */
   call symput('x0', plateauDate);  /* store breakpoint in macro variable */
   put plateauDate DATE10.;
run;
 
/* plot the predicted values and CLM on the original scale */
proc sgplot data=NLinOut noautolegend;
   band x=Date lower=Lower upper=Upper;
   refline 97.97  / axis=y label="Plateau"    labelpos=max;
   refline &x0 / axis=x label="01JAN2020" labelpos=max;
   scatter x=Date y=y;
   series  x=Date y=Pred;
   yaxis label='Duration';
run;

The graph shows the breakpoint estimate, which happens to be 01JAN2020. It also shows the variability in the data and the wide prediction limits.

Summary

This article shows how to fit a simple segmented model in SAS. The model has one breakpoint, which is estimated from the data. The model is quadratic before the breakpoint, constant after the breakpoint, and joins smoothly at the breakpoint. The constraints on continuity and smoothness reduce the problem from five parameters to three free parameters. The article shows how to use PROC NLIN in SAS to solve segmented models and how to visualize the result.

You can use a segmented model for many other data sets. For example, if you are a runner and routinely run a 5k distance, you can use a segmented model to monitor your times. Are your times decreasing, or did you reach a plateau? When did you reach the plateau?

The post Segmented regression models in SAS appeared first on The DO Loop.

12月 092020
 

One purpose of principal component analysis (PCA) is to reduce the number of important variables in a data analysis. Thus, PCA is known as a dimension-reduction algorithm. I have written about four simple rules for deciding how many principal components (PCs) to keep.

There are other methods for deciding how many PCs to keep. Recently a SAS customer asked about a method known as Horn's method (Horn, 1965), also called parallel analysis. This is a simulation-based method for deciding how many PCs to keep. If the original data consists of N observations and p variables, Horn's method is as follows:

  • Generate B sets of random data with N observations and p variables. The variables should be normally distributed and uncorrelated. That is, the data matrix, X, is a random sample from a multivariate normal (MVN) distribution with an identity correlation parameter: X ~ MVN(0, I(p)).
  • Compute the corresponding B correlation matrices and the eigenvalues for each correlation matrix.
  • Estimate the 95th percentile of each eigenvalue distribution. That is, estimate the 95th percentile of the largest eigenvalue, the 95th percentile of the second largest eigenvalue, and so forth.
  • Compare the observed eigenvalues to the 95th percentiles of the simulated eigenvalues. If the observed eigenvalue is larger, keep it. Otherwise, discard it.

I do not know why the adjective "parallel" is used for Horn's analysis. Nothing in the analysis is geometrically parallel to anything else. Although you can use parallel computations to perform a simulation study, I doubt Horn was thinking about that in 1965. My best guess is that is Horn's method is a secondary analysis that is performed "off to the side" or "in parallel" to the primary principal component analysis.

Parallel analysis in SAS

You do not need to write your own simulation method to use Horn's method (parallel analysis). Horn's parallel analysis is implemented in SAS (as of SAS/STAT 14.3 in SAS 9.4M5) by using the PARALLEL option in PROC FACTOR. The following call to PROC FACTOR uses data about US crime rates. The data are from the Getting Started example in PROC PRINCOMP.

proc factor data=Crime method=Principal 
   parallel(nsims=1000 seed=54321) nfactors=parallel plots=parallel;
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
run;

The PLOTS=PARALLEL option creates the visualization of the parallel analysis. The solid line shows the eigenvalues for the observed correlation matrix. The dotted line shows the 95th percentile of the simulated data. When the observed eigenvalue is greater than the corresponding 95th percentile, you keep the factor. Otherwise, you discard the factor. The graph shows that only one principal component would be kept according to Horn's method. This graph is a variation of the scree plot, which is a plot of the observed eigenvalues.

The same information is presented in tabular form in the "ParallelAnalysis" table. The first row is the only row for which the observed eigenvalue is greater than the 95th percentile (the "critical value") of the simulated eigenvalues.

Interpretation of the parallel analysis

Statisticians often use statistical tests based on a null hypothesis. In Horn's method, the simulation provides the "null distribution" of the eigenvalues of the correlation matrix under the hypothesis that the variables are uncorrelated. Horn's method says that we should only accept a factor as important if it explains more variance than would be expected from uncorrelated data.

Although the PARALLEL option is supported in PROC FACTOR, some researchers suggest that parallel analysis is valid only for PCA. Saccenti and Timmerman (2017) write, "Because Horn’s parallel analysis is associated with PCA, rather than [common factor analysis], its use to indicate the number of common factors is inconsistent (Ford, MacCallum, & Tait, 1986; Humphreys, 1975)." I an expert in factor analysis, but a basic principle of simulation is to ensure that the "null distribution" is appropriate to the analysis. For PCA, the null distribution in Horn's method (eigenvalues of a sample correlation matrix) is appropriate. However, in some common factor models, the important matrix is a "reduced correlation matrix," which does not have 1s on the diagonal.

The advantages of knowing how to write a simulation

Although the PARALLEL option in PROC FACTOR runs a simulation and summarizes the results, there are several advantages to implementing a parallel analysis yourself. For example, you can perform the analysis on the covariance (rather than correlation) matrix. Or you can substitute a robust correlation matrix as part of a robust principal component analysis.

I decided to run my own simulation because I was curious about the distribution of the eigenvalues. The graph that PROC FACTOR creates shows only the upper 95th percentiles of the eigenvalue distribution. I wanted to overlay a confidence band that indicates the distribution of the eigenvalues. The band would visualize the uncertainty in the eigenvalues of the simulated data. How wide is the band? Would you get different results if you use the median eigenvalue instead of the 95th percentile?

Such a graph is shown to the right. The confidence band was created by using a technique similar to the one I used to visualize uncertainty in predictions for linear regression models. The graph shows that the distribution of each eigenvalue and connects them with a straight line. The confidence band fits well with the existing graph, even though the X axis is discrete.

Implement Horn's simulation method

Here's an interesting fact about the simulation in Horn's method. Most implementations generate B random samples, X ~ MVN(0, I(p)), but you don't actually NEED the random samples! All you need are the correlation matrices for the random samples. It turns out that you can simulate the correlation matrices directly by using the Wishart distribution. SAS/IML software includes the RANDWISHART function, which simulates matrices from the Wishart distribution. You can transform those matrices into correlation matrices, find the eigenvalues, and compute the quantiles in just a few lines of PROC IML:

/* Parallel Analysis (Horn 1965) */
proc iml;
/* 1. Read the data and compute the observed eigenvalues of the correlation */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
   read all var varNames into X;
   read all var "State";
close;
p = ncol(X);
N = nrow(X);
 
m = corr(X);                  /* observed correlation matrix */
Eigenvalue = eigval(m);       /* observed eigenvalues */
 
/* 2. Generate random correlation matrices from MVN(0,I(p)) data
      and compute the eigenvalues. Each row of W is a p x p scatter 
      matrix for a random sample of size N where X ~ MVN(0, I(p)) */
nSim = 1000;
call randseed(12345);
W = RandWishart(nSim, N-1, I(p));  /* each row stores a p x p matrix */
S = W / (N-1);                /* rescale to form covariance matrix */
simEigen = j(nSim, p);        /* store eigenvalues in rows */
do i = 1 to nSim;
   cov = shape(S[i,], p, p);  /* reshape the i_th row into p x p */
   R = cov2corr(cov);         /* i_th correlation matrix */
   simEigen[i,] = T(eigval(R));  
end;
 
/* 3. find 95th percentile for each eigenvalue */
alpha = 0.05;
call qntl(crit, simEigen, 1-alpha);
 
results = T(1:nrow(Eigenvalue)) || Eigenvalue || crit`;
print results[c={"Number" "Observed Eigenvalue" "Simul Crit Val"} F=BestD8.];

The table is qualitatively the same as the one produced by PROC FACTOR. Both tables are the results of simulations, so you should expect to see small differences in the third column, which shows the 95th percentile of the distributions of the eigenvalues.

Visualize the distribution of eigenvalues

The eigenvalues are stored as rows of the simEigen matrix, so you can estimate the 5th, 10th, ..., 95th percentiles and over band plots on the eigenvalue (scree) plot, as follows:

/* 4. Create a graph that illustrates Horn's method: 
   Factor Number vs distribution of eigenvalues. Write results in long form. */
/* 4a. Write the observed eigenvalues and the 95th percentile */
create Horn from results[c={"Factor" "Eigenvalue" "SimulCritVal"}];
   append from results;
close;
 
/* 4b. Visualize the uncertainty in the simulated eigenvalues. For details, see
   https://blogs.sas.com/content/iml/2020/10/12/visualize-uncertainty-regression-predictions.html 
*/
a = do(0.05, 0.45, 0.05);          /* significance levels */
call qntl(Lower, simEigen, a);     /* lower qntls         */
call qntl(Upper, simEigen, 1-a);   /* upper qntls         */
Factor = col(Lower);               /* 1,2,3,...,1,2,3,... */
alpha = repeat(a`, 1, p);             
create EigenQntls var {"Factor" "alpha" "Lower" "Upper"};
   append;
close;
QUIT;
 
proc sort data=EigenQntls;
   by alpha Factor;
run;
data All;
   set Horn EigenQntls;
run;
 
title "Horn's Method (1965) for Choosing the Number of Factors";
title2 "Also called Parallel Analysis";
proc sgplot data=All noautolegend;
   band x=Factor lower=Lower upper=Upper/ group=alpha fillattrs=(color=gray) transparency=0.9;
   series x=Factor y=Eigenvalue / markers name='Eigen' legendlabel='Observed Eigenvalue';
   series x=Factor y=SimulCritVal / markers lineattrs=(pattern=dot) 
          name='Sim' legendlabel='Simulated Crit Value';
   keylegend 'Eigen' 'Sim' / across=1;
run;

The graph is shown in the previous section. The darkest part of the band shows the median eigenvalue. You can see that the "null distribution" of eigenvalues is rather narrow, even though the data contain only 50 observations. I thought perhaps it would be wider. Because the band is narrow, it doesn't matter much whether you choose the 95th percentile as a critical value or some other value (90th percentile, 80th percentile, and so forth). For these data, any reasonable choice for a percentile will still lead to rejecting the second factor and keeping only one principal component. Because the band is narrow, the results will not be unduly affected by whether you use few or many Monte Carlo simulations. In this article, both simulations used B=1000 simulations.

Summary

In summary, PROC FACTOR supports the PARALLEL and PLOTS=PARALLEL options for performing a "parallel analysis," which is Horn's method for choosing the number of principal components to retain. PROC FACTOR creates a table and graph that summarize Horn's method. You can also run the simulation yourself. If you use SAS/IML, you can simulate the correlation matrices directly, which is more efficient than simulating the data. If you run the simulation yourself, you can add additional features to the scree plot, such as a confidence band that shows the null distribution of the eigenvalues.

The post Horn's method: A simulation-based method for retaining principal components appeared first on The DO Loop.

12月 092020
 

One purpose of principal component analysis (PCA) is to reduce the number of important variables in a data analysis. Thus, PCA is known as a dimension-reduction algorithm. I have written about four simple rules for deciding how many principal components (PCs) to keep.

There are other methods for deciding how many PCs to keep. Recently a SAS customer asked about a method known as Horn's method (Horn, 1965), also called parallel analysis. This is a simulation-based method for deciding how many PCs to keep. If the original data consists of N observations and p variables, Horn's method is as follows:

  • Generate B sets of random data with N observations and p variables. The variables should be normally distributed and uncorrelated. That is, the data matrix, X, is a random sample from a multivariate normal (MVN) distribution with an identity correlation parameter: X ~ MVN(0, I(p)).
  • Compute the corresponding B correlation matrices and the eigenvalues for each correlation matrix.
  • Estimate the 95th percentile of each eigenvalue distribution. That is, estimate the 95th percentile of the largest eigenvalue, the 95th percentile of the second largest eigenvalue, and so forth.
  • Compare the observed eigenvalues to the 95th percentiles of the simulated eigenvalues. If the observed eigenvalue is larger, keep it. Otherwise, discard it.

I do not know why the adjective "parallel" is used for Horn's analysis. Nothing in the analysis is geometrically parallel to anything else. Although you can use parallel computations to perform a simulation study, I doubt Horn was thinking about that in 1965. My best guess is that is Horn's method is a secondary analysis that is performed "off to the side" or "in parallel" to the primary principal component analysis.

Parallel analysis in SAS

You do not need to write your own simulation method to use Horn's method (parallel analysis). Horn's parallel analysis is implemented in SAS (as of SAS/STAT 14.3 in SAS 9.4M5) by using the PARALLEL option in PROC FACTOR. The following call to PROC FACTOR uses data about US crime rates. The data are from the Getting Started example in PROC PRINCOMP.

proc factor data=Crime method=Principal 
   parallel(nsims=1000 seed=54321) nfactors=parallel plots=parallel;
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
run;

The PLOTS=PARALLEL option creates the visualization of the parallel analysis. The solid line shows the eigenvalues for the observed correlation matrix. The dotted line shows the 95th percentile of the simulated data. When the observed eigenvalue is greater than the corresponding 95th percentile, you keep the factor. Otherwise, you discard the factor. The graph shows that only one principal component would be kept according to Horn's method. This graph is a variation of the scree plot, which is a plot of the observed eigenvalues.

The same information is presented in tabular form in the "ParallelAnalysis" table. The first row is the only row for which the observed eigenvalue is greater than the 95th percentile (the "critical value") of the simulated eigenvalues.

Interpretation of the parallel analysis

Statisticians often use statistical tests based on a null hypothesis. In Horn's method, the simulation provides the "null distribution" of the eigenvalues of the correlation matrix under the hypothesis that the variables are uncorrelated. Horn's method says that we should only accept a factor as important if it explains more variance than would be expected from uncorrelated data.

Although the PARALLEL option is supported in PROC FACTOR, some researchers suggest that parallel analysis is valid only for PCA. Saccenti and Timmerman (2017) write, "Because Horn’s parallel analysis is associated with PCA, rather than [common factor analysis], its use to indicate the number of common factors is inconsistent (Ford, MacCallum, & Tait, 1986; Humphreys, 1975)." I an expert in factor analysis, but a basic principle of simulation is to ensure that the "null distribution" is appropriate to the analysis. For PCA, the null distribution in Horn's method (eigenvalues of a sample correlation matrix) is appropriate. However, in some common factor models, the important matrix is a "reduced correlation matrix," which does not have 1s on the diagonal.

The advantages of knowing how to write a simulation

Although the PARALLEL option in PROC FACTOR runs a simulation and summarizes the results, there are several advantages to implementing a parallel analysis yourself. For example, you can perform the analysis on the covariance (rather than correlation) matrix. Or you can substitute a robust correlation matrix as part of a robust principal component analysis.

I decided to run my own simulation because I was curious about the distribution of the eigenvalues. The graph that PROC FACTOR creates shows only the upper 95th percentiles of the eigenvalue distribution. I wanted to overlay a confidence band that indicates the distribution of the eigenvalues. The band would visualize the uncertainty in the eigenvalues of the simulated data. How wide is the band? Would you get different results if you use the median eigenvalue instead of the 95th percentile?

Such a graph is shown to the right. The confidence band was created by using a technique similar to the one I used to visualize uncertainty in predictions for linear regression models. The graph shows that the distribution of each eigenvalue and connects them with a straight line. The confidence band fits well with the existing graph, even though the X axis is discrete.

Implement Horn's simulation method

Here's an interesting fact about the simulation in Horn's method. Most implementations generate B random samples, X ~ MVN(0, I(p)), but you don't actually NEED the random samples! All you need are the correlation matrices for the random samples. It turns out that you can simulate the correlation matrices directly by using the Wishart distribution. SAS/IML software includes the RANDWISHART function, which simulates matrices from the Wishart distribution. You can transform those matrices into correlation matrices, find the eigenvalues, and compute the quantiles in just a few lines of PROC IML:

/* Parallel Analysis (Horn 1965) */
proc iml;
/* 1. Read the data and compute the observed eigenvalues of the correlation */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
   read all var varNames into X;
   read all var "State";
close;
p = ncol(X);
N = nrow(X);
 
m = corr(X);                  /* observed correlation matrix */
Eigenvalue = eigval(m);       /* observed eigenvalues */
 
/* 2. Generate random correlation matrices from MVN(0,I(p)) data
      and compute the eigenvalues. Each row of W is a p x p scatter 
      matrix for a random sample of size N where X ~ MVN(0, I(p)) */
nSim = 1000;
call randseed(12345);
W = RandWishart(nSim, N-1, I(p));  /* each row stores a p x p matrix */
S = W / (N-1);                /* rescale to form covariance matrix */
simEigen = j(nSim, p);        /* store eigenvalues in rows */
do i = 1 to nSim;
   cov = shape(S[i,], p, p);  /* reshape the i_th row into p x p */
   R = cov2corr(cov);         /* i_th correlation matrix */
   simEigen[i,] = T(eigval(R));  
end;
 
/* 3. find 95th percentile for each eigenvalue */
alpha = 0.05;
call qntl(crit, simEigen, 1-alpha);
 
results = T(1:nrow(Eigenvalue)) || Eigenvalue || crit`;
print results[c={"Number" "Observed Eigenvalue" "Simul Crit Val"} F=BestD8.];

The table is qualitatively the same as the one produced by PROC FACTOR. Both tables are the results of simulations, so you should expect to see small differences in the third column, which shows the 95th percentile of the distributions of the eigenvalues.

Visualize the distribution of eigenvalues

The eigenvalues are stored as rows of the simEigen matrix, so you can estimate the 5th, 10th, ..., 95th percentiles and over band plots on the eigenvalue (scree) plot, as follows:

/* 4. Create a graph that illustrates Horn's method: 
   Factor Number vs distribution of eigenvalues. Write results in long form. */
/* 4a. Write the observed eigenvalues and the 95th percentile */
create Horn from results[c={"Factor" "Eigenvalue" "SimulCritVal"}];
   append from results;
close;
 
/* 4b. Visualize the uncertainty in the simulated eigenvalues. For details, see
   https://blogs.sas.com/content/iml/2020/10/12/visualize-uncertainty-regression-predictions.html 
*/
a = do(0.05, 0.45, 0.05);          /* significance levels */
call qntl(Lower, simEigen, a);     /* lower qntls         */
call qntl(Upper, simEigen, 1-a);   /* upper qntls         */
Factor = col(Lower);               /* 1,2,3,...,1,2,3,... */
alpha = repeat(a`, 1, p);             
create EigenQntls var {"Factor" "alpha" "Lower" "Upper"};
   append;
close;
QUIT;
 
proc sort data=EigenQntls;
   by alpha Factor;
run;
data All;
   set Horn EigenQntls;
run;
 
title "Horn's Method (1965) for Choosing the Number of Factors";
title2 "Also called Parallel Analysis";
proc sgplot data=All noautolegend;
   band x=Factor lower=Lower upper=Upper/ group=alpha fillattrs=(color=gray) transparency=0.9;
   series x=Factor y=Eigenvalue / markers name='Eigen' legendlabel='Observed Eigenvalue';
   series x=Factor y=SimulCritVal / markers lineattrs=(pattern=dot) 
          name='Sim' legendlabel='Simulated Crit Value';
   keylegend 'Eigen' 'Sim' / across=1;
run;

The graph is shown in the previous section. The darkest part of the band shows the median eigenvalue. You can see that the "null distribution" of eigenvalues is rather narrow, even though the data contain only 50 observations. I thought perhaps it would be wider. Because the band is narrow, it doesn't matter much whether you choose the 95th percentile as a critical value or some other value (90th percentile, 80th percentile, and so forth). For these data, any reasonable choice for a percentile will still lead to rejecting the second factor and keeping only one principal component. Because the band is narrow, the results will not be unduly affected by whether you use few or many Monte Carlo simulations. In this article, both simulations used B=1000 simulations.

Summary

In summary, PROC FACTOR supports the PARALLEL and PLOTS=PARALLEL options for performing a "parallel analysis," which is Horn's method for choosing the number of principal components to retain. PROC FACTOR creates a table and graph that summarize Horn's method. You can also run the simulation yourself. If you use SAS/IML, you can simulate the correlation matrices directly, which is more efficient than simulating the data. If you run the simulation yourself, you can add additional features to the scree plot, such as a confidence band that shows the null distribution of the eigenvalues.

The post Horn's method: A simulation-based method for retaining principal components appeared first on The DO Loop.

12月 072020
 
Spruce (picea glauca) branches

Image of spruce branch (Picea glauca) by Aleksandrs Balodis, licensed under the Creative Commons Attribution-Share Alike 4.0 International license (CC-BY-SA-4.0).

"O Christmas tree, O Christmas tree, how lovely are your branches!" The idealized image of a Christmas tree is a perfectly straight conical tree with lush branches and no bare spots. Although this ideal exists only on Christmas cards, forest researchers are always trying to develop trees that approach the ideal. And they use serious science and statistics to do it!

Bert Cregg, a forest researcher at Michigan State University, is a Christmas tree researcher who has been featured in Wired magazine. Cregg and his colleagues have published many papers on best practices for growing Christmas trees. One paper that caught my eye is Gooch, Nzokou, and Cregg (2009), "Effect of Indoor Exposure on the Cold Hardiness and Physiology of Containerized Christmas Trees." In this paper, Cregg and his colleagues investigate whether you can buy a live Christmas tree, keep it in the house for the holidays, and then transplant it in your yard. The authors use a statistical technique known as the median lethal dose, or LD50, to describe how bringing a potted Christmas tree into the house can affect its hardiness to freezing temperatures.

This blog post uses Cregg's data and shows how to use SAS to estimate the LD50 statistic. If you are not interested in the details, the last section of this article summarizes the results. Spoiler: An evergreen kept indoors has a reduced ability to withstand freezing temperatures after it is transplanted. In Cregg's experiment, all the transplanted trees died within six months.

The problem and the experiment

Containerized (potted) Christmas trees are popular among consumers who want a live Christmas tree but do not want to kill the tree by cutting it down. The idea is to bring the tree indoors during the holidays, then plant it on your property. However, there is a problem with bringing a tree into a house during the winter. Evergreens naturally go dormant in the winter, which enables their needles and buds to withstand freezing temperatures. When you bring a tree into a heated space, it "wakes up." If you later move the tree outside into freezing temperatures, the needles and buds can be damaged by the cold. This damage can kill the tree.

Cregg and his colleagues set up an experiment to understand how the length of time spent indoors affects a Christmas tree's ability to withstand freezing temperatures. Trees were brought indoors for 0, 3, 7, 14, and 20 days. Cuttings from the trees were then exposed to freezing conditions: -3, -6, -9, ..., -30 degrees Celsius. The goal is to estimate the median "lethal dose" for temperature. That is, to estimate the temperature at which half of the cuttings are damaged by the cold. In pharmacological terms, the time spent indoors is a treatment and the temperature is a "dose." The trees that were never brought indoors (0 days) are a control group.

Cregg and his colleagues studied three species of Christmas trees and studied dames to both buds and needles. For brevity, I will only look at the results for buds on the Black Hills Spruce (Picea glauca).

The data and the graphical method

In Table 1 (p. 75), the authors show data for the bud mortality and 50% lethality (LD50) for each treatment group (days indoors) as a function of decreasing temperatures. The data for the spruce are shown in the following SAS DATA step. Each cell in the table represents the percentage of 60 cuttings that showed damage after being exposed to a freezing temperature. By knowing there were 60 cuttings, you can compute how many cuttings were damaged.

/* Bud mortality and LD50. Data from Table 1 of Gooch, Nzokou, & Cregg (2009). */
data SpruceBudDamage;
Ntrials = 60;
input TimeIndoors @;
do Temp = -3 to -30 by -3;
   input BudMort @;
   NEvents = int(NTrials * BudMort / 100);  /* compute NEvents from mortality */
   output;
end;
label Temp = "Temperature (C)";
/* Bud Mortality (percent) as a function of temperature for each treatment */
/* Days    | --------  Temperature (C)  -------- */   
/* Indoors |-3 -6 -9 -12 -15 -18 -21 -24 -27 -30 */
datalines;
   0         0  0  0   0   0  6.7  0   0  20 100
   3         0  0  0   0   0  0    0  30  80 100
   7         0  0  0   0   0  0   10  40 100 100
  14         0  0  0   0   0  0    0  80 100 100
  20         0  0  0   0  20 40  100 100 100 100
;

The paper says that the LD50 "was determined graphically using a pairwise plot of the exposure temperature and the percentage of bud mortality ... for each species." I don't know what that means. It sounds like they did not estimate the LD50 statistically but instead graphed the bud mortality versus the temperature and used linear interpolation (or a smoother) to estimate the temperature at which the mortality is 50%. Here is a graph of the data connected by lines:

title "Bud Mortality in Black Hills Spruce";
proc sgplot data=SpruceBudDamage;
   series x=Temp y=BudMort / markers group=TimeIndoors lineattrs=(thickness=2);
   refline 50 / axis=y lineattrs=(color=red);
   xaxis grid; yaxis grid;
run;

The horizontal red line in the graph is the line of 50% mortality. For each curve, a crude estimate of LD50 is the temperature at which the curve crosses that line. The graphical estimates and the estimates in the paper are shown in the following table. The estimates in the authors' paper are greater than the estimates from my graph, but I do not know why. If you use something other than linear interpolation (for example, a loess smoother), you will get different curves and different estimates.

Although these graphical estimates differ slightly from the published results, the numbers tell the same story. On average, a spruce Christmas tree that is not brought indoors is hardy to about -28C. If you bring a tree indoors for 3 days, it is hardy only to about -25C. The longer a tree is indoors, the more it loses hardiness. For trees that are indoors for 20 days, the median lethal temperature is -18C, or about 10 degrees warmer than for the control group.

Better estimates of LD50

The graphical estimates are crude. They are based on linear interpolation between two consecutive data points: one for which the mortality is below 50% and the next for which the mortality is above 50%. The estimates ignore all other data. Furthermore, the estimates assume that the mortality is linear between those two points, which is not usually the case. The mortality curve is typically a sigmoid (or S-shaped) curve.

Fortunately, we can use statistics to address these concerns. The usual way to estimate LD50 in SAS is to use PROC PROBIT. For these data, we will perform a separate analysis for each value of the TimeIndoors variable. The INVERSECL option on the MODEL statement estimates the Temperature (and confidence limits) for a range of probability values. You can use the ODS OUTPUT statement to write the statistics to a SAS data set so that you can use PROC SGPLOT to overlay all five curves on a single graph, as follows:

proc probit data=SpruceBudDamage plots=(predpplot);
   by TimeIndoors;
   model NEvents / NTrials = Temp / InverseCL;
   ods exclude ProbitAnalysis;
   ods output ProbitAnalysis=ProbitOut;
run;
 
proc sgplot data=ProbitOut;
   band y=Probability lower=LowerCL upper=UpperCL / group=TimeIndoors transparency=0.5;
   series y=Probability x=Variable / group=TimeIndoors;
   refline 0.50 / axis=y lineattrs=(color=brown);
   xaxis grid values=(-30 to -3 by 3); yaxis grid;
run;

For each treatment group (time spent indoors), the graph shows probability curves for bud mortality as a function of the outdoor temperature. The median lethal temperature is where these curves and inverse confidence intervals intersect the line of 0.5 probability. (They are inverse confidence limits because they are for the value of the X value that produces a given Y value.) You can use PROC PRINT to display the estimates for LD50 for each treatment group:

The estimates for LD50 use all the data to model the bud mortality. This method also produces 95% (inverse) confidence limits for the LD50 parameter. For one of the treatment groups (TimeIndoors=14), the confidence limits could not be produced. The documentation for PROC PROBIT discusses why this can happen.

If you want, you can use this table to estimate the differences between the LD50 values.

Summary

In summary, growing beautiful Christmas trees requires a lot of science and statistics. This article analyzes data from an experiment in which potted Christmas trees were brought indoors and later returned outdoors where they can experience freezing temperatures. Trees that are brought indoors lose some of their hardiness and can be damaged by freezing temperatures. This article shows how to use PROC PROBIT in SAS to compute the median lethal temperature (LD50), which is the temperature at which half of the trees would be damaged. After 20 days indoors, a spruce tree will lose about 10 degrees (C) of resistance to freezing temperatures.

If you are thinking about getting a containerized (live) Christmas tree, Cregg (2020) wrote an article about how to take care of it and how to prepare it for transplanting after Christmas. He suggests waiting until spring. In the 2009 experiment, 100% of the trees that had been brought indoors for 10 or more days died within six months of transplanting, as compared to 0% of the control group. This happened even though the outdoor temperature never approached the LD50 level. This experiment was done in Michigan, so the results might not apply to trees in warmer regions.

The post Can you transplant an indoor Christmas tree? appeared first on The DO Loop.

11月 232020
 

I previously showed how to create a decile calibration plot for a logistic regression model in SAS. A decile calibration plot (or "decile plot," for short) is used in some fields to visualize agreement between the data and a regression model. It can be used to diagnose an incorrectly specified model.

In principle, the decile plot can be applied to any regression model that has a continuous regressor. In this article, I show how to create two decile plots for an ordinary least squares regression model. The first overlays a decile plot on a fit plot, as shown to the right. The second decile plot shows the relationship between the observed and predicted responses.

Although this article shows how to create a decile plot in SAS, I do not necessarily recommend them. In many cases, it is better and simpler to use a loess fit, as I discuss at the end of this article.

A misspecified model

A decile plot can be used to identify an incorrectly specified model. Suppose, for example, that a response variable (Y) depends on an explanatory variable (X) in a nonlinear manner. If you model Y as a linear function of X, then you have specified an incorrect model. The following SAS DATA step simulates a response variable (Y) that depends strongly on a linear effect and weakly on quadratic and cubic effects for X. Because the nonlinear dependence is weak, a traditional fit plot does not indicate that the model is misspecified:

/* Y = b0 + b1*x + b2*x2 + b3*x**3 + error, where b2 and b3 are small */
%let N = 200;                      /* sample size */ 
data Have(keep= Y x);
call streaminit(1234);             /* set the random number seed */
do i = 1 to &N;                    /* for each observation in the sample  */
   x = rand("lognormal", 0, 0.5);  /* x = explanatory variable */
   y = 20 + 5*x - 0.5*(x-3)**3 +   /* include weak cubic effect */
            rand("Normal", 0, 4); 
   output;
end;
run;
 
proc reg data=Have;               /* Optional: plots=residuals(smooth); */
   model y = x;
quit;

The usual p-values and diagnostic plots (not shown) do not reveal that the model is misspecified. In particular, the fit plot, which shows the observed and predicted response versus the explanatory variable, does not indicate that the model is misspecified. At the end of this article, I discuss how to get PROC REG to overlay additional information about the fit on its graphs.

Overlay a decile plot on a fit plot

This section shows how to overlay a decile plot on the fit plot. The fit plot shows the observed values of the response (Y) for each value of the explanatory variable (X). You can create a decile fit plot as follows:

  1. Bin the X values into 10 bins by using the deciles of the X variable as cut points. In SAS, you can use PROC RANK for this step.
  2. For each bin, compute the mean Y value and a confidence interval for the mean. At the same time, compute the mean of each bin, which will be the horizontal plotting position for the bin. In SAS, you can use PROC MEANS for this step.
  3. Overlay the fit plot with the 10 confidence intervals and mean values for Y. You can use PROC SGPLOT for this step.
/* 1. Assign each obs to a decile for X */
proc rank data=Have out=Decile groups=10 ties=high;
   var x;           /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 2. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y x;
   output out=DecileOut mean=YMean XMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 3. Create a fit plot and overlay the deciles. */
data All;
   set Have DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed and Predicted Response versus Explanatory";
proc sgplot data=All noautolegend;
   reg x=x y=y / nomarkers CLM alpha=0.1 name="reg";
   scatter x=xMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 90% CI" markerattrs=GraphData2;
   fringe x / transparency=0.5;
   xaxis label="x" values=(0 to 4.5 by 0.5) valueshint;
   yaxis label="y" offsetmin=0.05;
   keylegend "reg" "obs" / position=NW location=inside across=1;
run;

The graph is shown at the top of this article. I use the FRINGE statement to display the locations of the X values at the bottom of the plot. The graph indicates that the model might be misspecified. The mean Y values for the first three deciles are all above the predicted value from the model. For the next six deciles, the mean Y values are all below the model's predictions. This kind of systematic deviation from the model can indicate that the model is missing an important component. In this case, we know that the true response is cubic whereas the model is linear.

A decile plot of observed versus predicted response

You can create a fit plot when the model contains a single continuous explanatory variable. If you have multiple explanatory variables, you can assess the model by plotting the observed responses against the predicted responses. You can overlay a decile plot by computing the average observed response for each decile of the predicted response. When you hear someone use the term "decile plot," they are probably referring to this second plot, which is applicable to most regression models.

The SAS program is almost identical to the program in the previous section. The main difference is that you need to create an output data set (FIT) that contains the predicted values (Pred). You then use the Pred variable instead of the X variable throughout the program, as follows:

/* 1. Compute predicted values */
proc reg data=Have noprint;
   model y = x;
   output out=Fit Pred=Pred;
quit;
 
/* 2. Assign each obs to a decile for Pred */
proc rank data=Fit out=Decile groups=10 ties=high;
   var Pred;        /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 3. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y Pred;
   output out=DecileOut mean=YMean PredMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 4. Create an Observed vs Predicted plot and overlay the deciles. */
data All;
   set Fit DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed Response verus Predicted Response";
proc sgplot data=All noautolegend;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip; /* add identity line */
   scatter x=PredMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 95% CI" markerattrs=GraphData2;
   fringe Pred;
   xaxis label="Predicted Response";
   yaxis label="Observed Response" offsetmin=0.05;
   keylegend "obs" / position=NW location=inside across=1;
run;

The decile plot includes a diagonal line, which is the line of perfect agreement between the model and the data. In a well-specified model, the 10 empirical means of the deciles should be close to the line, and they should vary randomly above and below the line. The decile plot for the misspecified model shows a systematic trend rather than random variation. For the lower 30% of the predicted values, the observed response is higher (on average) than the predictions. When the predicted values, are in the range [29, 32] (the fourth through ninth deciles) the observed response is lower (on average) than the predictions. Thus, the decile plot indicates that the linear model might be missing an important nonlinear effect.

The loess alternative

A general principle in statistics is that you should not discretize a continuous variable unless absolutely necessary. You can often get better insight by analyzing the continuous variable. In accordance with that principle, this section shows how to identify misspecified models by using a loess smoother rather than by discretizing a continuous variable into deciles.

In addition to the "general principle," a decile plot requires three steps: compute the deciles, compute the means for the deciles, and merge the decile information with the original data so that you can plot it. As I discuss at the end of my previous article, it is often easier to overlay a loess smoother.

The following statements use the LOESS statement in PROC SGPLOT to add a loess smoother to a fit plot. If the loess fit is not approximately linear, it could indicate that the linear model is not appropriate.

title "Linear Regression Fit";
proc sgplot data=Have;
   reg x=x y=y / clm;              /* fit linear model y = b0 + b1*x */
   loess x=x y=y / nomarkers;      /* use loess fit instead of decile plot */
run;

This plot is the continuous analog to the first decile plot. It tells you the same information: For very small and very large values of X, the predicted values are too low. The loess curve suggests that the linear model is lacking an effect.

The previous plot is useful for single-variable regressions. But even if you have multiple regressors, you can add a loess smoother to the "observed versus predicted" plot. If the FIT data set contains the observed and predicted response, you can create the graph as follows:

title "Observed versus Predicted";
proc sgplot data=Fit;
   loess x=Pred y=y;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip legendlabel="Identity Line";
run;

This plot is the continuous analog to the second decile plot. The loess curve indicates that the observed response is higher than predicted for small and large predicted values. When the model predicts values of Y in the range [29.5, 32], the predicted values tend to be smaller than the observed values. This is the same information that the decile plot provides, but the loess curve is easier to use and gives more information because it does not discretize a continuous variable.

Finally, you can add a loess smoother to a residual plot by using an option in PROC REG. In a well-specified model, the residual plot should not show any systematic trends. You can use the PLOTS=RESIDUALS(SMOOTH) options on the PROC REG statement to add a loess smoother to the residual plot, as follows:
proc reg data=Have plots=residuals(smooth);
   model y = x;
   ods select ResidualPlot;
quit;

The loess smoother on the residual plot shows that the residuals seem to have a systematic component that is not captured by the linear model. When you see the loess fit, you might be motivated to fit models that are more complex. You can use this option even for models that have multiple explanatory variables.

Summary

In summary, this article shows how to create two decile plots for least-square regression models in SAS. The first decile plot adds 10 empirical means to a fit plot. You can create this plot when you have a single continuous regressor. The second decile plot is applicable to models that have multiple continuous regressors. It adds 10 empirical means to the "observed versus predicted" plot. Lastly, although you can create decile plots in SAS, in many cases it is easier and more informative to use a loess curve to assess the model.

The post Decile plots in SAS appeared first on The DO Loop.

11月 182020
 
Predicted probabilities for a logistic regression model

To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When the explanatory variables are correlated, the standard visualizations might evaluate the model at unreasonable (or even impossible!) combinations of the explanatory variables. This article describes how to generate points in an ellipsoid and score (evaluate) a regression model at those points. The ellipsoid contains typical combinations of the explanatory variables when the explanatory variables are multivariate normal. An example is shown in the graph to the right, which shows the predicted probabilities for a logistic regression model.

This technique can help prevent nonsensical predicted values. For example, in a previous article, I show an example of a quadratic regression surface that appears to predict negative values for a quantity that can never be negative (such as length, mass, or volume). The reason for the negative prediction is that the model is being evaluated at atypical points that are far from the explanatory values in the experiment.

Example: Visualizing the predicted probabilities for a logistic model

Let's look at an example of a regression model for which the regressors are correlated. Suppose you want to predict the gender of the 19 students in the Sashelp.Class data based on their height and weight. (Apologies to my international readers, but the measurements are in pounds and inches.) The following call to PROC LOGISTIC fits a logistic model to the data and uses the EFFECTPLOT statement to create a contour plot of the predicted probability that a student is female:

proc logistic data=Sashelp.Class;
   model Sex = Height Weight;
   effectplot contour / ilink;      /* contour plot of pred prob */
   store out=LogiModel;             /* store model for later */
run;

The contour plot shows the predicted probability of being female according to the model. The red regions have a high probability of being female and the blue regions have a low probability. For a given height, lighter students are more likely to be female.

This graph is a great summary of the model. However, because height and weight are correlated, the data values do not fill the rectangular region in the graph. There are no students in the upper-left corner near (height, weight)=(51, 150). Such a student would be severely obese. Similarly, there are no students in the lower-right corner near (height, weight)=(72, 50). Such a student would be severely underweight.

You can graph just the explanatory variables and overlay a 95% prediction ellipse for bivariate normal data, as shown below:

title "Weight versus Height";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / jitter transparency=0.25 group=Sex markerattrs=(size=10 symbol=CircleFilled);
   ellipse x=Height y=Weight;    /* 95% prediction ellipse for bivariate normal data */
run;

The heights and weights of most students will fall into the prediction ellipse, assuming that the variables are multivariate normal with a mean vector that is close to (62.3, 100), which is the mean of the data.

If you want to score the model at "representative" values of the data, it makes sense to generate points inside the ellipse. The following sections show how to generate a regular grid of values inside the ellipse and how to score the logistic model on those points.

Generate data in an ellipsoid

I can think of several ways to generate a grid of values inside the ellipse, but the easiest way is to use the Mahalanobis distance (MD). The Mahalanobis distance is a metric that accounts for the correlation in data. Accordingly, you can generate a regular grid of points and then keep only the points whose Mahalanobis distance to the mean vector is less than a certain cutoff value. In this section, I choose the cutoff value to be the maximum Mahalanobis distance in the original data. Alternately, you can use probability theory and the fact that the distribution of the squared MD values is asymptotically chi-squared.

The following SAS/IML program reads the explanatory variables and computes the sample mean and covariance matrix. You can use the MAHALANOBIS function to compute the Mahalanobis distance between each observation and the sample mean. The maximum MD for these data is about 2.23. You can use the ExpandGrid function to generate a regular grid of points, then keep only those points whose MD is less than 2.23. The resulting points are inside an ellipse and represent "typical" values for the variables.

proc iml;
varNames = {'Height' 'Weight'};
use Sashelp.Class; read all var varNames into X;  close;
 
S = cov(X);
m = mean(X);
MD_Data = mahalanobis(X, m, S);
MaxMD = max(MD_Data);
print MaxMD;
 
minX = 50;  maxX = 72;           /* or use min(X[,1]) and max(X[,1]) */
minY = 50;  maxY = 150;          /* or use min(X[,2]) and max(X[,2]) */
xRng = minX:maxX;                /* step by 1 in X */
yRng = do(minY, maxY, 5);        /* step by 5 in Y */
 
u = ExpandGrid( xRng, yRng );    /* regular grid of points */
MD = mahalanobis(u, m, S);       /* MD to center for each point */
u = u[ loc(MD < MaxMD), ];       /* keep only points where MD < cutoff */
ods graphics / width=300px height=300px;
call scatter(u[,1], u[,2]);
 
create ScoreIn from u[c=varNames];   /* write score data set */
   append from u;
close;
QUIT;

Score data in an ellipsoid

You can use PROC PLM to score the logistic model at the points in the ellipsoid. In the earlier call to PROC LOGISTIC, I used the STORE statement to create an item store named LogiModel. The following call to PROC PLM scores the model and uses the ILINK option to output the predicted probabilities. You can use the HEATMAPPARM statement in PROC SGPLOT to visualize the predicted probabilities. So that the color ramp always visualizes probabilities on the interval [0,1], I add two fake observations (0 and 1) to the predicted values.

proc PLM restore=LogiModel noprint;
   score data=ScoreIn out=ScoreOut / ilink;      /* predicted probability */
run;
 
data Fake;           /* Optional trick: Add 0 and 1 so colorramp shows range [0,1] */
Predicted=0; output;
Predicted=1; output;
run;
 
data Score;          /* concatenate the real and fake data */
   set ScoreOut Fake;
run;
 
title "Predicted Values of Sex='F' for Logistic Model";
proc sgplot data=Score;
   styleattrs wallcolor=CxF8F8F8;
   heatmapparm  x=Height y=Weight colorresponse=Predicted;
   xaxis grid; yaxis grid;
run;

The graph is shown at the top of this article. It shows the predicted probabilities for representative combinations of the height and weight variables. Students whose measurements appear in the lower portion of the ellipse (red) are more likely to be female. Students in the upper portion of the ellipse (blue) are more likely to be male. It is not likely that students have measurements outside the ellipse, so the model is not evaluated outside the ellipse.

Summary

This article shows how you can use the Mahalanobis distance to generate scoring data that are in an ellipsoid that is centered at the sample mean vector. Such data are appropriate for scoring "typical" combinations of variables in data that are approximated multivariate normal. Although the example used a logistic regression model, the idea applies to any model that you want to score. This technique can help to prevent scoring the model at unrealistic values of the explanatory variables.

The post Create scoring data when regressors are correlated appeared first on The DO Loop.

11月 092020
 

Intuitively, the skewness of a unimodal distribution indicates whether a distribution is symmetric or not. If the right tail has more mass than the left tail, the distribution is "right skewed." If the left tail has more mass, the distribution is "left skewed." Thus, estimating skewness requires some estimates about the relative size of the left and right tails of a distribution.

The standard definition of skewness is called the moment coefficient of skewness because it is based on the third central moment. The moment coefficient of skewness is a biased estimator and is also not robust to outliers in the data. This article discusses an estimator proposed by Hogg (1974) that is robust and less biased. The ideas in this article are based on Bono, et al. (2020). Bono et al. provide a SAS macro that uses SAS/IML to compute the estimates. This article provides SAS/IML functions that compute the estimates more efficiently and more succinctly.

Early robust estimators of skewness

Hogg's estimate of skewness is merely one of many robust measures of skewness that have been proposed since the 1920s. I have written about the Bowley-Galton skewness, which is based on quantiles. The Bowley-Galton estimator defines the length of the right tails as Q3-Q2 and the length of the left tail as Q2-Q1. (Q1 is the first quartile, Q2 is the median, and Q3 is the third quartile.) It then defines the quantile skewness as half the relative difference between the lengths, which is BG = ((Q3-Q2)-(Q2-Q1)) / (Q3-Q1).

Although "the tail" does not have a universally accepted definition, some people would argue that a tail should include portions of the distribution that are more extreme than the Q1 and Q3 quartiles. You could generalize the Bowley-Galton estimator to use more extreme quantiles. For example, you could use the upper and lower 20th percentiles. If P20 is the 20th percentile and P80 is the 80th percentile, you can define an estimator to be ((P80-Q2)-(Q2-P20)) / (P80-P20). The choice of percentiles to use is somewhat arbitrary.

Hogg's estimator of skewness

Hogg's measure of skewness (1974, p. 918) is a ratio of the right tail length to the left tail length. The right tail length is estimated as U(0.05) – M25, where U(0.05) is the average of the largest 5% of the data and M25 is the 25% trimmed mean of the data. The left tail length is estimated as M25 – L(0.05), where L(0.05) is the average of the smallest 5% of the data. The Hogg estimator is
SkewH = (U(0.05) – M25) / (M25 - L(0.05))

Note that the 25% trimmed mean is the mean of the middle 50% of the data. So the estimator is based on estimating the means of various subsets of the data that are based on quantiles of the data.

Compute the mean of the lower and upper tails

I have written about how to visualize and compute the average value in a tail. However, I did not discuss how to make sense of phrases like "5% of the data" in a sample of size N when 0.05*N is not an integer. There are many estimates of quantiles but Bono et al. (2020) and uses an interpolation method (QNTLDEF=1 in SAS) to compute a weighted mean of the data based on the estimated quantiles.

In Bono et al., the "mean of the lower p_th quantile" is computed by computing k=floor(p*N), which is the closest integer less then p*N, and the fractional remainder, r = p*N – k. After sorting the data from lowest to highest, you sum the first k values and a portion of the next value. You then compute the average as (sum(x[1:k]) + r*x[k+1]) / (k+r), as shown in the following program. The mean of the upper quantile is similar.

proc iml;
/* ASSUME x is sorted in ascending order.
   Compute the mean of the lower p_th percentile of the data.
   Use linear interpolation if p*N is not an integer. Assume 0<= p <= 1. 
*/
start meanLowerPctl(p, x);
   N = nrow(x);  
   k = floor(p*N);                 /* index less than p*N */
   r = p*N - k;                    /* r >= 0 */
   if k<1 then m = x[1];           /* use x[1] as the mean */
   else if k>=N then m = mean(x);  /* use mean of all data */
   else do;                        /* use interpolation */
      sum = sum(x[1:k]) + r*x[k+1];
      m = sum / (k+r);
   end;
   return( m );
finish;
 
/* ASSUME x is sorted in ascending order.
   Compute the mean of the upper p_th percentile of the data.
*/
start meanUpperPctl(p, x);
   q = 1-p; 
   N = nrow(x); 
   k = ceil(q*N);                  /* index greater than p*N */
   r = k - q*N;                    /* r >= 0 */
   if k<=1 then m = mean(x);       /* use mean of all data */
   else if k>=N then m = x[N];     /* use x[1] as the mean */
   else do;                        /* use interpolation */
      sum = sum(x[k+1:N]) + r*x[k];
      m = sum / (N-k+r);
   end;
   return( m );
finish;

Let's run a small example so that we can check that the functions work. The following statements define an ordered sample of size N=10 and compute the mean of the lower and upper p_th quantiles for p=0.1, 0.2, ..., 1.

x = {2, 4, 5, 7, 8, 8, 9, 9, 12, 16};  /* NOTE: sorted in increasing order */
obsNum = T(1:nrow(x));
Pctl = obsNum / nrow(x);
meanLow = j(nrow(Pctl), 1, .);
meanUp = j(nrow(Pctl), 1, .);
do i = 1 to nrow(Pctl);
   meanLow[i] = meanLowerPctl(Pctl[i], x);
   meanUp[i] = meanUpperPctl(Pctl[i], x);
end;
print Pctl meanLow meanUp;

Because there are 10 observations and the program uses deciles to define the tails, each row gives the mean of the smallest and largest k observations for k=1, 2, ..., 10. For example, the second row indicates that the mean of the two smallest values is 3, and the mean of the two largest values is 14.

If you plot the estimated means as a function of the percentile, you can see how the functions interpolate between the data values. The following statements compute and graph the mean of the lower p_th quantile as a function of p. The graph of the mean of the upper tail is computed in an analogous fashion:

/* plot the mean of the lower tail as a function of p */
p = do(0, 1, 0.01);
mLow = j(1, ncol(p), .);
do i = 1 to ncol(p);
   mLow[i] = meanLowerPctl(p[i], x);
end;
 
title "Mean of Lower p Percentile of Data";
call series(p, mLow) grid={x y} xvalues=do(0,1,0.1) label={"Percentile" "Mean"};

Compute Hogg's skewness in SAS

With these functions, you can compute Hogg's estimate of skewness. The following SAS/IML function implements the formula. The function is then tested on a large random sample from the exponential distribution.

/* estimate of Hogg skewness */
start HoggSkew(_x);
   x = _x;
   call sort(x);
   L05 = meanLowerPctl(0.05, x);
   U05 = meanUpperPctl(0.05, x);
   m25 = mean(x, 'trim', 0.25);
   Skew_Hogg = (U05 - m25) / (m25 - L05); /* Right Skewness: Hogg (1974, p. 918) */
   return Skew_Hogg;
finish;
 
/* generate random sample from the exponential distribution */
call randseed(12345, 1);
N = 1E5;
x = randfun(N, "Expon");
Skew_Hogg = HoggSkew(x);  /* find the Hogg skewness for exponential data */
print Skew_Hogg;

The sample estimate of Hogg's skewness for these data is 4.558. The value of Hogg's skewness for the exponential distribution is 4.569, so there is close agreement between the parameter and its estimate from the sample.

Compute Hogg's kurtosis in SAS

In the same paper, Hogg (1974) proposed a robust measure of kurtosis, which is
KurtH = (U(0.2) – L(0.2)) / (U(0.5 - U(0.5))
where U(0.2) and L(0.2) are the means of the upper and lower 20% of the data (respectively) and U(0.5) and L(0.5) are the means of the upper and lower 50% of the data (respectively). You can use the same SAS/IML functions to compute Hogg's kurtosis for the data:

/* estimate of Hogg kurtosis */
start HoggKurt(_x);
   x = _x;
   call sort(x);
   L20 = meanLowerPctl(0.20, x);
   U20 = meanUpperPctl(0.20, x);
   L50 = meanLowerPctl(0.50, x);
   U50 = meanUpperPctl(0.50, x);
   Kurt_Hogg = (U20 - L20) / (U50 - L50); /* Kurtosis: Hogg (1974, p. 913) */
   return Kurt_Hogg;
finish;
 
Kurt_Hogg = HoggKurt(x);  /* find the Hogg kurtosis for exponential data */
print Kurt_Hogg;

For comparison, the exact value of the Hogg kurtosis for the exponential distribution is 1.805.

Summary

This article shows how to compute Hogg's robust measures of skewness and kurtosis. The article was inspired by Bono et al. (2020), who explore the bias and accuracy of Hogg's measures of skewness and kurtosis as compared to the usual moment-based skewness and kurtosis. They conclude that Hogg’s estimators are less biased and more accurate. Bono et al. provide a SAS macro that computes Hogg's estimators. In this article, I rewrote the computations as SAS/IML functions to make them more efficient and more understandable. These functions enable researchers to compute Hogg's skewness and kurtosis for data analysis and in simulation studies.

You can download the complete SAS program that defines the Hogg measures of skewness and kurtosis. The program includes computations that show how to compute these values for a distribution.

References

The post Robust statistics for skewness and kurtosis appeared first on The DO Loop.