data analysis

7月 102019

I recently showed how to create an annotation data set that will overlay cell counts or percentages on a mosaic plot. A mosaic plot is a visual representation of a cross-tabulation of observed frequencies for two categorical variables. The mosaic plot with cell counts is shown to the right. The previous article focused on how to create the mosaic plot and how to apply an annotation, assuming that the centers of each cell were provided. This article shows how to compute the center of the cells from the output of PROC FREQ.

A motivating example

In a two-way mosaic plot, the widths of the bars represent the proportions of the levels of the categorical variables. It makes sense, therefore, to use [0, 100] as the coordinate system for the mosaic plots. For example, suppose the horizontal variable has three levels (A, B, and C) and the proportion of A is 20%, the proportion of B is 30%, and the proportion of C is 50%. Then the widths of the bars in the coordinate system are 20, 30, and 50, respectively. The first bar covers the range [0, 20], the second bar covers [20, 50], and the third bar covers [50, 100]. The widths of the bars will occupy 20%, 30%, and 50% of the width of the mosaic plot. In general, if the widths of the bars are w1, w2, and w3, the bars cover the intervals [0, w1], [w1, w1+w2], and [w1+w2, 100]. The centers of the bars are at w1/2, w1 + w2/2, and w1 + w2 + w3/2, respectively.

Each bar is composed of smaller vertical bars. If you adopt a [0, 100] coordinate system for the vertical dimension, the heights of the sub-bars are the conditional proportions of the vertical variable within each level of the horizontal variable.

Obtain the proportions and conditional proportions from PROC FREQ

You can use PROC FREQ to obtain the proportions of each level and joint level. You can use the proportions to compute the center of each cell in a [0, 100] x [0, 100] coordinate system. I recommend using the CROSSLIST and SPARSE options, as shown in the following example, which shows a mosaic plot for the Origin and Type variables in the Sashelp.Cars data. For portability (and, hopefully, clarity), I have defined two macros that contain the name of the horizontal and vertical variables.

%let hVar = Type;             /* name of horizontal variable */
%let vVar = Origin;           /* name of vertical variable */
proc freq data=Sashelp.Cars;
   where Type ^= 'Hybrid';
   tables &vVar * &hVar / crosslist sparse 
   ods output CrossList=FreqList;    /* output CROSSLIST table */

The CROSSLIST option creates a data set where the formatted values of the horizontal and vertical variables (Origin and Type, respectively) begin with the prefix "F_", which stands for "formatted values." These are always character variables. You can use WHERE clauses to subset the data that you need to compute the centers of the mosaic cells:

  • When F_Origin equals "Total," the Frequency and Percent variables contain the information you need to compute the width of the mosaic bars.
  • The other rows contain the information needed to compute the heights of the bars for each stacked bar. The ColPercent variable contains the conditional proportions for each bar.

The following PROC PRINT statements show the relevant information for this example:

title "Horizontal Percentages: &hVar";
proc print data=FreqList;
   where F_&vVar='Total' & F_&hVar^='Total';
   var F_&vVar F_&hVar Frequency Percent;
title "Vertical Percentages: &vVar";
proc print data=FreqList;
   where F_&vVar ^= 'Total' & F_&hVar ^= 'Total';
   by F_&vVar notsorted;
   var F_&vVar F_&hVar Frequency Percent ColPercent;

Find the centers of each cell

As shown in the previous section, the FreqList data set contains all the information you need to find the center of each cell in the mosaic plot. Mathematically, it is not hard to convert that information into the coordinates of the cell centers. Programmatically, the process is complicated. The following SAS/IML program computes the centers by using the following steps:

  1. Read the percentages for the levels of the horizontal variable.
  2. Use these and the CUSUM function to find the horizontal centers of the bars.
  3. For each level of the horizontal variable, read the percentages for the levels of the vertical variable.
  4. Use these to find the vertical centers of each stacked bar.
  5. Write this information to a SAS data set in the long form. Include the observed frequencies and percentages for each cell.
/* Read CROSSLIST data set and write data set that contains centers of cells in mosaic plot.
  Main idea: If a categorical variable has three levels and observed proportions are
  20, 30, and 50, then the midpoints of the bars are
       10 = 20/2
       35 = 20 + 30/2
       75 = 20 + 30 + 50/2
proc iml;
/* 1. read percentages for the horizontal variable */
use FreqList;
read all var {Percent F_&hVar F_&vVar}
         where (F_&vVar='Total' & F_&hVar^='Total');
hPos = Percent;            
nH = nrow(hPos);
hLabel = F_&hVar;
/* 2. horizontal centers
   h1/2,  h1 + h2/2, h1 + h2 + h3/2, h1 + h2 + h3 + h4/2, ... */
halfW = hPos / 2;
hCenter = halfW + cusum(0 // hPos[1:(nH-1)]);
print (hCenter`)[c=hLabel L="hCenter" F=5.2];
/* 3. For each column, read cell percentages and frequencies */
read all var {Frequency Percent ColPercent F_&vVar F_&hVar}
     where (F_&vVar^='Total' &  F_&hVar ^= 'Total');
FhVar = F_&hVar;              /* levels of horiz var */
FvVar = F_&vVar;              /* lavels of vert var */
*print FhVar fVVar Frequency Percent ColPercent;
/* 4. Get the counts and percentages for each cell.
   Vertical centers are 
   v1/2,  v1 + v2/2, v1 + v2 + v3/2, ... */
vLabel = shape( FvVar, 0, nH )[,1];
vPos = shape( ColPercent, 0, nH );
nV = nrow(vPos);
halfW = vPos / 2;
vCenters = j(nrow(vPos), ncol(vPos), .);
do i = 1 to nH;
   vCenters[,i] = halfW[,i] + cusum(0 // vPos[1:(nV-1),i]);
print vCenters[r=vLabel c=hLabel F=5.2];
/* 5. convert to a long format: (hPos, vPos, Freq, Pct) 
   and write to SAS data set */
hCenters = repeat(hCenter, nV);
CellFreq = shape( Frequency, 0, nH );
CellPct = shape( Percent, 0, nH );
result = colvec(hCenters) || colvec(vCenters) || 
         colvec(CellFreq) || colvec(CellPct);
*print FhVar FvVar result[c={"hCenter" "vCenter" "Freq" "Pct"}];
/* Optional: You might want to get rid of any labels that account to fewer 
   than 2% (or 1%) of the total. The criterion is up to you. For example: 
   idx = loc( result[,4] > 2 );  * keep if greater than 2% of total;
   result = result[idx, ];
/* write character and numeric vars separately, then merge together */
/* Character vars: pairwise labels of horiz and vertical categories */
create labels var {FhVar FvVar}; append; close;
/* Numeric vars: centers of cells, counts, and percentages */
create centers from result[c={"hCenter" "vCenter" "Freq" "Pct"}];
   append from result;
data annoData;
   merge labels centers;

Convert centers into an annotation data set

The variables and values in an SG annotation data set must use special names so that the annotation facility knows how to interpret the data. For an introduction to GTL annotation, see the GTL documentation or Warren Kuhfeld's free e-book Advanced ODS Graphics Examples. To overlay text, you need to include the following information:

  • The Label variable specifies the text to display.
  • The x1 and y1 variables specify the coordinates of the label (in the LAYOUTPERCENT drawing space).
  • The Width variable specifies the width of the label and the Anchor variable specifies how the text is anchored (left, right, centered,...) at the (x1, y1) location.

However, as discussed in my previous article, a "region plot" such as the mosaic plot does not support the GRAPHPERCENT or WALLPERCENT drawing areas. You have to use the LAYOUTPERCENT drawing area, which includes space for the axes. Therefore, you cannot merely use the centers as the (x1, x2) coordinates. Instead, you need to linearly transform the centers so that they correctly align with the mosaic cells.

I do not know how to do this step in a general way that will accommodate all situations. You need to look at the graph (including its physical dimensions, font sizes, the length of the labels, etc.) and make a guess about how to linearly transform the center data. In the following program, I estimate that the axis area is about 10% of the horizontal and vertical portion of the layout drawing region. Therefore, I shrink the centers by 10% (that is, multiply by 0.9) and translate the result by 10%. You might need to use different values for your data.

The following statements create the annotation data step. For completeness, I've also included the statements that define the mosaic template and create the graph with the annotation.

/* If we could use the WALLPERCENT drawing space, we could
   use (hCenter, vCenter) as the drawing coordinates:
   x1 = hCenter;
   y1 = vCenter;
   Unfortunately, we have to use LAYOUTPERCENT, so perform a linear
   transformation from "wall coordinates" to layout coordinates.
data anno;
set AnnoData;
length label $12;
/* use RETAIN stmt to define values that are constant */
retain function 'text' 
       y1space 'layoutpercent' x1space 'layoutpercent'
       width 4        /* make larger if you are plotting percentages */
       anchor 'center';
/* Guess a linear transform to LAYOUTPERCENT coordinates.
   Need to move inside graph area, so shrink and 
   translate to correct for left and bottom axes areas */
x1 = 0.9*hCenter + 10;
y1 = 0.9*vCenter + 10;   
label = put(Freq, 5.);
/* Note: PROC FREQ reverses Y axis b/c it sorts the FreqOut data in descending order. */
proc template;
  define statgraph mosaicPlotParm;
      entrytitle _TITLE;
      layout region;    /* REGION layout, so can't overlay text! */
      MosaicPlotParm category=(_HORZVAR _VERTVAR) count=_FREQ / 
             colorgroup=_VERTVAR name="mosaic";
      annotate;         /* required for annotation */
proc sgrender data=FreqOut template=mosaicPlotParm sganno=anno;
dynamic _VERTVAR="Origin" _HORZVAR="Type" _FREQ="Count"
        _TITLE="Basic Mosaic Plot with Counts";

The mosaic plot with text annotations is shown at the top of this program.

In the previous article, I showed how you can use the GTL annotation facility to overlay frequency counts or percentages on a mosaic plot in SAS. In this article, I show how to write a program that computes the centers of the cells from the percentages that PROC FREQ provides when you use the CROSSLIST SPARSE option. Unfortunately, after you compute the centers, you cannot use them directly because the mosaic plot does not support the WALLPERCENT drawing space. Instead, you must use the LAYOUTPERCENT drawing space, which means you need to linearly transform the cell centers.

You can download the SAS program that computes the centers of mosaic cells and uses those coordinates to annotate the mosaic plot.

The post Find the center of each cell in a mosaic plot appeared first on The DO Loop.

6月 262019

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

6月 242019

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include additional explanatory effects to model the data. Sometimes it is difficult to spot patterns in a seemingly random cloud of points, so some analysts like to add a scatter plot smoother to the residual plots. You can use the SMOOTH suboption to the PLOTS=RESIDUALS option in many SAS regression procedures to generate a panel of residual plots that contain loess smoothers. For SAS procedures that do not support the PLOTS=RESIDUALS option, you can use PROC SGPLOT to manually create a residual plot with a smoother.

Residual plots with loess smoothers

Many SAS linear regression procedures such as PROC REG and PROC GLM support the PLOTS=RESIDUAL(SMOOTH) option on the PROC statement. For example, the following call to PROC GLM automatically creates a panel of scatter plots where the residuals are plotted against each regressor. The model is a two-variable regression of the MPG_City variable in the Sashelp.Cars data.

/* residual plots with loess smoother */
ods graphics on;
proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight;
run; quit;

The loess smoothers can sometimes reveal patterns in the residuals that would not otherwise be perceived. In this case, it looks like there is a quadratic pattern to the residuals-versus-EngineSize graph (and perhaps for the Weight variable as well). This indicates that you might need to include a quadratic effect in the model. Because the EngineSize and Weight variables are highly correlated (ρ = 0.81), the following statements add only a quadratic effect for EngineSize:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight
                    EngineSize*EngineSize ;

After adding the quadratic effect, the residual plots do not reveal any obvious systematic trends. Also, the residual plot for Weight no longer shows any quadratic pattern.

How to use PROC SGPLOT to create a residual plot with a smoother

If you use a SAS procedure that does not support the PLOTS=RESIDUALS(SMOOTH) option, you can output the residual values to a SAS data set and use PROC SGPLOT to create the residual plots. Even when a procedure DOES support the PLOTS=RESIDUALS(SMOOTH) option, you might want to customize the plot by adding legends, by changing attributes of the markers or curve, or by specifying a value for the smoothing parameter.

An example is shown below. If you use the same model for MPG_City, but use all observations in the data set, the residual plot for EngineSize looks very strange. For these data, the smoothing parameter for the loess curve is very small and therefore the loess curve overfits the residuals:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   model MPG_City = EngineSize Weight;
   output out=RegOut predicted=Pred residual=Residual;
run; quit;

Yuck! The loess curve for the plot on the left clearly overfits the residuals-versus-EngineSize data! Unfortunately, you cannot change the smoothing parameter from the PROC GLM syntax. However, you can change the default smoothing parameter in PROC SGPLOT and you can make other modifications to the plot as well. Notice in the previous call to PROC GLM that the OUTPUT statement creates a data set named RegOut that contains the residual values and the original variables. Therefore, you can create a residual plot and add a loess smoother by using PROC SGPLOT, as follows:

ods graphics / attrpriority=NONE;
title "Residuals for Model";
proc sgplot data=RegOut ;
   scatter x=EngineSize y=Residual / group=Origin;
   loess x=EngineSize y=Residual / nomarkers smooth=0.5;
   refline 0 / axis=y;
   xaxis grid; yaxis grid;

The smoothing parameter was manually set to 0.5, but you can use PROC LOESS if you want to choose a smoothing parameter that optimizes some information criterion such as the AICC statistic. Notice that you can use additional SGPLOT statements to add a reference grid and to change marker attributes. If you prefer, you could add a different kind of smoother such as a penalized B-spline by using the PBSPLINE statement.

You might wonder why the smoother in the residual plot for EngineSize is so small. The parameter is chosen to optimize a criterion such as the AICC statistic, so why does it overfit the data? An example in the PROC LOESS documentation provides an explanation. The chosen value for the smoothing parameter is one that corresponds to a local minimum of an objective function that involves the AICC statistic. Unfortunately, a set of data can have multiple local minima, and this is the case for the residuals of the EngineSize variable. When the smoothing parameter is 0.534, the AICC criterion reaches a local minimum. However, there are smaller values of the smoothing parameter for which the AICC criterion is even smaller. The minimum value of the AICC occurs when the smoothing parameter is 0.015, which leads to the "jagged" loess curve that is seen in the panel of residual plots shown earlier in this section. If you want to see this phenomenon yourself, run the following PROC LOESS code and look at the criterion plot.

ods select CriterionPlot SmoothingCriterion FitPlot;
proc loess data=RegOut;
   model Residual = EngineSize / select=AICC(global) ;

Because a data set can be smoothed at multiple scales, the "optimal" smoothing parameter that is chosen automatically by the PLOTS=RESIDUALS(SMOOTH) option might not enable you to see the general trend of the residuals. If you experience this phenomenon, output the residuals and use PROC SGPLOT or PROC LOESS to compute a more useful smoother.

In summary, SAS provides the PLOTS=RESIDUALS(SMOOTH) option to automatically create residual-versus-regressor plots. Although this panel usually provides a useful indication of patterns in the residuals, you can also output the residuals to a data set and use PROC SGPLOT or PROC LOESS to create a customized residual plot.

The post Add loess smoothers to residual plots appeared first on The DO Loop.

6月 192019

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the DFFTIS and Cook's distance, which is also known as Cook's D statistic. Both statistics measure the change in predicted values that occurs when you delete an observation and refit the model. This article describes the DFFITS and Cook's D statistics and shows how to compute and graph them in SAS.

DFFITS: How the predicted value changes if an observation is excluded

If you exclude an observation from a model and refit, the predicted values will change. The DFFITS statistic is a measure of how the predicted value at the i_th observation changes when the i_th observation is deleted. High-leverage points tend to pull the regression surface towards the response at that point, so the change in the predicted value at that point is a good indication of how influential the observation is. So that the DFFITS values are independent of the scale of the data, the change in predicted values is scaled by dividing by the standard error of the predicted value at that point. The exact formula is given in the documentation for PROC REG.. The book Regression Diagnostics by Belsley, Kuh, and Welsch (1980) suggests that an observation is influential if the magnitude of its DFFITS value exceeds 2*sqrt(p/n), where p is the number of effects in the model and n is the sample size.

PROC REG provides three ways to generate the DFFITS statistics for each observation:

  • You can create a graph of the DFFITS statistics by using the PLOTS=DFFITS option.
  • You can also display a table of the DFFITS (and other influence statistics) by using the INFLUENCE option in the MODEL statement.
  • You can write the DFFITS statistics to a data set by using the DFFITS= option in the OUTPUT statement.

The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City. The data are sorted because the DFFITS statistic is graphed against the observation number, which is an arbitrary quantity. By sorting the data, you know that small observation numbers correspond to low values of the response and so forth. If you have a short ID variable, you can label the influential observations by using the LABEL suboption, as follows:

/* Create sample data */
data cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
/* Optional but helpful: Sort by response variable */
proc sort data=cars;
   by MPG_City;
proc reg data=Cars plots(only) = DFFITS(label); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;

The DFFITS graph shows that three observations have a large positive DFFITS value. The observations are the Ford Excursion, the Ford Ranger, and the Madza BB230. For these observations, the predicted value (at the observation) is higher with the observation included in the model than if it were excluded. Thus, these observations "pull the regression up." There are four observations that have large negative DFFITS, which means that these observations "pull the regression down." They include the Land Rover Discovery and the Volvo XC90.

Cook's D: A distance measure for the change in regression estimates

When you estimate a vector of regression coefficients, there is uncertainty. The confidence regions for the parameter estimate is an ellipsoid in k-dimensional space, where k is the number of effects that you are estimating (including the intercept). Cook (1977) defines a distance that the estimates move within the confidence ellipse when the i_th point is deleted. Equivalently, Cook shows that the statistic is proportional to the squared studentized residual for the i_th observation. The documentation for PROC REG provides a formula in terms of the studentized residuals.

By default, PROC REG creates a plot of Cook's D statistic as part of the panel of diagnostic plots. (Cook's D is the second row and third column.) You can create a larger stand-alone plot by using the PLOTS=DFFITS option. Optionally, you can label the influential points (those whose Cook's D statistic exceeds 4/sqrt(n)) by using the LABEL suboption, as shown below:
/* create multiple plots and label influential points */
proc reg data=Cars plots(only) = (CooksD(label) DFFits(label));   
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
   output out=RegOut pred=Pred rstudent=RStudent dffits=DFFits cookd=CooksD; /* optional: output statistics */
run; quit;

In many ways, the plot of Cook's D looks similar to a plot of the squared DFFITS statistics. Both measure a change in the predicted value at the i_th observation when the i_th observation is excluded from the analysis. The formula for Cook's D statistic squares a residual-like quantity, so it does not show the direction of the change, whereas the DFFITS statistics do show the direction. Otherwise, the observations that are "very influential" are often the same for both statistics, as seen in this example.

The post Influential observations in a linear regression model: The DFFITS and Cook's D statistics appeared first on The DO Loop.

6月 172019

My article about deletion diagnostics investigated how influential an observation is to a least squares regression model. In other words, if you delete the i_th observation and refit the model, what happens to the statistics for the model? SAS regression procedures provide many tables and graphs that enable you to examine the influence of deleting an observation. For example:

  • The DFBETAS are statistics that indicate the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS and Cook's D statistics indicate the effect that deleting each observation has on the predicted values of the model.
  • The COVRATIO statistics indicate the effect that deleting each observation has on the variance-covariance matrix of the estimates.

These observation-wise statistics are typically used for smaller data sets (n ≤ 1000) because the influence of any single observation diminishes as the sample size increases. You can get a table of these (and other) deletion diagnostics by using the INFLUENCE option on the MODEL statement of PROC REG in SAS. However, because there is one statistic per observation, these statistics are usually graphed. PROC REG can automatically generate needle plots of these statistics (with heuristic cutoff values) by using the PLOTS= option on the PROC REG statement.

This article describes the DFBETAS statistic and shows how to create graphs of the DFBETAS in PROC REG in SAS. The next article discusses the DFFITS and Cook's D statistics. The COVRATIO statistic is not as popular, so I won't say more about that statistic.

DFBETAS: How the coefficient estimates change if an observation is excluded

The documentation for PROC REG has a section that describes the influence statistics, which is based on the book Regression Diagnostics by Belsley, Kuh, and Welsch (1980, p. 13-14). Among these, the DFBETAS statistics are perhaps the easiest to understand. If you exclude an observation from the data and refit the model, you will get new parameter estimates. How much do the estimates change? Notice that you get one statistic for each observation and also one for each regressor (including the intercept). Thus if you have n observations and k regressors, you get nk statistics.

Typically, these statistics are shown in a panel of k plots, with the DFBETAS for each regressor plotted against the observation number. Because "observation number" is an arbitrary number, I like to sort the data by the response variable. Then I know that the small observation numbers correspond to low values of the response variable and large observation numbers correspond to high values of the response variable. The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City:

data cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
proc sort data=cars;
   by MPG_City;
proc print data=cars(obs=5) noobs;
   var Make Model IDMakeMod MPG_City;

The first few observations are shown. Notice that the first observations correspond to small values of the MPG_City variable. Notice also a short label (IDMakeMod) identifies each vehicle.

There are two ways to generate the DFBETAS statistics: You can use the INFLUENCE option on the MODEL statement to generate a table of statistics, or you can use the PLOTS=DFBETAS option in the PROC REG statement to generate a panel of graphs. The following call to PROC REG generates a panel of graphs. The IMAGEMAP=ON option on the ODS GRAPHICS statement enables you to hover the mouse pointer over an observation and obtain a brief description of the observation:

ods graphics on / imagemap=on;              /* enable data tips (tooltips) */
proc reg data=Cars plots(only) = DFBetas; 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;
ods graphics / imagemap=off;

The panel shows the influence of each observation on the estimates of the four regression coefficients. The statistics are standardized so that all graphs can use the same vertical scale. Horizontal lines are drawn at ±2/sqrt(n) ≈ 0.22. Observations are called influential if they have a DFBETA statistic that exceeds that value. The graph shows a tool tip for one of the observations in the EngineSize graph, which shows that the influential point is observation 4, the Land Rover Discovery.

Each graph reveals a few influential observations:

  • For the intercept estimate, the most influential observations are numbers 1, 35, 83, and 84.
  • For the EngineSize estimates, the most influential observations are numbers 4, 35, and 38.
  • For the Horsepower estimates, the most influential observations are numbers 1, 4, and 38.
  • For the Weight estimates, the most influential observations are numbers 1, 24, 35, and 38.

Notice that several observations (such as 1, 35, and 38) are influential for more than one estimate. Excluding those observations causes several parameter estimates to change substantially.

Labeing the influential observations

For me, the panel of graphs is too small. I found it difficult to hover the mouse pointer exactly over the tip of a needle in an attempt to discover the observation number and name of the vehicle. Fortunately, if you want details like that, PROC REG supplies options that make the process easier. If you don't have too many observations, you can add labels to the DFBETAS plots by using the LABEL suboption. To plot each graph individually (instead of in a panel), use the UNPACK suboption, as follows:

proc reg data=Cars plots(only) = DFBetas(label unpack); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;

The REG procedure creates four plots, but only the graph for the Weight variable is shown here. In this graph, the influential observations are labeled by the IDMakeMod variable, which enables you to identify vehicles rather than observation numbers. For example, some of the influential observations for the Weight variable are the Ford Excursion (1), the Toyota Tundra (24), the Mazda B400 (35), and the Volvo XC90 (38).

A table of influential observations

If you want a table that displays the most influential observations, you can use the INFLUENCE option to generate the OutputStatistics table, which contains the DFBETAS for all regressors. You can write that table to a SAS data set and exclude any that do not have a large DFBETAS statistic, where "large" means the magnitude of the statistic exceeds 2/sqrt(n), where n is the sample size. The following DATA step filters the observations and prints only the influential ones.

ods exclude all;
proc reg data=Cars plots=NONE; 
   model MPG_City = EngineSize HorsePower Weight / influence;
   id IDMakeMod;
   ods output OutputStatistics=OutputStats;      /* save influence statistics */
run; quit;
ods exclude none;
data Influential;
set OutputStats nobs=n;
array DFB[*] DFB_:;
cutoff = 2 / sqrt(n);
ObsNum = _N_;
influential = 0;
DFBInd = '0000';                   /* binary string indicator */
do i = 1 to dim(DFB);
   if abs(DFB[i])>cutoff then do;  /* this obs is influential for i_th regressor */
      substr(DFBInd,i,1) = '1';
      influential = 1;
if influential;                    /* output only influential obs */
proc print data=Influential noobs;
   var ObsNum IDMakeMod DFBInd cutoff DFB_:;

The DFBInd variable is a four-character binary string that indicates which parameter estimates are influenced by each observation. Some observations are influential only for one coefficient; others (1, 3, 35, and 38) are influential for many variables. Creating a binary string for each observation is a useful trick.

By the way, did you notice that the name of the statistic ("DFBETAS") has a large S at the end? Until I researched this article, I assumed it was to make the word plural since there is more than one "DFBETA" statistic. But, no, it turns out that the S stands for "scaled." You can define the DFBETA statistic (without the S) to be the change in parameter estimates bb(i), but that statistic depends on the scale of the variables. To standardize the statistic, divide by the standard error of the parameter estimates. That scaling is the reason for the S as the end of DFBETAS. The same is true for the DFFITS statistic: S stands for "scaled."

The next article describes how to create similar graphs for the DFFITS and Cook's D statistics.


DFFITS: How the predicted values change if an observation is excluded

The DFFITS statistic measures, for each observation, how the predicted value at that observation changes if you exclude the observation and refit the model.

Cook's D: How the sum of the predicted values change if an observation is excluded

Cook's distance (D) statistic measures, for each observation, the sum of the differences in the predicted values (summed over all observations) if you exclude the observation and refit the model.

The post Influential observations in a linear regression model: The DFBETAS statistics appeared first on The DO Loop.

6月 122019

For linear regression models, there is a class of statistics that I call deletion diagnostics or leave-one-out statistics. These observation-wise statistics address the question, "If I delete the i_th observation and refit the model, what happens to the statistics for the model?" For example:

  • The PRESS statistic is similar to the residual sum of squares statistic but is based on fitting n different models, where n is the sample size and the i_th model excludes the i_th observation.
  • Cook's D statistic measures the influence of the i_th observation on the fit.
  • The DFBETAS statistics measure how the regression estimates change if you delete the i_th observation.

Although most references define these statistics in terms of deleting an observation and refitting the model, you can use a mathematical trick to compute the statistics without ever refitting the model! For example, the Wikipedia page on the PRESS statistic states, "each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors." Although this paragraph is conceptually correct, theSAS/STAT documentation for PROC GLMSELECT states that the PRESS statistic "can be efficiently obtained without refitting the model n times."

A rank-1 update to the inverse of a matrix

Recall that you can use the "normal equations" to obtain the least squares estimate for the regression problem with design matrix X and observed responses Y. The normal equations are b = (X`X)-1(X`Y), where X`X is known as the sum of squares and crossproducts (SSCP) matrix and b is the least squares estimate of the regression coefficients. For data sets with many observations (very large n), the process of reading the data and forming the SSCP is a relatively expensive part of fitting a regression model. Therefore, if you want the PRESS statistic, it is better to avoid rebuilding the SSCP matrix and computing its inverse n times. Fortunately, there is a beautiful result in linear algebra that relates the inverse of the full SSCP matrix to the inverse when a row of X is deleted. The result is known as the Sherman-Morrison formula for rank-1 updates.

The key insight is that one way to compute the SSCP matrix is as a sum of outer products of the rows of X. Therefore if xi is the i_th row of X, the SCCP matrix for data where xi is excluded is equal to X`X - xi`xi. You have to invert this matrix to find the least squares estimates after excluding xi.

The Sherman-Morrison formula enables you to compute the inverse of X`X - xi`xi when you already know the inverse of X`X. For brevity, set A = X`X. The Sherman-Morrison formula for deleting a row vector xi` is
(A – xi`xi)-1 = A-1 + A-1 xi`xi A-1 / (1 – xiA-1xi`)

Implement the Sherman-Morrison formula in SAS

The formula shows how to compute the inverse of the updated SSCP by using a matrix-vector multiplication and an outer product. Let's use a matrix language to demonstrate the update method. The following SAS/IML program reads in a small data set, forms the SSCP matrix (X`X), and computes its inverse:

proc iml;
use Sashelp.Class;   /* read data into design matrix X */
read all var _NUM_ into X[c=varNames];  
XpX = X`*X;          /* form SSCP */
XpXinv = inv(XpX);   /* compute the inverse */

Suppose you want to compute a leave-one-out statistic such as PRESS. For each observation, you need to estimate the parameters that result if you delete that observation. For simplicity, let's just look at deleting the first row of the X matrix. The following program creates a new design matrix (Z) that excludes the row, forms the new SSCP matrix, and finds its inverse:

/* Inefficient: Manually delete the row from the X matrix 
   and recompute the inverse */
n = nrow(X);
Z = X[2:n, ];       /* delete first row */
ZpZ = Z`*Z;         /* reform the SSCP matrix */
ZpZinv = inv(ZpZ);  /* recompute the inverse */
print ZpZinv[c=varNames r=varNames L="Inverse of SSCP After Deleting First Row"];

The previous statements essentially repeat the entire least squares computation. To compute a leave-one-out statistic, you would perform a total of n similar computations.

In contrast, it is much cheaper to apply the Sherman-Morrison formula to update the inverse of the original SSCP. The following statements apply the Sherman-Morrison formula as it is written:

/* Alternative: Do not change X or recompute the inverse. 
   Use the Sherman-Morrison rank-1 update formula.–Morrison_formula */
r = X[1, ];          /* first row */
rpr = r`*r;          /* outer product */
/* apply Sherman-Morrison formula */
NewInv = XpXinv + XPXinv*rpr*XPXinv / (1 - r*XpXinv*r`);
print NewInv[c=varNames r=varNames L="Inverse from Sherman-Morrison Formula"];

These statements compute the new inverse by using the old inverse, an outer product, and a few matrix multiplications. Notice that the denominator of the Sherman-Morrison formula includes the expression r*(X`X)-1*r`, which is the leverage statistic for the i_th row.

The INVUPDT function in SAS/IML

Because it is important to be able to update an inverse matrix quickly when an observation is deleted (or added!), the SAS/IML language supports the IMVUPDT function, which implements the Sherman-Morrison formula. You merely specify the inverse matrix to update, the vector (as a column vector) to use for the rank-one update, and an optional scalar value, which is usually +1 if you are adding a new observation and -1 if you are deleting an observation. For example, the following statements are the easiest way to implement the Sherman-Morrison formula in SAS for a leave-one-out statistic:

NewInv2 = invupdt(XpXinv, r`, -1);
print NewInv2[c=varNames r=varNames L="Inverse from INVUPDT Function"];

The output is not displayed because the matrix NewInv2 is the same as the matrix NewInv in the previous section. The documentation includes additional examples.

The general Sherman-Morrison-Woodbury formula

The Sherman-Morrison formula shows how to perform a rank-1 update of an inverse matrix. There is a more general formula, called the Sherman-Morrison-Woodbury formula, which enables you to update an inverse for any rank-k modification of the original matrix. The general formula (Golub and van Loan, p. 51 of 2nd ed. or p. 65 of 4th ed.) shows how to find the matrix of a rank-k modification to a nonsingular matrix, A, in terms of the inverse of A. The general formula is
(A + U VT)-1 = A-1 – A-1 U (I + VT A-1 U) VT A-1
where U and V are p x k and all inverses are assumed to exist. When k = 1, the matrices U and V become vectors and the k x k identify matrix becomes the scalar value 1. In the previous section, U equals -xiT and V equals xiT.

The Sherman-Morrison-Woodbury formula is one of my favorite results in linear algebra. It shows that a rank-k modification of a matrix results in a rank-k modification of its inverse. It is not only a beautiful theoretical result, but it has practical applications to leave-one-out statistics because you can use the formula to quickly compute the linear regression model that results by dropping an observation from the data. In this way, you can study the influence of each observation on the model fit (Cook's D, DFBETAS,...) and perform leave-one-out cross-validation techniques, such as the PRESS statistic.

The post Leave-one-out statistics and a formula to update a matrix inverse appeared first on The DO Loop.

6月 102019

Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to create a new data set that contains the new variable. For large data sets, this is wasteful: most of the data remain the same; only the recoded variables are different.

There is an alternative approach: You can use PROC FORMAT in Base SAS to define a custom SAS format. When you use PROC FORMAT, the data are never changed, but all the SAS reports and analyses can display the formatted values instead of the raw data values. You can use the same format for multiple data sets. You can even define multiple formats to analyze the same variable in multiple ways.

An example of using a format to recode a variable

In the simplest situation, a recoding of a variable converts each raw value to an easier-to-interpret value. For example, suppose that the gender variable for patients is recorded as a binary 0/1 variable. This is a terrible choice because it is not clear whether 0 represents males or females. The following example shows the typical IF-THEN logic for recoding 0 as "Female" and 1 as "Male" by creating a new data set and a new variable:

/* original data: Gender is binary variable, which is hard to understand! */
data Have;              
input Gender @@;
1 0 0 0 1 1 0 0 . 1 1 0 0 0 0 1 1 1 . 1 1 
/* Recode by using IF-THEN or SELECT-WHEN. This can be inefficient. */
data HaveRecode;
set Have;
/* use IF-THEN logic to recode gender */
length Gender_Recode $6;
if      Gender=0 then Gender_Recode = "Female";
else if Gender=1 then Gender_Recode = "Male";
else Gender_Recode = " ";
proc freq data=HaveRecode;
   tables Gender_Recode Gender;
Recode variables

The table for the Gender_Recode variable is shown. The data, which was originally coded as a binary indicator variable, has been duplicated by creating a character variable that contains the same information but is more understandable. Of course, now you have to use the new variable name to analyze the recoded data. If you have already written programs that refer to the Gender variable, you have to update the programs to use the new variable name. Yuck!

A more efficient choice is to use a custom-defined format. The beauty of using a format is that you do not have to change the data. Instead, you simply define a format that changes the way that the data are used and displayed in SAS procedures. (A data view is a third alternative, but formats have additional advantages.)

You can define the following format (called GenderFmt.), which displays the gender data as "Female" and "Male" without modifying the data set:

/* use a format to recode gender */
proc format;
value GenderFmt
      0 = "Female"
      1 = "Male" 
      other = " ";
/* apply the format to original data; no need to create new data set */
proc freq data=Have;
   format Gender GenderFmt.;    /* the name of the format includes a period */
   tables Gender;
Use PROC FORMAT to recode variables in SAS

Notice that the analysis is run on the original data and use the original variable name. No additional data sets, views, or variables are created.

Use a format to recode a character variable

Did you know that you can use PROC FORMAT to define formats for character variables? Formats for character variables are used less often than formats for numeric variables, but the syntax is similar. The main difference is that the name of a character format starts with the '$' symbol.

In addition to recoding the values of a categorical variable, formats are useful because they enable you to merge or combine categories by defining a many-to-one mapping. For example, the following character format recodes values of the TYPE variable and also combines the 'SUV' and 'Wagon' categories into a single category. Although it is not needed for this example, notice that the format also includes an 'Other' category, which can be used to combine small groups. The 'Other' category will also handle invalid data.

/* Create sample data from Sashelp.Cars. Exclude hybrids. Optionally sort the data */
proc sort^='Hybrid')) out=Cars; 
   by MPG_City; 
proc format;
value $CarTypeFmt
      'Sedan' = 'Family Car'
      'Sports' = 'Sports Car'
      'SUV','Wagon' = 'Big Car'
      'Truck' = 'Truck'
      Other = 'Other';
proc freq data=Cars;
   format Type $CarTypeFmt.;   /* the name the format includes a period at the end */
   tables Type;
Use PROC FORMAT to recode variables in SAS

Using a format enables you to analyze the original data (omit the FORMAT statement) or apply the format (include the FORMAT statement). You can even define multiple formats if you want to slice and dice the data in various ways.

Use a format to bin numeric variables

One of my favorite SAS tricks is to use a format to bin numeric variables into categories. In the following example, the MPG_City variable is used to group vehicles into four categories based on how fuel-efficient the vehicles are. You can use this format to perform any computation that requires a classification variable. The example shows a two-way frequency analysis of the two variables for which we have defined custom formats:

proc format;
value MPGFmt  
      low -<  15   = "Gas Guzzler"    /* < 15      */
       15 -<  20   = "Not Good"       /* [ 15, 20) */
       20 -<  25   = "Good"           /* [ 20, 25) */
       25 -   high = "Great";         /* > 25      */
proc freq data=Cars order=data;
   format MPG_City MPGFmt. Type $CarTypeFmt.;
   tables MPG_City * Type / nocol norow nopercent;
Use PROC FORMAT to recode variables in SAS

Store and retrieve formats

Formats are stored in a catalog, which is stored separately from the data. By default, SAS stores the formats in a catalog named WORK.FORMATS. Like everything else stored in WORK, that catalog will vanish when you end the SAS session. Therefore, you need to store the formats to a permanent libref if you want to reuse the formats across SAS session.

SAS supports several features that help you to maintain a permanent library of formats. Here are two facts about format catalogs:

  • You can use the LIBRARY= option on the PROC FORMAT statement to specify a libref in which to store the format catalog. By default, the catalog will be named FORMATS.
  • SAS maintains a list of librefs to search through to find formats. By default, it looks in WORK and a special libref named LIBRARY.

These facts imply that you can do two simple things to create a permanent library of formats. First, define a permanent libref named LIBRARY (the name is important!) that will contain your catalog of formats. Second, specify the LIBRARY=LIBRARY option when you define the format, as follows:

libname library "C:/MyFormats";   /* the libref 'LIBRARY' has special significance! */
proc format library=library;      /* adds format to the permanent catalog LIBRARY.FORMATS */
value $CarTypeFmt
      'Sedan' = 'Family Car'    'Sports' = 'Sports Car'
      'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck'       Other = 'Other';

When you start a new SAS session, you will need to define the LIBRARY libref again if you want to access the formats. For convenience, many people put the LIBNAME statement in their file. Because SAS searches for formats in the LIBRARY.FORMATS catalog, SAS will automatically find the $CarTypeFmt. format.

SAS provides many other options for storing formats and for specifying the search locations for formats. For details, see the SAS usage note "How can I permanently store and use formats that I have created?" or John Ladd's 2012 paper, "Yes, We Can... Save SAS Formats."


In summary, if you need to recode data, custom-defined formats provide an easy alternative to physically changing the data. This article discusses five advantages to using formats to recode data:

  • The data do not change. You can use the original variable names in the analyses.
  • You can apply formats to both character and numerical variables.
  • You can use formats to merge categories and to bin numeric variables.
  • You apply a format to multiple variables in multiple data sets.
  • You can save formats in a permanent libref and use them across SAS sessions.

Do you maintain a library of SAS formats at your workplace? Leave a comment to share your experience and your best practices.

The post 5 reasons to use PROC FORMAT to recode variables in SAS appeared first on The DO Loop.

6月 032019

Statistical programmers and analysts often use two kinds of rectangular data sets, popularly known as wide data and long data. Some analytical procedures require that the data be in wide form; others require long form. (The "long format" is sometimes called "narrow" or "tall" data.) Fortunately, the statistical graphics procedures in SAS (notably, PROC SGPLOT) can usually accommodate either format. You can use multiple statements to create graphs of wide data. You can use a single statement and the GROUP= option to create graphs of long data.

Example: Overlay line plots for multiple response variables

Suppose you have four variables (with N observations) and you want to overlay line plots of three of the variables graphed against the fourth. There are two natural ways to arrange the 4*N data values. The first (and most natural) is a data set that has N rows and 4 variables (call them X, Y1, Y2, and Y3). This is the "wide form" of the data. The "long form" data set has three variables and 3*N rows, as shown to the right. The first column (VarName) specifies the name of the three response variables. The second column (X) indicates the value of the independent variable and the third column (Y) represents the value of the dependent variable that is specified in the VarName column. Some people will additionally sort the long data by the VarName variable, but that is not usually necessary. In general, if you want to stack k variables, the long form data will contain k*N observations.

PROC SGPLOT enables you to plot either set of data. For the wide data, you can use three SERIES statements to plot X vs Y1, X vs Y2, and X vs Y3, as follows. Notice that you can independently set the attributes of each line, such as color, symbol, line style. In the following program, the line thickness is set to the same value for all lines, but you could make that attribute vary, if you prefer.

data Wide;
input X Y1 Y2 Y3;
10 2 3 4
15 0 4 6
20 1 4 5
title "Wide Form: Use k Statements to Plot k Variables";
proc sgplot data=Wide;
   series x=X y=Y1 / markers lineattrs=(thickness=2);
   series x=X y=Y2 / markers lineattrs=(thickness=2);
   series x=X y=Y3 / markers lineattrs=(thickness=2);

You can use PROC TRANSPOSE or the SAS DATA step to convert the data from wide form to long form. When the data are in the long format, you use a single SERIES statement and the GROUP=VarName option to plot the three groups of lines. In addition, you can set the attributes for all the lines by using a single statement.

/* convert data from Wide to Long form */
data Long;
set Wide;
VarName='Y1'; Value=Y1; output;
VarName='Y2'; Value=Y2; output;
VarName='Y3'; Value=Y3; output;
drop Y1-Y3;
title "Long Form: Use GROUP= Option to Plot k Variables";
proc sgplot data=Long;
   series x=X y=Value / group=VarName markers lineattrs=(thickness=2);

Advantages and disadvantages of wide and long formats

The two formats contain the same information, but sometimes one form is more convenient than the other. Here are a few reasons to consider wide-form and long-form data:

Use the wide form when...

  • You want to run a fixed-effect regression analysis. Many SAS procedures require data to be in wide form, including ANOVA, REG, GLM, LOGISTIC, and GENMOD.
  • You want to run a multivariate analysis. Multivariate analyses include principal components (PRINCOMP), clustering (FASTCLUS), discriminant analysis (DISCRIM), and most matrix-based computations (PROC IML).
  • You want to create a plot that overlays graphs of several distinct variables. With wide data, you can easily and independently control the attributes of each overlay.

Use the long form when...

  • You want to run a mixed model regression analysis for repeated measurements. PROC MIXED and GLIMMIX require the long format. In general, the long format is useful for many kinds of longitudinal analysis, where the same subject is measured at multiple time points.
  • The measurements were taken at different values of the X variable. For example, in the previous section, the wide format is applicable because Y1, Y2, and Y3 were all measured at the same three values of X. However, the long form enables Y1 to be measured at different values than Y2. In fact, Y1 could be measured at only three points whereas Y2 could be measured at more points.

The last bullet point is important. The long form is more flexible, so you can use it to plot quantities that are measured at different times or positions. Line plots of this type are sometimes called spaghetti plots. The following DATA step defines long-form data for which the response variables are measured at different values of X:

data Long2;
infile datalines truncover;
length VarName $2;
input VarName X Value @;
do until( Value=. );
   output;   input X Value @;
Y1 10 2 15 0 20 1
Y2 10 3 12 4 13 5 16 4 17 3 18 3 20 4
Y3 9 3 11 4 14 6 18 4 19 5
title "Long Form: Different Number of Measurements per Subject";
proc sgplot data=Long2;
   series x=X y=Value / group=VarName markers lineattrs=(thickness=2);
   xaxis grid; yaxis grid;

In summary, you can use PROC SGPLOT to create graphs regardless of whether the data are in wide form or long form. I've presented a few common situations in which you might want to use each kind of data representation. Can you think of other situations in which the long format is preferable to the wide format? Or vice versa? Let me know by leaving a comment.

The post Graph wide data and long data in SAS appeared first on The DO Loop.

5月 282019

Modern statistical software provides many options for computing robust statistics. For example, SAS can compute robust univariate statistics by using PROC UNIVARIATE, robust linear regression by using PROC ROBUSTREG, and robust multivariate statistics such as robust principal component analysis. Much of the research on robust regression was conducted in the 1970s, so I was surprised to learn that a robust version of simple (one variable) linear regression was developed way back in 1950! This early robust regression method uses many of the same techniques that are found in today's "modern" robust regression methods. This article describes and implements a robust estimator for simple linear regression that was developed by Theil (1950) and extended by Sen (1968).

The Theil-Sen robust estimator

I had not heard of the Theil-Sen robust regression method until recently, perhaps because it applies only to one-variable regression. The Wikipedia article about the Theil-Sen estimator states that the method is "the most popular nonparametric technique for estimating a linear trend" in the applied sciences due to its "simplicity in computation, ... robustness to outliers," and "[limited assumptions]regarding measurement errors."

The idea behind the estimator is simple. If the data contains N pairs of (x, y) values, compute all the slopes between pairs of points and choose the median as the estimate of the regression slope. Using that slope, pass a line through each pair of (x,y) values to obtain N intercepts. Choose the median of the intercepts as the estimate of the regression intercept.

That's it. You compute "N choose 2" (which is N*(N-1)/2) slopes and take the median. Then compute N intercepts and take the median. The slope estimate is unbiased and the process is resistant to outliers.

The adjacent scatter plot shows the Theil-Sen regression line for nine data points. The seven data points that appear to fall along the regression line were used by Sen (1968). I added two outliers. The plot shows that the Theil-Sen regression line ignores the outliers and passes close to the other data points. The slope of the Theil-Sen line is slightly less than 4. In contrast, the least squares line through these data has a slope of only 2.4 because of the influence of the two outliers.

Implement the Theil-Sen estimator in SAS

You can easily implement Theil-Sen regression in SAS/IML, as follows:

  1. Use the ALLCOMB function to generate all pairs of the values {1, 2, ..., N}. Or, for large N, use the RANDCOMB function to sample pairs of values.
  2. Use subscript operations to extract the pairs of points. Compute all slopes between the pairs of points.
  3. Use the MEDIAN function to compute the median slope and the median intercept.

The following SAS/IML program implements this algorithm. The program is very compact (six statements) because it is vectorized. There are no explicit loops.

proc iml;
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78, /* 7 data points used by Sen (1968) */
     12.5 30,  4.5 50};                               /* 2 outliers (not considered by Sen) */
/* Theil uses all "N choose 2" combinations of slopes of segments.
   Assume that the first coordinates (X) are distinct */
c = allcomb(nrow(XY), 2);         /* all "N choose 2" combinations of pairs */
Pt1 = XY[c[,1],];                 /* extract first point of line segments */
Pt2 = XY[c[,2],];                 /* extract second point of line segments */
slope = (Pt1[,2] - Pt2[,2]) / (Pt1[,1] - Pt2[,1]); /* Careful! Assumes x1 ^= x2 */
m = median(slope);
b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
print (b||m)[c={'Intercept' 'Slope'} L="Method=Theil Combs=All"];

As stated earlier, the Theil-Sen estimate has a slope of 3.97. That value is the median of the slopes among the 36 line segments that connect pairs of points. The following graphs display the 36 line segments between pairs of points and a histogram of the distribution of the slopes. The histogram shows that the value 3.97 is the median value of the distribution of slopes.

Handling repeated X values: Sen's extension

The observant reader might object that the slopes of the line segments will be undefined if any of the data have identical X values. One way to deal with that situation is to replace the undefined slopes by large positive or negative values, depending on the sign of the difference between the Y values. Since the median is a robust estimator, adding a few high and low values will not affect the computation of the median slope. Alternatively, Sen (1968) proved that you can omit the pairs that have identical X values and still obtain an unbiased estimate. In the following SAS/IML program, I modified the X values of the two outliers so that only seven of the nine X values are unique. The LOC function finds all pairs that have different X values, and only those pairs are used to compute the robust regression estimates.

/* Sen (1968) handles repeated X coords by using only pairs with distinct X */
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78,
     12 30,  4 50};  /* last two obs are repeated X values */
c = allcomb(nrow(XY), 2);        /* all "N choose 2" combinations of pairs */
Pt1 = XY[c[,1],];                /* first point of line segments */
Pt2 = XY[c[,2],];                /* second point of line segments */
idx = loc(Pt1[,1]-Pt2[,1]^=0);   /* find pairs with same X value */
Pt1 = Pt1[idx,];                 /* keep only pairs with different X values */
Pt2 = Pt2[idx,];
slope = (Pt1[,2] - Pt2[,2]) / (Pt1[,1] - Pt2[,1]);
m = median(slope);
b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
print (b||m)[c={'Intercept' 'Slope'} L="Method=Sen Combs=All"];

A function to compute the Theil-Sen estimator

The following function defines a SAS/IML function that implements the Theil-Sen regression estimator. I added two options. You can use the METHOD argument to specify how to handle pairs of points that have the same X values. You can use the NUMPAIRS option to specify whether to use the slopes of all pairs of points or whether to use the slopes of K randomly generated pairs of points.

proc iml;
/* Return (intercept, slope) for Theil-Sen robust estimate of a regression line.
   XY is N x 2 matrix. The other arguments are:
      If method="Theil" and a pair of points have the same X coordinate, 
         assign a large positive value instead of +Infinity and a large negative 
         value instead of -Infinity. 
      If method="Sen", omit any pairs of points that have the same first coordinate. 
      If numPairs="All", generate all "N choose 2" combinations of the N points.
      If numPairs=K (positive integer), generate K random pairs of points. 
start TheilSenEst(XY, method="SEN", numPairs="ALL");
   Infinity = 1e99;             /* big value for slope instead of +/- infinity */
   if type(numPairs)='N' then
      c = rancomb(nrow(XY), 2, numPairs);  /* random combinations of pairs */
   else if upcase(numPairs)="ALL" then 
      c = allcomb(nrow(XY), 2);            /* all "N choose 2" combinations of pairs */
   else stop "ERROR: The numPairs option must be 'ALL' or a postive integer";
   Pt1 = XY[c[,1],];                       /* first points for slopes */
   Pt2 = XY[c[,2],];                       /* second points for slopes */
   dy = Pt1[,2] - Pt2[,2];                 /* change in Y */
   dx = Pt1[,1] - Pt2[,1];                 /* change in X */ 
   idx = loc( dx ^= 0 );  
   if upcase(method) = "SEN" then do;      /* exclude pairs with same X value */
      slope = dy[idx] / dx[idx];           /* slopes of line segments */
   else do;                        /* assign big slopes for pairs with same X value */
      slope = j(nrow(Pt1), 1, .);  /* if slope calculation is 0/0, assign missing */
      /* Case 1: x1 ^= x2. Do the usual slope computation */
      slope[idx] = dy[idx] / dx[idx];
      /* Case 2: x1 = x2. Assign +Infinity if sign(y1-y2) > 0, else assign -Infinity */
      jdx = loc( dx = 0 & sign(dy)>0 );
      if ncol(jdx)>0 then 
         slope[jdx] = Infinity;
      jdx = loc( dx = 0 & sign(dy)<0 );
      if ncol(jdx)>0 then 
         slope[jdx] = -Infinity;
   m = median(slope);
   b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
   return( b || m );
/* Test all four calls */
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78,
     18 30,  4 50};  /* last two obs are outliers not considered by Sen */
est = TheilSenEst(XY, "Theil", "All");
print est[c={'Intercept' 'Slope'} L="Method=Theil; Pairs=All"];
est = TheilSenEst(XY, "Sen", "All");
print est[c={'Intercept' 'Slope'} L="Method=Sen; Pairs=All"];
call randseed(123, 1);
est = TheilSenEst(XY, "Theil", 200);
print est[c={'Intercept' 'Slope'} L="Method=Theil; Pairs=200"];
call randseed(123, 1);
est = TheilSenEst(XY, "Sen", 200);
print est[c={'Intercept' 'Slope'} L="Method=Sen; Pairs=200"];

For these data, the estimates are the same whether you exclude pairs of points that have identical X coordinates or whether you replace the undefined slopes with large values. For this small data set, there is no reason to use the randomly chosen pairs of points, but that syntax is shown for completeness. Of course, if you run the analysis with a different random number seed, you will get a different estimate.


You can download the SAS program that creates the analysis and graphs in this article.

Although Theil published the main ideas for this method in 1950, it contains many of the features of modern robust statistical estimates. Specifically, a theme in modern robust statistics is to exhaustively or randomly choose many small subsets of the data. You compute a (classical) estimate on each subset and then use the many estimates to obtain a robust estimate. I did not realize that Theil had introduced these basic ideas almost seventy years ago!

Theil and Sen also included confidence intervals for the estimates, but I have not included them in this brief article.


The post The Theil-Sen robust estimator for simple linear regression appeared first on The DO Loop.

5月 152019

Have you ever run a statistical test to determine whether data are normally distributed? If so, you have probably used Kolmogorov's D statistic. Kolmogorov's D statistic (also called the Kolmogorov-Smirnov statistic) enables you to test whether the empirical distribution of data is different than a reference distribution. The reference distribution can be a probability distribution or the empirical distribution of a second sample. Most statistical software packages provide built-in methods for computing the D statistic and using it in hypothesis tests. This article discusses the geometric meaning of the Kolmogorov D statistic, how you can compute it in SAS, and how you can compute it "by hand."

This article focuses on the case where the reference distribution is a continuous probability distribution, such as a normal, lognormal, or gamma distribution. The D statistic can help you determine whether a sample of data appears to be from the reference distribution. Throughout this article, the word "distribution" refers to the cumulative distribution.

What is the Kolmogorov D statistic?

The geometry of Kolmogorov's D statistic, which measures the maximum vertical distance between an emirical distribution and a reference distribution.

The letter "D" stands for "distance." Geometrically, D measures the maximum vertical distance between the empirical cumulative distribution function (ECDF) of the sample and the cumulative distribution function (CDF) of the reference distribution. As shown in the adjacent image, you can split the computation of D into two parts:

  • When the reference distribution is above the ECDF, compute the statistic D as the maximal difference between the reference distribution and the ECDF. Because the reference distribution is monotonically increasing and because the empirical distribution is a piecewise-constant step function that changes at the data values, the maximal value of D will always occur at a right-hand endpoint of the ECDF.
  • When the reference distribution is below the ECDF, compute the statistic D+ as the maximal difference between the ECDF and the reference distribution. The maximal value of D+ will always occur at a left-hand endpoint of the ECDF.

The D statistic is simply the maximum of D and D+.

You can compare the statistic D to critical values of the D distribution, which appear in tables. If the statistic is greater than the critical value, you reject the null hypothesis that the sample came from the reference distribution.

How to compute Kolmogorov's D statistic in SAS?

In SAS, you can use the HISTOGRAM statement in PROC UNIVARIATE to compute Kolmogorov's D statistic for many continuous reference distributions, such as the normal, beta, or gamma distributions. For example, the following SAS statements simulate 30 observations from a N(59, 5) distribution and compute the D statistic as the maximal distance between the ECDF of the data and the CDF of the N(59, 5) distribution:

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
/* simulate a random sample of size N from the reference distribution */
data Test;
call streaminit(73);
do i = 1 to &N;
   x = rand("Normal", &mu, &sigma);
drop i;
proc univariate data=Test;
   ods select Moments CDFPlot GoodnessOfFit;
   histogram x / normal(mu=&mu sigma=&sigma) vscale=proportion;  /* compute Kolmogov D statistic (and others) */
   cdfplot x / normal(mu=&mu sigma=&sigma) vscale=proportion;    /* plot ECDF and reference distribution */
   ods output CDFPlot=ECDF(drop=VarName CDFPlot);                /* for later use, save values of ECDF */

The computation shows that D = 0.131 for this simulated data. The plot at the top of this article shows that the maximum occurs for a datum that has the value 54.75. For that observation, the ECDF is farther away from the normal CDF than at any other location.

The critical value of the D distribution when N=30 and α=0.05 is 0.2417. Since D < 0.2417, you should not reject the null hypothesis. It is reasonable to conclude that the sample comes from the N(59, 5) distribution.

Although the computation is not discussed in this article, you can use PROC NPAR1WAY to compute the statistic when you have two samples and want to determine if they are from the same distribution. In this two-sample test, the geometry is the same, but the computation is slightly different because the reference distribution is itself an ECDF, which is a step function.

Compute Kolmogorov's D statistic manually

Although PROC UNIVARIATE in SAS computes the D statistic (and other goodness-of-fit statistics) automatically, it is not difficult to compute the statistic from first principles in a vector language such as SAS/IML.

The key is to recall that the ECDF is a piecewise constant function that changes heights at the value of the observations. If you sort the data, the height at the i_th sorted observation is i / N, where N is the sample size. The height of the ECDF an infinitesimal distance before the i_th sorted observation is (i – 1)/ N. These facts enable you to compute D and D+ efficiently.

The following algorithm computes the D statistic:

  1. Sort the data in increasing order.
  2. Evaluate the reference distribution at the data values.
  3. Compute the pairwise differences between the reference distribution and the ECDF.
  4. Compute D as the maximum value of the pairwise differences.

The following SAS/IML program implements the algorithm:

/* Compute the Kolmogorov D statistic manually */
proc iml;
use Test;  read all var "x";  close;
N = nrow(x);                         /* sample size */
call sort(x);                        /* sort data */
F = cdf("Normal", x, &mu, &sigma);   /* CDF of reference distribution */
i = T( 1:N );                        /* ranks */
Dminus = F - (i-1)/N;
Dplus = i/N - F;
D = max(Dminus, Dplus);              /* Kolmogorov's D statistic */
print D;

The D statistic matches the computation from PROC UNIVARIATE.

The SAS/IML implementation is very compact because it is vectorized. By computing the statistic "by hand," you can perform additional computations. For example, you can find the observation for which the ECDF and the reference distribution are farthest away. The following statements find the index of the maximum for the Dminus and Dplus vectors. You can use that information to find the value of the observation at which the maximum occurs, as well as the heights of the ECDF and reference distribution. You can use these values to create the plot at the top of this article, which shows the geometry of the Kolmogorov D statistic:

/* compute locations of the maximum D+ and D-, then plot with a highlow plot */
i1 = Dminus[<:>];  i2 = Dplus [<:>];              /* indices of maxima */
/* Compute location of max, value of max, upper and lower curve values */
x1=x[i1];  v1=DMinus[i1];  Low1=(i[i1]-1)/N;  High1=F[i1]; 
x2=x[i2];  v2=Dplus [i2];  Low2=F[i2];        High2=i[i2]/N; 
Result = (x1 || v1 || Low1 || High1) // (x2 || v2 || Low2 || High2);
print Result[c={'x' 'Value' 'Low' 'High'} r={'D-' 'D+'} L='Kolmogorov D'];

The observations that maximize the D and D+ statistics are x=54.75 and x=61.86, respectively. The value of D is the larger value, so that is the value of Kolmogorov's D.

For completeness, the following statement show how to create the graph at the top of this article. The HIGHLOW statement in PROC SGPLOT is used to plot the vertical line segments that represent the D and D+ statistics.

create KolmogorovD var {x x1 Low1 High1 x2 Low2 High2}; append; close;
data ALL;
   set KolmogorovD ECDF;   /* combine the ECDF, reference curve, and the D- and D+ line segments */
title "Kolmogorov's D Statistic";
proc sgplot data=All;
   label CDFy = "Reference CDF" ECDFy="ECDF";
   xaxis grid label="x";
   yaxis grid label="Cumulative Probability" offsetmin=0.08;
   fringe x;
   series x=CDFx Y=CDFy / lineattrs=GraphReference(thickness=2) name="F";
   step x=ECDFx Y=ECDFy / lineattrs=GraphData1 name="ECDF";
   highlow x=x1 low=Low1 high=High1 / 
           lineattrs=GraphData2(thickness=4) name="Dm" legendlabel="D-";
   highlow x=x2 low=Low2 high=High2 / 
           lineattrs=GraphData3(thickness=4) name="Dp" legendlabel="D+";
   keylegend "F" "ECDF" "Dm" "Dp";

Concluding thoughts

Why would anyone want to compute the D statistic by hand if PROC UNIVARIATE can compute it automatically? There are several reasons:

  • Understanding: When you compute a statistic manually, you are forced to understand what the statistic means and how it works.
  • Efficiency: When you run PROC UNIVARIATE, it computes many statistics (mean, standard deviation, skewness, kurtosis,...), many goodness-of-fit tests (Kolmogorov-Smirnov, Anderson-Darling, and Cramér–von Mises), a histogram, and more. That's a lot of work! If you are interested only in the D statistic, why needlessly compute the others?
  • Generalizability: You can compute quantities such as the location of the maximum value of D and D+. You can use the knowledge and experience you've gained to compute related statistics.

The post What is Kolmogorov's D statistic? appeared first on The DO Loop.