rick wicklin

6月 192019
 

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the DFFTIS and Cook's distance, which is also known as Cook's D statistic. Both statistics measure the change in predicted values that occurs when you delete an observation and refit the model. This article describes the DFFITS and Cook's D statistics and shows how to compute and graph them in SAS.

DFFITS: How the predicted value changes if an observation is excluded

If you exclude an observation from a model and refit, the predicted values will change. The DFFITS statistic is a measure of how the predicted value at the i_th observation changes when the i_th observation is deleted. High-leverage points tend to pull the regression surface towards the response at that point, so the change in the predicted value at that point is a good indication of how influential the observation is. So that the DFFITS values are independent of the scale of the data, the change in predicted values is scaled by dividing by the standard error of the predicted value at that point. The exact formula is given in the documentation for PROC REG.. The book Regression Diagnostics by Belsley, Kuh, and Welsch (1980) suggests that an observation is influential if the magnitude of its DFFITS value exceeds 2*sqrt(p/n), where p is the number of effects in the model and n is the sample size.

PROC REG provides three ways to generate the DFFITS statistics for each observation:

  • You can create a graph of the DFFITS statistics by using the PLOTS=DFFITS option.
  • You can also display a table of the DFFITS (and other influence statistics) by using the INFLUENCE option in the MODEL statement.
  • You can write the DFFITS statistics to a data set by using the DFFITS= option in the OUTPUT statement.

The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City. The data are sorted because the DFFITS statistic is graphed against the observation number, which is an arbitrary quantity. By sorting the data, you know that small observation numbers correspond to low values of the response and so forth. If you have a short ID variable, you can label the influential observations by using the LABEL suboption, as follows:

/* Create sample data */
data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
/* Optional but helpful: Sort by response variable */
proc sort data=cars;
   by MPG_City;
run;
 
proc reg data=Cars plots(only) = DFFITS(label); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;

The DFFITS graph shows that three observations have a large positive DFFITS value. The observations are the Ford Excursion, the Ford Ranger, and the Madza BB230. For these observations, the predicted value (at the observation) is higher with the observation included in the model than if it were excluded. Thus, these observations "pull the regression up." There are four observations that have large negative DFFITS, which means that these observations "pull the regression down." They include the Land Rover Discovery and the Volvo XC90.

Cook's D: A distance measure for the change in regression estimates

When you estimate a vector of regression coefficients, there is uncertainty. The confidence regions for the parameter estimate is an ellipsoid in k-dimensional space, where k is the number of effects that you are estimating (including the intercept). Cook (1977) defines a distance that the estimates move within the confidence ellipse when the i_th point is deleted. Equivalently, Cook shows that the statistic is proportional to the squared studentized residual for the i_th observation. The documentation for PROC REG provides a formula in terms of the studentized residuals.

By default, PROC REG creates a plot of Cook's D statistic as part of the panel of diagnostic plots. (Cook's D is the second row and third column.) You can create a larger stand-alone plot by using the PLOTS=DFFITS option. Optionally, you can label the influential points (those whose Cook's D statistic exceeds 4/sqrt(n)) by using the LABEL suboption, as shown below:
/* create multiple plots and label influential points */
proc reg data=Cars plots(only) = (CooksD(label) DFFits(label));   
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
   output out=RegOut pred=Pred rstudent=RStudent dffits=DFFits cookd=CooksD; /* optional: output statistics */
run; quit;

In many ways, the plot of Cook's D looks similar to a plot of the squared DFFITS statistics. Both measure a change in the predicted value at the i_th observation when the i_th observation is excluded from the analysis. The formula for Cook's D statistic squares a residual-like quantity, so it does not show the direction of the change, whereas the DFFITS statistics do show the direction. Otherwise, the observations that are "very influential" are often the same for both statistics, as seen in this example.

The post Influential observations in a linear regression model: The DFFITS and Cook's D statistics appeared first on The DO Loop.

6月 172019
 

My article about deletion diagnostics investigated how influential an observation is to a least squares regression model. In other words, if you delete the i_th observation and refit the model, what happens to the statistics for the model? SAS regression procedures provide many tables and graphs that enable you to examine the influence of deleting an observation. For example:

  • The DFBETAS are statistics that indicate the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS and Cook's D statistics indicate the effect that deleting each observation has on the predicted values of the model.
  • The COVRATIO statistics indicate the effect that deleting each observation has on the variance-covariance matrix of the estimates.

These observation-wise statistics are typically used for smaller data sets (n ≤ 1000) because the influence of any single observation diminishes as the sample size increases. You can get a table of these (and other) deletion diagnostics by using the INFLUENCE option on the MODEL statement of PROC REG in SAS. However, because there is one statistic per observation, these statistics are usually graphed. PROC REG can automatically generate needle plots of these statistics (with heuristic cutoff values) by using the PLOTS= option on the PROC REG statement.

This article describes the DFBETAS statistic and shows how to create graphs of the DFBETAS in PROC REG in SAS. The next article discusses the DFFITS and Cook's D statistics. The COVRATIO statistic is not as popular, so I won't say more about that statistic.

DFBETAS: How the coefficient estimates change if an observation is excluded

The documentation for PROC REG has a section that describes the influence statistics, which is based on the book Regression Diagnostics by Belsley, Kuh, and Welsch (1980, p. 13-14). Among these, the DFBETAS statistics are perhaps the easiest to understand. If you exclude an observation from the data and refit the model, you will get new parameter estimates. How much do the estimates change? Notice that you get one statistic for each observation and also one for each regressor (including the intercept). Thus if you have n observations and k regressors, you get nk statistics.

Typically, these statistics are shown in a panel of k plots, with the DFBETAS for each regressor plotted against the observation number. Because "observation number" is an arbitrary number, I like to sort the data by the response variable. Then I know that the small observation numbers correspond to low values of the response variable and large observation numbers correspond to high values of the response variable. The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City:

data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
proc sort data=cars;
   by MPG_City;
run;
 
proc print data=cars(obs=5) noobs;
   var Make Model IDMakeMod MPG_City;
run;

The first few observations are shown. Notice that the first observations correspond to small values of the MPG_City variable. Notice also a short label (IDMakeMod) identifies each vehicle.

There are two ways to generate the DFBETAS statistics: You can use the INFLUENCE option on the MODEL statement to generate a table of statistics, or you can use the PLOTS=DFBETAS option in the PROC REG statement to generate a panel of graphs. The following call to PROC REG generates a panel of graphs. The IMAGEMAP=ON option on the ODS GRAPHICS statement enables you to hover the mouse pointer over an observation and obtain a brief description of the observation:

ods graphics on / imagemap=on;              /* enable data tips (tooltips) */
proc reg data=Cars plots(only) = DFBetas; 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;
ods graphics / imagemap=off;

The panel shows the influence of each observation on the estimates of the four regression coefficients. The statistics are standardized so that all graphs can use the same vertical scale. Horizontal lines are drawn at ±2/sqrt(n) ≈ 0.22. Observations are called influential if they have a DFBETA statistic that exceeds that value. The graph shows a tool tip for one of the observations in the EngineSize graph, which shows that the influential point is observation 4, the Land Rover Discovery.

Each graph reveals a few influential observations:

  • For the intercept estimate, the most influential observations are numbers 1, 35, 83, and 84.
  • For the EngineSize estimates, the most influential observations are numbers 4, 35, and 38.
  • For the Horsepower estimates, the most influential observations are numbers 1, 4, and 38.
  • For the Weight estimates, the most influential observations are numbers 1, 24, 35, and 38.

Notice that several observations (such as 1, 35, and 38) are influential for more than one estimate. Excluding those observations causes several parameter estimates to change substantially.

Labeing the influential observations

For me, the panel of graphs is too small. I found it difficult to hover the mouse pointer exactly over the tip of a needle in an attempt to discover the observation number and name of the vehicle. Fortunately, if you want details like that, PROC REG supplies options that make the process easier. If you don't have too many observations, you can add labels to the DFBETAS plots by using the LABEL suboption. To plot each graph individually (instead of in a panel), use the UNPACK suboption, as follows:

proc reg data=Cars plots(only) = DFBetas(label unpack); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
quit;

The REG procedure creates four plots, but only the graph for the Weight variable is shown here. In this graph, the influential observations are labeled by the IDMakeMod variable, which enables you to identify vehicles rather than observation numbers. For example, some of the influential observations for the Weight variable are the Ford Excursion (1), the Toyota Tundra (24), the Mazda B400 (35), and the Volvo XC90 (38).

A table of influential observations

If you want a table that displays the most influential observations, you can use the INFLUENCE option to generate the OutputStatistics table, which contains the DFBETAS for all regressors. You can write that table to a SAS data set and exclude any that do not have a large DFBETAS statistic, where "large" means the magnitude of the statistic exceeds 2/sqrt(n), where n is the sample size. The following DATA step filters the observations and prints only the influential ones.

ods exclude all;
proc reg data=Cars plots=NONE; 
   model MPG_City = EngineSize HorsePower Weight / influence;
   id IDMakeMod;
   ods output OutputStatistics=OutputStats;      /* save influence statistics */
run; quit;
ods exclude none;
 
data Influential;
set OutputStats nobs=n;
array DFB[*] DFB_:;
cutoff = 2 / sqrt(n);
ObsNum = _N_;
influential = 0;
DFBInd = '0000';                   /* binary string indicator */
do i = 1 to dim(DFB);
   if abs(DFB[i])>cutoff then do;  /* this obs is influential for i_th regressor */
      substr(DFBInd,i,1) = '1';
      influential = 1;
   end;
end;
if influential;                    /* output only influential obs */
run;
 
proc print data=Influential noobs;
   var ObsNum IDMakeMod DFBInd cutoff DFB_:;
run;

The DFBInd variable is a four-character binary string that indicates which parameter estimates are influenced by each observation. Some observations are influential only for one coefficient; others (1, 3, 35, and 38) are influential for many variables. Creating a binary string for each observation is a useful trick.

By the way, did you notice that the name of the statistic ("DFBETAS") has a large S at the end? Until I researched this article, I assumed it was to make the word plural since there is more than one "DFBETA" statistic. But, no, it turns out that the S stands for "scaled." You can define the DFBETA statistic (without the S) to be the change in parameter estimates bb(i), but that statistic depends on the scale of the variables. To standardize the statistic, divide by the standard error of the parameter estimates. That scaling is the reason for the S as the end of DFBETAS. The same is true for the DFFITS statistic: S stands for "scaled."

The next article describes how to create similar graphs for the DFFITS and Cook's D statistics.

---------------

DFFITS: How the predicted values change if an observation is excluded

The DFFITS statistic measures, for each observation, how the predicted value at that observation changes if you exclude the observation and refit the model.

Cook's D: How the sum of the predicted values change if an observation is excluded

Cook's distance (D) statistic measures, for each observation, the sum of the differences in the predicted values (summed over all observations) if you exclude the observation and refit the model.

The post Influential observations in a linear regression model: The DFBETAS statistics appeared first on The DO Loop.

6月 122019
 

For linear regression models, there is a class of statistics that I call deletion diagnostics or leave-one-out statistics. These observation-wise statistics address the question, "If I delete the i_th observation and refit the model, what happens to the statistics for the model?" For example:

  • The PRESS statistic is similar to the residual sum of squares statistic but is based on fitting n different models, where n is the sample size and the i_th model excludes the i_th observation.
  • Cook's D statistic measures the influence of the i_th observation on the fit.
  • The DFBETAS statistics measure how the regression estimates change if you delete the i_th observation.

Although most references define these statistics in terms of deleting an observation and refitting the model, you can use a mathematical trick to compute the statistics without ever refitting the model! For example, the Wikipedia page on the PRESS statistic states, "each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors." Although this paragraph is conceptually correct, theSAS/STAT documentation for PROC GLMSELECT states that the PRESS statistic "can be efficiently obtained without refitting the model n times."

A rank-1 update to the inverse of a matrix

Recall that you can use the "normal equations" to obtain the least squares estimate for the regression problem with design matrix X and observed responses Y. The normal equations are b = (X`X)-1(X`Y), where X`X is known as the sum of squares and crossproducts (SSCP) matrix and b is the least squares estimate of the regression coefficients. For data sets with many observations (very large n), the process of reading the data and forming the SSCP is a relatively expensive part of fitting a regression model. Therefore, if you want the PRESS statistic, it is better to avoid rebuilding the SSCP matrix and computing its inverse n times. Fortunately, there is a beautiful result in linear algebra that relates the inverse of the full SSCP matrix to the inverse when a row of X is deleted. The result is known as the Sherman-Morrison formula for rank-1 updates.

The key insight is that one way to compute the SSCP matrix is as a sum of outer products of the rows of X. Therefore if xi is the i_th row of X, the SCCP matrix for data where xi is excluded is equal to X`X - xi`xi. You have to invert this matrix to find the least squares estimates after excluding xi.

The Sherman-Morrison formula enables you to compute the inverse of X`X - xi`xi when you already know the inverse of X`X. For brevity, set A = X`X. The Sherman-Morrison formula for deleting a row vector xi` is
(A – xi`xi)-1 = A-1 + A-1 xi`xi A-1 / (1 – xiA-1xi`)

Implement the Sherman-Morrison formula in SAS

The formula shows how to compute the inverse of the updated SSCP by using a matrix-vector multiplication and an outer product. Let's use a matrix language to demonstrate the update method. The following SAS/IML program reads in a small data set, forms the SSCP matrix (X`X), and computes its inverse:

proc iml;
use Sashelp.Class;   /* read data into design matrix X */
read all var _NUM_ into X[c=varNames];  
close;
XpX = X`*X;          /* form SSCP */
XpXinv = inv(XpX);   /* compute the inverse */

Suppose you want to compute a leave-one-out statistic such as PRESS. For each observation, you need to estimate the parameters that result if you delete that observation. For simplicity, let's just look at deleting the first row of the X matrix. The following program creates a new design matrix (Z) that excludes the row, forms the new SSCP matrix, and finds its inverse:

/* Inefficient: Manually delete the row from the X matrix 
   and recompute the inverse */
n = nrow(X);
Z = X[2:n, ];       /* delete first row */
ZpZ = Z`*Z;         /* reform the SSCP matrix */
ZpZinv = inv(ZpZ);  /* recompute the inverse */
print ZpZinv[c=varNames r=varNames L="Inverse of SSCP After Deleting First Row"];

The previous statements essentially repeat the entire least squares computation. To compute a leave-one-out statistic, you would perform a total of n similar computations.

In contrast, it is much cheaper to apply the Sherman-Morrison formula to update the inverse of the original SSCP. The following statements apply the Sherman-Morrison formula as it is written:

/* Alternative: Do not change X or recompute the inverse. 
   Use the Sherman-Morrison rank-1 update formula.
   https://en.wikipedia.org/wiki/Sherman–Morrison_formula */
r = X[1, ];          /* first row */
rpr = r`*r;          /* outer product */
/* apply Sherman-Morrison formula */
NewInv = XpXinv + XPXinv*rpr*XPXinv / (1 - r*XpXinv*r`);
print NewInv[c=varNames r=varNames L="Inverse from Sherman-Morrison Formula"];

These statements compute the new inverse by using the old inverse, an outer product, and a few matrix multiplications. Notice that the denominator of the Sherman-Morrison formula includes the expression r*(X`X)-1*r`, which is the leverage statistic for the i_th row.

The INVUPDT function in SAS/IML

Because it is important to be able to update an inverse matrix quickly when an observation is deleted (or added!), the SAS/IML language supports the IMVUPDT function, which implements the Sherman-Morrison formula. You merely specify the inverse matrix to update, the vector (as a column vector) to use for the rank-one update, and an optional scalar value, which is usually +1 if you are adding a new observation and -1 if you are deleting an observation. For example, the following statements are the easiest way to implement the Sherman-Morrison formula in SAS for a leave-one-out statistic:

NewInv2 = invupdt(XpXinv, r`, -1);
print NewInv2[c=varNames r=varNames L="Inverse from INVUPDT Function"];

The output is not displayed because the matrix NewInv2 is the same as the matrix NewInv in the previous section. The documentation includes additional examples.

The general Sherman-Morrison-Woodbury formula

The Sherman-Morrison formula shows how to perform a rank-1 update of an inverse matrix. There is a more general formula, called the Sherman-Morrison-Woodbury formula, which enables you to update an inverse for any rank-k modification of the original matrix. The general formula (Golub and van Loan, p. 51 of 2nd ed. or p. 65 of 4th ed.) shows how to find the matrix of a rank-k modification to a nonsingular matrix, A, in terms of the inverse of A. The general formula is
(A + U VT)-1 = A-1 – A-1 U (I + VT A-1 U) VT A-1
where U and V are p x k and all inverses are assumed to exist. When k = 1, the matrices U and V become vectors and the k x k identify matrix becomes the scalar value 1. In the previous section, U equals -xiT and V equals xiT.

The Sherman-Morrison-Woodbury formula is one of my favorite results in linear algebra. It shows that a rank-k modification of a matrix results in a rank-k modification of its inverse. It is not only a beautiful theoretical result, but it has practical applications to leave-one-out statistics because you can use the formula to quickly compute the linear regression model that results by dropping an observation from the data. In this way, you can study the influence of each observation on the model fit (Cook's D, DFBETAS,...) and perform leave-one-out cross-validation techniques, such as the PRESS statistic.

The post Leave-one-out statistics and a formula to update a matrix inverse appeared first on The DO Loop.

6月 102019
 

Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to create a new data set that contains the new variable. For large data sets, this is wasteful: most of the data remain the same; only the recoded variables are different.

There is an alternative approach: You can use PROC FORMAT in Base SAS to define a custom SAS format. When you use PROC FORMAT, the data are never changed, but all the SAS reports and analyses can display the formatted values instead of the raw data values. You can use the same format for multiple data sets. You can even define multiple formats to analyze the same variable in multiple ways.

An example of using a format to recode a variable

In the simplest situation, a recoding of a variable converts each raw value to an easier-to-interpret value. For example, suppose that the gender variable for patients is recorded as a binary 0/1 variable. This is a terrible choice because it is not clear whether 0 represents males or females. The following example shows the typical IF-THEN logic for recoding 0 as "Female" and 1 as "Male" by creating a new data set and a new variable:

/* original data: Gender is binary variable, which is hard to understand! */
data Have;              
input Gender @@;
datalines;
1 0 0 0 1 1 0 0 . 1 1 0 0 0 0 1 1 1 . 1 1 
;
 
/* Recode by using IF-THEN or SELECT-WHEN. This can be inefficient. */
data HaveRecode;
set Have;
/* use IF-THEN logic to recode gender */
length Gender_Recode $6;
if      Gender=0 then Gender_Recode = "Female";
else if Gender=1 then Gender_Recode = "Male";
else Gender_Recode = " ";
run;
 
proc freq data=HaveRecode;
   tables Gender_Recode Gender;
run;
Recode variables

The table for the Gender_Recode variable is shown. The data, which was originally coded as a binary indicator variable, has been duplicated by creating a character variable that contains the same information but is more understandable. Of course, now you have to use the new variable name to analyze the recoded data. If you have already written programs that refer to the Gender variable, you have to update the programs to use the new variable name. Yuck!

A more efficient choice is to use a custom-defined format. The beauty of using a format is that you do not have to change the data. Instead, you simply define a format that changes the way that the data are used and displayed in SAS procedures. (A data view is a third alternative, but formats have additional advantages.)

You can define the following format (called GenderFmt.), which displays the gender data as "Female" and "Male" without modifying the data set:

/* use a format to recode gender */
proc format;
value GenderFmt
      0 = "Female"
      1 = "Male" 
      other = " ";
run;
 
/* apply the format to original data; no need to create new data set */
proc freq data=Have;
   format Gender GenderFmt.;    /* the name of the format includes a period */
   tables Gender;
run;
Use PROC FORMAT to recode variables in SAS

Notice that the analysis is run on the original data and use the original variable name. No additional data sets, views, or variables are created.

Use a format to recode a character variable

Did you know that you can use PROC FORMAT to define formats for character variables? Formats for character variables are used less often than formats for numeric variables, but the syntax is similar. The main difference is that the name of a character format starts with the '$' symbol.

In addition to recoding the values of a categorical variable, formats are useful because they enable you to merge or combine categories by defining a many-to-one mapping. For example, the following character format recodes values of the TYPE variable and also combines the 'SUV' and 'Wagon' categories into a single category. Although it is not needed for this example, notice that the format also includes an 'Other' category, which can be used to combine small groups. The 'Other' category will also handle invalid data.

/* Create sample data from Sashelp.Cars. Exclude hybrids. Optionally sort the data */
proc sort data=sashelp.cars(where=(Type^='Hybrid')) out=Cars; 
   by MPG_City; 
run;
 
proc format;
value $CarTypeFmt
      'Sedan' = 'Family Car'
      'Sports' = 'Sports Car'
      'SUV','Wagon' = 'Big Car'
      'Truck' = 'Truck'
      Other = 'Other';
run;
 
proc freq data=Cars;
   format Type $CarTypeFmt.;   /* the name the format includes a period at the end */
   tables Type;
run;
Use PROC FORMAT to recode variables in SAS

Using a format enables you to analyze the original data (omit the FORMAT statement) or apply the format (include the FORMAT statement). You can even define multiple formats if you want to slice and dice the data in various ways.

Use a format to bin numeric variables

One of my favorite SAS tricks is to use a format to bin numeric variables into categories. In the following example, the MPG_City variable is used to group vehicles into four categories based on how fuel-efficient the vehicles are. You can use this format to perform any computation that requires a classification variable. The example shows a two-way frequency analysis of the two variables for which we have defined custom formats:

proc format;
value MPGFmt  
      low -<  15   = "Gas Guzzler"    /* < 15      */
       15 -<  20   = "Not Good"       /* [ 15, 20) */
       20 -<  25   = "Good"           /* [ 20, 25) */
       25 -   high = "Great";         /* > 25      */
run;
 
proc freq data=Cars order=data;
   format MPG_City MPGFmt. Type $CarTypeFmt.;
   tables MPG_City * Type / nocol norow nopercent;
run;
Use PROC FORMAT to recode variables in SAS

Store and retrieve formats

Formats are stored in a catalog, which is stored separately from the data. By default, SAS stores the formats in a catalog named WORK.FORMATS. Like everything else stored in WORK, that catalog will vanish when you end the SAS session. Therefore, you need to store the formats to a permanent libref if you want to reuse the formats across SAS session.

SAS supports several features that help you to maintain a permanent library of formats. Here are two facts about format catalogs:

  • You can use the LIBRARY= option on the PROC FORMAT statement to specify a libref in which to store the format catalog. By default, the catalog will be named FORMATS.
  • SAS maintains a list of librefs to search through to find formats. By default, it looks in WORK and a special libref named LIBRARY.

These facts imply that you can do two simple things to create a permanent library of formats. First, define a permanent libref named LIBRARY (the name is important!) that will contain your catalog of formats. Second, specify the LIBRARY=LIBRARY option when you define the format, as follows:

libname library "C:/MyFormats";   /* the libref 'LIBRARY' has special significance! */
proc format library=library;      /* adds format to the permanent catalog LIBRARY.FORMATS */
value $CarTypeFmt
      'Sedan' = 'Family Car'    'Sports' = 'Sports Car'
      'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck'       Other = 'Other';
run;

When you start a new SAS session, you will need to define the LIBRARY libref again if you want to access the formats. For convenience, many people put the LIBNAME statement in their AutoExec.sas file. Because SAS searches for formats in the LIBRARY.FORMATS catalog, SAS will automatically find the $CarTypeFmt. format.

SAS provides many other options for storing formats and for specifying the search locations for formats. For details, see the SAS usage note "How can I permanently store and use formats that I have created?" or John Ladd's 2012 paper, "Yes, We Can... Save SAS Formats."

Summary

In summary, if you need to recode data, custom-defined formats provide an easy alternative to physically changing the data. This article discusses five advantages to using formats to recode data:

  • The data do not change. You can use the original variable names in the analyses.
  • You can apply formats to both character and numerical variables.
  • You can use formats to merge categories and to bin numeric variables.
  • You apply a format to multiple variables in multiple data sets.
  • You can save formats in a permanent libref and use them across SAS sessions.

Do you maintain a library of SAS formats at your workplace? Leave a comment to share your experience and your best practices.

The post 5 reasons to use PROC FORMAT to recode variables in SAS appeared first on The DO Loop.

6月 052019
 

A family of curves is generated by an equation that has one or more parameters. To visualize the family, you might want to display a graph that overlays four of five curves that have different parameter values, as shown to the right. The graph shows members of a family of exponential transformations of the form
f(x; α) = (1 – exp(-α x)) / (1 – exp(-α))
for α > 0 and x ∈ [0, 1]. This graph enables you to see how the parameter affects the shape of the curve. For example, for small values of the parameter, α, the transformation is close to the identity transformation. For larger values of α, the nonlinear transformation stretches intervals near x=0 and compresses intervals near x=1.

Here's a tip for creating a graph like this in SAS. Generate the data in "long format" and use the GROUP= option on the SERIES statement in PROC SGPLOT to plot the curves and control their attributes. The long format and the GROUP= option make it easy to visualize the family of curves.

A family of exponential transformations

I recently read a technical article that used the exponential family given above. The authors introduced the family and stated that they would use α = 8 in their paper. Although I could determine in my head that the function is monotonically increasing on [0, 1] and f(0)=0 and f(1)=1, I had no idea what the transformation looked like for α = 8. However, it is easy to use SAS to generate members of the family for different values of α and overlay the curves:

data ExpTransform;
do alpha = 1 to 7 by 2;                             /* parameters in the outer loop */
   do x = 0 to 1 by 0.01;                           /* domain of function    */
      y = (1-exp(-alpha*x)) / (1 - exp(-alpha));    /* f(x; alpha) on domain */
      output;
   end;
end;
run;
 
/* Use ODS GRAPHICS / ATTRPRIORITY=NONE 
   if you want to force the line attributes to vary in the HTML destination. */
ods graphics / width=400px height=400px;
title "Exponential Family of Transformations";
proc sgplot data=ExpTransform;
   series x=x y=y / group=alpha lineattrs=(thickness=2);
   keylegend / location=inside position=E across=1 opaque sortorder=reverseauto;
   xaxis grid;  yaxis grid;
run;

The graph is shown at the top of this article. The best way to create this graph is to generate the points in the long-data format because:

  • The outer loop controls the values of the parameters and how many curves are drawn. You can use a DO loop to generate evenly spaced parameters or specify an arbitrary sequence of parameters by using the syntax
    DO alpha = 1, 3, 6, 10;
  • The domain of the curve might depend on the parameter value. As shown in the next section, you might want to use a different set of points for each curve.
  • You can use the GROUP= option and the KEYLEGEND statement in PROC SGPLOT to visualize the family of curves.

Visualize a two-parameter family of curves

You can use the same ideas and syntax to plot a two-parameter family of curves. For example, you might want to visualize the density of the Beta distribution for representative values of the shape parameters, a and b. The Wikipedia article about the Beta distribution uses five pairs of (a, b) values; I've used the same values in the following SAS program:

data BetaDist;
array alpha[5] _temporary_ (0.5 5 1 2 2);
array beta [5] _temporary_ (0.5 1 3 2 5);
do i = 1 to dim(alpha);                       /* parameters in the outer loop */
   a = alpha[i]; b = beta[i];
   Params = catt("a=", a, "; b=", b);         /* concatenate parameters */
   do x = 0 to 0.99 by 0.01;
      pdf = pdf("Beta", x, a, b);             /* evaluate the Beta(x; a, b) density */
      if pdf < 2.5 then output;               /* exclude large values */
   end;
end;
run;
 
ods graphics / reset;
title "Probability Density of the Beta(a, b) Distribution";
proc sgplot data=BetaDist;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid;  yaxis grid;
run;

The resulting graph gives a good overview of how the parameters in the Beta distribution affect the shape of the probability density function. The program uses a few tricks:

  • The parameters are stored in arrays. The program loops over the number of parameters.
  • A SAS concatenation functions concatenate the parameters into a string that identifies each curve. The CAT, CATS, CATT, and CATX functions are powerful and useful!
  • For this family, several curves are unbounded. The program caps the maximum vertical value of the graph at 2.5.
  • Although it is not obvious, some of the curves are drawn by using 100 points whereas others use fewer points. This is an advantage of using the long format.

In summary, you can use PROC SGPLOT to visualize a family of curves. The task is easiest when you generate the points along each curve in the "long format." The long format is easier to work with than the "wide format" in which each curve is stored in a separate Y variable. When the curve values are in long form, you can use the GROUP= option on the SERIES statement to create an effective visualization by using a small number of statements.

The post Plot a family of curves in SAS appeared first on The DO Loop.

6月 032019
 

Statistical programmers and analysts often use two kinds of rectangular data sets, popularly known as wide data and long data. Some analytical procedures require that the data be in wide form; others require long form. (The "long format" is sometimes called "narrow" or "tall" data.) Fortunately, the statistical graphics procedures in SAS (notably, PROC SGPLOT) can usually accommodate either format. You can use multiple statements to create graphs of wide data. You can use a single statement and the GROUP= option to create graphs of long data.

Example: Overlay line plots for multiple response variables

Suppose you have four variables (with N observations) and you want to overlay line plots of three of the variables graphed against the fourth. There are two natural ways to arrange the 4*N data values. The first (and most natural) is a data set that has N rows and 4 variables (call them X, Y1, Y2, and Y3). This is the "wide form" of the data. The "long form" data set has three variables and 3*N rows, as shown to the right. The first column (VarName) specifies the name of the three response variables. The second column (X) indicates the value of the independent variable and the third column (Y) represents the value of the dependent variable that is specified in the VarName column. Some people will additionally sort the long data by the VarName variable, but that is not usually necessary. In general, if you want to stack k variables, the long form data will contain k*N observations.

PROC SGPLOT enables you to plot either set of data. For the wide data, you can use three SERIES statements to plot X vs Y1, X vs Y2, and X vs Y3, as follows. Notice that you can independently set the attributes of each line, such as color, symbol, line style. In the following program, the line thickness is set to the same value for all lines, but you could make that attribute vary, if you prefer.

data Wide;
input X Y1 Y2 Y3;
datalines;
10 2 3 4
15 0 4 6
20 1 4 5
;
 
title "Wide Form: Use k Statements to Plot k Variables";
proc sgplot data=Wide;
   series x=X y=Y1 / markers lineattrs=(thickness=2);
   series x=X y=Y2 / markers lineattrs=(thickness=2);
   series x=X y=Y3 / markers lineattrs=(thickness=2);
run;

You can use PROC TRANSPOSE or the SAS DATA step to convert the data from wide form to long form. When the data are in the long format, you use a single SERIES statement and the GROUP=VarName option to plot the three groups of lines. In addition, you can set the attributes for all the lines by using a single statement.

/* convert data from Wide to Long form */
data Long;
set Wide;
VarName='Y1'; Value=Y1; output;
VarName='Y2'; Value=Y2; output;
VarName='Y3'; Value=Y3; output;
drop Y1-Y3;
run;
 
title "Long Form: Use GROUP= Option to Plot k Variables";
proc sgplot data=Long;
   series x=X y=Value / group=VarName markers lineattrs=(thickness=2);
run;

Advantages and disadvantages of wide and long formats

The two formats contain the same information, but sometimes one form is more convenient than the other. Here are a few reasons to consider wide-form and long-form data:

Use the wide form when...

  • You want to run a fixed-effect regression analysis. Many SAS procedures require data to be in wide form, including ANOVA, REG, GLM, LOGISTIC, and GENMOD.
  • You want to run a multivariate analysis. Multivariate analyses include principal components (PRINCOMP), clustering (FASTCLUS), discriminant analysis (DISCRIM), and most matrix-based computations (PROC IML).
  • You want to create a plot that overlays graphs of several distinct variables. With wide data, you can easily and independently control the attributes of each overlay.

Use the long form when...

  • You want to run a mixed model regression analysis for repeated measurements. PROC MIXED and GLIMMIX require the long format. In general, the long format is useful for many kinds of longitudinal analysis, where the same subject is measured at multiple time points.
  • The measurements were taken at different values of the X variable. For example, in the previous section, the wide format is applicable because Y1, Y2, and Y3 were all measured at the same three values of X. However, the long form enables Y1 to be measured at different values than Y2. In fact, Y1 could be measured at only three points whereas Y2 could be measured at more points.

The last bullet point is important. The long form is more flexible, so you can use it to plot quantities that are measured at different times or positions. Line plots of this type are sometimes called spaghetti plots. The following DATA step defines long-form data for which the response variables are measured at different values of X:

data Long2;
infile datalines truncover;
length VarName $2;
input VarName X Value @;
do until( Value=. );
   output;   input X Value @;
end;
datalines;
Y1 10 2 15 0 20 1
Y2 10 3 12 4 13 5 16 4 17 3 18 3 20 4
Y3 9 3 11 4 14 6 18 4 19 5
;
 
title "Long Form: Different Number of Measurements per Subject";
proc sgplot data=Long2;
   series x=X y=Value / group=VarName markers lineattrs=(thickness=2);
   xaxis grid; yaxis grid;
run;

In summary, you can use PROC SGPLOT to create graphs regardless of whether the data are in wide form or long form. I've presented a few common situations in which you might want to use each kind of data representation. Can you think of other situations in which the long format is preferable to the wide format? Or vice versa? Let me know by leaving a comment.

The post Graph wide data and long data in SAS appeared first on The DO Loop.

5月 302019
 

Knowing how to visualize a regression model is a valuable skill. A good visualization can help you to interpret a model and understand how its predictions depend on explanatory factors in the model. Visualization is especially important in understanding interactions between factors. Recently I read about work by Jacob A. Long who created a package in R for visualizing interaction effects in regression models. His graphs inspired me to discuss how to visualize interaction effects in regression models in SAS.

There are many ways to explore the interactions in a regression model, but this article describes how to use the EFFECTPLOT statement in SAS. The emphasis is on creating a plot that shows how the response depends on two regressors that might interact. Depending on the type of regressors (continuous or categorical), you can create the following plots:

  • Both regressors are continuous: Use the CONTOUR option to create a contour plot or the SLICEFIT option to display curves that show the predicted response as a function of the first regressor while fixing the second regressor at a sequence of values (often low, medium, and high values).
  • One regressor is categorical and the other is continuous: Use the SLICEFIT option to overlay a curve for the predicted response for each value of the categorical regressor.
  • Both regressors are categorical: Use the INTERACTION option to create a plot that shows the group means for each joint level of the regressors. Alternatively, you can use the BOX option to draw box plots for each pair of levels.

For an introduction to the EFFECTPLOT statement, see my 2016 article "Use the EFFECTPLOT statement to visualize regression models in SAS." The EFFECTPLOT statement and the PLM procedure were both introduced in SAS 9.22 in 2010.

This article uses the Sashelp.Cars data to demonstrate the visualizations. The response variable is MPG_City, which is the average miles per gallon for each vehicle during city driving. The regressors are Weight (mass of the vehicle, in pounds), Horsepower, Origin (place of manufacture: 'Asia', 'Europe', or 'USA'), and Type of vehicle. For simplicity, the example uses only four values of the Type variable: 'SUV', 'Sedan', 'Sports', or 'Wagon'.

Although this article shows only two-regressor models, the EFFECTPLOT statement supports arbitrarily many regressors. By default, the additional continuous explanatory variables are set to their mean values; the additional categorical regressors are set to their reference level. You can change this default behavior by using the AT keyword.

Interaction between two continuous variables

Suppose you want to visualize the interaction between two continuous regressors. The following call to PROC GLM creates a contour plot automatically. It also creates an item store which saves information about the model.

proc glm data=Sashelp.Cars;
   model MPG_City = Horsepower | Weight / solution;
   ods select ParameterEstimates ContourFit;
   store GLMModel;
run;

From the contour plot, you can see that the Horsepower and Weight variables interact. For low values of Weight, the predicted response has a negative slope with respect to Horsepower. In contrast, for high values of Weight, the predicted response has a positive slope with respect to Horsepower.

This fact is easier to see if you "slice" the contour plot at low, medium, and high values of the Weight variable. You can use PROC PLM to create a SLICEFIT plot. By default, the "slicing" variable is fixed at five values: its minimum value, first quartile value, median value, third quartile value, and maximum value.

/* Graph response vs X1. By default, X2 fixed at Min, Q1, Median, Q3, and Max values */
proc plm restore=GLMModel noinfo;
   effectplot slicefit(x=Horsepower sliceby=Weight) / clm;
run;

Because the slopes of the lines depend on the value of Weight, the graph indicates an interaction.

Changing the slicing levels for continuous variables

As shown in the previous section, the SLICEFIT statement uses quantiles to fix the value of the second continuous regressor. When I read Jacob Long's web page, I noticed that his functions slice the second regressor at its mean and one standard deviation away from the mean (in both directions). In SAS, you can specify arbitrary values by using the SLICEBY= option. The following statements use PROC MEANS to compute the mean and standard deviation of the Weight variable, then use those values to specify slicing values for the Weight variable:

proc means data=Sashelp.Cars Mean StdDev ndec=0;
   var Weight;
run;
 
proc plm restore=GLMModel noinfo;
   effectplot slicefit(x=Horsepower sliceby=Weight=2819 3578 4337) / clm;
run;

Again, the slope of the predicted response changes with values of Weight, which indicates an interaction effect. Notice that when Weight = 4337, which is one standard deviation above the mean, the slope of the predicted response is flat.

Of course, you can automate this process if you don't want to compute the slicing values in your head. You can use the DATA step or PROC SQL to compute the slicing values, then create macro variables for the sample mean and standard deviation. You can then use the macro variables to specify the slicing values:

proc sql;
  select mean(Weight) as mean, std(Weight) as std
  into :mean, :std             /* put Mean and StdDev into macro variables */
  from Sashelp.cars;
quit;
 
/* slice the Weight variable at mean - StdDev, mean, and mean + StdDev */
proc plm restore=GLMModel noinfo;
   effectplot slicefit(x=Horsepower 
      sliceby=Weight=%sysevalf(&mean-&std) &mean %sysevalf(&mean+&std)) / clm;
run;

The graph is similar to the previous graph and is not shown.

Interactions between a continuous and a categorical regressor

If one of the regressors is categorical and the other is continuous, it is easy to visualize the interaction because you can plot the predicted response versus the continuous regressor for each level of the categorical regressor. In fact, this plot is created automatically by many SAS procedures, so often you don't need to use the EFFECTPLOT statement. For example, the following call to PROC GLM overlays three regression curves on a scatter plot of the data:

ods graphics on;
proc glm data=Sashelp.Cars;
   class Origin(ref='Europe');
   model mpg_city = Horsepower | Origin / solution; /* one continuous, one categorical */
   store GLMModel2;
run;

The slopes of the lines change with the levels of the Origin variable, so there appears to be an interaction effect between those two regressors.

The GLM procedure has access to the original data, so the lines are overlaid on a scatter plot. If you create the same plot in PROC PLM, you obtain the lines (and, optionally, confidence bands), but the plot does not include a scatter plot because the data are not part of the saved item store. For completeness, the following call to PROC PLM creates a similar visualization of the Horsepower-Origin interaction:

proc plm restore=GLMModel2 noinfo;
   effectplot slicefit(x=Horsepower sliceby=Origin) / clm;
run;

Interactions between two categorical regressors

I have previously written about how to create an "interaction plot" for two categorical predictors. Many SAS procedures produce this kind of plot automatically. You can use the EFFECTPLOT BOX or EFFECTPLOT INTERACTION statement inside many regression procedures. Alternatively, you can call PROC PLM and create an interaction plot from an item store. Again, the main difference is that the regression procedures can overlay observed data values, whereas PROC PLM visualizes only the model, not the data.

The following example creates a model that has two categorical variables. By default, PROC GLM creates an interaction plot and overlays the observed data values:

proc glm data=Sashelp.Cars(where=(Type in ('SUV' 'Sedan' 'Sports' 'Wagon')));
   class Origin(ref='Europe') Type;
   model mpg_city = Origin | Type;
   store GLMModel3;
run;

This model does not exhibit much (if any) interaction between the regressors. For a vehicle of a specific type (such as 'SUV'), Asian-built vehicles tend to have a higher MPG_City than UAS-built vehicles, which tend to have a higher MPG than European-built vehicles. These trend lines have similar slopes as you vary the Type variable. If you check the ParameterEstimates table, you will see that the interaction effects are not statistically significant (α = 0.05).

As mentioned, you can create the same plot (without the data markers) by using PROC PLM. If you request confidence intervals, you get a slightly different graph. You can choose whether or not to connect the means of the response for each level. The following statements create a plot for which the means are not connected:

proc plm restore=GLMModel3 noinfo;
   effectplot interaction(x=Origin sliceby=Type) / clm;
/* effectplot interaction(x=Origin sliceby=Type) / clm connect; */ /* or connect the means */
run;

Lastly, you can use the EFFECTPLOT BOX statement in regression procedures. The information is the same as for the "interaction plot," but box plots are used to show the observed distribution of the response for each level of the first categorical regressor.

proc genmod data=Sashelp.Cars(where=(Type in ('SUV' 'Sedan' 'Sports' 'Wagon')));
   class Origin(ref='Europe') Type;
   model mpg_city = Origin | Type;
   effectplot box(x=Origin sliceby=Type) / nolabeloutlier;
run;

Summary

In summary, you can use the EFFECTPLOT statement to visualize the interactions between regressors in a regression model. In general, when the slopes of the response curves depend on the values of a second regressor, that indicates an interaction effect. For a continuous-continuous interaction, you can choose the values at which you slice the second regressor. By default, the regressor is sliced at quantiles, but you can modify that to, for example, slice the variable at its mean and at (plus or minus) one standard deviation away from the mean. If you use the EFFECTPLOT statement inside a regression procedure, you can overlay the model on the observed responses. In PROC PLM, the EFFECTPLOT statement visualizes only the model.

The post Visualize interaction effects in regression models appeared first on The DO Loop.

5月 282019
 

Modern statistical software provides many options for computing robust statistics. For example, SAS can compute robust univariate statistics by using PROC UNIVARIATE, robust linear regression by using PROC ROBUSTREG, and robust multivariate statistics such as robust principal component analysis. Much of the research on robust regression was conducted in the 1970s, so I was surprised to learn that a robust version of simple (one variable) linear regression was developed way back in 1950! This early robust regression method uses many of the same techniques that are found in today's "modern" robust regression methods. This article describes and implements a robust estimator for simple linear regression that was developed by Theil (1950) and extended by Sen (1968).

The Theil-Sen robust estimator

I had not heard of the Theil-Sen robust regression method until recently, perhaps because it applies only to one-variable regression. The Wikipedia article about the Theil-Sen estimator states that the method is "the most popular nonparametric technique for estimating a linear trend" in the applied sciences due to its "simplicity in computation, ... robustness to outliers," and "[limited assumptions]regarding measurement errors."

The idea behind the estimator is simple. If the data contains N pairs of (x, y) values, compute all the slopes between pairs of points and choose the median as the estimate of the regression slope. Using that slope, pass a line through each pair of (x,y) values to obtain N intercepts. Choose the median of the intercepts as the estimate of the regression intercept.

That's it. You compute "N choose 2" (which is N*(N-1)/2) slopes and take the median. Then compute N intercepts and take the median. The slope estimate is unbiased and the process is resistant to outliers.

The adjacent scatter plot shows the Theil-Sen regression line for nine data points. The seven data points that appear to fall along the regression line were used by Sen (1968). I added two outliers. The plot shows that the Theil-Sen regression line ignores the outliers and passes close to the other data points. The slope of the Theil-Sen line is slightly less than 4. In contrast, the least squares line through these data has a slope of only 2.4 because of the influence of the two outliers.

Implement the Theil-Sen estimator in SAS

You can easily implement Theil-Sen regression in SAS/IML, as follows:

  1. Use the ALLCOMB function to generate all pairs of the values {1, 2, ..., N}. Or, for large N, use the RANDCOMB function to sample pairs of values.
  2. Use subscript operations to extract the pairs of points. Compute all slopes between the pairs of points.
  3. Use the MEDIAN function to compute the median slope and the median intercept.

The following SAS/IML program implements this algorithm. The program is very compact (six statements) because it is vectorized. There are no explicit loops.

proc iml;
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78, /* 7 data points used by Sen (1968) */
     12.5 30,  4.5 50};                               /* 2 outliers (not considered by Sen) */
 
/* Theil uses all "N choose 2" combinations of slopes of segments.
   Assume that the first coordinates (X) are distinct */
c = allcomb(nrow(XY), 2);         /* all "N choose 2" combinations of pairs */
Pt1 = XY[c[,1],];                 /* extract first point of line segments */
Pt2 = XY[c[,2],];                 /* extract second point of line segments */
slope = (Pt1[,2] - Pt2[,2]) / (Pt1[,1] - Pt2[,1]); /* Careful! Assumes x1 ^= x2 */
m = median(slope);
b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
print (b||m)[c={'Intercept' 'Slope'} L="Method=Theil Combs=All"];

As stated earlier, the Theil-Sen estimate has a slope of 3.97. That value is the median of the slopes among the 36 line segments that connect pairs of points. The following graphs display the 36 line segments between pairs of points and a histogram of the distribution of the slopes. The histogram shows that the value 3.97 is the median value of the distribution of slopes.


Handling repeated X values: Sen's extension

The observant reader might object that the slopes of the line segments will be undefined if any of the data have identical X values. One way to deal with that situation is to replace the undefined slopes by large positive or negative values, depending on the sign of the difference between the Y values. Since the median is a robust estimator, adding a few high and low values will not affect the computation of the median slope. Alternatively, Sen (1968) proved that you can omit the pairs that have identical X values and still obtain an unbiased estimate. In the following SAS/IML program, I modified the X values of the two outliers so that only seven of the nine X values are unique. The LOC function finds all pairs that have different X values, and only those pairs are used to compute the robust regression estimates.

/* Sen (1968) handles repeated X coords by using only pairs with distinct X */
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78,
     12 30,  4 50};  /* last two obs are repeated X values */
c = allcomb(nrow(XY), 2);        /* all "N choose 2" combinations of pairs */
Pt1 = XY[c[,1],];                /* first point of line segments */
Pt2 = XY[c[,2],];                /* second point of line segments */
idx = loc(Pt1[,1]-Pt2[,1]^=0);   /* find pairs with same X value */
Pt1 = Pt1[idx,];                 /* keep only pairs with different X values */
Pt2 = Pt2[idx,];
 
slope = (Pt1[,2] - Pt2[,2]) / (Pt1[,1] - Pt2[,1]);
m = median(slope);
b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
print (b||m)[c={'Intercept' 'Slope'} L="Method=Sen Combs=All"];

A function to compute the Theil-Sen estimator

The following function defines a SAS/IML function that implements the Theil-Sen regression estimator. I added two options. You can use the METHOD argument to specify how to handle pairs of points that have the same X values. You can use the NUMPAIRS option to specify whether to use the slopes of all pairs of points or whether to use the slopes of K randomly generated pairs of points.

proc iml;
/* Return (intercept, slope) for Theil-Sen robust estimate of a regression line.
   XY is N x 2 matrix. The other arguments are:
   METHOD: 
      If method="Theil" and a pair of points have the same X coordinate, 
         assign a large positive value instead of +Infinity and a large negative 
         value instead of -Infinity. 
      If method="Sen", omit any pairs of points that have the same first coordinate. 
   NUMPAIRS:
      If numPairs="All", generate all "N choose 2" combinations of the N points.
      If numPairs=K (positive integer), generate K random pairs of points. 
*/
start TheilSenEst(XY, method="SEN", numPairs="ALL");
   Infinity = 1e99;             /* big value for slope instead of +/- infinity */
   if type(numPairs)='N' then
      c = rancomb(nrow(XY), 2, numPairs);  /* random combinations of pairs */
   else if upcase(numPairs)="ALL" then 
      c = allcomb(nrow(XY), 2);            /* all "N choose 2" combinations of pairs */
   else stop "ERROR: The numPairs option must be 'ALL' or a postive integer";
 
   Pt1 = XY[c[,1],];                       /* first points for slopes */
   Pt2 = XY[c[,2],];                       /* second points for slopes */
   dy = Pt1[,2] - Pt2[,2];                 /* change in Y */
   dx = Pt1[,1] - Pt2[,1];                 /* change in X */ 
   idx = loc( dx ^= 0 );  
   if upcase(method) = "SEN" then do;      /* exclude pairs with same X value */
      slope = dy[idx] / dx[idx];           /* slopes of line segments */
   end;
   else do;                        /* assign big slopes for pairs with same X value */
      slope = j(nrow(Pt1), 1, .);  /* if slope calculation is 0/0, assign missing */
      /* Case 1: x1 ^= x2. Do the usual slope computation */
      slope[idx] = dy[idx] / dx[idx];
      /* Case 2: x1 = x2. Assign +Infinity if sign(y1-y2) > 0, else assign -Infinity */
      jdx = loc( dx = 0 & sign(dy)>0 );
      if ncol(jdx)>0 then 
         slope[jdx] = Infinity;
      jdx = loc( dx = 0 & sign(dy)<0 );
      if ncol(jdx)>0 then 
         slope[jdx] = -Infinity;
   end;
   m = median(slope);
   b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
   return( b || m );
finish;
 
/* Test all four calls */
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78,
     18 30,  4 50};  /* last two obs are outliers not considered by Sen */
 
est = TheilSenEst(XY, "Theil", "All");
print est[c={'Intercept' 'Slope'} L="Method=Theil; Pairs=All"];
 
est = TheilSenEst(XY, "Sen", "All");
print est[c={'Intercept' 'Slope'} L="Method=Sen; Pairs=All"];
 
call randseed(123, 1);
est = TheilSenEst(XY, "Theil", 200);
print est[c={'Intercept' 'Slope'} L="Method=Theil; Pairs=200"];
 
call randseed(123, 1);
est = TheilSenEst(XY, "Sen", 200);
print est[c={'Intercept' 'Slope'} L="Method=Sen; Pairs=200"];
QUIT;

For these data, the estimates are the same whether you exclude pairs of points that have identical X coordinates or whether you replace the undefined slopes with large values. For this small data set, there is no reason to use the randomly chosen pairs of points, but that syntax is shown for completeness. Of course, if you run the analysis with a different random number seed, you will get a different estimate.

Summary

You can download the SAS program that creates the analysis and graphs in this article.

Although Theil published the main ideas for this method in 1950, it contains many of the features of modern robust statistical estimates. Specifically, a theme in modern robust statistics is to exhaustively or randomly choose many small subsets of the data. You compute a (classical) estimate on each subset and then use the many estimates to obtain a robust estimate. I did not realize that Theil had introduced these basic ideas almost seventy years ago!

Theil and Sen also included confidence intervals for the estimates, but I have not included them in this brief article.

References

The post The Theil-Sen robust estimator for simple linear regression appeared first on The DO Loop.

5月 222019
 

The eigenvalues of a matrix are not easy to compute. It is remarkable, therefore, that with relatively simple mental arithmetic, you can obtain bounds for the eigenvalues of a matrix of any size. The bounds are provided by using a marvelous mathematical result known as Gershgorin's Disc Theorem. For certain matrices, you can use Gershgorin's theorem to quickly determine that the matrix is nonsingular or positive definite.

The Gershgorin Disc Theorem appears in Golub and van Loan (p. 357, 4th Ed; p. 320, 3rd Ed), where it is called the Gershgorin Circle Theorem. The theorem states that the eigenvalues of any N x N matrix, A, are contained in the union of N discs in the complex plane. The center of the i_th disc is the i_th diagonal element of A. The radius of the i_th disc is the absolute values of the off-diagonal elements in the i_th row. In symbols,
Di = {z ∈ C | |z - Ai i| ≤ ri }
where ri = Σi ≠ j |Ai j|. Although the theorem holds for matrices with complex values, this article only uses real-valued matrices.

An example of Gershgorin discs is shown to the right. The discs are shown for the following 4 x 4 symmetric matrix:

At first glance, it seems inconceivable that we can know anything about the eigenvalues without actually computing them. However, two mathematical theorems tells us quite a lot about the eigenvalues of this matrix, just by inspection. First, because the matrix is real and symmetric, the Spectral Theorem tells us that all eigenvalues are real. Second, the Gershgorin Disc Theorem says that the four eigenvalues are contained in the union of the following discs:

  • The first row produces a disc centered at x = 200. The disc has radius |30| + |-15| + |5| = 50.
  • The second row produces a disc centered at x = 100 with radius |30| + |5| + |5| = 40.
  • The third row produces a disc centered at x = 55 with radius |-15| + |5| + |0| = 20.
  • The last row produces a disc centered at x = 25 with radius |5| + |5| + |0| = 10.

Although the eigenvalues for this matrix are real, the Gershgorin discs are in the complex plane. The discs are visualized in the graph at the top of this article. The true eigenvalues of the matrix are shown inside the discs.

For this example, each disc contains an eigenvalue, but that is not true in general. (For example, the matrix A = {1 −1, 2 −1} does not have any eigenvalues in the disc centered at x=1.) What is true, however, is that disjoint unions of discs must contain as many eigenvalues as the number of discs in each disjoint region. For this matrix, the discs centered at x=25 and x=200 are disjoint. Therefore they each contain an eigenvalue. The union of the other two discs must contain two eigenvalues, but, in general, the eigenvalues can be anywhere in the union of the discs.

The visualization shows that the eigenvalues for this matrix are all positive. That means that the matrix is not only symmetric but also positive definite. You can predict that fact from the Gershgorin discs because no disc intersects the negative X axis.

Of course, you don't have to perform the disc calculations in your head. You can write a program that computes the centers and radii of the Gershgorin discs, as shown by the following SAS/IML program, which also computes the eigenvalues for the matrix:

proc iml;
A = { 200  30 -15  5,
       30 100   5  5,
      -15   5  55  0, 
        5   5   0 15};
 
evals = eigval(A);                 /* compute the eigenvalues */
center = vecdiag(A);               /* centers = diagonal elements */
radius = abs(A)[,+] - abs(center); /* sum of abs values of off-diagonal elements of each row */
discs = center || radius || round(evals,0.01);
print discs[c={"Center" "Radius" "Eigenvalue"} L="Gershgorin Discs"];

Diagonally dominant matrices

For this example, the matrix is strictly diagonally dominant. A strictly diagonally dominant matrix is one for which the magnitude of each diagonal element exceeds the sum of the magnitudes of the other elements in the row. In symbols, |Ai i| > Σi ≠ j |Ai j| for each i. Geometrically, this means that no Gershgorin disc intersects the origin, which implies that the matrix is nonsingular. So, by inspection, you can determine that his matrix is nonsingular.

Gershgorin discs for correlation matrices

The Gershgorin theorem is most useful when the diagonal elements are distinct. For repeated diagonal elements, it might not tell you much about the location of the eigenvalues. For example, all diagonal elements for a correlation matrix are 1. Consequently, all Gershgorin discs are centered at (1, 0) in the complex plane. The following graph shows the Gershgorin discs and the eigenvalues for a 10 x 10 correlation matrix. The eigenvalues of any 10 x 10 correlation matrix must be real and in the interval [0, 10], so the only new information from the Gershgorin discs is a smaller upper bound on the maximum eigenvalue.

Gershgorin discs for unsymmetric matrices

Gershgorin's theorem can be useful for unsymmetric matrices, which can have complex eigenvalues. The SAS/IML documentation contains the following 8 x 8 block-diagonal matrix, which has two pairs of complex eigenvalues:

A = {-1  2  0       0       0       0       0  0,
     -2 -1  0       0       0       0       0  0,
      0  0  0.2379  0.5145  0.1201  0.1275  0  0,
      0  0  0.1943  0.4954  0.1230  0.1873  0  0,
      0  0  0.1827  0.4955  0.1350  0.1868  0  0,
      0  0  0.1084  0.4218  0.1045  0.3653  0  0,
      0  0  0       0       0       0       2  2,
      0  0  0       0       0       0      -2  0 };

The matrix has four smaller Gershgorin discs and three larger discs (radius 2) that are centered at (-1,0), (2,0), and (0,0), respectively. The discs and the actual eigenvalues of this matrix are shown in the following graph. Not only does the Gershgorin theorem bound the magnitude of the real part of the eigenvalues, but it is clear that the imaginary part cannot exceed 2. In fact, this matrix has eigenvalues -1 ± 2 i, which are on the boundary of one of the discs, which shows that the Gershgorin bound is tight.

Conclusions

In summary, the Gershgorin Disc Theorem provides a way to visualize the possible location of eigenvalues in the complex plane. You can use the theorem to provide bounds for the largest and smallest eigenvalues.

I was never taught this theorem in school. I learned it from a talented mathematical friend at SAS. I use this theorem to create examples of matrices that have particular properties, which can be very useful for developing and testing software.

This theorem also helped me to understand the geometry behind "ridging", which is a statistical technique in which positive multiples of the identity are added to a nearly singular X`X matrix. The Gershgorin Disc Theorem shows the effect of ridging a matrix is to translate all of the Gershgorin discs to the right, which moves the eigenvalues away from zero while preserving their relative positions.

You can download the SAS program that I used to create the images in this article.

Further reading

There are several papers on the internet about Gershgorin discs. It is a favorite topic for advanced undergraduate projects in mathematics.

The post Gershgorin discs and the location of eigenvalues appeared first on The DO Loop.

5月 202019
 

Recently I wrote about how to compute the Kolmogorov D statistic, which is used to determine whether a sample has a particular distribution. One of the beautiful facts about modern computational statistics is that if you can compute a statistic, you can use simulation to estimate the sampling distribution of that statistic. That means that instead of looking up critical values of the D statistic in a table, you can estimate the critical value by using empirical quantiles from the simulation.

This is a wonderfully liberating result! No longer are we statisticians constrained by the entries in a table in the appendix of a textbook. In fact, you could claim that modern computation has essentially killed the standard statistical table.

Obtain critical values by using simulation

Before we compute anything, let's recall a little statistical theory. If you get a headache thinking about null hypotheses and sampling distributions, you might want to skip the next two paragraphs!

When you run a hypothesis test, you compare a statistic (computed from data) to a hypothetical distribution (called the null distribution). If the observed statistic is way out in a tail of the null distribution, you reject the hypothesis that the statistic came from that distribution. In other words, the data does not seem to have the characteristic that you are testing for. Statistical tables use "critical values" to designate when a statistic is in the extreme tail. A critical value is a quantile of the null distribution; if the observed statistic is greater than the critical value, then the statistic is in the tail. (Technically, I've described a one-tailed test.)

One of the uses for simulation is to approximate the sampling distribution of a statistic when the true distribution is not known or is known only asymptotically. You can generate a large number of samples from the null hypothesis and compute the statistic on each sample. The union of the statistics approximates the true sampling distribution (under the null hypothesis) so you can use the quantiles to estimate the critical values of the null distribution.

Critical values of the Kolmogorov D distribution

You can use simulation to estimate the critical value for the Kolmogorov-Smirnov statistical test for normality. For the data in my previous article, the null hypothesis is that the sample data follow a N(59, 5) distribution. The alternative hypothesis is that they do not. The previous article computed a test statistic of D = 0.131 for the data (N = 30). If the null hypothesis is true, is that an unusual value to observe? Let's simulate 40,000 samples of size N = 30 from N(59,5) and compute the D statistic for each. Rather than use PROC UNIVARIATE, which computes dozens of statistics for each sample, you can use the SAS/IML computation from the previous article, which is very fast. The following simulation runs in a fraction of a second.

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
%let NumSamples = 40000;
 
proc iml;
call randseed(73);
N = &N;
i = T( 1:N );                           /* ranks */
u = i/N;                                /* ECDF height at right-hand endpoints */
um1 = (i-1)/N;                          /* ECDF height at left-hand endpoints  */
 
y = j(N, &NumSamples, .);               /* columns of Y are samples of size N */
call randgen(y, "Normal", &mu, &sigma); /* fill with random N(mu, sigma)      */
D = j(&NumSamples, 1, .);               /* allocate vector for results        */
 
do k = 1 to ncol(y);                    /* for each sample:                   */
   x = y[,k];                           /*    get sample x ~ N(mu, sigma)     */
   call sort(x);                        /*    sort sample                     */
   F = cdf("Normal", x, &mu, &sigma);   /*    CDF of reference distribution   */
   D[k] = max( F - um1, u - F );        /*    D = max( D_minus, D_plus )      */
end;
 
title "Monte Carlo Estimate of Sampling Distribution of Kolmogorov's D Statistic";
title2 "N = 30; N_MC = &NumSamples";
call histogram(D) other=
     "refline 0.131 / axis=x label='Sample D' labelloc=inside lineattrs=(color=red);";

The test statistic is right smack dab in the middle of the null distribution, so there is no reason to doubt that the sample is distributed as N(59, 5).

How big would the test statistic need to be to be considered extreme? To test the hypothesis at the α significance level, you can compute the 1 – α quantile of the null distribution. The following statements compute the critical value for α = 0.05 and N = 30:

/* estimate critical value as the 1 - alpha quantile */
alpha = 0.05;
call qntl(Dcrit_MC, D, 1-alpha);
print Dcrit_MC;

The estimated critical value for a sample of size 30 is 0.242. This compares favorably with the exact critical value from a statistical table, which gives Dcrit = 0.2417 for N = 30.

You can also use the null distribution to compute a p value for an observed statistic. The p value is estimated as the proportion of statistics in the simulation that exceed the observed value. For example, if you observe data that has a D statistic of 0.28, the estimated p value is obtained by the following statements:

Dobs = 0.28;                        /* hypothetical observed statistic */
pValue = sum(D >= Dobs) / nrow(D);  /* proportion of distribution values that exceed D0 */
print Dobs pValue;

This same technique works for any sample size, N, although most tables critical values only for all N ≤ 30. For N > 35, you can use the following asymptotic formulas, developed by Smirnov (1948), which depend only on α:

The Kolmogorov D statistic does not depend on the reference distribution

It is reasonable to assume that the results of this article apply only to a normal reference distribution. However, Kolmogorov proved that the sampling distribution of the D statistic is actually independent of the reference distribution. In other words, the distribution (and critical values) are the same regardless of the continuous reference distribution: beta, exponential, gamma, lognormal, normal, and so forth. That is a surprising result, which explains why there is only one statistical table for the critical values of the Kolmogorov D statistic, as opposed to having different tables for different reference distributions.

In summary, you can use simulation to estimate the critical values for the Kolmogorov D statistic. In a vectorized language such as SAS/IML, the entire simulation requires only about a dozen statements and runs extremely fast.

The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.