data analysis

7月 312013
 

Do you have dozens (or even hundreds) of SAS data sets that you want to read into SAS/IML matrices? In a previous blog post, I showed how to iterate over a series of data sets and analyze each one. Inside the loop, I read each data set into a matrix X. I computed some quantity, and then reused X for the next data set. This technique works well, but it assumes that you only need to analyze one matrix at a time. How can you read the data sets into different matrices so that all the matrices reside in memory concurrently?

The trick is to use the SAS/IML VALSET routine to assign the contents of each data set to a unique matrix name. As I showed in a previous article, the VALSET call can dynamically create matrices whose names are specified at run time. This enables you to automate the reading of data sets into SAS/IML matrices.

The following program creates three data sets and reads the data into SAS/IML matrices whose names are identical to the data set names:

/* create square matrices in data sets */
data A;
  x=1; y=2; output;  x=2; y=3; output;
run;
data B;  set A;  x = x + 2; run;
data C;  set A;  y = y + 1; run;
 
/* read data into SAS/IML matrices of the same name */
proc iml;
dsNames = {A B C};                /* specify names of data set */
MatNames = dsNames;               /* specify names of matrices */ 
do i = 1 to ncol(dsNames);
   use (dsNames[i]);
   read all var _NUM_ into X;    /* read data into X */
   call valset(MatNames[i], X);  /* copy data to matrix with specified name */
   close (dsNames[i]);
end;
free i X;
show names;

The READ statement reads all numerical data into the matrix X. The VALSET call copies data from X into a matrix whose name is stored in MatNames[i]. The SHOW NAMES statement displays the names of all matrices that have values. From the output, you can see that the program creates matrices named A, B, and C.

For this example, the names of the matrices are the same as the names of the data sets. Of course, the matrices could have different names. You can assign MatName to be any array of valid names. For example, the following statement specifies the matrix names as X1, X2, and X3:

MatNames = "x1":("x"+strip(char(ncol(dsNames))));

You can use these ideas to read hundreds of data sets into SAS/IML matrices. For example, suppose that you need to create SAS/IML matrices for all of the numerical data in a library. You can use the DATASETS function to obtain the names of all data sets in a library. There are two special cases that you potentially need to handle:

  • Some data sets might not have any numeric variables. To protect against that case, use the _ALL_ keyword to read the data. If a data set contains even one numerical variable, it will be read into the X matrix. If not, the matrix X will be a character matrix, and you can use the TYPE function to check whether X is numeric before you call the VALSET subroutine.
  • Some data sets might be empty. To protect against this case, use the FREE statement to free X matrix you read each data set. If the next data set is empty, then the X matrix will be undefined. Again, the TYPE function can check for this situation.

The following SAS/IML program reads all of the data sets in the SASHELP library. On my version of SAS, there are 129 data sets in the library. Two are empty and 10 others do not contain any numerical variables. Consequently, the program defines 117 numerical matrices:

proc iml;
libref = "Sashelp";                /* read from this library */
dsNames = T( datasets(libref) );   /* 129 data sets */
MatNames = dsNames;                /* names of matrices */ 
do i = 1 to ncol(dsNames);
   ds = libref + "." + strip(dsNames[i]);
   use (ds);                       /* open sashelp.filename */
   read all var _ALL_ into X;      /* read data into X */
   if type(X)="N" then             /* exclude data with zero rows or all char */
      call valset(MatNames[i], X); /* copy data to matrix with specified name */
   close (ds);
   free X;                         /* in case of empty data sets */
end;
show names;                        /* 117 numerical matrices defined */

The SHOW NAMES statement shows the names of a few of the matrices that were created from the numerical data in the Sashelp library. These matrices reside in memory concurrently. This technique would be useful if you have a SAS program that creates a bunch of matrices (as data sets) that you want to multiply together.

tags: Data Analysis
7月 172013
 

In a previous article I discussed how to bin univariate observations by using the BIN function, which was added to the SAS/IML language in SAS/IML 9.3. You can generalize that example and bin bivariate or multivariate data. Over two years ago I wrote a blog post on 2D binning in the SAS/IML language, but that post was written before the SAS/IML 9.3 release, so it does not take advantage of the BIN function.

The image to the left shows a scatter plot of data from the Sashelp.BWeight data set. For 50,000 pregnancies, the birth weight of the baby (in grams) is plotted against the weight gain of the mother (in pounds), adjusted so that the median weight gain is centered at zero. There are 11 subintervals in the horizontal direction and seven subintervals in the vertical direction. Consequently, there are 77 cells, which you can enumerate in row-major order beginning with the upper-left cell.

A two-dimensional binning of these data enables you to find the bin for each observation. Many statistics use binned data. For example, you can do a chi-square test for association, estimate density, and test for spatial randomness. The following SAS/IML statements associates a bin number to each observation. The Bin2D function calls the BIN function twice.

proc iml;
use sashelp.bweight;
read all var {m_wtgain weight} into Z;
close;
 
/* define two-dimensional binning function */
start Bin2D(u, cutX, cutY);
   bX = bin(u[,1], cutX);   /* bins in X direction: 1,2,...,kx */
   bY = bin(u[,2], cutY);   /* bins in Y direction: 1,2,...,ky */
   bin = bX + (ncol(cutX)-1)*(bY-1);     /* bins 1,2,...,kx*ky */
   return(bin);
finish;
 
cutX = do(-35, 75, 10);            /* cut points in X direction */
cutY = do(0, 7000, 1000);          /* cut points in Y direction */
b = Bin2D(Z, cutX, cutY);          /* bin numbers 1-77 */

I wrote the Bin2D function to return an index, but you can convert the index into a subscript if you prefer. (For example, the index 77 corresponds to the subscript (11, 7).) The index is useful for tabulating the number of observations in each bin. The TABULATE call counts the number of observations in each bin, which you can visualize by forming a frequency table:

call tabulate(binNumber, Freq, b);           /* count in each bin */
Count = j(ncol(cutY)-1, ncol(cutX)-1, 0);    /* allocate matrix */
Count[binNumber] = Freq;                     /* fill nonzero counts */
print Count[c=('bX1':'bX11') r=('bY1':'bY7')];

Notice that the Y axis is reversed on the scatter plot. This makes it easy to compare the scatter plot and the tabular frequencies.

If you want to overlay the frequencies on the scatter plot, you can write the cell counts to a SAS data set and use PROC SGPLOT. The result is shown in the following figure. You can download the program used to create the figures in this blog post.

In summary, you can use the BIN function in the SAS/IML language to bin data in many dimensions. The BIN function is simpler than using the DATA step or PROC FORMAT. It handles unevenly spaced intervals and semi-infinite intervals. It's also fast. So next time you you want to group continuous data into intervals—or higher-dimensional cells!— give it a try.

tags: Data Analysis
7月 172013
 

In a previous article I discussed how to bin univariate observations by using the BIN function, which was added to the SAS/IML language in SAS/IML 9.3. You can generalize that example and bin bivariate or multivariate data. Over two years ago I wrote a blog post on 2D binning in the SAS/IML language, but that post was written before the SAS/IML 9.3 release, so it does not take advantage of the BIN function.

The image to the left shows a scatter plot of data from the Sashelp.BWeight data set. For 50,000 pregnancies, the birth weight of the baby (in grams) is plotted against the weight gain of the mother (in pounds), adjusted so that the median weight gain is centered at zero. There are 11 subintervals in the horizontal direction and seven subintervals in the vertical direction. Consequently, there are 77 cells, which you can enumerate in row-major order beginning with the upper-left cell.

A two-dimensional binning of these data enables you to find the bin for each observation. Many statistics use binned data. For example, you can do a chi-square test for association, estimate density, and test for spatial randomness. The following SAS/IML statements associates a bin number to each observation. The Bin2D function calls the BIN function twice.

proc iml;
use sashelp.bweight;
read all var {m_wtgain weight} into Z;
close;
 
/* define two-dimensional binning function */
start Bin2D(u, cutX, cutY);
   bX = bin(u[,1], cutX);   /* bins in X direction: 1,2,...,kx */
   bY = bin(u[,2], cutY);   /* bins in Y direction: 1,2,...,ky */
   bin = bX + (ncol(cutX)-1)*(bY-1);     /* bins 1,2,...,kx*ky */
   return(bin);
finish;
 
cutX = do(-35, 75, 10);            /* cut points in X direction */
cutY = do(0, 7000, 1000);          /* cut points in Y direction */
b = Bin2D(Z, cutX, cutY);          /* bin numbers 1-77 */

I wrote the Bin2D function to return an index, but you can convert the index into a subscript if you prefer. (For example, the index 77 corresponds to the subscript (11, 7).) The index is useful for tabulating the number of observations in each bin. The TABULATE call counts the number of observations in each bin, which you can visualize by forming a frequency table:

call tabulate(binNumber, Freq, b);           /* count in each bin */
Count = j(ncol(cutY)-1, ncol(cutX)-1, 0);    /* allocate matrix */
Count[binNumber] = Freq;                     /* fill nonzero counts */
print Count[c=('bX1':'bX11') r=('bY1':'bY7')];

Notice that the Y axis is reversed on the scatter plot. This makes it easy to compare the scatter plot and the tabular frequencies.

If you want to overlay the frequencies on the scatter plot, you can write the cell counts to a SAS data set and use PROC SGPLOT. The result is shown in the following figure. You can download the program used to create the figures in this blog post.

In summary, you can use the BIN function in the SAS/IML language to bin data in many dimensions. The BIN function is simpler than using the DATA step or PROC FORMAT. It handles unevenly spaced intervals and semi-infinite intervals. It's also fast. So next time you you want to group continuous data into intervals—or higher-dimensional cells!— give it a try.

tags: Data Analysis
7月 152013
 

It is often useful to partition observations for a continuous variable into a small number of intervals, called bins. This familiar process occurs every time that you create a histogram, such as the one on the left. In SAS you can create this histogram by calling the UNIVARIATE procedure. Optionally, if you want to create a data set that contains the count of the observations in each bin, you can use the OUTHIST= option as shown in the following statements:

ods select Histogram;
proc univariate data=sashelp.cars;
   var Weight;
   histogram Weight / endpoints=(0 to 8000 by 1000) 
                      barlabel=count outhist=BinCount;
run;
/* proc print data=BinCount; run; */ /* optional print */

The end points of the bins are called the cut points. In the histogram, the cut points are evenly spaced: 0, 1000, 2000, ..., 8000. The nine cut points define eight bins of equal width. For the Sashelp.Cars data, no vehicle weighs less than 1000 pounds, so the first bin is empty.

Two related tasks are not supported by the UNIVARIATE procedure: finding the bin number for each observation, and using unequally spaced bins. You could write a DATA step for both these tasks, but SAS/IML language provides the BIN function, which is simpler to use. (The BIN function was introduced in SAS/IML 9.3.) The following example uses the same Sashelp.Cars data:

proc iml;
use sashelp.cars;
read all var {Weight} into x;
close sashelp.cars;
 
cutpts = do(0, 8000, 1000);
b = bin(x, cutpts);         /* i_th element is 1-8 to indicate bin */

The b vector indicates to which bin each observation belongs. The value of b[i] is j when x[i] is in the jth bin. If desired, you can use that information to tabulate the counts in each bin, thus creating a tabular version of the histogram:

call tabulate(BinNumber, Freq, b);
print Freq[colname=(char(BinNumber,2))];

The HISTOGRAM statement in PROC UNIVARIATE does not permit you to use unevenly spaced bins, but the BIN function does. In fact, the BIN function supports two special missing values. The special SAS missing value .M is interpreted as "negative infinity" and the missing value .I is interpreted as "positive infinity." An interval of the form (.M 3000) means "all observations less than 3000" and an interval of the form [4000 .I) means "all observations greater than or equal to 4000." For example, the following cut points define two semi-infinite intervals and three other unevenly spaced bins. A call to the BIN function assigns each observation to the correct bin.

cutpts = {.M 3000 3400 3600 4000 .I};
r = bin(x, cutpts);
 
call tabulate(BinNumber, Freq, r);
lbls = {"< 3000" "3000-3400" "3400-3600" "3600-4000" "> 4000"};
print Freq[colname=lbls];

As I have shown previously, you can use the BIN function in conjunction with the QNTL function to group observations based on quantiles.

tags: Data Analysis, Getting Started
6月 262013
 

The CLUSTER procedure in SAS/STAT software creates a dendogram automatically. The black-and-white dendogram is nice, but plain. A SAS customer wanted to know whether it is possible to add color to the dendogram to emphasize certain clusters. For example, the plot at the left emphasizes a four-cluster scenario for clustering cities based on the distances between cities. The question is, how can you construct such a colored dendogram?

One solution, which was proposed by my colleague Warren Kuhfeld, is to use the output of the PROC CLUSTER procedure in conjunction with the Graphics Template Language (GTL) to overlay a dendogram and a block plot. A block plot is simply a series of colored strips that span the length of a graph. You can create a block plot by using the GTL BLOCKPLOT statement.

To demonstrate how to construct a basic version of the colored dendogram, let's start with the output from the CLUSTER procedure when applied to the Sashelp.Mileages data set, which contains distances between 10 major US cities:

proc cluster data=Sashelp.Mileages(type=distance) method=average pseudo out=Tree;
   id City;
run;

From the procedure output, you can see that the City variable is used to construct the horizontal axis. To overlay a block plot, you can create a simple data set that identifies each city with a colored block. The following DATA step specifies the cities in alphabetical order, along with their "cluster number." This information is merged with a sorted version of the Tree data set. The Tree2 data set contains the information needed to construct the colored dendogram:

data Clusters;
input City $ 1-15 Cluster;
datalines;
Atlanta         1
Chicago         1
Denver          3
Houston         3
Los Angeles     4
Miami           2
New York        1
San Francisco   4
Seattle         4
Washington D.C. 1
;
 
proc sort data=tree; by city; run;
data tree2;
   merge Tree Clusters;
   by City;
run;

Now that the data are properly prepared, you can define a GTL template that overlays a dendogram and a block plot. The SGRENDER procedure is used to create the graph that is shown at the beginning of this article:

proc template;
define statgraph Dendrogram;
   begingraph; 
      entrytitle "Color by Clusters";
      layout overlay / yaxisopts=(discreteopts=(tickvaluefitpolicy=none));
         dendrogram nodeid=_name_ parentid=_parent_ clusterheight=_height_;
         blockplot x=City block=Cluster / datatransparency=0.75 display=(fill);
         endlayout;
      endgraph;
   end;
run;
 
proc sgrender data=tree2 template=dendrogram;
run;

This graphical technique enables you to emphasize the grouping of the cities into four clusters. If you want to emphasize a different number of clusters, you need to re-create the Clusters data set, re-merge the data, and call PROC SGRENDER again. This can be tedious, so Warren Kuhfeld wrote a macro that automates this process. The result is the %ClusterGroups macro, which is available for download from support.sas.com. The macro enables you to create colored dendograms easily. For example, after running the CLUSTER procedure, the following statement creates the colored dendogram in this article:

%clustergroups(nclusters=4, data=Tree, id=City)

The macro takes care of generating the auxiliary data set, merging it with the output from PROC CLUSTER, writing the GTL template, and calling PROC SGRENDER. You can specify the number of clusters that you want to emphasize, and it computes the appropriate range for the colored blocks.

Give it a try and let me know what you think. I'll pass on your comments to Warren Kuhfeld, who deserves all the credit.

tags: Data Analysis, Statistical Graphics
6月 262013
 

The CLUSTER procedure in SAS/STAT software creates a dendrogram automatically. The black-and-white dendrogram is nice, but plain. A SAS customer wanted to know whether it is possible to add color to the dendrogram to emphasize certain clusters. For example, the plot at the left emphasizes a four-cluster scenario for clustering cities based on the distances between cities. The question is, how can you construct such a colored dendrogram?

One solution, which was proposed by my colleague Warren Kuhfeld, is to use the output of the PROC CLUSTER procedure in conjunction with the Graphics Template Language (GTL) to overlay a dendrogram and a block plot. A block plot is simply a series of colored strips that span the length of a graph. You can create a block plot by using the GTL BLOCKPLOT statement.

To demonstrate how to construct a basic version of the colored dendrogram, let's start with the output from the CLUSTER procedure when applied to the Sashelp.Mileages data set, which contains distances between 10 major US cities:

proc cluster data=Sashelp.Mileages(type=distance) method=average pseudo out=Tree;
   id City;
run;

From the procedure output, you can see that the City variable is used to construct the horizontal axis. To overlay a block plot, you can create a simple data set that identifies each city with a colored block. The following DATA step specifies the cities in alphabetical order, along with their "cluster number." This information is merged with a sorted version of the Tree data set. The Tree2 data set contains the information needed to construct the colored dendrogram:

data Clusters;
input City $ 1-15 Cluster;
datalines;
Atlanta         1
Chicago         1
Denver          3
Houston         3
Los Angeles     4
Miami           2
New York        1
San Francisco   4
Seattle         4
Washington D.C. 1
;
 
proc sort data=tree; by city; run;
data tree2;
   merge Tree Clusters;
   by City;
run;

Now that the data are properly prepared, you can define a GTL template that overlays a dendrogram and a block plot. The SGRENDER procedure is used to create the graph that is shown at the beginning of this article:

proc template;
define statgraph Dendrogram;
   begingraph; 
      entrytitle "Color by Clusters";
      layout overlay / yaxisopts=(discreteopts=(tickvaluefitpolicy=none));
         dendrogram nodeid=_name_ parentid=_parent_ clusterheight=_height_;
         blockplot x=City block=Cluster / datatransparency=0.75 display=(fill);
         endlayout;
      endgraph;
   end;
run;
 
proc sgrender data=tree2 template=dendrogram;
run;

This graphical technique enables you to emphasize the grouping of the cities into four clusters. If you want to emphasize a different number of clusters, you need to re-create the Clusters data set, re-merge the data, and call PROC SGRENDER again. This can be tedious, so Warren Kuhfeld wrote a macro that automates this process. The result is the %ClusterGroups macro, which is available for download from support.sas.com. The macro enables you to create colored dendrograms easily. For example, after running the CLUSTER procedure, the following statement creates the colored dendrogram in this article:

%clustergroups(nclusters=4, data=Tree, id=City)

The macro takes care of generating the auxiliary data set, merging it with the output from PROC CLUSTER, writing the GTL template, and calling PROC SGRENDER. You can specify the number of clusters that you want to emphasize, and it computes the appropriate range for the colored blocks.

Give it a try and let me know what you think. I'll pass on your comments to Warren Kuhfeld, who deserves all the credit.

tags: Data Analysis, Statistical Graphics
6月 172013
 

A regular reader noticed my post on initializing vectors by using repetition factors and asked whether that technique would be useful to expand data that are given in value-frequency pairs. The short answer is "no." Repetition factors are useful for defining (static) matrix literals. However, if you want to expand data dynamically, I recommend that you use the REPEAT function or the technique in the article on expanding data by using frequencies.

For example, the following SAS/IML statement defines a vector for which the value 2.2 is repeated two times and the value 3.3 is repeated three times. The resulting vector has five elements:

proc iml;
x = {[2] 2.2  [3] 3.3};

This vector is constructed as a matrix literal. If instead you have the values and frequencies in separate vectors, then use the ExpandFreq module in my previous post:

values = {2.2 3.3};
freq   = {2   3};
load module=(ExpandFreq); /* define or load ExpandFreg module here */
y = ExpandFreq(values, freq);

This is probably a good time to remind everyone about the SAS/IML Support Community. You can post your SAS/IML questions there 24 hours a day. That is always better than sending me direct email. There are lots of experienced SAS/IML experts out there, so use the SAS/IML Support Community to tap into that knowledge.

tags: Data Analysis
6月 122013
 

In a previous blog post, I described how to use a spread plot to compare the distributions of several variables. Each spread plot is a graph of centered data values plotted against the estimated cumulative probability. Thus, spread plots are similar to a (rotated) plot of the empirical cumulative distribution function. Users of the SAS regression procedures will recognize the spread plots as one of the plots that are created automatically by procedures such as PROC REG. The spread plots of the fitted and residual values appear in the middle column of the third row of the regression diagnostics panel.

In the SAS documentation, the residual-fit spread plot is also called an "RF plot." This article describes how to interpret the R-F spread plot.

The residual-fit spread plot in SAS output

When I first saw the R-F spread plot in the PROC REG diagnostics panel, there were two things that I found confusing:

  • The title of the left plot is "Fit–Mean." I read the title as "fit hyphen mean," and I didn't know what that meant. However, the correct way to read the title is "fit minus mean," which is equivalent to "centered fit."
  • The label for the horizontal axis is "Proportion Less." I didn't know what that meant, either. I now know that scatter plot shows empirical quantiles versus their plotting positions. Recall that the pth empirical quantile is the data value that is greater than (or equal to) a proportion p of the data values. Therefore, if a point on the scatter plot has coordinates (pi, qi), it means that the vertical coordinate is the ith quantile, and approximately pi of the other data values are less than that proportion.

History of the residual-fit spread plot

The spread plot is a graph of the centered data versus the corresponding plotting position. Essentially, it is a plot of the sorted data against the corresponding rank, except that using the plotting position instead of ranks makes it possible to compare variables that have different numbers of nonmissing observations. Also, using centered data instead of raw values enables you to compare the spread of variables that have different means.

William S. Cleveland featured the residual-fit spread plot in his book Visualizing Data (1993). He describes how to create a quantile plot on pp. 17–20, and describes quantile plots for fitted and residual values on p. 35–38. Then he says (p. 40):

It is informative to study how influential the [explanatory] variable is in explaining the variation in the [response variable]. The fitted values and the residuals are two sets of values each of which has a distribution. If the spread of the fitted-value distribution is large compared with the spread of the residual distribution, then the [explanatory] variable is influential. If it is small, the [explanatory] variable is not as influential.... Since it is the spreads of the distributions that are of interest, the fitted values minus their overall mean are graphed.... This residual-fit spread plot, or r-f spread plot, shows [whether] the spreads of the residuals and fit values are comparable.

Cleveland goes on to use the R-F spread plot about 20 times in multiple examples.

The residual-fit spread plot as a regression diagnostic

Following Cleveland's examples, the residual-fit spread plot can be used to assess the fit of a regression as follows:

  • Compare the spread of the fit to the spread of the residuals. This is the main idea. If the left side of the plot (the centered fitted values) is taller than the right side (the residual values), then you conclude that the spread of the residual values is small relative to the spread of the fitted values.
  • Examine the distribution of the residual values. The quantile plot of the residual values contains all of the information that a box plot does—and more. If the distribution does not appear to be normally distributed, the model might not fit the data.
  • Are there extreme values for the distribution of the residual values? These indicate outliers: observations for which the observed value is far from the fitted value.
  • Are there extreme values for the distribution of the fitted values? These might indicate influential observations that have high leverage in the model. They need to be investigated.

Some examples on simulated data

The best way to practice interpreting the R-F spread plots are to view some examples for which the true model is known. The following DATA step simulates two response variables:

data RegData(drop=i);
call streaminit(12345);
do i = 1 to 100;
   x = rand("Normal");
   y1 = 2 + 4*x    + rand("Normal", 0, 0.25); /* small error */
   y2 = 2 + 4*x**2 + rand("Normal", 0, 1);    /* not linear in x */
   output;
end;
run;

For a real regression analysis, I would look at several diagnostic plots, but in the subsequent examples I will only interpret the residual-fit spread plots. I use the DIAGNOSTICS(UNPACK) option on the PLOTS= option to extract the R-F spread plot from the diagnostics panel.

Example 1: Examining the residual variation in a model

The y1 variable has a small error term. The following statements display the R-F spread plot:

title "y = 2 + 4*x + eps, eps ~ N(0,0.25)";
ods select RFPlot;
proc reg data = RegData plots=diagnostics(unpack);
   model y1 = x;
quit;

Notice that the left plot (the centered fitted values) is "taller" than the right plot (the residual values), which indicates that the residual values have a smaller spread. In terms of the model, the x variable accounts for a significant portion of the variation in the model, with only a little residual variation.

You can change the standard deviation of the error term to 5 and rerun the program. The new R-F spread plot (not shown), shows that the spread of the residual values is larger than the spread of the fitted values. The interpretation would be that considerable variation remains after accounting for the effect of the x variable.

Example 2: A misspecified model

In the previous example, the model was correctly specified. In this second example, the true model is quadratic in the x variable, but the fitted model is linear in x.

title "y = 2 + 4*x**3 + eps, eps ~ N(0,0.25)";
title2 "Model is y = x";
ods select RFPlot;
proc reg data = RegData plots(only)=diagnostics(unpack);
   model y2 = x;
quit;

In the R-F spread plot for the (misspecified) model, the right-hand plot is taller than the left-hand plot. This shows that there is a lot of variation that is not explained by the model. Furthermore, the residual distribution does not appear to be normally distributed. The right tail of the residual distribution is long, and the distribution is skewed. If I saw a plot like this in real data, I would look at other plots (such as the plot of residuals versus the predicted values) to see if the residuals exhibit a pattern that can be modeled.

Closing Comments

Several SAS regression procedures produce a regression diagnostics panel automatically. Each graph reveals information about the regression model and whether it fits the data well. This article has described how to interpret a residual-fit plot, which is located in the last row of the diagnostics panel. The residual-fit spread plot, which was featured prominently in Cleveland's book, Visualizing Data, is one tool in the arsenal of regression diagnostic plots.

tags: Data Analysis
6月 102013
 

Suppose that you have several data distributions that you want to compare. Questions you might ask include "Which variable has the largest spread?" and "Which variables exhibit skewness?" More generally, you might be interested in visualizing how the distribution of one variable differs from the distribution of other variables.

The usual way to compare data distributions is to use histograms. One technique is to display a panel of histograms, which are known as comparative histograms. I have used this approach to compare salaries between two categories of workers. The comparative histogram is produced automatically by PROC UNIVARIATE when the analysis includes a classification variable.

A related approach is to use transparency to overlay two histograms on the same axis. This method works well for two distributions. However, for three or more distributions, an overlay of histograms can be difficult to read.

A problem of scale

If one of the variables has a range that is an order of magnitude greater than the range of another variable, the comparative histogram can lose its effectiveness. To illustrate this, consider the following comparative histogram of three widely varying quantities:

The variable in the top histogram has a range that is 10 times the range of the variable in the lower histogram. Consequently, all data in the bottom panel is inside of a single histogram bin. This plot does not reveal anything about the distribution of the third variable.

A different way to compare distributions is to plot a panel of the empirical cumulative distribution functions (CDF). The following call to PROC UNIVARIATE creates these "comparative CDF plots," as well as the comparative histograms, for simulated data:

/* simulate three variables with different distributions */
data Wide;
call streaminit(123);
do i = 1 to 100;
   Uniform = rand("Uniform");
   Normal = rand("Normal");
   Gamma = rand("Gamma", 4);
   output;
end;
run;
 
/* transpose from wide to long data format to create a CLASS variable */
proc transpose data=Wide name=ID out=Long(drop=i);  by i;  run;
data Long; 
   set Long(rename=(Col1=x)); 
   label ID=;
run;
 
/* create comparative histograms and CDF plots */
ods select Histogram CDFPlot;
proc univariate data=Long;
   class ID;
   histogram x / nrows=3;
   cdfplot x / nrows=3;
run;

The comparative CDF plot shows the empirical distributions. The horizontal axis indicates the range of the data: [0, 11] for the Gamma variable, [-3,3] for the Normal variable, and [-0.5, 0.5] for the Uniform variable. However, the plot is far from perfect. Although the CDF plot reveals the approximate center and scale of the data, I would find it difficult to conclude from the comparative CDF plot that the Gamma variable is skewed whereas the Normal variable is symmetric.

The spread plot for distributions

In the classic books Graphical Methods for Data Analysis (Chambers, et al., 1983) and Visualizing Data (Cleveland, 1993), the authors recommend using scatter plots of quantiles to visualize and compare distributions.

Consider rotating the CDF panel 90 degrees in the counter-clockwise direction. If you also center the data so that each variable has zero mean, then the result called a spread plot. The spread plot, as its name implies, is used to compare the spread (range) of distributions, as well as to give some indication of the tail behavior (long tails, outliers, and so forth) in the data.

I have previously blogged about how to compute empirical quantiles. The following SAS/IML program centers each variable and estimates the empirical cumulative probabilities. The SGPANEL procedure is then used to visualize the results:

proc iml;
varNames = {"Gamma" "Normal" "Uniform"};
use Wide;  read all var varNames into x;  close Wide;
 
/* assume nonmissing values */
x = x - mean(x);          /* 1. Center the variables */
N = nrow(x);
F = j(nrow(x), ncol(x));
do i = 1 to ncol(x);      /* 2. Compute quantile of each variable */
   F[,i] = (rank(X[,i])-0.375)/(N+0.25); /* Blom (1958) fractional rank */
end;
ID = repeat(varNames, N); /* 3. ID variables */
create Spread var {ID x F};  append;  close Spread;
quit;
 
title "Comparative Spread Plot";
proc sgpanel data=Spread;
   panelby ID / rows=1 novarname;
   scatter x=f y=x;
   refline 0 / axis=y;
   colaxis label="Proportion of Data";
   rowaxis display=(nolabel);
run;

There does not appear to be a standard term for the quantity on the horizontal axis, which I have labeled "proportion of data." I call it the estimated cumulative probability, but it is also called the estimated proportion, the fractional rank, the estimate for cumulative proportion, or simply plotting positions. In Chambers et al. (1983) it is called the fraction of data. In Cleveland (1993) it is called the f-value, which is short for fractional value. In some SAS output it is labeled "proportion less [than the value shown]." (There is also many different ways to compute the plotting positions. Chambers and Cleveland each use the simpler fractional rank given by ri–0.5)/(N+1), where ri is the rank of the ith observations. I have used Blom's 1958 formula, which is used heavily in SAS software.)

You could create a similar plot by using PROC RANK and the DATA step. The nice thing about the spread plot is that the range of the data is evident. The range for the centered Gamma variable is [-3,8]. Furthermore, notice that the mean value is greater than the median value, which is a standard rule of thumb for continuous skewed distributions. In contrast, the Normal and Uniform variables are both visually symmetric. The Normal variable clearly has two moderate tails, whereas the Uniform variable appears to be a bounded distribution.

A plot that uses quantiles to compare distributions is more powerful than the technique of comparing histograms. However, the quantile plot requires more skill to interpret.

Although the spread plot is not a well-known display, it appears prominently in the output of several popular SAS procedures. My next blog post will discuss how to interpret a spread plot when it appears in the output of SAS regression procedures.

tags: Data Analysis, Statistical Graphics