data analysis

6月 102013
 

Suppose that you have several data distributions that you want to compare. Questions you might ask include "Which variable has the largest spread?" and "Which variables exhibit skewness?" More generally, you might be interested in visualizing how the distribution of one variable differs from the distribution of other variables.

The usual way to compare data distributions is to use histograms. One technique is to display a panel of histograms, which are known as comparative histograms. I have used this approach to compare salaries between two categories of workers. The comparative histogram is produced automatically by PROC UNIVARIATE when the analysis includes a classification variable.

A related approach is to use transparency to overlay two histograms on the same axis. This method works well for two distributions. However, for three or more distributions, an overlay of histograms can be difficult to read.

A problem of scale

If one of the variables has a range that is an order of magnitude greater than the range of another variable, the comparative histogram can lose its effectiveness. To illustrate this, consider the following comparative histogram of three widely varying quantities:

The variable in the top histogram has a range that is 10 times the range of the variable in the lower histogram. Consequently, all data in the bottom panel is inside of a single histogram bin. This plot does not reveal anything about the distribution of the third variable.

A different way to compare distributions is to plot a panel of the empirical cumulative distribution functions (CDF). The following call to PROC UNIVARIATE creates these "comparative CDF plots," as well as the comparative histograms, for simulated data:

/* simulate three variables with different distributions */
data Wide;
call streaminit(123);
do i = 1 to 100;
   Uniform = rand("Uniform");
   Normal = rand("Normal");
   Gamma = rand("Gamma", 4);
   output;
end;
run;
 
/* transpose from wide to long data format to create a CLASS variable */
proc transpose data=Wide name=ID out=Long(drop=i);  by i;  run;
data Long; 
   set Long(rename=(Col1=x)); 
   label ID=;
run;
 
/* create comparative histograms and CDF plots */
ods select Histogram CDFPlot;
proc univariate data=Long;
   class ID;
   histogram x / nrows=3;
   cdfplot x / nrows=3;
run;

The comparative CDF plot shows the empirical distributions. The horizontal axis indicates the range of the data: [0, 11] for the Gamma variable, [-3,3] for the Normal variable, and [-0.5, 0.5] for the Uniform variable. However, the plot is far from perfect. Although the CDF plot reveals the approximate center and scale of the data, I would find it difficult to conclude from the comparative CDF plot that the Gamma variable is skewed whereas the Normal variable is symmetric.

The spread plot for distributions

In the classic books Graphical Methods for Data Analysis (Chambers, et al., 1983) and Visualizing Data (Cleveland, 1993), the authors recommend using scatter plots of quantiles to visualize and compare distributions.

Consider rotating the CDF panel 90 degrees in the counter-clockwise direction. If you also center the data so that each variable has zero mean, then the result called a spread plot. The spread plot, as its name implies, is used to compare the spread (range) of distributions, as well as to give some indication of the tail behavior (long tails, outliers, and so forth) in the data.

I have previously blogged about how to compute empirical quantiles. The following SAS/IML program centers each variable and estimates the empirical cumulative probabilities. The SGPANEL procedure is then used to visualize the results:

proc iml;
varNames = {"Gamma" "Normal" "Uniform"};
use Wide;  read all var varNames into x;  close Wide;
 
/* assume nonmissing values */
x = x - mean(x);          /* 1. Center the variables */
N = nrow(x);
F = j(nrow(x), ncol(x));
do i = 1 to ncol(x);      /* 2. Compute quantile of each variable */
   F[,i] = (rank(X[,i])-0.375)/(N+0.25); /* Blom (1958) fractional rank */
end;
ID = repeat(varNames, N); /* 3. ID variables */
create Spread var {ID x F};  append;  close Spread;
quit;
 
title "Comparative Spread Plot";
proc sgpanel data=Spread;
   panelby ID / rows=1 novarname;
   scatter x=f y=x;
   refline 0 / axis=y;
   colaxis label="Proportion of Data";
   rowaxis display=(nolabel);
run;

There does not appear to be a standard term for the quantity on the horizontal axis, which I have labeled "proportion of data." I call it the estimated cumulative probability, but it is also called the estimated proportion, the fractional rank, the estimate for cumulative proportion, or simply plotting positions. In Chambers et al. (1983) it is called the fraction of data. In Cleveland (1993) it is called the f-value, which is short for fractional value. In some SAS output it is labeled "proportion less [than the value shown]." (There is also many different ways to compute the plotting positions. Chambers and Cleveland each use the simpler fractional rank given by ri–0.5)/(N+1), where ri is the rank of the ith observations. I have used Blom's 1958 formula, which is used heavily in SAS software.)

You could create a similar plot by using PROC RANK and the DATA step. The nice thing about the spread plot is that the range of the data is evident. The range for the centered Gamma variable is [-3,8]. Furthermore, notice that the mean value is greater than the median value, which is a standard rule of thumb for continuous skewed distributions. In contrast, the Normal and Uniform variables are both visually symmetric. The Normal variable clearly has two moderate tails, whereas the Uniform variable appears to be a bounded distribution.

A plot that uses quantiles to compare distributions is more powerful than the technique of comparing histograms. However, the quantile plot requires more skill to interpret.

Although the spread plot is not a well-known display, it appears prominently in the output of several popular SAS procedures. My next blog post will discuss how to interpret a spread plot when it appears in the output of SAS regression procedures.

tags: Data Analysis, Statistical Graphics
6月 102013
 

Suppose that you have several data distributions that you want to compare. Questions you might ask include "Which variable has the largest spread?" and "Which variables exhibit skewness?" More generally, you might be interested in visualizing how the distribution of one variable differs from the distribution of other variables.

The usual way to compare data distributions is to use histograms. One technique is to display a panel of histograms, which are known as comparative histograms. I have used this approach to compare salaries between two categories of workers. The comparative histogram is produced automatically by PROC UNIVARIATE when the analysis includes a classification variable.

A related approach is to use transparency to overlay two histograms on the same axis. This method works well for two distributions. However, for three or more distributions, an overlay of histograms can be difficult to read.

A problem of scale

If one of the variables has a range that is an order of magnitude greater than the range of another variable, the comparative histogram can lose its effectiveness. To illustrate this, consider the following comparative histogram of three widely varying quantities:

The variable in the top histogram has a range that is 10 times the range of the variable in the lower histogram. Consequently, all data in the bottom panel is inside of a single histogram bin. This plot does not reveal anything about the distribution of the third variable.

A different way to compare distributions is to plot a panel of the empirical cumulative distribution functions (CDF). The following call to PROC UNIVARIATE creates these "comparative CDF plots," as well as the comparative histograms, for simulated data:

/* simulate three variables with different distributions */
data Wide;
call streaminit(123);
do i = 1 to 100;
   Uniform = rand("Uniform");
   Normal = rand("Normal");
   Gamma = rand("Gamma", 4);
   output;
end;
run;
 
/* transpose from wide to long data format to create a CLASS variable */
proc transpose data=Wide name=ID out=Long(drop=i);  by i;  run;
data Long; 
   set Long(rename=(Col1=x)); 
   label ID=;
run;
 
/* create comparative histograms and CDF plots */
ods select Histogram CDFPlot;
proc univariate data=Long;
   class ID;
   histogram x / nrows=3;
   cdfplot x / nrows=3;
run;

The comparative CDF plot shows the empirical distributions. The horizontal axis indicates the range of the data: [0, 11] for the Gamma variable, [-3,3] for the Normal variable, and [-0.5, 0.5] for the Uniform variable. However, the plot is far from perfect. Although the CDF plot reveals the approximate center and scale of the data, I would find it difficult to conclude from the comparative CDF plot that the Gamma variable is skewed whereas the Normal variable is symmetric.

The spread plot for distributions

In the classic books Graphical Methods for Data Analysis (Chambers, et al., 1983) and Visualizing Data (Cleveland, 1993), the authors recommend using scatter plots of quantiles to visualize and compare distributions.

Consider rotating the CDF panel 90 degrees in the counter-clockwise direction. If you also center the data so that each variable has zero mean, then the result called a spread plot. The spread plot, as its name implies, is used to compare the spread (range) of distributions, as well as to give some indication of the tail behavior (long tails, outliers, and so forth) in the data.

I have previously blogged about how to compute empirical quantiles. The following SAS/IML program centers each variable and estimates the empirical cumulative probabilities. The SGPANEL procedure is then used to visualize the results:

proc iml;
varNames = {"Gamma" "Normal" "Uniform"};
use Wide;  read all var varNames into x;  close Wide;
 
/* assume nonmissing values */
x = x - mean(x);          /* 1. Center the variables */
N = nrow(x);
F = j(nrow(x), ncol(x));
do i = 1 to ncol(x);      /* 2. Compute quantile of each variable */
   F[,i] = (rank(X[,i])-0.375)/(N+0.25); /* Blom (1958) fractional rank */
end;
ID = repeat(varNames, N); /* 3. ID variables */
create Spread var {ID x F};  append;  close Spread;
quit;
 
title "Comparative Spread Plot";
proc sgpanel data=Spread;
   panelby ID / rows=1 novarname;
   scatter x=f y=x;
   refline 0 / axis=y;
   colaxis label="Proportion of Data";
   rowaxis display=(nolabel);
run;

There does not appear to be a standard term for the quantity on the horizontal axis, which I have labeled "proportion of data." I call it the estimated cumulative probability, but it is also called the estimated proportion, the fractional rank, the estimate for cumulative proportion, or simply plotting positions. In Chambers et al. (1983) it is called the fraction of data. In Cleveland (1993) it is called the f-value, which is short for fractional value. In some SAS output it is labeled "proportion less [than the value shown]." (There is also many different ways to compute the plotting positions. Chambers and Cleveland each use the simpler fractional rank given by ri–0.5)/(N+1), where ri is the rank of the ith observations. I have used Blom's 1958 formula, which is used heavily in SAS software.)

You could create a similar plot by using PROC RANK and the DATA step. The nice thing about the spread plot is that the range of the data is evident. The range for the centered Gamma variable is [-3,8]. Furthermore, notice that the mean value is greater than the median value, which is a standard rule of thumb for continuous skewed distributions. In contrast, the Normal and Uniform variables are both visually symmetric. The Normal variable clearly has two moderate tails, whereas the Uniform variable appears to be a bounded distribution.

A plot that uses quantiles to compare distributions is more powerful than the technique of comparing histograms. However, the quantile plot requires more skill to interpret.

Although the spread plot is not a well-known display, it appears prominently in the output of several popular SAS procedures. My next blog post will discuss how to interpret a spread plot when it appears in the output of SAS regression procedures.

tags: Data Analysis, Statistical Graphics
5月 282013
 

Has anyone noticed that the REG procedure in SAS/STAT 12.1 produces heat maps instead of scatter plots for fit plots and residual plots when the regression involves more than 5,000 observations? I wasn't aware of the change until a colleague informed me, although the change is discussed in the "Details" section of the PROC REG documentation for SAS/STAT 12.1.

Here is how the fit plot looks for fewer than 5,000 observations:

/* simulate data from a linear regression model */
data RegData;
call streaminit(1);
do i = 1 to 6000;
   x = rand("Uniform");
   y = 1 + 2*x + 3*rand("Normal");
   if i < 20 then y = y + 8*rand("Normal"); /* add a few outliers */
   output;
end;
 
ods select FitPlot(persist);
proc reg data=RegData(obs=4000) plots(only)=FitPlot;
   model y = x;
quit;

With fewer than 5,000 observations, I get the usual fit plot that consists of a scatter plot overlaid with a curve of predicted value, a band for the confidence interval for the mean, and dashed lines that indicate the confidence intervals for individual predictions. (The confidence limits are barely visible. Click the graph to enlarge it.) However, watch what happens when I use more than 5,000 observations:

proc reg data=RegData plots(only)=FitPlot;
   model y = x;
quit;
ods select all;

The scatter plot is gone, replaced by a heat map that shows the density of the data. The predicted values are still present, although the graphical style used to draw it is different, which results in a red line. The confidence intervals are gone.

Overall, this is a nice feature and I think that the change is a good idea. The reason for the change is easy to understand: scatter plots suffer from overplotting when there are many points, so it is more useful to visualize the density of the observations than the individual observations. Furthermore, although scatter plots are very fast to construct, when there are many points a heat map (which bins observations) is faster to compute and render than a scatter plot.

The plot does not currently include confidence intervals, but there is no reason why these can't be added in a future release. However, the confidence interval for the mean predictions will usually be tiny for large data sets—already it is barely visible in the plot of 4,000 points.

Controlling the appearance of the heat map

Prior to SAS/STAT 12.1, the REG procedure created a fit plot as a scatter plot for small data sets (less than 5,000 points). For larger sample sizes, the procedure suppressed the fit plot. The behavior was controlled by using MAXPOINTS= option on the PLOTS= option on the PROC REG statement.

In SAS/STAT 12.1, the MAXPOINTS= option accepts two arguments, and the default values are MAXPOINTS=5000 150000. The first argument specifies the data size for which heat maps are used instead of scatter plots. The documentation of the MAXPOINTS= option states that "when the number of points exceeds [the first number] but does not exceed [the second number] divided by the number of independent variables, heat maps are displayed instead of scatter plots for the fit and residual plots." In other words, if you have a regression with k explanatory variables, heat maps are used when the number of observations is between 5,000 and 150,000/k. Of course, you can use the MAXPOINTS= option to change either or both of those values.

Any comments on this new behavior in PROC REG?

tags: Data Analysis
5月 132013
 

I've conducted a lot of univariate analyses in SAS, yet I'm always surprised when the best way to carry out the analysis uses a SAS regression procedure. I always think, "This is a univariate analysis! Why am I using a regression procedure? Doesn't a regression require at least two variables?"

Then it dawns on me. In the SAS regression procedures, a MODEL statement that does not contain explanatory variables simply performs a univariate analysis on the response variable. For example, when there are no explanatory variables, a classical regression analysis produces sample statistics, such as the mean, the variance, and the standard error of the mean, as shown in the following output:

/* estimates of mean, variance, and std err of mean */
ods select FitStatistics ParameterEstimates ;
proc reg data=sashelp.cars;
   model MPG_City= ;
quit;

The estimates in this PROC REG output are the same as are produced by the following call to PROC MEANS:

proc means data=sashelp.cars mean std stderr cv;
 var MPG_City;
run;

The ODS graphics that are produced by PROC REG also includes a histogram of the centered data and a normal Q-Q plot.

Here are some other instances in which a SAS regression procedure can be used to carry out a univariate analysis:

  • Robust estimates of scale and location. This univariate analysis is usually performed by using PROC UNIVARIATE with the ROBUSTSCALE option. However, you can also use the ROBUSTREG procedure to estimate robust statistics. The ROBUSTREG procedure provides four different estimation methods, which you can control by using the METHOD= option. For example, the following statements display robust estimates of location and scale:
    ods select ParameterEstimates;
    proc robustreg data=sashelp.cars method=M;
       model MPG_City= ;
    run;
  • Detection of univariate outliers. How can you detect univariate outliers in SAS? One way is to call the ROBUSTREG procedure! Again, there are four estimation methods that you can use.
    ods select DiagSummary;
    proc robustreg data=sashelp.cars method=LTS;
       model MPG_City= ;
       output out=out outlier=outliers;
    run;
     
    proc print data=out(where=(outliers=1));
       var make model type mpg_city;
    run;
  • Estimation of quantiles, with confidence intervals. Last week I showed how to use PROC UNIVARIATE to compute sample quantiles and confidence intervals. An alternative approach is to use the QUANTREG procedure, as follows:
    /* estimate univariate quantiles, CIs, and std errors */
    ods output ParameterEstimates=QntlCI;
    proc quantreg data=sashelp.cars;
       model MPG_City= / quantiles=0.025 0.2 0.8 0.975;
    run;
     
    proc print data=QntlCI; run;
  • Fit discrete parametric models to univariate data. I've previously shown how to use the GENMOD procedure to fit a Poisson model to data, and the same technique can be used to fit other discrete distributions, including the binomial, geometric, multinomial, negative binomial, and some zero-inflated distributions.

  • Fit parameters for a mixed density model to univariate data. I've previously demonstrated how to use the FMM procedure to fit a finite mixture distribution to data.

There are other examples, but I hope you see that the SAS regression procedures are useful for computing univariate statistics and analyses.

Do you have a favorite univariate analysis that can be accomplished by using a SAS regression procedure? Let me know about it by leaving a comment.

tags: Data Analysis, SAS Programming
5月 082013
 

At a recent conference, I talked with a SAS customer who told me that he was using an R package to create a three-panel visualization of a distribution. Unfortunately, he couldn't remember the name of the package, and he has not returned my e-mails, so the purpose of today's article is to discuss some ideas related to this visualization and to solicit critiques of my implementation.

The customer wanted to create a paneled display in SAS that includes three graphs: a histogram, a box plot, and a normal quantile-quantile (Q-Q) plot. We sketched out an idea, and the plot at the left is my implementation of our sketch. (Click to enlarge.)

One question I asked was, "Do you want the usual Q-Q plot, or should we flip it?" The usual Q-Q plot is a scatter plot of the ordered data values (on the vertical axis) plotted against the corresponding quantiles of a normal distribution (on the horizontal axis). The purpose of a Q-Q plot is to see whether the points fall along a straight line, which would indicate that the data are normally distributed. I remarked that if we flip the plot so that the data values are displayed horizontally, then the histogram, boxplot, and Q-Q plot can all share a common horizontal axis. The customer said that this seemed like a good idea.

In SAS, you can create this kind of paneled layout by using the Graph Template Language (GTL) and the SGRENDER procedure.

A GTL template for a three-panel display

When I returned from the conference, I created a GTL template that defines a three-panel display. The top panel, which occupies 50% of the height of the display, is a histogram of the data overlaid with a normal curve and a kernel density estimate. The second panel, which occupies 10% of the height, is a horizontal box plot. The third panel is a normal Q-Q plot, but is flipped so that the normal quantiles are plotted on the vertical axis. A diagonal reference line is added to the Q-Q plot. Normally distributed data should fall near the reference line.

The threepanel template takes five dynamic variables. The data and the normal quantiles are referenced by the dynamic variables _X and _QUANTILE, respectively. The title is supplied by using the _Title variable. Lastly, the parameter estimates for the normal curve that best fits the data are supplied by using the _mu and _sigma dynamic variables. The template definition follows:

/* define 'threepanel' template that displays a histogram, box plot, and Q-Q plot */
proc template;
define statgraph threepanel;
dynamic _X _QUANTILE _Title _mu _sigma;
begingraph;
   entrytitle halign=center _Title;
   layout lattice / rowdatarange=data columndatarange=union 
      columns=1 rowgutter=5 rowweights=(0.4 0.10 0.5);
      layout overlay;
         histogram   _X / name='histogram' binaxis=false;
         densityplot _X / name='Normal' normal();
         densityplot _X / name='Kernel' kernel() lineattrs=GraphData2(thickness=2 );
         discretelegend 'Normal' 'Kernel' / border=true halign=right valign=top location=inside across=1;
      endlayout;
      layout overlay;
         boxplot y=_X / boxwidth=0.8 orient=horizontal;
      endlayout;
      layout overlay;
         scatterplot x=_X y=_QUANTILE;
         lineparm x=_mu y=0.0 slope=eval(1./_sigma) / extend=true clip=true;
      endlayout;
      columnaxes;
         columnaxis;
      endcolumnaxes;
   endlayout;
endgraph;
end;
run;

You can download the %ThreePanel macro, which creates a three-panel display for any variable in any data set. If you want to learn more about how to write GTL templates, I recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

The macro calls PROC UNIVARIATE to compute the normal parameter estimates and the quantiles. That information is then used by PROC SGRENDER to create the plot according to the specifications in the threepanel template. The macro uses a cool trick: I get the data for the Q-Q plot by using an ODS OUTPUT statement on a graph that is created by PROC UNIVARIATE.

The image at the top of this post shows how the template renders the MPG_City variable in the Sashelp.Cars data set. The image was created as follows:

ods graphics on;
%ThreePanel(Sashelp.Cars, MPG_City)

The MPG_City variable is not normally distributed, as is evident by looking at the poor fit of the data in the Q-Q plot (lower panel). In contrast, the distribution of the SepalLength variable in the Sashelp.Iris data set appears to be more normal, as shown below:

%ThreePanel(Sashelp.Iris, SepalLength)

Discussion

What do you think? Try it out on your own data and let me know if you have suggestions to improve it. Is this a useful display? Leave a comment.

tags: Data Analysis, Statistical Graphics
5月 062013
 

PROC UNIVARIATE has provided confidence intervals for standard percentiles (quartiles) for eons. However, in SAS 9.3M2 (featuring the 12.1 analytical procedures) you can use a new feature in PROC UNIVARIATE to compute confidence intervals for a specified list of percentiles.

To be clear, percentiles and quantiles are essentially the same thing. For example, the median value of a set of data is the 0.5 quantile, which is also the 50th percentile. In general, the pth quantile is the (100 p)th percentile.

The CIPCTLDF option on the PROC UNIVARIATE statement produces distribution-free confidence intervals for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles as shown in the following example:

/* CI for standard percentiles: 1, 5, 10, 25, 50, 75, 90, 95, 99 */
ods select Quantiles;
proc univariate data=Sashelp.Cars cipctldf;
   var MPG_City;
run;

However, prior to the 12.1 releaase of the analytics procedures, there was not an easy way to obtain confidence intervals for arbitrary percentiles. (Recall that you can specify by nonstandard percentiles by using the PCTLPTS= option on the OUTPUT statement.)

I am happy to report that the OUTPUT statement in the UNIVARIATE procedure now supports the CIPCTLDF= option, which you can use as follows:

proc univariate data=sashelp.cars noprint;
   var MPG_City;
   output out=pctl pctlpts=2.5 20 80 97.5 pctlpre=p
          cipctldf=(lowerpre=LCL upperpre=UCL);    /* 12.1 options (SAS 9.3m2) */
run;
 
proc print noobs; run;

The CIPCTLDF= option computes distribution-free confidence intervals for the percentiles that are specified on the PCTLPTS= option. The LOWERPRE= option specifies the prefix to use for lower confidence limits; the UPPERPRE= option specifies the prefix to use for upper confidence limits.

If your data are normally distributed, you can use the CIPCTLNORMAL= option on the OUTPUT statement to compute confidence limits. However, if your data are not normally distributed, the CIPCTLNORMAL= option might produce inaccurate results. For example, on the MPG_City data, which is highly skewed, the confidence intervals for large percentiles (like the 99th percentile) do not contain the corresponding point estimate. For this reason, I prefer the distribution-free intervals for most analyses.

tags: Data Analysis
4月 172013
 

I often see variations of the following question posted on statistical discussion forums:

I want to bin the X variable into a small number of values. For each bin, I want to draw the quartiles of the Y variable for that bin. Then I want to connect the corresponding quartile points from bin to bin.

In other words, the question asks how to create a plot like the following:

The plot attempts to show how the first quartile, second quartile (median), and third quartile of the response variable vary as the explanatory variable changes. However, I do not recommend using this plot when the explanatory variable is continuous. When I see a question like this, I sometimes respond by saying "you might want to look at quantile regression."

This article takes a quick look at quantile regression. What is quantile regression? How can you perform quantile regression in SAS? And how does it relate to the "binned quantile plot" that is shown above?

Start with the familiar: Standard regression

Before discussing quantile regression, let's introduce some data and think about a typical analysis. The data are salaries (in 1991) for 459 statistics professors in U.S. colleges and universities. A professor's salary (the response) is assumed to be related to his or her years of service (the explanatory variable). For these data, the explanatory variable is already binned into 25 discrete values. These data and this analysis are based on an example in the documentation for the QUANTREG procedure. You can download the data and the SAS statements that are used in this article.

A traditional regression analysis predicts the mean salary for a professor, given the years of service. In SAS, there are many regression procedures, such as the parametric GLM procedure and the nonparametric LOESS procedure. For these procedure, you can also call the regression directly from the SGPLOT procedure. For example, the following statements add a loess curve and a cubic regression curve to the data:

title "Salary by Experience";
proc sgplot data=salary;
   loess x=Year y=Salaries / smooth=0.5;       /* nonparametric regression */
   reg x=Year y=Salaries / degree=3 nomarkers legendLabel="Cubic Regression";
run;

For the loess model, salaries appear to increase with experience for the first six years and for years 12–18; the salaries appear to be flat for the other intervals. For the cubic regression model, the salaries appear to increase gradually, although they increase more quickly for the first 10 years.

These regression curves smooth the data. For a parametric model, you can sometimes interpret the parameters of the model in terms of physical quantities. Do you believe that the mean salary of a professor depends smoothly on the number of years of service? These models reflect that assumption.

Now think for a moment about what we did not do. We did not compute the sample mean for each year and "connect the dots." That would result in a jagged line, which does not smooth the data. "Connecting the dots" is not a model; it does not provide insight into the relationship between salary and years of service, not does it accomodate random errors in the data.

Moving from means to quantiles

Given the number of years of service, each of the previous curves predicts the mean salary. Statistically speaking, the regression curves model the "conditional mean" of the response variable. However, you can also design a regression procedure to model the conditional median of the response. In fact, you can model other quantiles, such as the upper quartile (75th percentile), the lower decile (10th percentile), or other values.

A model for a conditional quantile is known as quantile regression. In SAS, quantile regression is computed by using the QUANTREG procedure, the QUANTSELECT procedure (which supports variable selection), or the QUANTLIFE procedure (which support censored observations).

What is quantile regression?

There have been several introductory papers written on quantile regression:

I don't intend to duplicate these papers. Instead, let's use PROC QUANTREG to compute a quantile regression and compare it with the binned quantile plot:

title "Quantile Regression of Salary vs. Year";
ods graphics on;
proc quantreg data=salary ci=sparsity;
   model salaries = year year*year year*year*year /
                    quantile=0.25 0.5 0.75 plot=fitplot(showlimits);
run;

For these data, the model for the lower quartile increases for about 10 years, then levels off. The model for the conditional median is qualitatively similar to the cubic model for the conditional mean. The model for the upper quartile indicates a higher growth rate that for the median quartile. The quantile curves enable you to estimate how the inter-quartile range (the gap between the upper and lower quartiles) grows with time. For newly hired professors, only about $7,000 separates the relatively high salaries from the relatively low salaries. However, for professors with twenty or more years of experience, that gap has widened to more than $10,000.

When compared with the binned quantile plot at the beginning of this article, this quantile regression plot has several advantages:

  1. The curves smooth the data. Given the number of years of service, you can read off the predicted quartiles of salary from the plot.
  2. You can use the CI= option on the PROC QUANTREG statement to obtain confidence intervals (shown as shaded bands) for the predicted quartiles. See the PROC QUANTREG documentation for details.
  3. For parametric models (especially linear models), you might be able to interpret the parameters to gain insight into the process that generates the data. You can also compute nonparametric models by using the EFFECT statement to create spline effects.
  4. Quantile regression extends easily to multiple explanatory variables, whereas binning data gets harder as the dimension increases, and you often get bins for which there are no data.

So reach for quantile regression when you want to investigate how quartiles, quintiles, or deciles of the response variable change with covariates. The QUANTREG procedure is easy to run, and the results are superior to ad-hoc methods such as binning the data and connecting the sample quantiles.

tags: Data Analysis, Statistical Thinking
4月 172013
 

I often see variations of the following question posted on statistical discussion forums:

I want to bin the X variable into a small number of values. For each bin, I want to draw the quartiles of the Y variable for that bin. Then I want to connect the corresponding quartile points from bin to bin.

In other words, the question asks how to create a plot like the following:

The plot attempts to show how the first quartile, second quartile (median), and third quartile of the response variable vary as the explanatory variable changes. However, I do not recommend using this plot when the explanatory variable is continuous. When I see a question like this, I sometimes respond by saying "you might want to look at quantile regression."

This article takes a quick look at quantile regression. What is quantile regression? How can you perform quantile regression in SAS? And how does it relate to the "binned quantile plot" that is shown above?

Start with the familiar: Standard regression

Before discussing quantile regression, let's introduce some data and think about a typical analysis. The data are salaries (in 1991) for 459 statistics professors in U.S. colleges and universities. A professor's salary (the response) is assumed to be related to his or her years of service (the explanatory variable). For these data, the explanatory variable is already binned into 25 discrete values. These data and this analysis are based on an example in the documentation for the QUANTREG procedure. You can download the data and the SAS statements that are used in this article.

A traditional regression analysis predicts the mean salary for a professor, given the years of service. In SAS, there are many regression procedures, such as the parametric GLM procedure and the nonparametric LOESS procedure. For these procedure, you can also call the regression directly from the SGPLOT procedure. For example, the following statements add a loess curve and a cubic regression curve to the data:

title "Salary by Experience";
proc sgplot data=salary;
   loess x=Year y=Salaries / smooth=0.5;       /* nonparametric regression */
   reg x=Year y=Salaries / degree=3 nomarkers legendLabel="Cubic Regression";
run;

For the loess model, salaries appear to increase with experience for the first six years and for years 12–18; the salaries appear to be flat for the other intervals. For the cubic regression model, the salaries appear to increase gradually, although they increase more quickly for the first 10 years.

These regression curves smooth the data. For a parametric model, you can sometimes interpret the parameters of the model in terms of physical quantities. Do you believe that the mean salary of a professor depends smoothly on the number of years of service? These models reflect that assumption.

Now think for a moment about what we did not do. We did not compute the sample mean for each year and "connect the dots." That would result in a jagged line, which does not smooth the data. "Connecting the dots" is not a model; it does not provide insight into the relationship between salary and years of service, not does it accomodate random errors in the data.

Moving from means to quantiles

Given the number of years of service, each of the previous curves predicts the mean salary. Statistically speaking, the regression curves model the "conditional mean" of the response variable. However, you can also design a regression procedure to model the conditional median of the response. In fact, you can model other quantiles, such as the upper quartile (75th percentile), the lower decile (10th percentile), or other values.

A model for a conditional quantile is known as quantile regression. In SAS, quantile regression is computed by using the QUANTREG procedure, the QUANTSELECT procedure (which supports variable selection), or the QUANTLIFE procedure (which support censored observations).

What is quantile regression?

There have been several introductory papers written on quantile regression:

I don't intend to duplicate these papers. Instead, let's use PROC QUANTREG to compute a quantile regression and compare it with the binned quantile plot:

title "Quantile Regression of Salary vs. Year";
ods graphics on;
proc quantreg data=salary ci=sparsity;
   model salaries = year year*year year*year*year /
                    quantile=0.25 0.5 0.75 plot=fitplot(showlimits);
run;

For these data, the model for the lower quartile increases for about 10 years, then levels off. The model for the conditional median is qualitatively similar to the cubic model for the conditional mean. The model for the upper quartile indicates a higher growth rate that for the median quartile. The quantile curves enable you to estimate how the inter-quartile range (the gap between the upper and lower quartiles) grows with time. For newly hired professors, only about $7,000 separates the relatively high salaries from the relatively low salaries. However, for professors with twenty or more years of experience, that gap has widened to more than $10,000.

When compared with the binned quantile plot at the beginning of this article, this quantile regression plot has several advantages:

  1. The curves smooth the data. Given the number of years of service, you can read off the predicted quartiles of salary from the plot.
  2. You can use the CI= option on the PROC QUANTREG statement to obtain confidence intervals (shown as shaded bands) for the predicted quartiles. See the PROC QUANTREG documentation for details.
  3. For parametric models (especially linear models), you might be able to interpret the parameters to gain insight into the process that generates the data. You can also compute nonparametric models by using the EFFECT statement to create spline effects.
  4. Quantile regression extends easily to multiple explanatory variables, whereas binning data gets harder as the dimension increases, and you often get bins for which there are no data.

So reach for quantile regression when you want to investigate how quartiles, quintiles, or deciles of the response variable change with covariates. The QUANTREG procedure is easy to run, and the results are superior to ad-hoc methods such as binning the data and connecting the sample quantiles.

tags: Data Analysis, Statistical Thinking
4月 032013
 

I was recently asked how to compute the difference between two density estimates in SAS. The person who asked the question sent me a link to a paper from The Review of Economics and Statistics that contains several examples of this technique (for example, see Figure 3 on p. 16 and Figure 4 on p. 17). The author of the paper uses the technique to demonstrate that the wage of a Mexican citizen influences his decision to emigrate to the US. Briefly, the paper plots a kernel density estimate of the income distribution of Mexicans who emigrated and those who did not, and shows that the incomes of those who decided to emigrate tend to be less than for those who decided to stay.

In SAS, you can form the difference between two density estimates by doing the following:

  1. Use PROC KDE to compute kernel density estimates of the two groups. Use the GRIDL= and GRIDU= options so that the two kernel densities are evaluated on the same grid of points. Use the OUT= option on the UNIVAR statement to write the density estimates to a SAS data set.
  2. Form the difference of the densities by subtracting one from the other.
  3. Use the SGPLOT procedure to plot the difference.

However, just because you can do something doesn't mean that you should do it! After I produce some graphs, I present some questions about the technique.

The difference of densities for subpopulations

For the sake of an example, I will use data in the Sashelp.Cars data set. I will compare the MPG_CITY for cars that are manufactured in the USA with the MPG_CITY for cars that are manufactured in Asia. The cars from the USA tend to get fewer miles per gallon than the cars from Asia, so these data are a convenient proxy for the Mexican incomes that are shown in the paper.

The following DATA step creates two MPG variables. The call to PROC KDE overlays both densities on a single plot and writes the kernel density estimates to the KerOut data set.

/* create example data: MPG_City for cars built in USA vs. Asia */
data cars;
 keep MPG_Asia MPG_USA;
 merge sashelp.cars(where=(origin="Asia") rename=(MPG_City=MPG_Asia))
       sashelp.cars(where=(origin="USA") rename=(MPG_City=MPG_USA));
run;
 
/* optional: use BWM= option to adjust kernel bandwidth */
proc kde data=cars;
univar MPG_Asia MPG_USA / plots=DensityOverlay out=KerOut GridL=5 GridU=65;
run;

The GRIDL= and GRIDU= options are necessary. Without them, the density estimate would not be evaluated at a common set of points. Then you would need to use an interpolation scheme to obtain the KDEs at a common set of locations. I have previously shown how to use SAS/IML functions for linear interpolation of data.

However, because we used the GRIDL= and GRIDU= options, no interpolation is necessary. You can just use the DATA step to merge the two KDEs and to compute the difference:

data Diff(drop=Var Count);
merge KerOut(where=(Var="MPG_Asia") rename=(Density=Density_Asia))
      KerOut(where=(Var="MPG_USA")  rename=(Density=Density_USA));
by Value;
Diff_Density = Density_USA - Density_Asia;
run;
 
/* display difference of group densities */
proc sgplot data=Diff;
   title "Difference between Densitites";
   series x=Value y=Diff_Density;
   refline 0 / axis=y;
   xaxis grid values=(5 to 65 by 5) label="MPG (City)";
   yaxis label="Difference in Density Estimates (USA - Asia)";
run;

You can interpret the graph as follows. Given that a vehicle was made in the USA, its fuel efficiency is generally less than for vehicles that were manufactured in Asia. The "difference plot" shows the extent of the difference in densities for the two subpopulations. This plot is interesting because American cars that get approximately 27 mpg seem to be an exception to the general rule.

Notice that the difference of densities is not itself a density. Rather, the difference plot is a way to visualize when the density of one distribution is greater than for a second distribution.

A critique of this method

I have never seen this "difference in density" technique before, and I have some questions about it. I am not familiar with a theoretical reference, so my criticisms are based on ignorance rather than on a careful study of the method.

One concern I have is that the scale of the "density difference plot" is clearly important, but the method seems to ignore it. Suppose that I have two normal populations, N(0, 1) and N(ε, 1). When I compute the difference in densities, the difference plot will look like the following plot.

The difference plot shows that the N(0, 1) distribution is to the left of the other distribution, but the plot doesn't warn the reader that the two distributions are essentially identical. No matter how small ε is, the difference plot will look similar...except for the scale of the vertical axis. Can the method can be extended to provide confidence limits? I don't know. Perhaps the integral of the absolute value of the difference (or the square of the difference) is an appropriate measure for how close the densities are.

Another concern I have is that kernel density estimation requires choosing a bandwidth parameter. There are various ways to choose a bandwidth when the goal is to estimate the density. However, it is not clear to me that the bandwidth that best estimates the density is also the bandwidth that best estimates the difference between the densities of two subpopulations.

The authors of the paper acknowledge some of these concerns (p. 17 and the bottom of p. 16), and they perform a nonparametric test, such as can be computed by using PROC NPAR1WAY. Nevertheless, I wonder when it makes sense to display the difference between two densities.

How would you analyze data like these? How would you visualize the results?

tags: Data Analysis
3月 272013
 

In statistics, distances between observations are used to form clusters, to identify outliers, and to estimate distributions. Distances are used in spatial statistics and in other application areas.

There are many ways to define the distance between observations. I have previously written an article that explains Mahalanobis distance, which is used often in multivariate analysis, and I have showed how to compute the Mahalanobis distance in SAS. Today's article is simpler: how do you compute the usual Euclidean distance in SAS?

Recall that the squared Euclidean distance between the point p = (p1, p2, ..., pn) and the point q = (q1, q2, ..., qn) is the sum of the squares of the differences between the components: Dist2(p,q) = Σi (piqi)2. The Euclidean distance is then the square root of Dist2(p,q). This article shows three ways to compute the Euclidean distance in SAS:

  1. By using the DISTANCE procedure in SAS/STAT software.
  2. By using the DISTANCE function in SAS/IML software.
  3. By writing your own SAS/IML function.

The following DATA step creates 27 observations that are arranged on an integer lattice in three dimensions. Each row of the data set contains an observation in three variables. The 27 points are (0,0,0), (0,0,1), (0,0,2), (0,1,0), ..., (2,2,2).

data Obs;
do x=0 to 2;
   do y=0 to 2;
      do z = 0 to 2;
         output;
      end;
   end;
end;
run;

Compute distance by using SAS/STAT software

PROC DISTANCE can compute many kinds of distance, and can also standardize the data variables, which is useful when your variable represent different quantities (such as height, weight, and age). In the simple case of Euclidean distance without any standardization, specify the METHOD=EUCLID option and the NOSTD option on the PROC DISTANCE statement, as follows:

proc distance data=Obs out=Dist method=Euclid nostd;
   var interval(x y z);
run;
 
proc print data=Dist(obs=4);
   format Dist: 8.6;
   var Dist1-Dist4;
run;

The output data set has 27 variables and 27 observations. The preceding table shows the first four observations and the first four variables. You can see that the output data set is the lower-triangular portion of the distance matrix. The ith row gives the distance between the ith observation and the jth observation for ji. For example, the distance between the fourth observation (0,1,0) and the second observation (0,0,1) is sqrt(02 + 12 + 12)= sqrt(2) = 1.414.

If you prefer to output the full, dense, symmetric matrix of distances, use the SHAPE=SQUARE option on the PROC DISTANCE statement.

Computing distance in SAS/IML Software

In SAS/IML software, you can use the DISTANCE function in SAS/IML to compute a variety of distance matrices. The DISTANCE function was introduced in SAS/IML 12.1 (SAS 9.3M2).

By default, the DISTANCE function computes the Euclidean distance, and the output is always a square matrix. Therefore, the following statements compute the Euclidean pairwise distances between the 27 points in the Obs data set:

proc iml;
use Obs;
read all var _NUM_ into X;
close Obs;
 
D = distance(X);
print (D[1:4, 1:4])[format=8.6 c=("Dist1":"Dist4")];

The output shows that the values in the upper-left portion of the distance matrix are the same as were computed by PROC DISTANCE.

A SAS/IML Module for computing Euclidean distance

Sometimes you do not want to compute the complete matrix of pairwise distances between observations. Sometimes you only need the distances between observations that belong to different groups. For example, you might have one set of locations that represent warehouses and another set that represents the locations of retail stores. You might be interested in computing the distances from each store to the warehouses, so that you can efficiently ship goods to each store from the closest warehouse.

The following SAS/IML module computes the distances between observations in the matrix X and the matrix Y. The (i,j)th element of the result is the distance between the ith row of X and the jth row of Y.

/* compute Euclidean distance between points in x and points in y.
   x is a n x d matrix, where each row is a point in d dimensions.
   y is a m x d matrix.
   The function returns the n x q matrix of distances, D, such that
   D[i,j] is the distance between x[i,] and y[j,]. */
start PairwiseDist(x, y);
   if ncol(x)^=ncol(y) then return (.);       /* different dimensions */
   n = nrow(x);  m = nrow(y);
   idx = T(repeat(1:n, m));                   /* index matrix for x   */
   jdx = shape(repeat(1:m, n), n);            /* index matrix for y   */
   diff = X[idx,] - Y[jdx,];
   return( shape( sqrt(diff[,##]), n ) );     /* sqrt(sum of squares) */
finish;
 
p = { 1  0, 
      1  1,
     -1 -1};
q = { 0  0,
     -1  0};
PD = PairwiseDist(p,q); 
print PD;

If you are still running SAS/IML 9.3 or earlier, you can use the PairwiseDist function to define your own function that computes the Euclidean distance between rows of a matrix:

/* compute Euclidean distance between points in x.
   x is a p x d matrix, where each row is a point in d dimensions.
   Use the DISTANCE function in SAS/IML 12.1 and later releases. */
start EuclideanDistance(x);  
   y=x;
   return( PairwiseDist(x,y) );
finish;
 
D = EuclideanDistance(X);
print (D[1:4, 1:4])[format=8.6 c=("Dist1":"Dist4")];

The output is the same as for the built-in DISTANCE function, and is not printed.

tags: Data Analysis, Matrix Computations, Statistical Programming