data analysis

2月 152012

I previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance. A subsequent article will describe how you can compute Mahalanobis distance.

Distance in standard units

In statistics, we sometimes measure "nearness" or "farness" in terms of the scale of the data. Often "scale" means "standard deviation." For univariate data, we say that an observation that is one standard deviation from the mean is closer to the mean than an observation that is three standard deviations away. (You can also specify the distance between two observations by specifying how many standard deviations apart they are.)

For many distributions, such as the normal distribution, this choice of scale also makes a statement about probability. Specifically, it is more likely to observe an observation that is about one standard deviation from the mean than it is to observe one that is several standard deviations away. Why? Because the probability density function is higher near the mean and nearly zero as you move many standard deviations away.

For normally distributed data, you can specify the distance from the mean by computing the so-called z-score. For a value x, the z-score of x is the quantity z = (x-μ)/σ, where μ is the population mean and σ is the population standard deviation. This is a dimensionless quantity that you can interpret as the number of standard deviations that x is from the mean.

Distance is not always what it seems

You can generalize these ideas to the multivariate normal distribution. The following graph shows simulated bivariate normal data that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and 90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The prediction ellipses are contours of the bivariate normal density function. The probability density is high for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further away, such as the 90% prediction ellipse.

In the graph, two observations are displayed by using red stars as markers. The first observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which marker is closer to the origin? (The origin is the multivariate center of this distribution.)

The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2, respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for this distribution, the variance in the Y direction is less than the variance in the X direction, so in some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is. </p

Notice the position of the two observations relative to the ellipses. The point (0,2) is located at the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the sense that you are more likely to observe an observation near (4,0) than to observe one near (0,2). The probability density is higher near (4,0) than it is near (0,2).

In this sense, prediction ellipses are a multivariate generalization of "units of standard deviation." You can use the bivariate probability contours to compare distances to the bivariate mean. A point p is closer than a point q if the contour that contains p is nested within the contour that contains q.

Defining the Mahalanobis distance

You can use the probability contours to define the Mahalanobis distance. The Mahalanobis distance has the following properties:

  • It accounts for the fact that the variances in each direction are different.
  • It accounts for the covariance between variables.
  • It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

For univariate normal data, the univariate z-score standardizes the distribution (so that it has mean 0 and unit variance) and gives a dimensionless quantity that specifies the distance from an observation to the mean in terms of the scale of the data. For multivariate normal data with mean μ and covariance matrix Σ, you can decorrelate the variables and standardize the distribution by applying the Cholesky transformation z = L-1(x - μ), where L is the Cholesky factor of Σ, Σ=LLT.

After transforming the data, you can compute the standard Euclidian distance from the point z to the origin. In order to get rid of square roots, I'll compute the square of the Euclidean distance, which is dist2(z,0) = zTz. This measures how far from the origin a point is, and it is the multivariate generalization of a z-score.

You can rewrite zTz in terms of the original correlated variables. The squared distance Mahal2(x,μ) is
= zT z
= (L-1(x - μ))T (L-1(x - μ))
= (x - μ)T (LLT)-1 (x - μ)
= (x - μ)T Σ -1 (x - μ)
The last formula is the definition of the squared Mahalanobis distance. The derivation uses several matrix identities such as (AB)T = BTAT, (AB)-1 = B-1A-1, and (A-1)T = (AT)-1. Notice that if Σ is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between x and μ.

The Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data. In this way, the Mahalanobis distance is like a univariate z-score: it provides a way to measure distances that takes into account the scale of the data.

tags: Data Analysis, Statistical Thinking
2月 022012

In two previous blog posts I worked through examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Robust estimates of location in a univariate setting are well-known, with the median statistic being the classical example. Robust estimates of scale are less well-known, with the best known example being interquartile range (IQR), but a more modern statistic being the MAD function.

For multivariate data, the classical (nonrobust) estimate of location is the vector mean, c, which is simply the vector whose ith component is the mean of the ith variable. The classical (nonrobust) estimate of scatter is the covariance matrix. An outlier is defined as an observation whose Mahalanobis distance from c is greater than some cutoff value. As in the univariate case, both classical estimators are sensitive to outliers in the data. Consequently, statisticians have created robust estimates of the center and the scatter (covariance) matrix.

MCD: Robust estimation by subsampling

A popular algorithm that computes a robust center and scatter of multivariate data is known as the Minimum Covariance Determinant (MCD) algorithm. The main idea is due to Rousseeuw (1985), but the algorithm that is commonly used was developed by Rousseeuw and Van Driessen (1999). The MCD algorithm works by sampling h observations from the data over and over again, where h is typically in the range n/2 < h < 3n/4. The "winning" subset is the h points whose covariance matrix has the smallest determinant. Points far from the center of this subset are excluded, and the center and scatter of the remaining points are used as the robust estimates of location and scatter.

This Monte Carlo approach works very well in practice, but it does have the unfortunate property that it is not deterministic: a different random number seed could result in different robust estimates. Recently, Hubert, Rousseeuw, and Verdonck (2010) have published a deterministic algorithm for the MCD.

Robust MCD estimates in SAS/IML software

The SAS/IML language includes the MCD function for robust estimation of multivariate location and scatter. The following matrix defines a data matrix from Brownlee (1965) that correspond to certain measurements taken on 21 consecutive days. The points are shown in a three-dimensional scatter plot that was created in SAS/IML Studio.

proc iml;
x = { 80  27  89,  80  27  88,  75  25  90, 
      62  24  87,  62  22  87,  62  23  87, 
      62  24  93,  62  24  93,  58  23  87, 
      58  18  80,  58  18  89,  58  17  88, 
      58  18  82,  58  19  93,  50  18  89, 
      50  18  86,  50  19  72,  50  19  79, 
      50  20  80,  56  20  82,  70  20  91 };
/* classical estimates */
labl = {"x1" "x2" "x3"};
mean = mean(x);
cov = cov(x);
print mean[c=labl format=5.2], cov[r=labl c=labl format=5.2];

Most researchers think that observations 1, 2, 3, and 21 are outliers, with others including observation 2 as an outlier. (These points are shown as red crosses in the scatter plot.) The following statement runs the MCD algorithm on these data and prints the robust estimates:

/* robust estimates */
N = nrow(x);   /* 21 observations */
p = ncol(x);   /*  3 variables */
optn = j(8,1,.); /* default options for MCD */
optn[1] = 0;     /* =1 if you want printed output */
optn[4]= floor(0.75*N); /* h = 75% of obs */
call MCD(sc, est, dist, optn, x);
RobustLoc = est[1, ];     /* robust location */
RobustCov = est[3:2+p, ]; /* robust scatter matrix */
print RobustLoc[c=labl format=5.2], RobustCov[r=labl c=labl format=5.2];

The robust estimate of the center of the data is not too different from the classical estimate, but the robust scatter matrix is VERY different from the classical covariance matrix. Each robust estimate excludes points that are identified as outliers.

If you take these robust estimates and plug them into the classical Mahalanobis distance formula, the corresponding distance is known as the robust distance. It measures the distance between each observation and the estimate of the robust center by using a metric that depends on the robust scatter matrix. The MCD subroutine returns distance information in a matrix that I've called DIST (the third argument). The first row of DIST is the classical Mahalanobis distance. The second row is the robust distance, which is based on the robust estimates of location and scatter. The third row is an indicator variable with the value 1 if an observation is closer to the robust center than some cutoff value, and 0 otherwise. Consequently, the following statements find the outliers.
/* rules to detect outliers */
outIdx = loc(dist[3,]=0); /* RD > cutoff */
print outIdx;

The MCD algorithm has determined that observations 1, 2, 3, and 21 are outliers.

Incidentally, the cutoff value used by MCD is based on a quantile of the chi-square distribution because the squared Mahalanobis distance of multivariate normal data obeys a chi-square distribution with p degress of freedom, where p is the number of variables. The cutoff used is as follows:
cutoff = sqrt( quantile("chisquare", 0.975, p) ); /* dist^2 ~ chi-square */

In a recent paper, Hardin and Rocke (2005) propose a different criterion, based on the distribution of robust distances.

Robust MCD estimates in SAS/STAT software: How to "trick" PROC ROBUSTREG

The ROBUSTREG procedure can also compute MCD estimates. Usually, the ROBUSTREG procedure is used as a regression procedure, but you can also use it to obtain the MCD estimates by "inventing" a response variable. The MCD estimates are produced for the explanatory variables, so the choice of a response variable is unimportant. In the following example, I generate random values for the response variable.

In a regression context, the word "outlier" is reserved for an observation for which the value of the response variable is far from the predicted value. In other words, in regression an outlier means "far away (from the model) in the Y direction." In contrast, the ROBUSTREG procedure uses the MCD algorithm to identify influential observations in the space of explanatory (that is, X) variables. These are also called high-leverage points. They are observations that are far from the center of the X variables. High-leverage points are very influential in ordinary least squares regression, and that is why it is important to identify them.

To generate the MCD estimates, specify the DIAGNOSTICS and the LEVERAGE(MCDINFO) options on the MODEL statement, as shown in the following statements:

/* write data from SAS/IML (or use a DATA step) */
create X from x[c=labl]; append from x; close;
data X;
set X;
y=rannor(1); /* random response variable */
proc robustreg data=X method=lts;
   model y = x1 x2 x3 / diagnostics leverage(MCDInfo);
   ods select MCDCenter MCDCov Diagnostics;
   ods output diagnostics=Diagnostics(where=(leverage=1));
proc print data=Diagnostics; 
var Obs Mahalanobis RobustDist Leverage;

The robust estimates of location and scatter are the same, as are the robust distances. The "leverage" variable is an indicator variable that tells you which observations are far from the center of the explanatory variables. They are multivariate "outliers" in the the space of the X variables, although they are not necessarily outliers for the response (Y) variable.

tags: Data Analysis, Statistical Programming
1月 272012

In a previous blog post on robust estimation of location, I worked through some of the examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. I showed that SAS/IML software and PROC UNIVARIATE both support the robust estimators of location that are mentioned in the paper. Today's post looks at the robust estimators of scale that are mentioned in the same paper and works through more examples in the paper. The paper uses the following five measurements, which contain one outlier:
6.25, 6.27, 6.28, 6.34, 63.1

Robust scale statistics in SAS/IML software

SAS/IML software contains several functions for robust estimation of scale. For estimating scale, the MAD function is often used. The MAD statistic is an acronym for "median of all absolute deviations from the median." The MAD statistic is often multiplied by a constant in order to make it unbiased for data that are normally distributed. The constant is 1.483, but you don't need to remember that value because the MAD function has the "NMAD" option that automatically includes the multiplication factor, as shown in the following example:

proc iml;
x = {6.25, 6.27, 6.28, 6.34, 63.1};
mad = mad(x, "NMAD"); 
print mad;

Rousseeuw and Hubert briefly mention two other robust measures of scale: the Qn estimator (Rousseeuw and Croux, JASA, 1993) and the interquartile range (IQR), which is well-known from the Tukey box plot. You can compute both of these estimators in SAS/IML software, as follow:

Qn = mad(x, "QN"); 
call qntl(q, x, {0.25 0.75}); /* compute 25th and 75th percentile */
IQR = q[2] - q[1];
print Qn IQR;

The three robust estimates of scale are similar. They range from 0.04 (MAD) to 0.07 (IQR). The IQR is sometimes divided by 1.349 in order to estimate the scale of normally distributed data. If you divide 0.07 by 1.349, you get 0.052, which make the estimates even more similar.

The connection with outlier detection

All this discussion of robust estimation of location and scale is closely related to detecting outliers. In practice, outliers are often detected using a rule or formula. The classical rule is to compute z-scores, which are just the normalized values zi = (xi - x̄)/s, where is the sample mean and s is the sample standard deviation. An outlier is defined as any observation for which |zi| exceeds some cutoff value, typically 2.5 or 3.

This rule fails when there is a large outlier in the data. For example, the following SAS/IML statements compute the classical z-scores for the Rousseeuw and Hubert example:

/* rules to detect outliers */
z = (x - mean(x)) / std(x);
print z;

Because the mean and standard deviation are both influenced by the outlier, no observation has a large z-score, and therefore none is flagged as an outlier. However, using robust estimators in the z-score formula does successfully identify the outlier, as shown in the following statements:

zRobust = (x - median(x)) / mad(x, "NMAD");
print zRobust;

The outlier has a HUGE "robust score." Of course, you don't have to print out the scores and inspect them. The following SAS/IML statements use the LOC function (the most useful function that you've never heard of!) to find all of the data for which the robust z-score exceeds 2.5, and prints only the outliers:

outIdx = loc(abs(zRobust)>2.5);
if ncol(outIdx)>0 then 
   outliers = x[outIdx];
   outliers = .;
print outliers;

Robust Estimates in the UNIVARIATE Procedure

The UNIVARIATE procedure also supports robust estimates of scale. The ROBUSTSCALE option on the PROC UNIVARIATE statement computes the robust estimates in the Rousseeuw and Hubert article, as well as others. The documentation for the UNIVARIATE procedure includes a section that describes the robust estimates of scale. The following example computes robust estimates of scale:

data a;
input x @@;
6.25 6.27 6.28 6.34 63.1
proc univariate data=a robustscale;
   var x;
   ods select RobustScale;

Notice that the output from PROC UNIVARIATE includes two columns. The first column is an unadjusted robust estimate. The second column estimates the standard deviation for normally distributed data, which can be derived from the first column.

tags: Data Analysis, Statistical Programming
1月 202012

I encountered a wonderful survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Not only are the authors major contributors to the field of robust estimation, but the article is short and very readable. This blog post walks through the examples in the paper and shows how to compute each example by using SAS. In particular, this post shows how to compute robust estimates of location for univariate data. Future posts will show how to compute robust estimates of scale and multivariate estimates.

The Rousseeuw and Hubert article begins with a quote:

In real data sets, it often happens that some observations are different from the majority. Such observations are called outliers. ...They do not fit the model well. It is very important to be able to detect these outliers.

The quote explains why outlier detection is connected to robust estimation methods. Classical statistical estimators are so affected by the outliers that "the resulting fitted model does not allow [you] to detect the deviating observations." The goal of robust statistical methods is to "find a fit that is close to the fit [you] would have found without the [presence of] outliers." You can then identify the outliers by their large deviation from the robust model.

The simplest example is computing the "center" of a set of data, which is known as estimating location. Consider the following five measurements:
6.25, 6.27, 6.28, 6.34, 63.1
As the song says, one of these points is not like the other.... The last datum is probably a miscoding of 6.31.

Robust estimate of location in SAS/IML software

SAS/IML software contains several functions for robust estimation. For estimating location, the MEAN and MEDIAN functions are the primary computational tools. It is well known that the mean is sensitive to even a single outlier, whereas the median is not. The following SAS/IML statements compute the mean and median of these data:

proc iml;
x = {6.25, 6.27, 6.28, 6.34,  63.1};
mean = mean(x); /* or x[:] */
median = median(x);
print mean median;

The mean is not representative of the bulk of the data, but the median is.

Although the survey article doesn't mention it, there are two other robust estimators of location that have been extensively studied. They are the trimmed mean and the Winsorized mean:

trim = mean(x, "trimmed", 0.2);    /* 20% of obs */
winsor = mean(x, "winsorized", 1); /* one obs */
print trim winsor;

The trimmed mean is computed by excluding the k smallest and k largest values, and computing the mean of the remaining values. The Winsorized mean is computed by replacing the k smallest values with the (k+1)st smallest, and replacing the k largest values with the (k+1)st largest. The mean of these remaining values is the Winsorized mean. For both of these functions, you can specify either a number of observations to trim or Winsorize, or a percentage of values. Formulas for the trimmed and Winsorized means are included in the documentation of the UNIVARIATE procedure. If you prefer an example, here are the equivalent computations for the trimmed and Winsorized means:

trim2 = mean( x[2:4] );
winsor2 = mean( x[2] // x[2:4] // x[4] );
print trim2 winsor2;

Robust Estimates in the UNIVARIATE Procedure

The UNIVARIATE procedure also supports these robust estimators. The trimmed and Winsorized means are computed by using the TRIM= and WINSOR= options, respectively. Not only does PROC UNIVARIATE compute robust estimates, but it computes standard errors as shown in the following example.

data a;
input x @@;
6.25 6.27 6.28 6.34 63.1
proc univariate data=a trim=0.2 winsor=1;
   var x;
ods select BasicMeasures TrimmedMeans WinsorizedMeans;

Next time: robust estimates of scale.

tags: Data Analysis, Statistical Programming
1月 132012

A recent question on a SAS Discussion Forum was "how can you overlay multiple kernel density estimates on a single plot?" There are three ways to do this, depending on your goals and objectives.

Overlay different estimates of the same variable

Sometimes you have a single variable and want to overlay various density estimates, either parametric or nonparametric. You can use the HISTOGRAM statement in the UNIVARIATE procedure to accomplish this. The following SAS code overlays three kernel density estimates with different bandwidths on a histogram of the MPG_CITY variable in the SASHelp.Cars data set:

/* use UNIVARIATE to overlay different estimates of the same variable */
proc univariate;
   var mpg_city;
   histogram / kernel(C=SJPI MISE 0.5); /* three bandwidths */

In the same way, you can overlay various parametric estimates and combine parametric and nonparametric estimates.

Overlay estimates of different variables

Sometimes you might want to overlay the density estimates of several variables in order to compare their densities. You can use the KDE procedure to accomplish this by using the PLOTS=DensityOverlay graph. The following SAS code overlays the density curves of two different variables: the miles per gallon for vehicles in the city and the miles per gallon for the same variables on the highway:

/* use KDE to overlay estimates of different variables */
proc kde;
   univar mpg_city mpg_highway / plots=densityoverlay;

Overlay arbitrary densities

Sometimes you might need to overlay density estimates that come from multiple sources. For example, you might use PROC UNIVARIATE construct a parametric density estimate, but overlay it on a density estimate that you computed by using PROC KDE or that you computed yourself by writing an algorithm in PROC IML. In these cases, you want to write the density estimates to a data set, combine them with the DATA step, and plot them using the SERIES statement in PROC SGPLOT.

There are three ways to get density estimates in a data set:

  • In PROC KDE, the UNIVAR statement has an OUT= option that you can use to write the density estimate to a SAS data set.
  • In PROC UNIVARIATE, the HISTOGRAM statement has an OUTKERNEL= option that you can use to write the kernel density estimate to a SAS data set.
  • For parametric estimates that are computed in PROC UNIVARIATE, you can use the ODS OUTPUT statement to save the ParameterEstimates table to a SAS data set. You can then use a DATA step in conjunction with the PDF function to create the (x,y) values along a parametric density curve.

For some of these situations, you might need to transpose a data set from a long format to a wide format. For extremely complicated graphs that overlay multiples density estimates on a histogram, you might need to use PROC SGRENDER and the Graphics Template Language (GTL).

If you prefer to panel (rather than overlay) density estimates for different levels of a classification variable, the SAS & R blog shows an example that uses the SGPANEL procedure.

tags: Data Analysis, Statistical Programming
12月 072011
Yesterday, December 7, 1941, a date which will live in infamy...
- Franklin D. Roosevelt

Today is the 70th anniversary of the Japanese attack on Pearl Harbor. The very next day, America declared war.

During a visit to the Smithsonian National Museum of American History, I discovered the results of a 1939 poll that shows American opinions about war at the start of the European conflict. (I could not determine from the exhibit whether the poll was taken before or after the invasion of Poland by Germany in September 1939.) Although most Americans (83%) favored the Allies, more than 50% of those surveyed supported either providing no aid to either side (25%) or selling supplies to both sides (29%).

I was also amused by the 13.5% who thought the US should fight with the allies "if they are losing."

In addition to wide range of opinions about who to support in Europe, two statistical aspects of this table jumped out at me:

  • Why did the editor print "1/10 of 1%" for the "Help Germany" category? They could have printed 0.1%, or "less than 1%," or even lumped that response into the "others" category. I think the display was an intentional ploy to emphasize how little support there was for helping Germany.
  • Why did the editor use 13.5% when the rest of the table is rounded to the nearest percent? The value seems out of place. I suppose the editor did not want to round the value up to 14%, because then the total percentage would be 100.1%. But so what?

Why do you think the editor displayed the survey results as he did? Do you think this data would be better visualized as a graph, or does the table do a better job?

tags: Data Analysis, Statistical Thinking
12月 022011

Recently the "SAS Sample of the Day" was a Knowledge Base article with an impressively long title:

Sample 42165: Using a stored process to eliminate duplicate values caused by multiple group memberships when creating a group-based, identity-driven filter in SAS® Information Map Studio

"Wow," I thought. "This is the longest title on a SAS Sample that I have ever seen!"

This got me wondering whether anyone has run statistics on the SAS Knowledge Base. It would be interesting, I thought, to see a distribution of the length of the titles, to see words that appear most frequently in titles, and so forth.

I enlisted the aid of my friend Chris Hemedinger who is no dummy at reading data into SAS. A few minutes later, Chris had assembled a SAS data set that contained the titles of roughly 2,250 SAS Samples.

The length of titles

The first statistic I looked at was the length of the title, which you can compute by using the LENGTH function. A quick call to PROC UNIVARIATE and—presto!—the analysis is complete:

proc univariate data=SampleTitles;
   var TitleLength;
   histogram TitleLength;

The table of basic statistical measures shows that the median title length is about 50 characters long, with 50% of titles falling into the range 39–67 characters. Statistically speaking, a "typical" SAS Sample has 50 characters, such as this one: "Calculating rolling sums and averages using arrays." A histogram of the title lengths indicates that the distribution has a long tail:

The shortest title is the pithy "Heat Maps," which contains only nine characters. The longest title is the mouth-filling behemoth mentioned at the beginning of this article, which tips the scales at an impressive 173 characters and crushes the nearest competitor, which has a mere 149 characters.

Frequency of words that appear most often in SAS Samples

The next task was to investigate the frequency of words in the titles. Which words appear most often? The visual result of this investigation is a Wordle word cloud, shown at the beginning of this article. (In the word cloud, capitalization matters, so Using and using both appear.) As you might have expected, SAS and PROC are used frequently, as are action words such as use/using and create/creating. Nouns such as data, variable, example, and documentation also appear frequently.

You can do a frequency analysis of the words in the titles by using the COUNTW, SCAN, and SUBSTR functions to decompose the titles into words. The following SAS code excludes certain simple words (such as "a," "the," and "to") and runs PROC FREQ to perform a frequency analysis on the words that remain. The UPCASE function is used to combine words that differ only in capitalization:

data words;
keep Word;
set SampleTitles;
length Word $20;
count = countw(title);
do i = 1 to count;
   Word = scan(title, i);
   if substr(Word,1,3)="SAS" then Word="SAS"; /* get rid of (R) symbol */
   if upcase(Word) NOT IN ("A" "THE" "TO" "WITH" "FOR" "IN" "OF"
           "AND" "FROM" "AN" "ON" "THAT" "OR" "WHEN" 
           "1" "2" "3" "4" "5" "6" "7" "8" "9")
      & Word NOT IN ("by" "By") then do;
      Word = upcase(Word);
proc freq data=words order=freq noprint;
tables Word / out=FreqOut(where=(count>=50));
ods graphics / height=1200 width=750;
proc sgplot data=FreqOut;
dot Word / response=count categoryorder=respdesc;
xaxis values=(0 to 650 by 50) grid fitpolicy=rotate;

As is often the case, the distribution of frequencies decreases quickly and then has a long tail. The graph shows the frequency counts of terms that appear in titles more than 50 times.

tags: Data Analysis, Just for Fun, Statistical Graphics
11月 182011

Halloween night was rainy, so many fewer kids knocked on the door than usual. Consequently, I'm left with a big bucket of undistributed candy.

One evening as I treated myself to a mouthful of tooth decay, I chanced to open a package of Wonka® Bottle Caps. The package contained three little soda-flavored candies, and I was surprised to find that all three were the same flavor: grape. "Hey," I exclaimed to my youngest child, "they're all the same color! What do you think are the odds of that? You want to help me check some other packages?"

"Mom! Daddy's making me work on his blog," she complained.

After assuring her that she could devour the data when I finished my analysis, she helped me open the remaining packages of Bottle Caps and tabulate the flavors of the candies within.

The flavors of Bottle Caps candies are cola, root beer, grape, cherry, and orange. As we unwrapped the sugary data, my daughter and I hypothesized that the flavors of Bottle Caps were evenly distributed.

Although the hypothesis is the obvious one to make, it is not necessarily true for other candies. There have been dozens (hundreds?) of studies about the distribution of colors in M&M® candies. In fact, many math and statistics classes count the colors of M&M candies and compare the empirical counts to official percentages that are supplied by the manufacturer. The great thing about this classic project is that the colors of M&Ms are not uniformly distributed. Last time I checked, the official proportions were 30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue.

So what about the Bottle Caps? There were 101 candies in 34 packages (one package contained only two candies). If our hypothesis is correct, the expected number of each flavor is 20.2 candies. We counted 23 Cherry, 15 Cola, 21 Grape, 25 Orange, and 17 Root Beer.

I entered the data into SAS for each package. (If you'd like to do further analysis, you can download the data.) Each package is a sample from a multinomial distribution, and I want to test whether the five flavors each have the same proportion, which is 1/5. I used the FREQ procedure to analyze the distribution of flavors and to run a chi-square test for equal proportions:

proc freq data=BottleCaps;
label Flavor = "Flavor";
weight Count;
tables Flavor / nocum chisq plots=DeviationPlot;

The chi-square test gives a large p-value, so there is no statistical indication that the proportions of flavors of Bottle Caps are not uniform. All of the deviations can be attributed to random sampling.

The FREQ procedure can automatically produce plots as part of the analysis. For these data, I asked for a bar chart that shows the relative deviations of the observed frequencies (O) from the expected frequencies (E). The relative deviation is part of the chi-square computation, and is computed as (O-E)/E.

Although the Bottle Caps that I observed had fewer cola and root beer flavors than expected, the chi-square test shows that these deviations are not significant. Anyway, my favorite Bottle Caps are cherry and orange, so I think the sampling gods smiled on me during this experiment.

Conclusion? Looks like Wonka makes the same number of each flavor. Now back to my pile of sugary goodness. I wonder how many candies are in a typical box of Nerds...

tags: Data Analysis, Just for Fun
11月 042011

Being able to reshape data is a useful skill in data analysis. Most of the time you can use the TRANSPOSE procedure or the SAS DATA step to reshape your data. But the SAS/IML language can be handy, too.

I only use PROC TRANSPOSE a few times per year, so my skills never progress beyond the "beginner" stage. I always have to look up the syntax! Sometimes, when I am trying to meet a deadline, I resort to using the SAS/IML language, which I find more intuitive for reshaping data.

Recently I had data that contained two variables: a character categorical variable and a numerical variable. I wanted to reshape the data so that each level (category) of the categorical variable became a new variable, and I wanted to use the levels to name the new variables. (This is an example of converting data from a "long" description to a "wide" description.) Because the number of observations usually differs among categories, some of the new variables will have missing values.

A Canonical Example

A simple example is given by the Sashelp.Class data set. The SEX variable contains two values, F and M. There are several numerical variables; I'll use HEIGHT for this example. For these data, I want to reshape the data to create two variables named X_F and X_M, where the first variable contains the heights of the females and the second variable contains the heights of the males. The following DATA step shows one way to accomplish this:

data combo;
 keep x_F x_M;
 merge sashelp.class(where=(sex="F") rename=(height=x_F))
       sashelp.class(where=(sex="M") rename=(height=x_M));
proc print; run;

Notice that the X_F variable contains a missing value because there are only nine females in the data, whereas there are ten males.

This technique works when you know the levels of the categorical variable, which you can discover by using PROC FREQ. However, what do you do if the categorical variable has dozens or hundreds of levels? It would be tedious to have to type in the generalization of this DATA step. I'd prefer to have code that works for an arbitrary categorical variable with k levels, and that automatically forms the names of the new variables as X_C1, X_C2, ..., X_Ck where C1, C2, ..., are the unique values of the categorical variable.

Obviously, this can be done, but I had a deadline to meet. Rather than mess with SAS macro, PROC SQL, and other tools that I do not use every day, I turned to the SAS/IML language

SAS/IML to the Rescue

To reshape the data, I needed to do the following:

  1. Find the levels (unique values) of the categorical variable and count the number of observations in each level.
  2. Create the names of the new variables by appending the values of each level to the prefix "X_".
  3. Allocate a matrix large enough to hold the results.
  4. For the ith level of the categorical variable, copy the corresponding values from the continuous variable into the ith column of the matrix.

The following SAS/IML statements implement this algorithm:

proc iml;
use sashelp.class;
   read all var {sex} into C;
   read all var {height} into x;
/* TABULATE is SAS 9.3; you can also use UNIQUE + LOC */
call tabulate(u, count, C); /* 1. find unique values and count obs */
labels = "x_" + u;          /* 2. create new variable names */
y = j(max(count), ncol(count), .); /* 3. allocate result matrix */
do i = 1 to ncol(count);           /* 4. for each level... */
   y[1:count[i], i] = x[loc(C=u[i])]; /* copy values into i_th column */
print y[colname=labels];

The y matrix contains the desired values. Each column corresponds to a level of the categorical variable. Notice how the LOC function (also known as "the most useful function you've never heard of") is used to identify the observations for each category; the output from the LOC function is used to extract the corresponding values of x.

The program not only works for the Sashelp.Class data, it works in general. The TABULATE function (new in SAS 9.3) finds the unique values of the categorical variable and counts the number of observations for each value. If you do not have SAS 9.3, you can use the UNIQUE-LOC technique to obtain these values (see Section 3.3.5 of my book, Statistical Programming with SAS/IML Software). The remainder of the program is written in terms of these quantities. So, for example, if I want to reshape the MPG_CITY variable in the Sashelp.Cars data according to levels of the ORIGIN variable, all I have to do is change the first few lines of the program:

   read all var {origin} into C;
   read all var {mpg_city} into x;

Obviously, this could also be made into a macro or a SAS/IML module[REF], where the data set name, the name of the categorical variable, and the name of the numerical variable are included as parameters. It is also straightforward to support numerical categorical variables.

Yes, it can be done with Base SAS...

Although the SAS/IML solution is short, simple, and does not require any macro shenanigans, I might as well provide a generalization of the earlier DATA step program. The trick is to use PROC SQL to compute the number of levels and the unique values of the categorical variable, and to put this information into SAS macro variables, like so:

proc sql noprint;
select strip(put(count(distinct sex),8.)) into :Count 
       from sashelp.class;
select distinct sex into :C1- :C&Count
       from sashelp.class;

The macro variable Count contains the number of levels, and the macro variables C1, C2,... contain the unique values. Using these quantities, you can write a short macro loop that generalizes the KEEP and MERGE statements in the original DATA step:

/* create the string x_C1 x_C2 ... where C_i are unique values */
%macro keepvars;
   %do i=1 %to &Count; x_&&C&i %end;
/* create the various data sets to merge together, and create 
   variable names x_C1 x_C2 ... where C_i are unique values */
%macro mergevars;
   %do i=1 %to &Count; 
      sashelp.class(where=(sex="&&C&i") rename=(height=x_&&C&i))
data combo;
 keep %keepvars;
 merge %mergevars;

Again, this could be made into a macro where the data set name, the name of the categorical variable, and the name of the numerical variable are included as parameters.

I'd like to thank my colleagues Jerry and Jason for ideas that led to the formulation of the preceding macro code. My colleagues also suggested several other methods for accomplishing the same task. I invite you to post your favorite technique in the comments.

Addendum (11:00am): Several people asked why I want to do this. The reason is that some procedures do not support a classification variable (or don't handle classification variables the way I want). By using this transformation, you can create multiple variables and have a procedure operate on those. For example, the DENSITY statement in PROC SGPLOT does not support a GROUP= option, but you can use this trick to overlay the densities of subgroups.

In reponse to this post, several other techniques in Base SAS were submitted to the SAS-L discussion forum. I particularly like Nat Wooding's solution, which uses PROC TRANSPOSE.

tags: Data Analysis, Reading and Writing Data, Statistical Programming
10月 282011

"I think that my data are exponentially distributed, but how can I check?"

I get asked that question a lot. Well, not specifically that question. Sometimes the question is about the normal, lognormal, or gamma distribution. A related question is "Which distribution does my data have," which was recently discussed by John D. Cook on his blog.

Regardless of the exact phrasing, the questioner wants to know "What methods are available for checking whether a given distribution fits the data?" In SAS, I recommend the UNIVARIATE procedure. It supports three techniques that are useful for comparing the distribution of data to some common distributions: goodness-of-fit tests, overlaying a curve on a histogram of the data, and the quantile-quantile (Q-Q) plot. (Some people drop the hyphen and write "the QQ plot.")

Constructing a Q-Q Plot for any distribution

The UNIVARIATE procedure supports many common distributions, such as the normal, exponential, and gamma distributions. In SAS 9.3, the UNIVARIATE procedure supports five new distributions. They are the Gumbel distribution, the inverse Gaussian (Wald) distribution, the generalized Pareto distribution, the power function distribution, and the Rayleigh distribution.

But what if you want to check whether your data fits some distribution that is not supported by PROC UNIVARIATE? No worries, creating a Q-Q plot is easy, provided you can compute the quantile function of the theoretical distribution. The steps are as follows:

  1. Sort the data.
  2. Compute n evenly spaced points in the interval (0,1), where n is the number of data points in your sample.
  3. Compute the quantiles (inverse CDF) of the evenly spaced points.
  4. Create a scatter plot of the sorted data versus the quantiles computed in Step 3.

If the data are in a SAS/IML vector, the following statements carry out these steps:

proc iml;
y = {1.7, 1.0, 0.5, 3.5, 1.9, 0.7, 0.4, 
     5.1, 0.2, 5.6, 4.6, 2.8, 3.8, 1.4, 
     1.6, 0.9, 0.3, 0.4, 1.9, 0.5};
n = nrow(y);
call sort(y, 1); /* 1 */
v = ((1:n) - 0.375) / (n + 0.25);  /* 2 (Blom, 1958) */
q = quantile("Exponential", v, 2); /* 3 */

If you plot the data (y) against the quantiles of the exponential distribution (q), you get the following plot:

"But, Rick," you might argue, "the plotted points fall neatly along the diagonal line only because you somehow knew to use a scale parameter of 2 in Step 3. What if I don't know what parameter to use?!"

Ahh, but that is the beauty of the Q-Q plot! If you plot the data against the standardized distribution (that is, use a unit scale parameter), then the slope of the line in a Q-Q plot is an estimate of the unknown scale parameter for your data! For example, modify the previous SAS/IML statements so that the quantiles of the exponential distribution are computed as follows: q = quantile("Exponential", v); /* 3 */

The resulting Q-Q plot shows points that lie along a line with slope 2, which implies that the distribution of the data is approximately exponentially distributed with a shape parameter close to 2.

Choice of quantiles for the theoretical distribution

The Wikipedia article on Q-Q plots states, "The choice of quantiles from a theoretical distribution has occasioned much discussion." Wow, is that an understatement! Literally dozens of papers have been written on this topic. SAS uses a formula suggested by Blom (1958): (i - 3/8) / (n + 1/4), i=1,2,...,n. Another popular choice is (i-0.5)/n, or even i/(n+1). For large n, the choices are practically equivalent. See O. Thas (2010), Comparing Distributions, p. 57–59 for a discussion of various choices. In SAS, you can use the RANKADJ= and NADJ= options to accomodate different choices.

Repeating the construction by using the DATA step

These computations are simple enough to perform by using the DATA step and PROC SORT. For completeness, here is the SAS code:

data A;
input y @@;
1.7 1.0 0.5 3.5 1.9 0.7 0.4 
5.1 0.2 5.6 4.6 2.8 3.8 1.4 
1.6 0.9 0.3 0.4 1.9 0.5
proc sort data=A; by y; run; /* 1 */
data Exp;
set A nobs=nobs;
v = (_N_ - 0.375) / (nobs + 0.25); /* 2 */
q = quantile("Exponential", v, 2); /* 3 */
proc sgplot data=Exp noautolegend; /* 4 */
scatter x=q y=y;
lineparm x=0 y=0 slope=1; /* SAS 9.3 statement */
xaxis label="Exponential Quantiles" grid; 
yaxis label="Observed Data" grid;

Use PROC UNIVARIATE for Simple Q-Q Plots

Of course, for this example, I don't need to do any computations at all, since PROC UNIVARIATE supports the exponential distribution and other common distributions. The following statements compute goodness-of-fit tests, overlay a curve on the histogram, and display a Q-Q plot:

proc univariate data=A;
var y;
histogram y / exp(sigma=2); 
QQplot y / exp(theta=0 sigma=2);

However, if you think your data are distributed according to some distribution that is not built into PROC UNIVARIATE, the techniques in this article show how to construct a Q-Q plot to help you assess whether some "named" distribution might model your data.

tags: Data Analysis, Statistical Programming