data analysis

4月 152011
In a previous blog post, I showed how you can use simulation to construct confidence intervals for ranks. This idea (from a paper by E. Marshall and D. Spiegelhalter), enables you to display a graph that compares the performance of several institutions, where "institutions" can mean schools, companies, airlines, or any set of similar groups.

A more recent paper (Spiegelhalter, 2004) recommends doing away with individual confidence intervals altogether. Instead, Spiegelhalter recommends plotting the means of each group versus "an interpretable measure of its precision" and overlaying "control limits" similar to those found on a Shewhart control chart.

This article shows how to create a funnel plot in SAS. You can download all of the data and the SAS program used in this analysis.

Spiegelhalter lays out four steps for creating the funnel plot that displays the mean of a continuous response for multiple groups (see Appendix A.3.1):

  1. Compute the mean of each category.
  2. Compute the overall mean, y and standard deviation, s.
  3. Compute control limits y ± zα * s / sqrt(n) where zα is the α quantile of the normal distribution and n varies between the number of observations in the smallest group and the number in the largest group.
  4. Plot the mean of each category against the sample size and overlay the control limits.

The Temperature of Car Roofs in a Parking Lot

The data for this example are from an experiment by Clark Andersen in which he measured the temperature of car roofs in a parking lot. He observed that black and burgundy cars had the hottest roofs, whereas white and silver cars were cooler to touch.

Andersen measured the roof temperatures of 52 cars on a mild (71 degree F) day. The cars were of nine colors and 4–8 measurements were recorded for each color. You could rank the car colors by the mean temperature of the roof (remember to include confidence intervals!) but an alternative display is to plot the mean temperature for each car color versus the number of cars with that color. You can then overlay funnel-shaped "control limits" on the chart.

The funnel plot is shown below. (Click to enlarge.) The remainder of this post shows how to create the graph.

Computing the Mean of Each Category

The following statements read the data into SAS/IML vectors:

proc iml;
use CarTemps;
   read all var {Color Temperature};
close CarTemps;

You can use the UNIQUE/LOC technique to compute the sample size and mean temperature for each color category. The following technique is described in Section 3.3.5 of Statistical Programming with SAS/IML Software:

/** for each car color, compute the mean and
    standard error of the mean **/
u = unique(Color); /** unique colors **/
p = ncol(u); /** how many colors? **/
mean = j(p,1); sem = j(p,1); n = j(p,1);
do i = 1 to p;
   idx = loc(Color=u[i]);
   n[i] = ncol(idx);
   T = Temperature[idx]; 
   mean[i] = mean(T); /** mean temp of category **/
   sem[i] = sqrt(var(T)/n[i]); /** stderr **/

The following table summarizes the data by ranking the mean temperatures for each color. Although the standard errors are not used in the funnel plot, they are included here so that you can see that many of the confidence intervals overlap. You could use the SGPLOT procedure to display these means along with 95% confidence intervals, but that is not shown here.

   color            n  mean  sem

   black            8 137.3  3.6
   burgundy         4 133.9  4.6
   green            4 130.9  7.3
   gray             6 130.5  2.5
   blue             7 129.3  3.9
   red              6 128.5  2.9
   tan              4 116.1  5.1
   silver           6 107.9  3.2
   white            7  98.4  1.8

Those car roofs get pretty hot! Are the black cars much hotter than the average? What about the burgundy cars? Notice that the sample means of the burgundy and green cars are based on only four observations, so the uncertainty in those estimates are greater than for cars with more measurements. The funnel plot displays both the mean temperature and the precision in a single graph.

Computing the Overall Mean and Standard Deviation

The funnel plot compares the group means to the overall mean. The following statements compute the overall mean and standard deviation of the data, ignoring colors:

/** compute overall mean and variance **/
y = Temperature[:];
s = sqrt(var(Temperature));
print y s[label="StdDev"];





The overall mean temperature is 123 degrees F, with a standard deviation of 16 degrees. The VAR function is available in SAS/IML 9.22. If you are using an earlier version of SAS/IML, you can use the VAR module from my book.

Computing Control Limits

A funnel plot is effective because it explicitly reveals a source of variability for the means, namely the different sample sizes. The following statements compute the control limits for these data:

/** confidence limits for a range of sample sizes **/
n = T( do(3, 8.5, 0.1) );
p = {0.001 0.025 0.975 0.999}; /** lower/upper limits **/
z = quantile("normal", p);
/** compute 56 x 4 matrix that contains confidence limits
    for n = 3 to 8.5 by 0.1 **/
limits = y + s / sqrt(n) * z;

Notice that the limits variable is a matrix. The expression s/sqrt(n) is a column vector with 56 rows, whereas the row vector z contains four z-scores. Therefore the (outer) product is a 56x4 matrix. The values in this matrix are used to overlay control limits on a plot of mean versus the sample size.

Creating a Funnel Plot

After writing the SAS/IML computations to a data set (not shown here), you can use PROC SGPLOT to display the mean temperature for each car color and use the BAND statement to overlay 95% and 99.8% control limits.

title "Temperatures of Car Roofs";
title2 "71 Degrees in the Shade";
proc sgplot data=All;
   scatter x=N y=Mean /datalabel=Color;
   refline 123.4 / axis=y;
   band x=N lower=L95 upper=U95 / nofill;
   band x=N lower=L998 upper=U998 / nofill;
   xaxis label="Number of Cars";
   yaxis label="Average Temperature";

The funnel plot indicates that black cars are hotter than average. Silver and white cars are cooler than average. More precisely, the plot shows that the mean temperature of the black cars exceeds the 95% prediction limits, which indicates that the mean is greater than would be expected by random variation from the overall mean. Similarly, the mean temperature of the silver cars is lower than the 95% prediction limits. The mean temperature of the white cars is even more extreme: it is beyond the 99.8% prediction limits.


The funnel plot is a simple way to compare group means to the overall average. The funnel-shaped curves are readily explainable to a non-statistician and the plot enables you to compare different groups without having to rank them. The eye is naturally drawn to observations that are outside the funnel curves, which is good because these observations often warrant special investigation.

More advantages and some limitations are described in Spiegelhalter's paper. He also shows how to construct the control limits for funnel plots for proportions, ratios of proportions, odd ratios, and other situations.

Two criticisms of my presentation come to mind. The first is that I've illustrated the funnel plot by using groups that have a small number of observations. In order to make the control limits look like funnels, I used a "continuous" value of n, even though there is no such thing as a group with 4.5 or 5.8 observations! Spiegelhalter's main application is displaying the mean mortality rates of hospitals that deal with hundreds or thousands of patients, so the criticism does not apply to his examples.

The second criticism is that I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison. Spiegelhalter notes (p. 1196) that "the only allowance for multiple comparisons is the use of small p-values for the control limits. These could be chosen based on some formal criterion such as Bonferroni..., but we suggest that this is best carried out separately." Following Spiegelhalter's suggestion, I leave that adjustment for another blog post.

4月 082011
At the beginning of 2011, I heard about the Dow Piano, which was created by The Dow Piano visualizes the performance of the Dow Jones industrial average in 2010 with a line plot, but also adds an auditory component. As Bård Edlund, Art Director at, said,
The daily trading of the Dow Jones industrial average determines the songwriting here, translating the ups and downs of 2010's market into musical notes. Using a five-note scale spanning three octaves, pitch is determined by each day's closing level.

When I first saw the Dow Piano, I immediately thought, "I can do that in SAS/IML Studio by using the SOUND call in the SAS/IML language!"

To be fair, I can't fully duplicate the Dow Piano in SAS/IML software. The SOUND call in SAS is very simple: it only provides pitch and duration. The music for the Dow Piano is more complex:

  • It uses piano notes, which involves sound characteristics such as attack and decay.
  • It uses dynamics to represent trading volume: high-volume days are represented by loud notes, low-volume days by soft notes.
  • It overlays a percussion track so that you hear a percussive beat on days that the stock market is closed.
Nevertheless, I think the SAS/IML version captures the main idea by allowing you to listen to the performance of the Dow Jones industrial average. (Get it? The performance?)

Will this kind of visualization (auditorization? sonification?) catch on? Time will tell, but for these data, the sound doesn't add any new information; it merely uses the sense of sound to repeat information that is already in the line plot. In fact, the sound present less information than the line plot: there are more than 250 closing values of the Dow Jones industrial average, but the Dow Piano collapses these values into 15 notes (three octave of a five note scale).

But enough analysis! Listen to the Way of the Dow. The demo starts at 0:45, after I introduce the problem.

4月 072011
In a previous blog post about computing confidence intervals for rankings, I inadvertently used the VAR function in SAS/IML 9.22, without providing equivalent functionality for those readers who are running an earlier version of SAS/IML software. (Thanks to Eric for pointing this out.)

The following module computes the variance of each column of a matrix, and correctly handles missing values in the data:

/** Var: return sample variance of each column of a data matrix **/
/** For this module
  x is a matrix that contains the data
start Var(x);
   mean = x[:,];
   countn = j(1, ncol(x)); /** allocate vector for counts **/
   do i = 1 to ncol(x); /** count nonmissing values **/
      countn[i] = sum(x[,i]^=.); /** in each column **/
   var = (x-mean)[##,] / (countn-1);
return ( var );

This module appears in Appendix D of my book, Statistical Programming with SAS/IML Software.

4月 062011
When comparing scores from different subjects, it is often useful to rank the subjects. A rank is the order of a subject when the associated score is listed in ascending order.

I've written a few articles about the importance of including confidence intervals when you display rankings, but I haven't blogged about the basics of computing ranks. In Base SAS you can use the RANK procedure, but this article focuses on how to compute ranks and other related quantities in SAS/IML software.

Ranking Values in SAS/IML Software

When the values are in a SAS/IML vector, you can use the RANK function to assign ranks. The RANK function assigns 1 to the lowest score, 2 to the second lowest, and so on. Ties are broken arbitrarily. For example, the following statements rank the competitors in the semifinal heat of a swimming event at the 2008 Olympics according to their times (in seconds):

proc iml;
/** Men's 100m butterfly, semifinal heat **/
name = {Crocker Pini Lauterstein Cavic 
        Phelps Dunford Serdinov Fujii};
times = {51.27 51.62 51.27 50.92 
         50.97 51.33 51.41 51.59};
r = rank(times);
print r;










The output tells you that the first competitor (Crocker) is ranked third, the second competitor (Pini) finished eighth, and so one. In particular, the first-place finisher was the fourth competitor in the list (Cavic) and second-place was the fifth competitor (Phelps).

Notice that the times for Crocker and Lauterstein were the same, but one was ranked third and the other fourth, which shows you that the RANK function breaks ties arbitrarily. If you want to include ties in your ranking, you can use the RANKTIE function.

The RANK function returns a matrix that is the same shape as the input matrix. This means that you can compute the ranks for column vectors, row vectors, and matrices.

Reordering by Rank

You can use the rank to order the competitors by their times. To do this, make a copy of the data and use the ranks, r, to index the data on the left hand side of the equal sign:

order = name; /** copy data **/
order[,r] = name; /** order by rank **/
print order;








This syntax might seem strange. Why do you need to index the data on the left hand side? The simple answer is "because that's how ranks work." A rank is different from a sorting index such as is produced by the SORTNDX subroutine. As I showed in a previous post, when you use a sorting index to sort a matrix, you use the index on the right hand side of the equal sign. For this reason, the sorting index is sometimes called the anti-rank. The following statements compute the anti-rank from the rank vector, and by using the SORTNDX subroutine:

/** compute anti-rank from rank **/
ar = r; 
ar[,r] = 1:ncol(r); /** permute indices **/

/** SORTNDX operates on columns! **/
call sortndx(idx, times`, 1);

The two vectors, ar and idx, contain the same values in the same order, but ar is a row vector whereas idx is a column vector. (The SORTNDX function always operates on the columns of the input data and always returns a column vector.) You can use either of these vectors as indices on the right side of the equal sign to reorder the data:

order2 = name[idx];

Rank Values in Descending Order

Here's one last tip. The RANK function ranks data in ascending order, which is good for sports such as racing and golf in which the lowest score wins. However, in other sports, teams with higher scores are ranked ahead of teams with lower scores. Thus, you need to be able to rank data in descending order.

If there are N teams and r is the result of the RANK function, you can flip the order of the ranks by using the following statements:

N = ncol(r); /** or nrow(r) **/
RankDesc = N + 1 - r;

Alternatively, you can call the RANK function and pass in the negative of the scores. For example, to rank the Olympic swimmers in descending order:

RankDesc = rank( -times );

3月 312011
This week, I posted the 100th article to The DO Loop. To celebrate, I'm going to analyze the content of my first 100 articles.

In December 2010 I compiled a list of The DO Loop's most-read posts, so I won't repeat that exercise. Instead, I thought it would be interesting to analyze what I've said in my blog by looking at the frequency of words over the first 100 posts.

I concatenated the first 100 posts into a single file, and I used SAS software (specifically, PROC FREQ) to analyze the most frequent terms. Then I created a Wordle word cloud, which uses size to represent the frequency of occurrence of the top terms.

Click to enlarge

I like this word cloud because it summarizes my objectives for The DO Loop:

  • Data is central to my blog.
  • Data is surrounded by terms related to statistical programming: matrix, function, SAS, SAS/IML, program, and compute.
  • Surrounding those terms are keywords in the SAS/IML language: USE, READ, DO, IF/THEN, PRINT, and CALL.
  • Interspersed throughout the cloud are terms related to matrix computations: row, column, values, and vector.
  • Statistical terms are also interspersed: random, variables, probability, distribution, mean, and covariance.
The word cloud interlaces these different objectives, just like I attempt to write articles that interleave statistics, data analysis, and programming.

In that spirit, I can't think of a better way to celebrate Post #101 than to write a SAS program that uses the SGPLOT procedure to visualize the general topics of my blog posts. (Counts do not add to 100 because some articles belong to multiple categories.)

title "The DO Loop: Top Categories";
proc sgplot data=Categories;
   hbar Category / freq=Freq;
   xaxis grid integer;
   yaxis discreteorder=data;

Thanks to my colleagues, Alison, Anne, and Chris, for their support of this blog. Thanks to all of my readers for your patronage, comments, and encouragement. I look forward to writing the next 100 articles, and I hope you look forward to reading them.

3月 302011
In a previous post, I described how to compute means and standard errors for data that I want to rank. The example data (which are available for download) are mean daily delays for 20 US airlines in 2007.

The previous post carried out steps 1 and 2 of the method for computing confidence intervals for rankings, as suggested in Marshall and Spiegelhalter (1998):

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

This post carries out step 3 of the method.

Ordering Airlines by Mean Delays

To recap, here is the SAS/IML program so far. The following statements compute the mean delays and standard errors for each airline in 2007. The airlines are then reordered by their mean delays:

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;
meanDelay = x[:,];

/** sort descending; reorder columns **/
call sortndx(idx, MeanDelay`, 1, 1);
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];

Ranking Airlines...with Confidence

You can use the simulation method suggested by Marshall and Spiegelhalter to compute confidence intervals for the rankings. The following statements sample 10,000 times from normal distributions. (I've previously written about how to write an efficient simulation in SAS/IML software.) Each of these samples represents a potential realization of the underlying random variables:

/** simulation of true ranks, according to the method 
    proposed by Marshall and Spiegelhalter (1998) 
    British Med. J. **/
/** parameters for normal distribution of means **/
mean = meanDelay`;
sem = sqrt(var(x)/nrow(x)); /** std error of mean **/

p = ncol(Carrier);
NSim = 10000;
s = j(NSim, p); /** allocate matrix for results **/
z = j(NSim, 1);
call randseed(1234);
do j = 1 to p;
   call randgen(z, "Normal", mean[j], sem[j]);
   s[,j] = z;

The matrix s contains 10,000 rows and 20 columns. Each row is a potential set of means for the airlines, where the ith mean is chosen from a normal distribution with population parameters μi and σi. Because the population parameters are unknown, Marshall and Spiegelhalter advocate replacing the population means with the sample means, and the population standard deviations by the sample standard errors. This kind of simulation is similar to a parametric bootstrap.

The following statements use the SAS/IML RANK function to compute the rank (1–20) for each simulated set of means. (If you need to conserve memory, you can re-use the s matrix to store the ranks.)

/** for each sample, determine the rankings **/
r = j(NSim, p);
do i = 1 to NSim;
   r[i,] = rank(s[i,]);

Each column of the r matrix contains simulated ranks for a single airline. You can use the QNTL subroutine to compute 95% confidence intervals for the rank. Basically, you just chop off the lower and upper 2.5% of the simulated ranks. (If you have not upgraded to SAS/IML 9.22, you can use the QNTL module for this step.)

/** find the 95% CI of the rankings **/
call qntl(q, r, {0.025 0.975});

The simulation is complete. The matrix q contains two rows and 20 columns. The first row contains the lower end of the 95% confidence limits for the ranks; the second row contains the upper limit.

You can print q to see the results in tabular form. I prefer graphical displays, so the following statements create a SAS data set that contains the estimated ranks and the upper and lower confidence limits.

d = rank(mean) || T(q);
varNames = {"Rank" "LCL" "UCL"};
create Rank from d[r=Carrier c=varNames];
append from d[r=Carrier];
close Rank;

It is now straightforward to use SCATTER statement in PROC SGPLOT to create a scatter plot with confidence intervals (called "error bars" in some disciplines):

proc sgplot data=Rank noautolegend;
title "Ranking Airlines (On-Time Performance, 2007)";
scatter y=carrier x=Rank /
xaxis label="Rank with 95% Confidence Intervals";
yaxis grid discreteorder=data label="Carrier"; 

For these data, the ranking of the first three airlines (AQ, HA, WN) seems fairly certain, but there is more uncertainty for the rankings of airlines that are in the middle of the pack. So, for example, if Frontier Airlines (F9) issues a press release about their 2007 on-time performance, it shouldn't claim "We're #5!" Rather, it should state "We're #5 (maybe), but we might be as high as #4 or as low as #8."

In keeping with my New Year's Resolutions, I've learned something new. Next time I want to rank a list, I will pay more attention to uncertainty in the data. Plotting confidence intervals for the rankings is one way to do that.

3月 252011
I recently posted an article about representing uncertainty in rankings on the blog of the ASA Section for Statistical Programmers and Analysts (SSPA). The posting discusses the importance of including confidence intervals or other indicators of uncertainty when you display rankings.

Today's article complements the SSPA post by showing how to construct the confidence intervals in SAS. This article implements the techniques in Marshall and Spiegelhalter (1998), Brit. Med. J., which is a very well written and readable paper. (If you prefer sports to medicine, Spiegelhalter wrote a similar article on ranking teams in the English Premier Football League.)

To illustrate the ideas, consider the problem of ranking US airlines on the basis of their average on-time performance. The data for this example are described in the Statistical Computing and Graphics Newsletter (Dec. 2009) and in my 2010 SAS Global Forum paper. The data are available for download.

The data are in a SAS data set named MVTS that consists of 365 observations and 21 variables. The Date variable records the day in 2007 for which the data were recorded. The other 20 variables contain the mean delays, in minutes, experienced by an airline carrier for that day’s flights. The airline variables in the data are named by using two-digit airline carrier codes. For example, AA is the carrier code for American Airlines.

The appendix of Marshall and Spiegelhalter's paper describes how to compute rankings with confidence intervals:

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

Ordering Airlines by Mean Delays

You can use PROC MEANS or PROC IML to compute the means and standard errors for each carrier. Because simulation will be used in Step 3 to compute confidence intervals for the rankings, PROC IML is the natural choice for this analysis. The following statements compute the mean delay for all airlines (sometimes called the "grand mean") and the mean delays for each airline in 2007. The airlines are then sorted by their mean delays, as described in a previous article about ordering the columns of a matrix.

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;

grandMean = x[:];
meanDelay = x[:,];
call sortndx(idx, MeanDelay`, 1);

/** reorder columns **/
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];
print grandMean, carrier;





You can use these computations to reorder the variables in the data set (this step is not shown) and use PROC SGPLOT to plot the means and standard errors for the each airline carrier:

/** convert data from wide to long format**/
proc transpose data=mvts 
   out=Delay(rename=(col1=Delay)) name=Carrier;
by date;

title "Comparison of Airline Daily Delays (2007)";
proc sgplot data=delay noautolegend;
  dot carrier / response=delay stat=mean limitstat=clm;
  refline 9.646 / axis=x lineattrs=(pattern=dash);
  xaxis label="Mean Delay with 95% Confidence Intervals";
  yaxis discreteorder=data label="Carrier"; 

Notice that two airlines (Aloha Airlines (AQ) and Hawaiian Airlines (HA)) have much better on-time performance than the others. In fact, their average daily delay is negative, which means that they typically land a minute or two ahead of schedule!

Many analysts would end the analysis here by assigning ranks based on the mean statistics. But as I point out on my SSPA blog, it is important for rankings to reflect the uncertainty in the mean estimates.

Notice that for most airlines, the confidence interval overlaps with the confidence intervals of one or more other airlines. The "true mean" for each airline is unknown, but probably lies within the confidence intervals.

So, for example Frontier Airlines (F9) is nominally ranked #5. But you can see from the overlapping confidence intervals that there is a chance that the true mean for Frontier Airlines is lower than the true mean of Delta Airlines (DL), which would make Frontier ranked #4. However, there is also a chance that the true mean could be less than the true mean of Continental Airlines (CO), which would make Frontier ranked #11. How can you rank the airlines so that that uncertainty is reflected in the ranking?

The answer is in Step 3 of Marshall and Spiegelhalter's approach: use a simulation to compute the confidence intervals for the ranking. This step will appear in a second article, which I will post next week.

3月 232011
When you think of statistical process control, or SPC for short, what industry first comes to your mind? In the past 10 or 15 years, diverse industries have begun to standardize processes and administrative tasks with statistical process control. While the top two bars of the industrial Pareto chart are probably still manufacturing and machine maintenance, in North America, SPC is being used in far more types of work than many people realize.

One reason that researchers and process managers turn to Walter Shewhart’s trusty techniques to distinguish between common cause and special cause variation—in other words, variation that represents unimportant bumps on the road and variation that means that the wheels have fallen off the bus—is that they work far better than gut instinct in many cases. Furthermore, they are defensible, because they are based on sound, scientific practice. What might surprise you is the extent to which social science, human subject research, and administrative fields are making use of SPC.

1. Health Care. This is the area I hear the most buzz about for SPC. A great deal of interesting work is being done in monitoring process variables such as late payments, the number of available beds, payments to providers, script-writing rates for high-risk drugs, pain levels during recovery from surgery, and so on. I can only scratch the surface here. The writing is on the wall about health care becoming still more expensive, data are becoming more and more plentiful, and it is especially easy for problems to hide in massive data. Unpredictable problems = cost. Process control = savings.

2. Customer Service How long did you have to wait in line the last time you were in a big box store? When you last called the cable company? Some companies recognize that if customer service is slow, you will take your business elsewhere. Some are even willing to take action on that knowledge. There are plenty of measureable service quality characteristics that can be tapped into to identify times of day or product lines that are inconsistent, which translates to a better customer experience, and to customer loyalty.

3. Survey Research This is the one I’m most excited about right now. SPC interest in survey research has been on the rise for the past 5 or so years, and I think it’s an area ripe for this sort of analysis. If the news tells you that 52% of Americans who eat raw oysters only eat them to be polite, someone had to get that information. Enter the heroes of the survey world. Survey researchers, the people behind all the Things We Know about How We Are, are applying statistical process control methods with variables related to survey data collection, such as interview length, nonresponse rate, interviewer driving distances, and so on.

Continue reading "The Human Side of Statistical Process Control: Three Applications of SAS/QC You Might Not Have Thought About"
3月 042011
Do you have many points in your scatter plots that overlap each other? If so, your graph exhibits overplotting. Overplotting occurs when many points have similar coordinates. For example, the following scatter plot (which is produced by using the ODS statistical graphics procedure, SGPLOT) displays 12,000 points, many of which are plotted on top of each other.

/** 12,000 points; notice the overplotting **/
proc sgplot data=A;
scatter x=x y=y / markerattrs=(symbol=CircleFilled);

There are several ways to deal with overplotting. When the overplotting is caused because the data are rounded to a convenient unit (such as plot of height versus age), some analysts use jittering to reduce the overplotting. When the overplotting is caused by having too many data points, as in the previous scatter plot, you have to use an alternate approach.

One of the techniques I use is the "transparency trick" for visualizing scatter plots that suffer from a large amount of overplotting.

When a colleague recently came to me with a scatter plot that was dense with points, I suggested that he try the TRANSPARENCY= option in the SGPLOT procedure. This option enables you to specify a value between 0 and 1 that determines the transparency of the markers that are used in the scatter plot. A value of 0 indicates that all markers are opaque. As you increase the transparency factor, singleton markers fade away, whereas areas of the plot that have a high density of overlapping points remain dark.

That one option makes a huge difference in the plot and enables you to visualize areas that have a high density of points. For example, the following statements display a scatter plot of the same data, but the markers are highly transparent:

/** use transparency **/
proc sgplot data=A;
scatter x=x y=y / markerattrs=(symbol=CircleFilled) 

The new scatter plot makes it easy to see that there is a high density of points near -3 and +3 on the horizontal axis. Consequently, a scatter plot with a large transparency factor is sometimes called the "poor man's density estimator."

Of course, there is no real reason to use transparency to emulate density estimation, when there is a real density estimation procedure in SAS/STAT software. If you want to compute the density of the data, use the KDE procedure and request a contour plot of the data, as shown below:

proc kde data=A;
   bivar x y / plots=(contour);

2月 252011
I was inspired by Chris Hemedinger's blog posts about his daughter's science fair project. Explaining statistics to a pre-teenager can be a humbling experience.

My 11-year-old son likes science. He recently set about trying to measure which of three projectile launchers is the most accurate. I think he wanted to determine their accuracy so that he would know which launcher to use when he wanted to attack his little sister, but I didn't ask. Sometimes it is better not to know.

He and his sister created a bullseye target. They then launched six projectiles from each of the three launchers and recorded the location of the projectile relative to the center of the target. In order to make the experiment repeatable, my kids aligned each launcher with the target and then tied it down so that it would not move during the six shots.

For each launcher, they averaged the six distances to the center. The launcher with the smallest average distance was declared to be The Most Accurate Launcher in the World.

In the accompanying figure, the red crosses represent the winning launcher. The average distance to the center is 5.2 centimeters, whereas the black squares have an average distance of 6.7 cm and the blue circles have an average distance of 13.1 cm. Three of the black squares have the same location and are marked by "(3)."

When I saw the target on which they had recorded their results, I had a hard time containing the teacher in me. This target shows two fundamental statistical concepts: bias and scatter.

Bias Explained

When I look at the target, the first characteristic I see is bias due to a misalignment of the launchers. Two of the three launchers are aimed too high relative to the bullseye. This kind of experimental error is sometimes known as systematic bias because it leads to measurements that are systematically too high or too low.

The second figure shows the average locations of projectiles for each launcher. The average location of the red crosses (marked "C2") is the closest to the center of the target. The next closest is the center of the black squares, which is marked "C1." The fact that "C1" and "C3" are far away from the center of the target shows the bias in the experiment.

I pointed this out to my son, who shrugged. "I guess the launchers weren't perfectly lined up," he said.

Thinking that perhaps he might want to realign the launchers and re-run the experiment, I asked him what he intended to do about this bias.

He hesitated. I could see the wheels turning in his mind.

"Aim low?" he guessed.

Scatter Explained

The second feature of the data that I see is the scatter of the projectiles. The way that the projectiles scatter is measured by a statistical term known as the covariance. The covariance measures the spread in the horizontal direction, the vertical direction, and jointly along both directions.

I pointed out to my son that the red crosses tend to spread out in the up-down direction much more than they do in the left-right direction. In contrast, the black squares spread in both directions, but, more importantly, the directions tend to vary together. In four out of five cases, a shot that was high was also left of center. This is an example of correlation and covariance.

I asked my son to describe the spread of the blue circles. "Well," he cautiously began, "they vary a little bit up and down, but they are all over the place left-to-right."

"And do the shot locations tend to vary together?" I prompted.

"Um, not really," he replied.

Bingo! Covariance explained. All of the launchers show up-and-down and left-and-right variability, but the black squares also exhibit co-variability: the deviations in the horizontal direction are correlated with the deviations in the vertical direction.

For Adults Only

My son ran off to attack his sister, grateful to escape my merciless questioning.

I carefully measured the locations of the data, created a SAS DATA step, and began to analyze the data in SAS/IML Studio.

The third figure shows the center of each group of projectiles. I have also added ellipses that help to visualize the way that the projectiles scatter. (Each ellipse is a 75% prediction ellipse for bivariate normal data with covariance equal to the sample covariance.)

The ellipses make it easier to see that locations of the red crosses and blue circles do not vary together. The red ellipse is tall and skinny, which represents a small variation in the horizontal direction and a medium variation in the vertical direction. The blue ellipse is very wide, representing a large variation in the horizontal direction. The black ellipse is tilted, which shows the correlation between the horizontal and vertical directions.

The programming techniques that are used to draw the figures are described in Chapters 9 and 10 of my book, Statistical Programming with SAS/IML Software.