data analysis

5月 132011
 
For years I've been making presentations about SAS/IML software at conferences. Since 2008, I've always mentioned to SAS customers that they can call R from within SAS/IML software. (This feature was introduced in SAS/IML Studio 3.2 and was added to the IML procedure in SAS/IML 9.22.) I also included a chapter on calling R in my book, Statistical Programming with SAS/IML Software.

However, I've never blogged about it until today. Why? Frankly, I don't usually have a reason to call R from SAS/IML software. Both R and SAS/IML are high-level languages with a rich run-time library of functions. They both enable you to extend the language by writing and sharing user-defined functions. They both enable you to use matrix computations to compactly represent and solve statistical problems and to analyze data. I use SAS/IML software for all of my day-to-day computational needs.

However, sometimes I hear about a technique that someone has implemented in R and I want to try that technique too. If I'm going to use the technique many times, I'll consider implementing it myself in the SAS/IML language. However, if I'm only going to use it once (or I'm not sure how often I'll use it), I'll save myself time and call R.

Earlier this week, I showed some plots of airline routes that a colleague created in SAS, which are based on similar plots (created in R) that appeared on the Flowing Data blog. In my blog post I said:

I don't have the time right now to implement a great circle algorithm in SAS/IML. ... I could also generate the great arcs by using the same R package that Flowing Data uses.

Basically, I was being lazy. However, after Flowing Data posted the R code to plot great arcs, I no longer had an excuse not to use great arcs. This is a situation in which calling R will save me time: I want to compute arcs of great circles, I don't know of a comparable function already written in SAS, and I'll rarely use this functionality in the future. So, I wrote a short SAS/IML program that calls an R function to compute the arcs.

Read on if you want to learn about how to call R from SAS. If you just want to see the final result, here is it (click to enlarge):

How to Call R from a SAS/IML Program

To call R, install R 2.11 or earlier1 on the same computer that runs SAS and install the R package that contains the function you want. The function that I want to call is the gcIntermediate function in the geosphere package, so I installed the package as described on the Flowing Data blog.

In general, a SAS/IML program that calls R contains four steps:

  1. Transfer data from a SAS data set or a SAS/IML matrix into a similar data structure in R.
  2. Call R by using the SUBMIT statement with the R option.
  3. Transfer results from R into a SAS data set or a SAS/IML matrix.
  4. Use the results in SAS.
I'll discuss each step in turn.

Step 1: Transfer Data from SAS to R

For ease of presentation, assume that there is a data set called DeltaFlights that contains airline routes for Delta airlines. (The general case of multiple airlines is handled similarly.) The data contains the following variables:

  • Origin_Long and Origin_Lat contain the longitude and latitude of the origin airport.
  • Dest_Long and Dest_Lat contain the longitude and latitude of the destination airport.
I can use the ExportDataSetToR subroutine to create an R data frame from a SAS data set, or I can use the ExportMatrixToR subroutine to transfer data from a SAS/IML matrix into an R matrix.

Because I like to work in the SAS/IML environment, I'll choose the second option. The following statements read the data into SAS/IML vectors or matrices:

/** requires SAS/IML 9.22 or SAS/IML Studio 3.2 **/ 
libname MyLib "C:\Users\...\MyData";
proc iml;
use MyLib.DeltaFlights; /** 376 routes **/
  read all var {origin_long origin_lat} into mOrig;
  read all var {dest_long dest_lat} into mDest;
close MyLib.DeltaFlights;

The matrices mOrig and mDest contain 376 rows and two columns. The following statements transfer data from the two matrices into R matrices of the same dimensions:

/** copy SAS/IML matrices to R **/
run ExportMatrixToR(mOrig, "orig");
run ExportMatrixToR(mDest, "dest");

The result is two R matrices named orig and dest. Each row of orig and each row of dest contains the longitude and latitude of an airport. (The statements also start R if it is not already running.)

Step 2: Call R to Generate the Great Arcs

The call to R is straightforward:

/** get points on great arc between airports **/
submit / R;
library(geosphere)
dist <- gcIntermediate(orig, dest)
endsubmit;

The resulting R object, named dist, is a list of 376 matrices. Each matrix has 50 rows and two columns. The ith matrix represents the longitude and latitude of 50 points along a great arc that connects the ith row of orig and the ith row of dest.

Step 3: Transfer the Results

You can use the ImportMatrixFromR subroutine to copy the data from dist into a SAS/IML matrix named distance:

/** get arcs back from R **/
run ImportMatrixFromR(distance, "dist");

The distance matrix has 50 rows and 2 x 376 columns. The first two columns correspond to dist[[1]], the next two to dist[[2]], and so forth.

Step 4: Use the Results in SAS

These results are intended to be overlaid on a map. To visualize the flight paths in SAS/IML Studio, I can use the following IMLPlus statements, which are similar to the mapping examples in my 2008 SAS Global Forum Paper:

/** create map in SAS/IML Studio **/
a = shape(distance[1,], 0, 2); /** airports **/
declare ScatterPlot p;
p = ScatterPlot.Create("flights", a[,1], a[,2]);
p.DrawUseDataCoordinates();
p.DrawSetPenColor(GREY);
do i = 1 to ncol(distance)/2;
   p.DrawLine(distance[,2*i-1], distance[,2*i]);
end;

/** draw map in background (see SAS Global Forum
    Paper) then zoom in on US **/


However, this particular computation was for my colleague, Robert Allison, who uses SAS/GRAPH software to visualize the flight paths. Therefore, I wrote the arcs to a SAS data set and let him use PROC GMAP with the ANNOTATE= option to create the image seen earlier in this article.


1. As I've said elsewhere, R changed its directory structure between versions 2.11 and 2.12. Consequently, SAS 9.22 (which shipped before R 2.12 was released) looks for certain DLLs in directories that no longer exist. The workaround is to use R 2.11 with SAS 9.22. This workaround is not required for SAS 9.3.
5月 112011
 
When Charlie H. posted an interesting article titled Top 10 most powerful functions for PROC SQL, there was one item on his list that was unfamiliar: the COALESCE function.

Ever since I posted my first response, "SAS/IML Operations Compared with PROC SQL: Part I," I had been dreading writing an article on the COALESCE function. Why? Because SAS/IML does not have a COALESCE function, so I assumed I'd have to write a SAS/IML module that mimics the functionality. Although writing a module is easy, I wasn't sure what the COALESCE function did, so I would have to do research.

However, a recent Tip of the Day by Mike Zdeb at SASCommunity.org reminded me of a programming tip that I always recommend to others: you can call Base SAS functions from the SAS/IML language. In particular, Zdeb showed me that the DATA step supports the COALESCE function. My problem was solved!

The COALESCE Function

The COALSCE function returns the first nonmissing value in a list of variables. Why this is useful is best illustrated with an example. Suppose that you have a data set that contains information about contact information for students. Each row represents a student. Variables include a column for the mother's cell-phone number, the father's cell-phone number, the phone number at home, and various work numbers. Some of these phone numbers might be missing. The school administrator wants you to write a program that creates a new column that contains one of the phone numbers, based on the following algorithm:

  1. If the mother's cell-phone number is not missing use it.
  2. Otherwise, if the father's cell-phone number is not missing use it.
  3. Otherwise, if the home-phone number not missing use it.
  4. ...(and so forth with the other phone numbers)...
The COALESCE function enables you to process these data with a single function call instead of writing multiple IF-THEN/ELSE statements.

The following program provides a small example of the COALESCE function. The data set B has 10 observations and two variables. The PROC SQL statements uses the COALESCE function to report the value of x1 if that value is not missing. Otherwise, it reports the value of x2 if x2 is nonmissing. If both values are missing, then the COALESCE function returns a missing value.

data B;
input x1 x2;
datalines;
 . 11  
 .  . 
 3 13
 4  . 
 . 15
 6 16
;
run;

proc sql; 
title 'Coalesce() -- Combine column values';
select Monotonic() as obs, 
     coalesce(x1, x2) 
from B;
quit;

obs

1

11

2

.

3

3

4

4

5

15

6

6

Because the COALESCE function is part of Base SAS, you can call it from PROC IML:

proc iml;
use B;
read all var {x1 x2};
close B;
y = coalesce(x1, x2);

The vector y contains the same values computed by PROC SQL. Notice that it is not necessary to write a loop: the COALESCE function automatically applies to each row of the argument vectors. (If you have character data, use the COALESCEC function instead.)

The lesson to learn is this: although about 300 functions and subroutines appear in the SAS/IML User's Guide, you can also use the 500 or so functions that appear in Base SAS software. And use all the formats, informats, and global statements. This is one reason that the SAS/IML language is a natural progression for a SAS programmer who wants to learn matrix programming: a SAS programmer can leverage existing DATA step knowledge when writing an analysis in PROC IML.

5月 102011
 
Last week the Flowing Data blog published an excellent visualization of the flight patterns of major US airlines.

On Friday, I sent the link to Robert Allison, my partner in the 2009 ASA Data Expo, which explored airline data. Robert had written a SAS program for the Expo that plots line segments for the flight routes between US cities (see Figure 8 of our joint work), and on Monday he modified his SAS program to mimic to the Flowing Data graphs.

Here's one example of his efforts (click to enlarge):


He used line segments to display the flight patterns because I don't have the time right now to implement a great circle algorithm in SAS/IML. Sorry, Robert! (Because I can call R packages from SAS/IML software, I could also generate the great arcs by using the same R package that Flowing Data uses.)

Visit Robert's visualization gallery to see the other plots.

High-flying graphics, indeed!

5月 042011
 
More than a month ago I wrote a first article in response to an interesting article by Charlie H. titled Top 10 most powerful functions for PROC SQL. In that article I described SAS/IML equivalents to the MONOTONIC, COUNT, N, FREQ, and NMISS Functions in PROC SQL.

In this article, I discuss the UNIQUE functions in PROC SQL and in PROC IML.

The UNIQUE function in SQL finds the unique values of a variable. For categorical variables, it gives the levels of the variable. Charlie H. wrote a program that uses PROC SQL to count the number of levels for the Origin and Type variables in the SASHELP.CARS data set:

proc sql;
title'Unique() -- Find levels of categorical variables';
select count(unique(Origin)) as L_origin, 
       count(unique(Type)) as L_type
from sashelp.cars;
quit;

L_origin

L_type

3

6

The SAS/IML language also has a UNIQUE function. The UNIQUE function always returns a row vector that contains the unique sorted values of its argument, as shown in the following statements:

proc iml;
use sashelp.cars;
read all var {Origin Type};
close sashelp.cars;

uOrigin = unique(Origin);
uType   = unique(Type);
print uOrigin, uType;

uOrigin

Asia

Europe

USA

uType

Hybrid

SUV

Sedan

Sports

Truck

Wagon

Because you know that the output is a row vector, you can count the number of columns by using the NCOL function, as shown below:

L_Origin = ncol(uOrigin); /** = 3 **/
L_Type   = ncol(uType);   /** = 6 **/

I think that the UNIQUE function is one of the most important functions in SAS/IML because it enables you to compute statistics for each level of a categorical variable. Furthermore, the UNIQUE/LOC technique (which is described on p. 69 of my book, Statistical Programming with SAS/IML Software) is a highly useful technique that should be a part of every statistical programmer's toolbox.

4月 292011
 
In last week's article on how to create a funnel plot in SAS, I wrote the following comment:
I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison.
This article discusses how to adjust the control limits (called decision limits in the GLM procedure) to account for multiple comparisons. Because the adjustments are more complicated when the group sizes are not constant, this article treats the simpler case in which each group has the same number of observations. For details on multiple comparisons, see Multiple Comparisons and Multiple Tests Using SAS (the second edition is scheduled for Summer 2011).

Example Data and ANOM Chart

In the funnel plot article, I used data for the temperatures of 52 cars. Each car was one of nine colors, and I was interested in whether the mean temperature of a group (say, black cars) was different from the overall mean temperature of the cars. The number of cars in each color group varied. However, in order to simplify the analysis, today's analysis uses only the temperatures of the first four cars of each color.

You can download the new data and all of the SAS statements used in this article. The following statements create the data and run the SAS/QC ANOM procedure to generate the ANOM chart:

ods graphics on;
proc anom data=CarTemp2;
   xchart Temperature*Color;
   label Temperature = 'Mean Temperature (F)';
   label Color = 'Car Color';
run;


Even for this smaller set of data, it is apparent that black cars are warmer than average and silver and white cars are cooler than average. You can create a similar plot by using the LSMEANS statement in the SAS/STAT GLM procedure.

Computing Decision Limits: An Overview

The formulas for computing decision limits are available in the documentation of the XCHART statement in the ANOM procedure. The decision limits have three components:

  1. The central value, y, with which you want to compare each individual group mean. Often this is the grand mean of the data.
  2. A variance term, v, which involves the root mean square error, the number of groups, and the size of each group. This term quantifies the accuracy of the comparisons.
  3. A multiplier, h, which depends on the significance level, α, and accounts for the multiple comparisons.
The upper and lower decision limits are then formed as y ± h * v. The following sections compute each component of the decision limits.

Computing the Central Value

The central value is the easiest component to compute. When the group sizes are constant, the central value is merely the overall mean:

proc iml;
use CarTemp2; 
read all var {Color Temperature}; 
close CarTemp2;

/** 1. overall mean **/
y = mean(Temperature); 

This value is 123.6, as shown in the ANOM chart. The ANOM chart compares each individual group mean to this central value.

Computing the Variance Term

The second component in the computation of decision limits is the variance term. This term measures the accuracy you have when comparing group means to the overall mean. (More variance means less accuracy.) The formula involves the mean square error, which in this case is just the average of the sample variances of the nine groups. For convenience, the following statements define a SAS/IML module that computes the average variance:

/** 2. variance term **/
start MSEofGroups(g, x);
   u = unique(g); /** g is group var **/
   nGroups = ncol(u);
   v = j(1, nGroups);
   do i = 1 to nGroups;
      v[i] = var( x[loc(g=u[i])] );
   end;
   return( sum(v)/nGroups );
finish;

The module is then used to compute the variance term:

MSE = MSEofGroups(Color, Temperature);
nGroups = 9; /** or determine from data **/
size = repeat(4, nGroups); /** {4,4,...,4} **/
v = sqrt(MSE) * sqrt((nGroups-1)/sum(size));

Computing the ANOM Multiplier

The final component in forming the ANOM decision limits is the multiplier, h. In elementary statistics, the value 2 (or more precisely, the 0.975 quantile of a t distribution) might be used as a multiplier, but that value isn’t big enough when multiple comparisons are being made. The PROC ANOM documentation states that in a comparison of several group means with the overall mean, the proper value of h is the α quantile of a certain distribution. However, the documentation does not specify how to compute this quantile.

In SAS software you can compute the quantile by using the PROBMC function. I had never heard of the PROBMC function until I started working on this article, but it is similar to the QUANTILE function in that it enables you to obtain quantiles from one of several distributions that are used in multiple comparison computations. (You can also use the PROBMC function to obtain probabilities.)

The following statements compute h for α = 0.05 and for the case of nine groups, each with four observations:

/** 3. multiplier for ANOM **/
alpha = 0.05;
pAnom = 1 - alpha;

/** degrees of freedom for 
    pooled estimate of variance **/
df = sum(size)-nGroups; 
h = probmc("ANOM", ., pAnom, df, nGroups); 

The main idea is that h is the α quantile of the "ANOM distribution." Although the "ANOM distribution" is not as well-known as the t distribution, the idea is the same. The distribution involves parameters for the degrees of freedom and the number of groups. In the general case (when the group sizes are not constant), the sizes of the groups are also parameters for the distribution (not shown here).

Computing the Decision Limits

All three pieces are computed, so it is easy to put them together to compute the upper and lower decision limits:

/** compute decision limits **/
upperAnom = y + h * v;
lowerAnom = y - h * v;
print lowerAnom upperAnom;

lowerAnom

upperAnom

109.89971

137.29473

Notice that these values are identical to the values graphed by the ANOM procedure.

Comparing the ANOM Multiplier with Better-Known Multipliers

The computation is finished, but it is interesting to compare the ANOM multiplier with more familiar multipliers from the t distribution.

A classic way to handle multiple comparisons is to use the Bonferroni adjustment. In this method, you divide α by the number of comparisons (9) but continue to use quantiles of the t distribution. By dividing α by the number of groups, you find quantiles that are further in the tail of the t distribution and therefore are larger than the unadjusted values. You can show that the Bonferroni multiplier is a conservative multiplier that will always be larger than the ANOM multiplier.

The following statements compute decision limit multipliers based on an unadjusted t quantile (such as is used for a classical confidence interval for a mean) and on a Bonferroni adjusted quantile. These are printed, along with the multiplier h that was computed previously.

/** compare with unadjusted and Bonferroni multipliers **/
q = quantile("T", 1-alpha/2, df); 
qBonf = quantile("T", 1-alpha/2/nGroups, df); 
print q qBonf h;

q

qBonf

h

2.0518305

3.0137174

2.9785179

For these data, the Bonferroni multiplier is only about 1% larger than h. You can see that the Bonferroni and ANOM multipliers are about 50% larger than the multiplier based on the unadjusted quantile, which means that the decision limits based on these quantiles will be wider. This is good, because the unadjusted limits are too narrow for multiple comparisons.

4月 272011
 
The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses b=1 and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

proc iml;
Y = {-3,1,2,.,5,10,100}; /** negative datum **/
LY = log10(Y + 1 - min(Y)); /** translate, then transform **/

Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

n = nrow(Y);
LogY = j(n,1); /** allocate result vector **/
do i = 1 to n; /** loop is inefficient **/
   if Y>0 then LogY[i] = log(Y);
   else LogY[i] = .;
end;

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign LogY, as shown in the following statements.

/** more efficient statements **/
LogY = j(nrow(Y),1,.); /** allocate missing **/
idx = loc(Y>0); /** find indices where Y>0 **/
if ncol(idx)>0 then 
   LogY[idx] = log10(Y[idx]);

print Y LY LogY;

Y

LY

LogY

-3

0

.

1

0.69897

0

2

0.7781513

0.30103

.

.

.

5

0.9542425

0.69897

10

1.146128

1

100

2.0170333

2

The preceding statements initially define LogY to be a vector of missing values. The LOC function finds the indices of Y for which Y is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in LogY for any element for which Y is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.

4月 222011
 
Last week I showed how to create a funnel plot in SAS. A funnel plot enables you to compare the mean values (or rates, or proportions) of many groups to some other value. The group means are often compared to the overall mean, but they could also be a compared to a value that is mandated by a regulatory agency.

A colleague at SAS mentioned that several SAS procedures automatically produce statistical graphs that are similar to the funnel plot. This blog post compares the funnel plot to the output from these procedures. Because my original post used a continuous response, I will concentrate on analysis of means plots.

The main idea of Spiegelhalter's paper on funnel plots is to create a scatter plot that has control limits "in close analogy to standard Shewhart charts." (Spiegelhalter 2004, p. 1185). It therefore makes sense to compare funnel plots with a plot created by the ANOM procedure in SAS/QC software. This article also examines output from the GLM procedure in SAS/STAT software. (You can use other SAS procedures, such as LOGISTIC, GENMOD, and GLIMMIX, to create funnel-like plots for proportions.)

The example used here is the same as for the previous article: Clark Andersen's data on the temperature of car roofs on a 71 degree (F) day.

The ANOM Procedure

Analysis of means (ANOM) is a statistical method for simultaneously comparing group means with the overall mean at a specified significance level. The control limits (often called decision limits for the ANOM plot) are adjusted for the fact that multiple group means are being compared with the overall mean. You can create an ANOM chart with the ANOM procedure in SAS/QC software as follows:

ods graphics on;
proc anom data=CarTemps;
   xchart Temperature*Color;
   label Temperature = 'Mean Temperature (F)';
   label Color  = 'Car Color';
run;

The information displayed in ANOM chart (at left; click to enlarge) is similar to the funnel plot in that they both show deviation from the overall mean. (See the documentation for the XCHART statement in PROC ANOM.)

PROC ANOM also supports funnel-like charts for proportions (see the PCHART statement) and the rates (see the UCHART statement).

Although PROC ANOM produces the ANOM plot automatically, I think that the funnel plot improves on the ANOM plot produced by PROC ANOM in two ways. First, the funnel plot explicitly presents one of the sources of the variation among groups, namely the sample size, whereas the ANOM plot does not. Second, the ANOM plot orders the categories in alphabetical order (black, burgundy, ..., white), whereas the funnel plot orders them according to sample size.


The GLM Procedure

An analysis of means plot is also available from the GLM procedure. The GLM procedure can order the groups by sample size (see the ORDER= option) . You can use the GLM procedure for comparing group means, but you need to use a different procedure if you want to compare rates or proportions, or if you want to compare them to a quantity other than the overall mean.

It is straightforward to analyze the data with the LSMEANS statement of the GLM procedure. (IF you are unfamiliar with the LSMEANS statement, see the recent SAS Global Forum paper on the LSMEANS and the LSMESTIMATE statements.) PROC GLM automatically adjusts the control limits for multiplicity (see the documentation of the ADJUST= option).

ods graphics on;
proc glm data=CarTemps order=freq;
   class Color;
   model Temperature = Color;
   lsmeans Color / pdiff=anom adjust=nelson;
run;

This plot is almost equivalent to the funnel plot. The main differences are that the groups are shown in decreasing order (instead of increasing) and the X axis does not explicitly show the sample size. For small examples such as the current data, this is not an issue, since there is room for all nine categories. However, the funnel plot easily handles hundreds or thousands of categories, whereas the ANOM plot would be unable to handle such a crowded axis.

Conclusions

For a small number of groups, the ANOM and GLM procedures can automatically produce funnel-like plots. They also automatically adjust the decision limits to account for multiple comparisons.

The funnel plot is a useful display for comparing the mean scores of hundreds of groups, and it explicitly shows the sample size, which is a source of variability. A drawback of the funnel plot is that you need to compute the control limits yourself, including (preferably) adjusting the limits for multiple comparisons. Next week I'll show how to compute these adjusted limits.

4月 152011
 
In a previous blog post, I showed how you can use simulation to construct confidence intervals for ranks. This idea (from a paper by E. Marshall and D. Spiegelhalter), enables you to display a graph that compares the performance of several institutions, where "institutions" can mean schools, companies, airlines, or any set of similar groups.

A more recent paper (Spiegelhalter, 2004) recommends doing away with individual confidence intervals altogether. Instead, Spiegelhalter recommends plotting the means of each group versus "an interpretable measure of its precision" and overlaying "control limits" similar to those found on a Shewhart control chart.

This article shows how to create a funnel plot in SAS. You can download all of the data and the SAS program used in this analysis.

Spiegelhalter lays out four steps for creating the funnel plot that displays the mean of a continuous response for multiple groups (see Appendix A.3.1):

  1. Compute the mean of each category.
  2. Compute the overall mean, y and standard deviation, s.
  3. Compute control limits y ± zα * s / sqrt(n) where zα is the α quantile of the normal distribution and n varies between the number of observations in the smallest group and the number in the largest group.
  4. Plot the mean of each category against the sample size and overlay the control limits.

The Temperature of Car Roofs in a Parking Lot

The data for this example are from an experiment by Clark Andersen in which he measured the temperature of car roofs in a parking lot. He observed that black and burgundy cars had the hottest roofs, whereas white and silver cars were cooler to touch.

Andersen measured the roof temperatures of 52 cars on a mild (71 degree F) day. The cars were of nine colors and 4–8 measurements were recorded for each color. You could rank the car colors by the mean temperature of the roof (remember to include confidence intervals!) but an alternative display is to plot the mean temperature for each car color versus the number of cars with that color. You can then overlay funnel-shaped "control limits" on the chart.

The funnel plot is shown below. (Click to enlarge.) The remainder of this post shows how to create the graph.

Computing the Mean of Each Category

The following statements read the data into SAS/IML vectors:

proc iml;
use CarTemps;
   read all var {Color Temperature};
close CarTemps;

You can use the UNIQUE/LOC technique to compute the sample size and mean temperature for each color category. The following technique is described in Section 3.3.5 of Statistical Programming with SAS/IML Software:

/** for each car color, compute the mean and
    standard error of the mean **/
u = unique(Color); /** unique colors **/
p = ncol(u); /** how many colors? **/
mean = j(p,1); sem = j(p,1); n = j(p,1);
do i = 1 to p;
   idx = loc(Color=u[i]);
   n[i] = ncol(idx);
   T = Temperature[idx]; 
   mean[i] = mean(T); /** mean temp of category **/
   sem[i] = sqrt(var(T)/n[i]); /** stderr **/
end;

The following table summarizes the data by ranking the mean temperatures for each color. Although the standard errors are not used in the funnel plot, they are included here so that you can see that many of the confidence intervals overlap. You could use the SGPLOT procedure to display these means along with 95% confidence intervals, but that is not shown here.


   color            n  mean  sem

   black            8 137.3  3.6
   burgundy         4 133.9  4.6
   green            4 130.9  7.3
   gray             6 130.5  2.5
   blue             7 129.3  3.9
   red              6 128.5  2.9
   tan              4 116.1  5.1
   silver           6 107.9  3.2
   white            7  98.4  1.8


Those car roofs get pretty hot! Are the black cars much hotter than the average? What about the burgundy cars? Notice that the sample means of the burgundy and green cars are based on only four observations, so the uncertainty in those estimates are greater than for cars with more measurements. The funnel plot displays both the mean temperature and the precision in a single graph.

Computing the Overall Mean and Standard Deviation

The funnel plot compares the group means to the overall mean. The following statements compute the overall mean and standard deviation of the data, ignoring colors:

/** compute overall mean and variance **/
y = Temperature[:];
s = sqrt(var(Temperature));
print y s[label="StdDev"];

y

StdDev

123.40385

15.582808

The overall mean temperature is 123 degrees F, with a standard deviation of 16 degrees. The VAR function is available in SAS/IML 9.22. If you are using an earlier version of SAS/IML, you can use the VAR module from my book.

Computing Control Limits

A funnel plot is effective because it explicitly reveals a source of variability for the means, namely the different sample sizes. The following statements compute the control limits for these data:

/** confidence limits for a range of sample sizes **/
n = T( do(3, 8.5, 0.1) );
p = {0.001 0.025 0.975 0.999}; /** lower/upper limits **/
z = quantile("normal", p);
/** compute 56 x 4 matrix that contains confidence limits
    for n = 3 to 8.5 by 0.1 **/
limits = y + s / sqrt(n) * z;

Notice that the limits variable is a matrix. The expression s/sqrt(n) is a column vector with 56 rows, whereas the row vector z contains four z-scores. Therefore the (outer) product is a 56x4 matrix. The values in this matrix are used to overlay control limits on a plot of mean versus the sample size.

Creating a Funnel Plot

After writing the SAS/IML computations to a data set (not shown here), you can use PROC SGPLOT to display the mean temperature for each car color and use the BAND statement to overlay 95% and 99.8% control limits.

title "Temperatures of Car Roofs";
title2 "71 Degrees in the Shade";
proc sgplot data=All;
   scatter x=N y=Mean /datalabel=Color;
   refline 123.4 / axis=y;
   band x=N lower=L95 upper=U95 / nofill;
   band x=N lower=L998 upper=U998 / nofill;
   xaxis label="Number of Cars";
   yaxis label="Average Temperature";
run;

The funnel plot indicates that black cars are hotter than average. Silver and white cars are cooler than average. More precisely, the plot shows that the mean temperature of the black cars exceeds the 95% prediction limits, which indicates that the mean is greater than would be expected by random variation from the overall mean. Similarly, the mean temperature of the silver cars is lower than the 95% prediction limits. The mean temperature of the white cars is even more extreme: it is beyond the 99.8% prediction limits.

Conclusions

The funnel plot is a simple way to compare group means to the overall average. The funnel-shaped curves are readily explainable to a non-statistician and the plot enables you to compare different groups without having to rank them. The eye is naturally drawn to observations that are outside the funnel curves, which is good because these observations often warrant special investigation.

More advantages and some limitations are described in Spiegelhalter's paper. He also shows how to construct the control limits for funnel plots for proportions, ratios of proportions, odd ratios, and other situations.

Two criticisms of my presentation come to mind. The first is that I've illustrated the funnel plot by using groups that have a small number of observations. In order to make the control limits look like funnels, I used a "continuous" value of n, even though there is no such thing as a group with 4.5 or 5.8 observations! Spiegelhalter's main application is displaying the mean mortality rates of hospitals that deal with hundreds or thousands of patients, so the criticism does not apply to his examples.

The second criticism is that I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison. Spiegelhalter notes (p. 1196) that "the only allowance for multiple comparisons is the use of small p-values for the control limits. These could be chosen based on some formal criterion such as Bonferroni..., but we suggest that this is best carried out separately." Following Spiegelhalter's suggestion, I leave that adjustment for another blog post.

4月 082011
 
At the beginning of 2011, I heard about the Dow Piano, which was created by CNNMoney.com. The Dow Piano visualizes the performance of the Dow Jones industrial average in 2010 with a line plot, but also adds an auditory component. As Bård Edlund, Art Director at CNNMoney.com, said,
The daily trading of the Dow Jones industrial average determines the songwriting here, translating the ups and downs of 2010's market into musical notes. Using a five-note scale spanning three octaves, pitch is determined by each day's closing level.

When I first saw the Dow Piano, I immediately thought, "I can do that in SAS/IML Studio by using the SOUND call in the SAS/IML language!"

To be fair, I can't fully duplicate the Dow Piano in SAS/IML software. The SOUND call in SAS is very simple: it only provides pitch and duration. The music for the Dow Piano is more complex:

  • It uses piano notes, which involves sound characteristics such as attack and decay.
  • It uses dynamics to represent trading volume: high-volume days are represented by loud notes, low-volume days by soft notes.
  • It overlays a percussion track so that you hear a percussive beat on days that the stock market is closed.
Nevertheless, I think the SAS/IML version captures the main idea by allowing you to listen to the performance of the Dow Jones industrial average. (Get it? The performance?)

Will this kind of visualization (auditorization? sonification?) catch on? Time will tell, but for these data, the sound doesn't add any new information; it merely uses the sense of sound to repeat information that is already in the line plot. In fact, the sound present less information than the line plot: there are more than 250 closing values of the Dow Jones industrial average, but the Dow Piano collapses these values into 15 notes (three octave of a five note scale).

But enough analysis! Listen to the Way of the Dow. The demo starts at 0:45, after I introduce the problem.

4月 072011
 
In a previous blog post about computing confidence intervals for rankings, I inadvertently used the VAR function in SAS/IML 9.22, without providing equivalent functionality for those readers who are running an earlier version of SAS/IML software. (Thanks to Eric for pointing this out.)

The following module computes the variance of each column of a matrix, and correctly handles missing values in the data:

/** Var: return sample variance of each column of a data matrix **/
/** For this module
  x is a matrix that contains the data
**/
start Var(x);
   mean = x[:,];
   countn = j(1, ncol(x)); /** allocate vector for counts **/
   do i = 1 to ncol(x); /** count nonmissing values **/
      countn[i] = sum(x[,i]^=.); /** in each column **/
   end;
   var = (x-mean)[##,] / (countn-1);
return ( var );
finish;

This module appears in Appendix D of my book, Statistical Programming with SAS/IML Software.