Statistical Programming

6月 202011
 
Recently Charlie Huang showed how to use the SAS/IML language to compute an exponentially weighted moving average of some financial data. In the commentary to his analysis, he said:

I found that if a matrix or a vector is declared with specified size before the computation step, the program’s efficiency would be significantly improved. It may suggest that declaring matrices explicitly could be a good habit for SAS/IML programming.

Charlie has stated a general rule: in matrix/vector languages such as SAS/IML, MATLAB, or R, you should allocate space for a matrix outside of a loop, rather than using concatenation to grow the matrix inside of a loop. I cover this in Chapter 2 (p. 42-43) of my book, Statistical Programming with SAS/IML Software. You can download Chapter 2 at no cost from the book's Web site.

Do Not Grow Arrays Dynamically

This general rule is relevant when you compute values of a matrix one element at a time. That is, you are using np separate computations to fill all elements of an n x p matrix.

Suppose that you want to compute a vector that contains the results of several similar computations. Naturally, you write a DO loop and compute each value within the loop. For example, the following SAS/IML statements define a SAS/IML module and call it eight times within a DO loop:

proc iml;
start Func(x);
   /** compute any quantity **/
   return (ssq(x)); /** sums of squares **/
finish;

/** NOT EFFICIENT **/
/** grow the result array dynamically **/
do i = 1 to 8;
   x = i:8;
   result = result || Func(x);
end;
print result;

result

204

203

199

190

174

149

113

64

The program is not efficient because it uses the horizontal concatenation operator (||) to grow the result vector dynamically within the DO loop. After the ith iteration, the vector contains i elements, but during the ith iteration, the previous i – 1 elements are needlessly copied from the old array to the new (larger) array. If the DO loop iterates n times, then

  1. the first element is copied n times
  2. the second element is copied n – 1 times
  3. the third element is copied n – 2 times, and so forth
In summary, when you grow an array dynamically within a loop, there are n allocations and n (n – 1) / 2 elements are needlessly copied.

A Better Approach: Pre-Allocate the Array

When you know the ultimate size of an array, it is best to allocate the array prior to the loop. You can then assign values to the elements of the array inside the loop, as shown in the following statements:

/** EFFICIENT: Pre-allocate result array **/
result = j(1, 8);      
do i = 1 to 8;
   x = i:8;
   result[i] = Func(x);
end;

This computation does not contain any needless allocations, nor any unnecessary copying of elements that were previously computed.

This technique is essential for efficient sampling in the SAS/IML language: when you want random values from a specified distribution (such as the normal distribution), you should pre-allocate a matrix and then call the RANDGEN subroutine once in order to fill the matrix with random numbers.

6月 152011
 
A colleague asked, "How can I enumerate the levels of a categorical classification variable in SAS/IML software?" The variable was a character variable with n observations, but he wanted the following:
  1. A "look-up table" that contains the k (unique) levels of the variable.
  2. A vector with n elements that contains the values 1, 2, ..., k.

Factors in the R Language

My colleague had previously used the R language, and wanted to duplicate the concept of an R "factor." For example, the following R statements define a character vector, pets, and then use the as.factor function to obtain a factor that has the values 1, 2, and 3 that correspond to the character values "cat," "dog," and "fish":

# R statements
pets <- c("dog","dog","cat","cat","fish","dog","cat")
f <- as.factor(pets)
str(f)   # print structure of f

  Factor w/ 3 levels "cat","dog","fish": 2 2 1 1 3 2 1

The vector, pets, has three unique levels: "cat","dog", and "fish". The vector {2 2 1 1 3 2 1} specifies that the elements in terms of the look-up table: the first two elements are "dog" (level 2), the third element is "cat" (level 1), and so forth.

Creating a "Factor" in the SAS/IML Language

Creating a factor is yet another application of the UNIQUE/LOC technique, which is described on p. 69 of my book, Statistical Programming with SAS/IML Software.

The following statements define a SAS/IML module named AsFactor. The third argument is the input argument, x. Upon return, the first argument contains the unique sorted levels of x, whereas the second argument contains integers that correspond to each level of x:

proc iml;
/** AsFactor takes a categorical variable x and 
    returns two matrices:
    Levels = a row vector that contains the 
             k unique sorted values of x
    Codes = a matrix that is the same dimensions 
            as x and contains the values
            1, 2, ..., k. **/
start AsFactor(Levels, Codes, x);
   Levels = unique(x);
   Codes = j(nrow(x),ncol(x));
   do i = 1 to ncol(Levels);
      idx = loc(x=Levels[i]);
      Codes[idx] = i;
   end;
finish;

Notice that the output argument, Codes, has the same dimensions as the input argument, x. The following statements call the AsFactor module and print the output values:

pets = {"dog","dog","cat","cat","fish","dog","cat"};
run AsFactor(Lev, C, pets);
print Lev, C;

Lev

cat

dog

fish

C

2

2

1

1

3

2

1

I'll mention that the AsFactor module is written so that it also works for numeric vectors. For example, the following statements find the unique levels of the numerical vector, Grades:

grades = {90,90,80,80,100,90,80};
run AsFactor(Lev, C, grades);

Of course, if you want only the unique levels of a variable in a SAS data set, you can the TABLES statement in the FREQ procedure.

6月 082011
 
Over at the SAS/IML Discussion Forum, there have been several posts about how to call a Base SAS functions from SAS/IML when the Base SAS function supports a variable number of arguments.

It is easy to call a Base SAS function from SAS/IML software when the syntax for the function is a fixed number of parameters: simply use the same syntax in SAS/IML as you use for the DATA step.

The problem arises for functions that support a varying number of parameters. One solution uses a slick trick that I developed and put in my book, Statistical Programming with SAS/IML Software.

What's the Problem?

The problem is that some Base SAS functions such as the DUR function, FINANCE function, and NPV function (in general, a lot of the financial functions) have a syntax that supports a varying number of arguments. The Base SAS function expects a comma-separated list of parameters, whereas in SAS/IML it is natural to store the parameters in a vector.

How can you convert a vector of values into a comma-separated list? The trick is to do the following:

  1. Convert the vector of values into a comma-delimited string.
  2. Put the string into a macro variable by using the SYMPUT function.
  3. Use the macro variable in the Base SAS function.

A Canonical Example

The syntax for the DUR function supports an arbitrary number of parameters:

d = dur(Y, Freq, c1, ..., ck);
The last k parameters are usually specified either by a comma-delimited list or by using an array and the OF operator, as shown in the following example:

data Duration;
   /** first approach: use comma-delimited list **/
   d1 = dur(1/20,1,.33,.44,.55,.49,.5,.22,.4,.8,.01,.36,.2,.4);
   /** alternative approach: use array **/
   Y = 1/20;
   freq = 1;
   array C[12] (.33,.44,.55,.49,.50,.22,.4,.8,.01,.36,.2,.4);
   d2 = dur(Y, freq, of C[*]);
run;

proc print noobs; var d1 d2; run;

d1

d2

5.28402

5.28402

However, when you try to do the "natural" thing in SAS/IML software, you get an error:

proc iml;
use Duration;
   read all var {Y freq};
   vNames = "c1":"c12";
   read all var vNames into C;
close Duration;

/** OF syntax is not supported in SAS/IML **/
d = dur(Y, freq, of C); /** ERROR: doesn't work! **/

The problem is that the OF syntax is not supported in SAS/IML software. The DUR function in Base SAS is expecting a comma-separated list of arguments, not a SAS/IML vector.

Macro to the Rescue: Converting a Vector into a Comma-Separated String

The following statements define a SAS/IML module called GetArgList that takes a vector of values and creates a comma-delimited string from it. The module uses the PUTN function to convert from numbers to character values, and uses the ROWCAT function to form a single string:

/** convert numeric elements to a single string **/
start GetArgList(v);
   y = strip(putn(rowvec(v),"BEST12.")) + ",";
   /** get rid of last comma **/
   y[ncol(y)] = strip(putn(v[ncol(y)],"BEST12."));
   return( rowcat(y) ); /** scalar string **/
finish;

s = GetArgList(C);
print s;

s

0.33,0.44,0.55,0.49,0.5 ,0.22,0.4 ,0.8 ,0.01,0.36,0.2 ,0.4

This function is what we need to call the DUR function (and similar functions) with parameters that are contained in a SAS/IML vector. The trick is to save the comma-delimited string, s, into a macro variable, and then call the DUR function with that macro variable:

call symput("myArgs", s); /** create macro var **/
d = dur(Y, freq, &myArgs); /** use macro var **/

Although it is a bit of a hack, this technique enables you to call Base SAS functions that have a varying number of parameters, and pass in values from a SAS/IML vector.

5月 182011
 
Andrew Ratcliffe posted a fine article titled "Inadequate Mends" in which he extols the benefits of including the name of a macro on the %MEND statement. That is, if you create a macro function named foo, he recommends that you include the name in two places:

%macro foo(x);
   /** define the macro function **/
%mend foo;

Did you know that the same option exists for SAS/IML modules? That is, if you want to be explicitly clear that you are ending a particular module definition, you can include the name of the module on the FINISH statement:

proc iml;
start foo(x);
   /** define the SAS/IML module **/
finish foo;

5月 162011
 
A fundamental operation in data analysis is finding data that satisfy some criterion. How many people are older than 85? What are the phone numbers of the voters who are registered Democrats? These questions are examples of locating data with certain properties or characteristics.

The SAS DATA step has a variety of statements (such as the WHERE and IF statements) that enable statistical programmers to locate observations and subset data. The SAS/IML language has similar language features, but because data is often stored in SAS/IML matrices, the SAS/IML language also has a function that is not available in the DATA step: the LOC function.

The LOC Function

If your data are in a SAS/IML matrix, x, the LOC Function enables you to find elements of x for which a given criterion is true. The LOC function returns the LOCations (indices) of the relevant elements. (In the R language, the which function implements similar functionality.) For example, the following statements define a numeric vector, x, and use the LOC function to find the indices for which the numbers are greater than 3:

proc iml;
x = {1 4 3 5 2 7 3 5};
/** which elements are > 3? **/
k = loc( x>3 ); 
print k;

k

2

4

6

8

Notice the following:

  • The argument to the LOC function is an expression that resolves to a vector of 0s and 1s. (Some languages call this a logical vector.) In practice, the argument to the LOC function is almost always an expression.
  • The result of the LOC function is always a row vector. The number of columns is the number of elements of x that satisfy the given criterion.
  • The LOC function returns indices of x, not values of x. To obtain the values, use x[k]. (Indices and subscripts are related; for vectors, they are the same.)

How Many Elements Satisfy the Criterion?

You can exploit the fact that the LOC function outputs a row vector. To count the number of elements that satisfy the criterion, simply use the NCOL function, as follows:

n = ncol(k); /** how many? **/
print n;

n

4

What If No Elements Satisfy the Criterion?

The expression ncol(idx) always tells you the number of elements that satisfy the criterion, even when no elements satisfy the criterion. The following statement asks for the elements larger than 100 and handles the possible results:

j = loc( x>100 );
if ncol(j) > 0 then do;
   print "At least one element found";
   /** handle this case **/
end;
else do;
   print "No elements found";
   /** handle alternate case **/
end;

In the preceding example, x does not contain any elements that are greater than 100. Therefore the matrix j is an empty matrix, which means that j has zero rows and zero columns. It is a good programming practice to check the results of the LOC function to see if any elements satisfied the criterion. For more details, see Chapter 3 of Statistical Programming with SAS/IML Software.

Using the LOC Function to Subset a Vector

The LOC function finds the indices of elements that satisfy some criterion. These indices can be used to subset the data. For example, the following statements read information about vehicles in the SasHelp.Cars data set. The READ statement creates a vector that contains the make of each vehicle ("Acura," "Audi," "BMW,"...) and creates a second vector that contains the engine size (in liters) for each vehicle. The LOC function is used to find the indices for the vehicles made by Acura. These indices are then used to subset the EngineSize vector in order to produce a vector, s, that contains only the engine volumes for the Acura vehicles:

use sashelp.cars; 
read all var {Make EngineSize};
close sashelp.cars; 

/** find observations that 
    satisfy a criterion **/
idx = loc( Make="Acura" );
s = EngineSize[idx];
print s[label="EngineSize (Acura)"];

EngineSize (Acura)

3.5

2

2.4

3.2

3.5

3.5

3.2

LOC = Efficient SAS Programming

I have called the LOC function the most useful function that most DATA step programmers have never heard of. Despite its relative obscurity, it is essential that SAS/IML programmers master the LOC function. By using the LOC function, you can write efficient vectorized programs, rather than inefficient programs that loop over data.

4月 292011
 
In last week's article on how to create a funnel plot in SAS, I wrote the following comment:
I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison.
This article discusses how to adjust the control limits (called decision limits in the GLM procedure) to account for multiple comparisons. Because the adjustments are more complicated when the group sizes are not constant, this article treats the simpler case in which each group has the same number of observations. For details on multiple comparisons, see Multiple Comparisons and Multiple Tests Using SAS (the second edition is scheduled for Summer 2011).

Example Data and ANOM Chart

In the funnel plot article, I used data for the temperatures of 52 cars. Each car was one of nine colors, and I was interested in whether the mean temperature of a group (say, black cars) was different from the overall mean temperature of the cars. The number of cars in each color group varied. However, in order to simplify the analysis, today's analysis uses only the temperatures of the first four cars of each color.

You can download the new data and all of the SAS statements used in this article. The following statements create the data and run the SAS/QC ANOM procedure to generate the ANOM chart:

ods graphics on;
proc anom data=CarTemp2;
   xchart Temperature*Color;
   label Temperature = 'Mean Temperature (F)';
   label Color = 'Car Color';
run;


Even for this smaller set of data, it is apparent that black cars are warmer than average and silver and white cars are cooler than average. You can create a similar plot by using the LSMEANS statement in the SAS/STAT GLM procedure.

Computing Decision Limits: An Overview

The formulas for computing decision limits are available in the documentation of the XCHART statement in the ANOM procedure. The decision limits have three components:

  1. The central value, y, with which you want to compare each individual group mean. Often this is the grand mean of the data.
  2. A variance term, v, which involves the root mean square error, the number of groups, and the size of each group. This term quantifies the accuracy of the comparisons.
  3. A multiplier, h, which depends on the significance level, α, and accounts for the multiple comparisons.
The upper and lower decision limits are then formed as y ± h * v. The following sections compute each component of the decision limits.

Computing the Central Value

The central value is the easiest component to compute. When the group sizes are constant, the central value is merely the overall mean:

proc iml;
use CarTemp2; 
read all var {Color Temperature}; 
close CarTemp2;

/** 1. overall mean **/
y = mean(Temperature); 

This value is 123.6, as shown in the ANOM chart. The ANOM chart compares each individual group mean to this central value.

Computing the Variance Term

The second component in the computation of decision limits is the variance term. This term measures the accuracy you have when comparing group means to the overall mean. (More variance means less accuracy.) The formula involves the mean square error, which in this case is just the average of the sample variances of the nine groups. For convenience, the following statements define a SAS/IML module that computes the average variance:

/** 2. variance term **/
start MSEofGroups(g, x);
   u = unique(g); /** g is group var **/
   nGroups = ncol(u);
   v = j(1, nGroups);
   do i = 1 to nGroups;
      v[i] = var( x[loc(g=u[i])] );
   end;
   return( sum(v)/nGroups );
finish;

The module is then used to compute the variance term:

MSE = MSEofGroups(Color, Temperature);
nGroups = 9; /** or determine from data **/
size = repeat(4, nGroups); /** {4,4,...,4} **/
v = sqrt(MSE) * sqrt((nGroups-1)/sum(size));

Computing the ANOM Multiplier

The final component in forming the ANOM decision limits is the multiplier, h. In elementary statistics, the value 2 (or more precisely, the 0.975 quantile of a t distribution) might be used as a multiplier, but that value isn’t big enough when multiple comparisons are being made. The PROC ANOM documentation states that in a comparison of several group means with the overall mean, the proper value of h is the α quantile of a certain distribution. However, the documentation does not specify how to compute this quantile.

In SAS software you can compute the quantile by using the PROBMC function. I had never heard of the PROBMC function until I started working on this article, but it is similar to the QUANTILE function in that it enables you to obtain quantiles from one of several distributions that are used in multiple comparison computations. (You can also use the PROBMC function to obtain probabilities.)

The following statements compute h for α = 0.05 and for the case of nine groups, each with four observations:

/** 3. multiplier for ANOM **/
alpha = 0.05;
pAnom = 1 - alpha;

/** degrees of freedom for 
    pooled estimate of variance **/
df = sum(size)-nGroups; 
h = probmc("ANOM", ., pAnom, df, nGroups); 

The main idea is that h is the α quantile of the "ANOM distribution." Although the "ANOM distribution" is not as well-known as the t distribution, the idea is the same. The distribution involves parameters for the degrees of freedom and the number of groups. In the general case (when the group sizes are not constant), the sizes of the groups are also parameters for the distribution (not shown here).

Computing the Decision Limits

All three pieces are computed, so it is easy to put them together to compute the upper and lower decision limits:

/** compute decision limits **/
upperAnom = y + h * v;
lowerAnom = y - h * v;
print lowerAnom upperAnom;

lowerAnom

upperAnom

109.89971

137.29473

Notice that these values are identical to the values graphed by the ANOM procedure.

Comparing the ANOM Multiplier with Better-Known Multipliers

The computation is finished, but it is interesting to compare the ANOM multiplier with more familiar multipliers from the t distribution.

A classic way to handle multiple comparisons is to use the Bonferroni adjustment. In this method, you divide α by the number of comparisons (9) but continue to use quantiles of the t distribution. By dividing α by the number of groups, you find quantiles that are further in the tail of the t distribution and therefore are larger than the unadjusted values. You can show that the Bonferroni multiplier is a conservative multiplier that will always be larger than the ANOM multiplier.

The following statements compute decision limit multipliers based on an unadjusted t quantile (such as is used for a classical confidence interval for a mean) and on a Bonferroni adjusted quantile. These are printed, along with the multiplier h that was computed previously.

/** compare with unadjusted and Bonferroni multipliers **/
q = quantile("T", 1-alpha/2, df); 
qBonf = quantile("T", 1-alpha/2/nGroups, df); 
print q qBonf h;

q

qBonf

h

2.0518305

3.0137174

2.9785179

For these data, the Bonferroni multiplier is only about 1% larger than h. You can see that the Bonferroni and ANOM multipliers are about 50% larger than the multiplier based on the unadjusted quantile, which means that the decision limits based on these quantiles will be wider. This is good, because the unadjusted limits are too narrow for multiple comparisons.

4月 272011
 
The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses b=1 and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

proc iml;
Y = {-3,1,2,.,5,10,100}; /** negative datum **/
LY = log10(Y + 1 - min(Y)); /** translate, then transform **/

Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

n = nrow(Y);
LogY = j(n,1); /** allocate result vector **/
do i = 1 to n; /** loop is inefficient **/
   if Y>0 then LogY[i] = log(Y);
   else LogY[i] = .;
end;

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign LogY, as shown in the following statements.

/** more efficient statements **/
LogY = j(nrow(Y),1,.); /** allocate missing **/
idx = loc(Y>0); /** find indices where Y>0 **/
if ncol(idx)>0 then 
   LogY[idx] = log10(Y[idx]);

print Y LY LogY;

Y

LY

LogY

-3

0

.

1

0.69897

0

2

0.7781513

0.30103

.

.

.

5

0.9542425

0.69897

10

1.146128

1

100

2.0170333

2

The preceding statements initially define LogY to be a vector of missing values. The LOC function finds the indices of Y for which Y is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in LogY for any element for which Y is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.

4月 152011
 
In a previous blog post, I showed how you can use simulation to construct confidence intervals for ranks. This idea (from a paper by E. Marshall and D. Spiegelhalter), enables you to display a graph that compares the performance of several institutions, where "institutions" can mean schools, companies, airlines, or any set of similar groups.

A more recent paper (Spiegelhalter, 2004) recommends doing away with individual confidence intervals altogether. Instead, Spiegelhalter recommends plotting the means of each group versus "an interpretable measure of its precision" and overlaying "control limits" similar to those found on a Shewhart control chart.

This article shows how to create a funnel plot in SAS. You can download all of the data and the SAS program used in this analysis.

Spiegelhalter lays out four steps for creating the funnel plot that displays the mean of a continuous response for multiple groups (see Appendix A.3.1):

  1. Compute the mean of each category.
  2. Compute the overall mean, y and standard deviation, s.
  3. Compute control limits y ± zα * s / sqrt(n) where zα is the α quantile of the normal distribution and n varies between the number of observations in the smallest group and the number in the largest group.
  4. Plot the mean of each category against the sample size and overlay the control limits.

The Temperature of Car Roofs in a Parking Lot

The data for this example are from an experiment by Clark Andersen in which he measured the temperature of car roofs in a parking lot. He observed that black and burgundy cars had the hottest roofs, whereas white and silver cars were cooler to touch.

Andersen measured the roof temperatures of 52 cars on a mild (71 degree F) day. The cars were of nine colors and 4–8 measurements were recorded for each color. You could rank the car colors by the mean temperature of the roof (remember to include confidence intervals!) but an alternative display is to plot the mean temperature for each car color versus the number of cars with that color. You can then overlay funnel-shaped "control limits" on the chart.

The funnel plot is shown below. (Click to enlarge.) The remainder of this post shows how to create the graph.

Computing the Mean of Each Category

The following statements read the data into SAS/IML vectors:

proc iml;
use CarTemps;
   read all var {Color Temperature};
close CarTemps;

You can use the UNIQUE/LOC technique to compute the sample size and mean temperature for each color category. The following technique is described in Section 3.3.5 of Statistical Programming with SAS/IML Software:

/** for each car color, compute the mean and
    standard error of the mean **/
u = unique(Color); /** unique colors **/
p = ncol(u); /** how many colors? **/
mean = j(p,1); sem = j(p,1); n = j(p,1);
do i = 1 to p;
   idx = loc(Color=u[i]);
   n[i] = ncol(idx);
   T = Temperature[idx]; 
   mean[i] = mean(T); /** mean temp of category **/
   sem[i] = sqrt(var(T)/n[i]); /** stderr **/
end;

The following table summarizes the data by ranking the mean temperatures for each color. Although the standard errors are not used in the funnel plot, they are included here so that you can see that many of the confidence intervals overlap. You could use the SGPLOT procedure to display these means along with 95% confidence intervals, but that is not shown here.


   color            n  mean  sem

   black            8 137.3  3.6
   burgundy         4 133.9  4.6
   green            4 130.9  7.3
   gray             6 130.5  2.5
   blue             7 129.3  3.9
   red              6 128.5  2.9
   tan              4 116.1  5.1
   silver           6 107.9  3.2
   white            7  98.4  1.8


Those car roofs get pretty hot! Are the black cars much hotter than the average? What about the burgundy cars? Notice that the sample means of the burgundy and green cars are based on only four observations, so the uncertainty in those estimates are greater than for cars with more measurements. The funnel plot displays both the mean temperature and the precision in a single graph.

Computing the Overall Mean and Standard Deviation

The funnel plot compares the group means to the overall mean. The following statements compute the overall mean and standard deviation of the data, ignoring colors:

/** compute overall mean and variance **/
y = Temperature[:];
s = sqrt(var(Temperature));
print y s[label="StdDev"];

y

StdDev

123.40385

15.582808

The overall mean temperature is 123 degrees F, with a standard deviation of 16 degrees. The VAR function is available in SAS/IML 9.22. If you are using an earlier version of SAS/IML, you can use the VAR module from my book.

Computing Control Limits

A funnel plot is effective because it explicitly reveals a source of variability for the means, namely the different sample sizes. The following statements compute the control limits for these data:

/** confidence limits for a range of sample sizes **/
n = T( do(3, 8.5, 0.1) );
p = {0.001 0.025 0.975 0.999}; /** lower/upper limits **/
z = quantile("normal", p);
/** compute 56 x 4 matrix that contains confidence limits
    for n = 3 to 8.5 by 0.1 **/
limits = y + s / sqrt(n) * z;

Notice that the limits variable is a matrix. The expression s/sqrt(n) is a column vector with 56 rows, whereas the row vector z contains four z-scores. Therefore the (outer) product is a 56x4 matrix. The values in this matrix are used to overlay control limits on a plot of mean versus the sample size.

Creating a Funnel Plot

After writing the SAS/IML computations to a data set (not shown here), you can use PROC SGPLOT to display the mean temperature for each car color and use the BAND statement to overlay 95% and 99.8% control limits.

title "Temperatures of Car Roofs";
title2 "71 Degrees in the Shade";
proc sgplot data=All;
   scatter x=N y=Mean /datalabel=Color;
   refline 123.4 / axis=y;
   band x=N lower=L95 upper=U95 / nofill;
   band x=N lower=L998 upper=U998 / nofill;
   xaxis label="Number of Cars";
   yaxis label="Average Temperature";
run;

The funnel plot indicates that black cars are hotter than average. Silver and white cars are cooler than average. More precisely, the plot shows that the mean temperature of the black cars exceeds the 95% prediction limits, which indicates that the mean is greater than would be expected by random variation from the overall mean. Similarly, the mean temperature of the silver cars is lower than the 95% prediction limits. The mean temperature of the white cars is even more extreme: it is beyond the 99.8% prediction limits.

Conclusions

The funnel plot is a simple way to compare group means to the overall average. The funnel-shaped curves are readily explainable to a non-statistician and the plot enables you to compare different groups without having to rank them. The eye is naturally drawn to observations that are outside the funnel curves, which is good because these observations often warrant special investigation.

More advantages and some limitations are described in Spiegelhalter's paper. He also shows how to construct the control limits for funnel plots for proportions, ratios of proportions, odd ratios, and other situations.

Two criticisms of my presentation come to mind. The first is that I've illustrated the funnel plot by using groups that have a small number of observations. In order to make the control limits look like funnels, I used a "continuous" value of n, even though there is no such thing as a group with 4.5 or 5.8 observations! Spiegelhalter's main application is displaying the mean mortality rates of hospitals that deal with hundreds or thousands of patients, so the criticism does not apply to his examples.

The second criticism is that I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison. Spiegelhalter notes (p. 1196) that "the only allowance for multiple comparisons is the use of small p-values for the control limits. These could be chosen based on some formal criterion such as Bonferroni..., but we suggest that this is best carried out separately." Following Spiegelhalter's suggestion, I leave that adjustment for another blog post.

4月 132011
 
I have recently returned from five days at SAS Global Forum in Las Vegas. A riffle shuffle|Source=Flickr [http://www.flickr.com/photos/latitudes/66424863/in/set-1442169/] |Date= November 2005 - April 2006 |Author= Todd Klassy [http://www.flickr.com/photos/latitudes/] |Permission=CC-by 2.0 On the way there, I finally had time to read a classic statistical paper: Bayer and Diaconis (1992) describes how many shuffles are needed to randomize a deck of cards. Their famous result that it takes seven shuffles to randomize a 52-card deck is known as "the bane of bridge players" because the result motivated many bridge clubs to switch from hand shuffling to computer generated shuffling. Casual bridge players also blame this result for "slowing down the game" while the cards are shuffled more times than seems intuitively necessary.

In the second paragraph of the paper, Bayer and Diaconis introduce a "mathematically precise model of shuffling," which is known as the Gilbert-Shannon-Reeds (GSR) model. This model is known to be a "good description of the way real people shuffle real cards." (See Diaconis (1988) and the references at the end of this article.) This article describes how to implement GSR shuffling model in SAS/IML software.

The Riffle Shuffle

Computationally, you can shuffle a deck by generating a permutation of the set 1:n, but that is not how real cards are shuffled.

The riffle (or "dovetail") shuffle is the most common shuffling algorithm. A deck of n cards is split into two parts and the two stacks are interleaved. The GSR algorithm simulates this physical process.

The GSR model splits the deck into two pieces according to the binomial distribution. Each piece has roughly n/2 cards. Then cards are dropped from the two stacks according to the number of cards remaining in each stack. Specifically, if there are NL cards in the left stack and NR cards in the right stack, then the probability of the next card dropping from the right stack is NR / (NR + NL).

The following SAS/IML module is a straightforward implementation of the GSR algorithm:

proc iml;
/** Gilbert-Shannon-Reeds statistical 
    model of a riffle shuffle. Described in 
    Bayer and Diaconis (1992) **/
start GSRShuffle(deck);
   n = nrow(deck);
   /** cut into two stacks **/
   nL = rand("Binomial", 0.5, n);
   nR = n - nL;

   /** left stack, right stack **/
   L = deck[1:nL]; R = deck[nL+1:n];
   j = 1; k = 1; /** card counters **/
   shuffle = j(n,1); /** allocate **/
   do i = 1 to n;
      c = rand("Bernoulli", nR/(nL+nR)); 
      if c=0 then do;  /** drop from left **/
        shuffle[i] = L[j];
        nL = nL-1;  j=j+1; /** update left **/
      end;
      else do;  /** drop from right **/
        shuffle[i] = R[k];
        nR = nR-1;  k=k+1; /** update right **/
      end;
   end;
   return(shuffle);
finish;

Testing the GSR Algorithm

You can test the algorithm by starting with a deck of cards in a known order and observing how the cards are mixed by consecutive riffle shuffles. The following statements riffle the cards seven times and store the results of each shuffle. To save space, a 20-card deck is used. The original order of the cards is denoted 1, 2, 3, ..., 20.

call randseed(1234);
n = 20; /** n=52 for a standard deck **/
deck = T(1:n);

s = j(n, 8); /** allocate for 8 decks **/
s[,1] = deck; /** original order **/
do i = 1 to 7;
   s[,i+1] = GSRShuffle( s[,i] );
end;
names = "s0":"s7";
print s[colname=names];


                  s
    s0  s1  s2  s3  s4  s5  s6  s7

     1   1   1   1   1   1  18   2
     2   2  14  14  14  11   1  13
     3  12   2   2   4  14  11  12
     4   3   6   6   2   4  14  18
     5  13  12   3  12   2  19   1
     6   4  15  16  15  12   7  11
     7   5   7  13  17  15   3  14
     8  14   8   4  10  17   4  15
     9   6   9  12   6  10  16  19
    10  15   3  15  18   6   2   8
    11   7  16  17   7  18  13   7
    12   8  13  10   3  19  12   3
    13   9   4  18  16   7  15   9
    14  16  17   7  13   3   8   4
    15  17  10   8   8  16   9  17
    16  10  18   5   5  13  17  20
    17  18   5  11  11   8  20  16
    18  11  11  19  19   9  10  10
    19  19  19   9   9  20   5   5
    20  20  20  20  20   5   6   6

It is interesting to study the evolution of mixing the cards:

  • For the first shuffle, the original deck is cut into two stacks (1:11 and 12:20) and riffled to form the second column. Notice that usually one or two cards are dropped from each stack, although at one point three cards (7, 8, 9) are dropped from the left stack.
  • The second cut occurs at card 14, which is the eighth card in the second column. After the second riffle, notice that the cards 7, 8, and 9 are still consecutive, and the first and last cards are still in their original locations. Six cards (30%) are still in their initial positions in the deck.
  • The pair 7 and 8 is not separated until the fourth shuffle.
  • The last card (20) does not move from the bottom of the deck until the fifth shuffle.
  • The first card (1) does not move from the top of the deck until the sixth shuffle.

The Efficiency of the GSR Algorithm

As far as efficiency goes, the GSRShuffle module that I've implemented here is not very efficient. As I've said before, the SAS/IML language is a vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities.

This implementation of the shuffling algorithm is not vectorized. Unfortunately, because the probability of a card dropping from the left stack changes at every iteration, there is no way to call the RANDGEN function once and have it return all n numbers required to simulate a single riffle shuffle.

Or is there? Perhaps there is an equivalent algorithm that can be vectorized? Next week I'll present a more efficient version of the GSR algorithm that does not require an explicit loop over the number of cards in the deck.

References

D. Bayer and P. Diaconis (1992), "Trailing the Dovetail Shuffle to Its Lair", Annals of Applied Probablity 2(2) 294-313
P. Diaconis (1988), Group Representations in Probability and Statistics. IMS, Hayward, CA.
E. Gilbert (1955) "Theory of Shuffling," Technical memorandum. Bell Laboratories.
J. Reeds (1981), Unpublished manuscript.
4月 042011
 
In my article on computing confidence intervals for rankings, I had to generate p random vectors that each contained N random numbers. Each vector was generated from normal distribution with different parameters. This post compares two different ways to generate p vectors that are sampled from independent normal distributions.

Sampling from Normal Distributions

Recall that the following SAS/IML statements sample from p identical normal distributions:

proc iml;
call randseed(1234); /** set random number seed **/
N = 10000; /** number of observations per sample **/
p = 100; /** number of variables **/
mean = 1; stdDev = 2;
x = j(N, p); /** allocate space for results **/
call randgen(x, "Normal", mean, stdDev);

Each of the p columns of the matrix x is a sample from the same normal population. Today's article describes how to sample each column from a different population. That is, the means and standard deviations are different for each column of x.

As I've discussed previously, when you want correlated vectors you can sample from a multivariate normal distribution by using the RANDNORMAL module. Each of the resulting vectors is univariate normal and they are correlated with each other.

You can also use the RANDNORMAL module to sample from p independent and uncorrelated variables: just pass in a diagonal matrix for the covariance matrix. However, there is a more efficient way to sample from p independent normal variables: call the RANDGEN function p times in a loop.

Why Sampling in a Loop Is Better

Some long-time readers might be surprised that I would recommend writing a loop because I have blogged many times about the evils of unnecessary looping in the SAS/IML language, especially for sampling and simulation. But loops are not inherently evil. The point I often make is that loops should be avoided when there is a more efficient vectorized function that does the same thing.

But in the case of sampling from independent normals, a loop is more efficient than calling the RANDNORMAL module. Here's why. The RANDNORMAL algorithm has three steps:

  1. Sample from p independent standard normal distributions.
  2. Compute the Cholesky decomposition of the covariance matrix by using the SAS/IML ROOT function.
  3. Perform a matrix multiplication that transforms the uncorrelated variates into correlated variates.
Because the RANDNORMAL algorithm requires sampling, but also involves a Cholesky decomposition (which is an O(p3) operation), I expect it to be less efficient than an operation that uses only sampling.

The following statements use PROC IML to compute p independent normal variates, each with its own mean and variance:

mean = j(p,1,0); /** use zero means **/
var = uniform( j(p,1) ); /** random variances **/

x = j(N, p); /** allocate space for results **/
z = j(N,1); /** allocate helper vector **/
do i = 1 to p;
   call randgen(z, "Normal", mean[i], sqrt(var[i]));
   x[,i] = z;
end;

Comparing the Two Algorithms

For comparison, the following statements sample from a multivariate normal distribution with uncorrelated components:

D = diag(var); /** diagonal covariance matrix **/
x = randnormal(N, mean, D);

You can use the techniques in Chapter 15 of my book, Statistical Programming with SAS/IML Software, to compare the performance of these two methods for a range of values of p. A graph of the results shows that the RANDNORMAL method does indeed require more computational time. Using the module, it takes about one second to create a 300x300 covariance matrix and use it to generate 300 random vectors with 10,000 observations. If you use a loop, sampling from 300 univariate normal distributions takes less than 0.2 seconds.

So go ahead and loop. This time, it's the more efficient choice.