Statistical Programming

5月 162011
A fundamental operation in data analysis is finding data that satisfy some criterion. How many people are older than 85? What are the phone numbers of the voters who are registered Democrats? These questions are examples of locating data with certain properties or characteristics.

The SAS DATA step has a variety of statements (such as the WHERE and IF statements) that enable statistical programmers to locate observations and subset data. The SAS/IML language has similar language features, but because data is often stored in SAS/IML matrices, the SAS/IML language also has a function that is not available in the DATA step: the LOC function.

The LOC Function

If your data are in a SAS/IML matrix, x, the LOC Function enables you to find elements of x for which a given criterion is true. The LOC function returns the LOCations (indices) of the relevant elements. (In the R language, the which function implements similar functionality.) For example, the following statements define a numeric vector, x, and use the LOC function to find the indices for which the numbers are greater than 3:

proc iml;
x = {1 4 3 5 2 7 3 5};
/** which elements are > 3? **/
k = loc( x>3 ); 
print k;






Notice the following:

  • The argument to the LOC function is an expression that resolves to a vector of 0s and 1s. (Some languages call this a logical vector.) In practice, the argument to the LOC function is almost always an expression.
  • The result of the LOC function is always a row vector. The number of columns is the number of elements of x that satisfy the given criterion.
  • The LOC function returns indices of x, not values of x. To obtain the values, use x[k]. (Indices and subscripts are related; for vectors, they are the same.)

How Many Elements Satisfy the Criterion?

You can exploit the fact that the LOC function outputs a row vector. To count the number of elements that satisfy the criterion, simply use the NCOL function, as follows:

n = ncol(k); /** how many? **/
print n;



What If No Elements Satisfy the Criterion?

The expression ncol(idx) always tells you the number of elements that satisfy the criterion, even when no elements satisfy the criterion. The following statement asks for the elements larger than 100 and handles the possible results:

j = loc( x>100 );
if ncol(j) > 0 then do;
   print "At least one element found";
   /** handle this case **/
else do;
   print "No elements found";
   /** handle alternate case **/

In the preceding example, x does not contain any elements that are greater than 100. Therefore the matrix j is an empty matrix, which means that j has zero rows and zero columns. It is a good programming practice to check the results of the LOC function to see if any elements satisfied the criterion. For more details, see Chapter 3 of Statistical Programming with SAS/IML Software.

Using the LOC Function to Subset a Vector

The LOC function finds the indices of elements that satisfy some criterion. These indices can be used to subset the data. For example, the following statements read information about vehicles in the SasHelp.Cars data set. The READ statement creates a vector that contains the make of each vehicle ("Acura," "Audi," "BMW,"...) and creates a second vector that contains the engine size (in liters) for each vehicle. The LOC function is used to find the indices for the vehicles made by Acura. These indices are then used to subset the EngineSize vector in order to produce a vector, s, that contains only the engine volumes for the Acura vehicles:

read all var {Make EngineSize};

/** find observations that 
    satisfy a criterion **/
idx = loc( Make="Acura" );
s = EngineSize[idx];
print s[label="EngineSize (Acura)"];

EngineSize (Acura)








LOC = Efficient SAS Programming

I have called the LOC function the most useful function that most DATA step programmers have never heard of. Despite its relative obscurity, it is essential that SAS/IML programmers master the LOC function. By using the LOC function, you can write efficient vectorized programs, rather than inefficient programs that loop over data.

4月 292011
In last week's article on how to create a funnel plot in SAS, I wrote the following comment:
I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison.
This article discusses how to adjust the control limits (called decision limits in the GLM procedure) to account for multiple comparisons. Because the adjustments are more complicated when the group sizes are not constant, this article treats the simpler case in which each group has the same number of observations. For details on multiple comparisons, see Multiple Comparisons and Multiple Tests Using SAS (the second edition is scheduled for Summer 2011).

Example Data and ANOM Chart

In the funnel plot article, I used data for the temperatures of 52 cars. Each car was one of nine colors, and I was interested in whether the mean temperature of a group (say, black cars) was different from the overall mean temperature of the cars. The number of cars in each color group varied. However, in order to simplify the analysis, today's analysis uses only the temperatures of the first four cars of each color.

You can download the new data and all of the SAS statements used in this article. The following statements create the data and run the SAS/QC ANOM procedure to generate the ANOM chart:

ods graphics on;
proc anom data=CarTemp2;
   xchart Temperature*Color;
   label Temperature = 'Mean Temperature (F)';
   label Color = 'Car Color';

Even for this smaller set of data, it is apparent that black cars are warmer than average and silver and white cars are cooler than average. You can create a similar plot by using the LSMEANS statement in the SAS/STAT GLM procedure.

Computing Decision Limits: An Overview

The formulas for computing decision limits are available in the documentation of the XCHART statement in the ANOM procedure. The decision limits have three components:

  1. The central value, y, with which you want to compare each individual group mean. Often this is the grand mean of the data.
  2. A variance term, v, which involves the root mean square error, the number of groups, and the size of each group. This term quantifies the accuracy of the comparisons.
  3. A multiplier, h, which depends on the significance level, α, and accounts for the multiple comparisons.
The upper and lower decision limits are then formed as y ± h * v. The following sections compute each component of the decision limits.

Computing the Central Value

The central value is the easiest component to compute. When the group sizes are constant, the central value is merely the overall mean:

proc iml;
use CarTemp2; 
read all var {Color Temperature}; 
close CarTemp2;

/** 1. overall mean **/
y = mean(Temperature); 

This value is 123.6, as shown in the ANOM chart. The ANOM chart compares each individual group mean to this central value.

Computing the Variance Term

The second component in the computation of decision limits is the variance term. This term measures the accuracy you have when comparing group means to the overall mean. (More variance means less accuracy.) The formula involves the mean square error, which in this case is just the average of the sample variances of the nine groups. For convenience, the following statements define a SAS/IML module that computes the average variance:

/** 2. variance term **/
start MSEofGroups(g, x);
   u = unique(g); /** g is group var **/
   nGroups = ncol(u);
   v = j(1, nGroups);
   do i = 1 to nGroups;
      v[i] = var( x[loc(g=u[i])] );
   return( sum(v)/nGroups );

The module is then used to compute the variance term:

MSE = MSEofGroups(Color, Temperature);
nGroups = 9; /** or determine from data **/
size = repeat(4, nGroups); /** {4,4,...,4} **/
v = sqrt(MSE) * sqrt((nGroups-1)/sum(size));

Computing the ANOM Multiplier

The final component in forming the ANOM decision limits is the multiplier, h. In elementary statistics, the value 2 (or more precisely, the 0.975 quantile of a t distribution) might be used as a multiplier, but that value isn’t big enough when multiple comparisons are being made. The PROC ANOM documentation states that in a comparison of several group means with the overall mean, the proper value of h is the α quantile of a certain distribution. However, the documentation does not specify how to compute this quantile.

In SAS software you can compute the quantile by using the PROBMC function. I had never heard of the PROBMC function until I started working on this article, but it is similar to the QUANTILE function in that it enables you to obtain quantiles from one of several distributions that are used in multiple comparison computations. (You can also use the PROBMC function to obtain probabilities.)

The following statements compute h for α = 0.05 and for the case of nine groups, each with four observations:

/** 3. multiplier for ANOM **/
alpha = 0.05;
pAnom = 1 - alpha;

/** degrees of freedom for 
    pooled estimate of variance **/
df = sum(size)-nGroups; 
h = probmc("ANOM", ., pAnom, df, nGroups); 

The main idea is that h is the α quantile of the "ANOM distribution." Although the "ANOM distribution" is not as well-known as the t distribution, the idea is the same. The distribution involves parameters for the degrees of freedom and the number of groups. In the general case (when the group sizes are not constant), the sizes of the groups are also parameters for the distribution (not shown here).

Computing the Decision Limits

All three pieces are computed, so it is easy to put them together to compute the upper and lower decision limits:

/** compute decision limits **/
upperAnom = y + h * v;
lowerAnom = y - h * v;
print lowerAnom upperAnom;





Notice that these values are identical to the values graphed by the ANOM procedure.

Comparing the ANOM Multiplier with Better-Known Multipliers

The computation is finished, but it is interesting to compare the ANOM multiplier with more familiar multipliers from the t distribution.

A classic way to handle multiple comparisons is to use the Bonferroni adjustment. In this method, you divide α by the number of comparisons (9) but continue to use quantiles of the t distribution. By dividing α by the number of groups, you find quantiles that are further in the tail of the t distribution and therefore are larger than the unadjusted values. You can show that the Bonferroni multiplier is a conservative multiplier that will always be larger than the ANOM multiplier.

The following statements compute decision limit multipliers based on an unadjusted t quantile (such as is used for a classical confidence interval for a mean) and on a Bonferroni adjusted quantile. These are printed, along with the multiplier h that was computed previously.

/** compare with unadjusted and Bonferroni multipliers **/
q = quantile("T", 1-alpha/2, df); 
qBonf = quantile("T", 1-alpha/2/nGroups, df); 
print q qBonf h;







For these data, the Bonferroni multiplier is only about 1% larger than h. You can see that the Bonferroni and ANOM multipliers are about 50% larger than the multiplier based on the unadjusted quantile, which means that the decision limits based on these quantiles will be wider. This is good, because the unadjusted limits are too narrow for multiple comparisons.

4月 272011
The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses b=1 and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

proc iml;
Y = {-3,1,2,.,5,10,100}; /** negative datum **/
LY = log10(Y + 1 - min(Y)); /** translate, then transform **/

Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

n = nrow(Y);
LogY = j(n,1); /** allocate result vector **/
do i = 1 to n; /** loop is inefficient **/
   if Y>0 then LogY[i] = log(Y);
   else LogY[i] = .;

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign LogY, as shown in the following statements.

/** more efficient statements **/
LogY = j(nrow(Y),1,.); /** allocate missing **/
idx = loc(Y>0); /** find indices where Y>0 **/
if ncol(idx)>0 then 
   LogY[idx] = log10(Y[idx]);

print Y LY LogY;

























The preceding statements initially define LogY to be a vector of missing values. The LOC function finds the indices of Y for which Y is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in LogY for any element for which Y is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.

4月 152011
In a previous blog post, I showed how you can use simulation to construct confidence intervals for ranks. This idea (from a paper by E. Marshall and D. Spiegelhalter), enables you to display a graph that compares the performance of several institutions, where "institutions" can mean schools, companies, airlines, or any set of similar groups.

A more recent paper (Spiegelhalter, 2004) recommends doing away with individual confidence intervals altogether. Instead, Spiegelhalter recommends plotting the means of each group versus "an interpretable measure of its precision" and overlaying "control limits" similar to those found on a Shewhart control chart.

This article shows how to create a funnel plot in SAS. You can download all of the data and the SAS program used in this analysis.

Spiegelhalter lays out four steps for creating the funnel plot that displays the mean of a continuous response for multiple groups (see Appendix A.3.1):

  1. Compute the mean of each category.
  2. Compute the overall mean, y and standard deviation, s.
  3. Compute control limits y ± zα * s / sqrt(n) where zα is the α quantile of the normal distribution and n varies between the number of observations in the smallest group and the number in the largest group.
  4. Plot the mean of each category against the sample size and overlay the control limits.

The Temperature of Car Roofs in a Parking Lot

The data for this example are from an experiment by Clark Andersen in which he measured the temperature of car roofs in a parking lot. He observed that black and burgundy cars had the hottest roofs, whereas white and silver cars were cooler to touch.

Andersen measured the roof temperatures of 52 cars on a mild (71 degree F) day. The cars were of nine colors and 4–8 measurements were recorded for each color. You could rank the car colors by the mean temperature of the roof (remember to include confidence intervals!) but an alternative display is to plot the mean temperature for each car color versus the number of cars with that color. You can then overlay funnel-shaped "control limits" on the chart.

The funnel plot is shown below. (Click to enlarge.) The remainder of this post shows how to create the graph.

Computing the Mean of Each Category

The following statements read the data into SAS/IML vectors:

proc iml;
use CarTemps;
   read all var {Color Temperature};
close CarTemps;

You can use the UNIQUE/LOC technique to compute the sample size and mean temperature for each color category. The following technique is described in Section 3.3.5 of Statistical Programming with SAS/IML Software:

/** for each car color, compute the mean and
    standard error of the mean **/
u = unique(Color); /** unique colors **/
p = ncol(u); /** how many colors? **/
mean = j(p,1); sem = j(p,1); n = j(p,1);
do i = 1 to p;
   idx = loc(Color=u[i]);
   n[i] = ncol(idx);
   T = Temperature[idx]; 
   mean[i] = mean(T); /** mean temp of category **/
   sem[i] = sqrt(var(T)/n[i]); /** stderr **/

The following table summarizes the data by ranking the mean temperatures for each color. Although the standard errors are not used in the funnel plot, they are included here so that you can see that many of the confidence intervals overlap. You could use the SGPLOT procedure to display these means along with 95% confidence intervals, but that is not shown here.

   color            n  mean  sem

   black            8 137.3  3.6
   burgundy         4 133.9  4.6
   green            4 130.9  7.3
   gray             6 130.5  2.5
   blue             7 129.3  3.9
   red              6 128.5  2.9
   tan              4 116.1  5.1
   silver           6 107.9  3.2
   white            7  98.4  1.8

Those car roofs get pretty hot! Are the black cars much hotter than the average? What about the burgundy cars? Notice that the sample means of the burgundy and green cars are based on only four observations, so the uncertainty in those estimates are greater than for cars with more measurements. The funnel plot displays both the mean temperature and the precision in a single graph.

Computing the Overall Mean and Standard Deviation

The funnel plot compares the group means to the overall mean. The following statements compute the overall mean and standard deviation of the data, ignoring colors:

/** compute overall mean and variance **/
y = Temperature[:];
s = sqrt(var(Temperature));
print y s[label="StdDev"];





The overall mean temperature is 123 degrees F, with a standard deviation of 16 degrees. The VAR function is available in SAS/IML 9.22. If you are using an earlier version of SAS/IML, you can use the VAR module from my book.

Computing Control Limits

A funnel plot is effective because it explicitly reveals a source of variability for the means, namely the different sample sizes. The following statements compute the control limits for these data:

/** confidence limits for a range of sample sizes **/
n = T( do(3, 8.5, 0.1) );
p = {0.001 0.025 0.975 0.999}; /** lower/upper limits **/
z = quantile("normal", p);
/** compute 56 x 4 matrix that contains confidence limits
    for n = 3 to 8.5 by 0.1 **/
limits = y + s / sqrt(n) * z;

Notice that the limits variable is a matrix. The expression s/sqrt(n) is a column vector with 56 rows, whereas the row vector z contains four z-scores. Therefore the (outer) product is a 56x4 matrix. The values in this matrix are used to overlay control limits on a plot of mean versus the sample size.

Creating a Funnel Plot

After writing the SAS/IML computations to a data set (not shown here), you can use PROC SGPLOT to display the mean temperature for each car color and use the BAND statement to overlay 95% and 99.8% control limits.

title "Temperatures of Car Roofs";
title2 "71 Degrees in the Shade";
proc sgplot data=All;
   scatter x=N y=Mean /datalabel=Color;
   refline 123.4 / axis=y;
   band x=N lower=L95 upper=U95 / nofill;
   band x=N lower=L998 upper=U998 / nofill;
   xaxis label="Number of Cars";
   yaxis label="Average Temperature";

The funnel plot indicates that black cars are hotter than average. Silver and white cars are cooler than average. More precisely, the plot shows that the mean temperature of the black cars exceeds the 95% prediction limits, which indicates that the mean is greater than would be expected by random variation from the overall mean. Similarly, the mean temperature of the silver cars is lower than the 95% prediction limits. The mean temperature of the white cars is even more extreme: it is beyond the 99.8% prediction limits.


The funnel plot is a simple way to compare group means to the overall average. The funnel-shaped curves are readily explainable to a non-statistician and the plot enables you to compare different groups without having to rank them. The eye is naturally drawn to observations that are outside the funnel curves, which is good because these observations often warrant special investigation.

More advantages and some limitations are described in Spiegelhalter's paper. He also shows how to construct the control limits for funnel plots for proportions, ratios of proportions, odd ratios, and other situations.

Two criticisms of my presentation come to mind. The first is that I've illustrated the funnel plot by using groups that have a small number of observations. In order to make the control limits look like funnels, I used a "continuous" value of n, even though there is no such thing as a group with 4.5 or 5.8 observations! Spiegelhalter's main application is displaying the mean mortality rates of hospitals that deal with hundreds or thousands of patients, so the criticism does not apply to his examples.

The second criticism is that I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison. Spiegelhalter notes (p. 1196) that "the only allowance for multiple comparisons is the use of small p-values for the control limits. These could be chosen based on some formal criterion such as Bonferroni..., but we suggest that this is best carried out separately." Following Spiegelhalter's suggestion, I leave that adjustment for another blog post.

4月 132011
I have recently returned from five days at SAS Global Forum in Las Vegas. A riffle shuffle|Source=Flickr [] |Date= November 2005 - April 2006 |Author= Todd Klassy [] |Permission=CC-by 2.0 On the way there, I finally had time to read a classic statistical paper: Bayer and Diaconis (1992) describes how many shuffles are needed to randomize a deck of cards. Their famous result that it takes seven shuffles to randomize a 52-card deck is known as "the bane of bridge players" because the result motivated many bridge clubs to switch from hand shuffling to computer generated shuffling. Casual bridge players also blame this result for "slowing down the game" while the cards are shuffled more times than seems intuitively necessary.

In the second paragraph of the paper, Bayer and Diaconis introduce a "mathematically precise model of shuffling," which is known as the Gilbert-Shannon-Reeds (GSR) model. This model is known to be a "good description of the way real people shuffle real cards." (See Diaconis (1988) and the references at the end of this article.) This article describes how to implement GSR shuffling model in SAS/IML software.

The Riffle Shuffle

Computationally, you can shuffle a deck by generating a permutation of the set 1:n, but that is not how real cards are shuffled.

The riffle (or "dovetail") shuffle is the most common shuffling algorithm. A deck of n cards is split into two parts and the two stacks are interleaved. The GSR algorithm simulates this physical process.

The GSR model splits the deck into two pieces according to the binomial distribution. Each piece has roughly n/2 cards. Then cards are dropped from the two stacks according to the number of cards remaining in each stack. Specifically, if there are NL cards in the left stack and NR cards in the right stack, then the probability of the next card dropping from the right stack is NR / (NR + NL).

The following SAS/IML module is a straightforward implementation of the GSR algorithm:

proc iml;
/** Gilbert-Shannon-Reeds statistical 
    model of a riffle shuffle. Described in 
    Bayer and Diaconis (1992) **/
start GSRShuffle(deck);
   n = nrow(deck);
   /** cut into two stacks **/
   nL = rand("Binomial", 0.5, n);
   nR = n - nL;

   /** left stack, right stack **/
   L = deck[1:nL]; R = deck[nL+1:n];
   j = 1; k = 1; /** card counters **/
   shuffle = j(n,1); /** allocate **/
   do i = 1 to n;
      c = rand("Bernoulli", nR/(nL+nR)); 
      if c=0 then do;  /** drop from left **/
        shuffle[i] = L[j];
        nL = nL-1;  j=j+1; /** update left **/
      else do;  /** drop from right **/
        shuffle[i] = R[k];
        nR = nR-1;  k=k+1; /** update right **/

Testing the GSR Algorithm

You can test the algorithm by starting with a deck of cards in a known order and observing how the cards are mixed by consecutive riffle shuffles. The following statements riffle the cards seven times and store the results of each shuffle. To save space, a 20-card deck is used. The original order of the cards is denoted 1, 2, 3, ..., 20.

call randseed(1234);
n = 20; /** n=52 for a standard deck **/
deck = T(1:n);

s = j(n, 8); /** allocate for 8 decks **/
s[,1] = deck; /** original order **/
do i = 1 to 7;
   s[,i+1] = GSRShuffle( s[,i] );
names = "s0":"s7";
print s[colname=names];

    s0  s1  s2  s3  s4  s5  s6  s7

     1   1   1   1   1   1  18   2
     2   2  14  14  14  11   1  13
     3  12   2   2   4  14  11  12
     4   3   6   6   2   4  14  18
     5  13  12   3  12   2  19   1
     6   4  15  16  15  12   7  11
     7   5   7  13  17  15   3  14
     8  14   8   4  10  17   4  15
     9   6   9  12   6  10  16  19
    10  15   3  15  18   6   2   8
    11   7  16  17   7  18  13   7
    12   8  13  10   3  19  12   3
    13   9   4  18  16   7  15   9
    14  16  17   7  13   3   8   4
    15  17  10   8   8  16   9  17
    16  10  18   5   5  13  17  20
    17  18   5  11  11   8  20  16
    18  11  11  19  19   9  10  10
    19  19  19   9   9  20   5   5
    20  20  20  20  20   5   6   6

It is interesting to study the evolution of mixing the cards:

  • For the first shuffle, the original deck is cut into two stacks (1:11 and 12:20) and riffled to form the second column. Notice that usually one or two cards are dropped from each stack, although at one point three cards (7, 8, 9) are dropped from the left stack.
  • The second cut occurs at card 14, which is the eighth card in the second column. After the second riffle, notice that the cards 7, 8, and 9 are still consecutive, and the first and last cards are still in their original locations. Six cards (30%) are still in their initial positions in the deck.
  • The pair 7 and 8 is not separated until the fourth shuffle.
  • The last card (20) does not move from the bottom of the deck until the fifth shuffle.
  • The first card (1) does not move from the top of the deck until the sixth shuffle.

The Efficiency of the GSR Algorithm

As far as efficiency goes, the GSRShuffle module that I've implemented here is not very efficient. As I've said before, the SAS/IML language is a vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities.

This implementation of the shuffling algorithm is not vectorized. Unfortunately, because the probability of a card dropping from the left stack changes at every iteration, there is no way to call the RANDGEN function once and have it return all n numbers required to simulate a single riffle shuffle.

Or is there? Perhaps there is an equivalent algorithm that can be vectorized? Next week I'll present a more efficient version of the GSR algorithm that does not require an explicit loop over the number of cards in the deck.


D. Bayer and P. Diaconis (1992), "Trailing the Dovetail Shuffle to Its Lair", Annals of Applied Probablity 2(2) 294-313
P. Diaconis (1988), Group Representations in Probability and Statistics. IMS, Hayward, CA.
E. Gilbert (1955) "Theory of Shuffling," Technical memorandum. Bell Laboratories.
J. Reeds (1981), Unpublished manuscript.
4月 042011
In my article on computing confidence intervals for rankings, I had to generate p random vectors that each contained N random numbers. Each vector was generated from normal distribution with different parameters. This post compares two different ways to generate p vectors that are sampled from independent normal distributions.

Sampling from Normal Distributions

Recall that the following SAS/IML statements sample from p identical normal distributions:

proc iml;
call randseed(1234); /** set random number seed **/
N = 10000; /** number of observations per sample **/
p = 100; /** number of variables **/
mean = 1; stdDev = 2;
x = j(N, p); /** allocate space for results **/
call randgen(x, "Normal", mean, stdDev);

Each of the p columns of the matrix x is a sample from the same normal population. Today's article describes how to sample each column from a different population. That is, the means and standard deviations are different for each column of x.

As I've discussed previously, when you want correlated vectors you can sample from a multivariate normal distribution by using the RANDNORMAL module. Each of the resulting vectors is univariate normal and they are correlated with each other.

You can also use the RANDNORMAL module to sample from p independent and uncorrelated variables: just pass in a diagonal matrix for the covariance matrix. However, there is a more efficient way to sample from p independent normal variables: call the RANDGEN function p times in a loop.

Why Sampling in a Loop Is Better

Some long-time readers might be surprised that I would recommend writing a loop because I have blogged many times about the evils of unnecessary looping in the SAS/IML language, especially for sampling and simulation. But loops are not inherently evil. The point I often make is that loops should be avoided when there is a more efficient vectorized function that does the same thing.

But in the case of sampling from independent normals, a loop is more efficient than calling the RANDNORMAL module. Here's why. The RANDNORMAL algorithm has three steps:

  1. Sample from p independent standard normal distributions.
  2. Compute the Cholesky decomposition of the covariance matrix by using the SAS/IML ROOT function.
  3. Perform a matrix multiplication that transforms the uncorrelated variates into correlated variates.
Because the RANDNORMAL algorithm requires sampling, but also involves a Cholesky decomposition (which is an O(p3) operation), I expect it to be less efficient than an operation that uses only sampling.

The following statements use PROC IML to compute p independent normal variates, each with its own mean and variance:

mean = j(p,1,0); /** use zero means **/
var = uniform( j(p,1) ); /** random variances **/

x = j(N, p); /** allocate space for results **/
z = j(N,1); /** allocate helper vector **/
do i = 1 to p;
   call randgen(z, "Normal", mean[i], sqrt(var[i]));
   x[,i] = z;

Comparing the Two Algorithms

For comparison, the following statements sample from a multivariate normal distribution with uncorrelated components:

D = diag(var); /** diagonal covariance matrix **/
x = randnormal(N, mean, D);

You can use the techniques in Chapter 15 of my book, Statistical Programming with SAS/IML Software, to compare the performance of these two methods for a range of values of p. A graph of the results shows that the RANDNORMAL method does indeed require more computational time. Using the module, it takes about one second to create a 300x300 covariance matrix and use it to generate 300 random vectors with 10,000 observations. If you use a loop, sampling from 300 univariate normal distributions takes less than 0.2 seconds.

So go ahead and loop. This time, it's the more efficient choice.

3月 302011
In a previous post, I described how to compute means and standard errors for data that I want to rank. The example data (which are available for download) are mean daily delays for 20 US airlines in 2007.

The previous post carried out steps 1 and 2 of the method for computing confidence intervals for rankings, as suggested in Marshall and Spiegelhalter (1998):

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

This post carries out step 3 of the method.

Ordering Airlines by Mean Delays

To recap, here is the SAS/IML program so far. The following statements compute the mean delays and standard errors for each airline in 2007. The airlines are then reordered by their mean delays:

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;
meanDelay = x[:,];

/** sort descending; reorder columns **/
call sortndx(idx, MeanDelay`, 1, 1);
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];

Ranking Airlines...with Confidence

You can use the simulation method suggested by Marshall and Spiegelhalter to compute confidence intervals for the rankings. The following statements sample 10,000 times from normal distributions. (I've previously written about how to write an efficient simulation in SAS/IML software.) Each of these samples represents a potential realization of the underlying random variables:

/** simulation of true ranks, according to the method 
    proposed by Marshall and Spiegelhalter (1998) 
    British Med. J. **/
/** parameters for normal distribution of means **/
mean = meanDelay`;
sem = sqrt(var(x)/nrow(x)); /** std error of mean **/

p = ncol(Carrier);
NSim = 10000;
s = j(NSim, p); /** allocate matrix for results **/
z = j(NSim, 1);
call randseed(1234);
do j = 1 to p;
   call randgen(z, "Normal", mean[j], sem[j]);
   s[,j] = z;

The matrix s contains 10,000 rows and 20 columns. Each row is a potential set of means for the airlines, where the ith mean is chosen from a normal distribution with population parameters μi and σi. Because the population parameters are unknown, Marshall and Spiegelhalter advocate replacing the population means with the sample means, and the population standard deviations by the sample standard errors. This kind of simulation is similar to a parametric bootstrap.

The following statements use the SAS/IML RANK function to compute the rank (1–20) for each simulated set of means. (If you need to conserve memory, you can re-use the s matrix to store the ranks.)

/** for each sample, determine the rankings **/
r = j(NSim, p);
do i = 1 to NSim;
   r[i,] = rank(s[i,]);

Each column of the r matrix contains simulated ranks for a single airline. You can use the QNTL subroutine to compute 95% confidence intervals for the rank. Basically, you just chop off the lower and upper 2.5% of the simulated ranks. (If you have not upgraded to SAS/IML 9.22, you can use the QNTL module for this step.)

/** find the 95% CI of the rankings **/
call qntl(q, r, {0.025 0.975});

The simulation is complete. The matrix q contains two rows and 20 columns. The first row contains the lower end of the 95% confidence limits for the ranks; the second row contains the upper limit.

You can print q to see the results in tabular form. I prefer graphical displays, so the following statements create a SAS data set that contains the estimated ranks and the upper and lower confidence limits.

d = rank(mean) || T(q);
varNames = {"Rank" "LCL" "UCL"};
create Rank from d[r=Carrier c=varNames];
append from d[r=Carrier];
close Rank;

It is now straightforward to use SCATTER statement in PROC SGPLOT to create a scatter plot with confidence intervals (called "error bars" in some disciplines):

proc sgplot data=Rank noautolegend;
title "Ranking Airlines (On-Time Performance, 2007)";
scatter y=carrier x=Rank /
xaxis label="Rank with 95% Confidence Intervals";
yaxis grid discreteorder=data label="Carrier"; 

For these data, the ranking of the first three airlines (AQ, HA, WN) seems fairly certain, but there is more uncertainty for the rankings of airlines that are in the middle of the pack. So, for example, if Frontier Airlines (F9) issues a press release about their 2007 on-time performance, it shouldn't claim "We're #5!" Rather, it should state "We're #5 (maybe), but we might be as high as #4 or as low as #8."

In keeping with my New Year's Resolutions, I've learned something new. Next time I want to rank a list, I will pay more attention to uncertainty in the data. Plotting confidence intervals for the rankings is one way to do that.

3月 252011
I recently posted an article about representing uncertainty in rankings on the blog of the ASA Section for Statistical Programmers and Analysts (SSPA). The posting discusses the importance of including confidence intervals or other indicators of uncertainty when you display rankings.

Today's article complements the SSPA post by showing how to construct the confidence intervals in SAS. This article implements the techniques in Marshall and Spiegelhalter (1998), Brit. Med. J., which is a very well written and readable paper. (If you prefer sports to medicine, Spiegelhalter wrote a similar article on ranking teams in the English Premier Football League.)

To illustrate the ideas, consider the problem of ranking US airlines on the basis of their average on-time performance. The data for this example are described in the Statistical Computing and Graphics Newsletter (Dec. 2009) and in my 2010 SAS Global Forum paper. The data are available for download.

The data are in a SAS data set named MVTS that consists of 365 observations and 21 variables. The Date variable records the day in 2007 for which the data were recorded. The other 20 variables contain the mean delays, in minutes, experienced by an airline carrier for that day’s flights. The airline variables in the data are named by using two-digit airline carrier codes. For example, AA is the carrier code for American Airlines.

The appendix of Marshall and Spiegelhalter's paper describes how to compute rankings with confidence intervals:

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

Ordering Airlines by Mean Delays

You can use PROC MEANS or PROC IML to compute the means and standard errors for each carrier. Because simulation will be used in Step 3 to compute confidence intervals for the rankings, PROC IML is the natural choice for this analysis. The following statements compute the mean delay for all airlines (sometimes called the "grand mean") and the mean delays for each airline in 2007. The airlines are then sorted by their mean delays, as described in a previous article about ordering the columns of a matrix.

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;

grandMean = x[:];
meanDelay = x[:,];
call sortndx(idx, MeanDelay`, 1);

/** reorder columns **/
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];
print grandMean, carrier;





You can use these computations to reorder the variables in the data set (this step is not shown) and use PROC SGPLOT to plot the means and standard errors for the each airline carrier:

/** convert data from wide to long format**/
proc transpose data=mvts 
   out=Delay(rename=(col1=Delay)) name=Carrier;
by date;

title "Comparison of Airline Daily Delays (2007)";
proc sgplot data=delay noautolegend;
  dot carrier / response=delay stat=mean limitstat=clm;
  refline 9.646 / axis=x lineattrs=(pattern=dash);
  xaxis label="Mean Delay with 95% Confidence Intervals";
  yaxis discreteorder=data label="Carrier"; 

Notice that two airlines (Aloha Airlines (AQ) and Hawaiian Airlines (HA)) have much better on-time performance than the others. In fact, their average daily delay is negative, which means that they typically land a minute or two ahead of schedule!

Many analysts would end the analysis here by assigning ranks based on the mean statistics. But as I point out on my SSPA blog, it is important for rankings to reflect the uncertainty in the mean estimates.

Notice that for most airlines, the confidence interval overlaps with the confidence intervals of one or more other airlines. The "true mean" for each airline is unknown, but probably lies within the confidence intervals.

So, for example Frontier Airlines (F9) is nominally ranked #5. But you can see from the overlapping confidence intervals that there is a chance that the true mean for Frontier Airlines is lower than the true mean of Delta Airlines (DL), which would make Frontier ranked #4. However, there is also a chance that the true mean could be less than the true mean of Continental Airlines (CO), which would make Frontier ranked #11. How can you rank the airlines so that that uncertainty is reflected in the ranking?

The answer is in Step 3 of Marshall and Spiegelhalter's approach: use a simulation to compute the confidence intervals for the ranking. This step will appear in a second article, which I will post next week.

3月 232011
I recently blogged about how to eliminate a macro loop in favor of using SAS/IML language statements. The purpose of the program was to extract N 3x3 matrices from a big 3Nx3 matrix. The main portion of my PROC IML program looked something like this:

proc iml;
do i=0 to N-1;
   rows = (3*i+1):(3*i+3);    /** find certain rows **/
   s = X[rows,];              /** extract those rows **/
   /** do something with s **/

A reader correctly pointed out that my version of the program does not enable the programmer to access multiple matrices simultaneously. For example, after looping through all the rows of the data, the programmer cannot compute the sum of several of the 3x3 matrices. In my version, each 3x3 matrix is overwritten by the next.

However, you can address this issue by using the VALSET and VALUE functions. Never heard of them? Not very many people have, but they provide a mechanism to indirectly assign a value to a matrix name, or to indirectly get the value of a matrix name.

Indirect Assignment of Values to Matrices

A simple example illustrates the usage of the VALSET function. Suppose you want to assign values to a matrix named x1. The usual way is to write x1={1 2 3}, but you can use the following statements to achieve the same result:

/** indirect assignment: create a matrix
    name and assign a value to it **/
VarName = "x1";
call valset(VarName, {1 2 3});
print x1;





So what use is that? Well, 99% of the time it's of no use at all! However, this kind of "indirect assignment" is useful in the situation where you need to assign many matrices, and the assignment is most easily done in a loop.

For example, suppose you want to assign the matrix x1 to be a 2x2 matrix that consists entirely of 1s, the matrix x2 to be a 2x2 matrix that consists entirely of 2s, and so forth, up to x1000. Furthermore, suppose that you want all of these matrices available for future computations.

You have three options. You can write 1,000 literal assignment statements, you can write a macro %DO loop, or you can use the VALSET function. As I described in a previous blog [LINK], the following statements create the matrix names (x1,...,x1000) by using the string concatenation operator (+) in conjunction with the CHAR function (which converts a number into a character string) and the STRIP function. The VALSET function then assigns a value to each matrix:

do i = 1 to 1000;
   VarName = "x" + strip(char(i,4)); /** x1, x2,... **/
   v = repeat(i, 2, 2); /** create 2x2 matrix **/
   call valset(VarName, v);

You can use the SHOW NAMES statement to convince yourself that the matrices x1,...,x1000 are assigned. You can also compute with any or all of these matrices:

t = x1 + x13 + x7 + x29 + x450;
print t;






Indirect Retrieval of Values from Matrices

The VALSET function sets a matrix to a value. The VALUE function retrieves the value. For example, the following statement gets the value in the x1 matrix:

v = value("x1");

This syntax is rarely necessary, but it can be useful to retrieve values within a loop. For example, to compute the sum x1 + x13 + x7 + x29 + x450, you can define a vector that contains the suffixes, and then create the name of each matrix and use the VALUE function to retrieve its value, as shown in the following statements:

/** indirect reference: create a matrix
    name and retrieve its value **/
k = {1 13 7 29 450};
t = j(2, 2, 0);          /** initialize to zero matrix **/
do i = 1 to ncol(k);
   VarName = "x" + strip(char(k[i],4)); /** x1, x13,... **/
   t = t + value(VarName);

In conclusion, the VALSET and VALUE functions in SAS/IML are little-known functions that can be used to assign and retrieve values from matrices. They can sometimes be used to replace the functionality of a %DO loop. They can also be used for other purposes, such as to produce the functionality of multi-dimensional arrays in PROC IML.

3月 022011
I don't use the SAS macro language very often. Because the SAS/IML language has statements for looping and evaluating expressions, I rarely write a macro function as part of a SAS/IML programs. Oh, sure, I use the %LET statement to define global constants, but I seldom use the %DO and %EVALF macro statements. [Edit: 2Mar2011. I meant to say the %EVAL function.]

I was therefore momentarily baffled when I tried to decipher the following SAS/IML statements, which include a macro loop:

%macro ExtractSubmatrices;
proc iml; 
use ManyMatrices;  /** 3,747 rows of data **/
read all into X;

%do i=0 %to (3747/3)-1;
   %let j = %eval((&i*3) + 1);
   %let k = %eval((&i*3) + 2);
   %let l = %eval((&i*3) + 3);

   s&i. = X[{&j.,&k.,&l.},];
   /** do something with s&i. **/

The program looks complicated because of the macro syntax, but it is actually fairly simple. The program reads a data set, ManyMatrices, which contains 3,747 rows and 3 variables. The first three rows represent a 3 x 3 matrix, the next three rows represent a different 3 x 3 matrix, and so on. (There are a total of 3,747 / 3 = 1,249 matrices in the data set.) For each 3 x 3 matrix, the programmer wants to compute a quantity (not shown) based on the matrix.

The program can be improved by making the following changes:

proc iml;                     /** 1 **/
use ManyMatrices;
   read all into X; 
close ManyMatrices;           /** 2 **/

p = ncol(X);                  /** 3 **/
do i=0 to nrow(X)/p-1;        /** 4 **/
   rows = (p*i+1):(p*i+p);    /** 5 **/
   s = X[rows,];              /** 6 **/
   /** do something with s **/

The numbered comments correspond to the following improvements:

  1. Eliminate the macro function. There is no need to use one.
  2. It is always a good idea to close the input data set when you are finished reading from it.
  3. Why limit yourself to 3 x 3 matrices? Use the NCOL function to assign a variable, p, that contains the number of variables in the data.
  4. Use the NROW function to determine the number of observations in the data set, rather than hard-code this value.
  5. Use the index creation operator (:) to create the sequence of rows that contain the ith p x p matrix.
  6. Subscript the data to extract the relevant rows and all of the columns. There is no need to create and store 1,249 matrices, each with a unique name. It is more efficient to reuse the same matrix name (s) for each computation.
There are often good reasons to write a macro function. However, the programming statements in the SAS/IML language often make macro functions unnecessary in PROC IML.