Statistical Programming

3月 252011
 
I recently posted an article about representing uncertainty in rankings on the blog of the ASA Section for Statistical Programmers and Analysts (SSPA). The posting discusses the importance of including confidence intervals or other indicators of uncertainty when you display rankings.

Today's article complements the SSPA post by showing how to construct the confidence intervals in SAS. This article implements the techniques in Marshall and Spiegelhalter (1998), Brit. Med. J., which is a very well written and readable paper. (If you prefer sports to medicine, Spiegelhalter wrote a similar article on ranking teams in the English Premier Football League.)

To illustrate the ideas, consider the problem of ranking US airlines on the basis of their average on-time performance. The data for this example are described in the Statistical Computing and Graphics Newsletter (Dec. 2009) and in my 2010 SAS Global Forum paper. The data are available for download.

The data are in a SAS data set named MVTS that consists of 365 observations and 21 variables. The Date variable records the day in 2007 for which the data were recorded. The other 20 variables contain the mean delays, in minutes, experienced by an airline carrier for that day’s flights. The airline variables in the data are named by using two-digit airline carrier codes. For example, AA is the carrier code for American Airlines.

The appendix of Marshall and Spiegelhalter's paper describes how to compute rankings with confidence intervals:

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

Ordering Airlines by Mean Delays

You can use PROC MEANS or PROC IML to compute the means and standard errors for each carrier. Because simulation will be used in Step 3 to compute confidence intervals for the rankings, PROC IML is the natural choice for this analysis. The following statements compute the mean delay for all airlines (sometimes called the "grand mean") and the mean delays for each airline in 2007. The airlines are then sorted by their mean delays, as described in a previous article about ordering the columns of a matrix.

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;

grandMean = x[:];
meanDelay = x[:,];
call sortndx(idx, MeanDelay`, 1);

/** reorder columns **/
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];
print grandMean, carrier;


               grandMean

               9.6459671


                Carrier

AQ HA WN DL F9 FL PE OO AS XE CO YV ... AA EV

You can use these computations to reorder the variables in the data set (this step is not shown) and use PROC SGPLOT to plot the means and standard errors for the each airline carrier:

/** convert data from wide to long format**/
proc transpose data=mvts 
   out=Delay(rename=(col1=Delay)) name=Carrier;
by date;
run;

title "Comparison of Airline Daily Delays (2007)";
proc sgplot data=delay noautolegend;
  dot carrier / response=delay stat=mean limitstat=clm;
  refline 9.646 / axis=x lineattrs=(pattern=dash);
  xaxis label="Mean Delay with 95% Confidence Intervals";
  yaxis discreteorder=data label="Carrier"; 
run;

Notice that two airlines (Aloha Airlines (AQ) and Hawaiian Airlines (HA)) have much better on-time performance than the others. In fact, their average daily delay is negative, which means that they typically land a minute or two ahead of schedule!

Many analysts would end the analysis here by assigning ranks based on the mean statistics. But as I point out on my SSPA blog, it is important for rankings to reflect the uncertainty in the mean estimates.

Notice that for most airlines, the confidence interval overlaps with the confidence intervals of one or more other airlines. The "true mean" for each airline is unknown, but probably lies within the confidence intervals.

So, for example Frontier Airlines (F9) is nominally ranked #5. But you can see from the overlapping confidence intervals that there is a chance that the true mean for Frontier Airlines is lower than the true mean of Delta Airlines (DL), which would make Frontier ranked #4. However, there is also a chance that the true mean could be less than the true mean of Continental Airlines (CO), which would make Frontier ranked #11. How can you rank the airlines so that that uncertainty is reflected in the ranking?

The answer is in Step 3 of Marshall and Spiegelhalter's approach: use a simulation to compute the confidence intervals for the ranking. This step will appear in a second article, which I will post next week.

3月 232011
 
I recently blogged about how to eliminate a macro loop in favor of using SAS/IML language statements. The purpose of the program was to extract N 3x3 matrices from a big 3Nx3 matrix. The main portion of my PROC IML program looked something like this:

proc iml;
...
do i=0 to N-1;
   rows = (3*i+1):(3*i+3);    /** find certain rows **/
   s = X[rows,];              /** extract those rows **/
   /** do something with s **/
end;

A reader correctly pointed out that my version of the program does not enable the programmer to access multiple matrices simultaneously. For example, after looping through all the rows of the data, the programmer cannot compute the sum of several of the 3x3 matrices. In my version, each 3x3 matrix is overwritten by the next.

However, you can address this issue by using the VALSET and VALUE functions. Never heard of them? Not very many people have, but they provide a mechanism to indirectly assign a value to a matrix name, or to indirectly get the value of a matrix name.

Indirect Assignment of Values to Matrices

A simple example illustrates the usage of the VALSET function. Suppose you want to assign values to a matrix named x1. The usual way is to write x1={1 2 3}, but you can use the following statements to achieve the same result:

/** indirect assignment: create a matrix
    name and assign a value to it **/
VarName = "x1";
call valset(VarName, {1 2 3});
print x1;

X1

1

2

3

So what use is that? Well, 99% of the time it's of no use at all! However, this kind of "indirect assignment" is useful in the situation where you need to assign many matrices, and the assignment is most easily done in a loop.

For example, suppose you want to assign the matrix x1 to be a 2x2 matrix that consists entirely of 1s, the matrix x2 to be a 2x2 matrix that consists entirely of 2s, and so forth, up to x1000. Furthermore, suppose that you want all of these matrices available for future computations.

You have three options. You can write 1,000 literal assignment statements, you can write a macro %DO loop, or you can use the VALSET function. As I described in a previous blog [LINK], the following statements create the matrix names (x1,...,x1000) by using the string concatenation operator (+) in conjunction with the CHAR function (which converts a number into a character string) and the STRIP function. The VALSET function then assigns a value to each matrix:

do i = 1 to 1000;
   VarName = "x" + strip(char(i,4)); /** x1, x2,... **/
   v = repeat(i, 2, 2); /** create 2x2 matrix **/
   call valset(VarName, v);
end;

You can use the SHOW NAMES statement to convince yourself that the matrices x1,...,x1000 are assigned. You can also compute with any or all of these matrices:

t = x1 + x13 + x7 + x29 + x450;
print t;

t

500

500

500

500

Indirect Retrieval of Values from Matrices

The VALSET function sets a matrix to a value. The VALUE function retrieves the value. For example, the following statement gets the value in the x1 matrix:

v = value("x1");

This syntax is rarely necessary, but it can be useful to retrieve values within a loop. For example, to compute the sum x1 + x13 + x7 + x29 + x450, you can define a vector that contains the suffixes, and then create the name of each matrix and use the VALUE function to retrieve its value, as shown in the following statements:

/** indirect reference: create a matrix
    name and retrieve its value **/
k = {1 13 7 29 450};
t = j(2, 2, 0);          /** initialize to zero matrix **/
do i = 1 to ncol(k);
   VarName = "x" + strip(char(k[i],4)); /** x1, x13,... **/
   t = t + value(VarName);
end;

In conclusion, the VALSET and VALUE functions in SAS/IML are little-known functions that can be used to assign and retrieve values from matrices. They can sometimes be used to replace the functionality of a %DO loop. They can also be used for other purposes, such as to produce the functionality of multi-dimensional arrays in PROC IML.

3月 022011
 
I don't use the SAS macro language very often. Because the SAS/IML language has statements for looping and evaluating expressions, I rarely write a macro function as part of a SAS/IML programs. Oh, sure, I use the %LET statement to define global constants, but I seldom use the %DO and %EVALF macro statements. [Edit: 2Mar2011. I meant to say the %EVAL function.]

I was therefore momentarily baffled when I tried to decipher the following SAS/IML statements, which include a macro loop:

%macro ExtractSubmatrices;
proc iml; 
use ManyMatrices;  /** 3,747 rows of data **/
read all into X;

%do i=0 %to (3747/3)-1;
   %let j = %eval((&i*3) + 1);
   %let k = %eval((&i*3) + 2);
   %let l = %eval((&i*3) + 3);

   s&i. = X[{&j.,&k.,&l.},];
   /** do something with s&i. **/
%end;
quit;
%mend;
%ExtractSubmatrices;

The program looks complicated because of the macro syntax, but it is actually fairly simple. The program reads a data set, ManyMatrices, which contains 3,747 rows and 3 variables. The first three rows represent a 3 x 3 matrix, the next three rows represent a different 3 x 3 matrix, and so on. (There are a total of 3,747 / 3 = 1,249 matrices in the data set.) For each 3 x 3 matrix, the programmer wants to compute a quantity (not shown) based on the matrix.

The program can be improved by making the following changes:

proc iml;                     /** 1 **/
use ManyMatrices;
   read all into X; 
close ManyMatrices;           /** 2 **/

p = ncol(X);                  /** 3 **/
do i=0 to nrow(X)/p-1;        /** 4 **/
   rows = (p*i+1):(p*i+p);    /** 5 **/
   s = X[rows,];              /** 6 **/
   /** do something with s **/
end;
quit;

The numbered comments correspond to the following improvements:

  1. Eliminate the macro function. There is no need to use one.
  2. It is always a good idea to close the input data set when you are finished reading from it.
  3. Why limit yourself to 3 x 3 matrices? Use the NCOL function to assign a variable, p, that contains the number of variables in the data.
  4. Use the NROW function to determine the number of observations in the data set, rather than hard-code this value.
  5. Use the index creation operator (:) to create the sequence of rows that contain the ith p x p matrix.
  6. Subscript the data to extract the relevant rows and all of the columns. There is no need to create and store 1,249 matrices, each with a unique name. It is more efficient to reuse the same matrix name (s) for each computation.
There are often good reasons to write a macro function. However, the programming statements in the SAS/IML language often make macro functions unnecessary in PROC IML.
2月 092011
 
The other day, someone asked me how to compute a matrix of pairwise differences for a vector of values. The person asking the question was using SQL to do the computation for 2,000 data points, and it was taking many hours to compute the pairwise differences. He asked if SAS/IML could solve the problem faster.

The result? My program ran in a fraction of a second.

Now, to be fair, I'm sure that the SQL code is not as efficient as it could be. It has been my experience that a well-written SQL or DATA step program usually performs as well as (or better than) comparable SAS/IML code. I'm not an SQL programmer, so I couldn't offer advice on how to improve his SQL code. But since I solved the problem by using PROC IML, I'll discuss the solution, along with sharing some thoughts about efficient SAS/IML programming.

An Example Problem: Comparing Gas Prices at Different Stations

The following DATA step creates a variable that contains gas prices (in cents per gallon) in Raleigh, NC, on a certain day for seven gas stations:

/** data from www.gaspricewatch.com, 05FEB2011 **/
data GasPrices;
length Station $6;
input Price DOLLAR5.2 Station $;
Price = Price * 100; /** convert to cents **/
datalines;
$2.90 Citgo
$2.91 Exxon1
$2.99 Shell1
$3.00 Shell2
$3.03 BP    
$3.05 Exxon2
$3.09 Texaco
;
run;

Suppose you are interested in the differences between pairs of prices. That is, you want to create the matrix, D, whose ijth entry is the difference in price between the price at the ith station and the price at the jth station. Cleary, D is an anti-symmetric matrix with zero on the diagonal. (An anti-symmetric matrix is also called skew-symmetric.) You can compute D with the following SAS/IML statements:

proc iml;
use GasPrices;
read all var {Station Price};
close GasPrices;

/** compute matrix of differences **/
/** 1. use matrices **/
start FullMatDiff(x); /** x is a column vector **/
   n = nrow(x);   
   m = shape(x, n, n); /** duplicate data **/
   return( m` - m );
finish;

D = FullMatDiff(Price);
print D[c=Station r=Station];

D

Citgo

Exxon1

Shell1

Shell2

BP

Exxon2

Texaco

Citgo

0

-1

-9

-10

-13

-15

-19

Exxon1

1

0

-8

-9

-12

-14

-18

Shell1

9

8

0

-1

-4

-6

-10

Shell2

10

9

1

0

-3

-5

-9

BP

13

12

4

3

0

-2

-6

Exxon2

15

14

6

5

2

0

-4

Texaco

19

18

10

9

6

4

0

Because the data are ordered by price, the pairwise differences increase as you scan down any column of D.

Analyzing Efficiency

In the SAS/IML language, as with many programming languages, there is more than one way to compute the same quantity. Consequently, I am always asking myself whether my first solution is the most efficient solution.

From a computational point of view, the FullMatDiff module is efficient. Notice that it uses the SHAPE function to create a matrix whose ith column contains n copies of the ith price. If you are not familiar with the SHAPE function, you might unnecessarily write a loop.

The main computation is a single matrix expression, m`-m, which computes all n2 pairs of differences.

However, when I look at the module critically, I notice that it requires more memory than is necessary. The US Census Bureau estimates that there were 117,000 gas stations in the US in 2008, so I can imagine running this module on many tens of thousands of prices. This could be a problem because each matrix requires memory for n2 entries and there are three matrices that need to be in RAM simultaneously: m, the transpose of m, and the matrix of differences.

It is not necessary to explicitly compute either m or the transpose. If I'm willing to write a loop, I can compute the pairwise differences while allocating only a single n x n matrix:

/** 2. compute each row directly **/
start FullRowDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n;
      diff[i,] = xt[i] - xt;
   end;
   return( diff );
finish;

Another thought that occurred to me is that the matrix of differences is anti-symmetric, so it is not necessary to compute all of the entries but only the "n choose 2" entries above the diagonal (or, equivalently, below the diagonal). The SAS/IML module for computing the upper-triangular entries is similar to the previous module:

/** 3. compute only upper triangular differences 
       [COMB(n,2) elements] **/
start TriDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n-1;
      cols = i+1:n;
      diff[i,cols] = x[i] - xt[,cols];
   end;
   return( diff );
finish;

Comparing the Performance of the Modules

I wondered whether one algorithm would run faster than the others, so I timed the performance on data with a wide range of sizes. It turns out that the time required to compute the pairwise differences is about the same for the three algorithms, regardless of the size of data. Furthermore, even for 10,000 prices, each algorithm executes in seconds. For more data than that, my computer cannot allocate sufficient memory.


I know that some of my readers are SQL programmers, so feel free to post an SQL solution. Also, let me know how long your SQL program requires to compute the pairwise differences of 2,000 and 10,000 random uniform values. Can an SQL solution handle 20,000 prices? How long does it take?

2月 072011
 
When you pass a matrix as an parameter (argument) to a SAS/IML module, the SAS/IML language does not create a copy of the matrix. That approach, known as "calling by value," is inefficient. It is well-known that languages that implement call-by-value semantics suffer performance penalties.

In the SAS/IML language, matrices are passed to modules "by reference." This has two important implications:

  • You can pass large matrices into SAS/IML modules without hurting performance.
  • If you change or redefine an argument inside the module, the corresponding matrix also changes outside the module.
The second point means that you should be careful: when you modify the value or shape of matrices in a SAS/IML module, the change percolates up to the calling environment.

An Example of Call-By-Reference

A simple example can make these ideas clearer. Suppose that you write a module that sorts an input argument as part of a computation, such as computing the empirical cumulative distribution function:

proc iml;
start EmpiricalCDF( x );
   call sort(x,1);
   /** compute with the sorted data **/
   return( cusum(x) );
finish;

When you call that module from a SAS/IML program, it actually sorts the matrix that is sent in:

y = {3,2,1,1,4,2};
cdf = EmpiricalCDF( y );
print y cdf;

y

cdf

1

1

1

2

2

4

2

6

3

9

4

13

The Advantages of Call-By-Reference Semantics

An advantage of call-by-reference semantics is that your program doesn't waste time and memory copying data from the global environment into the local variable. Instead, the local variable inside the module is just an alias to the same matrix that is passed into the module. (C programmers are familiar with the concept of a pointer. Same idea.)

Call-by-reference semantics also enables you to return multiple results from a single module call. For example, if you know that you need both the mean and the standard deviation of data, you might decide to write a module that computes both quantities at the same time:

start GetDescriptive(Mean, StdDev, /** output arguments **/
                     x); /** input argument **/
   Mean = x[:];
   StdDev = sqrt( ssq(x-Mean)/(nrow(x)-1) );
finish;

run GetDescriptive(m, s, y);
print m s;

m

s

2.1666667

1.1690452

Notice that you don't have to allocate m or s before you call the module.

A Tip for Using Call-By-Reference Semantics

At the time that you write the EmpiricalCDF module, it might not matter to you that the module changes the order of the data. Maybe the rest of the program never refers to the data again. Or maybe it only computes statistics that do not depend on the order of the data (such as the mean and standard deviation).

But what happens if you share the module with a colleague?

Murphy's Law predicts that your colleague will get burned.

Your colleague will insert your module call into a working program, and suddenly some computation that used to be correct will now give a wrong answer. He will spend hours debugging his program and will eventually discover that the first value in his y matrix no longer corresponds to the first observation in the data set. Your colleague might be tempted to angrily march into your office and shout, "your $!@#$% module changed the #!%$@ order of one of my variables!"

In order to maintain a healthy working relationship with your colleagues, a good programming practice is to observe the following convention:

Programming Tip: Do not write a module that modifies an input argument.

When I write a module that needs to change the values or shape of an input argument, I make an explicit copy of it and then modify the copy. For example, I might rewrite the EmpiricalCDF module as follows:

start EmpiricalCDF( _x );
   x = _x; /** copy input argument **/
   call sort(x,1);
   return( cusum(x) );
finish;

I discuss these issues and give additional examples in my book, Statistical Programming with SAS/IML Software.

2月 042011
 
Last week I generated two kinds of random point patterns: one from the uniform distribution on a two-dimensional rectangle, the other by jittering a regular grid by a small amount.

My show choir director liked the second method (jittering) better because of the way it looks on stage: there are no gaps and no two performers are too close to each other. The statistician in me wants to know what statistical properties make the jittered pattern more desirable than the other. Said differently, can I do any of the following?

  • Compute a statistical measure that characterizes each kind of point pattern?
  • Use statistics to predict whether a given pattern would be acceptable to her?

As part of my plan to learn new statistics, I've been reading Statistical Analysis and Modelling of Spatial Point Patterns by J. Illian, A. Penttinen, H. Stoyan, and D. Stoyan. The spatial statistics in this blog post come from their book. This post has four parts:

  1. I describe a simple test that indicates whether a point pattern is likely to have been generated from a random uniform distribution.
  2. I implement the test in the SAS/IML language.
  3. I simulate data to show what the test means and how it works.
  4. I return to the above questions and show that the answer to both questions is "yes."

Complete Spatial Randomness --- A Test

Not all random configurations are created equal. A jittered pattern is random, but a pattern that is generated from a uniform distribution on a 2D rectangle exhibits complete spatial randomness (CSR).

There are several ways to test for CSR. Probably the easiest method is called the method of quadrats. The method consists of the following steps:

  1. Divide the rectangular domain into k disjoint cells (bins) of equal area and count the number of points, ci, that fall into each cell. You can use the Bin2D module for this step.
  2. If the points were generated from a distribution with a uniform density, you would expect e = N/k points in each bin, where N is the total number of points. Subtract that value from the actual bin count to obtain the residual count for each bin.
  3. Just as for two-way frequency tables, you can compute a normalized sum of the squared residuals and use that measure to determine whether the point pattern was likely to have come from a distribution with a uniform density. The computation is just the usual Pearson chi-square statistic: Σ (ci - e)2 / e.
The theory says that if a point pattern is generated from a uniform distribution, the Pearson statistic has an approximate chi-square distribution with k-1 degrees of freedom, provided that k > 6 and e > 1.

Ugh! It takes longer to explain the algorithm in words than it does to implement it in the SAS/IML language! It's time for a concrete example.

SAS/IML Code to Test for Complete Spatial Randomness

The following SAS/IML statements generate 20 points from a random uniform distribution, define a 3x3 grid on the unit square, and call the Bin2D module to count how many points are in each cell of the grid:

proc iml;
N = 20;
pUnif = j(N,2);
call randseed(1234);
call randgen(pUnif, "uniform"); /** generate random uniform **/

/** Define regular grid: 0, 0.33, 0.67, 1 **/
XMin = 0; XMax = 1; Nx = 3; dx = 1/Nx; 
YMin = 0; YMax = 1; Ny = 3; dy = 1/Ny;
XDiv = do(XMin, XMax, dx); 
YDiv = do(YMin, YMax, dy);

/** count the points in each cell **/
load module=Bin2D;
countUnif = Bin2D(pUnif, XDiv, YDiv); 

For this example, there are 20 points and 9 cells, so you should expect 2.22 points in each cell. The following statements compute the residuals (observed counts minus the expected value) for each cell and compute the Pearson statistic, which in spatial statistics is also known as the index of dispersion:

c = countUnif;
NBins = (ncol(XDiv)-1) * (ncol(YDiv)-1);
e = N/NBins;  /** expected value **/
resid = (c - e);
d = sum( resid##2 ) / e; /** index of dispersion **/
print resid;

resid

1.7777778

-1.222222

0.7777778

0.7777778

-2.222222

-1.222222

-0.222222

-1.222222

2.7777778

When the points come from a uniform random distribution, d has a 90% chance of being within the interval [L, R], where L and R are the following quantiles of the chi-square distribution:

L = quantile("chisquare", 0.05, NBins-1);
R = quantile("chisquare", 0.95, NBins-1);
print d L R;

d

L

R

9.7

2.7326368

15.507313

What Does It Mean? Time to Simulate

My brain is starting to hurt. What does this all mean in practice?

What it means is that I can do the following experiment: simulate 1,000 point patterns from the random uniform distribution and compute the corresponding 1,000 Pearson statistics. If I plot a histogram of the 1,000 Pearson statistics, I should see that histogram looks similar to a chi-square distribution with 8 degrees of freedom and that roughly 100 (10%) of the Pearson statistics will be outside of the interval [L, R].

The adjacent image shows the results of the experiment, which I ran in SAS/IML Studio. The curve is a χ82 distribution. The vertical lines are at the values L and R.

For this experiment, 71 point patterns have Pearson statistics that are outside of the interval. This is not quite 10%, but it is close. Repeating the experiment with different numbers of points (for example, N=30 or 50) and different grids (for example, a 2x3 or 4x3) results in similar histograms that also have roughly 10% of their values outside of the corresponding confidence interval.

The little scatter plots underneath give an example of a point pattern with the indicated Pearson statistic. The middle scatter plot (d=9.7) is the plot of the random points generated in the previous section. It is a "typical" point pattern from the random uniform distribution; most of the point patterns look like this one. You can see that there is a lot of variation in the cell counts for this configuration: one of the bins contains zero points, another contains five.

The scatter plot on the left (d=0.7) is a not a typical point pattern from the random uniform distribution. All of the cells in this pattern have two or three points. It is a little counterintuitive, but it is rare that a truly random uniform point pattern has points that are evenly spaced in the unit square.

The scatter plot on the right (d=27.7) shows the other extreme: more variation than is expected. Five of the cells in this pattern have zero or one points. A single cell contains almost half of the points!

What about the Jittered Patterns?

The previous graph implies that the arrangements of points that are randomly generated from a uniform distribution are rarely evenly spaced. The usual case exhibits gaps and clusters.

If you instead generate 1,000 jittered patterns, as in my previous post, the distribution of the Pearson statistics looks like the following image. The distribution of Pearson's statistic does not follow a chi-square distribution for jittered patterns. Most of the values of Pearson's statistic are small, which makes sense because the statistic measures the departure from the average density, and the configurations are permutations of evenly spaced points.

It's time to revisit the questions posed at the beginning of this post.

Can I compute a statistical measure that characterizes each kind of point pattern? Yes. The Pearson statistic tends to be small for jittered point patterns and tends to be larger for random uniform configurations. However, if the Pearson statistic for an example pattern is, say, between 3 and 6, I can't predict whether the points were generated by using the jittering algorithm or whether they were generated from a random uniform distribution.

Can I use statistics to predict whether a given pattern might be acceptable to my director? Yes. Regardless of how the configuration was generated, a pattern that has a small value of the Pearson statistic (less than 3 for this example) does not exhibit gaps or clusters, and therefore will be acceptable to my director.

Although statistics can be used to check the appropriateness of certain blocking configurations, I don’t think that my director will be replaced by a SAS/IML program anytime soon: she also plays the piano!

2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

1月 282011
 
I sing in the SAS-sponsored VocalMotion show choir. It's like an adult version of Glee, except we have more pregnancies and fewer slushie attacks.

For many musical numbers, the choreographer arranges the 20 performers on stage in an orderly manner, such as four rows of five singers. But every once in a while the director wants to create a disorganized look for a portion of a song (for example, a free-form '60's segment). She tells the performers to "go to random positions."

The choir always has problems achieving the look that the director wants. We end up in positions that are equally spaced from one another, but to the audience it often looks like we tried to arrange ourselves in rows, but failed miserably!

Since I know what a random point pattern looks like, I try to help. I whisper "clump more" and "vary the sizes of the gaps between performers." I tell my fellow performers, "random uniform does not imply uniformly spaced," but mostly they just ignore me or roll their eyes.

This year I took it upon myself to solve the problem, and I discovered that it led to some interesting statistics.

The Experiment

My first attempt to help the director was to generate many examples of random point patterns. Our stage is approximated by the rectangle S = [0, 2] x [0, 1]. It is easy to generate 20 points that are uniformly distributed in S. My idea was that the director would have 10–12 of these patterns. When she wants a random configuration, she can hold up a pattern and say, "go to this configuration." I sent the director the patterns on the adjacent image.

I was quite proud of myself. But guess what?

She didn't like them.


Random Uniform versus Evenly Spaced

The director has been telling us to go to random positions, so how can she not like the random configurations that I sent her? She explained that there are two characteristics that she doesn't like:

  1. Large areas of the stage are devoid of performers, which I translate to mean "the point patterns have too many gaps."
  2. Some of the performers are too close to one another, which I translate to mean "the point patterns have too many clusters."
From her comments, I conclude that I have been wrong all these years. She does not actually want the performers in random uniform positions on stage! The characteristics that she mentions are intrinsic to random uniform point patterns, so if she doesn't want those characteristics, she doesn't want uniform random.

So what does she want?

A Perturbation Algorithm

It was time for Plan B. I talked to the director again and heard her say that she wants the performers in a configuration that is close to a regular grid, but not too close. It has to "look random." To me, this suggests a perturbation approach to the problem: start by arranging the performers on a regular grid, but then perturb that arrangement by a small amount.

I thought about how to physically do this on stage and came up with the following algorithm:

  1. Arrange the performers on a regular grid, such as four rows of five performers.
  2. Ask the performers to turn in place for 5–10 seconds. Each performer chooses his or her own speed and direction. When the director calls "stop turning," each performer should be pointing in a random direction.
  3. Each performer takes a step forward in whatever direction he or she is facing.
This algorithm results in a perturbation of a regular grid in which each point is perturbed (jittered) by a small amount in an arbitrary direction. But does this perturbation give the look that my director wants? There's only one way to find out: implement the algorithm and look at some pictures.

Random Jittering of a Regular Grid

Forming a regular grid with SAS/IML software is no problem: I can use the Define2DGrid module, which I have previously described. It is also easy to use the RANDSEED and RANDGEN subroutines to generate angles that are distributed uniformly in [0, 2 pi]. The following SAS/IML module perturbs each position on a regular grid:

proc iml;
/** Module to define ordered pairs that are a random jittering
    of a uniform grid of points.
    The first six input parameters are the same as for Define2DGrid.
    The last parameter, delta, is the size of the jitter.

    Output: Ordered pairs are returned in a matrix with
         (Nx x Ny) rows and two columns.
**/
start GridJitter(XMin, XMax, Nx, YMin, YMax, Ny, delta);
   /** initial placement (grid) **/
   Lx = XMax - XMin; Ly = YMax - YMin;
   dx = Lx / Nx;     dy = Ly / Ny;
   load module=Define2DGrid;
   xy = Define2DGrid(dx/2, Lx-dx/2,Nx, dy/2, Ly-dy/2,Ny);
   x0 = xy[,1];  y0 = xy[,2];

   /** generate random directions in [0, 2 pi] **/
   NPts = Nx * Ny;
   pi = constant('PI');
   angle = 2 * pi * uniform( j(NPts,1) ); 

   /** each person takes on step in some direction **/
   x = x0 + delta * cos(angle);
   y = y0 + delta * sin(angle);

   /** don't let anyone fall off the stage! **/
   x = choose(x<0, 0, choose(x>Lx, Lx, x));
   y = choose(y<0, 0, choose(y>Ly, Ly, y));
   return ( x || y );
finish;

call randseed(654321);
r = 0.1; /** size of a small step **/
/** jitter 4x5 grid on the "stage" [0,2]x[0,1] **/
Jitter = GridJitter(0,2,5, 0,1,4, r); 

The results of several runs of this algorithm are shown in the following image. I showed the point patterns to the director. "Yes!" she exclaimed. "These are the kinds of configurations that I'm looking for."


Conclusions and Next Steps

One conclusion is obvious. In the initial stages of this project, I violated the Fundamental Rule of Consulting, which is "listen to a client in order to truly understand the problem." I assumed that my notion of "random placement" was the same as my director's, but it wasn't.

I am now excited to begin the next step of this process: a statistical analysis of point patterns.

Regular readers of my blog will recall that one of my New Year's resolutions is to learn something about an area of statistics that is new to me. I've already read a few articles and book chapters that will undoubtedly spark future blog posts about spatial point patterns.

For example, my director emphasized two features of point patterns: clustering and distance between adjacent points. In future blog posts, I will analyze statistics that are associated with these characteristics.

But—shhh!—don't tell the choir that I'm computing these statistics. I'd prefer not to get doused with a slushie.

1月 262011
 
A histogram displays the number of points that fall into a specified set of bins. This blog post shows how to efficiently compute a SAS/IML vector that contains those counts.

I stress the word "efficiently" because, as is often the case, a SAS/IML programmer has a variety of ways to solve this problem. Whereas a DATA step programmer might sort the data and then loop over all observations to count how many observations are contained in each bin, this algorithm is not the best way to solve the problem in the SAS/IML language.

To visualize the problem, suppose that you are interested in the distribution of the manufacturer's suggested retail price (MSRP) of vehicles in the Sashelp.Cars data set. You can call PROC UNIVARIATE to obtain the following histogram:

proc univariate data=sashelp.cars;
var MSRP;
histogram MSRP / vscale=count barlabel=count
                endpoints=0 to 200000 by 20000;
run;


The histogram has a bin width of $20,000 and displays the number of observations that fall into each bin. The bins are the intervals [0, 20000), [20000, 40000), and so on to [180000, 200000).

Constructing Bins

Suppose that you want to use the SAS/IML language to count the number of observations in each bin. The first step is to specify the endpoints of the bins. For these data (prices of vehicles), the minimum value is never less than zero. Therefore, you can use zero as the leftmost endpoint. The endpoint of the rightmost bin is the least multiple of $20,000 that is greater than the maximum vehicle price. Here is one way to define the bins:

proc iml;
use sashelp.cars;               
read all var {MSRP} into v; /** read data into vector v **/
close sashelp.cars;

w = 20000;                   /** specify bin width **/
EndPts = do(0, max(v)+w, w); /** end points of bins **/
print EndPts;

EndPts

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Counting Observations in Each Bin

After the bins are defined, you can count the observations that are between the ith and (i + 1)th endpoint. If there are k endpoints, there are k – 1 bins. The following statements loop over each bin and count the number of observations in it:

NumBins = ncol(EndPts)-1;
count = j(1,NumBins,0);      /** allocate vector for results **/
/** loop over bins, not observations **/
do i = 1 to NumBins;      
   count[i] = sum(v >= EndPts[i] & v < EndPts[i+1]); 
end;

labels = char(EndPts/1000,3)+"K"; /** use left endpoint of bin as label **/
print count[colname=labels];

count

0K

20K

40K

60K

80K

100K

120K

140K

160K

180K

98

228

70

19

9

0

3

0

0

1

You can compare the SAS/IML counts with the PROC UNIVARIATE histogram to determine that the algorithm correctly counts these data.

The body of the DO loop consists of a single statement. The statement v >= EndPts[i] & v < EndPts[i+1] is a vector of zeros and ones that indicates which observations are in the ith bin. If you call that vector b, you can use b to compute the following quantities:

  • You can count the number of observations in the bin by summing the ones: n = sum(b).
  • You can compute the percentage of observations in the bin by computing the mean: pct = b[:].
  • You can find the indices of observations in the bin by using the LOC function: idx = loc(b). This is useful, for example, if you want to extract those observations in order to compute some statistic for each bin.

Unequally Spaced Bins

The algorithm uses only endpoint information to count observations. Consequently, it is not limited to equally spaced bins. For example, you might want divide the prices of cars into ordinal categories such as "Cheap," "Affordable," "Luxury," and "Extravagant" by arbitrarily defining price points to separate the categories, as shown in the following statements:

/** count the number of observations in bins of different widths **/
EndPts = {0 20000 60000 100000} || (max(v)+1);
labels = {"Cheap" "Affordable" "Luxury" "Extravagant"};

NumBins = ncol(EndPts)-1;
count = j(1,NumBins,0);      /** allocate vector for results **/
do i = 1 to NumBins;      
   count[i] = sum(v >= EndPts[i] & v < EndPts[i+1]); 
end;

print count[colname=labels];

count

Cheap

Affordable

Luxury

Extravagant

98

298

28

4

In Chapter10 of my book, Statistical Programming with SAS/IML Software, I use this technique to color observations in scatter plots according to the value of a third variable.

I'll leave you with a few questions to consider:

  1. Why did I use max(v)+1 instead of max(v) as the rightmost endpoint for the unequally spaced bins?
  2. Does this algorithm handle missing values in the data?
  3. This algorithm uses half-open intervals of the form [a,b) for the bins. That is, each bin includes its left endpoint, but not its right endpoint. Which parts of the algorithm change if you want bins of the form (a,b]?
  4. Notice that the quantity bR = (v < EndPts[i+1]) for i=k is related to the quantity bL = (v >= EndPts[i]) for i=k+1. Specifically, bR equals ^bL. Can you exploit this fact to make the algorithm more efficient?