Statistical Programming

2月 092011
 
The other day, someone asked me how to compute a matrix of pairwise differences for a vector of values. The person asking the question was using SQL to do the computation for 2,000 data points, and it was taking many hours to compute the pairwise differences. He asked if SAS/IML could solve the problem faster.

The result? My program ran in a fraction of a second.

Now, to be fair, I'm sure that the SQL code is not as efficient as it could be. It has been my experience that a well-written SQL or DATA step program usually performs as well as (or better than) comparable SAS/IML code. I'm not an SQL programmer, so I couldn't offer advice on how to improve his SQL code. But since I solved the problem by using PROC IML, I'll discuss the solution, along with sharing some thoughts about efficient SAS/IML programming.

An Example Problem: Comparing Gas Prices at Different Stations

The following DATA step creates a variable that contains gas prices (in cents per gallon) in Raleigh, NC, on a certain day for seven gas stations:

/** data from www.gaspricewatch.com, 05FEB2011 **/
data GasPrices;
length Station $6;
input Price DOLLAR5.2 Station $;
Price = Price * 100; /** convert to cents **/
datalines;
$2.90 Citgo
$2.91 Exxon1
$2.99 Shell1
$3.00 Shell2
$3.03 BP    
$3.05 Exxon2
$3.09 Texaco
;
run;

Suppose you are interested in the differences between pairs of prices. That is, you want to create the matrix, D, whose ijth entry is the difference in price between the price at the ith station and the price at the jth station. Cleary, D is an anti-symmetric matrix with zero on the diagonal. (An anti-symmetric matrix is also called skew-symmetric.) You can compute D with the following SAS/IML statements:

proc iml;
use GasPrices;
read all var {Station Price};
close GasPrices;

/** compute matrix of differences **/
/** 1. use matrices **/
start FullMatDiff(x); /** x is a column vector **/
   n = nrow(x);   
   m = shape(x, n, n); /** duplicate data **/
   return( m` - m );
finish;

D = FullMatDiff(Price);
print D[c=Station r=Station];

D

Citgo

Exxon1

Shell1

Shell2

BP

Exxon2

Texaco

Citgo

0

-1

-9

-10

-13

-15

-19

Exxon1

1

0

-8

-9

-12

-14

-18

Shell1

9

8

0

-1

-4

-6

-10

Shell2

10

9

1

0

-3

-5

-9

BP

13

12

4

3

0

-2

-6

Exxon2

15

14

6

5

2

0

-4

Texaco

19

18

10

9

6

4

0

Because the data are ordered by price, the pairwise differences increase as you scan down any column of D.

Analyzing Efficiency

In the SAS/IML language, as with many programming languages, there is more than one way to compute the same quantity. Consequently, I am always asking myself whether my first solution is the most efficient solution.

From a computational point of view, the FullMatDiff module is efficient. Notice that it uses the SHAPE function to create a matrix whose ith column contains n copies of the ith price. If you are not familiar with the SHAPE function, you might unnecessarily write a loop.

The main computation is a single matrix expression, m`-m, which computes all n2 pairs of differences.

However, when I look at the module critically, I notice that it requires more memory than is necessary. The US Census Bureau estimates that there were 117,000 gas stations in the US in 2008, so I can imagine running this module on many tens of thousands of prices. This could be a problem because each matrix requires memory for n2 entries and there are three matrices that need to be in RAM simultaneously: m, the transpose of m, and the matrix of differences.

It is not necessary to explicitly compute either m or the transpose. If I'm willing to write a loop, I can compute the pairwise differences while allocating only a single n x n matrix:

/** 2. compute each row directly **/
start FullRowDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n;
      diff[i,] = xt[i] - xt;
   end;
   return( diff );
finish;

Another thought that occurred to me is that the matrix of differences is anti-symmetric, so it is not necessary to compute all of the entries but only the "n choose 2" entries above the diagonal (or, equivalently, below the diagonal). The SAS/IML module for computing the upper-triangular entries is similar to the previous module:

/** 3. compute only upper triangular differences 
       [COMB(n,2) elements] **/
start TriDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n-1;
      cols = i+1:n;
      diff[i,cols] = x[i] - xt[,cols];
   end;
   return( diff );
finish;

Comparing the Performance of the Modules

I wondered whether one algorithm would run faster than the others, so I timed the performance on data with a wide range of sizes. It turns out that the time required to compute the pairwise differences is about the same for the three algorithms, regardless of the size of data. Furthermore, even for 10,000 prices, each algorithm executes in seconds. For more data than that, my computer cannot allocate sufficient memory.


I know that some of my readers are SQL programmers, so feel free to post an SQL solution. Also, let me know how long your SQL program requires to compute the pairwise differences of 2,000 and 10,000 random uniform values. Can an SQL solution handle 20,000 prices? How long does it take?

2月 072011
 
When you pass a matrix as an parameter (argument) to a SAS/IML module, the SAS/IML language does not create a copy of the matrix. That approach, known as "calling by value," is inefficient. It is well-known that languages that implement call-by-value semantics suffer performance penalties.

In the SAS/IML language, matrices are passed to modules "by reference." This has two important implications:

  • You can pass large matrices into SAS/IML modules without hurting performance.
  • If you change or redefine an argument inside the module, the corresponding matrix also changes outside the module.
The second point means that you should be careful: when you modify the value or shape of matrices in a SAS/IML module, the change percolates up to the calling environment.

An Example of Call-By-Reference

A simple example can make these ideas clearer. Suppose that you write a module that sorts an input argument as part of a computation, such as computing the empirical cumulative distribution function:

proc iml;
start EmpiricalCDF( x );
   call sort(x,1);
   /** compute with the sorted data **/
   return( cusum(x) );
finish;

When you call that module from a SAS/IML program, it actually sorts the matrix that is sent in:

y = {3,2,1,1,4,2};
cdf = EmpiricalCDF( y );
print y cdf;

y

cdf

1

1

1

2

2

4

2

6

3

9

4

13

The Advantages of Call-By-Reference Semantics

An advantage of call-by-reference semantics is that your program doesn't waste time and memory copying data from the global environment into the local variable. Instead, the local variable inside the module is just an alias to the same matrix that is passed into the module. (C programmers are familiar with the concept of a pointer. Same idea.)

Call-by-reference semantics also enables you to return multiple results from a single module call. For example, if you know that you need both the mean and the standard deviation of data, you might decide to write a module that computes both quantities at the same time:

start GetDescriptive(Mean, StdDev, /** output arguments **/
                     x); /** input argument **/
   Mean = x[:];
   StdDev = sqrt( ssq(x-Mean)/(nrow(x)-1) );
finish;

run GetDescriptive(m, s, y);
print m s;

m

s

2.1666667

1.1690452

Notice that you don't have to allocate m or s before you call the module.

A Tip for Using Call-By-Reference Semantics

At the time that you write the EmpiricalCDF module, it might not matter to you that the module changes the order of the data. Maybe the rest of the program never refers to the data again. Or maybe it only computes statistics that do not depend on the order of the data (such as the mean and standard deviation).

But what happens if you share the module with a colleague?

Murphy's Law predicts that your colleague will get burned.

Your colleague will insert your module call into a working program, and suddenly some computation that used to be correct will now give a wrong answer. He will spend hours debugging his program and will eventually discover that the first value in his y matrix no longer corresponds to the first observation in the data set. Your colleague might be tempted to angrily march into your office and shout, "your $!@#$% module changed the #!%$@ order of one of my variables!"

In order to maintain a healthy working relationship with your colleagues, a good programming practice is to observe the following convention:

Programming Tip: Do not write a module that modifies an input argument.

When I write a module that needs to change the values or shape of an input argument, I make an explicit copy of it and then modify the copy. For example, I might rewrite the EmpiricalCDF module as follows:

start EmpiricalCDF( _x );
   x = _x; /** copy input argument **/
   call sort(x,1);
   return( cusum(x) );
finish;

I discuss these issues and give additional examples in my book, Statistical Programming with SAS/IML Software.

2月 042011
 
Last week I generated two kinds of random point patterns: one from the uniform distribution on a two-dimensional rectangle, the other by jittering a regular grid by a small amount.

My show choir director liked the second method (jittering) better because of the way it looks on stage: there are no gaps and no two performers are too close to each other. The statistician in me wants to know what statistical properties make the jittered pattern more desirable than the other. Said differently, can I do any of the following?

  • Compute a statistical measure that characterizes each kind of point pattern?
  • Use statistics to predict whether a given pattern would be acceptable to her?

As part of my plan to learn new statistics, I've been reading Statistical Analysis and Modelling of Spatial Point Patterns by J. Illian, A. Penttinen, H. Stoyan, and D. Stoyan. The spatial statistics in this blog post come from their book. This post has four parts:

  1. I describe a simple test that indicates whether a point pattern is likely to have been generated from a random uniform distribution.
  2. I implement the test in the SAS/IML language.
  3. I simulate data to show what the test means and how it works.
  4. I return to the above questions and show that the answer to both questions is "yes."

Complete Spatial Randomness --- A Test

Not all random configurations are created equal. A jittered pattern is random, but a pattern that is generated from a uniform distribution on a 2D rectangle exhibits complete spatial randomness (CSR).

There are several ways to test for CSR. Probably the easiest method is called the method of quadrats. The method consists of the following steps:

  1. Divide the rectangular domain into k disjoint cells (bins) of equal area and count the number of points, ci, that fall into each cell. You can use the Bin2D module for this step.
  2. If the points were generated from a distribution with a uniform density, you would expect e = N/k points in each bin, where N is the total number of points. Subtract that value from the actual bin count to obtain the residual count for each bin.
  3. Just as for two-way frequency tables, you can compute a normalized sum of the squared residuals and use that measure to determine whether the point pattern was likely to have come from a distribution with a uniform density. The computation is just the usual Pearson chi-square statistic: Σ (ci - e)2 / e.
The theory says that if a point pattern is generated from a uniform distribution, the Pearson statistic has an approximate chi-square distribution with k-1 degrees of freedom, provided that k > 6 and e > 1.

Ugh! It takes longer to explain the algorithm in words than it does to implement it in the SAS/IML language! It's time for a concrete example.

SAS/IML Code to Test for Complete Spatial Randomness

The following SAS/IML statements generate 20 points from a random uniform distribution, define a 3x3 grid on the unit square, and call the Bin2D module to count how many points are in each cell of the grid:

proc iml;
N = 20;
pUnif = j(N,2);
call randseed(1234);
call randgen(pUnif, "uniform"); /** generate random uniform **/

/** Define regular grid: 0, 0.33, 0.67, 1 **/
XMin = 0; XMax = 1; Nx = 3; dx = 1/Nx; 
YMin = 0; YMax = 1; Ny = 3; dy = 1/Ny;
XDiv = do(XMin, XMax, dx); 
YDiv = do(YMin, YMax, dy);

/** count the points in each cell **/
load module=Bin2D;
countUnif = Bin2D(pUnif, XDiv, YDiv); 

For this example, there are 20 points and 9 cells, so you should expect 2.22 points in each cell. The following statements compute the residuals (observed counts minus the expected value) for each cell and compute the Pearson statistic, which in spatial statistics is also known as the index of dispersion:

c = countUnif;
NBins = (ncol(XDiv)-1) * (ncol(YDiv)-1);
e = N/NBins;  /** expected value **/
resid = (c - e);
d = sum( resid##2 ) / e; /** index of dispersion **/
print resid;

resid

1.7777778

-1.222222

0.7777778

0.7777778

-2.222222

-1.222222

-0.222222

-1.222222

2.7777778

When the points come from a uniform random distribution, d has a 90% chance of being within the interval [L, R], where L and R are the following quantiles of the chi-square distribution:

L = quantile("chisquare", 0.05, NBins-1);
R = quantile("chisquare", 0.95, NBins-1);
print d L R;

d

L

R

9.7

2.7326368

15.507313

What Does It Mean? Time to Simulate

My brain is starting to hurt. What does this all mean in practice?

What it means is that I can do the following experiment: simulate 1,000 point patterns from the random uniform distribution and compute the corresponding 1,000 Pearson statistics. If I plot a histogram of the 1,000 Pearson statistics, I should see that histogram looks similar to a chi-square distribution with 8 degrees of freedom and that roughly 100 (10%) of the Pearson statistics will be outside of the interval [L, R].

The adjacent image shows the results of the experiment, which I ran in SAS/IML Studio. The curve is a χ82 distribution. The vertical lines are at the values L and R.

For this experiment, 71 point patterns have Pearson statistics that are outside of the interval. This is not quite 10%, but it is close. Repeating the experiment with different numbers of points (for example, N=30 or 50) and different grids (for example, a 2x3 or 4x3) results in similar histograms that also have roughly 10% of their values outside of the corresponding confidence interval.

The little scatter plots underneath give an example of a point pattern with the indicated Pearson statistic. The middle scatter plot (d=9.7) is the plot of the random points generated in the previous section. It is a "typical" point pattern from the random uniform distribution; most of the point patterns look like this one. You can see that there is a lot of variation in the cell counts for this configuration: one of the bins contains zero points, another contains five.

The scatter plot on the left (d=0.7) is a not a typical point pattern from the random uniform distribution. All of the cells in this pattern have two or three points. It is a little counterintuitive, but it is rare that a truly random uniform point pattern has points that are evenly spaced in the unit square.

The scatter plot on the right (d=27.7) shows the other extreme: more variation than is expected. Five of the cells in this pattern have zero or one points. A single cell contains almost half of the points!

What about the Jittered Patterns?

The previous graph implies that the arrangements of points that are randomly generated from a uniform distribution are rarely evenly spaced. The usual case exhibits gaps and clusters.

If you instead generate 1,000 jittered patterns, as in my previous post, the distribution of the Pearson statistics looks like the following image. The distribution of Pearson's statistic does not follow a chi-square distribution for jittered patterns. Most of the values of Pearson's statistic are small, which makes sense because the statistic measures the departure from the average density, and the configurations are permutations of evenly spaced points.

It's time to revisit the questions posed at the beginning of this post.

Can I compute a statistical measure that characterizes each kind of point pattern? Yes. The Pearson statistic tends to be small for jittered point patterns and tends to be larger for random uniform configurations. However, if the Pearson statistic for an example pattern is, say, between 3 and 6, I can't predict whether the points were generated by using the jittering algorithm or whether they were generated from a random uniform distribution.

Can I use statistics to predict whether a given pattern might be acceptable to my director? Yes. Regardless of how the configuration was generated, a pattern that has a small value of the Pearson statistic (less than 3 for this example) does not exhibit gaps or clusters, and therefore will be acceptable to my director.

Although statistics can be used to check the appropriateness of certain blocking configurations, I don’t think that my director will be replaced by a SAS/IML program anytime soon: she also plays the piano!

2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

1月 282011
 
I sing in the SAS-sponsored VocalMotion show choir. It's like an adult version of Glee, except we have more pregnancies and fewer slushie attacks.

For many musical numbers, the choreographer arranges the 20 performers on stage in an orderly manner, such as four rows of five singers. But every once in a while the director wants to create a disorganized look for a portion of a song (for example, a free-form '60's segment). She tells the performers to "go to random positions."

The choir always has problems achieving the look that the director wants. We end up in positions that are equally spaced from one another, but to the audience it often looks like we tried to arrange ourselves in rows, but failed miserably!

Since I know what a random point pattern looks like, I try to help. I whisper "clump more" and "vary the sizes of the gaps between performers." I tell my fellow performers, "random uniform does not imply uniformly spaced," but mostly they just ignore me or roll their eyes.

This year I took it upon myself to solve the problem, and I discovered that it led to some interesting statistics.

The Experiment

My first attempt to help the director was to generate many examples of random point patterns. Our stage is approximated by the rectangle S = [0, 2] x [0, 1]. It is easy to generate 20 points that are uniformly distributed in S. My idea was that the director would have 10–12 of these patterns. When she wants a random configuration, she can hold up a pattern and say, "go to this configuration." I sent the director the patterns on the adjacent image.

I was quite proud of myself. But guess what?

She didn't like them.


Random Uniform versus Evenly Spaced

The director has been telling us to go to random positions, so how can she not like the random configurations that I sent her? She explained that there are two characteristics that she doesn't like:

  1. Large areas of the stage are devoid of performers, which I translate to mean "the point patterns have too many gaps."
  2. Some of the performers are too close to one another, which I translate to mean "the point patterns have too many clusters."
From her comments, I conclude that I have been wrong all these years. She does not actually want the performers in random uniform positions on stage! The characteristics that she mentions are intrinsic to random uniform point patterns, so if she doesn't want those characteristics, she doesn't want uniform random.

So what does she want?

A Perturbation Algorithm

It was time for Plan B. I talked to the director again and heard her say that she wants the performers in a configuration that is close to a regular grid, but not too close. It has to "look random." To me, this suggests a perturbation approach to the problem: start by arranging the performers on a regular grid, but then perturb that arrangement by a small amount.

I thought about how to physically do this on stage and came up with the following algorithm:

  1. Arrange the performers on a regular grid, such as four rows of five performers.
  2. Ask the performers to turn in place for 5–10 seconds. Each performer chooses his or her own speed and direction. When the director calls "stop turning," each performer should be pointing in a random direction.
  3. Each performer takes a step forward in whatever direction he or she is facing.
This algorithm results in a perturbation of a regular grid in which each point is perturbed (jittered) by a small amount in an arbitrary direction. But does this perturbation give the look that my director wants? There's only one way to find out: implement the algorithm and look at some pictures.

Random Jittering of a Regular Grid

Forming a regular grid with SAS/IML software is no problem: I can use the Define2DGrid module, which I have previously described. It is also easy to use the RANDSEED and RANDGEN subroutines to generate angles that are distributed uniformly in [0, 2 pi]. The following SAS/IML module perturbs each position on a regular grid:

proc iml;
/** Module to define ordered pairs that are a random jittering
    of a uniform grid of points.
    The first six input parameters are the same as for Define2DGrid.
    The last parameter, delta, is the size of the jitter.

    Output: Ordered pairs are returned in a matrix with
         (Nx x Ny) rows and two columns.
**/
start GridJitter(XMin, XMax, Nx, YMin, YMax, Ny, delta);
   /** initial placement (grid) **/
   Lx = XMax - XMin; Ly = YMax - YMin;
   dx = Lx / Nx;     dy = Ly / Ny;
   load module=Define2DGrid;
   xy = Define2DGrid(dx/2, Lx-dx/2,Nx, dy/2, Ly-dy/2,Ny);
   x0 = xy[,1];  y0 = xy[,2];

   /** generate random directions in [0, 2 pi] **/
   NPts = Nx * Ny;
   pi = constant('PI');
   angle = 2 * pi * uniform( j(NPts,1) ); 

   /** each person takes on step in some direction **/
   x = x0 + delta * cos(angle);
   y = y0 + delta * sin(angle);

   /** don't let anyone fall off the stage! **/
   x = choose(x<0, 0, choose(x>Lx, Lx, x));
   y = choose(y<0, 0, choose(y>Ly, Ly, y));
   return ( x || y );
finish;

call randseed(654321);
r = 0.1; /** size of a small step **/
/** jitter 4x5 grid on the "stage" [0,2]x[0,1] **/
Jitter = GridJitter(0,2,5, 0,1,4, r); 

The results of several runs of this algorithm are shown in the following image. I showed the point patterns to the director. "Yes!" she exclaimed. "These are the kinds of configurations that I'm looking for."


Conclusions and Next Steps

One conclusion is obvious. In the initial stages of this project, I violated the Fundamental Rule of Consulting, which is "listen to a client in order to truly understand the problem." I assumed that my notion of "random placement" was the same as my director's, but it wasn't.

I am now excited to begin the next step of this process: a statistical analysis of point patterns.

Regular readers of my blog will recall that one of my New Year's resolutions is to learn something about an area of statistics that is new to me. I've already read a few articles and book chapters that will undoubtedly spark future blog posts about spatial point patterns.

For example, my director emphasized two features of point patterns: clustering and distance between adjacent points. In future blog posts, I will analyze statistics that are associated with these characteristics.

But—shhh!—don't tell the choir that I'm computing these statistics. I'd prefer not to get doused with a slushie.

1月 262011
 
A histogram displays the number of points that fall into a specified set of bins. This blog post shows how to efficiently compute a SAS/IML vector that contains those counts.

I stress the word "efficiently" because, as is often the case, a SAS/IML programmer has a variety of ways to solve this problem. Whereas a DATA step programmer might sort the data and then loop over all observations to count how many observations are contained in each bin, this algorithm is not the best way to solve the problem in the SAS/IML language.

To visualize the problem, suppose that you are interested in the distribution of the manufacturer's suggested retail price (MSRP) of vehicles in the Sashelp.Cars data set. You can call PROC UNIVARIATE to obtain the following histogram:

proc univariate data=sashelp.cars;
var MSRP;
histogram MSRP / vscale=count barlabel=count
                endpoints=0 to 200000 by 20000;
run;


The histogram has a bin width of $20,000 and displays the number of observations that fall into each bin. The bins are the intervals [0, 20000), [20000, 40000), and so on to [180000, 200000).

Constructing Bins

Suppose that you want to use the SAS/IML language to count the number of observations in each bin. The first step is to specify the endpoints of the bins. For these data (prices of vehicles), the minimum value is never less than zero. Therefore, you can use zero as the leftmost endpoint. The endpoint of the rightmost bin is the least multiple of $20,000 that is greater than the maximum vehicle price. Here is one way to define the bins:

proc iml;
use sashelp.cars;               
read all var {MSRP} into v; /** read data into vector v **/
close sashelp.cars;

w = 20000;                   /** specify bin width **/
EndPts = do(0, max(v)+w, w); /** end points of bins **/
print EndPts;

EndPts

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Counting Observations in Each Bin

After the bins are defined, you can count the observations that are between the ith and (i + 1)th endpoint. If there are k endpoints, there are k – 1 bins. The following statements loop over each bin and count the number of observations in it:

NumBins = ncol(EndPts)-1;
count = j(1,NumBins,0);      /** allocate vector for results **/
/** loop over bins, not observations **/
do i = 1 to NumBins;      
   count[i] = sum(v >= EndPts[i] & v < EndPts[i+1]); 
end;

labels = char(EndPts/1000,3)+"K"; /** use left endpoint of bin as label **/
print count[colname=labels];

count

0K

20K

40K

60K

80K

100K

120K

140K

160K

180K

98

228

70

19

9

0

3

0

0

1

You can compare the SAS/IML counts with the PROC UNIVARIATE histogram to determine that the algorithm correctly counts these data.

The body of the DO loop consists of a single statement. The statement v >= EndPts[i] & v < EndPts[i+1] is a vector of zeros and ones that indicates which observations are in the ith bin. If you call that vector b, you can use b to compute the following quantities:

  • You can count the number of observations in the bin by summing the ones: n = sum(b).
  • You can compute the percentage of observations in the bin by computing the mean: pct = b[:].
  • You can find the indices of observations in the bin by using the LOC function: idx = loc(b). This is useful, for example, if you want to extract those observations in order to compute some statistic for each bin.

Unequally Spaced Bins

The algorithm uses only endpoint information to count observations. Consequently, it is not limited to equally spaced bins. For example, you might want divide the prices of cars into ordinal categories such as "Cheap," "Affordable," "Luxury," and "Extravagant" by arbitrarily defining price points to separate the categories, as shown in the following statements:

/** count the number of observations in bins of different widths **/
EndPts = {0 20000 60000 100000} || (max(v)+1);
labels = {"Cheap" "Affordable" "Luxury" "Extravagant"};

NumBins = ncol(EndPts)-1;
count = j(1,NumBins,0);      /** allocate vector for results **/
do i = 1 to NumBins;      
   count[i] = sum(v >= EndPts[i] & v < EndPts[i+1]); 
end;

print count[colname=labels];

count

Cheap

Affordable

Luxury

Extravagant

98

298

28

4

In Chapter10 of my book, Statistical Programming with SAS/IML Software, I use this technique to color observations in scatter plots according to the value of a third variable.

I'll leave you with a few questions to consider:

  1. Why did I use max(v)+1 instead of max(v) as the rightmost endpoint for the unequally spaced bins?
  2. Does this algorithm handle missing values in the data?
  3. This algorithm uses half-open intervals of the form [a,b) for the bins. That is, each bin includes its left endpoint, but not its right endpoint. Which parts of the algorithm change if you want bins of the form (a,b]?
  4. Notice that the quantity bR = (v < EndPts[i+1]) for i=k is related to the quantity bL = (v >= EndPts[i]) for i=k+1. Specifically, bR equals ^bL. Can you exploit this fact to make the algorithm more efficient?
1月 252011
 
Have you ever wanted to compute the exact value of a really big number such as 200! = 200*199*...*2*1? You can do it—if you're willing to put forth some programming effort. This blog post shows you how.

Jiangtang Hu's recent blog discusses his quest to compute large factorials in many programming languages. As he discovered, many languages that use standard double precision arithmetic do not compute n! for large n, whereas other software, such as Mathematica, succeed.

The reason is simple: 171! is greater than 1.79769e+308, which on many computers is the largest floating point number that can be represented in eight bytes. For example, the following statements print the maximum double precision value on my Windows PC:

proc iml;
MaxBig = constant("big");
print MaxBig;

MaxBig

1.798E308

Because of the limitation of double precision arithmetic, floating point computations can result in a run-time error. For example, the following factorial computation overflows in SAS software:

f = fact(200);  /** numerical overflow **/

Of course, there are ways around this limitation. The reason that software such as Maple and Mathematica succeed is that they implement special algorithms that enable them to compute results to arbitrary precision.

Computing something to "arbitrary precision" sounds like magic, but it's not. For example, I can use the SAS/IML language (or the DATA step, or C, or many other languages) to write an algorithm that computes the factorial of fairly large numbers. The following algorithm is naive and inefficient, but it shows how you can implement an arbitrary precision algorithm by storing the result in a numerical array instead of in a double precision variable.

An Algorithm for Computing the Factorial of Large Numbers

Again, I emphasize that this algorithm is for illustration purposes and is not intended for "real" work. Given an integer, n, here's how you can compute the exact value of f = n!. The key is to store each digit of the result in a numerical array.

  1. Compute the number of digits of f = n!. The number of digits is related to the logarithm (base 10) of f, so you can use the LFACT function. The LFACT function returns the natural logarithm of the f, but you can convert that number to the base-10 logarithm.
  2. Allocate an array with the correct number of digits. The ith element of this array will hold the ith digit of the answer. In other words, if the ith digit is d, the digit represents the value d x 10i.
  3. Initialize the result to 1.
  4. For each integer 2, 3, ..., n, multiply the current result by the next factor. This is done place-value-by-place-value, and results that are greater than 10 are carried over to the next place value.
  5. After all the multiplications, the array f has the value of n!. The numbers in this array need to be reversed, because, for example, the array {0 2 1} represents the number 120.

start BigFactorial(n);
 numDigits = ceil(lfact(n)/log(10)); /** 1 **/
 f = j(1,numDigits,0);               /** 2 **/
 f[1] = 1;                           /** 3 **/
 do i = 2 to n;                      /** 4 **/
   carry = 0;
   digits = ceil(lfact(i)/log(10)); /** how many digits in the answer so far? **/
   do j = 1 to digits ;             /** multiply the product by next factor **/
     g = i*f[j] + carry;            /** j_th place value **/
     digit = mod(g, 10); 
     f[j] = digit;                  /** put value in the j_th place **/
     carry = (g - digit)/10;        /** carry the rest to the (j+1)_th place **/
   end;
 end;
 return( f[,ncol(f):1] );           /** 5. reverse the digits **/
finish;

Can this module really compute the factorial of large numbers? Yep. You can compute 200! and see that the exact answer contains 375 digits. The following statements compute the factorial and print the result:

f = BigFactorial(200);  /** compute 200! **/
/** convert answer from a numerical array to a character string **/
Fact200 = rowcat(char(f,1)); 

numDigits200 = nleng(Fact200); /** how many digits in the answer? **/
print numDigits200;
print Fact200;

numDigits200

375

                                     Fact200

     788657867364790503552363213932185062295135977687173263294742533244359449963
     403342920304284011984623904177212138919638830257642790242637105061926624952
     829931113462857270763317237396988943922445621451664240254033291864131227428
     294853277524242407573903240321257405579568660226031904170324062351700858796
     178922222789623703897374720000000000000000000000000000000000000000000000000

Obviously, I have entirely too much time on my hands, but does this example demonstrate anything else? It demonstrates that programming with arrays enables you to implement a wide variety of algorithms. Perhaps the example also demystifies software that evaluates expressions to "arbitrary precision."

Mostly, though, I did it because I thought it would be fun. I was right. And next time I'm at a party, I might casually sidle next to someone and mention that 1,000! has 2,568 digits. I'll let you know their response.

1月 212011
 
In a previous post, I described ways to create SAS/IML vectors that contain uniformly spaced values. The methods did not involve writing any loops.

This post describes how to perform a similar operation: creating evenly spaced values on a two-dimensional grid. The DATA step solution is simple, but an efficient SAS/IML solution requires thinking about the problem in terms of vectors and matrices.

The DATA Step Solution

A grid of values is useful in many contexts, such as scoring and visualizing a statistical model. In the DATA step, you can use loops to create a grid of values. For example, to generate a 3 x 2 grid of values on the rectangle [0,2] x [0,1] you can use the following program:

data grid;
keep x y;
/** set parameters for the general problem **/
XMin = 0; XMax = 2; Nx = 3;
YMin = 0; YMax = 1; Ny = 2;

dx = (XMax-XMin)/(Nx-1);
dy = (YMax-YMin)/(Ny-1);
do y = YMin to YMax by dy;
   do x = XMin to XMax by dx;
      output;
   end;
end;
run;

proc print; var x y; run;

Obs

x

y

1

0

0

2

1

0

3

2

0

4

0

1

5

1

1

6

2

1

This solution uses a double loop to compute the values on a uniform grid. There are Nx points in the X direction, evenly spaced on the interval [XMin, XMax], and Ny points in the Y direction, evenly spaced on the interval [YMin, YMax].

As I've written before, the SAS/IML language is a vector-matrix language, and an efficient SAS/IML program is one that avoids writing unnecessary loops. How can you write a similar program in PROC IML that avoids one or both loops?

A SAS/IML Language Solution

There are three simple SAS/IML functions that you can use to create a module that computes a grid of evenly spaced values:

The first step is to use the parameters in the problem to determine the points in the X and Y directions that will form the grid:

proc iml;
XMin = 0; XMax = 2; Nx = 3;
YMin = 0; YMax = 1; Ny = 2;
XDiv = do(XMin, XMax, (XMax-XMin)/(Nx-1));
YDiv = do(YMin, YMax, (YMax-YMin)/(Ny-1));
print XDiv, YDiv;


XDiv

0

1

2

YDiv

0

1

These are the coordinates of the grid points, but you need to arrange them into a matrix of ordered pairs that contains Nx * Ny rows and 2 columns.

The X coordinates are easy to construct. To form the X coordinates you can repeat the XDiv vector Ny times, and then reshape those values into a column vector:

X = repeat(XDiv, Ny);
X = shape(X, Nx*Ny);  /** or use COLVEC module in IMLMLIB **/
print X;

X

0

1

2

0

1

2

The Y coordinates are a little more difficult to construct. Recall that SAS/IML matrices are stored row-wise. This means that if you use the REPEAT function to form a set of points from the YDiv vector, the points will be in the wrong order:

Y = repeat(YDiv, Nx);
print Y; /** wrong order: elements stored row-wise **/

Y

0

1

0

1

0

1

What you really want is to traverse the Y matrix down the columns instead of across the rows. No problem! Use the T function to transpose the values, and then reshape those transposed values into a column:

Y = shape(T(Y), Nx*Ny);  /** or use COLVEC module in IMLMLIB **/

A SAS/IML Module That Defines Ordered Pairs on a Grid

You can put these statements into a module, which enables you to conveniently generate a grid of points:


/** Module to define ordered pairs in a uniform grid of points.
   Input: XMin = minimum value of the X coordinate of the grid
          XMax = maximum value of the X coordinate of the grid
          Nx = number of grid lines along the X coordinate
          YMin = minimum value of the Y coordinate of the grid
          YMax = maximum value of the Y coordinate of the grid
          Ny = number of grid lines along the Y coordinate

    Output: Ordered pairs are returned in a matrix with
         (Nx x Ny) rows and two columns.
**/
start Define2DGrid(XMin, XMax, Nx, YMin, YMax, Ny);
   XDiv = do(XMin, XMax, (XMax-XMin)/(Nx-1));
   YDiv = do(YMin, YMax, (YMax-YMin)/(Ny-1));
   X = repeat(XDiv, Ny);
   X = shape(X, Nx*Ny);
   Y = repeat(YDiv, Nx);
   Y = shape(T(Y), Nx*Ny);
   return ( X || Y );
finish;

/** Example: Use the module to define a grid of points and 
      then evaluate a function f(x,y) on the grid. **/
p = Define2DGrid(-2, 2, 23,  /** 23 points in [-2,2] **/
                 -1, 1, 11); /** 11 points in [-1,1] **/
x = p[,1];  y = p[,2];
z = x##2 - 2*y##2;  /** evaluate a function z = f(x,y) **/

/** the following SAS/IML Studio statement creates a rotating plot **/
RotatingPlot.Create("Quadratic Surface", x, y, z);

1月 192011
 
"What is the chance that two people in a room of 20 share initials?"

This was the question posed to me by a colleague who had been taking notes at a meeting with 20 people. He recorded each person's initials next to their comments and, upon editing the notes, was shocked to discover that no two people in the meeting shared initials.

Last week I analyzed the frequency distribution of initials for 4,502 SAS employees who work with me in Cary, NC. The distribution is highly skewed. For example, there are 676 possible pairs of initials (AA, AB, AC, ..., ZX, ZY, ZZ), but 212 of those combinations (31%) are not the initials of any SAS employee in Cary. Another 89 pairs (13%) are the initials of only a single person. In contrast, the most common initials, JS, are shared by 61 people.

How does this skewed distribution affect the chance that two people in the same room share initials? It makes matches more likely! Munford (TAS, 1977) showed that a uniform distribution of frequencies corresponds to the case in which it is least likely that two people in the same room share a common characteristic (such as a birthday or the same initials).

So how can you estimate the probability that two people in a room have the same initials? Use simulation!

A Simulation-Based Estimate

It works like this. Suppose you want to estimate the probability that two SAS employees in a room of 20 have a common pair of initials. You can simulate this situation in SAS/IML software as follows:

  1. Read the frequencies (or percentages) for each pair of initials into a SAS/IML vector. From these frequencies you can compute the empirical proportion for each pair of initials.
  2. Randomly select 20 initials from the vector to represent the 20 people in the room. The selections are made with a probability equal to the proportion of the initials at SAS. The program that follows uses the SampleWithReplace module from my book Statistical Programming with SAS/IML Software. (Technically, the module uses sampling with replacement, and I should use sampling without replacement because I have a finite population, but I'll ignore that small detail.)
  3. Count the number of unique initials in the simulated room. If the number is smaller than 20, then there is a match: at least two people have the same initials.
  4. Repeat the previous two steps a large number of times. The number of matches divided by the number of simulated rooms is an estimate of the probability that a random room will have a match. For example, if you generate 100,000 rooms and 58,455 of those rooms have a match, then the probability of a random room having a match is estimated by 0.58455.
The following SAS/IML program carries out this simulation:

/** program to compute the initial-matching problem **/
proc iml;
call randseed(123);
load module=SampleWithReplace; 

/** read percentages for each pair of initials **/
use Sasuser.InitialFreq where(Percent>0);
read all var {Percent};
close Sasuser.InitialFreq;

p = Percent/Percent[+]; /** probability of events **/
NumRooms = 1e5; /** number of simulated rooms **/
match = j(NumRooms, 1); /** allocate results vector **/
N=20;
initials = SampleWithReplace(1:nrow(p), NumRooms||N, p);
do j = 1 to NumRooms;
   u = unique(initials[j,]); /** number of unique initials **/
   match[j] = N - ncol(u); /** number of common initials **/
end;
/** estimated prob of >= 1 matching initials **/
ProbEst = (match>0)[:];
print ProbEst;

ProbEst

0.58455

According to this simulation, if a meeting at SAS is attended by 20 employees, there is a 58% chance that two people in attendance have matching initials.

Generalizing the Problem

It is not difficult to estimate the probability that two people will have matching initials when there are N people at the meeting: just put a loop around the previous program. (You can download the SAS/IML program that carries out the simulation.)

The following graph shows the probability estimates as the number of people in the room vary. The chance reaches 50% when there are 18 people in the room. Even for small meetings, the probability is not negligible: for 10 people in a room the chance is almost 20%.

Do these estimates generalize to other situations such as 20 people in a bar? Sometimes. SAS is a diverse community and so the frequency distribution I've used here might not be representative of more homogenous populations. If you record the initials of individuals in a pub in Ireland, the distributions of initials is likely to be different. (More O'Briens and McDowell's, perhaps?)

Nevertheless, I expect these estimates to apply to many situations. So next time you're stuck in a meeting, conduct a little experiment. Does anyone in the room have the same initials?