data analysis

2月 102011
 
Because of this week's story about a geostatistician, Mohan Srivastava, who figured out how predict winning tickets in a scratch-off lottery, I've been thinking about scratch-off games. He discovered how to predict winners when he began to "wonder how they make these [games]."

Each ticket has a set of "lucky numbers." A ticket is a winner if the lucky numbers align on a tic-tac-toe grid along any row, column, or diagonal. The scratch-off portion of the game involves scratching an area on the ticket to reveal the lucky numbers. See my previous blog article for further details.

Srivastava's observation was that the "lucky numbers" on the tic-tac-toe grid have to occur less frequently than the other numbers. Otherwise, the game will pay out more money than it takes in.

Simplifying the Game for Analysis

How might someone design a tic-tac-toe scratch-off game? I'll simplify the problem by eliminating the lucky numbers, the scratch-off portion of the ticket, and all but one 3x3 grid. Represent all the lucky numbers by a 1, and the remaining numbers by a 0. In this scheme, a card is a winner if it has a row, column, or diagonal of all 1s.

The game costs $3 to play, and pays out the following amounts:

  • If the first row contains all 1s, the payout is $5.
  • If the second row contains all 1s, the payout is $10.
  • If the first row contains all 1s, the payout is $100.
  • If the first column contains all 1s, the payout is $3 (that is, you get your money back).
  • If the second column contains all 1s, the payout is $20.
  • If the third column contains all 1s, the payout is $100.
  • If the either of the diagonals contain all 1s, the payout is $250.

What If Lucky Numbers Are Assigned with Equal Probability?

What would happen if the 0s and 1s are assigned to each cell with equal probability? For example, suppose that you specify a 15% chance that any individual cell is assigned a 1. What is the expected gain/loss for that each ticket, and what percentage of tickets will be winners?

The answer is provided by the following simulation and graph. (It is possible to get an exact solution, but I'm feeling lazy.) The simulation constructs a large number of randomly assigned grids. (Each grid is represented by a row of the x matrix.) Each cell of the grid has a certain probability of being a 1. The TestWinner module determines the gain or loss for each grid by determining whether there is a row, column, or diagonal that contains all 1s. The simulation is repeated for a range of probability values, from a 10% probability to a 30% probability.

proc iml;
/** Given N grids (represented by rows of x), 
    determine the payout for each grid. **/ 
start TestWinner(x);
   w = j(nrow(x),1, -3); /** gain: initialize to -3 **/
   do i = 1 to nrow(x);
      if all(x[i,1:3])     then w[i] = w[i]+5;
      if all(x[i,4:6])     then w[i] = w[i]+10;
      if all(x[i,7:9])     then w[i] = w[i]+100;
      if all(x[i,{1 4 7}]) then w[i] = w[i]+3;
      if all(x[i,{2 5 8}]) then w[i] = w[i]+20;
      if all(x[i,{3 6 9}]) then w[i] = w[i]+100;
      if all(x[i,{1 5 9}]) then w[i] = w[i]+250;
      if all(x[i,{3 5 7}]) then w[i] = w[i]+250;
   end;
   return(w);
finish;

call randseed(54321);
NSim = 1E5;
x = j(NSim,9); /** each row of x is a grid **/

p = do(0.1, 0.3, 0.01);
pctWin = j(ncol(p),1);
ExpWin = pctWin;
do i = 1 to ncol(p);
   call randgen(x, "BERNOULLI", p[i]);
   win = TestWinner(x);
   ExpWin[i] = win[:];
   PctWin[i] = (win>=0)[:];
end;

The graph shows the expected gain or loss when cells in the tic-tac-toe grid are randomly assigned according to a fixed probability value. The graph shows that the lottery corporation (that is, the government) cannot make a profit on games for which the chance of cell being a 1 is more than 15%. However, profitable games that are constructed like this have very few winners. For example, when the chance of a 1 is 15%, only 2.5% of those playing the game get a winning tic-tac-toe combination.

This is a problem for the lottery corporation, because people don't play games that they usually lose. The key to getting people to play a scratch-off game (and slot machines, and other gambling games) is to award many small prizes, but few large prizes. If the 0s and 1s are assigned to each cell with equal probability, then the large cash awards have the same chance of occurring as the smaller awards. The chance of winning has to be set very small (too small!) in order to manage the losses.

This hypothetical analysis shows that it is a bad idea to generate a tic-tac-toe game by using equal probabilities. Instead, cells that contribute to the large cash awards (the diagonals) must be assigned a small probability of getting the "lucky numbers," whereas cells that contribute to small cash awards (the first row and column) can be assigned larger probabilities.

By controlling the probability that each cell contains a lucky number, the lottery corporation can control the chance of winning each prize. It can guarantee a profit for the game, while awarding many small prizes that keep customers coming back to play again.

Srivastava recognized this fact. In the Wired article he is quoted as saying:

It would be really nice if the computer could just spit out random digits. But that’s not possible, since the lottery corporation needs to control the number of winning tickets.

Distribution of the Lucky Numbers

As a statistical footnote to this analysis, you can construct the frequency distribution of the number of 1s in a tic-tac-toe grid that uses a random uniform assignment for each cell. The count of 1s follows a binomial distribution. For example, if the probability that a cell contains a 1 is 15%, the following SAS/IML statements compute the frequency distribution of 1s in the grid:

/** distribution of 1s in a 3x3 grid in which 
    each cell has a 15% chance of a 1 **/
k = 0:9;
pdf = pdf("Binomial", k, 0.15, 9);

The scatter plot shows the probability that a tic-tac-toe grid will have a given number, k, of 1s for k=0, 1, 2,...,9, when there is a 15% chance that a given cell contains a 1. The graph shows that most grids (86%) that are constructed under this scheme will have zero, one, or two 1s. Obviously, none of these grids can be a winning ticket.

Once again, the conclusion is that a game that is constructed using this scheme would not be fun to play: the "lucky numbers" don't appear often enough! In order to increase the frequency of lucky numbers while managing the risk of the tickets that pay out, you need to use a construction method that assigns different probabilities to each cell of the grid.

I'll describe that approach tomorrow.

2月 092011
 
The other day, someone asked me how to compute a matrix of pairwise differences for a vector of values. The person asking the question was using SQL to do the computation for 2,000 data points, and it was taking many hours to compute the pairwise differences. He asked if SAS/IML could solve the problem faster.

The result? My program ran in a fraction of a second.

Now, to be fair, I'm sure that the SQL code is not as efficient as it could be. It has been my experience that a well-written SQL or DATA step program usually performs as well as (or better than) comparable SAS/IML code. I'm not an SQL programmer, so I couldn't offer advice on how to improve his SQL code. But since I solved the problem by using PROC IML, I'll discuss the solution, along with sharing some thoughts about efficient SAS/IML programming.

An Example Problem: Comparing Gas Prices at Different Stations

The following DATA step creates a variable that contains gas prices (in cents per gallon) in Raleigh, NC, on a certain day for seven gas stations:

/** data from www.gaspricewatch.com, 05FEB2011 **/
data GasPrices;
length Station $6;
input Price DOLLAR5.2 Station $;
Price = Price * 100; /** convert to cents **/
datalines;
$2.90 Citgo
$2.91 Exxon1
$2.99 Shell1
$3.00 Shell2
$3.03 BP    
$3.05 Exxon2
$3.09 Texaco
;
run;

Suppose you are interested in the differences between pairs of prices. That is, you want to create the matrix, D, whose ijth entry is the difference in price between the price at the ith station and the price at the jth station. Cleary, D is an anti-symmetric matrix with zero on the diagonal. (An anti-symmetric matrix is also called skew-symmetric.) You can compute D with the following SAS/IML statements:

proc iml;
use GasPrices;
read all var {Station Price};
close GasPrices;

/** compute matrix of differences **/
/** 1. use matrices **/
start FullMatDiff(x); /** x is a column vector **/
   n = nrow(x);   
   m = shape(x, n, n); /** duplicate data **/
   return( m` - m );
finish;

D = FullMatDiff(Price);
print D[c=Station r=Station];

D

Citgo

Exxon1

Shell1

Shell2

BP

Exxon2

Texaco

Citgo

0

-1

-9

-10

-13

-15

-19

Exxon1

1

0

-8

-9

-12

-14

-18

Shell1

9

8

0

-1

-4

-6

-10

Shell2

10

9

1

0

-3

-5

-9

BP

13

12

4

3

0

-2

-6

Exxon2

15

14

6

5

2

0

-4

Texaco

19

18

10

9

6

4

0

Because the data are ordered by price, the pairwise differences increase as you scan down any column of D.

Analyzing Efficiency

In the SAS/IML language, as with many programming languages, there is more than one way to compute the same quantity. Consequently, I am always asking myself whether my first solution is the most efficient solution.

From a computational point of view, the FullMatDiff module is efficient. Notice that it uses the SHAPE function to create a matrix whose ith column contains n copies of the ith price. If you are not familiar with the SHAPE function, you might unnecessarily write a loop.

The main computation is a single matrix expression, m`-m, which computes all n2 pairs of differences.

However, when I look at the module critically, I notice that it requires more memory than is necessary. The US Census Bureau estimates that there were 117,000 gas stations in the US in 2008, so I can imagine running this module on many tens of thousands of prices. This could be a problem because each matrix requires memory for n2 entries and there are three matrices that need to be in RAM simultaneously: m, the transpose of m, and the matrix of differences.

It is not necessary to explicitly compute either m or the transpose. If I'm willing to write a loop, I can compute the pairwise differences while allocating only a single n x n matrix:

/** 2. compute each row directly **/
start FullRowDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n;
      diff[i,] = xt[i] - xt;
   end;
   return( diff );
finish;

Another thought that occurred to me is that the matrix of differences is anti-symmetric, so it is not necessary to compute all of the entries but only the "n choose 2" entries above the diagonal (or, equivalently, below the diagonal). The SAS/IML module for computing the upper-triangular entries is similar to the previous module:

/** 3. compute only upper triangular differences 
       [COMB(n,2) elements] **/
start TriDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n-1;
      cols = i+1:n;
      diff[i,cols] = x[i] - xt[,cols];
   end;
   return( diff );
finish;

Comparing the Performance of the Modules

I wondered whether one algorithm would run faster than the others, so I timed the performance on data with a wide range of sizes. It turns out that the time required to compute the pairwise differences is about the same for the three algorithms, regardless of the size of data. Furthermore, even for 10,000 prices, each algorithm executes in seconds. For more data than that, my computer cannot allocate sufficient memory.


I know that some of my readers are SQL programmers, so feel free to post an SQL solution. Also, let me know how long your SQL program requires to compute the pairwise differences of 2,000 and 10,000 random uniform values. Can an SQL solution handle 20,000 prices? How long does it take?

2月 082011
 
On Friday, I posted an article about using spatial statistics to detect whether a pattern of points is truly random. That day, one of my colleagues asked me whether there are any practical applications of detecting spatial randomness or non-randomness. "Oh, sure," I replied, and rattled off a list of applications in biology, materials science, manufacturing, and epidemiology.

On Monday, I read in Andrew Gelman's blog about a statistician who, in 2003, figured out how to pick cards in a certain scratch-off game so as to increase the probability of winning. When I followed the link to the original story in Wired magazine, I was astonished to discover that the statistician, Mohan Srivastava, is a geostatistician and that his technique uses spatial statistics that are similar to the ideas that I laid out in my blog post!

The basic idea, which is illustrated below and described halfway through the article, is to look at the distribution of numbers on the ticket and use a frequency analysis to determine which tickets have layouts that are less random than is expected by chance. In the example shown in the article, the ticket has infrequent numbers (numbers with a frequency of 1) in a winning tic-tac-toe configuration. (This is circled in the image below.) Such a configuration is unlikely to happen by chance alone, so you should buy the ticket. Srivastava experimentally showed that the ticket is a winner about 90% of the time.

So, add "picking scratch-off tickets" to the list of applications of spatial statistics.

2月 042011
 
Last week I generated two kinds of random point patterns: one from the uniform distribution on a two-dimensional rectangle, the other by jittering a regular grid by a small amount.

My show choir director liked the second method (jittering) better because of the way it looks on stage: there are no gaps and no two performers are too close to each other. The statistician in me wants to know what statistical properties make the jittered pattern more desirable than the other. Said differently, can I do any of the following?

  • Compute a statistical measure that characterizes each kind of point pattern?
  • Use statistics to predict whether a given pattern would be acceptable to her?

As part of my plan to learn new statistics, I've been reading Statistical Analysis and Modelling of Spatial Point Patterns by J. Illian, A. Penttinen, H. Stoyan, and D. Stoyan. The spatial statistics in this blog post come from their book. This post has four parts:

  1. I describe a simple test that indicates whether a point pattern is likely to have been generated from a random uniform distribution.
  2. I implement the test in the SAS/IML language.
  3. I simulate data to show what the test means and how it works.
  4. I return to the above questions and show that the answer to both questions is "yes."

Complete Spatial Randomness --- A Test

Not all random configurations are created equal. A jittered pattern is random, but a pattern that is generated from a uniform distribution on a 2D rectangle exhibits complete spatial randomness (CSR).

There are several ways to test for CSR. Probably the easiest method is called the method of quadrats. The method consists of the following steps:

  1. Divide the rectangular domain into k disjoint cells (bins) of equal area and count the number of points, ci, that fall into each cell. You can use the Bin2D module for this step.
  2. If the points were generated from a distribution with a uniform density, you would expect e = N/k points in each bin, where N is the total number of points. Subtract that value from the actual bin count to obtain the residual count for each bin.
  3. Just as for two-way frequency tables, you can compute a normalized sum of the squared residuals and use that measure to determine whether the point pattern was likely to have come from a distribution with a uniform density. The computation is just the usual Pearson chi-square statistic: Σ (ci - e)2 / e.
The theory says that if a point pattern is generated from a uniform distribution, the Pearson statistic has an approximate chi-square distribution with k-1 degrees of freedom, provided that k > 6 and e > 1.

Ugh! It takes longer to explain the algorithm in words than it does to implement it in the SAS/IML language! It's time for a concrete example.

SAS/IML Code to Test for Complete Spatial Randomness

The following SAS/IML statements generate 20 points from a random uniform distribution, define a 3x3 grid on the unit square, and call the Bin2D module to count how many points are in each cell of the grid:

proc iml;
N = 20;
pUnif = j(N,2);
call randseed(1234);
call randgen(pUnif, "uniform"); /** generate random uniform **/

/** Define regular grid: 0, 0.33, 0.67, 1 **/
XMin = 0; XMax = 1; Nx = 3; dx = 1/Nx; 
YMin = 0; YMax = 1; Ny = 3; dy = 1/Ny;
XDiv = do(XMin, XMax, dx); 
YDiv = do(YMin, YMax, dy);

/** count the points in each cell **/
load module=Bin2D;
countUnif = Bin2D(pUnif, XDiv, YDiv); 

For this example, there are 20 points and 9 cells, so you should expect 2.22 points in each cell. The following statements compute the residuals (observed counts minus the expected value) for each cell and compute the Pearson statistic, which in spatial statistics is also known as the index of dispersion:

c = countUnif;
NBins = (ncol(XDiv)-1) * (ncol(YDiv)-1);
e = N/NBins;  /** expected value **/
resid = (c - e);
d = sum( resid##2 ) / e; /** index of dispersion **/
print resid;

resid

1.7777778

-1.222222

0.7777778

0.7777778

-2.222222

-1.222222

-0.222222

-1.222222

2.7777778

When the points come from a uniform random distribution, d has a 90% chance of being within the interval [L, R], where L and R are the following quantiles of the chi-square distribution:

L = quantile("chisquare", 0.05, NBins-1);
R = quantile("chisquare", 0.95, NBins-1);
print d L R;

d

L

R

9.7

2.7326368

15.507313

What Does It Mean? Time to Simulate

My brain is starting to hurt. What does this all mean in practice?

What it means is that I can do the following experiment: simulate 1,000 point patterns from the random uniform distribution and compute the corresponding 1,000 Pearson statistics. If I plot a histogram of the 1,000 Pearson statistics, I should see that histogram looks similar to a chi-square distribution with 8 degrees of freedom and that roughly 100 (10%) of the Pearson statistics will be outside of the interval [L, R].

The adjacent image shows the results of the experiment, which I ran in SAS/IML Studio. The curve is a χ82 distribution. The vertical lines are at the values L and R.

For this experiment, 71 point patterns have Pearson statistics that are outside of the interval. This is not quite 10%, but it is close. Repeating the experiment with different numbers of points (for example, N=30 or 50) and different grids (for example, a 2x3 or 4x3) results in similar histograms that also have roughly 10% of their values outside of the corresponding confidence interval.

The little scatter plots underneath give an example of a point pattern with the indicated Pearson statistic. The middle scatter plot (d=9.7) is the plot of the random points generated in the previous section. It is a "typical" point pattern from the random uniform distribution; most of the point patterns look like this one. You can see that there is a lot of variation in the cell counts for this configuration: one of the bins contains zero points, another contains five.

The scatter plot on the left (d=0.7) is a not a typical point pattern from the random uniform distribution. All of the cells in this pattern have two or three points. It is a little counterintuitive, but it is rare that a truly random uniform point pattern has points that are evenly spaced in the unit square.

The scatter plot on the right (d=27.7) shows the other extreme: more variation than is expected. Five of the cells in this pattern have zero or one points. A single cell contains almost half of the points!

What about the Jittered Patterns?

The previous graph implies that the arrangements of points that are randomly generated from a uniform distribution are rarely evenly spaced. The usual case exhibits gaps and clusters.

If you instead generate 1,000 jittered patterns, as in my previous post, the distribution of the Pearson statistics looks like the following image. The distribution of Pearson's statistic does not follow a chi-square distribution for jittered patterns. Most of the values of Pearson's statistic are small, which makes sense because the statistic measures the departure from the average density, and the configurations are permutations of evenly spaced points.

It's time to revisit the questions posed at the beginning of this post.

Can I compute a statistical measure that characterizes each kind of point pattern? Yes. The Pearson statistic tends to be small for jittered point patterns and tends to be larger for random uniform configurations. However, if the Pearson statistic for an example pattern is, say, between 3 and 6, I can't predict whether the points were generated by using the jittering algorithm or whether they were generated from a random uniform distribution.

Can I use statistics to predict whether a given pattern might be acceptable to my director? Yes. Regardless of how the configuration was generated, a pattern that has a small value of the Pearson statistic (less than 3 for this example) does not exhibit gaps or clusters, and therefore will be acceptable to my director.

Although statistics can be used to check the appropriateness of certain blocking configurations, I don’t think that my director will be replaced by a SAS/IML program anytime soon: she also plays the piano!

2月 032011
 
So, if you were reading last week, we talked about how to structure your data for a mixed models repeated measures analysis. And as my friend Rick pointed out, there’s more than one way to go about restructuring your data (if you ask real nice, he’ll also do it in PROC IML- the Rockstar of programming languages). Then we played with a data set in which the dependent measurements were not ordered over time. In fact, it wasn’t even the same variable.
The Scene:
In order to increase the amount of money customers deposit in three different account types, a bank designs a factorial experiment with two factors: promotion (Gift or Discount) and minimum balance ($75, $125, or $225). Offers are sent to existing customers and the sum of their deposits to the three account types (savings, checking, and investment) are recorded.
The Classical Approach: MANOVA
Multiple continuous variables observed on the same subject is a textbook-perfect scenario for multivariate analysis of variance (MANOVA). MANOVA takes advantage of the correlation among responses within a subject and constructs a matrix of sums of squares and sums of cross-products (SSCP) to compare between- and within-group variability while accounting for correlation among the dependent variables within a subject and unequal variances across the dependent variables.
proc glm data = blog.promoexperiment;
class promotion minbal;
model savbal checkbal investamt= promotion|minbal ;
manova h=_all_;
run;

The data set, as we discussed last week, looks like this:


























With one row per customer, one column per dependent variable.
Just like multivariate repeated measures analysis (which is really just MANOVA with some fancy contrasts pre-cooked), a little missing data goes a long way to killing your sample size and therefore statistical power. Furthermore, working with covariates can be tricky with repeated measures MANOVA. The MANOVA SSCP matrices require estimation of many bits, which can also eat up your power. There are four multivariate test statistics, which can also complicate matters if you are not certain which one is the best for you to use.
The Modern Approach: Mixed Models
It turns out that it is really easy to fit an equivalent—but not identical—model in the MIXED procedure.
proc mixed data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint ;
repeated /subject = subject type=un;
run;

The data set looks like this:


























One row per observation (a dependent variable within a customer).
More, and Different:
If all we were doing was reproducing MANOVA results with PROC MIXED, I would not be writing this blog. We can do more. Instead of just accommodating unequal variances and covariance within a subject, the mixed models approach directly models the covariance structure of the multiple dependent variables. What’s more is that you can also simplify the structure, buying you more power, and making the interpretation of your model easier. For example, you might suspect that the variances are equal and the covariances between pairs of dependent variables are equal across the three dependent variables.
proc mixed data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint ;
repeated /subject = subject type=cs;
run;

The fit statistics in the mixed model enable model comparison. Since the mean model is identical in both cases, fit statistics based on REML are appropriate.

























Along with changing the covariance structure, there are the other advantages that tag along with using a mixed model: more efficient handling of missing data, easy to handle covariates, multiple levels of nesting is easy to accommodate (measurements within subjects within sales territories within your wildest imaginings), a time component is easy to model, heterogeneous groups models, to name a few.
Variation on a Theme: Mixture of Distributions in PROC GLIMMIX
Few days go by that I don’t use the GLIMMIX procedure, and as it happens, there’s a trick in PROC GLIMMIX that makes these types of models even more flexible. Starting in SAS 92, you can model a mixture of distributions from the exponential family, such as one gamma and two normal responses. If my data looked like this:


























(Notice the column with the distribution name for each variable) then I could fit the model as follows:
proc glimmix data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint dist=byobs(distrib);
random intercept /subject = subject;
run;

Or like this, instead:
proc glimmix data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint dist=byobs(distrib);
random _residual_ /subject = subject type=un;
run;

Those two models are not equivalent, and they both use pseudo likelihood estimation, so you will probably only use this kind of a model in circumstances where nothing else will do the job. Still, it’s quite a bit more than could be done even a couple of years ago.
I know I’m keeping you hanging on for that punchline. So here you are (with my deepest apologies)…
Three correlated responses walk into a bar.
One asks for a pilsner. The second asks for an ale.
The third one tells the bartender, “I’m just not feeling normal today. Better gamma something mixed.”


(edited to fix the automatic underlining in html in the SAS code-- it should be correctly specified now)
2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

1月 282011
 
I sing in the SAS-sponsored VocalMotion show choir. It's like an adult version of Glee, except we have more pregnancies and fewer slushie attacks.

For many musical numbers, the choreographer arranges the 20 performers on stage in an orderly manner, such as four rows of five singers. But every once in a while the director wants to create a disorganized look for a portion of a song (for example, a free-form '60's segment). She tells the performers to "go to random positions."

The choir always has problems achieving the look that the director wants. We end up in positions that are equally spaced from one another, but to the audience it often looks like we tried to arrange ourselves in rows, but failed miserably!

Since I know what a random point pattern looks like, I try to help. I whisper "clump more" and "vary the sizes of the gaps between performers." I tell my fellow performers, "random uniform does not imply uniformly spaced," but mostly they just ignore me or roll their eyes.

This year I took it upon myself to solve the problem, and I discovered that it led to some interesting statistics.

The Experiment

My first attempt to help the director was to generate many examples of random point patterns. Our stage is approximated by the rectangle S = [0, 2] x [0, 1]. It is easy to generate 20 points that are uniformly distributed in S. My idea was that the director would have 10–12 of these patterns. When she wants a random configuration, she can hold up a pattern and say, "go to this configuration." I sent the director the patterns on the adjacent image.

I was quite proud of myself. But guess what?

She didn't like them.


Random Uniform versus Evenly Spaced

The director has been telling us to go to random positions, so how can she not like the random configurations that I sent her? She explained that there are two characteristics that she doesn't like:

  1. Large areas of the stage are devoid of performers, which I translate to mean "the point patterns have too many gaps."
  2. Some of the performers are too close to one another, which I translate to mean "the point patterns have too many clusters."
From her comments, I conclude that I have been wrong all these years. She does not actually want the performers in random uniform positions on stage! The characteristics that she mentions are intrinsic to random uniform point patterns, so if she doesn't want those characteristics, she doesn't want uniform random.

So what does she want?

A Perturbation Algorithm

It was time for Plan B. I talked to the director again and heard her say that she wants the performers in a configuration that is close to a regular grid, but not too close. It has to "look random." To me, this suggests a perturbation approach to the problem: start by arranging the performers on a regular grid, but then perturb that arrangement by a small amount.

I thought about how to physically do this on stage and came up with the following algorithm:

  1. Arrange the performers on a regular grid, such as four rows of five performers.
  2. Ask the performers to turn in place for 5–10 seconds. Each performer chooses his or her own speed and direction. When the director calls "stop turning," each performer should be pointing in a random direction.
  3. Each performer takes a step forward in whatever direction he or she is facing.
This algorithm results in a perturbation of a regular grid in which each point is perturbed (jittered) by a small amount in an arbitrary direction. But does this perturbation give the look that my director wants? There's only one way to find out: implement the algorithm and look at some pictures.

Random Jittering of a Regular Grid

Forming a regular grid with SAS/IML software is no problem: I can use the Define2DGrid module, which I have previously described. It is also easy to use the RANDSEED and RANDGEN subroutines to generate angles that are distributed uniformly in [0, 2 pi]. The following SAS/IML module perturbs each position on a regular grid:

proc iml;
/** Module to define ordered pairs that are a random jittering
    of a uniform grid of points.
    The first six input parameters are the same as for Define2DGrid.
    The last parameter, delta, is the size of the jitter.

    Output: Ordered pairs are returned in a matrix with
         (Nx x Ny) rows and two columns.
**/
start GridJitter(XMin, XMax, Nx, YMin, YMax, Ny, delta);
   /** initial placement (grid) **/
   Lx = XMax - XMin; Ly = YMax - YMin;
   dx = Lx / Nx;     dy = Ly / Ny;
   load module=Define2DGrid;
   xy = Define2DGrid(dx/2, Lx-dx/2,Nx, dy/2, Ly-dy/2,Ny);
   x0 = xy[,1];  y0 = xy[,2];

   /** generate random directions in [0, 2 pi] **/
   NPts = Nx * Ny;
   pi = constant('PI');
   angle = 2 * pi * uniform( j(NPts,1) ); 

   /** each person takes on step in some direction **/
   x = x0 + delta * cos(angle);
   y = y0 + delta * sin(angle);

   /** don't let anyone fall off the stage! **/
   x = choose(x<0, 0, choose(x>Lx, Lx, x));
   y = choose(y<0, 0, choose(y>Ly, Ly, y));
   return ( x || y );
finish;

call randseed(654321);
r = 0.1; /** size of a small step **/
/** jitter 4x5 grid on the "stage" [0,2]x[0,1] **/
Jitter = GridJitter(0,2,5, 0,1,4, r); 

The results of several runs of this algorithm are shown in the following image. I showed the point patterns to the director. "Yes!" she exclaimed. "These are the kinds of configurations that I'm looking for."


Conclusions and Next Steps

One conclusion is obvious. In the initial stages of this project, I violated the Fundamental Rule of Consulting, which is "listen to a client in order to truly understand the problem." I assumed that my notion of "random placement" was the same as my director's, but it wasn't.

I am now excited to begin the next step of this process: a statistical analysis of point patterns.

Regular readers of my blog will recall that one of my New Year's resolutions is to learn something about an area of statistics that is new to me. I've already read a few articles and book chapters that will undoubtedly spark future blog posts about spatial point patterns.

For example, my director emphasized two features of point patterns: clustering and distance between adjacent points. In future blog posts, I will analyze statistics that are associated with these characteristics.

But—shhh!—don't tell the choir that I'm computing these statistics. I'd prefer not to get doused with a slushie.

1月 262011
 
Next week's blog entry will build on this one, so I want you to take notes, OK?

It's not headline news that in most cases, the best way to handle a repeated measures analysis is with a mixed models approach, especially for Normal reponses (for other distributions in the exponential family, GLIMMIX also has some advantages over GEEs. But that's another blog post for another day). You have more flexibility in modeling the variances and covariances among the time points, data do not need to be balanced, and a small amount of missingness doesn't spell the end of your statistical power.

But sometimes the data you have are living in the past: arranged as if you were going to use a multivariate repeated measures analysis. This multivariate data structure arranges the data with one row per subject, each time-based response a column in the data set. This enables the GLM procedure to set up H matrices for the model effects, and E and T matrices for testing hypotheses about those effects. It also means that if a subject has any missing time points, you lose the entire subject's data. I've worked on many repeated measures studies in my pre-SAS years, and I tell you, I've been on the phones, email, snail mail, and even knocked on doors to try to bring subjects back to the lab for follow-ups. I mourned over every dropout. To be able to use at least the observations you have for a subject before dropout would be consolation to a weary researcher's broken heart.

Enter the mixed models approach to repeated measures. But, your data need to be restructured before you can use MIXED for repeated measures analysis. This is, coincidentally, the same data structure you would use for a univariate repeated measures, like in the old-olden days of PROC ANOVA with hand-made error terms (well, almost hand-made). Remember those? The good old days. But I digress.

The MIXED and GLIMMIX procedures require the data be in the univariate structure, with one row per measurement. Notice that these procedures still use CCA, but now the "case" is different. Instead of a subject, which in the context of a mixed model can be many things at once (a person, a clinic, a network...), the "case" is one measurement occurence.

How do you put your wide (multivariate) data into the long (univariate) structure? Well, there are a number of ways, and to some extent it depends on how you have organized your data. If the multivariate response variable names share a prefix, then this macro will convert your data easily.

What if you want to go back to the wide structure (for example, to create graphs to profile subjects over time)? There's a macro for that as well.

What if your variables do not share a prefix, but instead have different names (such as SavBal, CheckBal, and InvestAmt)? Then you will need an alternative strategy. For example:


This needs some rearrangement, but there are two issues. First, there is no subject identifier, and I will want this in next week's blog when I fit a mixed model. Second, the dependent variables are not named with a common prefix. In fact, they aren't even measured over time! They are three variables measured for one person at a given time. (I'll explain why in next week's blog).























So, my preference is to use arrays to handle this:
























Which results in the following:






















































I tip my hat to SAS Tech Support, who provide the %MAKELONG and %MAKEWIDE macros and to Gerhard Svolba, who authored them. If someone wants to turn my arrays into a macro, feel free. I'll tip my hat to you, too.

Tune in next week for the punchline to the joke:
"Three correlated responses walk into a bar..."
1月 182011
 
A colleague posted some data on his internal SAS blog about key trends in the US Mobile phone industry, as reported by comScore. He graciously shared the data so that I could create a graph that visualizes the trends.


The plot visualizes trends in the data: the Android phone is gaining market share, primarily at the expense of the Blackberry, while the iPhone is holding steady.

Plotting the essential features of the data requires only two statements in the SGPLOT procedure:

title "Top Smartphone Platforms";
title2 "Total US Smartphone Subscribers, Ages 13+";
footnote "Source: comScore MobiLens service";

proc sgplot data=Mobile;
series x=Date y=Pctval / group=Type break curvelabel;
xaxis interval=Month;
run;

  1. The SERIES statement plots the lines.
    • The GROUP= option groups the lines according to the platform and automatically assigns unique colors for each platform.
    • The BREAK option is used because the data contain a missing value for March of 2010. (The missing data is what makes this plot unusual!) If you omit the BREAK option, each line continues from the February value to the April value without interruption.
    • The CURVELABEL option labels each curve in the plot. Without the option, the SGPLOT procedure would automatically insert a legend.
  2. The XAXIS statement causes the horizontal axis to display tick marks for each month. This statement is optional, but without the statement the plot displays ticks every three months, which I don't like.
A comment that I received about a previous data analysis reminded me that some SAS customers are just now upgrading to SAS 9.2, and might not have much experience with PROC SGPLOT. For me, PROC SGPLOT has become a favorite tool for creating static images with minimal effort. Of course, for dynamically linked graphics from SAS/IML programs, I use SAS/IML Studio.