data analysis

2月 252011
 
I was inspired by Chris Hemedinger's blog posts about his daughter's science fair project. Explaining statistics to a pre-teenager can be a humbling experience.

My 11-year-old son likes science. He recently set about trying to measure which of three projectile launchers is the most accurate. I think he wanted to determine their accuracy so that he would know which launcher to use when he wanted to attack his little sister, but I didn't ask. Sometimes it is better not to know.

He and his sister created a bullseye target. They then launched six projectiles from each of the three launchers and recorded the location of the projectile relative to the center of the target. In order to make the experiment repeatable, my kids aligned each launcher with the target and then tied it down so that it would not move during the six shots.

For each launcher, they averaged the six distances to the center. The launcher with the smallest average distance was declared to be The Most Accurate Launcher in the World.

In the accompanying figure, the red crosses represent the winning launcher. The average distance to the center is 5.2 centimeters, whereas the black squares have an average distance of 6.7 cm and the blue circles have an average distance of 13.1 cm. Three of the black squares have the same location and are marked by "(3)."

When I saw the target on which they had recorded their results, I had a hard time containing the teacher in me. This target shows two fundamental statistical concepts: bias and scatter.

Bias Explained

When I look at the target, the first characteristic I see is bias due to a misalignment of the launchers. Two of the three launchers are aimed too high relative to the bullseye. This kind of experimental error is sometimes known as systematic bias because it leads to measurements that are systematically too high or too low.

The second figure shows the average locations of projectiles for each launcher. The average location of the red crosses (marked "C2") is the closest to the center of the target. The next closest is the center of the black squares, which is marked "C1." The fact that "C1" and "C3" are far away from the center of the target shows the bias in the experiment.

I pointed this out to my son, who shrugged. "I guess the launchers weren't perfectly lined up," he said.

Thinking that perhaps he might want to realign the launchers and re-run the experiment, I asked him what he intended to do about this bias.

He hesitated. I could see the wheels turning in his mind.

"Aim low?" he guessed.


Scatter Explained

The second feature of the data that I see is the scatter of the projectiles. The way that the projectiles scatter is measured by a statistical term known as the covariance. The covariance measures the spread in the horizontal direction, the vertical direction, and jointly along both directions.

I pointed out to my son that the red crosses tend to spread out in the up-down direction much more than they do in the left-right direction. In contrast, the black squares spread in both directions, but, more importantly, the directions tend to vary together. In four out of five cases, a shot that was high was also left of center. This is an example of correlation and covariance.

I asked my son to describe the spread of the blue circles. "Well," he cautiously began, "they vary a little bit up and down, but they are all over the place left-to-right."

"And do the shot locations tend to vary together?" I prompted.

"Um, not really," he replied.

Bingo! Covariance explained. All of the launchers show up-and-down and left-and-right variability, but the black squares also exhibit co-variability: the deviations in the horizontal direction are correlated with the deviations in the vertical direction.

For Adults Only

My son ran off to attack his sister, grateful to escape my merciless questioning.

I carefully measured the locations of the data, created a SAS DATA step, and began to analyze the data in SAS/IML Studio.

The third figure shows the center of each group of projectiles. I have also added ellipses that help to visualize the way that the projectiles scatter. (Each ellipse is a 75% prediction ellipse for bivariate normal data with covariance equal to the sample covariance.)

The ellipses make it easier to see that locations of the red crosses and blue circles do not vary together. The red ellipse is tall and skinny, which represents a small variation in the horizontal direction and a medium variation in the vertical direction. The blue ellipse is very wide, representing a large variation in the horizontal direction. The black ellipse is tilted, which shows the correlation between the horizontal and vertical directions.

The programming techniques that are used to draw the figures are described in Chapters 9 and 10 of my book, Statistical Programming with SAS/IML Software.

2月 192011
 
Lunch. For some workers, it’s the sweetest part of an otherwise bitter day at the grindstone. Nothing can turn that sweetness sour like going into the breakroom to discover that someone has taken your lunch and eaten it themselves.

Nothing like that ever happens here at SAS.

But if it did, I would set up a system to repeatedly collect and identify the saliva of the top suspects, and do an elegant chemical analysis. When a lunch goes missing, there’s always some residual spit on the used container.

I could develop a discriminant analysis model to identify each suspect. Then I’d score newly missing lunches with the model, flag the culprit, track them down and make them buy a box of Belgian chocolates for the person whose lunch they pilfered.

But what if I falsely accused someone who was innocent? Oh gosh. That could be an embarrassing and expensive error.

Let’s review how the discriminant analysis would look:


Continue reading "Who Ate My Lunch? Discriminant Thresholds to Reduce False Accusations"
2月 182011
 
The Flowing Data blog posted some data about how much TV actors get paid per episode. About a dozen folks have created various visualizations of the data (see the comments in the Flowing Data blog), several of them very glitzy and fancy.

One variable in the data is a categorical variable that specifies whether the actor appears in a TV comedy or a drama. It is interesting to compare the salaries for comedies with those for dramas.

One way to do this is by creating a comparative histogram. In SAS software, you can create a comparative histogram by using PROC UNIVARIATE with a CLASS statement and a HISTOGRAM statement. The ShowType variable is the classification variable; the Pay variable contains the pay per episode for each actor. The following statements create a comparative histogram:

ods graphics on;
proc univariate data=TV;
   class ShowType;
   histogram Pay / kernel(c=SJPI) endpoints=0 to 13;
run;


In the example, the bandwidth is chosen by using the Sheather-Jones plug-in method because the default bandwidth selection algorithm seems to oversmooth these data. A smaller bandwidth helps detect the cluster of actors who earn close to $100,000 per episode, and another cluster that earns $400,000 per episode.

I also used ENDPOINTS option to explicitly override the default bin width selection algorithm.

Notice that the comparative histogram automatically labels each histogram according to the value of the classification variable, and automatically equates the scales of all histograms for accurate visual comparison of the distribution of the sample. I can use the same code regardless of the number of categories in the classification variable.

This is the beauty of ODS graphics: the procedures automatically create graphs that are appropriate for an analysis.

Oh, and who is the outlier in the comedy group? That's Charlie Sheen. He makes $1.25 million per episode. I think his show, Two and a Half Men, should be renamed Two and a Half Times the Salary, Man!

2月 112011
 
In a previous blog post, I described the rules for a tic-tac-toe scratch-off lottery game and showed that it is a bad idea to generate the game tickets by using a scheme that uses equal probabilities. Instead, cells that yield large cash awards must be assigned a small probability of getting a tic-tac-toe and cells that yield small cash awards can be assigned larger probabilities. As Mohan Srivastava, the statistician who "cracked the scratch lottery code," said:
It would be really nice if the computer could just spit out random digits. But that’s not possible, since the lottery corporation needs to control the number of winning tickets.
This article describes how to construct a scratch-off game that has lots of winners of small prizes, but is still profitable for the lottery corporation (the government) because the large prizes occur infrequently.

A Cell-By-Cell Probability Model for a Scratch-Off Game

The lottery corporation wants to give away many small prizes (so that the game is fun to play) while restricting the number of large pay outs (so that the game profitable).

Recall that a ticket is a winner if a set of "lucky numbers" aligns along a row, column, or diagonal. For convenience, represent the lucky numbers by a 1 and the other numbers by a 0. The key is to assign each cell in the grid a certain probability of being a 1. For example, consider the following SAS/IML matrix of probabilities:

proc iml;
p = { 1.0  1.0  0.1,
      1.0  0.02 0.2,
      0.25 0.1  0.1 };

If you assign a 1 to a grid cell based on these probabilities, the resulting grids have the following properties:

  • A 1 is always placed in the first two positions of the first column (because the chance of placing a 1 there is 100%).
  • The probability of getting a winning tic-tac-toe in the first column is 0.25, the product of the probabilities in the first column. This means that one fourth of all gamers will get their $3 investment back.
  • A 1 is always placed in the first two positions of the first row.
  • The probability of getting a winning tic-tac-toe in the first row is 0.1. This means that one tenth of all gamers will get a $5 payout, for a net gain of $2.
  • Other ways of winning the game occur much less frequently.

The probabilities are assigned to lessen the chance of paying the big cash awards. For example, the middle cell gets a lucky number only 2% of the time, so winning the game by having all 1s on a diagonal (the big $250 prize!) will not happen often.

How Much and How Often Would You Win This Game?

In order to find the expected payout (in dollars) for each combination, you can compute the products of the row, column, and diagonal probabilities and multiply those values by the corresponding payouts, as shown in the following SAS/IML statements:

rowPay = p[,#] # {5,10,100};
colPay = p[#,] # {3 20 100};
diagPay = p[{1 5 9}][#] # 250;
antiDiagPay = p[{3 5 7}][#] #250;
print rowPay, colPay, diagPay antiDiagPay;

rowPay

0.5

0.04

0.25

colPay

0.75

0.04

0.2

diagPay

antiDiagPay

0.5

0.125

The payout for the entire game is just the sum of these individual payouts. To compute the expected gain/loss for playing the game, subtract the initial $3 purchase price:

ExpGain = -3 + sum(rowPay, colPay, diagPay, antiDiagPay);
print ExpGain;

ExpGain

-0.595

Scratch, Scratch, Scratch: Simulating Many Games

For people more comfortable with simulation than with probability theory, you can also write a SAS/IML simulation that "plays" the game many times. The statements call the TestWinner module from my previous blog post.

NSim = 1E5;
x = j(NSim,9);
y = j(NSim,1);
/** each cell cell has different probabilities. 
    Sample 9 times. **/
do i = 1 to 9;
   call randgen(y, "BERNOULLI", p[i]);
   x[,i] = y;
end;
win = TestWinner(x);
ExpWin = win[:];
PctWin = (win>=0)[:];
print ExpWin PctWin;

ExpWin

PctWin

-0.58358

0.33059

Notice that this game pays out an award 33% of the time, so it gives the players (false) hope and keeps them coming back. (Of course, for most of those "wins," you only get your money back!) The simulated result estimates that, on average, you will lose $0.58 every time you play the game. (The true expected loss is $0.595.)

Unless, of course, you know Srivastava's trick.

Choosing the Probabilities: A Better Way?

One thing bothers me about the probabilities that I chose: the probability of winning a $10 or $20 award is too small. In order to lower the chance of winning the big $250 award (getting tic-tac-toe on the diagonal), I chose a small probability for the middle cell. But this also lowers the probability of getting tic-tac-toe on the middle row and middle column.

What scheme would you use so that the chance of winning an award is a decreasing function of the size of the award?

2月 102011
 
Because of this week's story about a geostatistician, Mohan Srivastava, who figured out how predict winning tickets in a scratch-off lottery, I've been thinking about scratch-off games. He discovered how to predict winners when he began to "wonder how they make these [games]."

Each ticket has a set of "lucky numbers." A ticket is a winner if the lucky numbers align on a tic-tac-toe grid along any row, column, or diagonal. The scratch-off portion of the game involves scratching an area on the ticket to reveal the lucky numbers. See my previous blog article for further details.

Srivastava's observation was that the "lucky numbers" on the tic-tac-toe grid have to occur less frequently than the other numbers. Otherwise, the game will pay out more money than it takes in.

Simplifying the Game for Analysis

How might someone design a tic-tac-toe scratch-off game? I'll simplify the problem by eliminating the lucky numbers, the scratch-off portion of the ticket, and all but one 3x3 grid. Represent all the lucky numbers by a 1, and the remaining numbers by a 0. In this scheme, a card is a winner if it has a row, column, or diagonal of all 1s.

The game costs $3 to play, and pays out the following amounts:

  • If the first row contains all 1s, the payout is $5.
  • If the second row contains all 1s, the payout is $10.
  • If the first row contains all 1s, the payout is $100.
  • If the first column contains all 1s, the payout is $3 (that is, you get your money back).
  • If the second column contains all 1s, the payout is $20.
  • If the third column contains all 1s, the payout is $100.
  • If the either of the diagonals contain all 1s, the payout is $250.

What If Lucky Numbers Are Assigned with Equal Probability?

What would happen if the 0s and 1s are assigned to each cell with equal probability? For example, suppose that you specify a 15% chance that any individual cell is assigned a 1. What is the expected gain/loss for that each ticket, and what percentage of tickets will be winners?

The answer is provided by the following simulation and graph. (It is possible to get an exact solution, but I'm feeling lazy.) The simulation constructs a large number of randomly assigned grids. (Each grid is represented by a row of the x matrix.) Each cell of the grid has a certain probability of being a 1. The TestWinner module determines the gain or loss for each grid by determining whether there is a row, column, or diagonal that contains all 1s. The simulation is repeated for a range of probability values, from a 10% probability to a 30% probability.

proc iml;
/** Given N grids (represented by rows of x), 
    determine the payout for each grid. **/ 
start TestWinner(x);
   w = j(nrow(x),1, -3); /** gain: initialize to -3 **/
   do i = 1 to nrow(x);
      if all(x[i,1:3])     then w[i] = w[i]+5;
      if all(x[i,4:6])     then w[i] = w[i]+10;
      if all(x[i,7:9])     then w[i] = w[i]+100;
      if all(x[i,{1 4 7}]) then w[i] = w[i]+3;
      if all(x[i,{2 5 8}]) then w[i] = w[i]+20;
      if all(x[i,{3 6 9}]) then w[i] = w[i]+100;
      if all(x[i,{1 5 9}]) then w[i] = w[i]+250;
      if all(x[i,{3 5 7}]) then w[i] = w[i]+250;
   end;
   return(w);
finish;

call randseed(54321);
NSim = 1E5;
x = j(NSim,9); /** each row of x is a grid **/

p = do(0.1, 0.3, 0.01);
pctWin = j(ncol(p),1);
ExpWin = pctWin;
do i = 1 to ncol(p);
   call randgen(x, "BERNOULLI", p[i]);
   win = TestWinner(x);
   ExpWin[i] = win[:];
   PctWin[i] = (win>=0)[:];
end;

The graph shows the expected gain or loss when cells in the tic-tac-toe grid are randomly assigned according to a fixed probability value. The graph shows that the lottery corporation (that is, the government) cannot make a profit on games for which the chance of cell being a 1 is more than 15%. However, profitable games that are constructed like this have very few winners. For example, when the chance of a 1 is 15%, only 2.5% of those playing the game get a winning tic-tac-toe combination.

This is a problem for the lottery corporation, because people don't play games that they usually lose. The key to getting people to play a scratch-off game (and slot machines, and other gambling games) is to award many small prizes, but few large prizes. If the 0s and 1s are assigned to each cell with equal probability, then the large cash awards have the same chance of occurring as the smaller awards. The chance of winning has to be set very small (too small!) in order to manage the losses.

This hypothetical analysis shows that it is a bad idea to generate a tic-tac-toe game by using equal probabilities. Instead, cells that contribute to the large cash awards (the diagonals) must be assigned a small probability of getting the "lucky numbers," whereas cells that contribute to small cash awards (the first row and column) can be assigned larger probabilities.

By controlling the probability that each cell contains a lucky number, the lottery corporation can control the chance of winning each prize. It can guarantee a profit for the game, while awarding many small prizes that keep customers coming back to play again.

Srivastava recognized this fact. In the Wired article he is quoted as saying:

It would be really nice if the computer could just spit out random digits. But that’s not possible, since the lottery corporation needs to control the number of winning tickets.

Distribution of the Lucky Numbers

As a statistical footnote to this analysis, you can construct the frequency distribution of the number of 1s in a tic-tac-toe grid that uses a random uniform assignment for each cell. The count of 1s follows a binomial distribution. For example, if the probability that a cell contains a 1 is 15%, the following SAS/IML statements compute the frequency distribution of 1s in the grid:

/** distribution of 1s in a 3x3 grid in which 
    each cell has a 15% chance of a 1 **/
k = 0:9;
pdf = pdf("Binomial", k, 0.15, 9);

The scatter plot shows the probability that a tic-tac-toe grid will have a given number, k, of 1s for k=0, 1, 2,...,9, when there is a 15% chance that a given cell contains a 1. The graph shows that most grids (86%) that are constructed under this scheme will have zero, one, or two 1s. Obviously, none of these grids can be a winning ticket.

Once again, the conclusion is that a game that is constructed using this scheme would not be fun to play: the "lucky numbers" don't appear often enough! In order to increase the frequency of lucky numbers while managing the risk of the tickets that pay out, you need to use a construction method that assigns different probabilities to each cell of the grid.

I'll describe that approach tomorrow.

2月 092011
 
The other day, someone asked me how to compute a matrix of pairwise differences for a vector of values. The person asking the question was using SQL to do the computation for 2,000 data points, and it was taking many hours to compute the pairwise differences. He asked if SAS/IML could solve the problem faster.

The result? My program ran in a fraction of a second.

Now, to be fair, I'm sure that the SQL code is not as efficient as it could be. It has been my experience that a well-written SQL or DATA step program usually performs as well as (or better than) comparable SAS/IML code. I'm not an SQL programmer, so I couldn't offer advice on how to improve his SQL code. But since I solved the problem by using PROC IML, I'll discuss the solution, along with sharing some thoughts about efficient SAS/IML programming.

An Example Problem: Comparing Gas Prices at Different Stations

The following DATA step creates a variable that contains gas prices (in cents per gallon) in Raleigh, NC, on a certain day for seven gas stations:

/** data from www.gaspricewatch.com, 05FEB2011 **/
data GasPrices;
length Station $6;
input Price DOLLAR5.2 Station $;
Price = Price * 100; /** convert to cents **/
datalines;
$2.90 Citgo
$2.91 Exxon1
$2.99 Shell1
$3.00 Shell2
$3.03 BP    
$3.05 Exxon2
$3.09 Texaco
;
run;

Suppose you are interested in the differences between pairs of prices. That is, you want to create the matrix, D, whose ijth entry is the difference in price between the price at the ith station and the price at the jth station. Cleary, D is an anti-symmetric matrix with zero on the diagonal. (An anti-symmetric matrix is also called skew-symmetric.) You can compute D with the following SAS/IML statements:

proc iml;
use GasPrices;
read all var {Station Price};
close GasPrices;

/** compute matrix of differences **/
/** 1. use matrices **/
start FullMatDiff(x); /** x is a column vector **/
   n = nrow(x);   
   m = shape(x, n, n); /** duplicate data **/
   return( m` - m );
finish;

D = FullMatDiff(Price);
print D[c=Station r=Station];

D

Citgo

Exxon1

Shell1

Shell2

BP

Exxon2

Texaco

Citgo

0

-1

-9

-10

-13

-15

-19

Exxon1

1

0

-8

-9

-12

-14

-18

Shell1

9

8

0

-1

-4

-6

-10

Shell2

10

9

1

0

-3

-5

-9

BP

13

12

4

3

0

-2

-6

Exxon2

15

14

6

5

2

0

-4

Texaco

19

18

10

9

6

4

0

Because the data are ordered by price, the pairwise differences increase as you scan down any column of D.

Analyzing Efficiency

In the SAS/IML language, as with many programming languages, there is more than one way to compute the same quantity. Consequently, I am always asking myself whether my first solution is the most efficient solution.

From a computational point of view, the FullMatDiff module is efficient. Notice that it uses the SHAPE function to create a matrix whose ith column contains n copies of the ith price. If you are not familiar with the SHAPE function, you might unnecessarily write a loop.

The main computation is a single matrix expression, m`-m, which computes all n2 pairs of differences.

However, when I look at the module critically, I notice that it requires more memory than is necessary. The US Census Bureau estimates that there were 117,000 gas stations in the US in 2008, so I can imagine running this module on many tens of thousands of prices. This could be a problem because each matrix requires memory for n2 entries and there are three matrices that need to be in RAM simultaneously: m, the transpose of m, and the matrix of differences.

It is not necessary to explicitly compute either m or the transpose. If I'm willing to write a loop, I can compute the pairwise differences while allocating only a single n x n matrix:

/** 2. compute each row directly **/
start FullRowDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n;
      diff[i,] = xt[i] - xt;
   end;
   return( diff );
finish;

Another thought that occurred to me is that the matrix of differences is anti-symmetric, so it is not necessary to compute all of the entries but only the "n choose 2" entries above the diagonal (or, equivalently, below the diagonal). The SAS/IML module for computing the upper-triangular entries is similar to the previous module:

/** 3. compute only upper triangular differences 
       [COMB(n,2) elements] **/
start TriDiff(x);
   n = nrow(x);   
   diff = j(n, n, .);
   xt = T(x); 
   do i = 1 to n-1;
      cols = i+1:n;
      diff[i,cols] = x[i] - xt[,cols];
   end;
   return( diff );
finish;

Comparing the Performance of the Modules

I wondered whether one algorithm would run faster than the others, so I timed the performance on data with a wide range of sizes. It turns out that the time required to compute the pairwise differences is about the same for the three algorithms, regardless of the size of data. Furthermore, even for 10,000 prices, each algorithm executes in seconds. For more data than that, my computer cannot allocate sufficient memory.


I know that some of my readers are SQL programmers, so feel free to post an SQL solution. Also, let me know how long your SQL program requires to compute the pairwise differences of 2,000 and 10,000 random uniform values. Can an SQL solution handle 20,000 prices? How long does it take?

2月 082011
 
On Friday, I posted an article about using spatial statistics to detect whether a pattern of points is truly random. That day, one of my colleagues asked me whether there are any practical applications of detecting spatial randomness or non-randomness. "Oh, sure," I replied, and rattled off a list of applications in biology, materials science, manufacturing, and epidemiology.

On Monday, I read in Andrew Gelman's blog about a statistician who, in 2003, figured out how to pick cards in a certain scratch-off game so as to increase the probability of winning. When I followed the link to the original story in Wired magazine, I was astonished to discover that the statistician, Mohan Srivastava, is a geostatistician and that his technique uses spatial statistics that are similar to the ideas that I laid out in my blog post!

The basic idea, which is illustrated below and described halfway through the article, is to look at the distribution of numbers on the ticket and use a frequency analysis to determine which tickets have layouts that are less random than is expected by chance. In the example shown in the article, the ticket has infrequent numbers (numbers with a frequency of 1) in a winning tic-tac-toe configuration. (This is circled in the image below.) Such a configuration is unlikely to happen by chance alone, so you should buy the ticket. Srivastava experimentally showed that the ticket is a winner about 90% of the time.

So, add "picking scratch-off tickets" to the list of applications of spatial statistics.

2月 042011
 
Last week I generated two kinds of random point patterns: one from the uniform distribution on a two-dimensional rectangle, the other by jittering a regular grid by a small amount.

My show choir director liked the second method (jittering) better because of the way it looks on stage: there are no gaps and no two performers are too close to each other. The statistician in me wants to know what statistical properties make the jittered pattern more desirable than the other. Said differently, can I do any of the following?

  • Compute a statistical measure that characterizes each kind of point pattern?
  • Use statistics to predict whether a given pattern would be acceptable to her?

As part of my plan to learn new statistics, I've been reading Statistical Analysis and Modelling of Spatial Point Patterns by J. Illian, A. Penttinen, H. Stoyan, and D. Stoyan. The spatial statistics in this blog post come from their book. This post has four parts:

  1. I describe a simple test that indicates whether a point pattern is likely to have been generated from a random uniform distribution.
  2. I implement the test in the SAS/IML language.
  3. I simulate data to show what the test means and how it works.
  4. I return to the above questions and show that the answer to both questions is "yes."

Complete Spatial Randomness --- A Test

Not all random configurations are created equal. A jittered pattern is random, but a pattern that is generated from a uniform distribution on a 2D rectangle exhibits complete spatial randomness (CSR).

There are several ways to test for CSR. Probably the easiest method is called the method of quadrats. The method consists of the following steps:

  1. Divide the rectangular domain into k disjoint cells (bins) of equal area and count the number of points, ci, that fall into each cell. You can use the Bin2D module for this step.
  2. If the points were generated from a distribution with a uniform density, you would expect e = N/k points in each bin, where N is the total number of points. Subtract that value from the actual bin count to obtain the residual count for each bin.
  3. Just as for two-way frequency tables, you can compute a normalized sum of the squared residuals and use that measure to determine whether the point pattern was likely to have come from a distribution with a uniform density. The computation is just the usual Pearson chi-square statistic: Σ (ci - e)2 / e.
The theory says that if a point pattern is generated from a uniform distribution, the Pearson statistic has an approximate chi-square distribution with k-1 degrees of freedom, provided that k > 6 and e > 1.

Ugh! It takes longer to explain the algorithm in words than it does to implement it in the SAS/IML language! It's time for a concrete example.

SAS/IML Code to Test for Complete Spatial Randomness

The following SAS/IML statements generate 20 points from a random uniform distribution, define a 3x3 grid on the unit square, and call the Bin2D module to count how many points are in each cell of the grid:

proc iml;
N = 20;
pUnif = j(N,2);
call randseed(1234);
call randgen(pUnif, "uniform"); /** generate random uniform **/

/** Define regular grid: 0, 0.33, 0.67, 1 **/
XMin = 0; XMax = 1; Nx = 3; dx = 1/Nx; 
YMin = 0; YMax = 1; Ny = 3; dy = 1/Ny;
XDiv = do(XMin, XMax, dx); 
YDiv = do(YMin, YMax, dy);

/** count the points in each cell **/
load module=Bin2D;
countUnif = Bin2D(pUnif, XDiv, YDiv); 

For this example, there are 20 points and 9 cells, so you should expect 2.22 points in each cell. The following statements compute the residuals (observed counts minus the expected value) for each cell and compute the Pearson statistic, which in spatial statistics is also known as the index of dispersion:

c = countUnif;
NBins = (ncol(XDiv)-1) * (ncol(YDiv)-1);
e = N/NBins;  /** expected value **/
resid = (c - e);
d = sum( resid##2 ) / e; /** index of dispersion **/
print resid;

resid

1.7777778

-1.222222

0.7777778

0.7777778

-2.222222

-1.222222

-0.222222

-1.222222

2.7777778

When the points come from a uniform random distribution, d has a 90% chance of being within the interval [L, R], where L and R are the following quantiles of the chi-square distribution:

L = quantile("chisquare", 0.05, NBins-1);
R = quantile("chisquare", 0.95, NBins-1);
print d L R;

d

L

R

9.7

2.7326368

15.507313

What Does It Mean? Time to Simulate

My brain is starting to hurt. What does this all mean in practice?

What it means is that I can do the following experiment: simulate 1,000 point patterns from the random uniform distribution and compute the corresponding 1,000 Pearson statistics. If I plot a histogram of the 1,000 Pearson statistics, I should see that histogram looks similar to a chi-square distribution with 8 degrees of freedom and that roughly 100 (10%) of the Pearson statistics will be outside of the interval [L, R].

The adjacent image shows the results of the experiment, which I ran in SAS/IML Studio. The curve is a χ82 distribution. The vertical lines are at the values L and R.

For this experiment, 71 point patterns have Pearson statistics that are outside of the interval. This is not quite 10%, but it is close. Repeating the experiment with different numbers of points (for example, N=30 or 50) and different grids (for example, a 2x3 or 4x3) results in similar histograms that also have roughly 10% of their values outside of the corresponding confidence interval.

The little scatter plots underneath give an example of a point pattern with the indicated Pearson statistic. The middle scatter plot (d=9.7) is the plot of the random points generated in the previous section. It is a "typical" point pattern from the random uniform distribution; most of the point patterns look like this one. You can see that there is a lot of variation in the cell counts for this configuration: one of the bins contains zero points, another contains five.

The scatter plot on the left (d=0.7) is a not a typical point pattern from the random uniform distribution. All of the cells in this pattern have two or three points. It is a little counterintuitive, but it is rare that a truly random uniform point pattern has points that are evenly spaced in the unit square.

The scatter plot on the right (d=27.7) shows the other extreme: more variation than is expected. Five of the cells in this pattern have zero or one points. A single cell contains almost half of the points!

What about the Jittered Patterns?

The previous graph implies that the arrangements of points that are randomly generated from a uniform distribution are rarely evenly spaced. The usual case exhibits gaps and clusters.

If you instead generate 1,000 jittered patterns, as in my previous post, the distribution of the Pearson statistics looks like the following image. The distribution of Pearson's statistic does not follow a chi-square distribution for jittered patterns. Most of the values of Pearson's statistic are small, which makes sense because the statistic measures the departure from the average density, and the configurations are permutations of evenly spaced points.

It's time to revisit the questions posed at the beginning of this post.

Can I compute a statistical measure that characterizes each kind of point pattern? Yes. The Pearson statistic tends to be small for jittered point patterns and tends to be larger for random uniform configurations. However, if the Pearson statistic for an example pattern is, say, between 3 and 6, I can't predict whether the points were generated by using the jittering algorithm or whether they were generated from a random uniform distribution.

Can I use statistics to predict whether a given pattern might be acceptable to my director? Yes. Regardless of how the configuration was generated, a pattern that has a small value of the Pearson statistic (less than 3 for this example) does not exhibit gaps or clusters, and therefore will be acceptable to my director.

Although statistics can be used to check the appropriateness of certain blocking configurations, I don’t think that my director will be replaced by a SAS/IML program anytime soon: she also plays the piano!

2月 032011
 
So, if you were reading last week, we talked about how to structure your data for a mixed models repeated measures analysis. And as my friend Rick pointed out, there’s more than one way to go about restructuring your data (if you ask real nice, he’ll also do it in PROC IML- the Rockstar of programming languages). Then we played with a data set in which the dependent measurements were not ordered over time. In fact, it wasn’t even the same variable.
The Scene:
In order to increase the amount of money customers deposit in three different account types, a bank designs a factorial experiment with two factors: promotion (Gift or Discount) and minimum balance ($75, $125, or $225). Offers are sent to existing customers and the sum of their deposits to the three account types (savings, checking, and investment) are recorded.
The Classical Approach: MANOVA
Multiple continuous variables observed on the same subject is a textbook-perfect scenario for multivariate analysis of variance (MANOVA). MANOVA takes advantage of the correlation among responses within a subject and constructs a matrix of sums of squares and sums of cross-products (SSCP) to compare between- and within-group variability while accounting for correlation among the dependent variables within a subject and unequal variances across the dependent variables.
proc glm data = blog.promoexperiment;
class promotion minbal;
model savbal checkbal investamt= promotion|minbal ;
manova h=_all_;
run;

The data set, as we discussed last week, looks like this:


























With one row per customer, one column per dependent variable.
Just like multivariate repeated measures analysis (which is really just MANOVA with some fancy contrasts pre-cooked), a little missing data goes a long way to killing your sample size and therefore statistical power. Furthermore, working with covariates can be tricky with repeated measures MANOVA. The MANOVA SSCP matrices require estimation of many bits, which can also eat up your power. There are four multivariate test statistics, which can also complicate matters if you are not certain which one is the best for you to use.
The Modern Approach: Mixed Models
It turns out that it is really easy to fit an equivalent—but not identical—model in the MIXED procedure.
proc mixed data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint ;
repeated /subject = subject type=un;
run;

The data set looks like this:


























One row per observation (a dependent variable within a customer).
More, and Different:
If all we were doing was reproducing MANOVA results with PROC MIXED, I would not be writing this blog. We can do more. Instead of just accommodating unequal variances and covariance within a subject, the mixed models approach directly models the covariance structure of the multiple dependent variables. What’s more is that you can also simplify the structure, buying you more power, and making the interpretation of your model easier. For example, you might suspect that the variances are equal and the covariances between pairs of dependent variables are equal across the three dependent variables.
proc mixed data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint ;
repeated /subject = subject type=cs;
run;

The fit statistics in the mixed model enable model comparison. Since the mean model is identical in both cases, fit statistics based on REML are appropriate.

























Along with changing the covariance structure, there are the other advantages that tag along with using a mixed model: more efficient handling of missing data, easy to handle covariates, multiple levels of nesting is easy to accommodate (measurements within subjects within sales territories within your wildest imaginings), a time component is easy to model, heterogeneous groups models, to name a few.
Variation on a Theme: Mixture of Distributions in PROC GLIMMIX
Few days go by that I don’t use the GLIMMIX procedure, and as it happens, there’s a trick in PROC GLIMMIX that makes these types of models even more flexible. Starting in SAS 92, you can model a mixture of distributions from the exponential family, such as one gamma and two normal responses. If my data looked like this:


























(Notice the column with the distribution name for each variable) then I could fit the model as follows:
proc glimmix data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint dist=byobs(distrib);
random intercept /subject = subject;
run;

Or like this, instead:
proc glimmix data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint dist=byobs(distrib);
random _residual_ /subject = subject type=un;
run;

Those two models are not equivalent, and they both use pseudo likelihood estimation, so you will probably only use this kind of a model in circumstances where nothing else will do the job. Still, it’s quite a bit more than could be done even a couple of years ago.
I know I’m keeping you hanging on for that punchline. So here you are (with my deepest apologies)…
Three correlated responses walk into a bar.
One asks for a pilsner. The second asks for an ale.
The third one tells the bartender, “I’m just not feeling normal today. Better gamma something mixed.”


(edited to fix the automatic underlining in html in the SAS code-- it should be correctly specified now)
2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.