data analysis

3月 312011
 
This week, I posted the 100th article to The DO Loop. To celebrate, I'm going to analyze the content of my first 100 articles.

In December 2010 I compiled a list of The DO Loop's most-read posts, so I won't repeat that exercise. Instead, I thought it would be interesting to analyze what I've said in my blog by looking at the frequency of words over the first 100 posts.

I concatenated the first 100 posts into a single file, and I used SAS software (specifically, PROC FREQ) to analyze the most frequent terms. Then I created a Wordle word cloud, which uses size to represent the frequency of occurrence of the top terms.

Click to enlarge


I like this word cloud because it summarizes my objectives for The DO Loop:

  • Data is central to my blog.
  • Data is surrounded by terms related to statistical programming: matrix, function, SAS, SAS/IML, program, and compute.
  • Surrounding those terms are keywords in the SAS/IML language: USE, READ, DO, IF/THEN, PRINT, and CALL.
  • Interspersed throughout the cloud are terms related to matrix computations: row, column, values, and vector.
  • Statistical terms are also interspersed: random, variables, probability, distribution, mean, and covariance.
The word cloud interlaces these different objectives, just like I attempt to write articles that interleave statistics, data analysis, and programming.

In that spirit, I can't think of a better way to celebrate Post #101 than to write a SAS program that uses the SGPLOT procedure to visualize the general topics of my blog posts. (Counts do not add to 100 because some articles belong to multiple categories.)

title "The DO Loop: Top Categories";
proc sgplot data=Categories;
   hbar Category / freq=Freq;
   xaxis grid integer;
   yaxis discreteorder=data;
run;


Thanks to my colleagues, Alison, Anne, and Chris, for their support of this blog. Thanks to all of my readers for your patronage, comments, and encouragement. I look forward to writing the next 100 articles, and I hope you look forward to reading them.

3月 302011
 
In a previous post, I described how to compute means and standard errors for data that I want to rank. The example data (which are available for download) are mean daily delays for 20 US airlines in 2007.

The previous post carried out steps 1 and 2 of the method for computing confidence intervals for rankings, as suggested in Marshall and Spiegelhalter (1998):

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

This post carries out step 3 of the method.

Ordering Airlines by Mean Delays

To recap, here is the SAS/IML program so far. The following statements compute the mean delays and standard errors for each airline in 2007. The airlines are then reordered by their mean delays:

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;
meanDelay = x[:,];

/** sort descending; reorder columns **/
call sortndx(idx, MeanDelay`, 1, 1);
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];

Ranking Airlines...with Confidence

You can use the simulation method suggested by Marshall and Spiegelhalter to compute confidence intervals for the rankings. The following statements sample 10,000 times from normal distributions. (I've previously written about how to write an efficient simulation in SAS/IML software.) Each of these samples represents a potential realization of the underlying random variables:

/** simulation of true ranks, according to the method 
    proposed by Marshall and Spiegelhalter (1998) 
    British Med. J. **/
/** parameters for normal distribution of means **/
mean = meanDelay`;
sem = sqrt(var(x)/nrow(x)); /** std error of mean **/

p = ncol(Carrier);
NSim = 10000;
s = j(NSim, p); /** allocate matrix for results **/
z = j(NSim, 1);
call randseed(1234);
do j = 1 to p;
   call randgen(z, "Normal", mean[j], sem[j]);
   s[,j] = z;
end;

The matrix s contains 10,000 rows and 20 columns. Each row is a potential set of means for the airlines, where the ith mean is chosen from a normal distribution with population parameters μi and σi. Because the population parameters are unknown, Marshall and Spiegelhalter advocate replacing the population means with the sample means, and the population standard deviations by the sample standard errors. This kind of simulation is similar to a parametric bootstrap.

The following statements use the SAS/IML RANK function to compute the rank (1–20) for each simulated set of means. (If you need to conserve memory, you can re-use the s matrix to store the ranks.)

/** for each sample, determine the rankings **/
r = j(NSim, p);
do i = 1 to NSim;
   r[i,] = rank(s[i,]);
end;

Each column of the r matrix contains simulated ranks for a single airline. You can use the QNTL subroutine to compute 95% confidence intervals for the rank. Basically, you just chop off the lower and upper 2.5% of the simulated ranks. (If you have not upgraded to SAS/IML 9.22, you can use the QNTL module for this step.)

/** find the 95% CI of the rankings **/
call qntl(q, r, {0.025 0.975});

The simulation is complete. The matrix q contains two rows and 20 columns. The first row contains the lower end of the 95% confidence limits for the ranks; the second row contains the upper limit.

You can print q to see the results in tabular form. I prefer graphical displays, so the following statements create a SAS data set that contains the estimated ranks and the upper and lower confidence limits.

d = rank(mean) || T(q);
varNames = {"Rank" "LCL" "UCL"};
create Rank from d[r=Carrier c=varNames];
append from d[r=Carrier];
close Rank;

It is now straightforward to use SCATTER statement in PROC SGPLOT to create a scatter plot with confidence intervals (called "error bars" in some disciplines):

proc sgplot data=Rank noautolegend;
title "Ranking Airlines (On-Time Performance, 2007)";
scatter y=carrier x=Rank /
        xerrorlower=LCL 
        xerrorupper=UCL;
xaxis label="Rank with 95% Confidence Intervals";
yaxis grid discreteorder=data label="Carrier"; 
run;

For these data, the ranking of the first three airlines (AQ, HA, WN) seems fairly certain, but there is more uncertainty for the rankings of airlines that are in the middle of the pack. So, for example, if Frontier Airlines (F9) issues a press release about their 2007 on-time performance, it shouldn't claim "We're #5!" Rather, it should state "We're #5 (maybe), but we might be as high as #4 or as low as #8."

In keeping with my New Year's Resolutions, I've learned something new. Next time I want to rank a list, I will pay more attention to uncertainty in the data. Plotting confidence intervals for the rankings is one way to do that.

3月 252011
 
I recently posted an article about representing uncertainty in rankings on the blog of the ASA Section for Statistical Programmers and Analysts (SSPA). The posting discusses the importance of including confidence intervals or other indicators of uncertainty when you display rankings.

Today's article complements the SSPA post by showing how to construct the confidence intervals in SAS. This article implements the techniques in Marshall and Spiegelhalter (1998), Brit. Med. J., which is a very well written and readable paper. (If you prefer sports to medicine, Spiegelhalter wrote a similar article on ranking teams in the English Premier Football League.)

To illustrate the ideas, consider the problem of ranking US airlines on the basis of their average on-time performance. The data for this example are described in the Statistical Computing and Graphics Newsletter (Dec. 2009) and in my 2010 SAS Global Forum paper. The data are available for download.

The data are in a SAS data set named MVTS that consists of 365 observations and 21 variables. The Date variable records the day in 2007 for which the data were recorded. The other 20 variables contain the mean delays, in minutes, experienced by an airline carrier for that day’s flights. The airline variables in the data are named by using two-digit airline carrier codes. For example, AA is the carrier code for American Airlines.

The appendix of Marshall and Spiegelhalter's paper describes how to compute rankings with confidence intervals:

  1. Compute the mean delay for each airline in 2007.
  2. Compute the standard error for each mean.
  3. Use a simulation to compute the confidence intervals for the ranking.

Ordering Airlines by Mean Delays

You can use PROC MEANS or PROC IML to compute the means and standard errors for each carrier. Because simulation will be used in Step 3 to compute confidence intervals for the rankings, PROC IML is the natural choice for this analysis. The following statements compute the mean delay for all airlines (sometimes called the "grand mean") and the mean delays for each airline in 2007. The airlines are then sorted by their mean delays, as described in a previous article about ordering the columns of a matrix.

proc iml;
/** read mean delay for each day in 2007
    for each airline carrier **/
Carrier = {AA AQ AS B6 CO DL EV F9 FL HA
           MQ NW OH OO PE UA US WN XE YV};
use dir.mvts; 
read all var Carrier into x;
close dir.mvts;

grandMean = x[:];
meanDelay = x[:,];
call sortndx(idx, MeanDelay`, 1);

/** reorder columns **/
Carrier = Carrier[,idx];
x = x[,idx];
meanDelay = meanDelay[,idx];
print grandMean, carrier;


               grandMean

               9.6459671


                Carrier

AQ HA WN DL F9 FL PE OO AS XE CO YV ... AA EV

You can use these computations to reorder the variables in the data set (this step is not shown) and use PROC SGPLOT to plot the means and standard errors for the each airline carrier:

/** convert data from wide to long format**/
proc transpose data=mvts 
   out=Delay(rename=(col1=Delay)) name=Carrier;
by date;
run;

title "Comparison of Airline Daily Delays (2007)";
proc sgplot data=delay noautolegend;
  dot carrier / response=delay stat=mean limitstat=clm;
  refline 9.646 / axis=x lineattrs=(pattern=dash);
  xaxis label="Mean Delay with 95% Confidence Intervals";
  yaxis discreteorder=data label="Carrier"; 
run;

Notice that two airlines (Aloha Airlines (AQ) and Hawaiian Airlines (HA)) have much better on-time performance than the others. In fact, their average daily delay is negative, which means that they typically land a minute or two ahead of schedule!

Many analysts would end the analysis here by assigning ranks based on the mean statistics. But as I point out on my SSPA blog, it is important for rankings to reflect the uncertainty in the mean estimates.

Notice that for most airlines, the confidence interval overlaps with the confidence intervals of one or more other airlines. The "true mean" for each airline is unknown, but probably lies within the confidence intervals.

So, for example Frontier Airlines (F9) is nominally ranked #5. But you can see from the overlapping confidence intervals that there is a chance that the true mean for Frontier Airlines is lower than the true mean of Delta Airlines (DL), which would make Frontier ranked #4. However, there is also a chance that the true mean could be less than the true mean of Continental Airlines (CO), which would make Frontier ranked #11. How can you rank the airlines so that that uncertainty is reflected in the ranking?

The answer is in Step 3 of Marshall and Spiegelhalter's approach: use a simulation to compute the confidence intervals for the ranking. This step will appear in a second article, which I will post next week.

3月 232011
 
When you think of statistical process control, or SPC for short, what industry first comes to your mind? In the past 10 or 15 years, diverse industries have begun to standardize processes and administrative tasks with statistical process control. While the top two bars of the industrial Pareto chart are probably still manufacturing and machine maintenance, in North America, SPC is being used in far more types of work than many people realize.

One reason that researchers and process managers turn to Walter Shewhart’s trusty techniques to distinguish between common cause and special cause variation—in other words, variation that represents unimportant bumps on the road and variation that means that the wheels have fallen off the bus—is that they work far better than gut instinct in many cases. Furthermore, they are defensible, because they are based on sound, scientific practice. What might surprise you is the extent to which social science, human subject research, and administrative fields are making use of SPC.

1. Health Care. This is the area I hear the most buzz about for SPC. A great deal of interesting work is being done in monitoring process variables such as late payments, the number of available beds, payments to providers, script-writing rates for high-risk drugs, pain levels during recovery from surgery, and so on. I can only scratch the surface here. The writing is on the wall about health care becoming still more expensive, data are becoming more and more plentiful, and it is especially easy for problems to hide in massive data. Unpredictable problems = cost. Process control = savings.

2. Customer Service How long did you have to wait in line the last time you were in a big box store? When you last called the cable company? Some companies recognize that if customer service is slow, you will take your business elsewhere. Some are even willing to take action on that knowledge. There are plenty of measureable service quality characteristics that can be tapped into to identify times of day or product lines that are inconsistent, which translates to a better customer experience, and to customer loyalty.

3. Survey Research This is the one I’m most excited about right now. SPC interest in survey research has been on the rise for the past 5 or so years, and I think it’s an area ripe for this sort of analysis. If the news tells you that 52% of Americans who eat raw oysters only eat them to be polite, someone had to get that information. Enter the heroes of the survey world. Survey researchers, the people behind all the Things We Know about How We Are, are applying statistical process control methods with variables related to survey data collection, such as interview length, nonresponse rate, interviewer driving distances, and so on.

Continue reading "The Human Side of Statistical Process Control: Three Applications of SAS/QC You Might Not Have Thought About"
3月 042011
 
Do you have many points in your scatter plots that overlap each other? If so, your graph exhibits overplotting. Overplotting occurs when many points have similar coordinates. For example, the following scatter plot (which is produced by using the ODS statistical graphics procedure, SGPLOT) displays 12,000 points, many of which are plotted on top of each other.

/** 12,000 points; notice the overplotting **/
proc sgplot data=A;
scatter x=x y=y / markerattrs=(symbol=CircleFilled);
run;


There are several ways to deal with overplotting. When the overplotting is caused because the data are rounded to a convenient unit (such as plot of height versus age), some analysts use jittering to reduce the overplotting. When the overplotting is caused by having too many data points, as in the previous scatter plot, you have to use an alternate approach.

One of the techniques I use is the "transparency trick" for visualizing scatter plots that suffer from a large amount of overplotting.

When a colleague recently came to me with a scatter plot that was dense with points, I suggested that he try the TRANSPARENCY= option in the SGPLOT procedure. This option enables you to specify a value between 0 and 1 that determines the transparency of the markers that are used in the scatter plot. A value of 0 indicates that all markers are opaque. As you increase the transparency factor, singleton markers fade away, whereas areas of the plot that have a high density of overlapping points remain dark.

That one option makes a huge difference in the plot and enables you to visualize areas that have a high density of points. For example, the following statements display a scatter plot of the same data, but the markers are highly transparent:

/** use transparency **/
proc sgplot data=A;
scatter x=x y=y / markerattrs=(symbol=CircleFilled) 
                  transparency=0.97;
run;


The new scatter plot makes it easy to see that there is a high density of points near -3 and +3 on the horizontal axis. Consequently, a scatter plot with a large transparency factor is sometimes called the "poor man's density estimator."

Of course, there is no real reason to use transparency to emulate density estimation, when there is a real density estimation procedure in SAS/STAT software. If you want to compute the density of the data, use the KDE procedure and request a contour plot of the data, as shown below:

proc kde data=A;
   bivar x y / plots=(contour);
run;


2月 252011
 
I was inspired by Chris Hemedinger's blog posts about his daughter's science fair project. Explaining statistics to a pre-teenager can be a humbling experience.

My 11-year-old son likes science. He recently set about trying to measure which of three projectile launchers is the most accurate. I think he wanted to determine their accuracy so that he would know which launcher to use when he wanted to attack his little sister, but I didn't ask. Sometimes it is better not to know.

He and his sister created a bullseye target. They then launched six projectiles from each of the three launchers and recorded the location of the projectile relative to the center of the target. In order to make the experiment repeatable, my kids aligned each launcher with the target and then tied it down so that it would not move during the six shots.

For each launcher, they averaged the six distances to the center. The launcher with the smallest average distance was declared to be The Most Accurate Launcher in the World.

In the accompanying figure, the red crosses represent the winning launcher. The average distance to the center is 5.2 centimeters, whereas the black squares have an average distance of 6.7 cm and the blue circles have an average distance of 13.1 cm. Three of the black squares have the same location and are marked by "(3)."

When I saw the target on which they had recorded their results, I had a hard time containing the teacher in me. This target shows two fundamental statistical concepts: bias and scatter.

Bias Explained

When I look at the target, the first characteristic I see is bias due to a misalignment of the launchers. Two of the three launchers are aimed too high relative to the bullseye. This kind of experimental error is sometimes known as systematic bias because it leads to measurements that are systematically too high or too low.

The second figure shows the average locations of projectiles for each launcher. The average location of the red crosses (marked "C2") is the closest to the center of the target. The next closest is the center of the black squares, which is marked "C1." The fact that "C1" and "C3" are far away from the center of the target shows the bias in the experiment.

I pointed this out to my son, who shrugged. "I guess the launchers weren't perfectly lined up," he said.

Thinking that perhaps he might want to realign the launchers and re-run the experiment, I asked him what he intended to do about this bias.

He hesitated. I could see the wheels turning in his mind.

"Aim low?" he guessed.


Scatter Explained

The second feature of the data that I see is the scatter of the projectiles. The way that the projectiles scatter is measured by a statistical term known as the covariance. The covariance measures the spread in the horizontal direction, the vertical direction, and jointly along both directions.

I pointed out to my son that the red crosses tend to spread out in the up-down direction much more than they do in the left-right direction. In contrast, the black squares spread in both directions, but, more importantly, the directions tend to vary together. In four out of five cases, a shot that was high was also left of center. This is an example of correlation and covariance.

I asked my son to describe the spread of the blue circles. "Well," he cautiously began, "they vary a little bit up and down, but they are all over the place left-to-right."

"And do the shot locations tend to vary together?" I prompted.

"Um, not really," he replied.

Bingo! Covariance explained. All of the launchers show up-and-down and left-and-right variability, but the black squares also exhibit co-variability: the deviations in the horizontal direction are correlated with the deviations in the vertical direction.

For Adults Only

My son ran off to attack his sister, grateful to escape my merciless questioning.

I carefully measured the locations of the data, created a SAS DATA step, and began to analyze the data in SAS/IML Studio.

The third figure shows the center of each group of projectiles. I have also added ellipses that help to visualize the way that the projectiles scatter. (Each ellipse is a 75% prediction ellipse for bivariate normal data with covariance equal to the sample covariance.)

The ellipses make it easier to see that locations of the red crosses and blue circles do not vary together. The red ellipse is tall and skinny, which represents a small variation in the horizontal direction and a medium variation in the vertical direction. The blue ellipse is very wide, representing a large variation in the horizontal direction. The black ellipse is tilted, which shows the correlation between the horizontal and vertical directions.

The programming techniques that are used to draw the figures are described in Chapters 9 and 10 of my book, Statistical Programming with SAS/IML Software.

2月 192011
 
Lunch. For some workers, it’s the sweetest part of an otherwise bitter day at the grindstone. Nothing can turn that sweetness sour like going into the breakroom to discover that someone has taken your lunch and eaten it themselves.

Nothing like that ever happens here at SAS.

But if it did, I would set up a system to repeatedly collect and identify the saliva of the top suspects, and do an elegant chemical analysis. When a lunch goes missing, there’s always some residual spit on the used container.

I could develop a discriminant analysis model to identify each suspect. Then I’d score newly missing lunches with the model, flag the culprit, track them down and make them buy a box of Belgian chocolates for the person whose lunch they pilfered.

But what if I falsely accused someone who was innocent? Oh gosh. That could be an embarrassing and expensive error.

Let’s review how the discriminant analysis would look:


Continue reading "Who Ate My Lunch? Discriminant Thresholds to Reduce False Accusations"
2月 182011
 
The Flowing Data blog posted some data about how much TV actors get paid per episode. About a dozen folks have created various visualizations of the data (see the comments in the Flowing Data blog), several of them very glitzy and fancy.

One variable in the data is a categorical variable that specifies whether the actor appears in a TV comedy or a drama. It is interesting to compare the salaries for comedies with those for dramas.

One way to do this is by creating a comparative histogram. In SAS software, you can create a comparative histogram by using PROC UNIVARIATE with a CLASS statement and a HISTOGRAM statement. The ShowType variable is the classification variable; the Pay variable contains the pay per episode for each actor. The following statements create a comparative histogram:

ods graphics on;
proc univariate data=TV;
   class ShowType;
   histogram Pay / kernel(c=SJPI) endpoints=0 to 13;
run;


In the example, the bandwidth is chosen by using the Sheather-Jones plug-in method because the default bandwidth selection algorithm seems to oversmooth these data. A smaller bandwidth helps detect the cluster of actors who earn close to $100,000 per episode, and another cluster that earns $400,000 per episode.

I also used ENDPOINTS option to explicitly override the default bin width selection algorithm.

Notice that the comparative histogram automatically labels each histogram according to the value of the classification variable, and automatically equates the scales of all histograms for accurate visual comparison of the distribution of the sample. I can use the same code regardless of the number of categories in the classification variable.

This is the beauty of ODS graphics: the procedures automatically create graphs that are appropriate for an analysis.

Oh, and who is the outlier in the comedy group? That's Charlie Sheen. He makes $1.25 million per episode. I think his show, Two and a Half Men, should be renamed Two and a Half Times the Salary, Man!

2月 112011
 
In a previous blog post, I described the rules for a tic-tac-toe scratch-off lottery game and showed that it is a bad idea to generate the game tickets by using a scheme that uses equal probabilities. Instead, cells that yield large cash awards must be assigned a small probability of getting a tic-tac-toe and cells that yield small cash awards can be assigned larger probabilities. As Mohan Srivastava, the statistician who "cracked the scratch lottery code," said:
It would be really nice if the computer could just spit out random digits. But that’s not possible, since the lottery corporation needs to control the number of winning tickets.
This article describes how to construct a scratch-off game that has lots of winners of small prizes, but is still profitable for the lottery corporation (the government) because the large prizes occur infrequently.

A Cell-By-Cell Probability Model for a Scratch-Off Game

The lottery corporation wants to give away many small prizes (so that the game is fun to play) while restricting the number of large pay outs (so that the game profitable).

Recall that a ticket is a winner if a set of "lucky numbers" aligns along a row, column, or diagonal. For convenience, represent the lucky numbers by a 1 and the other numbers by a 0. The key is to assign each cell in the grid a certain probability of being a 1. For example, consider the following SAS/IML matrix of probabilities:

proc iml;
p = { 1.0  1.0  0.1,
      1.0  0.02 0.2,
      0.25 0.1  0.1 };

If you assign a 1 to a grid cell based on these probabilities, the resulting grids have the following properties:

  • A 1 is always placed in the first two positions of the first column (because the chance of placing a 1 there is 100%).
  • The probability of getting a winning tic-tac-toe in the first column is 0.25, the product of the probabilities in the first column. This means that one fourth of all gamers will get their $3 investment back.
  • A 1 is always placed in the first two positions of the first row.
  • The probability of getting a winning tic-tac-toe in the first row is 0.1. This means that one tenth of all gamers will get a $5 payout, for a net gain of $2.
  • Other ways of winning the game occur much less frequently.

The probabilities are assigned to lessen the chance of paying the big cash awards. For example, the middle cell gets a lucky number only 2% of the time, so winning the game by having all 1s on a diagonal (the big $250 prize!) will not happen often.

How Much and How Often Would You Win This Game?

In order to find the expected payout (in dollars) for each combination, you can compute the products of the row, column, and diagonal probabilities and multiply those values by the corresponding payouts, as shown in the following SAS/IML statements:

rowPay = p[,#] # {5,10,100};
colPay = p[#,] # {3 20 100};
diagPay = p[{1 5 9}][#] # 250;
antiDiagPay = p[{3 5 7}][#] #250;
print rowPay, colPay, diagPay antiDiagPay;

rowPay

0.5

0.04

0.25

colPay

0.75

0.04

0.2

diagPay

antiDiagPay

0.5

0.125

The payout for the entire game is just the sum of these individual payouts. To compute the expected gain/loss for playing the game, subtract the initial $3 purchase price:

ExpGain = -3 + sum(rowPay, colPay, diagPay, antiDiagPay);
print ExpGain;

ExpGain

-0.595

Scratch, Scratch, Scratch: Simulating Many Games

For people more comfortable with simulation than with probability theory, you can also write a SAS/IML simulation that "plays" the game many times. The statements call the TestWinner module from my previous blog post.

NSim = 1E5;
x = j(NSim,9);
y = j(NSim,1);
/** each cell cell has different probabilities. 
    Sample 9 times. **/
do i = 1 to 9;
   call randgen(y, "BERNOULLI", p[i]);
   x[,i] = y;
end;
win = TestWinner(x);
ExpWin = win[:];
PctWin = (win>=0)[:];
print ExpWin PctWin;

ExpWin

PctWin

-0.58358

0.33059

Notice that this game pays out an award 33% of the time, so it gives the players (false) hope and keeps them coming back. (Of course, for most of those "wins," you only get your money back!) The simulated result estimates that, on average, you will lose $0.58 every time you play the game. (The true expected loss is $0.595.)

Unless, of course, you know Srivastava's trick.

Choosing the Probabilities: A Better Way?

One thing bothers me about the probabilities that I chose: the probability of winning a $10 or $20 award is too small. In order to lower the chance of winning the big $250 award (getting tic-tac-toe on the diagonal), I chose a small probability for the middle cell. But this also lowers the probability of getting tic-tac-toe on the middle row and middle column.

What scheme would you use so that the chance of winning an award is a decreasing function of the size of the award?

2月 102011
 
Because of this week's story about a geostatistician, Mohan Srivastava, who figured out how predict winning tickets in a scratch-off lottery, I've been thinking about scratch-off games. He discovered how to predict winners when he began to "wonder how they make these [games]."

Each ticket has a set of "lucky numbers." A ticket is a winner if the lucky numbers align on a tic-tac-toe grid along any row, column, or diagonal. The scratch-off portion of the game involves scratching an area on the ticket to reveal the lucky numbers. See my previous blog article for further details.

Srivastava's observation was that the "lucky numbers" on the tic-tac-toe grid have to occur less frequently than the other numbers. Otherwise, the game will pay out more money than it takes in.

Simplifying the Game for Analysis

How might someone design a tic-tac-toe scratch-off game? I'll simplify the problem by eliminating the lucky numbers, the scratch-off portion of the ticket, and all but one 3x3 grid. Represent all the lucky numbers by a 1, and the remaining numbers by a 0. In this scheme, a card is a winner if it has a row, column, or diagonal of all 1s.

The game costs $3 to play, and pays out the following amounts:

  • If the first row contains all 1s, the payout is $5.
  • If the second row contains all 1s, the payout is $10.
  • If the first row contains all 1s, the payout is $100.
  • If the first column contains all 1s, the payout is $3 (that is, you get your money back).
  • If the second column contains all 1s, the payout is $20.
  • If the third column contains all 1s, the payout is $100.
  • If the either of the diagonals contain all 1s, the payout is $250.

What If Lucky Numbers Are Assigned with Equal Probability?

What would happen if the 0s and 1s are assigned to each cell with equal probability? For example, suppose that you specify a 15% chance that any individual cell is assigned a 1. What is the expected gain/loss for that each ticket, and what percentage of tickets will be winners?

The answer is provided by the following simulation and graph. (It is possible to get an exact solution, but I'm feeling lazy.) The simulation constructs a large number of randomly assigned grids. (Each grid is represented by a row of the x matrix.) Each cell of the grid has a certain probability of being a 1. The TestWinner module determines the gain or loss for each grid by determining whether there is a row, column, or diagonal that contains all 1s. The simulation is repeated for a range of probability values, from a 10% probability to a 30% probability.

proc iml;
/** Given N grids (represented by rows of x), 
    determine the payout for each grid. **/ 
start TestWinner(x);
   w = j(nrow(x),1, -3); /** gain: initialize to -3 **/
   do i = 1 to nrow(x);
      if all(x[i,1:3])     then w[i] = w[i]+5;
      if all(x[i,4:6])     then w[i] = w[i]+10;
      if all(x[i,7:9])     then w[i] = w[i]+100;
      if all(x[i,{1 4 7}]) then w[i] = w[i]+3;
      if all(x[i,{2 5 8}]) then w[i] = w[i]+20;
      if all(x[i,{3 6 9}]) then w[i] = w[i]+100;
      if all(x[i,{1 5 9}]) then w[i] = w[i]+250;
      if all(x[i,{3 5 7}]) then w[i] = w[i]+250;
   end;
   return(w);
finish;

call randseed(54321);
NSim = 1E5;
x = j(NSim,9); /** each row of x is a grid **/

p = do(0.1, 0.3, 0.01);
pctWin = j(ncol(p),1);
ExpWin = pctWin;
do i = 1 to ncol(p);
   call randgen(x, "BERNOULLI", p[i]);
   win = TestWinner(x);
   ExpWin[i] = win[:];
   PctWin[i] = (win>=0)[:];
end;

The graph shows the expected gain or loss when cells in the tic-tac-toe grid are randomly assigned according to a fixed probability value. The graph shows that the lottery corporation (that is, the government) cannot make a profit on games for which the chance of cell being a 1 is more than 15%. However, profitable games that are constructed like this have very few winners. For example, when the chance of a 1 is 15%, only 2.5% of those playing the game get a winning tic-tac-toe combination.

This is a problem for the lottery corporation, because people don't play games that they usually lose. The key to getting people to play a scratch-off game (and slot machines, and other gambling games) is to award many small prizes, but few large prizes. If the 0s and 1s are assigned to each cell with equal probability, then the large cash awards have the same chance of occurring as the smaller awards. The chance of winning has to be set very small (too small!) in order to manage the losses.

This hypothetical analysis shows that it is a bad idea to generate a tic-tac-toe game by using equal probabilities. Instead, cells that contribute to the large cash awards (the diagonals) must be assigned a small probability of getting the "lucky numbers," whereas cells that contribute to small cash awards (the first row and column) can be assigned larger probabilities.

By controlling the probability that each cell contains a lucky number, the lottery corporation can control the chance of winning each prize. It can guarantee a profit for the game, while awarding many small prizes that keep customers coming back to play again.

Srivastava recognized this fact. In the Wired article he is quoted as saying:

It would be really nice if the computer could just spit out random digits. But that’s not possible, since the lottery corporation needs to control the number of winning tickets.

Distribution of the Lucky Numbers

As a statistical footnote to this analysis, you can construct the frequency distribution of the number of 1s in a tic-tac-toe grid that uses a random uniform assignment for each cell. The count of 1s follows a binomial distribution. For example, if the probability that a cell contains a 1 is 15%, the following SAS/IML statements compute the frequency distribution of 1s in the grid:

/** distribution of 1s in a 3x3 grid in which 
    each cell has a 15% chance of a 1 **/
k = 0:9;
pdf = pdf("Binomial", k, 0.15, 9);

The scatter plot shows the probability that a tic-tac-toe grid will have a given number, k, of 1s for k=0, 1, 2,...,9, when there is a 15% chance that a given cell contains a 1. The graph shows that most grids (86%) that are constructed under this scheme will have zero, one, or two 1s. Obviously, none of these grids can be a winning ticket.

Once again, the conclusion is that a game that is constructed using this scheme would not be fun to play: the "lucky numbers" don't appear often enough! In order to increase the frequency of lucky numbers while managing the risk of the tickets that pay out, you need to use a construction method that assigns different probabilities to each cell of the grid.

I'll describe that approach tomorrow.