data analysis

10月 052011

When you misspell a word on your mobile device or in a word-processing program, the software might "autocorrect" your mistake. This can lead to some funny mistakes, such as the following:

I hate Twitter's autocorrect, although changing "extreme couponing" to "extreme coupling" did make THAT tweet more interesting. [@AnnMariaStat]

When you type "hte," how does the software know that you probably meant "the"? Even more impressively, when you type "satitsisc" into Microsoft Word, how does it create a list of the words you might have meant, a list that includes "statistics"?

For that matter, have you every mistyped the name of a SAS procedure or an option, only to discover that your program still ran? For example, if you type dat a; run; the program will run, although the SAS log will warn you of the misspelling: "WARNING 14-169: Assuming the symbol DATA was misspelled as dat." How can SAS determine that what you typed is close enough to an actual keyword that it assumes that the keyword was misspelled?

There are various techniques for correcting misspellings, but most of them involve a notion of the distance between words. Given a string of letters, you can define a number that relates how "close" the string is to other words (for example, in a dictionary). The string that is typed is called the keyword. The words in the dictionary are called query words. If the keyword is not in the dictionary, then you can list the query words that are close to the keyword. Often the keyword is a misspelling of one of the words on this list.

Base SAS has several functions that you can use to compare how close words are to each other. In this post, I describe the SPEDIS function, which I assume is an acronym for "spelling distance." Similar SAS functions include the COMPLEV function, the COMPGED function, and the SOUNDEX function.

What words are close to "satitsisc"

The SPEDIS function computes the "cost" of converting the keyword to a query word. Each operation—deletion, insertion, transposition, and so forth—is assigned a certain cost.

As an example, type the letters "satitsisc" into Microsoft Word and look at the top words that Word suggests for the misspelling. The words are statistics, statistic, and sadistic. The words satisfy and Saturday are not on the list of suggested words. By using the SAS DATA step or the SAS/IML language, you can examine how close each of these words are to the garbled keyword, as measured by the SPEDIS function:

proc iml;
keyword = "satitsisc";
query = {"statistics" "statistic" "sadistic" "satisfy" "Saturday"};
cost = spedis(query, keyword);
print cost;

According to the SPEDIS function, the words suggested by MS Word (the first three query words) have a lower cost than satisfy and Saturday, which were not suggested.

The cost returned by the SPEDIS function is not symmetric. For example, the cost of converting sadistic to statistic is different from the cost of converting statistic to sadistic, because inverse operations have different costs. For example, the cost of inserting a letter is different from the cost of deleting a letter.

This lack of symmetry means that the SPEDIS function does not return a true distance between words, although some people refer to the cost matrix as an "asymmetric distance." You can use the SAS/IML language to compute the cost matrix, in which the ijth element is the cost of converting the ith word into the jth word:

words = query || keyword;
n = ncol(words);
/* compute (n x n) cost matrix (not symmetric) */
cost = j(n,n,0);
do i = 1 to ncol(words);
   key = words[i];
   cost[i,] = spedis(words, key);
print cost[c=words r=words];

It is possible to visualize the distance between words by using the MDS procedure to compute how to position the words in the plane so that the planar distances most nearly match the (scaled) values in the cost matrix:

/* write SAS/IML cost matrix to data set */
create cost(type=DISTANCE) from cost[c=words r=words];
append from cost[r=words];
/* find 2D configuration of points */
ods graphics on;
proc mds data=cost dimension=2 shape=square;
subject words;

PROC MDS creates a graph that is similar to the one shown. Notice that the cluster in the upper right consists of the keyword and the words that MS Word suggests. These words are close to the keyword. On the other hand, the two words that were not suggested by MS Word are situated far from the keyword.

Further Reading

This post was inspired by Charlie Huang's interesting article titled Top 10 most powerful functions for PROC SQL. The SPEDIS and SOUNDEX functions are listed in item #6.

If you want to read funny examples of mistakes in text messages that were caused by the autocorrect feature of the iPhone, do an internet search for "funny iPhone autocorrect mistakes." Be aware, however, that many of the examples involve adult humor.

You can use the SPEDIS function to do fuzzy matching, as shown in the following papers:

tags: Data Analysis, Statistical Programming
9月 232011

In my previous post, I blogged about how to sample from a finite mixture distribution. I showed how to simulate variables from populations that are composed of two or more subpopulations. Modeling a response variable as a mixture distribution is an active area of statistics, as judged by many talks on the topic at JSM 2011. When SAS 9.3 shipped, I promised to discuss my experiences with the new experimental FMM procedure, which fits finite mixture models. This article describes modeling univariate data as a mixture of normal distributions.

Initial thoughts and misconceptions

I confess: I don't know much about finite mixture models. I read some of the documentation and followed some of the examples, but I've barely begun to explore the capabilities of PROC FMM. (Thanks to my colleagues, Dave and John, for reviewing this article.)

The FMM procedure in SAS/STAT software is not a discrimination procedure. When you use the CANDISC and DISCRIM procedures, you must know the group membership for each observation in order to build a model that classifies the group membership of future observations. Similarly, most SAS regression procedures enable you to use a CLASS statement to build a model that accounts for group membership. The beauty of the FMM procedure is that you don't need this knowledge, and yet you can still build a model for response variables that are multi-modal. (You can, however, supply partial group information when you know it by using the PARTIAL option.) In my mind, I pretend that PROC FMM models the components as latent variables, since I'm familiar with that concept.

The way that the FMM procedure models a response variable reminds me of clustering procedures: You tell the FMM procedure that the response variable consists of k components, and the procedure tries to use the data to discover (and model) the group structure as part of building a regression model. This means that there are many parameters in this problem: there are the usual parameters for the regression effects, but there are also parameters that estimate the location, shape, and strength of each component.

A first example: Modeling the distribution of Scrabble® scores

As a first example, I want to see whether the FMM procedure can model a univariate distribution as a mixture of normal densities. The data are the scores for each word played in series of a Scrabble games between my mother and me.

As I wrote previously, the distribution of scores seems to have three components. Most scores are in the 12–15 point range, and correspond to standard plays. There is a small group of larger scores that correspond to triple word scores. For my mother's scores, there is a third group of small and negative scores that correspond to plays made at the end of a game. You can download the data for these Scrabble games.

For the games recorded here, my scores had few end-of-game penalties. Consequently, I assume that my scores fall into two groups. I use the FMM procedure with the K=2 option to model my scores as a mixture of two normal distributions. Notice that I specify the response variable in the MODEL statement, but do not specify any covariate effects, so this is a model for the mean of the SCORE variable:

proc fmm data=scrabble(where=(Player=2)); /* Rick's scores */
model Score = / k=2;

The analysis models the distribution of scores as a mixture of two normal densities. The first component has mean 12.7, which represents my average score on a "typical" word play. The variance of this distribution is 44.6, which corresponds to a standard deviation of 6.7 points per play. The second component has a mean of 37.3 and a variance of 8.8 (a standard deviation just under 3 points), and represents my triple word score plays.

The blue curve, which is almost not visible, is the overall density estimate. It is a weighted sum of the two component densities. The weights sum to unity, so there is one parameter, called the mixing parameter, that needs to be estimated by the FMM procedure in addition to the two location and scale parameters for each component. For these data, the mixing probability estimate is 2.86. This estimate is on a logit scale, which means it corresponds to a mixing probability of exp(2.86)/(exp(2.86) +exp(0)) = 0.946. This number is shown in the "Probability" column of the second table. Equivalently, the estimated PDF is 0.95*f1 + 0.05*f2, where f1 is the PDF for the first component and f2 is the PDF for the second component. These weights make sense because my mother always "steals" the triple word plays from me, which means that I have few opportunities to play a triple word score!

Now I'll try to estimate my mother's scores as a mixture of three normal densities by using the following call to the FMM procedure:

proc fmm data=scrabble(where=(Player=1)); /* Mom's scores */
model Score = / k=3;

Unfortunately, this approach doesn't work very well for my mother's data. The FMM procedure converges to a set of parameters that correspond to a local maximum of the likelihood function. This solution models the data as a mixture of three distributions that are practically identical in terms of location and scale.

Fortunately, you can give "hints" to the FMM procedure by specifying starting values for the parameter estimates. The following call specifies a guess for the location and scale parameters, based on my knowledge of the game of Scrabble:

proc fmm data=scrabble(where=(Player=1));
model Score = / k=3 
      parameters(-5 4, 13 49, 33 9); /* means: -5, 13, 33 */

This time, the FMM procedure converges to a model that satisfies my prior beliefs. The means of the three components are at -6.3, 12.6, and 27.4. The mixing probability estimates are shown in the "Probabilities" column. They are 0.04 and 0.82, which means that the third mixing probability estimate is 0.14. Notice that the triple-word-score component for my mother accounts for 14% of the probability, whereas it accounts for only 5% of my scores.

Conclusions and gotchas

The FMM procedure did a good job of finding intuitive components for Scrabble scores. For my scores, it found two normal components such that the density of scores is a weighted sum of the components. For my mother's scores, it found three components, although I had to supply a hint. Notice that the procedure needs to estimate eight parameters in order to solve the three-component problem!

There were three lessons I learned from this exercise. The first two are "gotchas" regarding the 9.3 syntax:

  • Unlike some other SAS functions, the FMM procedure represents scale in terms of the variance. If you are used to specifying the scale parameter as a standard deviation in the RAND, PDF, and CDF functions and in the UNIVARIATE and SGPLOT procedures, this convention might cause momentary confusion.
  • Unlike the HISTOGRAM statement in the UNIVARIATE procedure, in which you specify all of the means followed by all of the standard deviations, the syntax of the PARAMETERS statement in the FMM procedure requires that you specify the mean and variance of each component in sequence.

The third lesson concerns the limitations of the FMM procedure. If two components of the response distribution are close to each other, then it is unreasonable to expect the FMM procedure to be able to model them as separate components. This is similar to the fact that clustering procedures have difficulties finding poorly separated clusters. I'll discuss this issue further in a future article.

tags: Data Analysis
9月 162011

Last week I showed a graph of the number of US births for each day in 2002, which shows a strong day-of-the-week effect. The graph also shows that the number of births on a given day is affected by US holidays. This blog post looks closer at the holiday effect. I actually conducted this analysis in 2009 for my book, but decided not to include it.

I want to identify days in 2002 that have fewer births than would be expected, given the day of the week. A box plot is often used for this sort of exploratory data analysis. The following statements use the VBOX statement in the SGPLOT procedure to create a box plot for each day of the week and to label the outliers for each day:

proc sgplot data=birthdays2002;
  title "US Births by Day of Week (2002)";
  vbox Percentage / category=Day datalabel=Date;
  yaxis grid; xaxis display=(nolabel) discreteorder=data;

In the box plots (click to enlarge), the outliers for each day of the week are labeled by using values of the Date variable. Each date belongs to one of the following categories: US holidays, days near holidays, and inauspicious days.

US holidays

Several US holidays in 2002 are responsible for lower than expected births, given the day of the week:

  • Tuesday, 01JAN (New Year's Day)
  • Monday, 27MAY (Memorial Day)
  • Thursday, 04JUL (Independence Day)
  • Monday, 02SEP (Labor Day)
  • Thursday, 28NOV (Thanksgiving Day)
  • Wednesday, 25DEC (Christmas Day)

Christmas Day is the day on which the fewest babies were born.

Several "minor" holidays on Mondays also exhibit slightly smaller-than-expected births. These are not visible in the box plot graph, but can be seen in the time series graph: 21JAN (Birthday of Martin Luther King, Jr.), 18FEB (Washington's Birthday, sometimes known as "President's Day"), 14OCT (Columbus Day), and 11NOV (Veterans Day).

Days near holidays

Families often travel on days near holidays, and that includes doctors and other hospital staff. Several of these days are visible as outliers in the birth data.

  • Wednesday, 02JAN (day after New Year's Day)
  • Friday, 29NOV (day after Thanksgiving Day)
  • Tuesday, 24DEC (Christmas Eve)
  • Thursday, 26DEC (day after Christmas Day)

Friday, 03JUL (the day prior to Independence Day), also exhibits smaller-than-expected births, as seen in the time series graph.

Inauspicious days

The following dates are also outliers:

  • Monday, 01APR (April Fool's Day)
  • Thursday, 31OCT (Halloween Day)

Most parents don't want their child to be teased for being an "April Fool" all his life. It is less clear why a couple would avoid giving birth on Halloween. Superstition? Maybe. Or maybe doctors don't induce deliveries on Halloween so that they can be home for trick-or-treating?

These days might not be preferred for giving birth, but these are both blog-able holidays: I've written Halloween posts and April Fool posts.

Interestingly, for leap years, 29FEB also falls into the "inauspicious day" category. I guess parents avoid that date because the poor child would only get birthday parties every four years? Personally, I think it would be fun to be born on a leap day. And think how impressed people would be when I brag that I completed college before I celebrated my eighth birthday!

9月 132011

My friend Chris posted an analysis of the distribution of birthdays for 236 of his Facebook friends. He noted that more of his friends have birthdays in April than in September. The numbers were 28 for April, but only 25 for September. As I reported in my post on "the most likely birthday in the US," September is supposedly the king of birthday months, as judged by the analysis of birth records at the National Center for Health Statistics (NCHS) (Martin, et al., 2006).

Does Chris's analysis prove that the government researchers are wrong? Has he exposed a sinister government conspiracy? No, but I'd like to use his analysis to discuss two essential principles of statistics:

  1. When you have a sample, you should always ask yourself "What is the population from which this sample is drawn?"
  2. Samples exhibit variability. Small samples are more variable than larger samples.

I could also discuss a third principal, the distinction between "different" and "significantly different," but I'll leave that discussion for another time.

What is the population?

The NCHS report was about birth rates for children born in the US in the late 1990s and early 2000s. Unless Chris's Facebook friends are tweeners and teenagers (insert inappropriate joke here), I think we can assume that Chris's sample is from a different population than the population in the report by Martin, et al.

Chris's data are probably from a population such as "people born between 1930 and 1970" and it is conceivable that the distribution of birthdays in this population is different than the distribution of US birth at the turn of the millennium. For instance, the Great Depression, World War II, and the post-war Baby Boom might have influenced the distribution of birthdays. Furthermore, Chris's friends might not be US-born. Birth rates for countries in the southern hemisphere might be different than those reported for US births by the NCHS. If Chris has a lot of friends from Australian and New Zealand, that could skew the distribution of birthdays.

Sampling variation

A more likely explanation is probably plain ol' sampling variation. It is well known that small samples have more variability than larger samples. If I select at random 10 children born in 2002, I might end up with two children born in April and none born in September. Chris has 236 friends who share their birthday information. With a sample this large, you can get a distribution of birthdays that is probably close to the distribution of birthdays in the population, but you still might get a few more April birthdays than expected, given the population distribution.

An analysis of birthdays at a large US software company

One way to deal with sampling variation is to use a larger sample. It might be a while before Chris has thousands of Facebook friends, and I don't want to wait, so I'll use the birthdays of 3,061 employees who work at a major US-based software company. I wrote a SAS program to collate the birthday information. The following statements use the FREQ procedure to analyze and plot the distribution of birthdays by month:

proc freq data=Birthdays;
tables Month / plots=FreqPlot(scale=percent);

The image shows that September birthdays occur most frequently in the sample of 3,061 employees, but it also shows an apparent "bump" in the April–May. The difference between these peak months is small, about 1%, so a random sample with, say, 263 observations can easily contain more spring birthdays than autumn birthdays.

With more work, you can analyze the birthdays by week and fit a periodic smoother to the weekly frequencies of birthdays. This is shown in the following graph:

The two curves smooth the scatter plot based on a local weighted fit of 12 (blue-gray) or six (red) nearest neighbors. The red curve shows an annual seasonality of fewer births in the winter and more in the late summer and early fall. The blue-gray curve shows that the data might have a smaller cycle with a minor peak near weeks 18–20, in addition to the major peak near weeks 36–38. Both of these curves are consistent with the NCHS report, although the data are not drawn from the same population.

Notice the variability from week to week in this graph. Even with a sample of 3,061, the proportion of birthdays per week varies by ±30%, whereas the 4 million birth records from 2002 show a much smaller week-to-week-variability. For the 2002 data, it was easy to "eyeball" a curve that fit the data. For the birthdays of employees, I would be hard-pressed to predict what a smoothing curve might look like.

If anyone else wants to analyze the 3,061 birthdays, you can download the data as a set of comma-separated values.

9月 022011

My elderly mother enjoys playing Scrabble®. The only problem is that my father and most of my siblings won't play with her because she beats them all the time! Consequently, my mother is always excited when I visit because I'll play a few Scrabble games with her.

During a recent visit, I managed to squeak out a couple of rare victories. (The key is to play when she's tired, like right before naptime!) The following panel shows a concise summary of our games (click to enlarge):

The graphic display has the following features:

  • You can see the score after each turn.
  • You can see the number of turns in each game (16–20).
  • The average slope of the line indicates the average number of points per word (about 14 points).
  • The gap between the curves shows the difference in scores at each turn.

In particular, you can read off some important plays for each game:

  • In Game 1, I was losing the entire game. I almost caught up with a 36-point play at Turn 16, but it was too little, too late.
  • In Game 2, the lead changed hands five times, as seen by the crossing of the curves.
  • In Game 3, I was holding my own until Mom crushed my hope with a 33-point word on Turn 11.
  • In Game 4, I cruised to victory after playing a 37-point word in Turn 4.

I call this a "play-by-play chart." A few years ago, I saw this display used very effectively to show the minute-by-minute evolution of the score of a basketball game, but I can't remember where I saw it.

This display is best used to show the progression of scores for games that feature a lot of scoring, and especially games in which the points per turn can vary. For example, the chart is effective in basketball for seeing the effect of three-point plays and foul shots on the score. This would also be a good display for bowling, golf, and "target shooting" games such as darts and archery. I wouldn't use it for baseball or soccer.

The chart is easy to construct by using the SGPANEL procedure. I added semi-transparent observations in the background, but you can omit the SCATTER statement if you want to see only the lines.

proc sgpanel data=Scrabble noautolegend;
panelby Game;
scatter x=Turn y=Cumul / group=Player transparency=0.75;
series x=Turn y=Cumul / group=Player curvelabel;
rowaxis integer grid;
colaxis grid;

If you'd like to try an alternate visualization, you can download the data.

In addition to this play-by-play chart, you might be interested in the distribution of each player's word scores over multiple games. You can use the SGPLOT procedure for each player, or you can combine histograms into a single paneled display by using PROC SGPANEL as follows:

proc sgpanel data=Scrabble noautolegend;
panelby Player;
histogram Score;
density Score / type=kernel;

The distributions of scores reveal several differences in playing styles and strategies:

  • For both players, the typical play scores 12–15 points.
  • I tend to use all my tiles by the end of the game. My mother has tiles left over, which results in a penalty as seen by her negative scores at the end of games.
  • Both distributions show a "triple word score" peak in the density. Hers is taller than mine is, which indicates that she plays more of the triple word scores. However, my peak is farther to the right, which indicates that my triple word scores tend to be larger. My mother plays a defensive game. If a triple word score space is playable, she grabs it, even if it results in a paltry score for her. (This aggravates my father!)

I only play Scrabble recreationally, but the SGSPANEL procedure enables me to visualize my scores and analyze my playing style. For me, Scrabble is fun to play... and fun to analyze.

8月 262011

Exploring correlation between variables is an important part of exploratory data analysis. Before you start to model data, it is a good idea to visualize how variables related to one another. Zach Mayer, on his Modern Toolmaking blog, posted code that shows how to display and visualize correlations in R. This is such a useful task that I want to repeat it in SAS software.

Basic correlations and a scatter plot matrix

Mayer's used Fisher's iris data for his example, so I will, too. The following statement uses the CORR procedure to compute the correlation matrix and display a scatter plot matrix for the numerical variables in the data:

proc corr data=sashelp.iris plots=matrix(histogram); 

Notice that by omitting the VAR statement, the CORR procedure analyzes all numerical variables in the data set.

The PLOTS=MATRIX option displays a scatter plot matrix. In SAS 9.3, ODS graphics are turned on by default. (In SAS 9.2, you need to submit ODS graphics on; prior to the PROC CORR statement.) The result is a "quick-and-dirty" visualization of pairwise relationships and the distribution of each variable (along the diagonal). This is the beauty of ODS graphics: the procedures automatically create graphs that are appropriate for an analysis.

A fancier a scatter plot matrix

Mayer also showed some fancier graphs. You can use the SGSCATTER procedure to re-plot the data, but with observations colored according to values of the Species variable, and with a kernel density estimate overlaid on the histogram.

proc sgscatter data=sashelp.iris; 
matrix SepalLength--PetalLength /group=Species diagonal=(histogram kernel);

Notice how I specified the variables. Did you know that you can specify a range of consecutive variables by using a double-dash? This SAS syntax isn't widely known, but can be very useful.

More options, more details

I don't usually add smoothers to my scatter plot matrices because I think it gives the false impression that certain variables are response variables. I prefer to focus on correlation first and save modeling for later in the analysis. However, Mayer showed some loess smoothers on his plots, so I feel obligated to show SAS users how to produce similar displays.

The observant reader will have noticed that there are no scales or tick marks on the scatter plot matrices that I've shown so far. The reason is that axes and scales can distract from the primary goal of the exploratory analysis, which is to give an overview of the data and to see potential pairwise relationships. In Tufte's jargon, the scatter plot matrices that I've shown have a large data-to-ink ratio (Ch. 4, The Visual Display of Quantitative Information).

However, scatter plot matrices also can serve another purpose. During the modeling phase of data analysis they can serve as small multiples that enable you to quickly compare and contrast a sequence of related displays. In this context, scales, tick marks, and statistical smoothers are more relevant.

In general, you can use the SGPANEL procedure to display small multiples. However, I'll use the SGSCATTER procedure again to show how you can add more details to the display. Instead of using the MATRIX statement, I will use the PLOT statement to control exactly with pairs of variables I want to plot. If I think that PetalWidth variable explains the other variables, I can use the LOESS option to add a loess smoother to the scatter plots, as shown in the following example:

proc sgscatter data=sashelp.iris; 
plot (SepalLength SepalWidth PetalLength)*PetalWidth /
   group=Species loess rows=1 grid;

Notice that the loess smoothers are added for each group because the GROUP= option is specified. If, instead, you want to smooth the data regardless of the group variable, you can specify the LOESS=(NOGROUP) option, which produces smoothers similar to those shown by Mayer.

8月 152011

This article describes the SAS/IML CHOOSE function: how it works, how it doesn't work, and how to use it to make your SAS/IML programs more compact. In particular, the CHOOSE function has a potential "gotcha!" that you need to understand if you want your program to perform as expected.

What the CHOOSE function does

The CHOOSE function is like a compact alternative to a compound assignment statement. In other words, it can eliminate an IF-THEN/ELSE statement. The CHOOSE function takes three arguments: a condition to evaluate as true or false, a value to use when the condition is true, and a value to use when the condition is false. Each argument can be vector-valued. For example, the following statements use the CHOOSE function to return the absolute value of a vector:

proc iml;
x = -2:2;
absX = choose( x<=0, -x, x );

Of course, using the ABS function is more efficient, but this illustrative example is easy to understand. The CHOOSE function receives three vectors as arguments: a vector of zeros and ones, a vector that contains the negative of x, and the vector x. The CHOOSE function examines each element of the first argument. If an element is nonzero, it returns the corresponding element of the second argument. If an element is zero, it returns the corresponding element of the third argument.

For this example, if x[i] is not positive, the value -x[i] is returned; otherwise x[i] is returned.

The second and third arguments can also be scalar values. For example, if you want to replace all nonpositive values of x with a missing value, you can use the following statement:

z = choose( x<=0, ., x );

The CHOOSE function is analogous to the ternary C/C++ operator "?:" or to the ifelse function in R.

What the CHOOSE function doesn't do

Notice that the CHOOSE function takes values, not expressions. Each argument of the function is evaluated before it is sent into the CHOOSE function.

This is important, because a common mistake is to expect the CHOOSE function to evaluate the second expression only for elements for which the first expression is true. For example, in a previous post, I described several ways to handle negative values in evaluating a logarithmic data transformation. You might assume that the following statements prevent the LOG function from evaluating negative values in Y:

Y = {-3,1,2,.,5,10,100}; /** notice negative datum **/
LogY = choose(Y<=0, ., log(Y)); /* WRONG */

However, that assumption is not correct. The third argument is evaluated before being sent to the CHOOSE function. It is easy to see the mistake if you use a temporary variable and rewrite the code:

Val2 = log(Y); /* WRONG */
LogY = choose(Y<=0, ., Val2);
However, you can use the CHOOSE function to solve this problem: simply operate on the Y values to replace nonpositive values by missing values:
Y2 = choose(Y<=0, ., Y);
LogY = Log(Y2);
In summary, the CHOOSE function is a compact way to implement a complex assignment statement, but it is important to remember that all arguments are evaluated prior to being passed into the function.
7月 202011
"Dad? How many times do I have to roll a die until all six sides appear?"

I stopped what I was doing to consider my son's question. Although I could figure out the answer mathematically, sometimes experiments are more powerful than math equations for showing how probability works.

"Why don't we find out?" I said. "You roll and I'll keep track of the numbers that appear."

Each trial took at least six rolls, but eventually we completed 20 trials. The following SAS DATA step contains the number of rolls required until all six faces appeared on the die. The UNIVARIATE procedure displays a histogram of the data:

data RollsUntilAllFaces;
input rolls @@;
6 7 7 7 8 9 9 9 10 10 11 12 12 12 15 16 17 19 20 26

proc univariate data=RollsUntilAllFaces;
histogram rolls / endpoints=6 to 27 by 3;

The Expected Value

It's not hard to write down the expected number of rolls for a single die. You need one roll to see the first face. After that, the probability of rolling a different number is 5/6. Therefore, on average, you expect the second face after 6/5 rolls. After that value appears, the probability of rolling a new face is 4/6, and therefore you expect the third face after 6/4 rolls. Continuing this process leads to the conclusion that the expected number of rolls before all six faces appear is

6/6 + 6/5 + 6/4 + 6/3 + 6/2 + 6/1 = 14.7 rolls.

When I told my son that "you should expect to roll the die 14.7 times before all six faces appear," he was unimpressed.

"That's too high," he retorted. "Most of the time we only need 10-12 rolls."

The Median Value

He and I had previously discussed the mean, median, and mode, so I knew what he meant when he told me that I was "using the wrong average." And I agree with him. The median value in our experiment is 10.5, but the mean value (which is strongly influenced by the trials with values 19, 20, and 26) is 12.1. The long tail in the histogram makes it apparent that the median and mean values will be substantially different.

So what is the expected median number of rolls before all six faces appear? I don't know. The only way that I know to estimate the median is through simulation. If you know a reference or a better way, leave a comment.

The General Problem

This problem is called the "coupon collector's problem." If you roll two dice, it is harder (but still possible) to find the expected number of rolls because the outcomes 2–12 do not have equal probability. The general formulation is "If you are trying to collect N unique items, and the probability of getting the ith item is pi, what is the expected number of items that you need to obtain before you complete the collection?"

Now I understand why I never completed that 1977 Star Wars trading card set! I kept buying and buying until my jaws hurt from chewing all the gum, but with 66 cards in the set, my quest was essentially futile. Alas, the coveted "light saber" card shall never be mine: the expected number of cards required to complete the set is 315, assuming all cards have an equal probability.

7月 082011
One of the joys of statistics is that you can often use different methods to estimate the same quantity. Last week I described how to compute a parametric density estimate for univariate data, and use the parameters estimates to compute the area under the probability density function (PDF). This article describes how to compute a kernel density estimate (KDE) and to integrate that curve numerically.

Some practicing statisticians (especially those over a certain age!) learned parametric modeling techniques in school, but later incorporated nonparametric techniques into their arsenal of statistical tools. A very common nonparametric technique is kernel density estimation, in which the PDF at a point, x is estimate by a weighted average of the sample data in a neighborhood of x. Kernel density estimates are widely used, especially when the data appear to be multi-modal.

Kernel Density Estimation in SAS

As before, consider the cholesterol of 2,873 women in the Framingham Heart Study as recorded in the SAS data set SASHELP.Heart, which is distributed with SAS/GRAPH software in SAS 9.2. You can use the UNIVARIATE procedure to construct a kernel density estimate for the cholesterol variable.

The kernel density estimator has a parameter (called the bandwidth) that determines the size of the neighborhood used in the computation to compute the estimate. Small values of the bandwidth result in wavy, wiggly, KDEs, whereas large values result in smooth KDEs. The UNIVARIATE procedure has various methods to select the bandwidth automatically. For this example, I use the Sheather-Jones plug-in (SJPI) bandwidth.

The following statements use the OUTKERNEL= option to output the kernel density estimate at a sequence of regularly space values, and the VSCALE= option to set the vertical scale of the histogram to the density scale:

data heart;
set sashelp.heart; where sex="Female";

ods graphics on;
proc univariate data=heart;
var cholesterol;
histogram / vscale=proportion
          kernel(C=SJPI) outkernel=density;

Click to Enlarge

The nonparametric kernel density estimate exhibits features that are not captured by parametric models. For example, the KDE for these data show several bumps and asymmetries that are not captured by the assumption that the data are normally distributed.

The DENSITY data set contains six variables. The _VALUE_ variable represents the X locations for the density curve and the _DENSITY_ variable represents the Y locations. Therefore the area under the density estimate curve can be computed by numerically integrating those two variables.

Overcoming a Bug in the OUTKERNEL= Option

Unfortunately, there is a bug in the OUTKERNEL= option in SAS 9.2 which results in the DENSITY data set containing two (identical) copies of the KDE curve! (The bug is fixed in SAS 9.3.) Here's how to remove the second copy: find the maximum value of the X variable and delete all observations that come after it. The maximum value of X ought to be the last observation in the DENSITY data. The following statements read the density curve into the SAS/IML language and eliminate the second copy of the curve:

proc iml;
/** read in kernel density estimate **/
use density;
read all var {_value_} into x;
read all var {_density_} into y;
close density;

/** work around bug in OUTKERNEL data set in SAS 9.2 **/
/** remove second copy of data **/
max = max(x);
idx = loc(x=max);
if ncol(idx)>1 then do;
  x = x[1:idx[1]];  y = y[1:idx[1]];

The previous statements remove the second copy of the data if it exists, but do nothing if the second copy does not exist. Therefore it is safe to run the statements in all versions of SAS.

The Area under a Kernel Density Estimate Curve

The kernel density estimate is not a simple function. The points on the curve are available only as a sequence of (X, Y) pairs in the DENSITY data set. Consequently, to integrate the KDE you need to use a numerical integration technique. For this example, you can use the TrapIntegral module from my blog post on the trapezoidal rule of integration.

The following statements define the TrapIntegral module and verify that the KDE curve is correctly defined by integrating the curve over the entire range of X values. The area under a density estimate should be 1. If the area is not close to 1, then something is wrong:

start TrapIntegral(x,y);
   N = nrow(x);
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( dx` * meanY );

/** compute the integral over all x **/
Area = TrapIntegral(x,y);
print Area;



Ahh! Success! The area under the KDE is 1 to within an acceptable tolerance. Now to the real task: compute the area under the KDE curve for certain medically interesting values of cholesterol. In particular, the following statements estimate the probability that the cholesterol of a random woman in the population is less than 200 mg/dL (desirable), or greater than 240 mg/dL (high):

/** compute the integral for x < 200 **/
idx = loc(x<200);
lt200 = TrapIntegral(x[idx],y[idx]);

/** compute the integral for x >= 240 **/
idx = loc(x>=240);
ge240 = TrapIntegral(x[idx],y[idx]);
print lt200 ge240;





If you believe that the KDE for these data is a good estimate of the distribution of cholesterol in women in the general population, the computations show that about 29% of women have healthy cholesterol levels, whereas 36% have high levels of cholesterol. These are approximately the same values that were computed by using empirical probability estimates, as shown in my original post on this topic. They differ by about 2% from the parametric model, which assumes normally distributed data.

7月 072011
If you create a scatter plot of highly correlated data, you will see little more than a thin cloud of points. Small-scale relationships in the data might be masked by the correlation. For example, Luke Miller recently posted a scatter plot that compares the body temperature of snails when they extend or withdraw their foot. This scatter plot appears in Miller's research article in the Biological Bulletin.

For Miller's data, the Pearson correlation coefficient is 0.99 and the data fall close to (but above) the identity line. This makes sense, because the temperature of a cold-blooded snail will be primarily dependent on the ambient temperature, and only secondarily related to whether a snail's foot is extended of withdrawn.

The scatter plot is reproduced below. (Ignore the red arrows for now.) As I looked at the graph, I thought that I could discern some additional structure in the data. It looks like the difference between the snails' body temperature is more pronounced when the ambient temperature is high. Consequently, I thought it might be interesting to plot the difference between the two variables.

Changing the Coordinate System

You could create a histogram of the differences between the variables, but that would not show how the difference might depend on the ambient temperature. Instead, you can change coordinate systems. The variable that represents the "difference between the temperatures" increases in the direction shown by the upper-left pointing arrow. The "ambient temperature" variable increases in the direction shown by the arrow that points to the upper right.

The coordinate system shown by the red arrows is a rotation by 45 degrees of the standard coordinate system. The matrix that rotates a coordinate system through a counter-clockwise angle, t is given by R = {cos(t) sin(t) // -sin(t) cos(t)}. For t = π/4, the matrix becomes a scalar multiple of R = {1 1, -1 1}. Therefore, original coordinates (x, y) are sent to the new coordinates (x+y, y-x) by the rotation.

The quantity x+y might be hard to explain to a biological audience, but if you divide by two, you get (x+y)/2 which you can interpret as the average between the snails' temperature when the foot is extended and when it is withdrawn. You can use this value as a proxy for the ambient temperature. The quantity y-x is, of course, the difference between the temperature when the foot is withdrawn and when it is extended.

Plotting the difference against the average gives the following graph, which shows much more detail than the original:

This graph reveals three facts that were not apparent from the original presentation of the data:

  1. The position of the foot makes more of a difference when the ambient temperature is high than when it is low.
  2. There are many observations for which the position of the foot does not matter (the difference is zero).
  3. There are three observations for which the difference is negative. Is this a data entry error? Is there a biological explanation for these anomalous values?
In summary, if you are graphing highly correlated data, you can change coordinates in order to better show the difference between the variables. This is nothing more than "tilting your head" to look at the data in a new way, but the results can be dramatic.

Statistical readers will notice that this is a special case of principal component analysis: the "average direction" is the first principal component; the "difference direction" is the second principal component.