data analysis

1月 142011
 
A colleague related the following story: He was taking notes at a meeting that was attended by a fairly large group of people (about 20). As each person made a comment or presented information, he recorded the two-letter initials of the person who spoke. After the meeting was over, he was surprised to discover that all of the initials of the people in the room were unique! Nowhere in his notes did he write "JS said..." and later wonder "Was that Jim Smith or Joyce Simpson?"

My colleague asked, "If 20 random people are in a room, do they usually have different initials or is it common for two people to share a pair of initials?" In other words, was his experience typical or a rare occurrence?

The Distribution of Initials at a Large US Software Company

In order to answer that question, it is necessary to know the distribution of initials in his workplace.

Clearly, the distribution of initials depends on the population of the people in the workplace. In some cultures, names that begin with X or Q are rare, whereas in other cultures names that begin with those letters (when phonetically translated into English) are more common.

SAS is a large US software company with a diverse base of employees, so I decided to download the names of 4,502 employees that work with me in Cary, NC, and write a DATA step program that extracts the first and last initials of each name.

You can use the FREQ procedure to compute the frequencies of the first initial (I1), the last initial (I2), and the frequency of the initials taken as a pair. The following statements output the frequency of the initials in decreasing order:

proc freq data=Employees order=freq;
tables I1 / out=I1Freq;
tables I2 / out=I2Freq;
tables I1*I2 / out=InitialFreq missing sparse noprint;
run;

As an example, I can display the relevant frequency for my initials (RW) as well as the initial of the SAS cofounders, Jim Goodnight and John Sall:

data SASUSER.InitialFreq; set InitialFreq; 
Initials = I1 || I2;
run;

proc print data=SASUSER.InitialFreq
   (where=(Initials="RW" | Initials="JG" | Initials="JS"));
run;

Obs

I1

I2

COUNT

PERCENT

Initials

1

J

S

61

1.35495

JS

10

J

G

21

0.46646

JG

214

R

W

18

0.39982

RW

The initials "JS" are the most frequent initials in my workplace, with 61 employees (1.35%) having those initials. The initials "JG" are also fairly common; they are the 10th most popular initials. My initials are less common and are shared by only 0.4% of my colleagues.

If you want to conduct your own analysis, you can download a comma-separated file that contains the initials and frequencies.

Click to Enlarge You can use PROC SGPLOT to display bar charts for the first and last initials.

The bar charts show that J, M, S, D, and C are the most common initials for first names, whereas S, B, H, M, and C are the most common initials for last names.

In contrast, U, Q, and X are initials that do not appear often for either first or last names. For first initials, the 10 least popular initials cumulatively occur less than 5% of the time. For last initials, the 10 least popular initials cumulatively occur about 8% of the time.

Clearly, the distribution of initials is far from uniform. However, for the note-taker, the important issue is the distribution of pairs of initials.


The Distribution of Two-Letter Initials

By using the PROC FREQ output, you can analyze the distribution at my workplace of the frequencies of the 262 = 676 pairs of initials:

  • More than 30% of the frequencies are zero. For example, there is no one at my workplace with initials YV, XU, or QX.
  • If you ignore the initials that do not appear, then the quantiles of the remaining observations are as follows:
    • The lower quartile is 0.044.
    • The median is 0.133.
    • The upper quartile is 0.333.
    • Three pairs are much more prevalent than the others. The initials JM, JB, an JS each occur more than 1% of the time.
The distribution of two-letter initials is summarized by the following box plot:

Click to Enlarge

Visualizing the Proportions of Two-Letter Initials

Click to Enlarge

With the help of a SAS Global Forum paper that shows how to use PROC SGPLOT to create a heat map, I created a plot that shows the distribution of two-letter initials in my workplace.

When I create a heat map, I often use the quartiles of the response variable to color the cells in the heat map. For these data, I used five colors: white to indicate pairs of initials that are not represented at my workplace, and a blue-to-red color scheme (obtained from colorbrewer.org) to indicate the quartiles of the remaining pairs. Blue indicates pairs of initials that are uncommon, and red indicates pairs that occur frequently.

In terms of counts, blue indicates pairs of initials that are shared by either one or two individuals, and red indicates 18 or more individuals.


The heat map shows several interesting features of the distribution of pairs of initials:

  • Although W and N are not unusual first initials (1.7% and 1.4%, respectively) and D and F are not unusual last initials (5.0% and 3.2%, respectively), there is no one at my workplace with the initials ND or WF.
  • There are 89 individuals at my workplace who have a unique pair of initials, including YX, XX, and QZ.

You can download the SAS program that is used to produce the analysis in this article.

The Probability of Matching Initials

Computing the probability that a group of people have similar characteristics is called a "birthday-matching problem" because the most famous example is "If there are N people in a room, what is the chance that two of them have the same birthday?"

In Chapter 13 of my book, Statistical Programming with SAS/IML Software, I examine the birthday-matching problem. I review the well-known solution under the usual assumption that birthdays are uniformly distributed throughout the year, but then go on to compare that solution to the more realistic case in which birthdays are distributed in a fashion that is consistent with empirical birth data from the National Center for Health Statistics (NCHS).

Obviously, you can do a similar analysis for the "initial-matching problem." Specifically, you can use the actual distribution of initials at SAS to investigate the question, "What is the chance that two people in a room of 20 randomly chosen SAS employees share initials?" Come back next Wednesday to find out the answer!

1月 062011
 
A student in my multivariate class last month asked a question about prior probability specifications in discriminant function analysis:
What if I don't know what the probabilities are in my population? Is it best to just use the default in PROC DISCRIM?


First, a quick refresher of priors in discriminant analysis. Consider a problem of classifying 150 cases (let's say, irises) into three categories (let's say, variety). I have four different measurements taken from each of the flowers.

If I walk through the bog and pick another flower and measure its 4 characteristics, how well can I expect to perform in classifying it as the right variety? One way to derive a classification algorithm is to use linear discriminant analysis.

A linear discriminant function to predict group membership is based on the squared Mahalanobis distance from each observation to the controid of the group plus a function of the prior probability of membership in that group.
This generalized squared distance is then converted into a score of similarity to each group, and the case is classified into the group it is most similar to.

The prior probability is the probability of an observation coming from a particular group in a simple random sample with replacement.

If the prior probabilities are the same for all three of the groups (also known as equal priors), then the function is only based on the squared Mahalanobis distance.




If the prior for group A is larger than for groups B and C, then the function makes it more likely that an observation will be classified as group A, all else being equal.

The default in PROC DISCRIM is equal priors. This default makes sense in the context of developing computational software: the function with equal priors is the simplest, and therefore the most computationally efficient.
PRIORS equal;

Alternatives are proportional priors (using priors that are the proportion of observations from each group in the same input data set) and user-specified priors (just what it sounds like: specify them yourself).
PRIORS proportional;
PRIORS 'A' = .5 'B' = .25 'C' = .25;


Of course this kind of problem is far more interesting when you consider something like people making choices, such as kids choosing an action figure of Tinkerbell, Rosetta, or Vidia. Those certainly don't have equal priors, and if your daughter's anything like mine, she doesn't want to be classified into the wrong group.

So back to the original question:
What if I don't know what the probabilities are in my population? Is it best to just use the default in PROC DISCRIM?

In this case, using the default would probably not be a great idea, as it would assign the dolls with equal probability, all else being equal.

So if not the default, then what should you use? This depends on what you're going to be scoring. Your priors should reflect the probabilities in the population that you will be scoring in the future. Some strategies for getting a decent estimate:
1. Go to historical data to see what the probabilities have been in the past.
2. If your input data set is a simple random sample, use proportional priors.
3. Take a simple random sample from the population and count up the number from each group. This can determine the priors.
4. Combine the probabilities you think are correct with the cost of different types of misclassification.

For example, suppose that. among 4-year-olds, the probabilities of wanting the Tinkerbell, Rosetta, and Vidia action figures are really 0.50, 0.35, and .15 respectively. After all, not many kids want to be the villian.
PRIORS 'Tink' = .5 'Rosetta' = .35 'Vidia' = .15

What is the cost of giving a girl the Rosetta doll when she wanted Tinkerbell? What's the cost of giving a girl Vidia when she wanted Rosetta?, and so on. A table is shown below (based on a very small sample of three interviews of 4-year-old girls):



Clearly the cost of an error is not the same for all errors. It is far worse to assign Vidia to a girl who doesn't want Vidia than for any other error to occur. Also notice the small detail that Vidia fans would prefer to get Rosetta over Tinkerbell. For birthday party favors, I'd massage those priors to err on the side of giving out Rosetta over Vidia.
PRIORS 'Tink' = .5 'Rosetta' = .4 'Vidia' = .1
Of course, depending on your tolerance for crying, you might just give everyone Rosetta and be done with it. But then, really, isn't variety the spice of life?

I hope this has helped at least one or two of you out there who were having trouble with priors. The same concepts apply in logistic regression with offset variables, by the way. But that's a party favor for another day.
1月 062011
 
The Junk Chart blog discusses problems with a chart which (poorly) presents statistics on the prevalence of shark attacks by different species.

Here is the same data presented by overlaying two bar charts by using the SGPLOT procedure. I think this approach works well because the number of deaths is always fewer than the number of attacks, and the visual juxtaposition of deaths and attacks gives the viewer an impression of the relative risk of each shark killing someone during an attack.


The graph was created by using the following SAS program.

data Sharks;
input Species $14. Attacks Deaths;
label Attacks="Attacks" Deaths="Deaths";
datalines;
White         394 61
Tiger         140 28
Bull           98 21
Sandtiger      74  2
Hammerhead     41  0
Blacktip       33  0
Blue           36  4
Blacktip reef  16  0
Shortfin mako  43  2
Gray reef       9  0
;
run;
proc sort data=Sharks; by descending Attacks; run;
data Others;
Species="Others"; Attacks=276; Deaths=9; output;
run;

data Sharks; set Sharks Others; run;

title 'Shark Attacks and Deaths by Species';
footnote justify=left 'Source: Shark Foundation';
footnote2 justify=left 'http://www.shark.ch/Information/Accidents/index.html';
proc sgplot data=Sharks;
   hbar Species / response=Attacks datalabel;
   hbar Species / response=Deaths datalabel;
   xaxis label=" " grid;
   yaxis label=" " discreteorder=data;
run;
1月 042011
 
Happy New Year!! This is a good time to think about what was going on here in SAS Education one year ago, and to introduce you to a big project that I'm really excited to "take public."

In January 2010 (as well as throughout 2009), we kept getting cries for help like this one: "Where do we find MBAs with quantitative skills? They need to know predictive modeling, large-scale time series forecasting, cluster analysis, design and analysis of experiments, honest assessment, and model management. Our problems cannot be solved with t-tests and most universities are not teaching the skills business analysts really need in the modern workplace."

Companies want to hire professionals who have a good head for business, an understanding of data analysis, and facility with advanced software for fitting sophisticated models. They have told us that these people are getting harder and harder to find.

Of course, SAS has had training on statistical analysis and software for decades. But analytical directors and business analysts frequently find themselves in a position of needing to know more about how to apply analytics in the business context: how do you make the most effective use of your (massive, messy, opportunistic) data? How do you "think analytically"?

About half of my time in 2010 (with help from some brilliant colleagues, as well) was spent developing a course to address this need: Advanced Analytics for the Modern Business Analyst. The first version was made exclusively available to university professors last August, and many of them are already teaching it in their graduate programs. Degree-seeking students are getting better prepared for handling large-scale data analysis in the so-called "real-world." We were all very excited about launching the course with the universities and worked individually with over 30 professors worldwide to train them and provide the software needed to teach this course.

Starting in 2011 the course will also be available to professionals.

In March, we are offering a nine-week class that combines statistical and software training with the business context-- think of it as a "how to solve problems with data analysis" class. It is like having that extra semester of graduate training that you wish had been offered.

This course is also a chance to try out a better way of learning. Instead of just a weekly live web class, this course models the strategies many universities are using to offer their distance learning curricula, making it easier for the student to retain learning from week to week and using multiple learning modes to reinforce understanding (reading, writing, listening, watching).

Each week, students will participate in:
...a 3.5-hour live web lecture with your instructor and classmates.
...practice activities using software on our servers.
...assigned reading of book chapters and articles to support the class topics, as well as discussion forums to exchange ideas about the readings.

The total time commitment is about 5-6 hours a week, roughly the same as a semester-long course.

You can read a little more about it here in the SAS Training Newsletter.

You can see the course outline here. There are only a few seats left, so if you're interested, you might want to jump on it. Since every teaching experience is also a learning experience, I hope that teaching this class will generate lots of new ideas and blog posts in the coming months. I'll let you know how it's coming along.

Does your organization have people who can to make sense of masses of data, forecast future behavior, and find the next best strategy to gain competitive edge?

See you in class!
12月 172010
 
When I finished writing my book, Statistical Programming with SAS/IML Software, I was elated. However, one small task still remained.

I had to write the index.

How Long Should an Index Be?

My editor told me that SAS Press would send the manuscript to a professional editor who would index the book for me. They did, and I got back about 30 pages of terms and page numbers! The index shrunk a little when I incorporated all those terms into LaTeX which displays an index in two columns. However, it still occupies 11 pages and includes almost 1200 entries!

I was a little annoyed. The entire book is only 440 pages—why is the index so long? I asked a colleague of mine who is an editor, and she said it didn't seem long to her. She referred me to Technical Writing 101: A Real-World Guide to Planning and Writing Technical Documentation (p. 148):

In technical documentation,... the rule of thumb is that the index should contain approximately one page of two-column index entries for every 20 pages of documentation.
That's crazy, I thought. There's no way my book should have an 22-page index! That might be a valid rule of thumb for documentation, but not for a statistical book.

But if I reject that rule of thumb, how can I determine whether the length of my index is appropriate compared to the length of my book? I decided to do what any sensible statistician would do in this situation: collect data.

Data for Index Length versus Book Length, by Publisher

A colleague and I marched down to the SAS library. We sampled books from three different publishers: SAS Press, Springer, and Wiley. For each book we recorded the number of pages (not including the index) and the number of pages for the index. You can download the data and the SAS/IML Studio program that analyzes the data.


The previous image summarizes the data. It shows the size of each book's index versus the size of the book. The books are colored according to the publisher. Because the data for each publisher appear to be linear except for outliers (for example, the 447-page Springer book with no index, which is a set of conference proceedings), I fit and added robust regression lines computed by the LTS algorithm in the ROBUSTREG procedure. I used a no-intercept model so that I could more easily interpret the slope of the regression line as a ratio that reveals the expected number of index pages, given the number of pages in the book. (The regression lines look similar for a model that includes an intercept term.)

The coefficients of the regression model indicate that SAS press books average about one page of index for each 26 pages of text. In contrast, the other publishers average about one page of index for each 65 pages of text. In other words, for this sample the professionally compiled SAS Press indexes contain, on average, more than twice as many pages as indexes of comparable books printed by other publishers.

The analysis reveals that the index of my book (marked by an "X") is not too long, at least by the SAS Press standards. In fact, the length of my index is low compared with other SAS Press books, although it is on the high end of Springer and Wiley books of similar length.


I contacted a colleague who wrote one of the Springer books in the sample. I asked whether he had used a professional indexer, or had compiled the index himself. He said that he created the index himself:

We assumed it would cost us [to use a professional indexer], and we were concerned that the indexer wouldn't understand the topic. Doing it myself enabled me to ensure that spelling was consistent. Initially I intended to get students to test the index for me, but at the end of the process I had rather lost my enthusiasm!

I cannot comment on the practices of other publishers, but SAS Press does not charge their authors for this professional service. Still, I share his sense of fatigue: writing an index is hard and tiring work.

If I had compiled the index myself, would I have had the stamina to include 1200 entries? Definitely not. But am I glad that my index is thorough, useful, and professionally compiled? Absolutely.

The cost to me? Three weeks of drudgery. The benefit to my readers? Priceless. Thanks, SAS Press!

12月 082010
 
Sample covariance matrices and correlation matrices are used frequently in multivariate statistics. This post shows how to compute these matrices in SAS and use them in a SAS/IML program. There are two ways to compute these matrices:
  1. Compute the covariance and correlation with PROC CORR and read the results into PROC IML
  2. Compute the matrices entirely with PROC IML

Computing a Covariance and Correlation Matrix with PROC CORR

You can use PROC CORR to compute the correlation matrix (or, more correctly, the "Pearson product-moment correlation matrix," since there are other measures of correlation which you can also compute with PROC CORR). The following statements compute the covariance matrix and the correlation matrix for the three numerical variables in the SASHELP.CLASS data set.

ods select Cov PearsonCorr;
proc corr data=sashelp.class noprob outp=OutCorr /** store results **/
          nomiss /** listwise deletion of missing values **/
          cov;   /**  include covariances **/
var Height Weight Age;
run;

The OutCorr data set contains various statistics about the data, as shown by running PROC PRINT:

proc print data=OutCorr; run;

Obs

_TYPE_

_NAME_

Height

Weight

Age

1

COV

Height

26.287

102.493

6.2099

2

COV

Weight

102.493

518.652

25.1857

3

COV

Age

6.210

25.186

2.2281

4

MEAN

62.337

100.026

13.3158

5

STD

5.127

22.774

1.4927

6

N

19.000

19.000

19.0000

7

CORR

Height

1.000

0.878

0.8114

8

CORR

Weight

0.878

1.000

0.7409

9

CORR

Age

0.811

0.741

1.0000

If you want to use the covariance or correlation matrix in PROC IML, you can read the appropriate values into a SAS/IML matrix by using a WHERE clause on the USE statement:


proc iml;
use OutCorr where(_TYPE_="COV");
read all var _NUM_ into cov[colname=varNames];

use OutCorr where(_TYPE_="CORR");
read all var _NUM_ into corr[colname=varNames];
close OutCorr;

Notice that this SAS/IML code is independent of the number of variables in the data set.

Computation of the Covariance and Correlation Matrix in PROC IML

If the data are in SAS/IML vectors, you can compute the covariance and correlation matrices by using matrix multiplication to form the matrix that contains the corrected sum of squares of cross products (CSSCP).

Suppose you are given p SAS/IML vectors x1, x2, ..., xp. To form the covariance matrix for these data:

  1. Use the horizontal concatenation operator to concatenate the vectors into a matrix whose columns are the vectors.
  2. Center each vector by subtracting the sample mean.
  3. Form the CSSCP matrix (also called the "X-prime-X matrix") by multiplying the matrix transpose and the matrix.
  4. Divide by n-1 where n is the number of observations in the vectors.
This process assumes that there are no missing values in the data. Otherwise, it needs to be slightly amended. Formulas for various matrix quantities are given in the SAS/STAT User's Guide.

The following SAS/IML statements define a SAS/IML module that computes the sample covariance matrix of a data matrix. For this example the data are read from a SAS data set, so Step 1 (horizontal concatenation of vectors) is skipped.

proc iml;
start Cov(A);             /** define module to compute a covariance matrix **/
   n = nrow(A);           /** assume no missing values **/
   C = A - A[:,];         /** subtract mean to center the data **/
   return( (C` * C) / (n-1) );
finish;

/** read or enter data matrix into X **/
varNames = {"Height" "Weight" "Age"};
use sashelp.class; read all var varNames into X; close sashelp.class;

cov = Cov(X);
print cov[c=varNames r=varNames];

cov

Height

Weight

Age

Height

26.286901

102.49342

6.2099415

Weight

102.49342

518.65205

25.185673

Age

6.2099415

25.185673

2.2280702

Computing the Pearson correlation matrix requires the same steps, but also that the columns of the centered data matrix be scaled to have unit standard deviation. SAS/IML software already has a built-in CORR function, so it is not necessary to define a Corr module, but it is nevertheless instructive to see how such a module might be written:

start MyCorr(A);
   n = nrow(A);                   /** assume no missing values     **/
   C = A - A[:,];                 /** center the data              **/
   stdCol = sqrt(C[##,] / (n-1)); /** std deviation of columns     **/
   stdC = C / stdCol;             /** assume data are not constant **/
   return( (stdC` * stdC) / (n-1) );
finish;

corr = MyCorr(X);
print corr[c=varNames r=varNames];

corr

Height

Weight

Age

Height

1

0.8777852

0.8114343

Weight

0.8777852

1

0.7408855

Age

0.8114343

0.7408855

1

You should use the built-in CORR function instead of the previous module, because the built-in function handles the case of constant data.

Computation of the Covariance and Correlation Matrix in PROC IML (post-9.2)

In November 2010, SAS released the 9.22 (pronounced "nine point twenty-two") release of SAS/IML software. This release includes the following:

  • a built-in COV function which handles missing values in either of two ways
  • new features for the built-in CORR function including
    • handling missing values in either of two ways
    • support for different measures of correlation, including rank-based correlations
12月 032010
 
My last post was a criticism of a statistical graph that appeared in Bloomberg Businessweek. Criticism is easy. Analysis is harder. In this post I re-analyze the data to present two graphics that I think should have replaced the one graphic in Businessweek. You can download the SAS program that creates the plots for this analysis. The program makes use of the excellent SGPLOT and SGPANEL procedures in SAS 9.2.

The original graph visualizes how participation in online social media activities vary across age groups. The graph is reproduced below at a smaller scale:

Click to Enlarge

When I look at these data, I ask myself two questions:

  1. How does participation in social media differ across age groups?
  2. Given that someone in an age group participates, what is the popularity of each activity?

How does participation in social media differ across age groups?

Click to Enlarge The answer to the first question is answered by the "Inactives" plot, but the graph is upside down. You can subtract the data from 100 to get the adjacent bar chart.

The bar chart shows the percentage of each age group that engages in social media activities. You can clearly see the main conclusion: using social media is highly popular among young people, but older people (in 1997) had not embraced it at the same levels.

This plot should be shown separately from the rest of the data because it shows how each age category is divided into two groups: those that participate in online social media and those who do not. Although it is not apparent by looking at the original Businessweek graph, the last row is distinctly different than the others. The category in the last row is mutually exclusive from the first five categories. Said differently, an "Inactive" individual is represented in a single cell in the last row, whereas an "Active" individual might be represented in multiple cells in the first five rows.

For these reasons, I think you should make one chart that shows participation, and a second chart that shows how people participate.

Absolute Percentages or Relative Percentages?

Look at the last column ("Seniors") in the Businessweek chart. The percentages are for all seniors that are online. This is the wrong scale if you want to determine whether age affects the kinds of the activities that people participate in. Why? Because the previous bar chart shows that participation rates are different across age groups.

Click to Enlarge

Only 30% of seniors participate in any form of social media, so of course the percentage of all seniors that are Critics (11%) is low in absolute terms. But that 11% is actually 37% of the participating seniors (0.37 = 0.11/0.30). In other words, of the seniors who participate in online social media activities, 37% of them are critics. By looking at a relative percentage, you can assess whether participation in a given activity varies across age groups for those who participate.

The adjacent line plot shows the relative percentages of the data from the Businessweek graph. This plot answers my second question: Given that someone participates, what is the popularity of each activity?

This visualization is more revealing than the "young people are more active" conclusion of the Businessweek graph. Older people who participate in social media are "Critics" and "Spectators" about as often as younger people, and are "Collectors" more often. Only the "Creator" and "Joiner" activities show a marked decrease in participation by older age groups.

Comparing Participation by Age Groups

There are only five activities to visualize, so it is feasible to create a single line plot that shows how participation in all activities varies across age groups. By placing all of the activities on a common plot, it is easier to determine that the relative percentage of seniors who are "Critics" is the same as the percentage of "Collectors."

Although this graph would never make its way into Businessweek, I think that it—along with the bar chart of social media use by age—tells the story of the data. And isn't that the goal of statistical graphics?

Click to Enlarge

12月 032010
 
Have you used multivariate procedures in SAS and wanted to save out scores? Some procedures, such as FACTOR, CANDISC, CANCORR, PRINCOMP, and others have an OUT= option to save scores to the input data set. However, to score a new data set, or to perform scoring with multivariate procedures that do not have an OUT= option you need another way.

Scoring new data with these procedures is a snap if you know how to use PROC SCORE. The OUTSTAT data sets from these procedures includes scoring code that the procedure uses to score observations in the DATA= data set. They are contained in the observations where TYPE='SCORE'.

The only catch, however, is that if you use a hierarchical method such as the VARCLUS procedure, then the score code is included for the 1-, 2-, 3-, and so on cluster solutions, up to the final number of clusters in the analysis. PROC SCORE doesn't know which one to use.

So you have one extra (teeny-tiny) step to perform scoring:
Continue reading "Weekday Morning Quick-trick: How to Score from PROC VARCLUS"
12月 012010
 
Recently I read a blog that advertised a data visualization competition. Under the heading "What Are We Looking For?" is a link to a 2007 Bloomberg Businessweek graph that visualizes how participation in online social media activities vary across age groups. The graph is reproduced below at a smaller scale:

Click to Enlarge

A few aspects of this chart bothered me, so in the spirit of Kaiser Fung's excellent Junk Charts blog, here are some thought on improving how these data are visualized.

"34%" Is One Number, Not 34

Each cell in the graph consists of a 10 x 10 grid, and the number of colored squares in the grid represents the percentage of a given age group that participate in a given online activity. For example, the cell in the lower left corner has 34 dark gray colored squares to indicate that 34% of young teens do not participate in social media activities online. That's a lot of ink used to represent a single number!

Furthermore, the chart is arranged so that the colored squares across all age groups simulate a line plot. For example, the graph attempts to show that the percentage of "Inactives" varies across age groups. Note the arrangement of the dark gray squares across the first four age groups:

The four "extra" squares in the first cell (34%) are arranged flush to the left. The gap in the second cell (17%) is put in the middle. (By the way, there should be only 17 colored squares in this cell, not 18.) The extra squares in the next two cells are arranged flush right. The effect is that the eye sees a "line" that decreases, reaches a minimum with 18–21 group, and then starts increasing.

This attempt to form a line plot out of colored squares can be deceptive. For example, by pushing all of the extra squares in one age group to the right and all of the colored squares in the adjacent age group to the left, I can bias your eye see local minima where there are none. This technique also fails miserably with nearly constant data such as the orange squares used for the "Collector" group. The eye sees little bumps, whereas the percentages are essentially constant across the age groups.

If You Want a Line Plot...

If you have data suitable for a line plot, then create a line plot. Here is a bare-bones strip-out-the-color-and-focus-on-the-data line chart. It shows the data in an undecorated statistical way that the editors at Businessweek would surely reject! However, it does show the data clearly.

The line plot shows that participation in most online social media activities peaks with the college-age students and decreases for older individuals. You can also see that the percentage of "Collectors" is essentially constant across age groups. Lastly, you can see that the "Not Active" category is flipped upside down from the previous category. It shows the percentage of people who are not active, and therefore reaches a minimum with the college-age students and increases for older individuals.

The line plot formulation helps to show the variation among age groups for each level of activity. You can, of course, use the same data to create a graph that shows the variation in activities for each age group. Perhaps the creator of the Businesweek graph did not use a line plot because he was hoping that one chart could serve two purposes.

Asking Different Questions

When I look at these data, I ask myself two questions:

  1. How does participation in social media differ across age groups?
  2. Given that someone in an age group participates, what is the popularity of each activity?

On Friday I will use these data to create new graphs that answer these questions, thereby presenting an alternate analysis of these data.

Further Improvements

Do you see features of the Businessweek graph that you think could be improved? Do you think that the original graph has merits that I didn't acknowledge? Post a comment.

11月 192010
 
In a previous post, I used statistical data analysis to estimate the probability that my grocery bill is a whole-dollar amount such as $86.00 or $103.00. I used three weeks' grocery receipts to show that the last two digits of prices on items that I buy are not uniformly distributed. (The prices tend to be values such as $1.49 or $1.99 that appear to be lower in price.) I also showed that sales tax increases the chance of getting a whole-dollar receipt.

In this post, I show how you can use resampling techniques in SAS/IML software to simulate thousands of receipts. You can then estimate the probability that a receipt is a whole-dollar amount, and use bootstrap ideas to examine variation in the estimate.

Simulating Thousands of Receipts

The SAS/IML language is ideal for sampling and simulation in the SAS environment. I previously introduced a SAS/IML module called SampleReplaceUni which performs sampling with replacement from a set. You can use this module to resample from the data.

First run a SAS DATA step program to create the WORK.PRICES data set. The ITEM variable contains prices of individual items that I buy. You can resample from these data. I usually buy about 40 items each week, so I'll use this number to simulate a receipt.

The following SAS/IML statements simulate 5,000 receipts. Each row of the y vector contains 40 randomly selected items (one simulated receipt), and there are 5,000 rows. The sum across the rows are the pre-tax totals for each receipt:

proc iml;
/** read prices of individual items **/
use prices; read all var {"Item"}; close prices;

SalesTax = 0.0775;   /** sales tax in Wake county, NC **/
numItems = 40;       /** number of items in each receipt **/
NumResamples = 5000; /** number of receipts to simulate **/
call randseed(4321);

/** LOAD or define the SampleReplaceUni module here. **/

/** resample from the original data. **/
y = sampleReplaceUni(Item`, NumResamples, NumItems);
pretax = y[,+];       /** sum is get pre-tax cost of each receipt **/
total = round((1 + SalesTax) # pretax, 0.01); /** add sales tax **/
whole = (total=int(total)); /** indicator: whole-dollar receipts **/
Prop = whole[:];    /** proportion that are whole-dollar amounts **/
print NumItems NumResamples (sum(whole))[label="Num00"] 
       Prop[format=PERCENT7.2 label="Pct"];

numItems

NumResamples

Num00

Pct

40

5000

52

1.04%

The simulation generated 52 whole-dollar receipts out of 5,000, or about 1%.

Estimating the Probability of a Whole-Dollar Receipt

The previous computation is the result of a single simulation. In technical terms, it is a point-estimate for the probability of obtaining a whole-dollar receipt. If you run another simulation, you will get a slightly different estimate. If you run, say, 200 simulations and plot all of the estimates, you obtain the sampling distribution for the estimate. For example, you might obtain something that looks like the following histogram:


The histogram shows the estimates for 200 simulations. The vertical dashed line in the histogram is the mean of the 200 estimates, which is about 1%. So, on average, I should expect my grocery bills to be a whole-dollar amount about 1% of the time. The histogram also indicates how the estimate varies over the 200 simulations.

The Bootstrap Distribution of Proportions

The technique that I've described is known as a resampling or bootstrap technique.

In the SAS/IML language, it is easy to wrap a loop around the previous simulation and to accumulate the result of each simulation. The mean of the accumulated results is a bootstrap estimate for the probability that my grocery receipt is a whole-dollar amount. You can use quantiles of the accumulated results to form a bootstrap confidence interval for the probability. For example:

NumRep = 200;          /** number of bootstrap replications **/
Prop = j(NumRep, 1);   /** allocate vector for results **/
do k = 1 to NumRep;
   /** resample from the original data **/
   ...
   /** proportion of bills that are whole-dollar amounts **/
   Prop[k] = whole[:]; 
end;

meanProp = Prop[:]; 
/** LOAD or define the Qntl module here **/
call Qntl(q, Prop, {0.05 0.95});
print numItems meanProp q[label="90% CI"];

numItems

NumRep

meanProp

90% CI

40

200

0.010055

0.0081

0.0124

Notice that 90% of the simulations had proportions that are between 0.81% and 1.24%. I can use this as a 90% confidence interval for the true probability that my grocery receipt will be a whole-dollar amount.

You can download the complete SAS/IML Studio program that computes the estimates and creates the histogram.

Conclusions

My intuition was right: I showed that the chance of my grocery receipt being a whole-dollar amount is about one in a hundred. But, more importantly, I also showed that you can use SAS/IML software to resample from data, run simulations, and implement bootstrap methods.

What's next? Well, all this programming has made me hungry. I’m going to the grocery store!