data analysis

12月 032010
 
My last post was a criticism of a statistical graph that appeared in Bloomberg Businessweek. Criticism is easy. Analysis is harder. In this post I re-analyze the data to present two graphics that I think should have replaced the one graphic in Businessweek. You can download the SAS program that creates the plots for this analysis. The program makes use of the excellent SGPLOT and SGPANEL procedures in SAS 9.2.

The original graph visualizes how participation in online social media activities vary across age groups. The graph is reproduced below at a smaller scale:

Click to Enlarge

When I look at these data, I ask myself two questions:

  1. How does participation in social media differ across age groups?
  2. Given that someone in an age group participates, what is the popularity of each activity?

How does participation in social media differ across age groups?

Click to Enlarge The answer to the first question is answered by the "Inactives" plot, but the graph is upside down. You can subtract the data from 100 to get the adjacent bar chart.

The bar chart shows the percentage of each age group that engages in social media activities. You can clearly see the main conclusion: using social media is highly popular among young people, but older people (in 1997) had not embraced it at the same levels.

This plot should be shown separately from the rest of the data because it shows how each age category is divided into two groups: those that participate in online social media and those who do not. Although it is not apparent by looking at the original Businessweek graph, the last row is distinctly different than the others. The category in the last row is mutually exclusive from the first five categories. Said differently, an "Inactive" individual is represented in a single cell in the last row, whereas an "Active" individual might be represented in multiple cells in the first five rows.

For these reasons, I think you should make one chart that shows participation, and a second chart that shows how people participate.

Absolute Percentages or Relative Percentages?

Look at the last column ("Seniors") in the Businessweek chart. The percentages are for all seniors that are online. This is the wrong scale if you want to determine whether age affects the kinds of the activities that people participate in. Why? Because the previous bar chart shows that participation rates are different across age groups.

Click to Enlarge

Only 30% of seniors participate in any form of social media, so of course the percentage of all seniors that are Critics (11%) is low in absolute terms. But that 11% is actually 37% of the participating seniors (0.37 = 0.11/0.30). In other words, of the seniors who participate in online social media activities, 37% of them are critics. By looking at a relative percentage, you can assess whether participation in a given activity varies across age groups for those who participate.

The adjacent line plot shows the relative percentages of the data from the Businessweek graph. This plot answers my second question: Given that someone participates, what is the popularity of each activity?

This visualization is more revealing than the "young people are more active" conclusion of the Businessweek graph. Older people who participate in social media are "Critics" and "Spectators" about as often as younger people, and are "Collectors" more often. Only the "Creator" and "Joiner" activities show a marked decrease in participation by older age groups.

Comparing Participation by Age Groups

There are only five activities to visualize, so it is feasible to create a single line plot that shows how participation in all activities varies across age groups. By placing all of the activities on a common plot, it is easier to determine that the relative percentage of seniors who are "Critics" is the same as the percentage of "Collectors."

Although this graph would never make its way into Businessweek, I think that it—along with the bar chart of social media use by age—tells the story of the data. And isn't that the goal of statistical graphics?

Click to Enlarge

12月 032010
 
Have you used multivariate procedures in SAS and wanted to save out scores? Some procedures, such as FACTOR, CANDISC, CANCORR, PRINCOMP, and others have an OUT= option to save scores to the input data set. However, to score a new data set, or to perform scoring with multivariate procedures that do not have an OUT= option you need another way.

Scoring new data with these procedures is a snap if you know how to use PROC SCORE. The OUTSTAT data sets from these procedures includes scoring code that the procedure uses to score observations in the DATA= data set. They are contained in the observations where TYPE='SCORE'.

The only catch, however, is that if you use a hierarchical method such as the VARCLUS procedure, then the score code is included for the 1-, 2-, 3-, and so on cluster solutions, up to the final number of clusters in the analysis. PROC SCORE doesn't know which one to use.

So you have one extra (teeny-tiny) step to perform scoring:
Continue reading "Weekday Morning Quick-trick: How to Score from PROC VARCLUS"
12月 012010
 
Recently I read a blog that advertised a data visualization competition. Under the heading "What Are We Looking For?" is a link to a 2007 Bloomberg Businessweek graph that visualizes how participation in online social media activities vary across age groups. The graph is reproduced below at a smaller scale:

Click to Enlarge

A few aspects of this chart bothered me, so in the spirit of Kaiser Fung's excellent Junk Charts blog, here are some thought on improving how these data are visualized.

"34%" Is One Number, Not 34

Each cell in the graph consists of a 10 x 10 grid, and the number of colored squares in the grid represents the percentage of a given age group that participate in a given online activity. For example, the cell in the lower left corner has 34 dark gray colored squares to indicate that 34% of young teens do not participate in social media activities online. That's a lot of ink used to represent a single number!

Furthermore, the chart is arranged so that the colored squares across all age groups simulate a line plot. For example, the graph attempts to show that the percentage of "Inactives" varies across age groups. Note the arrangement of the dark gray squares across the first four age groups:

The four "extra" squares in the first cell (34%) are arranged flush to the left. The gap in the second cell (17%) is put in the middle. (By the way, there should be only 17 colored squares in this cell, not 18.) The extra squares in the next two cells are arranged flush right. The effect is that the eye sees a "line" that decreases, reaches a minimum with 18–21 group, and then starts increasing.

This attempt to form a line plot out of colored squares can be deceptive. For example, by pushing all of the extra squares in one age group to the right and all of the colored squares in the adjacent age group to the left, I can bias your eye see local minima where there are none. This technique also fails miserably with nearly constant data such as the orange squares used for the "Collector" group. The eye sees little bumps, whereas the percentages are essentially constant across the age groups.

If You Want a Line Plot...

If you have data suitable for a line plot, then create a line plot. Here is a bare-bones strip-out-the-color-and-focus-on-the-data line chart. It shows the data in an undecorated statistical way that the editors at Businessweek would surely reject! However, it does show the data clearly.

The line plot shows that participation in most online social media activities peaks with the college-age students and decreases for older individuals. You can also see that the percentage of "Collectors" is essentially constant across age groups. Lastly, you can see that the "Not Active" category is flipped upside down from the previous category. It shows the percentage of people who are not active, and therefore reaches a minimum with the college-age students and increases for older individuals.

The line plot formulation helps to show the variation among age groups for each level of activity. You can, of course, use the same data to create a graph that shows the variation in activities for each age group. Perhaps the creator of the Businesweek graph did not use a line plot because he was hoping that one chart could serve two purposes.

Asking Different Questions

When I look at these data, I ask myself two questions:

  1. How does participation in social media differ across age groups?
  2. Given that someone in an age group participates, what is the popularity of each activity?

On Friday I will use these data to create new graphs that answer these questions, thereby presenting an alternate analysis of these data.

Further Improvements

Do you see features of the Businessweek graph that you think could be improved? Do you think that the original graph has merits that I didn't acknowledge? Post a comment.

11月 192010
 
In a previous post, I used statistical data analysis to estimate the probability that my grocery bill is a whole-dollar amount such as $86.00 or $103.00. I used three weeks' grocery receipts to show that the last two digits of prices on items that I buy are not uniformly distributed. (The prices tend to be values such as $1.49 or $1.99 that appear to be lower in price.) I also showed that sales tax increases the chance of getting a whole-dollar receipt.

In this post, I show how you can use resampling techniques in SAS/IML software to simulate thousands of receipts. You can then estimate the probability that a receipt is a whole-dollar amount, and use bootstrap ideas to examine variation in the estimate.

Simulating Thousands of Receipts

The SAS/IML language is ideal for sampling and simulation in the SAS environment. I previously introduced a SAS/IML module called SampleReplaceUni which performs sampling with replacement from a set. You can use this module to resample from the data.

First run a SAS DATA step program to create the WORK.PRICES data set. The ITEM variable contains prices of individual items that I buy. You can resample from these data. I usually buy about 40 items each week, so I'll use this number to simulate a receipt.

The following SAS/IML statements simulate 5,000 receipts. Each row of the y vector contains 40 randomly selected items (one simulated receipt), and there are 5,000 rows. The sum across the rows are the pre-tax totals for each receipt:

proc iml;
/** read prices of individual items **/
use prices; read all var {"Item"}; close prices;

SalesTax = 0.0775;   /** sales tax in Wake county, NC **/
numItems = 40;       /** number of items in each receipt **/
NumResamples = 5000; /** number of receipts to simulate **/
call randseed(4321);

/** LOAD or define the SampleReplaceUni module here. **/

/** resample from the original data. **/
y = sampleReplaceUni(Item`, NumResamples, NumItems);
pretax = y[,+];       /** sum is get pre-tax cost of each receipt **/
total = round((1 + SalesTax) # pretax, 0.01); /** add sales tax **/
whole = (total=int(total)); /** indicator: whole-dollar receipts **/
Prop = whole[:];    /** proportion that are whole-dollar amounts **/
print NumItems NumResamples (sum(whole))[label="Num00"] 
       Prop[format=PERCENT7.2 label="Pct"];

numItems

NumResamples

Num00

Pct

40

5000

52

1.04%

The simulation generated 52 whole-dollar receipts out of 5,000, or about 1%.

Estimating the Probability of a Whole-Dollar Receipt

The previous computation is the result of a single simulation. In technical terms, it is a point-estimate for the probability of obtaining a whole-dollar receipt. If you run another simulation, you will get a slightly different estimate. If you run, say, 200 simulations and plot all of the estimates, you obtain the sampling distribution for the estimate. For example, you might obtain something that looks like the following histogram:


The histogram shows the estimates for 200 simulations. The vertical dashed line in the histogram is the mean of the 200 estimates, which is about 1%. So, on average, I should expect my grocery bills to be a whole-dollar amount about 1% of the time. The histogram also indicates how the estimate varies over the 200 simulations.

The Bootstrap Distribution of Proportions

The technique that I've described is known as a resampling or bootstrap technique.

In the SAS/IML language, it is easy to wrap a loop around the previous simulation and to accumulate the result of each simulation. The mean of the accumulated results is a bootstrap estimate for the probability that my grocery receipt is a whole-dollar amount. You can use quantiles of the accumulated results to form a bootstrap confidence interval for the probability. For example:

NumRep = 200;          /** number of bootstrap replications **/
Prop = j(NumRep, 1);   /** allocate vector for results **/
do k = 1 to NumRep;
   /** resample from the original data **/
   ...
   /** proportion of bills that are whole-dollar amounts **/
   Prop[k] = whole[:]; 
end;

meanProp = Prop[:]; 
/** LOAD or define the Qntl module here **/
call Qntl(q, Prop, {0.05 0.95});
print numItems meanProp q[label="90% CI"];

numItems

NumRep

meanProp

90% CI

40

200

0.010055

0.0081

0.0124

Notice that 90% of the simulations had proportions that are between 0.81% and 1.24%. I can use this as a 90% confidence interval for the true probability that my grocery receipt will be a whole-dollar amount.

You can download the complete SAS/IML Studio program that computes the estimates and creates the histogram.

Conclusions

My intuition was right: I showed that the chance of my grocery receipt being a whole-dollar amount is about one in a hundred. But, more importantly, I also showed that you can use SAS/IML software to resample from data, run simulations, and implement bootstrap methods.

What's next? Well, all this programming has made me hungry. I’m going to the grocery store!

11月 122010
 
The other day I was at the grocery store buying a week's worth of groceries. When the cashier, Kurt (not his real name), totaled my bill, he announced, "That'll be ninety-six dollars, even."

"Even?" I asked incredulously. "You mean no cents?"

"Yup," he replied. "It happens."

"Wow," I said, with a sly statistical grin appearing on my face, "I'll bet that only happens once in a hundred customers!"

Kurt shrugged, disinterested. As I left, I congratulated myself on my subtle humor, which clearly had not amused my cashier. "Get it?" I thought to myself, "One chance in a hundred? The possibilities are 00 through 99 cents."

But as I drove home, I began to wonder: maybe Kurt knows more about grocery bills than I do. I quickly calculated that if Kurt works eight-hour shifts, he probably processes about 100 transactions every day. Does he see one whole-dollar amount every shift, on average? I thought back to my weekly visits to the grocery store over the past two years. I didn't recall another whole-dollar amount.

So what is the probability that this event (a grocery bill that is exactly a multiple of a dollar) happens? Is it really a one-chance-in-a-hundred event, or is it more rare?

The Distribution of Prices for Grocery Items

As I started thinking about the problem, I became less confident that I knew the probability of this event. I tried to recall some theory about the distribution of a sum. "Hmmmm," I thought, "the distribution of the sum of N independent random variables is the convolution of their distributions, so if each item is uniformly distributed...."

I almost got into an accident when the next thought popped into my head: grocery prices are not uniformly distributed!

I rushed home and left the milk to spoil on the counter while I hunted for old grocery bills. I found three and set about analyzing the distribution of prices for items that I buy at my grocery store. (If you'd like to do your own analysis, you can download the SAS DATA step program .)

First, I used SAS/IML Studio to create a histogram of the last two digits (the cents) for items I bought. As expected, the distribution is not uniform. More than 20% of the items have 99 cents as part of the price. Almost 10% were "dollar specials."

Click to enlarge


Frequent values for the last two digits are shown in the following PROC FREQ output:

Last2    Frequency     Percent
------------------------------
 0.99          24       20.51 
 0.00          10        8.55 
 0.19           7        5.98 
 0.49           7        5.98 
 0.69           7        5.98 
 0.50           6        5.13 
 0.89           6        5.13 

The distribution of digits would be even more skewed except for the fact that I buy a lot of vegetables, which are priced by the pound.

Hurray for Sales Tax!

Next I wondered whether sales tax affects the chances that my grocery bill is an whole-dollar amount. A sales tax of S results in a total bill that is higher than the cost of the individual items:

Total = (1 + S) * (Cost of Items)

Sales tax is good—if your goal is to get a "magic number" (that is, an whole-dollar amount) on your total grocery bill. Why? Sales tax increases the chances of getting a magic number. Look at it this way: if there is no sales tax, then the total cost of my groceries is a whole-dollar amount when the total cost of the items is $1.00, $2.00, $3.00, and so on. There is exactly $1 between each magic number. However, in Wake County, NC, we have a 7.75% sales tax. My post-tax total will be a whole-dollar amount if the total pre-tax cost of my groceries is $0.93, $1.86, $2.78, $3.71, and so on. These pre-tax numbers are only 92 or 93 cents apart, and therefore happen more frequently than if there were no sales tax. With sales tax rates at a record high in the US, I wonder if other shoppers are seeing more whole-dollar grocery bills?

This suggests that my chances might not be one in 100, but might be as high as one in 93—assuming that the last digits of my pre-tax costs are uniformly distributed. But are they? There is still the nagging fact that grocery items tend to be priced non-uniformly at values such as $1.99 and $1.49.

Simulating My Grocery Bill

There is a computational way to estimate the odds: I can resample from the data and simulate a bunch of grocery bills to estimate how often the post-tax bill is a whole-dollar amount. Since this post is getting a little long, I'll report the results of the simulation next week. If you are impatient, why not try it yourself?

Game-changing Analytics

 conference, data analysis, Data Mining, m2010, Michele Reister  Game-changing Analytics已关闭评论
10月 232010
 
The M2010 Data Mining Conference team has arrived in Las Vegas and we’re getting ready to host the 750+ analysts, statisticians and database managers who will be ascending upon Sin City this weekend for our 13th annual conference.

Before we left, my colleague Carrie Vetter had a chance to sit down with two of our speakers, Michael Goul and Sule Balkan from Arizona State University. Listen to the podcast interview to learn about game-changing in-database analytics.

Top Ten Government Web Sites for Downloading Data

 data analysis  Top Ten Government Web Sites for Downloading Data已关闭评论
10月 202010
 
Today is World Statistics Day, an event set up to "highlight the role of official statistics and the many achievements of the national statistical system."

I want to commemorate World Statistics Day by celebrating the role of the US government in data collection and dissemination.

Data analysis begins with data. Over the years I have bookmarked several interesting US government Web sites that enable you to download data from samples or surveys. In the last several years, I have seen several of these sites go beyond merely making data available to become sites that offer data visualization in the form of maps, bar charts, or line plots.

Here is my Top 10 list of US government Web sites where you can download interesting data:

  1. Bureau of Transportation Statistics (BTS)
    Are you making airline reservations and want to check whether you plane is likely to be delayed? The BTS site has all sorts of statistics on transportation, and was used as the source for the data for the 2009 ASA Data Expo poster session.

  2. Centers for Disease Control and Prevention (CDC)
    Did you know that 4,316,233 births were registered in 2007 and that 40% of them were to unmarried mothers? Did you know that about one third of those births were by cesarean delivery? At the CDC you can explore trends and analyze hundreds of data sets by race, gender, age, and state of residence.

  3. Environmental Protection Agency (EPA)
    You can download data on air and water pollution, or find out if any industries near your home are incinerating hazardous waste.

  4. Federal Reserve System (The Fed)
    If you want data on the US economy, this is a great place to begin. A server to build custom data sets enables you to create a map of the percentage of prime mortgages that are in foreclosure. Notice the regional variation!

  5. My NASA Data
    The NASA server at this Web site enables you to create your own customized data set from 150 variables in atmospheric and earth science from five NASA scientific projects. This type of data was used for the 2006 ASA Data Expo.

  6. National Oceanic and Atmospheric Administration (NOAA)
    Interested in subsurface oil monitoring data from ships, buoys, and satellites in the aftermath of the Deepwater Horizon spill? More interested in a historical analysis of hurricanes? All this, and more!

  7. National Center for Atmospheric Research (NCAR)
    Everything you wanted to know about weather and climate in North America and beyond. Download data about temperatures, precipitation, arctic ice, and so on.

  8. US Department of Agriculture (USDA)
    Check out the very cool Food Environment Atlas. The Economic Research Service (ERS) branch of the USDA disseminates many data sets on land use, organic farming, and other agricultural concerns. Several USDA researchers use SAS/IML Studio and regularly present research papers at SAS Global Forum.

  9. US Census Bureau
    Do you want to know where people live who satisfy one of hundreds of demographic characteristics, such as the percent change in population from 2000 - 2009? I have two words for you: "thematic maps."

  10. US Geological Survey (USGS)
    Data on scientific topics such as climate change, erosion, earthquakes, volcanoes, and endangered species. What's not to like?
Did I omit YOUR favorite government site that provides raw or summarized data? Post a comment and a link.

Mixed Feelings about Logistic Regression: Eight Hints for Getting Started with PROC GLIMMIX

 Cat Truxillo, data analysis, statistical training  Mixed Feelings about Logistic Regression: Eight Hints for Getting Started with PROC GLIMMIX已关闭评论
9月 112010
 
Delicious Mixed Model Goodness
Imagine the scene: You’re in your favorite coffee shop, laptop and chai. The last of the data from a four-year study are validated and ready for analysis. You’ve explored the plots, preliminary results are promising, and now it is time to fit the model.

It’s not just any model. It’s a three-level multilevel generalized linear mixed model with a binary response. You’ve used GENMOD before. You’ve used MIXED before. Now the two procedures have been sitting in a tree, K-I-S-S-I-N-G, and along comes GLIMMIX in a baby carriage.

We’ve all been there.

Here are some tips for first-time users of PROC GLIMMIX.

Continue reading "Mixed Feelings about Logistic Regression: Eight Hints for Getting Started with PROC GLIMMIX"
6月 162010
 

Today I took my 3-year old, Elizabeth, to lunch at the on-site cafeteria. As she enjoyed some mac & cheese, we noticed a spot on the floor.

Elizabeth: What’s that, mommy?
Me: I don’t know, I think it must be where the rug got messed up and they covered it with tiles. Or maybe that’s an access point to something electrical. Or maybe it’s for decoration.
E: No, mommy, I think that when it’s nighttime, and the people are gone, that’s where the possums have their weddings. And when there are no possum weddings, the mice use that spot to make Cinderella’s dress for the ball.

How can this bit of toddlerish wisdom make you a better data analyst? Sometimes you find anomalies in the data. Things that you might not give a second thought to. Outliers, unusual combinations of variables, maybe a funky-looking standard error (how is it that big?). You might brush it off as something mundane.

Continue reading "Possum Romance and Other Data Analysis Anomalies"