data analysis

11月 122010
The other day I was at the grocery store buying a week's worth of groceries. When the cashier, Kurt (not his real name), totaled my bill, he announced, "That'll be ninety-six dollars, even."

"Even?" I asked incredulously. "You mean no cents?"

"Yup," he replied. "It happens."

"Wow," I said, with a sly statistical grin appearing on my face, "I'll bet that only happens once in a hundred customers!"

Kurt shrugged, disinterested. As I left, I congratulated myself on my subtle humor, which clearly had not amused my cashier. "Get it?" I thought to myself, "One chance in a hundred? The possibilities are 00 through 99 cents."

But as I drove home, I began to wonder: maybe Kurt knows more about grocery bills than I do. I quickly calculated that if Kurt works eight-hour shifts, he probably processes about 100 transactions every day. Does he see one whole-dollar amount every shift, on average? I thought back to my weekly visits to the grocery store over the past two years. I didn't recall another whole-dollar amount.

So what is the probability that this event (a grocery bill that is exactly a multiple of a dollar) happens? Is it really a one-chance-in-a-hundred event, or is it more rare?

The Distribution of Prices for Grocery Items

As I started thinking about the problem, I became less confident that I knew the probability of this event. I tried to recall some theory about the distribution of a sum. "Hmmmm," I thought, "the distribution of the sum of N independent random variables is the convolution of their distributions, so if each item is uniformly distributed...."

I almost got into an accident when the next thought popped into my head: grocery prices are not uniformly distributed!

I rushed home and left the milk to spoil on the counter while I hunted for old grocery bills. I found three and set about analyzing the distribution of prices for items that I buy at my grocery store. (If you'd like to do your own analysis, you can download the SAS DATA step program .)

First, I used SAS/IML Studio to create a histogram of the last two digits (the cents) for items I bought. As expected, the distribution is not uniform. More than 20% of the items have 99 cents as part of the price. Almost 10% were "dollar specials."

Click to enlarge

Frequent values for the last two digits are shown in the following PROC FREQ output:

Last2    Frequency     Percent
 0.99          24       20.51 
 0.00          10        8.55 
 0.19           7        5.98 
 0.49           7        5.98 
 0.69           7        5.98 
 0.50           6        5.13 
 0.89           6        5.13 

The distribution of digits would be even more skewed except for the fact that I buy a lot of vegetables, which are priced by the pound.

Hurray for Sales Tax!

Next I wondered whether sales tax affects the chances that my grocery bill is an whole-dollar amount. A sales tax of S results in a total bill that is higher than the cost of the individual items:

Total = (1 + S) * (Cost of Items)

Sales tax is good—if your goal is to get a "magic number" (that is, an whole-dollar amount) on your total grocery bill. Why? Sales tax increases the chances of getting a magic number. Look at it this way: if there is no sales tax, then the total cost of my groceries is a whole-dollar amount when the total cost of the items is $1.00, $2.00, $3.00, and so on. There is exactly $1 between each magic number. However, in Wake County, NC, we have a 7.75% sales tax. My post-tax total will be a whole-dollar amount if the total pre-tax cost of my groceries is $0.93, $1.86, $2.78, $3.71, and so on. These pre-tax numbers are only 92 or 93 cents apart, and therefore happen more frequently than if there were no sales tax. With sales tax rates at a record high in the US, I wonder if other shoppers are seeing more whole-dollar grocery bills?

This suggests that my chances might not be one in 100, but might be as high as one in 93—assuming that the last digits of my pre-tax costs are uniformly distributed. But are they? There is still the nagging fact that grocery items tend to be priced non-uniformly at values such as $1.99 and $1.49.

Simulating My Grocery Bill

There is a computational way to estimate the odds: I can resample from the data and simulate a bunch of grocery bills to estimate how often the post-tax bill is a whole-dollar amount. Since this post is getting a little long, I'll report the results of the simulation next week. If you are impatient, why not try it yourself?

Game-changing Analytics

 conference, data analysis, Data Mining, m2010, Michele Reister  Game-changing Analytics已关闭评论
10月 232010
The M2010 Data Mining Conference team has arrived in Las Vegas and we’re getting ready to host the 750+ analysts, statisticians and database managers who will be ascending upon Sin City this weekend for our 13th annual conference.

Before we left, my colleague Carrie Vetter had a chance to sit down with two of our speakers, Michael Goul and Sule Balkan from Arizona State University. Listen to the podcast interview to learn about game-changing in-database analytics.

Top Ten Government Web Sites for Downloading Data

 data analysis  Top Ten Government Web Sites for Downloading Data已关闭评论
10月 202010
Today is World Statistics Day, an event set up to "highlight the role of official statistics and the many achievements of the national statistical system."

I want to commemorate World Statistics Day by celebrating the role of the US government in data collection and dissemination.

Data analysis begins with data. Over the years I have bookmarked several interesting US government Web sites that enable you to download data from samples or surveys. In the last several years, I have seen several of these sites go beyond merely making data available to become sites that offer data visualization in the form of maps, bar charts, or line plots.

Here is my Top 10 list of US government Web sites where you can download interesting data:

  1. Bureau of Transportation Statistics (BTS)
    Are you making airline reservations and want to check whether you plane is likely to be delayed? The BTS site has all sorts of statistics on transportation, and was used as the source for the data for the 2009 ASA Data Expo poster session.

  2. Centers for Disease Control and Prevention (CDC)
    Did you know that 4,316,233 births were registered in 2007 and that 40% of them were to unmarried mothers? Did you know that about one third of those births were by cesarean delivery? At the CDC you can explore trends and analyze hundreds of data sets by race, gender, age, and state of residence.

  3. Environmental Protection Agency (EPA)
    You can download data on air and water pollution, or find out if any industries near your home are incinerating hazardous waste.

  4. Federal Reserve System (The Fed)
    If you want data on the US economy, this is a great place to begin. A server to build custom data sets enables you to create a map of the percentage of prime mortgages that are in foreclosure. Notice the regional variation!

  5. My NASA Data
    The NASA server at this Web site enables you to create your own customized data set from 150 variables in atmospheric and earth science from five NASA scientific projects. This type of data was used for the 2006 ASA Data Expo.

  6. National Oceanic and Atmospheric Administration (NOAA)
    Interested in subsurface oil monitoring data from ships, buoys, and satellites in the aftermath of the Deepwater Horizon spill? More interested in a historical analysis of hurricanes? All this, and more!

  7. National Center for Atmospheric Research (NCAR)
    Everything you wanted to know about weather and climate in North America and beyond. Download data about temperatures, precipitation, arctic ice, and so on.

  8. US Department of Agriculture (USDA)
    Check out the very cool Food Environment Atlas. The Economic Research Service (ERS) branch of the USDA disseminates many data sets on land use, organic farming, and other agricultural concerns. Several USDA researchers use SAS/IML Studio and regularly present research papers at SAS Global Forum.

  9. US Census Bureau
    Do you want to know where people live who satisfy one of hundreds of demographic characteristics, such as the percent change in population from 2000 - 2009? I have two words for you: "thematic maps."

  10. US Geological Survey (USGS)
    Data on scientific topics such as climate change, erosion, earthquakes, volcanoes, and endangered species. What's not to like?
Did I omit YOUR favorite government site that provides raw or summarized data? Post a comment and a link.

Mixed Feelings about Logistic Regression: Eight Hints for Getting Started with PROC GLIMMIX

 Cat Truxillo, data analysis, statistical training  Mixed Feelings about Logistic Regression: Eight Hints for Getting Started with PROC GLIMMIX已关闭评论
9月 112010
Delicious Mixed Model Goodness
Imagine the scene: You’re in your favorite coffee shop, laptop and chai. The last of the data from a four-year study are validated and ready for analysis. You’ve explored the plots, preliminary results are promising, and now it is time to fit the model.

It’s not just any model. It’s a three-level multilevel generalized linear mixed model with a binary response. You’ve used GENMOD before. You’ve used MIXED before. Now the two procedures have been sitting in a tree, K-I-S-S-I-N-G, and along comes GLIMMIX in a baby carriage.

We’ve all been there.

Here are some tips for first-time users of PROC GLIMMIX.

Continue reading "Mixed Feelings about Logistic Regression: Eight Hints for Getting Started with PROC GLIMMIX"
6月 162010

Today I took my 3-year old, Elizabeth, to lunch at the on-site cafeteria. As she enjoyed some mac & cheese, we noticed a spot on the floor.

Elizabeth: What’s that, mommy?
Me: I don’t know, I think it must be where the rug got messed up and they covered it with tiles. Or maybe that’s an access point to something electrical. Or maybe it’s for decoration.
E: No, mommy, I think that when it’s nighttime, and the people are gone, that’s where the possums have their weddings. And when there are no possum weddings, the mice use that spot to make Cinderella’s dress for the ball.

How can this bit of toddlerish wisdom make you a better data analyst? Sometimes you find anomalies in the data. Things that you might not give a second thought to. Outliers, unusual combinations of variables, maybe a funky-looking standard error (how is it that big?). You might brush it off as something mundane.

Continue reading "Possum Romance and Other Data Analysis Anomalies"