data analysis

1月 062011
 
The Junk Chart blog discusses problems with a chart which (poorly) presents statistics on the prevalence of shark attacks by different species.

Here is the same data presented by overlaying two bar charts by using the SGPLOT procedure. I think this approach works well because the number of deaths is always fewer than the number of attacks, and the visual juxtaposition of deaths and attacks gives the viewer an impression of the relative risk of each shark killing someone during an attack.


The graph was created by using the following SAS program.

data Sharks;
input Species $14. Attacks Deaths;
label Attacks="Attacks" Deaths="Deaths";
datalines;
White         394 61
Tiger         140 28
Bull           98 21
Sandtiger      74  2
Hammerhead     41  0
Blacktip       33  0
Blue           36  4
Blacktip reef  16  0
Shortfin mako  43  2
Gray reef       9  0
;
run;
proc sort data=Sharks; by descending Attacks; run;
data Others;
Species="Others"; Attacks=276; Deaths=9; output;
run;

data Sharks; set Sharks Others; run;

title 'Shark Attacks and Deaths by Species';
footnote justify=left 'Source: Shark Foundation';
footnote2 justify=left 'http://www.shark.ch/Information/Accidents/index.html';
proc sgplot data=Sharks;
   hbar Species / response=Attacks datalabel;
   hbar Species / response=Deaths datalabel;
   xaxis label=" " grid;
   yaxis label=" " discreteorder=data;
run;
1月 042011
 
Happy New Year!! This is a good time to think about what was going on here in SAS Education one year ago, and to introduce you to a big project that I'm really excited to "take public."

In January 2010 (as well as throughout 2009), we kept getting cries for help like this one: "Where do we find MBAs with quantitative skills? They need to know predictive modeling, large-scale time series forecasting, cluster analysis, design and analysis of experiments, honest assessment, and model management. Our problems cannot be solved with t-tests and most universities are not teaching the skills business analysts really need in the modern workplace."

Companies want to hire professionals who have a good head for business, an understanding of data analysis, and facility with advanced software for fitting sophisticated models. They have told us that these people are getting harder and harder to find.

Of course, SAS has had training on statistical analysis and software for decades. But analytical directors and business analysts frequently find themselves in a position of needing to know more about how to apply analytics in the business context: how do you make the most effective use of your (massive, messy, opportunistic) data? How do you "think analytically"?

About half of my time in 2010 (with help from some brilliant colleagues, as well) was spent developing a course to address this need: Advanced Analytics for the Modern Business Analyst. The first version was made exclusively available to university professors last August, and many of them are already teaching it in their graduate programs. Degree-seeking students are getting better prepared for handling large-scale data analysis in the so-called "real-world." We were all very excited about launching the course with the universities and worked individually with over 30 professors worldwide to train them and provide the software needed to teach this course.

Starting in 2011 the course will also be available to professionals.

In March, we are offering a nine-week class that combines statistical and software training with the business context-- think of it as a "how to solve problems with data analysis" class. It is like having that extra semester of graduate training that you wish had been offered.

This course is also a chance to try out a better way of learning. Instead of just a weekly live web class, this course models the strategies many universities are using to offer their distance learning curricula, making it easier for the student to retain learning from week to week and using multiple learning modes to reinforce understanding (reading, writing, listening, watching).

Each week, students will participate in:
...a 3.5-hour live web lecture with your instructor and classmates.
...practice activities using software on our servers.
...assigned reading of book chapters and articles to support the class topics, as well as discussion forums to exchange ideas about the readings.

The total time commitment is about 5-6 hours a week, roughly the same as a semester-long course.

You can read a little more about it here in the SAS Training Newsletter.

You can see the course outline here. There are only a few seats left, so if you're interested, you might want to jump on it. Since every teaching experience is also a learning experience, I hope that teaching this class will generate lots of new ideas and blog posts in the coming months. I'll let you know how it's coming along.

Does your organization have people who can to make sense of masses of data, forecast future behavior, and find the next best strategy to gain competitive edge?

See you in class!
12月 172010
 
When I finished writing my book, Statistical Programming with SAS/IML Software, I was elated. However, one small task still remained.

I had to write the index.

How Long Should an Index Be?

My editor told me that SAS Press would send the manuscript to a professional editor who would index the book for me. They did, and I got back about 30 pages of terms and page numbers! The index shrunk a little when I incorporated all those terms into LaTeX which displays an index in two columns. However, it still occupies 11 pages and includes almost 1200 entries!

I was a little annoyed. The entire book is only 440 pages—why is the index so long? I asked a colleague of mine who is an editor, and she said it didn't seem long to her. She referred me to Technical Writing 101: A Real-World Guide to Planning and Writing Technical Documentation (p. 148):

In technical documentation,... the rule of thumb is that the index should contain approximately one page of two-column index entries for every 20 pages of documentation.
That's crazy, I thought. There's no way my book should have an 22-page index! That might be a valid rule of thumb for documentation, but not for a statistical book.

But if I reject that rule of thumb, how can I determine whether the length of my index is appropriate compared to the length of my book? I decided to do what any sensible statistician would do in this situation: collect data.

Data for Index Length versus Book Length, by Publisher

A colleague and I marched down to the SAS library. We sampled books from three different publishers: SAS Press, Springer, and Wiley. For each book we recorded the number of pages (not including the index) and the number of pages for the index. You can download the data and the SAS/IML Studio program that analyzes the data.


The previous image summarizes the data. It shows the size of each book's index versus the size of the book. The books are colored according to the publisher. Because the data for each publisher appear to be linear except for outliers (for example, the 447-page Springer book with no index, which is a set of conference proceedings), I fit and added robust regression lines computed by the LTS algorithm in the ROBUSTREG procedure. I used a no-intercept model so that I could more easily interpret the slope of the regression line as a ratio that reveals the expected number of index pages, given the number of pages in the book. (The regression lines look similar for a model that includes an intercept term.)

The coefficients of the regression model indicate that SAS press books average about one page of index for each 26 pages of text. In contrast, the other publishers average about one page of index for each 65 pages of text. In other words, for this sample the professionally compiled SAS Press indexes contain, on average, more than twice as many pages as indexes of comparable books printed by other publishers.

The analysis reveals that the index of my book (marked by an "X") is not too long, at least by the SAS Press standards. In fact, the length of my index is low compared with other SAS Press books, although it is on the high end of Springer and Wiley books of similar length.


I contacted a colleague who wrote one of the Springer books in the sample. I asked whether he had used a professional indexer, or had compiled the index himself. He said that he created the index himself:

We assumed it would cost us [to use a professional indexer], and we were concerned that the indexer wouldn't understand the topic. Doing it myself enabled me to ensure that spelling was consistent. Initially I intended to get students to test the index for me, but at the end of the process I had rather lost my enthusiasm!

I cannot comment on the practices of other publishers, but SAS Press does not charge their authors for this professional service. Still, I share his sense of fatigue: writing an index is hard and tiring work.

If I had compiled the index myself, would I have had the stamina to include 1200 entries? Definitely not. But am I glad that my index is thorough, useful, and professionally compiled? Absolutely.

The cost to me? Three weeks of drudgery. The benefit to my readers? Priceless. Thanks, SAS Press!

12月 082010
 
Sample covariance matrices and correlation matrices are used frequently in multivariate statistics. This post shows how to compute these matrices in SAS and use them in a SAS/IML program. There are two ways to compute these matrices:
  1. Compute the covariance and correlation with PROC CORR and read the results into PROC IML
  2. Compute the matrices entirely with PROC IML

Computing a Covariance and Correlation Matrix with PROC CORR

You can use PROC CORR to compute the correlation matrix (or, more correctly, the "Pearson product-moment correlation matrix," since there are other measures of correlation which you can also compute with PROC CORR). The following statements compute the covariance matrix and the correlation matrix for the three numerical variables in the SASHELP.CLASS data set.

ods select Cov PearsonCorr;
proc corr data=sashelp.class noprob outp=OutCorr /** store results **/
          nomiss /** listwise deletion of missing values **/
          cov;   /**  include covariances **/
var Height Weight Age;
run;

The OutCorr data set contains various statistics about the data, as shown by running PROC PRINT:

proc print data=OutCorr; run;

Obs

_TYPE_

_NAME_

Height

Weight

Age

1

COV

Height

26.287

102.493

6.2099

2

COV

Weight

102.493

518.652

25.1857

3

COV

Age

6.210

25.186

2.2281

4

MEAN

62.337

100.026

13.3158

5

STD

5.127

22.774

1.4927

6

N

19.000

19.000

19.0000

7

CORR

Height

1.000

0.878

0.8114

8

CORR

Weight

0.878

1.000

0.7409

9

CORR

Age

0.811

0.741

1.0000

If you want to use the covariance or correlation matrix in PROC IML, you can read the appropriate values into a SAS/IML matrix by using a WHERE clause on the USE statement:


proc iml;
use OutCorr where(_TYPE_="COV");
read all var _NUM_ into cov[colname=varNames];

use OutCorr where(_TYPE_="CORR");
read all var _NUM_ into corr[colname=varNames];
close OutCorr;

Notice that this SAS/IML code is independent of the number of variables in the data set.

Computation of the Covariance and Correlation Matrix in PROC IML

If the data are in SAS/IML vectors, you can compute the covariance and correlation matrices by using matrix multiplication to form the matrix that contains the corrected sum of squares of cross products (CSSCP).

Suppose you are given p SAS/IML vectors x1, x2, ..., xp. To form the covariance matrix for these data:

  1. Use the horizontal concatenation operator to concatenate the vectors into a matrix whose columns are the vectors.
  2. Center each vector by subtracting the sample mean.
  3. Form the CSSCP matrix (also called the "X-prime-X matrix") by multiplying the matrix transpose and the matrix.
  4. Divide by n-1 where n is the number of observations in the vectors.
This process assumes that there are no missing values in the data. Otherwise, it needs to be slightly amended. Formulas for various matrix quantities are given in the SAS/STAT User's Guide.

The following SAS/IML statements define a SAS/IML module that computes the sample covariance matrix of a data matrix. For this example the data are read from a SAS data set, so Step 1 (horizontal concatenation of vectors) is skipped.

proc iml;
start Cov(A);             /** define module to compute a covariance matrix **/
   n = nrow(A);           /** assume no missing values **/
   C = A - A[:,];         /** subtract mean to center the data **/
   return( (C` * C) / (n-1) );
finish;

/** read or enter data matrix into X **/
varNames = {"Height" "Weight" "Age"};
use sashelp.class; read all var varNames into X; close sashelp.class;

cov = Cov(X);
print cov[c=varNames r=varNames];

cov

Height

Weight

Age

Height

26.286901

102.49342

6.2099415

Weight

102.49342

518.65205

25.185673

Age

6.2099415

25.185673

2.2280702

Computing the Pearson correlation matrix requires the same steps, but also that the columns of the centered data matrix be scaled to have unit standard deviation. SAS/IML software already has a built-in CORR function, so it is not necessary to define a Corr module, but it is nevertheless instructive to see how such a module might be written:

start MyCorr(A);
   n = nrow(A);                   /** assume no missing values     **/
   C = A - A[:,];                 /** center the data              **/
   stdCol = sqrt(C[##,] / (n-1)); /** std deviation of columns     **/
   stdC = C / stdCol;             /** assume data are not constant **/
   return( (stdC` * stdC) / (n-1) );
finish;

corr = MyCorr(X);
print corr[c=varNames r=varNames];

corr

Height

Weight

Age

Height

1

0.8777852

0.8114343

Weight

0.8777852

1

0.7408855

Age

0.8114343

0.7408855

1

You should use the built-in CORR function instead of the previous module, because the built-in function handles the case of constant data.

Computation of the Covariance and Correlation Matrix in PROC IML (post-9.2)

In November 2010, SAS released the 9.22 (pronounced "nine point twenty-two") release of SAS/IML software. This release includes the following:

  • a built-in COV function which handles missing values in either of two ways
  • new features for the built-in CORR function including
    • handling missing values in either of two ways
    • support for different measures of correlation, including rank-based correlations
12月 032010
 
My last post was a criticism of a statistical graph that appeared in Bloomberg Businessweek. Criticism is easy. Analysis is harder. In this post I re-analyze the data to present two graphics that I think should have replaced the one graphic in Businessweek. You can download the SAS program that creates the plots for this analysis. The program makes use of the excellent SGPLOT and SGPANEL procedures in SAS 9.2.

The original graph visualizes how participation in online social media activities vary across age groups. The graph is reproduced below at a smaller scale:

Click to Enlarge

When I look at these data, I ask myself two questions:

  1. How does participation in social media differ across age groups?
  2. Given that someone in an age group participates, what is the popularity of each activity?

How does participation in social media differ across age groups?

Click to Enlarge The answer to the first question is answered by the "Inactives" plot, but the graph is upside down. You can subtract the data from 100 to get the adjacent bar chart.

The bar chart shows the percentage of each age group that engages in social media activities. You can clearly see the main conclusion: using social media is highly popular among young people, but older people (in 1997) had not embraced it at the same levels.

This plot should be shown separately from the rest of the data because it shows how each age category is divided into two groups: those that participate in online social media and those who do not. Although it is not apparent by looking at the original Businessweek graph, the last row is distinctly different than the others. The category in the last row is mutually exclusive from the first five categories. Said differently, an "Inactive" individual is represented in a single cell in the last row, whereas an "Active" individual might be represented in multiple cells in the first five rows.

For these reasons, I think you should make one chart that shows participation, and a second chart that shows how people participate.

Absolute Percentages or Relative Percentages?

Look at the last column ("Seniors") in the Businessweek chart. The percentages are for all seniors that are online. This is the wrong scale if you want to determine whether age affects the kinds of the activities that people participate in. Why? Because the previous bar chart shows that participation rates are different across age groups.

Click to Enlarge

Only 30% of seniors participate in any form of social media, so of course the percentage of all seniors that are Critics (11%) is low in absolute terms. But that 11% is actually 37% of the participating seniors (0.37 = 0.11/0.30). In other words, of the seniors who participate in online social media activities, 37% of them are critics. By looking at a relative percentage, you can assess whether participation in a given activity varies across age groups for those who participate.

The adjacent line plot shows the relative percentages of the data from the Businessweek graph. This plot answers my second question: Given that someone participates, what is the popularity of each activity?

This visualization is more revealing than the "young people are more active" conclusion of the Businessweek graph. Older people who participate in social media are "Critics" and "Spectators" about as often as younger people, and are "Collectors" more often. Only the "Creator" and "Joiner" activities show a marked decrease in participation by older age groups.

Comparing Participation by Age Groups

There are only five activities to visualize, so it is feasible to create a single line plot that shows how participation in all activities varies across age groups. By placing all of the activities on a common plot, it is easier to determine that the relative percentage of seniors who are "Critics" is the same as the percentage of "Collectors."

Although this graph would never make its way into Businessweek, I think that it—along with the bar chart of social media use by age—tells the story of the data. And isn't that the goal of statistical graphics?

Click to Enlarge

12月 032010
 
Have you used multivariate procedures in SAS and wanted to save out scores? Some procedures, such as FACTOR, CANDISC, CANCORR, PRINCOMP, and others have an OUT= option to save scores to the input data set. However, to score a new data set, or to perform scoring with multivariate procedures that do not have an OUT= option you need another way.

Scoring new data with these procedures is a snap if you know how to use PROC SCORE. The OUTSTAT data sets from these procedures includes scoring code that the procedure uses to score observations in the DATA= data set. They are contained in the observations where TYPE='SCORE'.

The only catch, however, is that if you use a hierarchical method such as the VARCLUS procedure, then the score code is included for the 1-, 2-, 3-, and so on cluster solutions, up to the final number of clusters in the analysis. PROC SCORE doesn't know which one to use.

So you have one extra (teeny-tiny) step to perform scoring:
Continue reading "Weekday Morning Quick-trick: How to Score from PROC VARCLUS"
12月 012010
 
Recently I read a blog that advertised a data visualization competition. Under the heading "What Are We Looking For?" is a link to a 2007 Bloomberg Businessweek graph that visualizes how participation in online social media activities vary across age groups. The graph is reproduced below at a smaller scale:

Click to Enlarge

A few aspects of this chart bothered me, so in the spirit of Kaiser Fung's excellent Junk Charts blog, here are some thought on improving how these data are visualized.

"34%" Is One Number, Not 34

Each cell in the graph consists of a 10 x 10 grid, and the number of colored squares in the grid represents the percentage of a given age group that participate in a given online activity. For example, the cell in the lower left corner has 34 dark gray colored squares to indicate that 34% of young teens do not participate in social media activities online. That's a lot of ink used to represent a single number!

Furthermore, the chart is arranged so that the colored squares across all age groups simulate a line plot. For example, the graph attempts to show that the percentage of "Inactives" varies across age groups. Note the arrangement of the dark gray squares across the first four age groups:

The four "extra" squares in the first cell (34%) are arranged flush to the left. The gap in the second cell (17%) is put in the middle. (By the way, there should be only 17 colored squares in this cell, not 18.) The extra squares in the next two cells are arranged flush right. The effect is that the eye sees a "line" that decreases, reaches a minimum with 18–21 group, and then starts increasing.

This attempt to form a line plot out of colored squares can be deceptive. For example, by pushing all of the extra squares in one age group to the right and all of the colored squares in the adjacent age group to the left, I can bias your eye see local minima where there are none. This technique also fails miserably with nearly constant data such as the orange squares used for the "Collector" group. The eye sees little bumps, whereas the percentages are essentially constant across the age groups.

If You Want a Line Plot...

If you have data suitable for a line plot, then create a line plot. Here is a bare-bones strip-out-the-color-and-focus-on-the-data line chart. It shows the data in an undecorated statistical way that the editors at Businessweek would surely reject! However, it does show the data clearly.

The line plot shows that participation in most online social media activities peaks with the college-age students and decreases for older individuals. You can also see that the percentage of "Collectors" is essentially constant across age groups. Lastly, you can see that the "Not Active" category is flipped upside down from the previous category. It shows the percentage of people who are not active, and therefore reaches a minimum with the college-age students and increases for older individuals.

The line plot formulation helps to show the variation among age groups for each level of activity. You can, of course, use the same data to create a graph that shows the variation in activities for each age group. Perhaps the creator of the Businesweek graph did not use a line plot because he was hoping that one chart could serve two purposes.

Asking Different Questions

When I look at these data, I ask myself two questions:

  1. How does participation in social media differ across age groups?
  2. Given that someone in an age group participates, what is the popularity of each activity?

On Friday I will use these data to create new graphs that answer these questions, thereby presenting an alternate analysis of these data.

Further Improvements

Do you see features of the Businessweek graph that you think could be improved? Do you think that the original graph has merits that I didn't acknowledge? Post a comment.

11月 192010
 
In a previous post, I used statistical data analysis to estimate the probability that my grocery bill is a whole-dollar amount such as $86.00 or $103.00. I used three weeks' grocery receipts to show that the last two digits of prices on items that I buy are not uniformly distributed. (The prices tend to be values such as $1.49 or $1.99 that appear to be lower in price.) I also showed that sales tax increases the chance of getting a whole-dollar receipt.

In this post, I show how you can use resampling techniques in SAS/IML software to simulate thousands of receipts. You can then estimate the probability that a receipt is a whole-dollar amount, and use bootstrap ideas to examine variation in the estimate.

Simulating Thousands of Receipts

The SAS/IML language is ideal for sampling and simulation in the SAS environment. I previously introduced a SAS/IML module called SampleReplaceUni which performs sampling with replacement from a set. You can use this module to resample from the data.

First run a SAS DATA step program to create the WORK.PRICES data set. The ITEM variable contains prices of individual items that I buy. You can resample from these data. I usually buy about 40 items each week, so I'll use this number to simulate a receipt.

The following SAS/IML statements simulate 5,000 receipts. Each row of the y vector contains 40 randomly selected items (one simulated receipt), and there are 5,000 rows. The sum across the rows are the pre-tax totals for each receipt:

proc iml;
/** read prices of individual items **/
use prices; read all var {"Item"}; close prices;

SalesTax = 0.0775;   /** sales tax in Wake county, NC **/
numItems = 40;       /** number of items in each receipt **/
NumResamples = 5000; /** number of receipts to simulate **/
call randseed(4321);

/** LOAD or define the SampleReplaceUni module here. **/

/** resample from the original data. **/
y = sampleReplaceUni(Item`, NumResamples, NumItems);
pretax = y[,+];       /** sum is get pre-tax cost of each receipt **/
total = round((1 + SalesTax) # pretax, 0.01); /** add sales tax **/
whole = (total=int(total)); /** indicator: whole-dollar receipts **/
Prop = whole[:];    /** proportion that are whole-dollar amounts **/
print NumItems NumResamples (sum(whole))[label="Num00"] 
       Prop[format=PERCENT7.2 label="Pct"];

numItems

NumResamples

Num00

Pct

40

5000

52

1.04%

The simulation generated 52 whole-dollar receipts out of 5,000, or about 1%.

Estimating the Probability of a Whole-Dollar Receipt

The previous computation is the result of a single simulation. In technical terms, it is a point-estimate for the probability of obtaining a whole-dollar receipt. If you run another simulation, you will get a slightly different estimate. If you run, say, 200 simulations and plot all of the estimates, you obtain the sampling distribution for the estimate. For example, you might obtain something that looks like the following histogram:


The histogram shows the estimates for 200 simulations. The vertical dashed line in the histogram is the mean of the 200 estimates, which is about 1%. So, on average, I should expect my grocery bills to be a whole-dollar amount about 1% of the time. The histogram also indicates how the estimate varies over the 200 simulations.

The Bootstrap Distribution of Proportions

The technique that I've described is known as a resampling or bootstrap technique.

In the SAS/IML language, it is easy to wrap a loop around the previous simulation and to accumulate the result of each simulation. The mean of the accumulated results is a bootstrap estimate for the probability that my grocery receipt is a whole-dollar amount. You can use quantiles of the accumulated results to form a bootstrap confidence interval for the probability. For example:

NumRep = 200;          /** number of bootstrap replications **/
Prop = j(NumRep, 1);   /** allocate vector for results **/
do k = 1 to NumRep;
   /** resample from the original data **/
   ...
   /** proportion of bills that are whole-dollar amounts **/
   Prop[k] = whole[:]; 
end;

meanProp = Prop[:]; 
/** LOAD or define the Qntl module here **/
call Qntl(q, Prop, {0.05 0.95});
print numItems meanProp q[label="90% CI"];

numItems

NumRep

meanProp

90% CI

40

200

0.010055

0.0081

0.0124

Notice that 90% of the simulations had proportions that are between 0.81% and 1.24%. I can use this as a 90% confidence interval for the true probability that my grocery receipt will be a whole-dollar amount.

You can download the complete SAS/IML Studio program that computes the estimates and creates the histogram.

Conclusions

My intuition was right: I showed that the chance of my grocery receipt being a whole-dollar amount is about one in a hundred. But, more importantly, I also showed that you can use SAS/IML software to resample from data, run simulations, and implement bootstrap methods.

What's next? Well, all this programming has made me hungry. I’m going to the grocery store!

11月 122010
 
The other day I was at the grocery store buying a week's worth of groceries. When the cashier, Kurt (not his real name), totaled my bill, he announced, "That'll be ninety-six dollars, even."

"Even?" I asked incredulously. "You mean no cents?"

"Yup," he replied. "It happens."

"Wow," I said, with a sly statistical grin appearing on my face, "I'll bet that only happens once in a hundred customers!"

Kurt shrugged, disinterested. As I left, I congratulated myself on my subtle humor, which clearly had not amused my cashier. "Get it?" I thought to myself, "One chance in a hundred? The possibilities are 00 through 99 cents."

But as I drove home, I began to wonder: maybe Kurt knows more about grocery bills than I do. I quickly calculated that if Kurt works eight-hour shifts, he probably processes about 100 transactions every day. Does he see one whole-dollar amount every shift, on average? I thought back to my weekly visits to the grocery store over the past two years. I didn't recall another whole-dollar amount.

So what is the probability that this event (a grocery bill that is exactly a multiple of a dollar) happens? Is it really a one-chance-in-a-hundred event, or is it more rare?

The Distribution of Prices for Grocery Items

As I started thinking about the problem, I became less confident that I knew the probability of this event. I tried to recall some theory about the distribution of a sum. "Hmmmm," I thought, "the distribution of the sum of N independent random variables is the convolution of their distributions, so if each item is uniformly distributed...."

I almost got into an accident when the next thought popped into my head: grocery prices are not uniformly distributed!

I rushed home and left the milk to spoil on the counter while I hunted for old grocery bills. I found three and set about analyzing the distribution of prices for items that I buy at my grocery store. (If you'd like to do your own analysis, you can download the SAS DATA step program .)

First, I used SAS/IML Studio to create a histogram of the last two digits (the cents) for items I bought. As expected, the distribution is not uniform. More than 20% of the items have 99 cents as part of the price. Almost 10% were "dollar specials."

Click to enlarge


Frequent values for the last two digits are shown in the following PROC FREQ output:

Last2    Frequency     Percent
------------------------------
 0.99          24       20.51 
 0.00          10        8.55 
 0.19           7        5.98 
 0.49           7        5.98 
 0.69           7        5.98 
 0.50           6        5.13 
 0.89           6        5.13 

The distribution of digits would be even more skewed except for the fact that I buy a lot of vegetables, which are priced by the pound.

Hurray for Sales Tax!

Next I wondered whether sales tax affects the chances that my grocery bill is an whole-dollar amount. A sales tax of S results in a total bill that is higher than the cost of the individual items:

Total = (1 + S) * (Cost of Items)

Sales tax is good—if your goal is to get a "magic number" (that is, an whole-dollar amount) on your total grocery bill. Why? Sales tax increases the chances of getting a magic number. Look at it this way: if there is no sales tax, then the total cost of my groceries is a whole-dollar amount when the total cost of the items is $1.00, $2.00, $3.00, and so on. There is exactly $1 between each magic number. However, in Wake County, NC, we have a 7.75% sales tax. My post-tax total will be a whole-dollar amount if the total pre-tax cost of my groceries is $0.93, $1.86, $2.78, $3.71, and so on. These pre-tax numbers are only 92 or 93 cents apart, and therefore happen more frequently than if there were no sales tax. With sales tax rates at a record high in the US, I wonder if other shoppers are seeing more whole-dollar grocery bills?

This suggests that my chances might not be one in 100, but might be as high as one in 93—assuming that the last digits of my pre-tax costs are uniformly distributed. But are they? There is still the nagging fact that grocery items tend to be priced non-uniformly at values such as $1.99 and $1.49.

Simulating My Grocery Bill

There is a computational way to estimate the odds: I can resample from the data and simulate a bunch of grocery bills to estimate how often the post-tax bill is a whole-dollar amount. Since this post is getting a little long, I'll report the results of the simulation next week. If you are impatient, why not try it yourself?

Game-changing Analytics

 conference, data analysis, Data Mining, m2010, Michele Reister  Game-changing Analytics已关闭评论
10月 232010
 
The M2010 Data Mining Conference team has arrived in Las Vegas and we’re getting ready to host the 750+ analysts, statisticians and database managers who will be ascending upon Sin City this weekend for our 13th annual conference.

Before we left, my colleague Carrie Vetter had a chance to sit down with two of our speakers, Michael Goul and Sule Balkan from Arizona State University. Listen to the podcast interview to learn about game-changing in-database analytics.