Just for Fun

11月 122010
 
The other day I was at the grocery store buying a week's worth of groceries. When the cashier, Kurt (not his real name), totaled my bill, he announced, "That'll be ninety-six dollars, even."

"Even?" I asked incredulously. "You mean no cents?"

"Yup," he replied. "It happens."

"Wow," I said, with a sly statistical grin appearing on my face, "I'll bet that only happens once in a hundred customers!"

Kurt shrugged, disinterested. As I left, I congratulated myself on my subtle humor, which clearly had not amused my cashier. "Get it?" I thought to myself, "One chance in a hundred? The possibilities are 00 through 99 cents."

But as I drove home, I began to wonder: maybe Kurt knows more about grocery bills than I do. I quickly calculated that if Kurt works eight-hour shifts, he probably processes about 100 transactions every day. Does he see one whole-dollar amount every shift, on average? I thought back to my weekly visits to the grocery store over the past two years. I didn't recall another whole-dollar amount.

So what is the probability that this event (a grocery bill that is exactly a multiple of a dollar) happens? Is it really a one-chance-in-a-hundred event, or is it more rare?

The Distribution of Prices for Grocery Items

As I started thinking about the problem, I became less confident that I knew the probability of this event. I tried to recall some theory about the distribution of a sum. "Hmmmm," I thought, "the distribution of the sum of N independent random variables is the convolution of their distributions, so if each item is uniformly distributed...."

I almost got into an accident when the next thought popped into my head: grocery prices are not uniformly distributed!

I rushed home and left the milk to spoil on the counter while I hunted for old grocery bills. I found three and set about analyzing the distribution of prices for items that I buy at my grocery store. (If you'd like to do your own analysis, you can download the SAS DATA step program .)

First, I used SAS/IML Studio to create a histogram of the last two digits (the cents) for items I bought. As expected, the distribution is not uniform. More than 20% of the items have 99 cents as part of the price. Almost 10% were "dollar specials."

Click to enlarge


Frequent values for the last two digits are shown in the following PROC FREQ output:

Last2    Frequency     Percent
------------------------------
 0.99          24       20.51 
 0.00          10        8.55 
 0.19           7        5.98 
 0.49           7        5.98 
 0.69           7        5.98 
 0.50           6        5.13 
 0.89           6        5.13 

The distribution of digits would be even more skewed except for the fact that I buy a lot of vegetables, which are priced by the pound.

Hurray for Sales Tax!

Next I wondered whether sales tax affects the chances that my grocery bill is an whole-dollar amount. A sales tax of S results in a total bill that is higher than the cost of the individual items:

Total = (1 + S) * (Cost of Items)

Sales tax is good—if your goal is to get a "magic number" (that is, an whole-dollar amount) on your total grocery bill. Why? Sales tax increases the chances of getting a magic number. Look at it this way: if there is no sales tax, then the total cost of my groceries is a whole-dollar amount when the total cost of the items is $1.00, $2.00, $3.00, and so on. There is exactly $1 between each magic number. However, in Wake County, NC, we have a 7.75% sales tax. My post-tax total will be a whole-dollar amount if the total pre-tax cost of my groceries is $0.93, $1.86, $2.78, $3.71, and so on. These pre-tax numbers are only 92 or 93 cents apart, and therefore happen more frequently than if there were no sales tax. With sales tax rates at a record high in the US, I wonder if other shoppers are seeing more whole-dollar grocery bills?

This suggests that my chances might not be one in 100, but might be as high as one in 93—assuming that the last digits of my pre-tax costs are uniformly distributed. But are they? There is still the nagging fact that grocery items tend to be priced non-uniformly at values such as $1.99 and $1.49.

Simulating My Grocery Bill

There is a computational way to estimate the odds: I can resample from the data and simulate a bunch of grocery bills to estimate how often the post-tax bill is a whole-dollar amount. Since this post is getting a little long, I'll report the results of the simulation next week. If you are impatient, why not try it yourself?

Statistics Can Save You Money: Estimates, Areas, and Arithmetic Means

 Just for Fun  Statistics Can Save You Money: Estimates, Areas, and Arithmetic Means已关闭评论
11月 042010
 
This post is about an estimate, but not the statistical kind. It also provides yet another example in which the arithmetic mean is not the appropriate measure for a computation.

First, some background.

Last week I read a blog post by Peter Flom that reminded me that it is wrong to use an arithmetic mean to average rates. Later that day I met a man at my house to get an estimate for some work on my roof. The price of the work depends on the area of my roof. My roof has a different pitch on the front than on the back, so the calculation of area requires some trigonometry and high-school algebra. The main calculation involves finding r1 and r2 in the following diagram, which represents a cross section of my roof:


The man who was writing the estimate took a few measurements in my attic and was downstairs with his calculation in only a few minutes.

"The front of your house has a 12/12 slope," he said, using the roofer's convention of specifying a slope by giving the ratio of the number of inches that the roof rises for every twelve inches of horizontal change. "Your back roof has a 6/12 slope," he continued. " The average slope is therefore 9/12, but because I want your business I'm only going to charge you for an average slope of 8/12. I'm saving you money."

Huh? Average slopes? Peter Flom's post came to mind.

After the man left, I sharpened my pencil and put on my thinking cap. You can compute the area of roof by knowing the square footage of the attic and multiplying that value by a factor that is related to the pitch of the roof. A steeper roof means a bigger multiplication factor, and consequently costs more money. Less pitch means less money.

Hmmm, wasn't the roofing man claiming that the correct computation is to average the slopes of my roof? Isn't a slope a rate?

Using the diagram, I determined the diagonal distances r1 and r2 in terms of the slopes of my roof. The following SAS/IML module summarizes the computations:

proc iml;
/** multiplier for area of a roof, computed from two roof pitches **/
start FindRoofAreaMultiplier(pitch);
   s1 = pitch[1];     /** slope in front of house **/
   s2 = pitch[2];     /** slope in back of house **/
   c = s2/(s1+s2);    /** roof peak occurs at this proportion **/
   h = s1*s2/(s1+s2); /** roof height **/

   /** distances from gutters to peak of roof **/
   r1 = sqrt(c##2 + h##2);     /** along the front **/
   r2 = sqrt((1-c)##2 + h##2); /** along the back  **/
   return(r1+r2);
finish;

/** the two pitches of my roof **/
pitch = 12/12 || 6/12;
ExactMultiplier = FindRoofAreaMultiplier(pitch);
print ExactMultiplier;

ExactMultiplier

1.2167605

So, for my roof, the area of the roof by is 1.217 times the area of the attic. What happens if I use the average pitch to compute the multiplier?

/** average the slopes and use average in calculations **/
s = pitch[:];
pitch = s || s;
AveMultiplier = FindRoofAreaMultiplier(pitch);
print AveMultiplier;

AveMultiplier

1.25

As I had suspected, using the average of the roof pitches to compute the multiplier gives an incorrect answer. This is obvious if you do the following thought experiment. Imagine that the slope of the front roof gets steeper and steeper while you hold the slope of the back roof constant. In this manner, you can make the average slope as large as you want.

In conclusion, it is not valid to use the average slope of the two roofs to estimate the area multiplier when the slopes are substantially different.

So, did the roofing man cheat me? Not really. The following computation shows that the slope of 8/12 that he used in the written estimate yields a multiplier that is slightly less than the true multiplier. The difference is about 1%, and it is in my favor:

/** pitch used in written estimate **/
s = 8/12;
pitch = s || s;
WrittenMultiplier = FindRoofAreaMultiplier(pitch);
print WrittenMultiplier;

WrittenMultiplier

1.2018504

Maybe the roofing man tells people "I'm doing you a favor" to get more business. Or, maybe he made a mistake. Or maybe he is so experienced that he just intuitively knew the correct multiplier.

In any case, statisticians know that the arithmetic mean shouldn't be used indiscriminately, and this story provides yet another example in which using the arithmetic mean leads to a wrong answer.

Solving Scrambled-Word Puzzles

 Just for Fun, Statistical Programming  Solving Scrambled-Word Puzzles已关闭评论
10月 152010
 
Have you ever been stuck while trying to solve a scrambled-word puzzle? You stare and stare at the letters, but no word reveals itself? You are stumped. Stymied.

I hope you didn't get stumped on the word puzzle I posted as an anniversary present for my wife. She breezed through it.

In a previous post, I showed that you can scramble and unscramble words by using permutations. I also showed that you can generate the set of all permutations on N elements. By putting these techniques together, you can write a SAS/IML program that solves scrambled-word puzzles.

Here's what you need:

A Dictionary of Words

In this internet age, it is easy to obtain a file that contains a dictionary of words. Some word lists are small. For example, the "Unix dictionary" contains just over 25 thousand entries. Others are quite a bit larger. I like the official Scrabble™ player's dictionary (OSPD) because it limits itself to words with eight or fewer letters.

Download your favorite wordlist and name the file "wordlist.txt." I used the following DATA step code to convert the word list to a SAS data set:

filename wordlist 'C:\Documents and Settings\...\wordlist.txt';
data Dictionary;
   /** keep first 8 chars unless you need longer words **/
   length word $8.; 
   infile wordlist;
   input word $;
run;

A SAS/IML Program to Unscramble Words

Suppose you have a list of 5- and 6-letter scrambled words that you want to unscramble. For this example, I will use the following set of scrambled statistical words:

proc iml;
target = upcase({NAGER, RRREO, ORRCDE, LROANM});
print target;

target

NAGER

RRREO

ORRCDE

LROANM

You can read the dictionary and the permutations into SAS/IML matrices:

/** read word list; convert to uppercase **/
use Dictionary; read all var {Word}; close Dictionary;
Word = upcase(Word);

/** read permutations for 5- an 6-letter words **/
use perm5; read all var _NUM_ into p5; close perm5;
use perm6; read all var _NUM_ into p6; close perm6;

The following algorithm unscrambles the words:

free solution;
do i = 1 to nrow(target); /** for each word **/
   w = strip(target[i]);
   if length(w)=5 then p = p5; 
   else p = p6; /** use permutation for N elements **/
   idx = loc( length(Word) = length(w) );
   list = Word[idx]; /** limit word list **/

   foundWord = 0;  /** how many permutations result in a word? **/ 
   do n = 1 to nrow(p) while (foundWord<=3); /** for each permutation **/
      perm = p[n, ];  
      s = PermuteChars(w, perm); /** apply permutation **/
      idx = loc(list = s);       /** check if this is a word **/
      if ncol(idx)>0 then do;    /** remember solution **/
         foundWord = foundWord + 1;
         solution = solution // (s + " is a solution for " + w); 
      end;
   end;
end;
print solution;

solution

REGNA is a solution for NAGER

RANGE is a solution for NAGER

ANGER is a solution for NAGER

ERROR is a solution for RRREO

ERROR is a solution for RRREO

ERROR is a solution for RRREO

ERROR is a solution for RRREO

CORDER is a solution for ORRCDE

CORDER is a solution for ORRCDE

RECORD is a solution for ORRCDE

RECORD is a solution for ORRCDE

NORMAL is a solution for LROANM

You can see that people who create scrambled-word puzzles need to be careful that there are not two or more permutations that each lead to valid words. For the letters NAGER, both RANGE and ANGER are common words that solve the puzzle. (I'd exclude REGNA.) Also notice that words with repeated letters, such as RRREO, are easier to solve in the sense that there are multiple permutations that unscramble the letters

So next time you are stuck, just fire up this SAS program to solve scrambled word puzzles.