Just for Fun

6月 132011
 
My primary purpose in writing The DO Loop blog is to share what I know about statistical programming in general and about SAS programming in particular. But I also write the blog for various personal reasons, including the enjoyment of writing.

The other day I encountered a concept on Ajay Ohri's Decision Stats blog that made me reflect on what I, personally, gain from blogging. The concept is the Johari window, which is a self-assessment technique in cognitive psychology. The following image is taken from Peter Dorrington's article about "unknown unknowns and risk":


By looking at the Johari window, I realized that blogging helps me to become more aware of what I know and what I don't know. The Johari window framework also summarizes the process that I use to write this blog. Long-time readers know that I post three articles a week according to the following schedule:

  • On Mondays, I post Getting Started articles. These articles correspond to the upper right quadrant of my Johari window. They represent topics that I know that I know. I exploit this knowledge to get out a quick article that requires minimal effort.
  • On Wednesdays, I post articles on a variety of topics in statistical programming such as sampling and simulation and efficient SAS programming. These articles often correspond to the lower left quadrant of my Johari window. They represent topics that I am trying to learn. Usually, I am not an expert on these topics, so I risk making a fool of myself. However, blogging gives me an opportunity to share what little I know and it motivates me to get it right. I often experiment with several approaches before I feature one in my blog.
  • On Fridays, I like to post articles about data analysis. These articles correspond to the upper left quadrant of my Johari window. They are often inspired by reading other blogs or by having a robust curiosity about topics such as "What Topics Appear in The Far Side Cartoons?" Even after I explore the data and blog about it, I am aware that there is more that could be said.

What about the lower right quadrant? That comes into play when I am searching for something to blog about. In the past 15 years, I've written and saved thousands of SAS and SAS/IML programs, demos, examples, test programs, presentations, and papers. These are scattered in dozens of directories on my computer. Sometimes I'll stumble upon a program I wrote ten years ago and think, "That's pretty clever; I should write a blog about this!" My challenge is to find these gems that I have forgotten about—to rediscover and expose what I once knew.

My goal is to become the best statistical programmer and data analyst that I can be, and to help other SAS programmers do the same. Blogging helps by making me keenly aware of what I know and what I don't know.

4月 132011
 
I have recently returned from five days at SAS Global Forum in Las Vegas. A riffle shuffle|Source=Flickr [http://www.flickr.com/photos/latitudes/66424863/in/set-1442169/] |Date= November 2005 - April 2006 |Author= Todd Klassy [http://www.flickr.com/photos/latitudes/] |Permission=CC-by 2.0 On the way there, I finally had time to read a classic statistical paper: Bayer and Diaconis (1992) describes how many shuffles are needed to randomize a deck of cards. Their famous result that it takes seven shuffles to randomize a 52-card deck is known as "the bane of bridge players" because the result motivated many bridge clubs to switch from hand shuffling to computer generated shuffling. Casual bridge players also blame this result for "slowing down the game" while the cards are shuffled more times than seems intuitively necessary.

In the second paragraph of the paper, Bayer and Diaconis introduce a "mathematically precise model of shuffling," which is known as the Gilbert-Shannon-Reeds (GSR) model. This model is known to be a "good description of the way real people shuffle real cards." (See Diaconis (1988) and the references at the end of this article.) This article describes how to implement GSR shuffling model in SAS/IML software.

The Riffle Shuffle

Computationally, you can shuffle a deck by generating a permutation of the set 1:n, but that is not how real cards are shuffled.

The riffle (or "dovetail") shuffle is the most common shuffling algorithm. A deck of n cards is split into two parts and the two stacks are interleaved. The GSR algorithm simulates this physical process.

The GSR model splits the deck into two pieces according to the binomial distribution. Each piece has roughly n/2 cards. Then cards are dropped from the two stacks according to the number of cards remaining in each stack. Specifically, if there are NL cards in the left stack and NR cards in the right stack, then the probability of the next card dropping from the right stack is NR / (NR + NL).

The following SAS/IML module is a straightforward implementation of the GSR algorithm:

proc iml;
/** Gilbert-Shannon-Reeds statistical 
    model of a riffle shuffle. Described in 
    Bayer and Diaconis (1992) **/
start GSRShuffle(deck);
   n = nrow(deck);
   /** cut into two stacks **/
   nL = rand("Binomial", 0.5, n);
   nR = n - nL;

   /** left stack, right stack **/
   L = deck[1:nL]; R = deck[nL+1:n];
   j = 1; k = 1; /** card counters **/
   shuffle = j(n,1); /** allocate **/
   do i = 1 to n;
      c = rand("Bernoulli", nR/(nL+nR)); 
      if c=0 then do;  /** drop from left **/
        shuffle[i] = L[j];
        nL = nL-1;  j=j+1; /** update left **/
      end;
      else do;  /** drop from right **/
        shuffle[i] = R[k];
        nR = nR-1;  k=k+1; /** update right **/
      end;
   end;
   return(shuffle);
finish;

Testing the GSR Algorithm

You can test the algorithm by starting with a deck of cards in a known order and observing how the cards are mixed by consecutive riffle shuffles. The following statements riffle the cards seven times and store the results of each shuffle. To save space, a 20-card deck is used. The original order of the cards is denoted 1, 2, 3, ..., 20.

call randseed(1234);
n = 20; /** n=52 for a standard deck **/
deck = T(1:n);

s = j(n, 8); /** allocate for 8 decks **/
s[,1] = deck; /** original order **/
do i = 1 to 7;
   s[,i+1] = GSRShuffle( s[,i] );
end;
names = "s0":"s7";
print s[colname=names];


                  s
    s0  s1  s2  s3  s4  s5  s6  s7

     1   1   1   1   1   1  18   2
     2   2  14  14  14  11   1  13
     3  12   2   2   4  14  11  12
     4   3   6   6   2   4  14  18
     5  13  12   3  12   2  19   1
     6   4  15  16  15  12   7  11
     7   5   7  13  17  15   3  14
     8  14   8   4  10  17   4  15
     9   6   9  12   6  10  16  19
    10  15   3  15  18   6   2   8
    11   7  16  17   7  18  13   7
    12   8  13  10   3  19  12   3
    13   9   4  18  16   7  15   9
    14  16  17   7  13   3   8   4
    15  17  10   8   8  16   9  17
    16  10  18   5   5  13  17  20
    17  18   5  11  11   8  20  16
    18  11  11  19  19   9  10  10
    19  19  19   9   9  20   5   5
    20  20  20  20  20   5   6   6

It is interesting to study the evolution of mixing the cards:

  • For the first shuffle, the original deck is cut into two stacks (1:11 and 12:20) and riffled to form the second column. Notice that usually one or two cards are dropped from each stack, although at one point three cards (7, 8, 9) are dropped from the left stack.
  • The second cut occurs at card 14, which is the eighth card in the second column. After the second riffle, notice that the cards 7, 8, and 9 are still consecutive, and the first and last cards are still in their original locations. Six cards (30%) are still in their initial positions in the deck.
  • The pair 7 and 8 is not separated until the fourth shuffle.
  • The last card (20) does not move from the bottom of the deck until the fifth shuffle.
  • The first card (1) does not move from the top of the deck until the sixth shuffle.

The Efficiency of the GSR Algorithm

As far as efficiency goes, the GSRShuffle module that I've implemented here is not very efficient. As I've said before, the SAS/IML language is a vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities.

This implementation of the shuffling algorithm is not vectorized. Unfortunately, because the probability of a card dropping from the left stack changes at every iteration, there is no way to call the RANDGEN function once and have it return all n numbers required to simulate a single riffle shuffle.

Or is there? Perhaps there is an equivalent algorithm that can be vectorized? Next week I'll present a more efficient version of the GSR algorithm that does not require an explicit loop over the number of cards in the deck.

References

D. Bayer and P. Diaconis (1992), "Trailing the Dovetail Shuffle to Its Lair", Annals of Applied Probablity 2(2) 294-313
P. Diaconis (1988), Group Representations in Probability and Statistics. IMS, Hayward, CA.
E. Gilbert (1955) "Theory of Shuffling," Technical memorandum. Bell Laboratories.
J. Reeds (1981), Unpublished manuscript.
4月 082011
 
At the beginning of 2011, I heard about the Dow Piano, which was created by CNNMoney.com. The Dow Piano visualizes the performance of the Dow Jones industrial average in 2010 with a line plot, but also adds an auditory component. As Bård Edlund, Art Director at CNNMoney.com, said,
The daily trading of the Dow Jones industrial average determines the songwriting here, translating the ups and downs of 2010's market into musical notes. Using a five-note scale spanning three octaves, pitch is determined by each day's closing level.

When I first saw the Dow Piano, I immediately thought, "I can do that in SAS/IML Studio by using the SOUND call in the SAS/IML language!"

To be fair, I can't fully duplicate the Dow Piano in SAS/IML software. The SOUND call in SAS is very simple: it only provides pitch and duration. The music for the Dow Piano is more complex:

  • It uses piano notes, which involves sound characteristics such as attack and decay.
  • It uses dynamics to represent trading volume: high-volume days are represented by loud notes, low-volume days by soft notes.
  • It overlays a percussion track so that you hear a percussive beat on days that the stock market is closed.
Nevertheless, I think the SAS/IML version captures the main idea by allowing you to listen to the performance of the Dow Jones industrial average. (Get it? The performance?)

Will this kind of visualization (auditorization? sonification?) catch on? Time will tell, but for these data, the sound doesn't add any new information; it merely uses the sense of sound to repeat information that is already in the line plot. In fact, the sound present less information than the line plot: there are more than 250 closing values of the Dow Jones industrial average, but the Dow Piano collapses these values into 15 notes (three octave of a five note scale).

But enough analysis! Listen to the Way of the Dow. The demo starts at 0:45, after I introduce the problem.

4月 012011
 
Today, SAS, the leader in business analytics announces significant changes to two popular SAS blogs, The DO Loop (written by Rick Wicklin) and The SAS Dummy (previously written by Chris Hemedinger).

The two blogs will be merged into a single blog, called The SAS Smarty: A Blog for an Elite Few. The blog, which will be written by Wicklin, will debut next week.

Ima Phül, director of SAS Brand Preservation, says the change is needed.

"Like many companies, SAS continuously monitors the way its brand is perceived by customers," says Phül. "By using SAS Social Media Analytics and SAS Web Analytics, my team was able to identify several problems in the way that the SAS brand is perceived by customers."

Particularly troubling was the high correlation between the terms "SAS" and "dummy," and the terms "SAS" and "dogfood".

"Thanks to SAS Social Media Analytics, Chris's blog was determined to be the root cause of these undesireable relationships," continues Phül.

Wicklin says that re-aligning the SAS Dummy blog with corporate directions will not be hard. "Expect less humor and more math," he says. "I've got some great articles planned on computing multi-dimensional integrals and using orthogonal regression polynomials."

The archives of the SAS Dummy will continue to be hosted on blogs.sas.com, but the title of the discontinued blog will be changed to an unpronouncible symbol:


Alison Bolin, Editor of Blogs and Social Content at SAS, thanks Chris for his years of blogging. "Chris was instrumental in the success of our SAS blogging efforts. We are happy to announce that Chris will be leading a new R&D effort from our Siberian office, and we wish him well in his new endeavors."

These changes are effective as of today, April 1, 2011, also known as April Fools' Day.

2月 142011
 
If you tell my wife that she's married to a statistical geek, she'll nod knowingly. She is used to hearing sweet words of affection such as
You are more beautiful than Euler's identity.
or
My love for you is like the exponential function: increasing, unbounded, and transcendental.
But those are ordinary, everyday sentiments. For Valentine's Day, I want to do something really special, such as formulating a parametric expression whose image is heart-shaped. If you haven't gotten anything sweet for your special someone, you too can use SAS/IML Studio to create the following image!

Modify the program to create your own personal message. My personal message to my wife is "I love you."




/** Valentine's Day heart. Rick Wicklin: blogs.sas.com/iml **/
/** Parametrize in polar coordinates h(t) = (r(t), theta(t)) **/
Pi = constant("Pi");
t = do(0, 2*Pi, 0.01*Pi/4)`;
r = 2 - 2*sin(t) + sin(t)#sqrt(abs(cos(t))) / (sin(t)+1.4);
/** Convert to Euclidean coordinates for plotting **/
x = r#cos(t);
y = r#sin(t);

/** Use SAS/IML Studio to produce the image **/
declare DataObject dobj;
dobj = DataObject.Create("Valentine", {"x" "y"}, x||y );
dobj.AddVar("const", j(nrow(t),1));
dobj.SetMarkerFillColor(OBS_ALL, RED);

declare PolygonPlot p;
p = PolygonPlot.Create(dobj, "x", "y", "const", true);
p.SetGraphAreaMargins(0,0,0,0);
p.SetAspectRatio(1);
p.ShowAxes(false);
p.SetWindowPosition(50,0,32,50);
p.DrawUseDataCoordinates();
p.DrawText(0,-4.1,
"r = 2 - 2*sin(t) + sin(t)*sqrt(|cos(t)|) / (sin(t)+1.4)");
p.DrawSetTextSize(30);
p.DrawText(0,-0.75,"Happy\nValentine's\nDay!"J);
2月 102011
 
I enjoyed the Dataists' data-driven blog on the best numbers to choose in a Super Bowl betting pool. It reminded me of my recent investigation of which initials are most common.

Because the Dataists' blog featured an R function that converts Arabic numerals into Roman numerals, the blog post also reminded me that SAS has a built in format that does the same thing. The ROMANw. format makes it easy to convert everyday whole numbers into their Roman equivalent.

For example, here are a few numbers and their Roman counterparts:

data RomanNumerals;
input x @@;
n = x;
format n ROMAN8.;
datalines;
1 2 3 4 5 9 30 40 45 50 1999 2011
;
run;
proc print noobs; run;

x

n

1

I

2

II

3

III

4

IV

5

V

9

IX

30

XXX

40

XL

45

XLV

50

L

1999

MCMXCIX

2011

MMXI

I've always wondered why SAS has this obscure format, but now I know of at least one useful application: using it to troll through Wikipedia to extract box scores from Super Bowl I through XLV.

1月 252011
 
Have you ever wanted to compute the exact value of a really big number such as 200! = 200*199*...*2*1? You can do it—if you're willing to put forth some programming effort. This blog post shows you how.

Jiangtang Hu's recent blog discusses his quest to compute large factorials in many programming languages. As he discovered, many languages that use standard double precision arithmetic do not compute n! for large n, whereas other software, such as Mathematica, succeed.

The reason is simple: 171! is greater than 1.79769e+308, which on many computers is the largest floating point number that can be represented in eight bytes. For example, the following statements print the maximum double precision value on my Windows PC:

proc iml;
MaxBig = constant("big");
print MaxBig;

MaxBig

1.798E308

Because of the limitation of double precision arithmetic, floating point computations can result in a run-time error. For example, the following factorial computation overflows in SAS software:

f = fact(200);  /** numerical overflow **/

Of course, there are ways around this limitation. The reason that software such as Maple and Mathematica succeed is that they implement special algorithms that enable them to compute results to arbitrary precision.

Computing something to "arbitrary precision" sounds like magic, but it's not. For example, I can use the SAS/IML language (or the DATA step, or C, or many other languages) to write an algorithm that computes the factorial of fairly large numbers. The following algorithm is naive and inefficient, but it shows how you can implement an arbitrary precision algorithm by storing the result in a numerical array instead of in a double precision variable.

An Algorithm for Computing the Factorial of Large Numbers

Again, I emphasize that this algorithm is for illustration purposes and is not intended for "real" work. Given an integer, n, here's how you can compute the exact value of f = n!. The key is to store each digit of the result in a numerical array.

  1. Compute the number of digits of f = n!. The number of digits is related to the logarithm (base 10) of f, so you can use the LFACT function. The LFACT function returns the natural logarithm of the f, but you can convert that number to the base-10 logarithm.
  2. Allocate an array with the correct number of digits. The ith element of this array will hold the ith digit of the answer. In other words, if the ith digit is d, the digit represents the value d x 10i.
  3. Initialize the result to 1.
  4. For each integer 2, 3, ..., n, multiply the current result by the next factor. This is done place-value-by-place-value, and results that are greater than 10 are carried over to the next place value.
  5. After all the multiplications, the array f has the value of n!. The numbers in this array need to be reversed, because, for example, the array {0 2 1} represents the number 120.

start BigFactorial(n);
 numDigits = ceil(lfact(n)/log(10)); /** 1 **/
 f = j(1,numDigits,0);               /** 2 **/
 f[1] = 1;                           /** 3 **/
 do i = 2 to n;                      /** 4 **/
   carry = 0;
   digits = ceil(lfact(i)/log(10)); /** how many digits in the answer so far? **/
   do j = 1 to digits ;             /** multiply the product by next factor **/
     g = i*f[j] + carry;            /** j_th place value **/
     digit = mod(g, 10); 
     f[j] = digit;                  /** put value in the j_th place **/
     carry = (g - digit)/10;        /** carry the rest to the (j+1)_th place **/
   end;
 end;
 return( f[,ncol(f):1] );           /** 5. reverse the digits **/
finish;

Can this module really compute the factorial of large numbers? Yep. You can compute 200! and see that the exact answer contains 375 digits. The following statements compute the factorial and print the result:

f = BigFactorial(200);  /** compute 200! **/
/** convert answer from a numerical array to a character string **/
Fact200 = rowcat(char(f,1)); 

numDigits200 = nleng(Fact200); /** how many digits in the answer? **/
print numDigits200;
print Fact200;

numDigits200

375

                                     Fact200

     788657867364790503552363213932185062295135977687173263294742533244359449963
     403342920304284011984623904177212138919638830257642790242637105061926624952
     829931113462857270763317237396988943922445621451664240254033291864131227428
     294853277524242407573903240321257405579568660226031904170324062351700858796
     178922222789623703897374720000000000000000000000000000000000000000000000000

Obviously, I have entirely too much time on my hands, but does this example demonstrate anything else? It demonstrates that programming with arrays enables you to implement a wide variety of algorithms. Perhaps the example also demystifies software that evaluates expressions to "arbitrary precision."

Mostly, though, I did it because I thought it would be fun. I was right. And next time I'm at a party, I might casually sidle next to someone and mention that 1,000! has 2,568 digits. I'll let you know their response.

11月 192010
 
In a previous post, I used statistical data analysis to estimate the probability that my grocery bill is a whole-dollar amount such as $86.00 or $103.00. I used three weeks' grocery receipts to show that the last two digits of prices on items that I buy are not uniformly distributed. (The prices tend to be values such as $1.49 or $1.99 that appear to be lower in price.) I also showed that sales tax increases the chance of getting a whole-dollar receipt.

In this post, I show how you can use resampling techniques in SAS/IML software to simulate thousands of receipts. You can then estimate the probability that a receipt is a whole-dollar amount, and use bootstrap ideas to examine variation in the estimate.

Simulating Thousands of Receipts

The SAS/IML language is ideal for sampling and simulation in the SAS environment. I previously introduced a SAS/IML module called SampleReplaceUni which performs sampling with replacement from a set. You can use this module to resample from the data.

First run a SAS DATA step program to create the WORK.PRICES data set. The ITEM variable contains prices of individual items that I buy. You can resample from these data. I usually buy about 40 items each week, so I'll use this number to simulate a receipt.

The following SAS/IML statements simulate 5,000 receipts. Each row of the y vector contains 40 randomly selected items (one simulated receipt), and there are 5,000 rows. The sum across the rows are the pre-tax totals for each receipt:

proc iml;
/** read prices of individual items **/
use prices; read all var {"Item"}; close prices;

SalesTax = 0.0775;   /** sales tax in Wake county, NC **/
numItems = 40;       /** number of items in each receipt **/
NumResamples = 5000; /** number of receipts to simulate **/
call randseed(4321);

/** LOAD or define the SampleReplaceUni module here. **/

/** resample from the original data. **/
y = sampleReplaceUni(Item`, NumResamples, NumItems);
pretax = y[,+];       /** sum is get pre-tax cost of each receipt **/
total = round((1 + SalesTax) # pretax, 0.01); /** add sales tax **/
whole = (total=int(total)); /** indicator: whole-dollar receipts **/
Prop = whole[:];    /** proportion that are whole-dollar amounts **/
print NumItems NumResamples (sum(whole))[label="Num00"] 
       Prop[format=PERCENT7.2 label="Pct"];

numItems

NumResamples

Num00

Pct

40

5000

52

1.04%

The simulation generated 52 whole-dollar receipts out of 5,000, or about 1%.

Estimating the Probability of a Whole-Dollar Receipt

The previous computation is the result of a single simulation. In technical terms, it is a point-estimate for the probability of obtaining a whole-dollar receipt. If you run another simulation, you will get a slightly different estimate. If you run, say, 200 simulations and plot all of the estimates, you obtain the sampling distribution for the estimate. For example, you might obtain something that looks like the following histogram:


The histogram shows the estimates for 200 simulations. The vertical dashed line in the histogram is the mean of the 200 estimates, which is about 1%. So, on average, I should expect my grocery bills to be a whole-dollar amount about 1% of the time. The histogram also indicates how the estimate varies over the 200 simulations.

The Bootstrap Distribution of Proportions

The technique that I've described is known as a resampling or bootstrap technique.

In the SAS/IML language, it is easy to wrap a loop around the previous simulation and to accumulate the result of each simulation. The mean of the accumulated results is a bootstrap estimate for the probability that my grocery receipt is a whole-dollar amount. You can use quantiles of the accumulated results to form a bootstrap confidence interval for the probability. For example:

NumRep = 200;          /** number of bootstrap replications **/
Prop = j(NumRep, 1);   /** allocate vector for results **/
do k = 1 to NumRep;
   /** resample from the original data **/
   ...
   /** proportion of bills that are whole-dollar amounts **/
   Prop[k] = whole[:]; 
end;

meanProp = Prop[:]; 
/** LOAD or define the Qntl module here **/
call Qntl(q, Prop, {0.05 0.95});
print numItems meanProp q[label="90% CI"];

numItems

NumRep

meanProp

90% CI

40

200

0.010055

0.0081

0.0124

Notice that 90% of the simulations had proportions that are between 0.81% and 1.24%. I can use this as a 90% confidence interval for the true probability that my grocery receipt will be a whole-dollar amount.

You can download the complete SAS/IML Studio program that computes the estimates and creates the histogram.

Conclusions

My intuition was right: I showed that the chance of my grocery receipt being a whole-dollar amount is about one in a hundred. But, more importantly, I also showed that you can use SAS/IML software to resample from data, run simulations, and implement bootstrap methods.

What's next? Well, all this programming has made me hungry. I’m going to the grocery store!

11月 122010
 
The other day I was at the grocery store buying a week's worth of groceries. When the cashier, Kurt (not his real name), totaled my bill, he announced, "That'll be ninety-six dollars, even."

"Even?" I asked incredulously. "You mean no cents?"

"Yup," he replied. "It happens."

"Wow," I said, with a sly statistical grin appearing on my face, "I'll bet that only happens once in a hundred customers!"

Kurt shrugged, disinterested. As I left, I congratulated myself on my subtle humor, which clearly had not amused my cashier. "Get it?" I thought to myself, "One chance in a hundred? The possibilities are 00 through 99 cents."

But as I drove home, I began to wonder: maybe Kurt knows more about grocery bills than I do. I quickly calculated that if Kurt works eight-hour shifts, he probably processes about 100 transactions every day. Does he see one whole-dollar amount every shift, on average? I thought back to my weekly visits to the grocery store over the past two years. I didn't recall another whole-dollar amount.

So what is the probability that this event (a grocery bill that is exactly a multiple of a dollar) happens? Is it really a one-chance-in-a-hundred event, or is it more rare?

The Distribution of Prices for Grocery Items

As I started thinking about the problem, I became less confident that I knew the probability of this event. I tried to recall some theory about the distribution of a sum. "Hmmmm," I thought, "the distribution of the sum of N independent random variables is the convolution of their distributions, so if each item is uniformly distributed...."

I almost got into an accident when the next thought popped into my head: grocery prices are not uniformly distributed!

I rushed home and left the milk to spoil on the counter while I hunted for old grocery bills. I found three and set about analyzing the distribution of prices for items that I buy at my grocery store. (If you'd like to do your own analysis, you can download the SAS DATA step program .)

First, I used SAS/IML Studio to create a histogram of the last two digits (the cents) for items I bought. As expected, the distribution is not uniform. More than 20% of the items have 99 cents as part of the price. Almost 10% were "dollar specials."

Click to enlarge


Frequent values for the last two digits are shown in the following PROC FREQ output:

Last2    Frequency     Percent
------------------------------
 0.99          24       20.51 
 0.00          10        8.55 
 0.19           7        5.98 
 0.49           7        5.98 
 0.69           7        5.98 
 0.50           6        5.13 
 0.89           6        5.13 

The distribution of digits would be even more skewed except for the fact that I buy a lot of vegetables, which are priced by the pound.

Hurray for Sales Tax!

Next I wondered whether sales tax affects the chances that my grocery bill is an whole-dollar amount. A sales tax of S results in a total bill that is higher than the cost of the individual items:

Total = (1 + S) * (Cost of Items)

Sales tax is good—if your goal is to get a "magic number" (that is, an whole-dollar amount) on your total grocery bill. Why? Sales tax increases the chances of getting a magic number. Look at it this way: if there is no sales tax, then the total cost of my groceries is a whole-dollar amount when the total cost of the items is $1.00, $2.00, $3.00, and so on. There is exactly $1 between each magic number. However, in Wake County, NC, we have a 7.75% sales tax. My post-tax total will be a whole-dollar amount if the total pre-tax cost of my groceries is $0.93, $1.86, $2.78, $3.71, and so on. These pre-tax numbers are only 92 or 93 cents apart, and therefore happen more frequently than if there were no sales tax. With sales tax rates at a record high in the US, I wonder if other shoppers are seeing more whole-dollar grocery bills?

This suggests that my chances might not be one in 100, but might be as high as one in 93—assuming that the last digits of my pre-tax costs are uniformly distributed. But are they? There is still the nagging fact that grocery items tend to be priced non-uniformly at values such as $1.99 and $1.49.

Simulating My Grocery Bill

There is a computational way to estimate the odds: I can resample from the data and simulate a bunch of grocery bills to estimate how often the post-tax bill is a whole-dollar amount. Since this post is getting a little long, I'll report the results of the simulation next week. If you are impatient, why not try it yourself?

Statistics Can Save You Money: Estimates, Areas, and Arithmetic Means

 Just for Fun  Statistics Can Save You Money: Estimates, Areas, and Arithmetic Means已关闭评论
11月 042010
 
This post is about an estimate, but not the statistical kind. It also provides yet another example in which the arithmetic mean is not the appropriate measure for a computation.

First, some background.

Last week I read a blog post by Peter Flom that reminded me that it is wrong to use an arithmetic mean to average rates. Later that day I met a man at my house to get an estimate for some work on my roof. The price of the work depends on the area of my roof. My roof has a different pitch on the front than on the back, so the calculation of area requires some trigonometry and high-school algebra. The main calculation involves finding r1 and r2 in the following diagram, which represents a cross section of my roof:


The man who was writing the estimate took a few measurements in my attic and was downstairs with his calculation in only a few minutes.

"The front of your house has a 12/12 slope," he said, using the roofer's convention of specifying a slope by giving the ratio of the number of inches that the roof rises for every twelve inches of horizontal change. "Your back roof has a 6/12 slope," he continued. " The average slope is therefore 9/12, but because I want your business I'm only going to charge you for an average slope of 8/12. I'm saving you money."

Huh? Average slopes? Peter Flom's post came to mind.

After the man left, I sharpened my pencil and put on my thinking cap. You can compute the area of roof by knowing the square footage of the attic and multiplying that value by a factor that is related to the pitch of the roof. A steeper roof means a bigger multiplication factor, and consequently costs more money. Less pitch means less money.

Hmmm, wasn't the roofing man claiming that the correct computation is to average the slopes of my roof? Isn't a slope a rate?

Using the diagram, I determined the diagonal distances r1 and r2 in terms of the slopes of my roof. The following SAS/IML module summarizes the computations:

proc iml;
/** multiplier for area of a roof, computed from two roof pitches **/
start FindRoofAreaMultiplier(pitch);
   s1 = pitch[1];     /** slope in front of house **/
   s2 = pitch[2];     /** slope in back of house **/
   c = s2/(s1+s2);    /** roof peak occurs at this proportion **/
   h = s1*s2/(s1+s2); /** roof height **/

   /** distances from gutters to peak of roof **/
   r1 = sqrt(c##2 + h##2);     /** along the front **/
   r2 = sqrt((1-c)##2 + h##2); /** along the back  **/
   return(r1+r2);
finish;

/** the two pitches of my roof **/
pitch = 12/12 || 6/12;
ExactMultiplier = FindRoofAreaMultiplier(pitch);
print ExactMultiplier;

ExactMultiplier

1.2167605

So, for my roof, the area of the roof by is 1.217 times the area of the attic. What happens if I use the average pitch to compute the multiplier?

/** average the slopes and use average in calculations **/
s = pitch[:];
pitch = s || s;
AveMultiplier = FindRoofAreaMultiplier(pitch);
print AveMultiplier;

AveMultiplier

1.25

As I had suspected, using the average of the roof pitches to compute the multiplier gives an incorrect answer. This is obvious if you do the following thought experiment. Imagine that the slope of the front roof gets steeper and steeper while you hold the slope of the back roof constant. In this manner, you can make the average slope as large as you want.

In conclusion, it is not valid to use the average slope of the two roofs to estimate the area multiplier when the slopes are substantially different.

So, did the roofing man cheat me? Not really. The following computation shows that the slope of 8/12 that he used in the written estimate yields a multiplier that is slightly less than the true multiplier. The difference is about 1%, and it is in my favor:

/** pitch used in written estimate **/
s = 8/12;
pitch = s || s;
WrittenMultiplier = FindRoofAreaMultiplier(pitch);
print WrittenMultiplier;

WrittenMultiplier

1.2018504

Maybe the roofing man tells people "I'm doing you a favor" to get more business. Or, maybe he made a mistake. Or maybe he is so experienced that he just intuitively knew the correct multiplier.

In any case, statisticians know that the arithmetic mean shouldn't be used indiscriminately, and this story provides yet another example in which using the arithmetic mean leads to a wrong answer.