Just for Fun

9月 272011
 

I was intrigued by a math puzzle posted to the SAS Discussion Forum (from New Scientist magazine). The problem is repeated below, but I have added numbers in brackets because I am going to comment on each clue:

[1] I have written down three different 5-digit perfect squares, which [2] between them use five different digits. [3] Each of the five digits is used a different number of times, [4] the five numbers of times being the same as the five digits of the perfect squares. [5] No digit is used its own number of times. [6] If you knew which digit I have used just once, you could deduce my three squares with certainty.

What are the three perfect squares?

I solved the problem by using the SAS/IML language. If you want to try solving the problem yourself without further hints, don't read any further!

Deciphering the clues

I found the clues difficult to understand, so I'll give an example for each clue:

  1. You are looking for three five-digit squares, such as 12321 = (111)2
  2. There are five distinct digits among the 15 digits in the three numbers. For example, the numbers can't be (10000, 10201, 10404), because there are only four unique digits in these numbers: 0, 1, 2, and 4.
  3. The frequencies of the five unique digits are distinct. Because there are 15 digits in the three numbers, we immediately know that the distribution of frequencies is 1, 2, 3, 4 and 5. So, for example, the numbers can't be (12321, 12544, 13225) because although there are five 2s and four 1s, no digit appears three times.
  4. The five frequencies are the same as the five digits. Therefore, the five digits are 1, 2, 3, 4, and 5. For example, 57121 cannot be one of the numbers, because it contains a 7 among its digits.
  5. The phrase "no digit is used its own number of times" baffled me. I finally figured out that it means that the digit 1 appears more than one time, the digit 2 appears either once or more than twice, the digit 3 does not appear exactly three times, and so forth.
  6. At first I didn't realize that the last sentence is also a clue. The person who posted the puzzle said he found seven solutions, but the puzzle implies that there is a unique solution. The last sentence means that if you look at the frequency distribution of the digits among the seven potential solutions, only one of them is not a repeat of another. For example, the solution is not (12321, 244521, 355225) because the 3 digit appears once, but there is also another (potential) solution for which the 3 digit appears once.

When I wrote the SAS/IML program that solves the problem, I was pleased to discover that I had used three functions that are new to SAS/IML 9.3! Since the entire program is less than 25 lines long, that's a pretty good ratio of new functions to total lines. The three new functions I used are as follows:

  • The ELEMENT function enables you to find which elements in one set are contained in another set. I use this function to get rid of all 5-digit perfect squares that contain the digits {6,7,8,9,0}.
  • The ALLCOMB function generates all combinations of n elements taken k at a time. After I reduced the problem to a set of nine 5-digit numbers, I used the ALLCOMB function to look at all triplets of the candidate numbers.
  • The TABULATE subroutine computes the frequency distribution of elements in a vector of matrix. I used this subroutine to apply rules 4 and 5.

You can use this post to ask questions or to clarify the directions. I will post my solution on Thursday. You can post your own solution as a comment to Thursday's post. DATA step and PROC SQL solutions are also welcome. Good luck!

tags: Just for Fun
7月 192011
 
Yesterday, Jiangtang Hu did a frequency analysis of my blog posts and noticed that there are some holidays on which I post to my blog and others on which I do not.

The explanation is simple: I post on Mondays, Wednesdays, and Fridays, provided that SAS Institute (World Headquarters) is not closed. Because I am absent-minded, I wrote a little SAS/IML program that tells me when I am supposed to post. Notice the use of the HOLIDAY function in Base SAS in order to figure out the dates of certain holidays.

proc iml;
date = today(); /** get today's date **/
/** is today M, W, or F? **/
MWF = any( putn(date, "DOWName3.") = {"Mon" "Wed" "Fri"} );

/** Is SAS closed today?**/
/** Find SAS holidays that might occur Mon-Fri **/
year = year(date);
Float = holiday({Labor Memorial NewYear USIndependence}, year);
/** Thanskgiving or day after **/
TDay = holiday("Thanksgiving", year) + (0:1); 
/** Christmas or week after **/
WinterBreak = holiday("Christmas", year) + (0:7);
SASHolidays = Float || TDay || WinterBreak;
SASClosed = any(date = SASHolidays);

/** should Rick post to his blog today? **/
if MWF & ^SASClosed then 
   print "Rick posts a blog today.";
else 
   print "Rick is working on a post for some other day.";
quit;

Readers are also welcome to run this program to determine whether a post is scheduled or not.

Kidding. Just kidding.

6月 132011
 
My primary purpose in writing The DO Loop blog is to share what I know about statistical programming in general and about SAS programming in particular. But I also write the blog for various personal reasons, including the enjoyment of writing.

The other day I encountered a concept on Ajay Ohri's Decision Stats blog that made me reflect on what I, personally, gain from blogging. The concept is the Johari window, which is a self-assessment technique in cognitive psychology. The following image is taken from Peter Dorrington's article about "unknown unknowns and risk":


By looking at the Johari window, I realized that blogging helps me to become more aware of what I know and what I don't know. The Johari window framework also summarizes the process that I use to write this blog. Long-time readers know that I post three articles a week according to the following schedule:

  • On Mondays, I post Getting Started articles. These articles correspond to the upper right quadrant of my Johari window. They represent topics that I know that I know. I exploit this knowledge to get out a quick article that requires minimal effort.
  • On Wednesdays, I post articles on a variety of topics in statistical programming such as sampling and simulation and efficient SAS programming. These articles often correspond to the lower left quadrant of my Johari window. They represent topics that I am trying to learn. Usually, I am not an expert on these topics, so I risk making a fool of myself. However, blogging gives me an opportunity to share what little I know and it motivates me to get it right. I often experiment with several approaches before I feature one in my blog.
  • On Fridays, I like to post articles about data analysis. These articles correspond to the upper left quadrant of my Johari window. They are often inspired by reading other blogs or by having a robust curiosity about topics such as "What Topics Appear in The Far Side Cartoons?" Even after I explore the data and blog about it, I am aware that there is more that could be said.

What about the lower right quadrant? That comes into play when I am searching for something to blog about. In the past 15 years, I've written and saved thousands of SAS and SAS/IML programs, demos, examples, test programs, presentations, and papers. These are scattered in dozens of directories on my computer. Sometimes I'll stumble upon a program I wrote ten years ago and think, "That's pretty clever; I should write a blog about this!" My challenge is to find these gems that I have forgotten about—to rediscover and expose what I once knew.

My goal is to become the best statistical programmer and data analyst that I can be, and to help other SAS programmers do the same. Blogging helps by making me keenly aware of what I know and what I don't know.

4月 132011
 
I have recently returned from five days at SAS Global Forum in Las Vegas. A riffle shuffle|Source=Flickr [http://www.flickr.com/photos/latitudes/66424863/in/set-1442169/] |Date= November 2005 - April 2006 |Author= Todd Klassy [http://www.flickr.com/photos/latitudes/] |Permission=CC-by 2.0 On the way there, I finally had time to read a classic statistical paper: Bayer and Diaconis (1992) describes how many shuffles are needed to randomize a deck of cards. Their famous result that it takes seven shuffles to randomize a 52-card deck is known as "the bane of bridge players" because the result motivated many bridge clubs to switch from hand shuffling to computer generated shuffling. Casual bridge players also blame this result for "slowing down the game" while the cards are shuffled more times than seems intuitively necessary.

In the second paragraph of the paper, Bayer and Diaconis introduce a "mathematically precise model of shuffling," which is known as the Gilbert-Shannon-Reeds (GSR) model. This model is known to be a "good description of the way real people shuffle real cards." (See Diaconis (1988) and the references at the end of this article.) This article describes how to implement GSR shuffling model in SAS/IML software.

The Riffle Shuffle

Computationally, you can shuffle a deck by generating a permutation of the set 1:n, but that is not how real cards are shuffled.

The riffle (or "dovetail") shuffle is the most common shuffling algorithm. A deck of n cards is split into two parts and the two stacks are interleaved. The GSR algorithm simulates this physical process.

The GSR model splits the deck into two pieces according to the binomial distribution. Each piece has roughly n/2 cards. Then cards are dropped from the two stacks according to the number of cards remaining in each stack. Specifically, if there are NL cards in the left stack and NR cards in the right stack, then the probability of the next card dropping from the right stack is NR / (NR + NL).

The following SAS/IML module is a straightforward implementation of the GSR algorithm:

proc iml;
/** Gilbert-Shannon-Reeds statistical 
    model of a riffle shuffle. Described in 
    Bayer and Diaconis (1992) **/
start GSRShuffle(deck);
   n = nrow(deck);
   /** cut into two stacks **/
   nL = rand("Binomial", 0.5, n);
   nR = n - nL;

   /** left stack, right stack **/
   L = deck[1:nL]; R = deck[nL+1:n];
   j = 1; k = 1; /** card counters **/
   shuffle = j(n,1); /** allocate **/
   do i = 1 to n;
      c = rand("Bernoulli", nR/(nL+nR)); 
      if c=0 then do;  /** drop from left **/
        shuffle[i] = L[j];
        nL = nL-1;  j=j+1; /** update left **/
      end;
      else do;  /** drop from right **/
        shuffle[i] = R[k];
        nR = nR-1;  k=k+1; /** update right **/
      end;
   end;
   return(shuffle);
finish;

Testing the GSR Algorithm

You can test the algorithm by starting with a deck of cards in a known order and observing how the cards are mixed by consecutive riffle shuffles. The following statements riffle the cards seven times and store the results of each shuffle. To save space, a 20-card deck is used. The original order of the cards is denoted 1, 2, 3, ..., 20.

call randseed(1234);
n = 20; /** n=52 for a standard deck **/
deck = T(1:n);

s = j(n, 8); /** allocate for 8 decks **/
s[,1] = deck; /** original order **/
do i = 1 to 7;
   s[,i+1] = GSRShuffle( s[,i] );
end;
names = "s0":"s7";
print s[colname=names];


                  s
    s0  s1  s2  s3  s4  s5  s6  s7

     1   1   1   1   1   1  18   2
     2   2  14  14  14  11   1  13
     3  12   2   2   4  14  11  12
     4   3   6   6   2   4  14  18
     5  13  12   3  12   2  19   1
     6   4  15  16  15  12   7  11
     7   5   7  13  17  15   3  14
     8  14   8   4  10  17   4  15
     9   6   9  12   6  10  16  19
    10  15   3  15  18   6   2   8
    11   7  16  17   7  18  13   7
    12   8  13  10   3  19  12   3
    13   9   4  18  16   7  15   9
    14  16  17   7  13   3   8   4
    15  17  10   8   8  16   9  17
    16  10  18   5   5  13  17  20
    17  18   5  11  11   8  20  16
    18  11  11  19  19   9  10  10
    19  19  19   9   9  20   5   5
    20  20  20  20  20   5   6   6

It is interesting to study the evolution of mixing the cards:

  • For the first shuffle, the original deck is cut into two stacks (1:11 and 12:20) and riffled to form the second column. Notice that usually one or two cards are dropped from each stack, although at one point three cards (7, 8, 9) are dropped from the left stack.
  • The second cut occurs at card 14, which is the eighth card in the second column. After the second riffle, notice that the cards 7, 8, and 9 are still consecutive, and the first and last cards are still in their original locations. Six cards (30%) are still in their initial positions in the deck.
  • The pair 7 and 8 is not separated until the fourth shuffle.
  • The last card (20) does not move from the bottom of the deck until the fifth shuffle.
  • The first card (1) does not move from the top of the deck until the sixth shuffle.

The Efficiency of the GSR Algorithm

As far as efficiency goes, the GSRShuffle module that I've implemented here is not very efficient. As I've said before, the SAS/IML language is a vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities.

This implementation of the shuffling algorithm is not vectorized. Unfortunately, because the probability of a card dropping from the left stack changes at every iteration, there is no way to call the RANDGEN function once and have it return all n numbers required to simulate a single riffle shuffle.

Or is there? Perhaps there is an equivalent algorithm that can be vectorized? Next week I'll present a more efficient version of the GSR algorithm that does not require an explicit loop over the number of cards in the deck.

References

D. Bayer and P. Diaconis (1992), "Trailing the Dovetail Shuffle to Its Lair", Annals of Applied Probablity 2(2) 294-313
P. Diaconis (1988), Group Representations in Probability and Statistics. IMS, Hayward, CA.
E. Gilbert (1955) "Theory of Shuffling," Technical memorandum. Bell Laboratories.
J. Reeds (1981), Unpublished manuscript.
4月 082011
 
At the beginning of 2011, I heard about the Dow Piano, which was created by CNNMoney.com. The Dow Piano visualizes the performance of the Dow Jones industrial average in 2010 with a line plot, but also adds an auditory component. As Bård Edlund, Art Director at CNNMoney.com, said,
The daily trading of the Dow Jones industrial average determines the songwriting here, translating the ups and downs of 2010's market into musical notes. Using a five-note scale spanning three octaves, pitch is determined by each day's closing level.

When I first saw the Dow Piano, I immediately thought, "I can do that in SAS/IML Studio by using the SOUND call in the SAS/IML language!"

To be fair, I can't fully duplicate the Dow Piano in SAS/IML software. The SOUND call in SAS is very simple: it only provides pitch and duration. The music for the Dow Piano is more complex:

  • It uses piano notes, which involves sound characteristics such as attack and decay.
  • It uses dynamics to represent trading volume: high-volume days are represented by loud notes, low-volume days by soft notes.
  • It overlays a percussion track so that you hear a percussive beat on days that the stock market is closed.
Nevertheless, I think the SAS/IML version captures the main idea by allowing you to listen to the performance of the Dow Jones industrial average. (Get it? The performance?)

Will this kind of visualization (auditorization? sonification?) catch on? Time will tell, but for these data, the sound doesn't add any new information; it merely uses the sense of sound to repeat information that is already in the line plot. In fact, the sound present less information than the line plot: there are more than 250 closing values of the Dow Jones industrial average, but the Dow Piano collapses these values into 15 notes (three octave of a five note scale).

But enough analysis! Listen to the Way of the Dow. The demo starts at 0:45, after I introduce the problem.

4月 012011
 
Today, SAS, the leader in business analytics announces significant changes to two popular SAS blogs, The DO Loop (written by Rick Wicklin) and The SAS Dummy (previously written by Chris Hemedinger).

The two blogs will be merged into a single blog, called The SAS Smarty: A Blog for an Elite Few. The blog, which will be written by Wicklin, will debut next week.

Ima Phül, director of SAS Brand Preservation, says the change is needed.

"Like many companies, SAS continuously monitors the way its brand is perceived by customers," says Phül. "By using SAS Social Media Analytics and SAS Web Analytics, my team was able to identify several problems in the way that the SAS brand is perceived by customers."

Particularly troubling was the high correlation between the terms "SAS" and "dummy," and the terms "SAS" and "dogfood".

"Thanks to SAS Social Media Analytics, Chris's blog was determined to be the root cause of these undesireable relationships," continues Phül.

Wicklin says that re-aligning the SAS Dummy blog with corporate directions will not be hard. "Expect less humor and more math," he says. "I've got some great articles planned on computing multi-dimensional integrals and using orthogonal regression polynomials."

The archives of the SAS Dummy will continue to be hosted on blogs.sas.com, but the title of the discontinued blog will be changed to an unpronouncible symbol:


Alison Bolin, Editor of Blogs and Social Content at SAS, thanks Chris for his years of blogging. "Chris was instrumental in the success of our SAS blogging efforts. We are happy to announce that Chris will be leading a new R&D effort from our Siberian office, and we wish him well in his new endeavors."

These changes are effective as of today, April 1, 2011, also known as April Fools' Day.

2月 142011
 
If you tell my wife that she's married to a statistical geek, she'll nod knowingly. She is used to hearing sweet words of affection such as
You are more beautiful than Euler's identity.
or
My love for you is like the exponential function: increasing, unbounded, and transcendental.
But those are ordinary, everyday sentiments. For Valentine's Day, I want to do something really special, such as formulating a parametric expression whose image is heart-shaped. If you haven't gotten anything sweet for your special someone, you too can use SAS/IML Studio to create the following image!

Modify the program to create your own personal message. My personal message to my wife is "I love you."




/** Valentine's Day heart. Rick Wicklin: blogs.sas.com/iml **/
/** Parametrize in polar coordinates h(t) = (r(t), theta(t)) **/
Pi = constant("Pi");
t = do(0, 2*Pi, 0.01*Pi/4)`;
r = 2 - 2*sin(t) + sin(t)#sqrt(abs(cos(t))) / (sin(t)+1.4);
/** Convert to Euclidean coordinates for plotting **/
x = r#cos(t);
y = r#sin(t);

/** Use SAS/IML Studio to produce the image **/
declare DataObject dobj;
dobj = DataObject.Create("Valentine", {"x" "y"}, x||y );
dobj.AddVar("const", j(nrow(t),1));
dobj.SetMarkerFillColor(OBS_ALL, RED);

declare PolygonPlot p;
p = PolygonPlot.Create(dobj, "x", "y", "const", true);
p.SetGraphAreaMargins(0,0,0,0);
p.SetAspectRatio(1);
p.ShowAxes(false);
p.SetWindowPosition(50,0,32,50);
p.DrawUseDataCoordinates();
p.DrawText(0,-4.1,
"r = 2 - 2*sin(t) + sin(t)*sqrt(|cos(t)|) / (sin(t)+1.4)");
p.DrawSetTextSize(30);
p.DrawText(0,-0.75,"Happy\nValentine's\nDay!"J);
2月 102011
 
I enjoyed the Dataists' data-driven blog on the best numbers to choose in a Super Bowl betting pool. It reminded me of my recent investigation of which initials are most common.

Because the Dataists' blog featured an R function that converts Arabic numerals into Roman numerals, the blog post also reminded me that SAS has a built in format that does the same thing. The ROMANw. format makes it easy to convert everyday whole numbers into their Roman equivalent.

For example, here are a few numbers and their Roman counterparts:

data RomanNumerals;
input x @@;
n = x;
format n ROMAN8.;
datalines;
1 2 3 4 5 9 30 40 45 50 1999 2011
;
run;
proc print noobs; run;

x

n

1

I

2

II

3

III

4

IV

5

V

9

IX

30

XXX

40

XL

45

XLV

50

L

1999

MCMXCIX

2011

MMXI

I've always wondered why SAS has this obscure format, but now I know of at least one useful application: using it to troll through Wikipedia to extract box scores from Super Bowl I through XLV.

1月 252011
 
Have you ever wanted to compute the exact value of a really big number such as 200! = 200*199*...*2*1? You can do it—if you're willing to put forth some programming effort. This blog post shows you how.

Jiangtang Hu's recent blog discusses his quest to compute large factorials in many programming languages. As he discovered, many languages that use standard double precision arithmetic do not compute n! for large n, whereas other software, such as Mathematica, succeed.

The reason is simple: 171! is greater than 1.79769e+308, which on many computers is the largest floating point number that can be represented in eight bytes. For example, the following statements print the maximum double precision value on my Windows PC:

proc iml;
MaxBig = constant("big");
print MaxBig;

MaxBig

1.798E308

Because of the limitation of double precision arithmetic, floating point computations can result in a run-time error. For example, the following factorial computation overflows in SAS software:

f = fact(200);  /** numerical overflow **/

Of course, there are ways around this limitation. The reason that software such as Maple and Mathematica succeed is that they implement special algorithms that enable them to compute results to arbitrary precision.

Computing something to "arbitrary precision" sounds like magic, but it's not. For example, I can use the SAS/IML language (or the DATA step, or C, or many other languages) to write an algorithm that computes the factorial of fairly large numbers. The following algorithm is naive and inefficient, but it shows how you can implement an arbitrary precision algorithm by storing the result in a numerical array instead of in a double precision variable.

An Algorithm for Computing the Factorial of Large Numbers

Again, I emphasize that this algorithm is for illustration purposes and is not intended for "real" work. Given an integer, n, here's how you can compute the exact value of f = n!. The key is to store each digit of the result in a numerical array.

  1. Compute the number of digits of f = n!. The number of digits is related to the logarithm (base 10) of f, so you can use the LFACT function. The LFACT function returns the natural logarithm of the f, but you can convert that number to the base-10 logarithm.
  2. Allocate an array with the correct number of digits. The ith element of this array will hold the ith digit of the answer. In other words, if the ith digit is d, the digit represents the value d x 10i.
  3. Initialize the result to 1.
  4. For each integer 2, 3, ..., n, multiply the current result by the next factor. This is done place-value-by-place-value, and results that are greater than 10 are carried over to the next place value.
  5. After all the multiplications, the array f has the value of n!. The numbers in this array need to be reversed, because, for example, the array {0 2 1} represents the number 120.

start BigFactorial(n);
 numDigits = ceil(lfact(n)/log(10)); /** 1 **/
 f = j(1,numDigits,0);               /** 2 **/
 f[1] = 1;                           /** 3 **/
 do i = 2 to n;                      /** 4 **/
   carry = 0;
   digits = ceil(lfact(i)/log(10)); /** how many digits in the answer so far? **/
   do j = 1 to digits ;             /** multiply the product by next factor **/
     g = i*f[j] + carry;            /** j_th place value **/
     digit = mod(g, 10); 
     f[j] = digit;                  /** put value in the j_th place **/
     carry = (g - digit)/10;        /** carry the rest to the (j+1)_th place **/
   end;
 end;
 return( f[,ncol(f):1] );           /** 5. reverse the digits **/
finish;

Can this module really compute the factorial of large numbers? Yep. You can compute 200! and see that the exact answer contains 375 digits. The following statements compute the factorial and print the result:

f = BigFactorial(200);  /** compute 200! **/
/** convert answer from a numerical array to a character string **/
Fact200 = rowcat(char(f,1)); 

numDigits200 = nleng(Fact200); /** how many digits in the answer? **/
print numDigits200;
print Fact200;

numDigits200

375

                                     Fact200

     788657867364790503552363213932185062295135977687173263294742533244359449963
     403342920304284011984623904177212138919638830257642790242637105061926624952
     829931113462857270763317237396988943922445621451664240254033291864131227428
     294853277524242407573903240321257405579568660226031904170324062351700858796
     178922222789623703897374720000000000000000000000000000000000000000000000000

Obviously, I have entirely too much time on my hands, but does this example demonstrate anything else? It demonstrates that programming with arrays enables you to implement a wide variety of algorithms. Perhaps the example also demystifies software that evaluates expressions to "arbitrary precision."

Mostly, though, I did it because I thought it would be fun. I was right. And next time I'm at a party, I might casually sidle next to someone and mention that 1,000! has 2,568 digits. I'll let you know their response.

11月 192010
 
In a previous post, I used statistical data analysis to estimate the probability that my grocery bill is a whole-dollar amount such as $86.00 or $103.00. I used three weeks' grocery receipts to show that the last two digits of prices on items that I buy are not uniformly distributed. (The prices tend to be values such as $1.49 or $1.99 that appear to be lower in price.) I also showed that sales tax increases the chance of getting a whole-dollar receipt.

In this post, I show how you can use resampling techniques in SAS/IML software to simulate thousands of receipts. You can then estimate the probability that a receipt is a whole-dollar amount, and use bootstrap ideas to examine variation in the estimate.

Simulating Thousands of Receipts

The SAS/IML language is ideal for sampling and simulation in the SAS environment. I previously introduced a SAS/IML module called SampleReplaceUni which performs sampling with replacement from a set. You can use this module to resample from the data.

First run a SAS DATA step program to create the WORK.PRICES data set. The ITEM variable contains prices of individual items that I buy. You can resample from these data. I usually buy about 40 items each week, so I'll use this number to simulate a receipt.

The following SAS/IML statements simulate 5,000 receipts. Each row of the y vector contains 40 randomly selected items (one simulated receipt), and there are 5,000 rows. The sum across the rows are the pre-tax totals for each receipt:

proc iml;
/** read prices of individual items **/
use prices; read all var {"Item"}; close prices;

SalesTax = 0.0775;   /** sales tax in Wake county, NC **/
numItems = 40;       /** number of items in each receipt **/
NumResamples = 5000; /** number of receipts to simulate **/
call randseed(4321);

/** LOAD or define the SampleReplaceUni module here. **/

/** resample from the original data. **/
y = sampleReplaceUni(Item`, NumResamples, NumItems);
pretax = y[,+];       /** sum is get pre-tax cost of each receipt **/
total = round((1 + SalesTax) # pretax, 0.01); /** add sales tax **/
whole = (total=int(total)); /** indicator: whole-dollar receipts **/
Prop = whole[:];    /** proportion that are whole-dollar amounts **/
print NumItems NumResamples (sum(whole))[label="Num00"] 
       Prop[format=PERCENT7.2 label="Pct"];

numItems

NumResamples

Num00

Pct

40

5000

52

1.04%

The simulation generated 52 whole-dollar receipts out of 5,000, or about 1%.

Estimating the Probability of a Whole-Dollar Receipt

The previous computation is the result of a single simulation. In technical terms, it is a point-estimate for the probability of obtaining a whole-dollar receipt. If you run another simulation, you will get a slightly different estimate. If you run, say, 200 simulations and plot all of the estimates, you obtain the sampling distribution for the estimate. For example, you might obtain something that looks like the following histogram:


The histogram shows the estimates for 200 simulations. The vertical dashed line in the histogram is the mean of the 200 estimates, which is about 1%. So, on average, I should expect my grocery bills to be a whole-dollar amount about 1% of the time. The histogram also indicates how the estimate varies over the 200 simulations.

The Bootstrap Distribution of Proportions

The technique that I've described is known as a resampling or bootstrap technique.

In the SAS/IML language, it is easy to wrap a loop around the previous simulation and to accumulate the result of each simulation. The mean of the accumulated results is a bootstrap estimate for the probability that my grocery receipt is a whole-dollar amount. You can use quantiles of the accumulated results to form a bootstrap confidence interval for the probability. For example:

NumRep = 200;          /** number of bootstrap replications **/
Prop = j(NumRep, 1);   /** allocate vector for results **/
do k = 1 to NumRep;
   /** resample from the original data **/
   ...
   /** proportion of bills that are whole-dollar amounts **/
   Prop[k] = whole[:]; 
end;

meanProp = Prop[:]; 
/** LOAD or define the Qntl module here **/
call Qntl(q, Prop, {0.05 0.95});
print numItems meanProp q[label="90% CI"];

numItems

NumRep

meanProp

90% CI

40

200

0.010055

0.0081

0.0124

Notice that 90% of the simulations had proportions that are between 0.81% and 1.24%. I can use this as a 90% confidence interval for the true probability that my grocery receipt will be a whole-dollar amount.

You can download the complete SAS/IML Studio program that computes the estimates and creates the histogram.

Conclusions

My intuition was right: I showed that the chance of my grocery receipt being a whole-dollar amount is about one in a hundred. But, more importantly, I also showed that you can use SAS/IML software to resample from data, run simulations, and implement bootstrap methods.

What's next? Well, all this programming has made me hungry. I’m going to the grocery store!