2月 072018
Multiple-Birthday Problem: The distribution of the number of shared birthdays among 23 random people

If N random people are in a room, the classical birthday problem provides the probability that at least two people share a birthday. The birthday problem does not consider how many birthdays are in common. However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of shared birthdays. Specifically, among N people, what is the probability that exactly k birthdays are shared (k = 1, 2, 3, ..., floor(N/2))? The bar chart at the right shows the distribution for N=23. The heights of the bars indicate the probability of 0 shared birthdays, 1 shared birthday, and so on.

This article uses simulation in SAS to examine the multiple-birthday problem. If you are interested in a theoretical treatment, see Fisher, Funk, and Sams (2013, p. 5-14).

The probability of multiple shared birthdays for N=23

You can explore the classical birthday problem by using probability theory or you can estimate the probabilities by using Monte Carlo simulation. The simulation in this article assumes 365 equally distributed birthdays, but see my previous article to extend the simulation to include leap days or a nonuniform distribution of birthdays.

The simulation-based approach enables you to investigate the multiple birthday problem. To begin, consider a room that contains N=23 people. The following SAS/IML statements are taken from my previous article. The Monte Carlo simulation generates one million rooms of size 23 and estimates the proportion of rooms in which the number of shared birthdays is 0, 1, 2, and so forth.

proc iml;
/* Function that simulates B rooms, each with N people, and counts the number of shared 
   (matching) birthdays in each room. The return value is a row vector that has B counts. */
start BirthdaySim(N, B);                 
   bday = Sample(1:365, B||N);             /* each column is a room */
   match = N - countunique(bday, "col");   /* number of unique birthdays in each col */
   return ( match );
call randseed(12345);                      /* set seed for random number stream */
NumPeople = 23;
NumRooms = 1e6;
match = BirthdaySim(NumPeople, NumRooms);  /* 1e6 counts */
call tabulate(k, freq, match);  /* summarize: k=number of shared birthdays, freq=counts */ 
prob = freq / NumRooms;         /* proportion of rooms that have 0, 1, 2,... shared birthdays */
print prob[F=BEST6. C=(char(k,2)) L="Estimate of Probability (N=23)"];
Multiple-Birthday Problem: The distribution of shared birthdays among 23 random people

The output summarizes the results. In less than half the rooms, no person shared a birthday with anyone else. In about 36% of the rooms, one birthday is shared by two or more people. In about 12% of the room, there were two birthdays that were shared by four or more people. About 2% of the rooms had three birthdays shared among six or more individuals, and so forth. In theory, there is a small probability of a room in which 8, 9, 10, or 11 birthdays are shared, but the probability of these events is very small. The probability estimates are plotted in the bar chart at the top of this article.

There is a second way to represent this probability distribution: you can create a stacked bar chart that shows the cumulative probability of k or fewer shared birthdays for k=1, 2, 3,.... You can see that the probability of 0 matching birthdays is less than 50%, the probability of 1 or fewer matches is 87%, the probability of 2 or fewer matches is 97%, and so forth. The probability for 5, 6, or 7 matches is not easily determined from either chart.

Multiple-Birthday Problem: The cumulative distribution of shared birthdays among 23 random people

Mathematically speaking, the information in the stacked bar chart is equivalent to the information in the regular bar chart. The regular bar chart shows the PMF (probability mass function) for the distribution whereas the stacked chart shows the CDF (cumulative distribution function).

The probability of shared birthdays among N people

The previous section showed the distribution of shared birthdays for N=23. You can download the SAS program that simulates the multiple birthday problem for rooms that contain N=2, 3, ..., and 60 people. For each simulation, the program computes estimates for the probabilities of k matching birthdays, where k ranges from 0 to 30.

Multiple-Birthday Problem: Three distributions of the number of shared birthdays among N=25, 40, and 60 random people

There are a few ways to visualize those 59 distributions. One way is to plot 59 bar charts, either in a panel or by using an animation. A panel of three bar charts is shown to the right for rooms of size N=25, 40, and 60. You can see that the bar chart for N=25 is similar to the one shown earlier. For N=40, most rooms have between one and three shared birthdays. For rooms with N=60 people, most rooms have between three and six shared birthdays. The horizontal graphs are truncated at k=12, even though a few simulations generated rooms that contain more than 12 matches.

Obviously plotting all 59 bar charts would require a lot of space. A different way to visualize these 59 distributions is to place 59 stacked bar charts side by side. This visualization is more compact. In fact, instead of 59 stacked bar charts, you might want to create a stacked band plot, which is what I show below.

The following stacked band chart is an attempt to visualize the distribution of shared birthdays for ALL room sizes N=2, 3, ..., 60. (Click to enlarge.)

Multiple-Birthday Problem: The cumulative distribution of the probability of k or fewer shared birthdays among N random people, N=2,3,...,60

How do you interpret this graph? A room that contains N=23 people corresponds to drawing a vertical line at N=23. That vertical line intersects the red, green, and brown bands near 0.5, 0.87, and 0.97. Therefore, these are the probabilities that a room with 23 people contains 0 shared birthdays, 1 or fewer shared birthdays, and 2 or fewer shared birthdays. The vertical heights of the bands qualitatively indicate the individual probabilities.

Next consider a room that contains N=40 people. A vertical line at N=40 intersects the colored bands at 0.11, 0.37, 0.66, 0.86, and 0.95. These are the cumulative probabilities P(k ≤ 0), P(k ≤ 1), P(k ≤ 2), and so forth. If you need the exact values, you can use PROC PRINT to display the actual probability estimates. For example:

proc print data=BDayProb noobs;
   where N=40 and k < 5;
Multiple-Birthday Problem: The distribution of shared birthdays among N=40 random people


In summary, this article examines the multiple-birthday problem, which is a generalization of the classical birthday problem. If a room contains N random people, the multiple-birthday problem examines the probability that k birthdays are shared by two or more people for k = 0, 1, 2, ..., floor(N/2). This article shows how to use Monte Carlo simulation in SAS to estimate the probabilities. You can visualize the results by using a panel of bar charts or by using a stacked band plot to visualize the cumulative distribution. You can download the SAS program that simulates the multiple birthday problem, estimates the probabilities, and visualizes the results.

What do you think? Do you like the stacked band plot that summarizes the results in one graph, or do you prefer a series of traditional bar charts? Leave a comment.

The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.

2月 052018
The Birthday Problem: The probability of a matching birthday in a room of that contains N people (N=2,3,...,60). You can simulate the birthday problem in SAS and compute a Monte Carlo estimate of the probability of a match.

This article simulates the birthday-matching problem in SAS. The birthday-matching problem (also called the birthday problem or birthday paradox) answers the following question: "if there are N people in a room, what is the probability that at least two people share a birthday?" The birthday problem is famous because the probability of duplicate birthdays is much higher than most people would guess: Among 23 people, the probability of a shared birthday is more than 50%.

If you assume a uniform distribution of birthdays, the birthday-matching problem can be solved exactly. In reality birthdays are not uniformly distributed, but as I show in Statistical Programming with SAS/IML Software (Wicklin 2010), you can use a Monte Carlo simulation to estimate the probability of a matching birthday.

This article uses ideas from Wicklin (2010) to simulate the birthday problem in the SAS/IML language. The simulation is simplest to understand if you assume uniformly distributed birthdays (p. 335–337), but it is only slightly harder to run a simulation that uses an empirical distribution of birthdays (p. 344–346). For both cases, the simulation requires only a few lines of code.

Monte Carlo estimates of the probabilities in the birthday problem

The birthday-matching problem readily lends itself to Monte Carlo simulation. Represent the birthdays by the integers {1, 2, ..., 365} and do the following:

  1. Choose a random sample of size N (with replacement) from the set of birthdays. That sample represents a "room" of N people.
  2. Determine whether any of the birthdays in the room match. There are N(N-1)/2 pairs of birthdays to check.
  3. Repeat this process for thousands of rooms and compute the proportion of rooms that contain a match. This is a Monte Carlo estimate of the probability of a shared birthday.
  4. If desired, simulate rooms of various sizes, such as N=2, 3, 4,..., 60, and plot the results.

How to simulate the birthday problem in SAS

Before running the full simulation, let's examine a small-scale simulation that simulates only 10 rooms, each containing a random sample of N = 23 people:

proc iml;
call randseed(12345);                 /* set seed for random number stream */
N = 23;                               /* number of people in room */
NumRooms = 10;                        /* number of simulated rooms */
bday = Sample(1:365, NumRooms||N);    /* simulate: each column is a room that contains N birthdays */
*print bday;                          /* optional: print the 23 x 10 matrix of birthdays */
match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
print match[label="Num Matches (N=23)" c=("Rm1":"Rm10")];
/* estimate the probability of a match, based on 10 simulations */
probMatch = (match > 0)[:];          /* mean = proportion of matches */
print probMatch;
Simulate the birthday problem in SAS and estimate the probability of a match

The output shows the number of matches in 10 rooms, each with 23 people. The first room did not contain any people with matching birthdays, nor did rooms 3, 5, 6, 7, and 10. The second room contained one matching birthday, as did rooms 8 and 9. The fourth room contains two shared birthdays. In this small simulation, 4/10 of the rooms contain a shared birthday, so the Monte Carlo estimate of the probability is 0.4. You can compute the estimate by forming the binary indicator variable (match > 0) and computing the mean. (Recall that the mean of a binary vector is the proportion of 1s.)

Notice that the simulation requires only three SAS/IML vectorized statements. The SAMPLE function produces a matrix that has N rows. Each column of the matrix represents the birthdays of N people in a room. The COUNTUNIQUE function returns a row vector that contains the number of unique birthdays in each column. If you subtract that number from N, you obtain the number of shared birthdays. Finally, the estimate is obtained by computing the mean of an indicator variable.

Simulate the birthday problem in SAS

For convenience, the following SAS/IML statements define a function that simulates B rooms, each containing a random sample of N birthdays. The function returns a row vector of size B that contains the number of matching birthdays in each room. The program calls the function in a loop and graphs the results.

start BirthdaySim(N, B);
   bday = Sample(1:365, B||N);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
maxN = 60;                    /* consider rooms with 2, 3, ..., maxN people */
NumPeople = 2:maxN;
probMatch = j(MaxN, 1, 0);    /* allocate vector for results */
NumRooms = 10000;             /* number of simulated rooms */
do N = 1 to ncol(NumPeople);
   match =  BirthdaySim(NumPeople[N], NumRooms);
   probMatch[N] = (match > 0)[:];
title "The Birthday Matching Problem";
call series(NumPeople, probMatch) grid={x y} 
     label={"Number of People in Room" "Probability of a Shared Birthday"};

The graph is shown at the top of this article. It shows the estimated probability of one or more matching birthdays in a room of size N for N=2, 3, ..., 60. The estimates are based on 10,000 Monte Carlo iterations. (Why did I use 10,000 rooms? See the Appendix.) As mentioned earlier, the probability exceeds 50% when a room contains 23 people. For 30 people, the probability of a match is 71%. If you put 50 people in a room, the chances that two people in the room share a birthday is about 97%.

Allow leap days and a nonuniform distribution of birthdays

The curious reader might wonder how this analysis changes if you account for people born on leap day (February 29th). The answer is that the probabilities of a match decrease very slightly because there are more birthdays, and you can calculate the probability analytically. If you are interested in running a simulation that accounts for leap day, use the following SAS/IML function:

start BirthdaySim(N, B);
   /* A four-year period has 3*365 + 366 = 1461 days. Assuming uniformly distributed 
      birthdays, the probability vector for randomly choosing a birthday is as follows: */
   p = j(366, 1, 4/1461);      /* most birthdays occur 4 times in 4 years */
   p[31+29] = 1/1461;          /* leap day occurs once in 4 years */
   /* sample with replacement using this probability of selection */
   bday = Sample(1:366, B||N, "replace", p);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );

If you run the simulation, you will find that the estimated probability of sharing a birthday among N people does not change much if you include leap-day birthdays. The difference between the estimated probabilities is about 0.003 on average, which is much less than the standard error of the estimates. In practice, you can exclude February 29 without changing the conclusion that shared birthdays are expected to occur.

In a similar way, you can set the probability vector to be the empirical distribution of birthdays in the population. Wicklin (2010, p. 346) reports that the simulation-based estimates are slightly higher than the exact results that you obtain from assuming a uniform distribution. This agrees with theory: A. G. Munford (TAS, 1977) showed that any nonuniform distribution increases the likelihood of a matching birthday. In other words, in the real world, the probability of a shared birthday is slightly higher than in the idealized mathematical world.

Appendix: How many Monte Carlo simulations are necessary?

Why did I use 10,000 Monte Carlo simulations? Notice that the Monte Carlo estimate in this problem is an estimate of a proportion of a binary variable, which means that you can estimate the standard error. If p̂ is the estimate, then a 95% confidence interval has radius 1.96 sqrt(p̂(1-p̂)/B), where B is the number of Monte Carlo simulations (that is, the number of "rooms"). This quantity is largest when p̂ = 1/2, so if you want, say, two digits of accuracy you can choose a value of B which is large enough so that 1.96 sqrt(0.5(1-0.5)/B) ≤ 0.01. Solving the equation for B give B ≥ 10000. If you choose that many Monte Carlo iterations, then the conservative 95% confidence interval has a half-width no greater than 0.01.

This simple calculation shows that if you use B=10,000 for the number of simulated "room," you can expect the Monte Carlo estimates to be within 0.01 of the true probability 95% of the time. The standard error is largest where the curve is steepest (near N=23). You could add error bars to the graph, but the errors would be too small to see.

The post Simulate the birthday-matching problem appeared first on The DO Loop.

1月 292018

What is a random number generator? What are the random-number generators in SAS, and how can you use them to generate random numbers from probability distributions? In SAS 9.4M5, you can use the STREAMINIT function to select from eight random-number generators (RNGs), including five new RNGs. After choosing an RNG, you can use the RAND function to obtain random values from probability distributions.

What is a random-number generator?

A random-number generator usually refers to a deterministic algorithm that generates uniformly distributed random numbers. The algorithm (sometimes called a pseudorandom-number generator, or PRNG) is initialized by an integer, called the seed, which sets the initial state of the PRNG. Each iteration of the algorithm produces a pseudorandom number and advances the internal state. The seed completely determines the sequence of random numbers. Although it is deterministic, a modern PRNG generates extremely long sequences of numbers whose statistical properties are virtually indistinguishable from a truly random uniform process.

An RNG can also refer to a hardware-based device (sometimes called a "truly random RNG") that samples from an entropy source, which is often thermal noise within the silicon of a computer chip. In the Intel implementation, the random entropy source is used to initialize a PRNG in the chip, which generates a short stream of pseudorandom values before resampling from the entropy source and repeating the process. A hardware-based RNG is not deterministic or reproducible.

An RNG generates uniformly distributed numbers. If you request random numbers from a nonuniform probability distribution (such as the normal or exponential distributions), SAS will automatically apply mathematical transformations that convert the uniform numbers into nonuniform random variates.

Using random numbers in the real world

In the real world, random numbers are used as the basis for statistical simulations. Financial analysts use simulations of the economy to evaluate the risk of a portfolio. Insurance companies use simulations of natural disasters to assess the impact on their customers. Government workers use simulations of population growth to predict the infrastructure and services that will be required by future generations. In all fields, data analysts generate random values for statistical sampling, bootstrap and resampling methods, Monte Carlo estimation, and data simulation.

For these techniques to be statistically valid, the values that are produced by statistical software must contain a high degree of randomness. The new random-number generators in SAS provide analysts with fast, high-quality, state-of-the-art random numbers.

New random-number generators in SAS

Prior to SAS 9.4M5, SAS supported the Mersenne twister random-number generator (introduced in SAS 9.1) and a hardware-based RNG (supported on Intel CPUs in SAS 9.4M3). SAS 9.4M5 introduces new random-number generators and new variants of the Mersenne twister, as follows:

  • PCG: A 64-bit permuted congruential generator (O’Neill 2014) with good statistical properties.
  • TF2: A 2x64-bit counter-based RNG that is based on the Threefish encryption function in the Random123 library (Salmon et al. 2011). The generator is also known as the 2x64 Threefry generator.
  • TF4: A 4x64-bit counter-based RNG that is also known as the 4x64 Threefry generator.
  • RDRAND: A hardware-based RNG (Intel, 2014) that is repeatedly reseeded by using random values from thermal noise in the chip. The RNG requires an Intel processor (Ivy Bridge and later) that supports the RDRAND instruction.
  • MT1998: (deprecated) The original 1998 32-bit Mersenne twister algorithm (Matsumoto and Nishimura, 1998). This was the default RNG for the RAND function prior to SAS 9.4M3. For a small number of seed values (those exactly divisible by 8,192), this RNG does not produce a sufficiently random stream of numbers. Consequently, other RNGS are preferable.
  • MTHYBRID: (default) A hybrid method that improves the MT1998 method by using the MT2002 initialization for seeds that are exactly divisible by 8,192. This is the default RNG, beginning with the SAS 9.4M3.
  • MT2002: The 2002 32-bit Mersenne twister algorithm (Matsumoto and Nishimura 2002).
  • MT64: A 64-bit version of the 2002 Mersenne twister algorithm.

Choose a random-number generator in SAS

In SAS, you can use the STREAMINIT subroutine to choose an RNG and to set the seed value. You can then use the RAND function to produce high-quality streams of random numbers for many common distributions. The first argument to the RAND function specifies the name of the distribution. Subsequent arguments specify parameters for the distribution.

If you want to use the default RNG (which is 'MTHYBRID'), you can specify only the seed. For example, the following DATA step initializes the default PRNG with the seed value 12345 and then generates five observations from the uniform distribution on (0,1) and from the standard normal distribution:

data RandMT(drop=i);
   call streaminit(12345);   /* choose default RNG (MTHYBRID) and seed=12345 */
   do i = 1 to 5;
      uMT = rand('Uniform'); /* generate random uniform: U ~ U(0,1)           */
      nMT = rand('Normal');  /* generate random normal: N ~ N(mu=0, sigma=1) */

If you want to choose a different RNG, you can specify the name of the RNG followed by the seed. For example, the following DATA step initializes the PCG method with the seed value 12345 and then generates five observations from the uniform and normal distributions:

data RandPCG(drop=i);
   call streaminit('PCG', 12345); /* SAS 9.4M5: choose PCG method, same seed */
   do i = 1 to 5;
      uPCG = rand('Uniform'); nPCG = rand('Normal'); output;

A hardware-based method is initialized by the entropy source, so you do not need to specify a seed value. For example, the following DATA step initializes the RDRAND method and then generates five observations. The three data sets are then merged and printed:

data RandHardware(drop=i);
   call streaminit('RDRAND');     /* SAS 9.4M3: Hardware method, no seed */
   do i = 1 to 5;
      uHardware = rand('Uniform'); nHardware = rand('Normal'); output;
data All;
   merge RandMT RandPCG RandHardware;
title "Uniform Random Variates for Three RNGs";
proc print data=All noobs; var u:; run;
title "Normal Random Variates for Three RNGs";
proc print data=All noobs; var n:; run;

The output shows that each RNG generates a different sequence of numbers. If you run this program yourself, you will generate the same sequences for the first two columns (the MTHYBRID and PCG streams), but your numbers for the RDRAND RNG will be different. Each column of the first table is an independent random sample from the uniform distribution. Similarly, each column of the second table is an independent random sample from the standard normal distribution.


In SAS 9.4M5, you can choose from eight different random-number generators. You use the STREAMINIT function to select the RNG and use the RAND function to obtain random values. In a future blog post, I will compare the attributes of the different RNGs and discuss the advantages of each. I will also show how to use the new random-number generators to generate random numbers in parallel by running a DS2 program in multiple threads.

The information in this article is taken from my forthcoming SAS Global Forum 2018 paper, "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018).


The post How to use random-number generators in SAS appeared first on The DO Loop.

1月 152018

Last week I got the following message:

Dear Rick: How can I create a normal distribution within a specified range (min and max)? I need to simulate a normal distribution that fits within a specified range. I realize that a normal distribution is by definition infinite... Are there any alternatives, such as a distribution that has shape and properties similar to a normal distribution, but for which I can restrict the range? - Carol

Mathematically, this request doesn't make sense, as Carol (not her real name) acknowledges. However, if you are working with a client who is not statistically savvy, you might be asked to do the impossible! How can Carol make sense of this request?

I provide a two-part answer. First I discuss general ways to choose a distribution that models a data sample. You can use these methods when you have access to the data. Then I address the specific question and present five distributions that you can use to model data that looks like a "bounded" normal distribution. These distributions can be used when you don't have access to the data.

How to choose a distribution that matches the data?

In general, statisticians are often asked to choose a model that seems to fit an observed set of data. In simulation studies, the model is used to simulate samples that have the same distributional properties as the real data. In the book Simulating Data with SAS (Chapter 16), I discuss many strategies, including the following:

  • Use domain-specific knowledge to guide you. There might be physical or biological reasons to prefer one probability distribution over another.
  • Sample from the empirical distribution of the data, which is equivalent to bootstrap resampling. For more information, see the article about how to bootstrap in SAS or Chapter 15 of Simulating Data with SAS.
  • Use a well-known “named” parametric distribution to model the data, such as the normal, gamma, and beta distributions. Typically you will fit several candidate distributions to the data to see which fits best. (In SAS, you can use PROC UNIVARIATE, PROC SEVERITY, or PROC NLMIXED to fit distributions.) After you choose a distribution that fits the data, you can draw random samples from the fitted distribution.
  • For univariate data, you can choose a flexible system of distributions such as the Pearson system or the Johnson system. The Johnson system is supported by PROC UNIVARIATE (Wicklin, p. 112–114).
  • Use a graphical tool, called the moment-ratio diagram, to help select candidate distributions for the data. Traditionally the moment-ratio diagram is a theoretical tool for understanding the relationships of families and systems of distributions, but Chapter 16 shows how to use the moment-ratio diagram as a tool to organize simulation studies.

These ideas (and others) are illustrated in the following diagram, which is also from Wicklin (2013, p. 298):

The flowchart shows a few paths that a researcher can follow to model data. Most of the paths are also applicable to multivariate data.

Simulate data from the "eyeball distribution"

Now let's discuss the specific question that I was asked: how to simulate a "bounded" normal distribution. If Carol has access to the data, she could use the techniques in the previous section. Consequently, I assume that Carol does not have access to the data. This can happen in several ways, such as when you are trying to reproduce the results in a published article but the article includes only summary statistics or a graph of the data. If you cannot obtain the original data from the authors, you might be forced to simulate fake data based only on the graph and the summary statistics.

I call this using the "eyeball distribution." For non-native speakers of English, "to eyeball" means to look at or observe something. When used as an adjective, “eyeball” indicates that a quantity was obtained by visual inspection, rather than through formal measurements. Thus an "eyeball distribution" is one that is chosen heuristically because it looks similar to the histogram of the data. (You can extend these ideas to multivariate data.)

From Carol's description, the histogram of her data is symmetric, unimodal, and "bell-shaped." The data are in the interval [a, b]. I can think of several "eyeball distributions" that could model data like this:

  1. If you know the sample mean (m) and standard deviation (s), then draw random samples from N(m, s).
  2. If you know sample quantiles of the data, you can approximate the CDF and use inverse probability sampling for the simulation.
  3. Theory tells us that 99.7% of the normal distribution is within three standard deviations of the mean. For symmetric distributions, the mean equals the midrange m = (a + b)/2. Thus you could use the three-sigma rule to approximate s ≈ (b - a)/6 and sample from N(m, s).
  4. If you have domain knowledge that guarantees that the data are truly bounded on [a, b], you could use the truncated normal distribution, which samples from N(m, s) but discards any random variates that are outside of the interval.
  5. Alternatively, you can sample from a symmetric bounded distribution such as the Beta(α, α) distribution. For example, Beta(5, 5) is approximately bell-shaped on the interval [0,1].

The first option is often the most reasonable choice. Even though the original data are in the interval [a, b], the simulated data do not have to be in the same interval. If you believe that the original data are normally distributed, then simulate from that model and accept that you might obtain simulated data that are beyond the bounds of the original sample. The same fact holds for the third option, which estimates the standard deviation from a histogram.

The following SAS DATA step implements the last three options. The sample distributions are displayed by using a comparative histogram in SAS:

%let N = 500;      /* sample size */
%let min = 5;
%let max = 15;
proc format;
value MethodFmt 1 = "Normal, 3-Sigma Rule"
                2 = "Truncated Normal"
                3 = "Beta(5, 5), Scaled";
data BellSim(keep=x Method
format method MethodFmt.;
call streaminit(54321);  /* set random number seed */
numSD = 3;               /* use 3-sigma rule to estiamte std dev */
mu = (&min + &max)/2;
sigma = (&max - &min)/ (2*numSD);
method = 1;        /* Normal distribution N(mu, sigma) */
   do i = 1 to &N;
      x = rand("Normal", mu, sigma);  output;
method = 2;        /* truncated normal distribution TN(mu, sigma; a, b) */
   i = 0;
   do until (i = &N);
      x = rand("Normal", mu, sigma);
      if &min < x < &max then do;
         i + 1;
method = 3;        /* symmetric beta distribution, scaled into [a, b] */
   alpha = 5;
   do i = 1 to &N;
      x = &min + (&max - &min)*rand("beta", alpha, alpha);
ods graphics / width=480px height=480px;
title "Simulate Bell-Shaped Data on [&min, &max]";
proc sgpanel data=BellSim;
   panelby method / columns=1 novarname;
   histogram x;
   refline &min &max / axis=x;
   colaxis values=(&min to &max) valueshint;

The histograms show a simulated sample of size 500 for each of the three models. The top sample is from the normal distribution. It contains simulated observations that are outside of the interval [a,b]. In many situations, that is fine. The second sample is from the truncated normal distribution on [a,b]. The third is from a Beta(5,5) distribution, which has been scaled onto the interval [a,b].

Although the preceding SAS program answers the question Carol asked, let me emphasize that if you (and Carol) have access to the data, you should use it to choose a model. When the data are unobtainable, you can use an "eyeball distribution," which is basically a guess. However, if your goal is to generate fake data so that you can test a computational method, an eyeball distribution might be sufficient.

The post Data unavailable? Use the "eyeball distribution" to simulate appeared first on The DO Loop.

11月 202017

This article shows how to simulate beta-binomial data in SAS and how to compute the density function (PDF). The beta-binomial distribution is a discrete compound distribution. The "binomial" part of the name means that the discrete random variable X follows a binomial distribution with parameters N (number of trials) and p, but there is a twist: The parameter p is not a constant value but is a random variable that follows the Beta(a, b) distribution.

The beta-binomial distribution is used to model count data where the counts are "almost binomial" but have more variance than can be explained by a binomial model. Therefore this article also compares the binomial and beta-binomial distributions.

Simulate data from the beta-binomial distribution

To generate a random value from the beta-binomial distribution, use a two-step process. The first step is to draw p randomly from the Beta(a, b) distribution. Then you draw x from the binomial distribution Bin(p, N). The beta-binomial distribution is not natively supported by the RAND function SAS, but you can call the RAND function twice to simulate beta-binomial data, as follows:

/* simulate a random sample from the beta-binomial distribution */
%let SampleSize = 1000;
data BetaBin;                         
a = 6; b = 4; nTrials = 10;           /* parameters */
call streaminit(4321);
do i = 1 to &SampleSize;
   p = rand("Beta", a, b);            /* p[i] ~ Beta(a,b)  */
   x = rand("Binomial", p, nTrials);  /* x[i] ~ Bin(p[i], nTrials) */
keep x;

The result of the simulation is shown in the following bar chart. The expected values are overlaid. The next section shows how to compute the expected values.

Beta-binomial distribution and expected values in SAS

The PDF of the beta-binomial distribution

The Wikipedia article about the beta-binomial distribution contains a formula for the PDF of the distribution. Since the distribution is discrete, some references prefer to use "PMF" (probability mass function) instead of PDF. Regardless, if X is a random variable that follows the beta-binomial distribution then the probability that X=x is given by
PDF of the beta-binomial distribution
where B is the complete beta function.

The binomial coefficients ("N choose x") and the beta function are defined in terms of factorials and gamma functions, which get big fast. For numerical computations, it is usually more stable to compute the log-transform of the quantities and then exponentiate the result. The following DATA step computes the PDF of the beta-binomial distribution. For easy comparison with the distribution of the simulated data, the DATA step also computes the expected count for each value in a random sample of size N. The PDF and the simulated data are merged and plotted on the same graph by using the VBARBASIC statement in SAS 9.4M3. The graph was shown in the previous section.

data PDFBetaBinom;           /* PMF function */
a = 6; b = 4; nTrials = 10;  /* parameters */
do x = 0 to nTrials;
   logPMF = lcomb(nTrials, x) +
            logbeta(x + a, nTrials - x + b) -
            logbeta(a, b);
   PMF = exp(logPMF);        /* probability that X=x */
   EX = &SampleSize * PMF;   /* expected value in random sample */
keep x PMF EX;
/* Merge simulated data and PMF. Overlay PMF on data distribution. */
data All;
merge BetaBin PDFBetaBinom(rename=(x=t));
title "The Beta-Binomial Distribution";
title2 "Sample Size = &SampleSize";
proc sgplot data=All;
   vbarbasic x / barwidth=1 legendlabel='Simulated Sample';      /* requires SAS 9.4M3 */
   scatter x=t y=EX /  legendlabel='Expected Value'
      markerattrs=GraphDataDefault(symbol=CIRCLEFILLED size=10);
   inset "nTrials = 10"  "a = 6"  "b = 4" / position=topleft border;
   yaxis grid;
   xaxis label="x" integer type=linear;             /* force TYPE=LINEAR */

Compare the binomial and beta-binomial distributions

One application of the beta-binomial distribution is to model count data that are approximately binomial but have more variance ("thicker tails") than the binomial model predicts. The expected value of a Beta(a, b) distribution is a/(a + b), so let's compare the beta-binomial distribution to the binomial distribution with p = a/(a + b).

The following graph overlays the two PDFs for a = 6, b = 4, and nTrials = 10. The blue distribution is the binomial distribution with p = 6/(6 + 4) = 0.6. The pink distribution is the beta-binomial. You can see that the beta-binomial distribution has a shorter peak and thicker tails than the corresponding binomial distribution. The expected value for both distributions is 6, but the variance of the beta-binomial distribution is greater. Thus you can use the beta-binomial distribution as an alternative to the binomial distribution when the data exhibit greater variance than expected under the binomial model (a phenomenon known as overdispersion).

Compare the binomial and beta-binomial distributions with the same mean. The beta-binomial distribution has greater variance than the binomial.


The beta-binomial distribution is an example of a compound distribution. You can simulate data from a compound distribution by randomly drawing the parameters from some distribution and then using those random parameters to draw the data. For the beta-binomial distribution, the probability parameter p is drawn from a beta distribution and then used to draw x from a binomial distribution where the probability of success is the value of p. You can use the beta-binomial distribution to model data that have greater variance than expected under the binomial model.

The post Simulate data from the beta-binomial distribution in SAS appeared first on The DO Loop.

10月 112017

The article "Fisher's transformation of the correlation coefficient" featured a Monte Carlo simulation that generated sample correlations from bivariate normal data. The simulation used three steps:

  1. Simulate B samples of size N from a bivariate normal distribution with correlation ρ.
  2. Use PROC CORR to compute the sample correlation matrix for each of the B samples.
  3. Use the DATA step to extract the off-diagonal elements from the correlation matrices.

After the three steps, you obtain a distribution of B sample correlation coefficients that approximates the sampling distribution of the Pearson correlation coefficient for bivariate normal data.

There is a simpler way to simulate the correlation estimates: You can directly simulate from the Wishart distribution. Each draw from the Wishart distribution is a sample covariance matrix for a multivariate normal sample of size N. If you convert that covariance matrix to a correlation matrix, you can immediately extract the off-diagonal elements, as shown in the following SAS/IML statements:

%let rho = 0.8;           /* correlation for bivariate normal distribution */
%let N = 20;              /* sample size */
%let NumSamples = 2500;   /* number of simulated samples */
/* generate sample correlation coefficients by using Wishart distribution */
proc iml;
call randseed(12345);
NumSamples = &NumSamples;
DF = &N - 1;              /* X ~ N obs from MVN(0, Sigma) */
Sigma = {1     &rho,      /* covariance for MVN samples */
         &rho   1  };
S = RandWishart(NumSamples, DF, Sigma); /* each row is 2x2 matrix */
Corr = j(NumSamples, 1);  /* allocate vector for correlation estimates */
do i = 1 to nrow(S);      /* convert to correlation; extract off-diagonal */
   Corr[i] = cov2corr( shape(S[i,], 2, 2) )[1,2];

You can create a comparative histogram of the sample correlation coefficients. In the following graph, the histogram at the top of the panel displays the distribution of the simulated correlation coefficients from the three-step method. The bottom histogram displays the distribution of correlations coefficients that are generated from the Wishart distribution.

Visually, the histograms appear to be similar. You can use PROC NPAR1WAY to run various hypothesis tests that compare the distributions; all tests support the hypothesis that these two distributions are equivalent.

If you'd like to see the complete analysis, you can download the SAS program that runs both simulations and compares the resulting distributions.

Although the Wishart distribution is more efficient for this simulation, recall that the Wishart distribution assumes that the underlying data distribution is multivariate normal. In contrast, the three-step simulation is more general. It can be used to generate correlation coefficients for any data distribution. So although the three-step simulation is not necessary for multivariate normal data, it is still an important technique to store in your simulation toolbox.

The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.

9月 272017

In a large simulation study, it can be convenient to have a "control file" that contains the parameters for the study. My recent article about how to simulate multivariate normal clusters demonstrates a simple example of this technique. The simulation in that article uses an input data set that contains the parameters (mean, standard deviations, and correlations) for the simulation. A SAS procedure (PROC SIMNORMAL) simulates data based on the parameters in the input data set.

This is a powerful paradigm. Instead of hard-coding the parameters in the program (or as macro variables), the parameters are stored in a data set that is processed by the program. This is sometimes called data-driven programming. (Some people call it dynamic programming, but there is an optimization technique of the same name so I will use the term "data-driven.") In a data-driven program, when you want to run the program with new parameters, you merely modify the data set that contains the control parameters.

I have previously written about a different way to control a batch program by passing in parameters on the command line when you invoke the SAS program.

Static programming and hard-coded parameters

Before looking at data-driven programming, let's review the static approach. I will simulate clusters of univariate normal data as an example.

Suppose that you want to simulate normal data for three different groups. Each group has its own sample size (N), mean, and standard deviation. In my book Simulating Data with SAS (p. 206), I show how to simulate this sort of ANOVA design by using arrays, as follows.

/* Static simulation: Parameters embedded in the simulation program */
data AnovaStatic;
/* define parameters for three simulated group */
array N[3]       _temporary_ (50,   50,   50);   /* sample sizes */
array Mean[3]    _temporary_ (14.6, 42.6, 55.5); /* center for each group */
array StdDev[3]  _temporary_ ( 1.7,  4.7,  5.5); /* spread for each group */
call streaminit(12345);
do k = 1 to dim(N);              /* for each group */
   do i = 1 to N[k];             /* simulate N[k] observations */
      x = rand("Normal", Mean[k], StdDev[k]); /* from k_th normal distribution */

The DATA step contains two loops, one for the groups and the other for the observations within each group. The parameters for each group are stored in arrays. Notice that if you want to change the parameters (including the number of groups), you need to edit the program. I call this method "static programming" because the behavior of the program is determined at the time that the program is written. This is a perfectly acceptable method for most applications. It has the advantage that you know exactly what the program will do by looking at the program.

Data-driven programming: Put parameters in a file

An alternative is to put the parameters for each group into a file or data set. If the k_th row in the data set contains the parameters for the k_th group, then the implicit loop in the DATA step will iterate over all groups, regardless of the number of groups. The following DATA step creates the parameters for three groups, which are read and processed by the second DATA step. The parameter values are the same as for the static example, but are transposed and processed row-by-row instead of via arrays:

/* Data-driven simulation: Parameters in a data set, processed by the simulation program */
data params;                     /* define parameters for each simulated group */
input N Mean StdDev;
50 14.6 1.7
50 42.6 4.7
50 55.5 5.5
data AnovaDynamic;
call streaminit(12345);
set params;                      /* implicit loop over groups k=1,2,... */
do i = 1 to N;                   /* simulate N[k] observations */
   x = rand("Normal", Mean, StdDev); /* from k_th normal distribution */

Notice the difference between the static and dynamic techniques. The static technique simulates data from three groups whose parameters are specified in temporary arrays. The dynamic technique simulates data from an arbitrary number of groups. Currently, the PARAMS data specifies three groups, but if I change the PARAMS data set to represent 10 or 1000 groups, the AnovaDynamic DATA step will simulate data from the new design without any modification.

Generate the parameters from real data

The data-driven technique is useful when the parameters are themselves the results of an analysis. For example, a common simulation technique is to generate the moments of real data (mean, variance, skewness, and so forth) and to use those statistics in place of the population parameters that they estimate. (See Chapter 16, "Moment Matching," in Simulating Statistics with SAS.)

The following call to PROC MEANS generates the sample mean and standard deviation for real data and writes those values to a data set:

proc means data=sashelp.iris N Mean StdDev stackods;
   class Species;
   var PetalLength;
   ods output Summary=params;

The output data set from PROC MEANS creates a PARAMS data set that contains the variables (N, MEAN, and STDDEV) that are read by the simulation program. Therefore, you can immediately run the AnovaDynamic DATA step to simulate normal data from the sample statistics. A visualization of the resulting simulated data is shown below.

You can run PROC MEANS on other data and other variables and the AnovaDynamic step will continue to work without any modification. The simulation is controlled entirely by the values in the "control file," which is the PARAMS data set.

You can generalize this technique by wrapping the program in a SAS macro in which the name of the parameter file and the name of the simulated data set are provided at run time. With a macro implementation, you can read from multiple input files and write to multiple output data sets. You could use such a macro, for example, to break up a large simulation study into smaller independent sub-simulations, each controlled by its own file of input parameters. In a gridded environment, each sub-simulation can be processed independently and in parallel, thus reducing the total time required to complete the study.

Although this article discusses control files in the context of statistical simulation, other applications are possible. Have you used a similar technique to control a program by using an input file that contains the parameters for the program? Leave a comment.

The post Data-driven simulation appeared first on The DO Loop.

9月 252017

My article about Fisher's transformation of the Pearson correlation contained a simulation. The simulation uses the RANDNORMAL function in SAS/IML software to simulate multivariate normal data. If you are a SAS programmer who does not have access to SAS/IML software, you can use the SIMNORMAL procedure in SAS/STAT software to simulate data from a multivariate normal distribution.

The 'TYPE' of a SAS data set

Most SAS procedures read and analyze raw data. However, some SAS procedures read and write special data sets that represent a statistical summary of data. PROC SIMNORMAL can read a TYPE=CORR or TYPE=COV data set. Usually, these special data sets are created as an output data set from another procedure. For example, the following SAS statements compute the correlation between four variables from a sample of 50 Iris versicolor flowers:

proc corr data=sashelp.iris(where=(Species="Versicolor"))  /* input raw data */
          nomiss noprint outp=OutCorr;                     /* output statistics */
var PetalLength PetalWidth SepalLength SepalWidth;
proc print data=OutCorr; run;
SAS TYPE=CORR data set

The output data set contains summary statistics including the mean, standard deviations, and correlation matrix for the four variables in the analysis. PROC PRINT does not display the 'TYPE' attribute of this data set, but if you run PROC CONTENTS you will see a field labeled "Data Set Type," which has the value "CORR".

You can also create a TYPE=CORR or TYPE=COV data set by using the DATA step as shown in the documentation for PROC SIMNORMAL.

Use PROC SIMNORMAL to generate multivariate normal data

Recall that you can use the standard deviations and correlations to construct a covariance matrix. When you call PROC SIMNORMAL, it internally constructs the covariance matrix from the information in the OutCorr data set and use the mean and covariance matrix to simulate multivariate normal data. The following call to PROC SIMNORMAL simulates 50 observations from a multivariate normal population. The DATA step combines the original and simulated data; the call to PROC SGSCATTER overlays the original and the simulated samples. Click to enlarge the graph.

proc simnormal data=OutCorr outsim=SimMVN
               numreal = 50           /* number of realizations = size of sample */
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
/* combine the original data and the simulated data */
data Both;
set sashelp.iris(where=(Species="Versicolor")) /* original */
    SimMVN(in=sim);                            /* simulated */
Simulated = sim;
ods graphics / attrpriority=none;   /* use different markers for each group */
title "Overlay of Original and Simulated MVN Data";
proc sgscatter data=Both;
   matrix PetalLength PetalWidth SepalLength SepalWidth / group=Simulated;
ods graphics / attrpriority=none;   /* reset markers */
Overlay of original and simulated four-dimensional data

Notice that the original data are rounded whereas the simulated data are not. Except for that minor difference, the simulated data appear to be similar to the original data. Of course, the simulated data will not match unless the original data is approximately multivariate normal.

Simulate many samples from a multivariate normal distribution

The SIMNORMAL procedure supports the NUMREAL= option, which you can use to specify the size of the simulated sample. (NUMREAL stands for "number of realizations," which is the number of independent draws.) You can use this option to generate multiple samples from the same multivariate normal population. For example, suppose you are conducting a Monte Carlo study and you want to generate 100 samples of size N=50, each drawn from the same multivariate normal population. This is equivalent to drawing 50*100 observations where the first 50 observations represent the first sample, the next 50 observations represent the second sample, and so on. The following statements generate 50*100 observations and then construct an ID variable that identifies each sample:

%let N = 50;            /* sample size */
%let NumSamples = 100;  /* number of samples */
proc simnormal data=OutCorr outsim=SimMVN
               numreal = %sysevalf(&N*&NumSamples) 
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
data SimMVNAll;
set SimMVN;
ID = floor((_N_-1) / &N) + 1;   /* ID = 1,1,...,1, 2,2,...,2, etc */

After adding the ID variable, you can efficiently analyze all samples by using a single call to a procedure. The procedure should use a BY statement to analyze each sample. For example, you could use PROC CORR with a BY ID statement to obtain a Monte Carlo estimate of the sampling distribution of the correlation for multivariate normal data.

In summary, although the SAS/IML language is the best tool for general multivariate simulation tasks, you can use the SIMNORMAL procedure in SAS/STAT software to simulate multivariate normal data. The key is to construct a TYPE=CORR or TYPE=COV data set, which is then processed by PROC SIMNORMAL.

The post Simulate multivariate normal data in SAS by using PROC SIMNORMAL appeared first on The DO Loop.

9月 132017
Simulate clusters of multivariate normal observations in SAS by using a Gaussian mixture distribution

This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture. You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a four-dimensional normal distribution. Each cluster has its own within-cluster covariance, which controls the spread of the cluster and the amount overlap between clusters.

This article is based on an example in Simulating Data with SAS (Wicklin, 2013, p. 138). Previous articles have explained how to simulate from a multivariate normal distribution and how to generate a random sample from a univariate mixture distribution.

A review of mixture distributions

The graph at the top of this article shows 100 random observation from a Gaussian mixture distribution. A mixture distribution consists of K component distributions and a set of mixing probabilities that determine the probability that a random observation belongs to each component. For example, if π = {0.35, 0.5, 0.15} is a vector of mixing probabilities, then in large random samples about 35% of the observations are drawn from the first component, about 50% from the second component, and about 15% from the third component.

For a mixture of Gaussian components, you need to specify the mean vector and the covariance matrix for each component. For this example, the means and covariances are approximately the estimates from Fisher's famous iris data set, so the scatter plot matrix might look familiar to statisticians who have previously analyzed the iris data. The means of the three component distributions are μ1 = {50, 34, 15, 2}, μ2 = {59, 28, 43, 13}, and μ3 = {66, 30, 56, 20}. The covariance matrices for the component distributions are shown later.

Simulate a Gaussian mixture in SAS

The SAS/IML language is the easiest way to simulate multivariate data in SAS. To simulate from a mixture of K Gaussian distributions, do the following:

  1. Generate a random draw from the multinomial distribution with probability vector π. This gives the number of observations to sample from each component.
  2. For each component, simulate from a multivariate normal distribution.

A technical issue is how to pass the mean vectors and covariance matrices into a module that simulates the data. If you are using SAS/IML 14.2 or beyond, you can use a list of lists (Wicklin, 2017, p. 12–14). For compatibility with older versions of SAS, the implementation in this article uses a different technique: each mean and covariance are stored as rows of a matrix. Because the covariance matrices are symmetric, Wicklin (2013) stores them in lower-triangular form. For simplicity, this article stores each 4 x 4 covariance matrix as a row in a 3 x 16 matrix.

proc iml;
/* Simulate from mixture of K mulitvariate normal distributions in dimension d.  
   OUTPUT variables:
   X  : a SampleSize x d matrix of simulated observations
   ID : a column vector that contains the component from which each obs is drawn
   INPUT variables:
   pi  : 1 x K vector of mixing probabilities, sum(pi)=1
   mu  : K x d matrix whose i_th row contains the mean vector for the i_th component
   Cov : K x (d**2) matrix whose i_th row contains the covariance matrix for the i_th component
start SimMVNMixture(X, ID,                     /* output arguments */
                    SampleSize, pi, mu, Cov);  /* input arguments  */
   K = ncol(pi);                               /* number of components */
   d = ncol(mu);                               /* number of variables */
   X = j(SampleSize, d);                       /* output: each row is observation */
   ID = j(SampleSize, 1);                      /* ID variable */
   N = RandMultinomial(1, SampleSize, pi);     /* vector of samples sizes for components */
   b = 1;                                      /* b = beginning index for group i */
   do i = 1 to K;
      e = b + N[i] - 1;                        /* e = ending index for group i */
      ID[b:e] = i;                             /* set ID variable */
      c = shape(Cov[i,], d, d);                /* reshape Cov to square matrix */
      X[b:e, ] = RandNormal(N[i], mu[i,], c);  /* simulate i_th MVN sample */
      b = e + 1;                               /* next group starts at this index */

The SimMVNMixture routine allocates a data matrix (X) that is large enough to hold the results. It generates a vector N ={N1, N2,..., NK} to determine the number of observations that will be drawn from each component and calls the RANDNORMAL function to simulate from each Gaussian component. The scalar values b and e keep track of the beginning and ending rows of each sample from each component.

After the module is defined, you can call it for a specific set of parameters. Assume you want to generate K=3 clusters of four-dimensional data (d=4). The following statements specify the mixing probabilities for a three-component model. The mu matrix is a K x d matrix whose rows are the mean vectors for the components. The Cov matrix is a K x (d**2) matrix whose rows are the covariance matrices for the components. The following statements generate a total of 100 observations from the three-component mixture distribution:

/* specify input args; means/cov correspond to sashelp.iris data for species */
pi = {0.35 0.5 0.15};                   /* mixing probs for K=3 groups */
mu = {50 34 15  2 ,                            /* means of Group 1     */
      59 28 43 13 ,                            /* means of Group 2     */
      66 30 56 20 };                           /* means of Group 3     */
/* specify within-group covariances */
Cov = {12 10  2 1  10 14 1 1   2 1  3 1  1 1 1 1 ,   /* cov of Group 1 */
       27  9 18 6   9 10 8 4  18 8 22 7  6 4 7 4 ,   /* cov of Group 2 */
       40  9 30 5   9 10 7 5  30 7 30 5  5 5 5 8 };  /* cov of Group 3 */
/* run the simulation to generate 100 observations */
call randseed(12345);
run SimMVNMixture(X, Group, 100, pi, mu, Cov);

The call to the SimMVNMixture routine returned the simulated random sample in X, which is a 100 x d matrix. The module also returns an ID vector that identifies the component from which each observation was drawn. You can visualize the random sample by writing the data to a SAS data set and using the SGSCATTER procedure to create a paneled scatter plot, as follows:

/* save simulated data to data set */
Y = Group || X;
create Clusters from Y[c=("Group" || varNames)];  append from Y;  close;
ods graphics / attrpriority=none;
title "Multivariate Normal Components";
title2 "(*ESC*){unicode pi} = {0.35, 0.5, 0.15}";
proc sgscatter data=Clusters;
   compare y=(x1 x2) x=(x3 x4) / group=Group markerattrs=(Size=12);

The graph is shown at the top of this article.

Although each component in this example is multivariate normal, the same technique will work for any component distributions. For example, one cluster could be multivariate normal, another multivariate t, and a third multivariate uniform.

In summary, you can create a function module in the SAS/IML language to simulate data from a mixture of Gaussian components. The RandMutinomial function returns a random vector that determines the number of observations to draw from each component. The RandNormal function generates the observations. A technical detail involves how to pass the parameters to the module. This implementation packs the parameters into matrices, but other options, such as a list of lists, are possible.

The post Simulate multivariate clusters in SAS appeared first on The DO Loop.

7月 262017

A classical problem in elementary probability asks for the expected lengths of line segments that result from randomly selecting k points along a segment of unit length. It is both fun and instructive to simulate such problems. This article uses simulation in the SAS/IML language to estimate solutions to the following problems:

  • Randomly choose a point at random in the interval (0, 1). The point divides the interval into two segments of length x and 1-x. What is the expected length of the larger (smaller) segment?
  • Broken stick problem: What is the probability that three randomly chosen points will break a segment into a triangle?
  • Randomly choose k points at random in the interval (0, 1). The points divide the interval into k+1 segments. What is the expected length of the largest (smallest) segment?
  • When k=2, the points divide the interval into three segments. What is the probability that the three segments can form a triangle? This is called the broken-stick problem and is illustrated in the figure to the right.

You can find a discussion and solution to these problems on many websites, but I like the website, which includes proofs and interactive Java applets.

Simulate a solution in SAS

You can simulate these problems in SAS by writing a DATA step or a SAS/IML program. I discuss the DATA step at the end of this article. The body of this article presents a SAS/IML simulation and constructed helper modules that solve the general problem. The simulation will do the following:

  1. Generate k points uniformly at random in the interval (0, 1). For convenience, sort the points in increasing order.
  2. Compute the lengths of the k+1 segments.
  3. Find the length of the largest and smallest segments.

In many languages (including the SAS DATA step), you would write a loop that performs these operations for each random sample. You would then estimate the expected length by computing the mean value of the largest segment for each sample. However, in the SAS/IML language, you can use matrices instead of using a loop. Each sample of random points can be held in the column of a matrix. The lengths of the segments can also be held in a matrix. The largest segment for each trial is stored in a row vector.

The following SAS/IML modules help solve the general simulation problem for k random points. Because the points are ordered, the lengths of the segments are the differences between adjacent rows. You can use the DIF function for this computation, but the following program uses the DifOp module to construct a small difference operator, and it uses matrix multiplication to compute the differences.

proc iml;
/* Independently sort column in a matrix.
   See */
start SortCols(A);
   do i = 1 to ncol(A);
      v = A[ ,i];  call sort(v);  A[ ,i] = v; /* get i_th col and sort it */
/* Generate a random (k x NSim) matrix of points, then sort each column. */
start GenPts(k, NSim);
   x = j(k, NSim);               /* allocate k x NSim matrix */
   call randgen(x, "Uniform");   /* fill with random uniform in (0,1) */
   if k > 1 then run SortCols(x);  
   return x;
/* Return matrix for difference operator.
   See */
start DifOp(dim);
   D = j(dim-1, dim, 0);         /* allocate zero matrix */
   n = nrow(D); m = ncol(D);
   D[do(1,n*m, m+1)] = -1;       /* assign -1 to diagonal elements */
   D[do(2,n*m, m+1)] = 1;        /* assign +1 to super-diagonal elements */
   return D;
/* Find lengths of segments formed by k points in the columns of x.
   Assume each column of x is sorted and all points are in (0,1). */
start SegLengths(x);   
   /* append 0 and 1 to top and bottom (respectively) of each column */
   P = j(1, ncol(x), 0) // x // j(1, ncol(x), 1);
   D = DifOp(nrow(P));           /* construct difference operator */
   return D*P;                   /* use difference operator to find lengths */
P = {0.1  0.2  0.3,
     0.3  0.8  0.5,
     0.7  0.9  0.8 };
L = SegLengths(P);
print L[label="Length (k=3)"];
Lengths of segments formed by random points in the unit interval

The table shows the lengths of three different sets of points for k=3. The first column of P corresponds to points at locations {0.1, 0.3, 0.7}. These three points divide the interval [0, 1] into four segments of lengths 0.1, 0.2, 0.4, and 0.3. Similar computations hold for the other columns.

The expected length of the longer of two segments

For k=1, the problem generates a random point in (0, 1) and asks for the expected length of the longer segment. Obviously the expected length is greater than 1/2, and you can read the Cut-The-Knot website to find a proof that shows that the expected length is 3/4 = 0.75.

The following SAS/IML statements generate one million random points and compute the larger of the segment lengths. The average value of the larger segments is computed and is very close to the expected value:

call randseed(54321);
k = 1;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 1 point dropped onto (0,1) */
L = SegLengths(x);               /* lengths of  segments */
Largest = L[<>, ];               /* max length among the segments */
mean = mean(Largest`);           /* average of the max lengths */
print mean;
Estimate for expected length of longest segment

You might not be familiar with the SAS/IML max subscript operator (<>) and the min subscript operator (><). These operators compute the minimum or maximum values for each row or column in a matrix.

The expected length of the longest of three segments

For k=2, the problem generates two random points in (0, 1) and asks for the expected length of the longest segment. You can also ask for the average shortest length. The Cut-The-Knot website shows that the expected length for the longest segment is 11/18 = 0.611, whereas the expected length of the shortest segment is 2/18 = 0.111.

The following SAS/IML statements simulate choosing two random points on one million unit intervals. The program computes the one million lengths for the resulting longest and shortest segments. Again, the average values of the segments are very close to the expected values:

k = 2;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 2 points dropped onto (0,1) */
L = SegLengths(x);               /* lengths of segments */
maxL = L[<>, ];                  /* max length among the segments */
meanMax = mean(maxL`);           /* average of the max lengths */
minL = L[><, ];                  /* min length among the segments */
meanMin = mean(minL`);           /* average of the max lengths */
print meanMin meanMax;
Estimates for expected lengths of the shorted and longest segments formed by two random points in the unit interval

The broken stick problem

You can use the previous simulation to estimate the broken stick probability. Recall that three line segments can form a triangle provided that they satisfy the triangle inequality: the sum of the two smaller lengths must be greater than the third length. If you randomly choose two points in (0,1), the probability that the resulting three segments can form a triangle is 1/4, which is smaller than what most people would guess.

The vectors maxL and minL each contain one million lengths, so it is trivial to compute the vector of that contains the third lengths.

/* what proportion of randomly broken sticks form triangles? */
medL = 1 - maxL - minL;          /* compute middle length */
isTriangle = (maxL <= minL + medL); /* do lengths satisfy triangle inequality? */
prop = mean(isTriangle`);        /* proportion of segments that form a triangle */
print prop;
Estimate for the probability that three random segments form a triangle

As expected, about 0.25 of the simulations resulted in segments that satisfy the triangle inequality.

In conclusion, this article shows how to use the SAS/IML language to solve several classical problems in probability. By using matrices, you can run the simulation by using vectorized computations such as matrix multiplication finding the minimum or maximum values of columns. (However, I had to use a loop to sort the points. Bummer!)

If you want to try this simulation yourself in the DATA step, I suggest that you transpose the SAS/IML setup. Use arrays to hold the random points and use the CALL SORTN subroutine to sort the points. Use the LARGEST function and the SMALLEST function to compute the largest and smallest elements in an array. Feel free to post your solution (and any other thoughts) in the comments.

The post Random segments and broken sticks appeared first on The DO Loop.