simulation

4月 162018
 

A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the set of all possible permutations.

In SAS, you can use the RANPERM function to generate ranom permutations. For example, the statement p = ranperm(4) generates a random permutation of the elements in the set {1, 2, 3, 4}. However, the function does not ensure that the permutations it generates will be unique.

I can think of two ways to generate random permutations without replacement. The first is to generate all permutations and then choose a random subset of size k without replacement from the set of n! possibilities. This technique is efficient when n is small or when k is a substantial proportion of n!. The second technique is to randomly generate permutations until you get a total of k that are unique. This technique is efficient when n is large and k is small relative to n!.

Generate all permutations, then sample without replacement

For small sets (say, n ≤ 8), an efficient way to generate random permutations is to generate all permutations and then extract a random sample without replacement. In the SAS/IML language you can use the ALLPERM function to generate all permutations of a set of n items, where n ≤ 18. The ALLPERM function returns a matrix that has n! rows and n columns. Each row is a permutation. You can then use the SAMPLE function to sample without replacement from among the rows, as follows:

proc iml;
call randseed(123);
n = 4;                       /* permutations of n items */
k = 3;                       /* want k unique permutation */
 
p = allperm(n);               /* matrix has n! rows; each row is a permutation */
rows = sample(1:n, k, "WOR"); /* random rows w/o replacement */
ranPermWOR = p[rows, ];       /* extract rows */
print ranPermWOR;

Generate random permutations, then check for uniqueness

The matrix of all permutations has n! rows, which gets big fast. If you want only a small number of permutations from among a huge set of possible permutations, it is more efficient to use the RANPERM function to generate permutations, then discard duplicates. Last week I showed how to eliminate duplicate rows from a numeric matrix so that the remaining rows are unique. The following SAS/IML statements use the UniqueRows function, which is defined in the previous post. You must define or load the module before you can use it.

/* <define or load the DupRows and UniqueRows modules HERE> */
 
n = 10;                      /* permutations of n items */
k = 5;                       /* want k unique permutation; k << n! */
p = ranperm(n, 2*k);         /* 2x more permutations than necessary */
U = UniqueRows(p);           /* get unique rows of 2k x n matrix */
if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
else do;
   U = UniqueRows( U // ranperm(n, 10*k) ); /* generate more... repeat as necessary */
   if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
end;
print U;

Notice in the previous statements that the call to RANPERM requests 2*k random permutations, even though we only want k. You should ask for more permutations than you need because some of them might be duplicates. The factor of 2 is ad hoc; it is not based on any statistical consideration.

If k is much smaller than n!, then you might think that the probability of generating duplicate a permutation is small. However, the Birthday Problem teaches us that duplicates arise much more frequently than we might expect, so it is best to expect some duplicates and generate more permutations than necessary. When k is moderately large relative to n!, you might want to use a factor of 5, 10, or even 100. I have not tried to compute the number of permutations that would generate k unique permutations with 95% confidence, but that would be a fun exercise. In my program, if the initial attempt generates fewer than k unique rows, it generates an additional 10*k permutations. You could repeat this process, if necessary.

In summary, if you want to generate random permutations without any duplicates, you have at least two choices. You can generate all permutations and then extract a random sample of k rows. This method works well for small values of n, such as n ≤ 8. For larger values of n, you might want to generate random permutation (more than k) and use the first k unique rows of the matrix of permutations.

The post Random permutations without duplicates appeared first on The DO Loop.

3月 262018
 
Zipper plot for normally distributed data

Simulation studies are used for many purposes, one of which is to examine how distributional assumptions affect the coverage probability of a confidence interval. This article describes the "zipper plot," which enables you to compare the coverage probability of a confidence interval when the data do or do not follow certain distributional assumptions.

I saw the zipper plot in a tutorial about how to design, implement, and report the results of simulation studies (Morris, White, Crowther, 2017). An example of a zipper plot is shown to the right. The main difference between the zipper plot and the usual plot of confidence intervals is that the zipper plot sorts the intervals by their associated p-values, which causes the graph to look like a garment's zipper. The zipper plot in this blog post is slightly different from the one that is presented in Morris et al. (2017, p. 22), as discussed at the end of this article.

Coverage probabilities for confidence intervals

A formulas for a confidence interval (CI) is based on an assumption about the distribution of the parameter estimates on a simple random sample. For example, the two-sided 95% confidence interval for the mean of normally distributed data has upper and lower limits given by the formula
x̄ ± t* s / sqrt(n)
where x̄ is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the 97.5th percentile of the t distribution with n – 1 degrees of freedom. You can confirm that the formula is a 95% CI by simulating many N(0,1) samples and computing the CI for each sample. About 95% of the samples should contain the population mean, which is 0.

The central limit theorem indicates that the formula is asymptotically valid for sufficiently large samples nonnormal data. However, for small nonnormal samples, the true coverage probability of those intervals might be different from 95%. Again, you can estimate the coverage probability by simulating many samples, constructing the intervals, and counting how many time the mean of the distribution is inside the confidence interval. A previous article shows how to use simulation to estimate the coverage probability of this formula.

Zipper plots and simulation studies of confidence intervals

Zipper plot that compares the empirical coverage probability of confidence intervals that assume normality. The data samples are size N=50 drawn from the normal or exponential distribution.

The graph to the right shows the zipper plot applied to the results of the two simulations in the previous article. The zipper plot on the left visualizes the estimate of the coverage probability 10,000 for normal samples of size n = 50. (The graph at the top of this article provides more detail.) The Monte Carlo estimate of the coverage probability of the CIs is 0.949, which is extremely close to the true coverage of 0.95. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.945, 0.953], which includes 0.95. You can see that the width of the confidence intervals does not seem to depend on the sample mean: the samples whose means are "too low" tend to produce intervals that are about the same length as the intervals for the samples whose means are "too high."

The zipper plot on the right is for 10,000 random samples of size n = 50 that are drawn from an exponential distribution with mean 0 and unit scale parameter. (The graph at this link provides more detail.) The Monte Carlo estimate of the coverage probability is 0.9360. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.931, 0.941] and does not include 0.95. Notice that the left side of the "zipper" tends to contain intervals that are shorter than those on the right side. This indicates that samples whose means are "too low" tend to produce shorter confidence intervals than the samples whose means are "too high."

The zipper plot enables you to compare the results of two simulations that generate data from different distributions. The zipper plot enables you to visualize the fact that the nominal 95% CIs do, in fact, have empirical coverage of about 95% for normal data, whereas the intervals have about 93.6% empirical coverage for the exponential data.

If you want to experiment with the zipper plot or use it for your own simulation studies in SAS, you can download the SAS program that generates the graphs in this article.

Differences from the "zip plot" of Morris et al.

There are some minor differences between the zipper plot in this article and the graph in Morris et al. (2017, p. 22).

  • Morris et al., use the term "zip plot." However, statisticians use "ZIP" for the zero-inflated Poisson distribution, so I prefer the term "zipper plot."
  • Morris et al. bin the p-values into 100 centiles and graph the CIs against the centiles. This has the advantage of being able to plot the CI's from thousands or millions of simulations in a graph that uses only a few hundred vertical pixels. In contrast, I plot the CI's for each fractional rank, which is the rank divided by the number of simulations. In the SAS program, I indicate how to compute and use the centiles.
  • Morris et al. plot all the centiles. Consequently, the interesting portion of the graph occupies only about 5-10% of the total graph area. In contrast, I display only the CIs whose fractional rank is less than some cutoff value, which is 0.2 in this article. Thus my zipper plots are a "zoomed in" version of the ones that appear in Morris et al. In the SAS program, you can set the FractionMaxDisplay macro variable to 1.0 to show all the centiles.

The post A zipper plot for visualizing coverage probability in simulation studies appeared first on The DO Loop.

2月 072018
 
Multiple-Birthday Problem: The distribution of the number of shared birthdays among 23 random people

If N random people are in a room, the classical birthday problem provides the probability that at least two people share a birthday. The birthday problem does not consider how many birthdays are in common. However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of shared birthdays. Specifically, among N people, what is the probability that exactly k birthdays are shared (k = 1, 2, 3, ..., floor(N/2))? The bar chart at the right shows the distribution for N=23. The heights of the bars indicate the probability of 0 shared birthdays, 1 shared birthday, and so on.

This article uses simulation in SAS to examine the multiple-birthday problem. If you are interested in a theoretical treatment, see Fisher, Funk, and Sams (2013, p. 5-14).

The probability of multiple shared birthdays for N=23

You can explore the classical birthday problem by using probability theory or you can estimate the probabilities by using Monte Carlo simulation. The simulation in this article assumes 365 equally distributed birthdays, but see my previous article to extend the simulation to include leap days or a nonuniform distribution of birthdays.

The simulation-based approach enables you to investigate the multiple birthday problem. To begin, consider a room that contains N=23 people. The following SAS/IML statements are taken from my previous article. The Monte Carlo simulation generates one million rooms of size 23 and estimates the proportion of rooms in which the number of shared birthdays is 0, 1, 2, and so forth.

proc iml;
/* Function that simulates B rooms, each with N people, and counts the number of shared 
   (matching) birthdays in each room. The return value is a row vector that has B counts. */
start BirthdaySim(N, B);                 
   bday = Sample(1:365, B||N);             /* each column is a room */
   match = N - countunique(bday, "col");   /* number of unique birthdays in each col */
   return ( match );
finish;
 
call randseed(12345);                      /* set seed for random number stream */
NumPeople = 23;
NumRooms = 1e6;
match = BirthdaySim(NumPeople, NumRooms);  /* 1e6 counts */
 
call tabulate(k, freq, match);  /* summarize: k=number of shared birthdays, freq=counts */ 
prob = freq / NumRooms;         /* proportion of rooms that have 0, 1, 2,... shared birthdays */
print prob[F=BEST6. C=(char(k,2)) L="Estimate of Probability (N=23)"];
Multiple-Birthday Problem: The distribution of shared birthdays among 23 random people

The output summarizes the results. In less than half the rooms, no person shared a birthday with anyone else. In about 36% of the rooms, one birthday is shared by two or more people. In about 12% of the room, there were two birthdays that were shared by four or more people. About 2% of the rooms had three birthdays shared among six or more individuals, and so forth. In theory, there is a small probability of a room in which 8, 9, 10, or 11 birthdays are shared, but the probability of these events is very small. The probability estimates are plotted in the bar chart at the top of this article.

There is a second way to represent this probability distribution: you can create a stacked bar chart that shows the cumulative probability of k or fewer shared birthdays for k=1, 2, 3,.... You can see that the probability of 0 matching birthdays is less than 50%, the probability of 1 or fewer matches is 87%, the probability of 2 or fewer matches is 97%, and so forth. The probability for 5, 6, or 7 matches is not easily determined from either chart.

Multiple-Birthday Problem: The cumulative distribution of shared birthdays among 23 random people

Mathematically speaking, the information in the stacked bar chart is equivalent to the information in the regular bar chart. The regular bar chart shows the PMF (probability mass function) for the distribution whereas the stacked chart shows the CDF (cumulative distribution function).

The probability of shared birthdays among N people

The previous section showed the distribution of shared birthdays for N=23. You can download the SAS program that simulates the multiple birthday problem for rooms that contain N=2, 3, ..., and 60 people. For each simulation, the program computes estimates for the probabilities of k matching birthdays, where k ranges from 0 to 30.

Multiple-Birthday Problem: Three distributions of the number of shared birthdays among N=25, 40, and 60 random people

There are a few ways to visualize those 59 distributions. One way is to plot 59 bar charts, either in a panel or by using an animation. A panel of three bar charts is shown to the right for rooms of size N=25, 40, and 60. You can see that the bar chart for N=25 is similar to the one shown earlier. For N=40, most rooms have between one and three shared birthdays. For rooms with N=60 people, most rooms have between three and six shared birthdays. The horizontal graphs are truncated at k=12, even though a few simulations generated rooms that contain more than 12 matches.

Obviously plotting all 59 bar charts would require a lot of space. A different way to visualize these 59 distributions is to place 59 stacked bar charts side by side. This visualization is more compact. In fact, instead of 59 stacked bar charts, you might want to create a stacked band plot, which is what I show below.

The following stacked band chart is an attempt to visualize the distribution of shared birthdays for ALL room sizes N=2, 3, ..., 60. (Click to enlarge.)

Multiple-Birthday Problem: The cumulative distribution of the probability of k or fewer shared birthdays among N random people, N=2,3,...,60

How do you interpret this graph? A room that contains N=23 people corresponds to drawing a vertical line at N=23. That vertical line intersects the red, green, and brown bands near 0.5, 0.87, and 0.97. Therefore, these are the probabilities that a room with 23 people contains 0 shared birthdays, 1 or fewer shared birthdays, and 2 or fewer shared birthdays. The vertical heights of the bands qualitatively indicate the individual probabilities.

Next consider a room that contains N=40 people. A vertical line at N=40 intersects the colored bands at 0.11, 0.37, 0.66, 0.86, and 0.95. These are the cumulative probabilities P(k ≤ 0), P(k ≤ 1), P(k ≤ 2), and so forth. If you need the exact values, you can use PROC PRINT to display the actual probability estimates. For example:

proc print data=BDayProb noobs;
   where N=40 and k < 5;
run;
Multiple-Birthday Problem: The distribution of shared birthdays among N=40 random people

Summary

In summary, this article examines the multiple-birthday problem, which is a generalization of the classical birthday problem. If a room contains N random people, the multiple-birthday problem examines the probability that k birthdays are shared by two or more people for k = 0, 1, 2, ..., floor(N/2). This article shows how to use Monte Carlo simulation in SAS to estimate the probabilities. You can visualize the results by using a panel of bar charts or by using a stacked band plot to visualize the cumulative distribution. You can download the SAS program that simulates the multiple birthday problem, estimates the probabilities, and visualizes the results.

What do you think? Do you like the stacked band plot that summarizes the results in one graph, or do you prefer a series of traditional bar charts? Leave a comment.

The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.

2月 052018
 
The Birthday Problem: The probability of a matching birthday in a room of that contains N people (N=2,3,...,60). You can simulate the birthday problem in SAS and compute a Monte Carlo estimate of the probability of a match.

This article simulates the birthday-matching problem in SAS. The birthday-matching problem (also called the birthday problem or birthday paradox) answers the following question: "if there are N people in a room, what is the probability that at least two people share a birthday?" The birthday problem is famous because the probability of duplicate birthdays is much higher than most people would guess: Among 23 people, the probability of a shared birthday is more than 50%.

If you assume a uniform distribution of birthdays, the birthday-matching problem can be solved exactly. In reality birthdays are not uniformly distributed, but as I show in Statistical Programming with SAS/IML Software (Wicklin 2010), you can use a Monte Carlo simulation to estimate the probability of a matching birthday.

This article uses ideas from Wicklin (2010) to simulate the birthday problem in the SAS/IML language. The simulation is simplest to understand if you assume uniformly distributed birthdays (p. 335–337), but it is only slightly harder to run a simulation that uses an empirical distribution of birthdays (p. 344–346). For both cases, the simulation requires only a few lines of code.

Monte Carlo estimates of the probabilities in the birthday problem

The birthday-matching problem readily lends itself to Monte Carlo simulation. Represent the birthdays by the integers {1, 2, ..., 365} and do the following:

  1. Choose a random sample of size N (with replacement) from the set of birthdays. That sample represents a "room" of N people.
  2. Determine whether any of the birthdays in the room match. There are N(N-1)/2 pairs of birthdays to check.
  3. Repeat this process for thousands of rooms and compute the proportion of rooms that contain a match. This is a Monte Carlo estimate of the probability of a shared birthday.
  4. If desired, simulate rooms of various sizes, such as N=2, 3, 4,..., 60, and plot the results.

How to simulate the birthday problem in SAS

Before running the full simulation, let's examine a small-scale simulation that simulates only 10 rooms, each containing a random sample of N = 23 people:

proc iml;
call randseed(12345);                 /* set seed for random number stream */
N = 23;                               /* number of people in room */
NumRooms = 10;                        /* number of simulated rooms */
bday = Sample(1:365, NumRooms||N);    /* simulate: each column is a room that contains N birthdays */
*print bday;                          /* optional: print the 23 x 10 matrix of birthdays */
match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
print match[label="Num Matches (N=23)" c=("Rm1":"Rm10")];
 
/* estimate the probability of a match, based on 10 simulations */
probMatch = (match > 0)[:];          /* mean = proportion of matches */
print probMatch;
Simulate the birthday problem in SAS and estimate the probability of a match

The output shows the number of matches in 10 rooms, each with 23 people. The first room did not contain any people with matching birthdays, nor did rooms 3, 5, 6, 7, and 10. The second room contained one matching birthday, as did rooms 8 and 9. The fourth room contains two shared birthdays. In this small simulation, 4/10 of the rooms contain a shared birthday, so the Monte Carlo estimate of the probability is 0.4. You can compute the estimate by forming the binary indicator variable (match > 0) and computing the mean. (Recall that the mean of a binary vector is the proportion of 1s.)

Notice that the simulation requires only three SAS/IML vectorized statements. The SAMPLE function produces a matrix that has N rows. Each column of the matrix represents the birthdays of N people in a room. The COUNTUNIQUE function returns a row vector that contains the number of unique birthdays in each column. If you subtract that number from N, you obtain the number of shared birthdays. Finally, the estimate is obtained by computing the mean of an indicator variable.

Simulate the birthday problem in SAS

For convenience, the following SAS/IML statements define a function that simulates B rooms, each containing a random sample of N birthdays. The function returns a row vector of size B that contains the number of matching birthdays in each room. The program calls the function in a loop and graphs the results.

start BirthdaySim(N, B);
   bday = Sample(1:365, B||N);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
finish;
 
maxN = 60;                    /* consider rooms with 2, 3, ..., maxN people */
NumPeople = 2:maxN;
probMatch = j(MaxN, 1, 0);    /* allocate vector for results */
NumRooms = 10000;             /* number of simulated rooms */
 
do N = 1 to ncol(NumPeople);
   match =  BirthdaySim(NumPeople[N], NumRooms);
   probMatch[N] = (match > 0)[:];
end;
 
title "The Birthday Matching Problem";
call series(NumPeople, probMatch) grid={x y} 
     label={"Number of People in Room" "Probability of a Shared Birthday"};

The graph is shown at the top of this article. It shows the estimated probability of one or more matching birthdays in a room of size N for N=2, 3, ..., 60. The estimates are based on 10,000 Monte Carlo iterations. (Why did I use 10,000 rooms? See the Appendix.) As mentioned earlier, the probability exceeds 50% when a room contains 23 people. For 30 people, the probability of a match is 71%. If you put 50 people in a room, the chances that two people in the room share a birthday is about 97%.

Allow leap days and a nonuniform distribution of birthdays

The curious reader might wonder how this analysis changes if you account for people born on leap day (February 29th). The answer is that the probabilities of a match decrease very slightly because there are more birthdays, and you can calculate the probability analytically. If you are interested in running a simulation that accounts for leap day, use the following SAS/IML function:

start BirthdaySim(N, B);
   /* A four-year period has 3*365 + 366 = 1461 days. Assuming uniformly distributed 
      birthdays, the probability vector for randomly choosing a birthday is as follows: */
   p = j(366, 1, 4/1461);      /* most birthdays occur 4 times in 4 years */
   p[31+29] = 1/1461;          /* leap day occurs once in 4 years */
   /* sample with replacement using this probability of selection */
   bday = Sample(1:366, B||N, "replace", p);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
finish;

If you run the simulation, you will find that the estimated probability of sharing a birthday among N people does not change much if you include leap-day birthdays. The difference between the estimated probabilities is about 0.003 on average, which is much less than the standard error of the estimates. In practice, you can exclude February 29 without changing the conclusion that shared birthdays are expected to occur.

In a similar way, you can set the probability vector to be the empirical distribution of birthdays in the population. Wicklin (2010, p. 346) reports that the simulation-based estimates are slightly higher than the exact results that you obtain from assuming a uniform distribution. This agrees with theory: A. G. Munford (TAS, 1977) showed that any nonuniform distribution increases the likelihood of a matching birthday. In other words, in the real world, the probability of a shared birthday is slightly higher than in the idealized mathematical world.

Appendix: How many Monte Carlo simulations are necessary?

Why did I use 10,000 Monte Carlo simulations? Notice that the Monte Carlo estimate in this problem is an estimate of a proportion of a binary variable, which means that you can estimate the standard error. If p̂ is the estimate, then a 95% confidence interval has radius 1.96 sqrt(p̂(1-p̂)/B), where B is the number of Monte Carlo simulations (that is, the number of "rooms"). This quantity is largest when p̂ = 1/2, so if you want, say, two digits of accuracy you can choose a value of B which is large enough so that 1.96 sqrt(0.5(1-0.5)/B) ≤ 0.01. Solving the equation for B give B ≥ 10000. If you choose that many Monte Carlo iterations, then the conservative 95% confidence interval has a half-width no greater than 0.01.

This simple calculation shows that if you use B=10,000 for the number of simulated "room," you can expect the Monte Carlo estimates to be within 0.01 of the true probability 95% of the time. The standard error is largest where the curve is steepest (near N=23). You could add error bars to the graph, but the errors would be too small to see.

The post Simulate the birthday-matching problem appeared first on The DO Loop.

1月 292018
 

What is a random number generator? What are the random-number generators in SAS, and how can you use them to generate random numbers from probability distributions? In SAS 9.4M5, you can use the STREAMINIT function to select from eight random-number generators (RNGs), including five new RNGs. After choosing an RNG, you can use the RAND function to obtain random values from probability distributions.

What is a random-number generator?

A random-number generator usually refers to a deterministic algorithm that generates uniformly distributed random numbers. The algorithm (sometimes called a pseudorandom-number generator, or PRNG) is initialized by an integer, called the seed, which sets the initial state of the PRNG. Each iteration of the algorithm produces a pseudorandom number and advances the internal state. The seed completely determines the sequence of random numbers. Although it is deterministic, a modern PRNG generates extremely long sequences of numbers whose statistical properties are virtually indistinguishable from a truly random uniform process.

An RNG can also refer to a hardware-based device (sometimes called a "truly random RNG") that samples from an entropy source, which is often thermal noise within the silicon of a computer chip. In the Intel implementation, the random entropy source is used to initialize a PRNG in the chip, which generates a short stream of pseudorandom values before resampling from the entropy source and repeating the process. A hardware-based RNG is not deterministic or reproducible.

An RNG generates uniformly distributed numbers. If you request random numbers from a nonuniform probability distribution (such as the normal or exponential distributions), SAS will automatically apply mathematical transformations that convert the uniform numbers into nonuniform random variates.

Using random numbers in the real world

In the real world, random numbers are used as the basis for statistical simulations. Financial analysts use simulations of the economy to evaluate the risk of a portfolio. Insurance companies use simulations of natural disasters to assess the impact on their customers. Government workers use simulations of population growth to predict the infrastructure and services that will be required by future generations. In all fields, data analysts generate random values for statistical sampling, bootstrap and resampling methods, Monte Carlo estimation, and data simulation.

For these techniques to be statistically valid, the values that are produced by statistical software must contain a high degree of randomness. The new random-number generators in SAS provide analysts with fast, high-quality, state-of-the-art random numbers.

New random-number generators in SAS

Prior to SAS 9.4M5, SAS supported the Mersenne twister random-number generator (introduced in SAS 9.1) and a hardware-based RNG (supported on Intel CPUs in SAS 9.4M3). SAS 9.4M5 introduces new random-number generators and new variants of the Mersenne twister, as follows:

  • PCG: A 64-bit permuted congruential generator (O’Neill 2014) with good statistical properties.
  • TF2: A 2x64-bit counter-based RNG that is based on the Threefish encryption function in the Random123 library (Salmon et al. 2011). The generator is also known as the 2x64 Threefry generator.
  • TF4: A 4x64-bit counter-based RNG that is also known as the 4x64 Threefry generator.
  • RDRAND: A hardware-based RNG (Intel, 2014) that is repeatedly reseeded by using random values from thermal noise in the chip. The RNG requires an Intel processor (Ivy Bridge and later) that supports the RDRAND instruction.
  • MT1998: (deprecated) The original 1998 32-bit Mersenne twister algorithm (Matsumoto and Nishimura, 1998). This was the default RNG for the RAND function prior to SAS 9.4M3. For a small number of seed values (those exactly divisible by 8,192), this RNG does not produce a sufficiently random stream of numbers. Consequently, other RNGS are preferable.
  • MTHYBRID: (default) A hybrid method that improves the MT1998 method by using the MT2002 initialization for seeds that are exactly divisible by 8,192. This is the default RNG, beginning with the SAS 9.4M3.
  • MT2002: The 2002 32-bit Mersenne twister algorithm (Matsumoto and Nishimura 2002).
  • MT64: A 64-bit version of the 2002 Mersenne twister algorithm.

Choose a random-number generator in SAS

In SAS, you can use the STREAMINIT subroutine to choose an RNG and to set the seed value. You can then use the RAND function to produce high-quality streams of random numbers for many common distributions. The first argument to the RAND function specifies the name of the distribution. Subsequent arguments specify parameters for the distribution.

If you want to use the default RNG (which is 'MTHYBRID'), you can specify only the seed. For example, the following DATA step initializes the default PRNG with the seed value 12345 and then generates five observations from the uniform distribution on (0,1) and from the standard normal distribution:

data RandMT(drop=i);
   call streaminit(12345);   /* choose default RNG (MTHYBRID) and seed=12345 */
   do i = 1 to 5;
      uMT = rand('Uniform'); /* generate random uniform: U ~ U(0,1)           */
      nMT = rand('Normal');  /* generate random normal: N ~ N(mu=0, sigma=1) */
      output;
   end;
run;

If you want to choose a different RNG, you can specify the name of the RNG followed by the seed. For example, the following DATA step initializes the PCG method with the seed value 12345 and then generates five observations from the uniform and normal distributions:

data RandPCG(drop=i);
   call streaminit('PCG', 12345); /* SAS 9.4M5: choose PCG method, same seed */
   do i = 1 to 5;
      uPCG = rand('Uniform'); nPCG = rand('Normal'); output;
   end;
run;

A hardware-based method is initialized by the entropy source, so you do not need to specify a seed value. For example, the following DATA step initializes the RDRAND method and then generates five observations. The three data sets are then merged and printed:

 
data RandHardware(drop=i);
   call streaminit('RDRAND');     /* SAS 9.4M3: Hardware method, no seed */
   do i = 1 to 5;
      uHardware = rand('Uniform'); nHardware = rand('Normal'); output;
   end;
run;
 
data All;
   merge RandMT RandPCG RandHardware;
run;
 
title "Uniform Random Variates for Three RNGs";
proc print data=All noobs; var u:; run;
 
title "Normal Random Variates for Three RNGs";
proc print data=All noobs; var n:; run;

The output shows that each RNG generates a different sequence of numbers. If you run this program yourself, you will generate the same sequences for the first two columns (the MTHYBRID and PCG streams), but your numbers for the RDRAND RNG will be different. Each column of the first table is an independent random sample from the uniform distribution. Similarly, each column of the second table is an independent random sample from the standard normal distribution.

Summary

In SAS 9.4M5, you can choose from eight different random-number generators. You use the STREAMINIT function to select the RNG and use the RAND function to obtain random values. In a future blog post, I will compare the attributes of the different RNGs and discuss the advantages of each. I will also show how to use the new random-number generators to generate random numbers in parallel by running a DS2 program in multiple threads.

The information in this article is taken from my forthcoming SAS Global Forum 2018 paper, "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018).

References

The post How to use random-number generators in SAS appeared first on The DO Loop.

1月 152018
 

Last week I got the following message:

Dear Rick: How can I create a normal distribution within a specified range (min and max)? I need to simulate a normal distribution that fits within a specified range. I realize that a normal distribution is by definition infinite... Are there any alternatives, such as a distribution that has shape and properties similar to a normal distribution, but for which I can restrict the range? - Carol

Mathematically, this request doesn't make sense, as Carol (not her real name) acknowledges. However, if you are working with a client who is not statistically savvy, you might be asked to do the impossible! How can Carol make sense of this request?

I provide a two-part answer. First I discuss general ways to choose a distribution that models a data sample. You can use these methods when you have access to the data. Then I address the specific question and present five distributions that you can use to model data that looks like a "bounded" normal distribution. These distributions can be used when you don't have access to the data.

How to choose a distribution that matches the data?

In general, statisticians are often asked to choose a model that seems to fit an observed set of data. In simulation studies, the model is used to simulate samples that have the same distributional properties as the real data. In the book Simulating Data with SAS (Chapter 16), I discuss many strategies, including the following:

  • Use domain-specific knowledge to guide you. There might be physical or biological reasons to prefer one probability distribution over another.
  • Sample from the empirical distribution of the data, which is equivalent to bootstrap resampling. For more information, see the article about how to bootstrap in SAS or Chapter 15 of Simulating Data with SAS.
  • Use a well-known “named” parametric distribution to model the data, such as the normal, gamma, and beta distributions. Typically you will fit several candidate distributions to the data to see which fits best. (In SAS, you can use PROC UNIVARIATE, PROC SEVERITY, or PROC NLMIXED to fit distributions.) After you choose a distribution that fits the data, you can draw random samples from the fitted distribution.
  • For univariate data, you can choose a flexible system of distributions such as the Pearson system or the Johnson system. The Johnson system is supported by PROC UNIVARIATE (Wicklin, p. 112–114).
  • Use a graphical tool, called the moment-ratio diagram, to help select candidate distributions for the data. Traditionally the moment-ratio diagram is a theoretical tool for understanding the relationships of families and systems of distributions, but Chapter 16 shows how to use the moment-ratio diagram as a tool to organize simulation studies.

These ideas (and others) are illustrated in the following diagram, which is also from Wicklin (2013, p. 298):

The flowchart shows a few paths that a researcher can follow to model data. Most of the paths are also applicable to multivariate data.

Simulate data from the "eyeball distribution"

Now let's discuss the specific question that I was asked: how to simulate a "bounded" normal distribution. If Carol has access to the data, she could use the techniques in the previous section. Consequently, I assume that Carol does not have access to the data. This can happen in several ways, such as when you are trying to reproduce the results in a published article but the article includes only summary statistics or a graph of the data. If you cannot obtain the original data from the authors, you might be forced to simulate fake data based only on the graph and the summary statistics.

I call this using the "eyeball distribution." For non-native speakers of English, "to eyeball" means to look at or observe something. When used as an adjective, “eyeball” indicates that a quantity was obtained by visual inspection, rather than through formal measurements. Thus an "eyeball distribution" is one that is chosen heuristically because it looks similar to the histogram of the data. (You can extend these ideas to multivariate data.)

From Carol's description, the histogram of her data is symmetric, unimodal, and "bell-shaped." The data are in the interval [a, b]. I can think of several "eyeball distributions" that could model data like this:

  1. If you know the sample mean (m) and standard deviation (s), then draw random samples from N(m, s).
  2. If you know sample quantiles of the data, you can approximate the CDF and use inverse probability sampling for the simulation.
  3. Theory tells us that 99.7% of the normal distribution is within three standard deviations of the mean. For symmetric distributions, the mean equals the midrange m = (a + b)/2. Thus you could use the three-sigma rule to approximate s ≈ (b - a)/6 and sample from N(m, s).
  4. If you have domain knowledge that guarantees that the data are truly bounded on [a, b], you could use the truncated normal distribution, which samples from N(m, s) but discards any random variates that are outside of the interval.
  5. Alternatively, you can sample from a symmetric bounded distribution such as the Beta(α, α) distribution. For example, Beta(5, 5) is approximately bell-shaped on the interval [0,1].

The first option is often the most reasonable choice. Even though the original data are in the interval [a, b], the simulated data do not have to be in the same interval. If you believe that the original data are normally distributed, then simulate from that model and accept that you might obtain simulated data that are beyond the bounds of the original sample. The same fact holds for the third option, which estimates the standard deviation from a histogram.

The following SAS DATA step implements the last three options. The sample distributions are displayed by using a comparative histogram in SAS:

%let N = 500;      /* sample size */
%let min = 5;
%let max = 15;
 
proc format;
value MethodFmt 1 = "Normal, 3-Sigma Rule"
                2 = "Truncated Normal"
                3 = "Beta(5, 5), Scaled";
run;
data BellSim(keep=x Method
format method MethodFmt.;
call streaminit(54321);  /* set random number seed */
numSD = 3;               /* use 3-sigma rule to estiamte std dev */
mu = (&min + &max)/2;
sigma = (&max - &min)/ (2*numSD);
 
method = 1;        /* Normal distribution N(mu, sigma) */
   do i = 1 to &N;
      x = rand("Normal", mu, sigma);  output;
   end;
 
method = 2;        /* truncated normal distribution TN(mu, sigma; a, b) */
   i = 0;
   do until (i = &N);
      x = rand("Normal", mu, sigma);
      if &min < x < &max then do;
         output;
         i + 1;
      end;
   end;
 
method = 3;        /* symmetric beta distribution, scaled into [a, b] */
   alpha = 5;
   do i = 1 to &N;
      x = &min + (&max - &min)*rand("beta", alpha, alpha);
      output;
   end;
run;
 
ods graphics / width=480px height=480px;
title "Simulate Bell-Shaped Data on [&min, &max]";
proc sgpanel data=BellSim;
   panelby method / columns=1 novarname;
   histogram x;
   refline &min &max / axis=x;
   colaxis values=(&min to &max) valueshint;
run;

The histograms show a simulated sample of size 500 for each of the three models. The top sample is from the normal distribution. It contains simulated observations that are outside of the interval [a,b]. In many situations, that is fine. The second sample is from the truncated normal distribution on [a,b]. The third is from a Beta(5,5) distribution, which has been scaled onto the interval [a,b].

Although the preceding SAS program answers the question Carol asked, let me emphasize that if you (and Carol) have access to the data, you should use it to choose a model. When the data are unobtainable, you can use an "eyeball distribution," which is basically a guess. However, if your goal is to generate fake data so that you can test a computational method, an eyeball distribution might be sufficient.

The post Data unavailable? Use the "eyeball distribution" to simulate appeared first on The DO Loop.

11月 202017
 

This article shows how to simulate beta-binomial data in SAS and how to compute the density function (PDF). The beta-binomial distribution is a discrete compound distribution. The "binomial" part of the name means that the discrete random variable X follows a binomial distribution with parameters N (number of trials) and p, but there is a twist: The parameter p is not a constant value but is a random variable that follows the Beta(a, b) distribution.

The beta-binomial distribution is used to model count data where the counts are "almost binomial" but have more variance than can be explained by a binomial model. Therefore this article also compares the binomial and beta-binomial distributions.

Simulate data from the beta-binomial distribution

To generate a random value from the beta-binomial distribution, use a two-step process. The first step is to draw p randomly from the Beta(a, b) distribution. Then you draw x from the binomial distribution Bin(p, N). The beta-binomial distribution is not natively supported by the RAND function SAS, but you can call the RAND function twice to simulate beta-binomial data, as follows:

/* simulate a random sample from the beta-binomial distribution */
%let SampleSize = 1000;
data BetaBin;                         
a = 6; b = 4; nTrials = 10;           /* parameters */
call streaminit(4321);
do i = 1 to &SampleSize;
   p = rand("Beta", a, b);            /* p[i] ~ Beta(a,b)  */
   x = rand("Binomial", p, nTrials);  /* x[i] ~ Bin(p[i], nTrials) */
   output;
end;
keep x;
run;

The result of the simulation is shown in the following bar chart. The expected values are overlaid. The next section shows how to compute the expected values.

Beta-binomial distribution and expected values in SAS

The PDF of the beta-binomial distribution

The Wikipedia article about the beta-binomial distribution contains a formula for the PDF of the distribution. Since the distribution is discrete, some references prefer to use "PMF" (probability mass function) instead of PDF. Regardless, if X is a random variable that follows the beta-binomial distribution then the probability that X=x is given by
PDF of the beta-binomial distribution
where B is the complete beta function.

The binomial coefficients ("N choose x") and the beta function are defined in terms of factorials and gamma functions, which get big fast. For numerical computations, it is usually more stable to compute the log-transform of the quantities and then exponentiate the result. The following DATA step computes the PDF of the beta-binomial distribution. For easy comparison with the distribution of the simulated data, the DATA step also computes the expected count for each value in a random sample of size N. The PDF and the simulated data are merged and plotted on the same graph by using the VBARBASIC statement in SAS 9.4M3. The graph was shown in the previous section.

data PDFBetaBinom;           /* PMF function */
a = 6; b = 4; nTrials = 10;  /* parameters */
do x = 0 to nTrials;
   logPMF = lcomb(nTrials, x) +
            logbeta(x + a, nTrials - x + b) -
            logbeta(a, b);
   PMF = exp(logPMF);        /* probability that X=x */
   EX = &SampleSize * PMF;   /* expected value in random sample */
   output;
end;
keep x PMF EX;
run;
 
/* Merge simulated data and PMF. Overlay PMF on data distribution. */
data All;
merge BetaBin PDFBetaBinom(rename=(x=t));
run;
 
title "The Beta-Binomial Distribution";
title2 "Sample Size = &SampleSize";
proc sgplot data=All;
   vbarbasic x / barwidth=1 legendlabel='Simulated Sample';      /* requires SAS 9.4M3 */
   scatter x=t y=EX /  legendlabel='Expected Value'
      markerattrs=GraphDataDefault(symbol=CIRCLEFILLED size=10);
   inset "nTrials = 10"  "a = 6"  "b = 4" / position=topleft border;
   yaxis grid;
   xaxis label="x" integer type=linear;             /* force TYPE=LINEAR */
run;

Compare the binomial and beta-binomial distributions

One application of the beta-binomial distribution is to model count data that are approximately binomial but have more variance ("thicker tails") than the binomial model predicts. The expected value of a Beta(a, b) distribution is a/(a + b), so let's compare the beta-binomial distribution to the binomial distribution with p = a/(a + b).

The following graph overlays the two PDFs for a = 6, b = 4, and nTrials = 10. The blue distribution is the binomial distribution with p = 6/(6 + 4) = 0.6. The pink distribution is the beta-binomial. You can see that the beta-binomial distribution has a shorter peak and thicker tails than the corresponding binomial distribution. The expected value for both distributions is 6, but the variance of the beta-binomial distribution is greater. Thus you can use the beta-binomial distribution as an alternative to the binomial distribution when the data exhibit greater variance than expected under the binomial model (a phenomenon known as overdispersion).

Compare the binomial and beta-binomial distributions with the same mean. The beta-binomial distribution has greater variance than the binomial.

Summary

The beta-binomial distribution is an example of a compound distribution. You can simulate data from a compound distribution by randomly drawing the parameters from some distribution and then using those random parameters to draw the data. For the beta-binomial distribution, the probability parameter p is drawn from a beta distribution and then used to draw x from a binomial distribution where the probability of success is the value of p. You can use the beta-binomial distribution to model data that have greater variance than expected under the binomial model.

The post Simulate data from the beta-binomial distribution in SAS appeared first on The DO Loop.

10月 112017
 

The article "Fisher's transformation of the correlation coefficient" featured a Monte Carlo simulation that generated sample correlations from bivariate normal data. The simulation used three steps:

  1. Simulate B samples of size N from a bivariate normal distribution with correlation ρ.
  2. Use PROC CORR to compute the sample correlation matrix for each of the B samples.
  3. Use the DATA step to extract the off-diagonal elements from the correlation matrices.

After the three steps, you obtain a distribution of B sample correlation coefficients that approximates the sampling distribution of the Pearson correlation coefficient for bivariate normal data.

There is a simpler way to simulate the correlation estimates: You can directly simulate from the Wishart distribution. Each draw from the Wishart distribution is a sample covariance matrix for a multivariate normal sample of size N. If you convert that covariance matrix to a correlation matrix, you can immediately extract the off-diagonal elements, as shown in the following SAS/IML statements:

%let rho = 0.8;           /* correlation for bivariate normal distribution */
%let N = 20;              /* sample size */
%let NumSamples = 2500;   /* number of simulated samples */
 
/* generate sample correlation coefficients by using Wishart distribution */
proc iml;
call randseed(12345);
NumSamples = &NumSamples;
DF = &N - 1;              /* X ~ N obs from MVN(0, Sigma) */
Sigma = {1     &rho,      /* covariance for MVN samples */
         &rho   1  };
S = RandWishart(NumSamples, DF, Sigma); /* each row is 2x2 matrix */
Corr = j(NumSamples, 1);  /* allocate vector for correlation estimates */
do i = 1 to nrow(S);      /* convert to correlation; extract off-diagonal */
   Corr[i] = cov2corr( shape(S[i,], 2, 2) )[1,2];
end;

You can create a comparative histogram of the sample correlation coefficients. In the following graph, the histogram at the top of the panel displays the distribution of the simulated correlation coefficients from the three-step method. The bottom histogram displays the distribution of correlations coefficients that are generated from the Wishart distribution.

Visually, the histograms appear to be similar. You can use PROC NPAR1WAY to run various hypothesis tests that compare the distributions; all tests support the hypothesis that these two distributions are equivalent.

If you'd like to see the complete analysis, you can download the SAS program that runs both simulations and compares the resulting distributions.

Although the Wishart distribution is more efficient for this simulation, recall that the Wishart distribution assumes that the underlying data distribution is multivariate normal. In contrast, the three-step simulation is more general. It can be used to generate correlation coefficients for any data distribution. So although the three-step simulation is not necessary for multivariate normal data, it is still an important technique to store in your simulation toolbox.

The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.

9月 272017
 

In a large simulation study, it can be convenient to have a "control file" that contains the parameters for the study. My recent article about how to simulate multivariate normal clusters demonstrates a simple example of this technique. The simulation in that article uses an input data set that contains the parameters (mean, standard deviations, and correlations) for the simulation. A SAS procedure (PROC SIMNORMAL) simulates data based on the parameters in the input data set.

This is a powerful paradigm. Instead of hard-coding the parameters in the program (or as macro variables), the parameters are stored in a data set that is processed by the program. This is sometimes called data-driven programming. (Some people call it dynamic programming, but there is an optimization technique of the same name so I will use the term "data-driven.") In a data-driven program, when you want to run the program with new parameters, you merely modify the data set that contains the control parameters.

I have previously written about a different way to control a batch program by passing in parameters on the command line when you invoke the SAS program.

Static programming and hard-coded parameters

Before looking at data-driven programming, let's review the static approach. I will simulate clusters of univariate normal data as an example.

Suppose that you want to simulate normal data for three different groups. Each group has its own sample size (N), mean, and standard deviation. In my book Simulating Data with SAS (p. 206), I show how to simulate this sort of ANOVA design by using arrays, as follows.

/* Static simulation: Parameters embedded in the simulation program */
data AnovaStatic;
/* define parameters for three simulated group */
array N[3]       _temporary_ (50,   50,   50);   /* sample sizes */
array Mean[3]    _temporary_ (14.6, 42.6, 55.5); /* center for each group */
array StdDev[3]  _temporary_ ( 1.7,  4.7,  5.5); /* spread for each group */
 
call streaminit(12345);
do k = 1 to dim(N);              /* for each group */
   do i = 1 to N[k];             /* simulate N[k] observations */
      x = rand("Normal", Mean[k], StdDev[k]); /* from k_th normal distribution */
      output;
   end;
end;
run;

The DATA step contains two loops, one for the groups and the other for the observations within each group. The parameters for each group are stored in arrays. Notice that if you want to change the parameters (including the number of groups), you need to edit the program. I call this method "static programming" because the behavior of the program is determined at the time that the program is written. This is a perfectly acceptable method for most applications. It has the advantage that you know exactly what the program will do by looking at the program.

Data-driven programming: Put parameters in a file

An alternative is to put the parameters for each group into a file or data set. If the k_th row in the data set contains the parameters for the k_th group, then the implicit loop in the DATA step will iterate over all groups, regardless of the number of groups. The following DATA step creates the parameters for three groups, which are read and processed by the second DATA step. The parameter values are the same as for the static example, but are transposed and processed row-by-row instead of via arrays:

/* Data-driven simulation: Parameters in a data set, processed by the simulation program */
data params;                     /* define parameters for each simulated group */
input N Mean StdDev;
datalines; 
50 14.6 1.7
50 42.6 4.7
50 55.5 5.5
;
 
data AnovaDynamic;
call streaminit(12345);
set params;                      /* implicit loop over groups k=1,2,... */
do i = 1 to N;                   /* simulate N[k] observations */
   x = rand("Normal", Mean, StdDev); /* from k_th normal distribution */
   output;
end;
run;

Notice the difference between the static and dynamic techniques. The static technique simulates data from three groups whose parameters are specified in temporary arrays. The dynamic technique simulates data from an arbitrary number of groups. Currently, the PARAMS data specifies three groups, but if I change the PARAMS data set to represent 10 or 1000 groups, the AnovaDynamic DATA step will simulate data from the new design without any modification.

Generate the parameters from real data

The data-driven technique is useful when the parameters are themselves the results of an analysis. For example, a common simulation technique is to generate the moments of real data (mean, variance, skewness, and so forth) and to use those statistics in place of the population parameters that they estimate. (See Chapter 16, "Moment Matching," in Simulating Statistics with SAS.)

The following call to PROC MEANS generates the sample mean and standard deviation for real data and writes those values to a data set:

proc means data=sashelp.iris N Mean StdDev stackods;
   class Species;
   var PetalLength;
   ods output Summary=params;
run;

The output data set from PROC MEANS creates a PARAMS data set that contains the variables (N, MEAN, and STDDEV) that are read by the simulation program. Therefore, you can immediately run the AnovaDynamic DATA step to simulate normal data from the sample statistics. A visualization of the resulting simulated data is shown below.

You can run PROC MEANS on other data and other variables and the AnovaDynamic step will continue to work without any modification. The simulation is controlled entirely by the values in the "control file," which is the PARAMS data set.

You can generalize this technique by wrapping the program in a SAS macro in which the name of the parameter file and the name of the simulated data set are provided at run time. With a macro implementation, you can read from multiple input files and write to multiple output data sets. You could use such a macro, for example, to break up a large simulation study into smaller independent sub-simulations, each controlled by its own file of input parameters. In a gridded environment, each sub-simulation can be processed independently and in parallel, thus reducing the total time required to complete the study.

Although this article discusses control files in the context of statistical simulation, other applications are possible. Have you used a similar technique to control a program by using an input file that contains the parameters for the program? Leave a comment.

The post Data-driven simulation appeared first on The DO Loop.

9月 252017
 

My article about Fisher's transformation of the Pearson correlation contained a simulation. The simulation uses the RANDNORMAL function in SAS/IML software to simulate multivariate normal data. If you are a SAS programmer who does not have access to SAS/IML software, you can use the SIMNORMAL procedure in SAS/STAT software to simulate data from a multivariate normal distribution.

The 'TYPE' of a SAS data set

Most SAS procedures read and analyze raw data. However, some SAS procedures read and write special data sets that represent a statistical summary of data. PROC SIMNORMAL can read a TYPE=CORR or TYPE=COV data set. Usually, these special data sets are created as an output data set from another procedure. For example, the following SAS statements compute the correlation between four variables from a sample of 50 Iris versicolor flowers:

proc corr data=sashelp.iris(where=(Species="Versicolor"))  /* input raw data */
          nomiss noprint outp=OutCorr;                     /* output statistics */
var PetalLength PetalWidth SepalLength SepalWidth;
run;
 
proc print data=OutCorr; run;
SAS TYPE=CORR data set

The output data set contains summary statistics including the mean, standard deviations, and correlation matrix for the four variables in the analysis. PROC PRINT does not display the 'TYPE' attribute of this data set, but if you run PROC CONTENTS you will see a field labeled "Data Set Type," which has the value "CORR".

You can also create a TYPE=CORR or TYPE=COV data set by using the DATA step as shown in the documentation for PROC SIMNORMAL.

Use PROC SIMNORMAL to generate multivariate normal data

Recall that you can use the standard deviations and correlations to construct a covariance matrix. When you call PROC SIMNORMAL, it internally constructs the covariance matrix from the information in the OutCorr data set and use the mean and covariance matrix to simulate multivariate normal data. The following call to PROC SIMNORMAL simulates 50 observations from a multivariate normal population. The DATA step combines the original and simulated data; the call to PROC SGSCATTER overlays the original and the simulated samples. Click to enlarge the graph.

proc simnormal data=OutCorr outsim=SimMVN
               numreal = 50           /* number of realizations = size of sample */
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
run;
 
/* combine the original data and the simulated data */
data Both;
set sashelp.iris(where=(Species="Versicolor")) /* original */
    SimMVN(in=sim);                            /* simulated */
Simulated = sim;
run;
 
ods graphics / attrpriority=none;   /* use different markers for each group */
title "Overlay of Original and Simulated MVN Data";
proc sgscatter data=Both;
   matrix PetalLength PetalWidth SepalLength SepalWidth / group=Simulated;
run;
ods graphics / attrpriority=none;   /* reset markers */
Overlay of original and simulated four-dimensional data

Notice that the original data are rounded whereas the simulated data are not. Except for that minor difference, the simulated data appear to be similar to the original data. Of course, the simulated data will not match unless the original data is approximately multivariate normal.

Simulate many samples from a multivariate normal distribution

The SIMNORMAL procedure supports the NUMREAL= option, which you can use to specify the size of the simulated sample. (NUMREAL stands for "number of realizations," which is the number of independent draws.) You can use this option to generate multiple samples from the same multivariate normal population. For example, suppose you are conducting a Monte Carlo study and you want to generate 100 samples of size N=50, each drawn from the same multivariate normal population. This is equivalent to drawing 50*100 observations where the first 50 observations represent the first sample, the next 50 observations represent the second sample, and so on. The following statements generate 50*100 observations and then construct an ID variable that identifies each sample:

%let N = 50;            /* sample size */
%let NumSamples = 100;  /* number of samples */
proc simnormal data=OutCorr outsim=SimMVN
               numreal = %sysevalf(&N*&NumSamples) 
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
run;
 
data SimMVNAll;
set SimMVN;
ID = floor((_N_-1) / &N) + 1;   /* ID = 1,1,...,1, 2,2,...,2, etc */
run;

After adding the ID variable, you can efficiently analyze all samples by using a single call to a procedure. The procedure should use a BY statement to analyze each sample. For example, you could use PROC CORR with a BY ID statement to obtain a Monte Carlo estimate of the sampling distribution of the correlation for multivariate normal data.

In summary, although the SAS/IML language is the best tool for general multivariate simulation tasks, you can use the SIMNORMAL procedure in SAS/STAT software to simulate multivariate normal data. The key is to construct a TYPE=CORR or TYPE=COV data set, which is then processed by PROC SIMNORMAL.

The post Simulate multivariate normal data in SAS by using PROC SIMNORMAL appeared first on The DO Loop.