simulation

7月 112018
 

In a previous article, I showed how to find the intersection (if it exists) between two line segments in the plane. There are some fun problems in probability theory that involve intersections of line segments. One is "What is the probability that two randomly chosen chords of a circle intersect?" This article shows how to create a simulation in SAS to estimate the probability.

An exact answer

For this problem, a "random chord" is defined as the line segment that joins two points chosen at random (with uniform probability) on the circle. The probability that two random chords intersect can be derived by using a simple counting argument. Suppose that you pick four points at random on the circle. Label the points according to their polar angle as p1, p2, p3, and p4. As illustrated by the following graphic, the points are arranged on the circle in one of the following three ways. Consequently, the probability that two random chords intersect is 1/3 because the chords intersect in only one of the three possible arrangements.

Possible arrangements of two random chords in the circle

A simulation in SAS

You can create a simulation to estimate the probability that two random chords intersect. The intersection of two segments can be detected by using either of the two SAS/IML modules in my article about the intersection of line segments. The following simulation generates four angles chosen uniformly at random in the interval (0, 2π). It converts those points to (x,y) coordinates on the unit circle. It then computes whether the chord between the first two points intersects the chord between the third and fourth points. It repeats this process 100,000 times and reports the proportion of times that the chords intersect.

proc iml;
/* Find the intersection between 2D line segments [p1,p2] and [q1,q2].
   This function assumes that the line segments have different slopes (A is nonsingular) */
start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
finish;
 
/* Generate two random chords on the unit circle.
   Simulate the probability that they intersect  */
N = 1e5;
theta = j(N, 4);
call randseed(123456);
call randgen(theta, "uniform", 0, 2*constant('pi'));
intersect = j(N,1,0);
do i = 1 to N;
   t = theta[i,]`;                 /* 4 random U(0, 2*pi) */
   pts = cos(t) || sin(t);         /* 4 pts on unit circle */
   p1 = pts[1,];    p2 = pts[2,];
   q1 = pts[3,];    q2 = pts[4,];
   intersect[i] = all(IntersectSegsSimple(p1, p2, q1, q2) ^= .);
end;
 
prob = mean(intersect);
print prob;

This simulation produces an estimate that is close to the exact probability of 1/3.

Connection to Bertrand's Paradox

This problem has an interesting connection to Bertrand's Paradox. Bertrand's paradox shows the necessity of specifying the process that is used to define the random variables in a probability problem. It turns out that there are multiple ways to define "random chords" in a circle, and the different definitions can lead to different answers to probability questions. See the Wikipedia article for an example.

For the definition of "random chords" in this problem, the density of the endpoints is uniform on the circle. After you make that choice, other distributions are determined. For example, the distribution of the lengths of 1,000 random chords is shown below. The lengths are NOT uniformly distributed! The theoretical density of the chord lengths is overlaid on the distribution of the sample. If you change the process by which chords are randomly chosen (for example, you force the lengths to be uniformly distributed), you might also change the answer to the problem, as shown in Bertrand's Paradox.

Distribution of lengths of random chords in the unit circle, generated by choosing uniformly distributed endpoints

The post The probability that two random chords of a circle intersect appeared first on The DO Loop.

6月 152018
 

My 2018 SAS Global Forum paper was about "how to use the random-number generators (RNGs) in SAS." You can read the paper for details, but I recently recorded a short video that summarizes the main ideas in the paper. In particular, the video gives an overview of the new RNGs in SAS, which include the following:

  • MTHYBRID, MT2002, and MT64: Variants of the Mersenne twister RNG. The MTHYBRID method is the default RNG in SAS, beginning with the SAS 9.4M3.
  • PCG: A 64-bit permuted congruential generator
  • RDRAND: A hardware-based RNG that generates random numbers from thermal noise in the chip
  • TF2 and TF4: Counter-based Threefry RNGs

If your browser does not support embedded video, you can go directly to the video on YouTube.

The following references provide more information about the random number generators in SAS:

The post Video: New random number generators in SAS appeared first on The DO Loop.

6月 062018
 

The SURVEYSELECT procedure in SAS 9.4M5 supports the OUTRANDOM option, which causes the selected items in a simple random sample to be randomly permuted after they are selected. This article describes several statistical tasks that benefit from this option, including simulating card games, randomly permuting observations in a DATA step, assigning a random ID to patients in a clinical study, and generating bootstrap samples. In each case, the new OUTRANDOM option reduces the number of statements that you need to write. The OUTRANDOM option can also be specified by using OUTORDER=RANDOM.

Sample data with PROC SURVEYSELECT

Often when you draw a random sample (with or without replacement) from a population, the order in which the items were selected is not important. For example, if you have 10 patients in a clinical trial and want to randomly assign five patients to the control group, the control group does not depend on the order in which the patients were selected. Similarly, in simulation studies, many statistics (means, proportions, standard deviations,...) depend only on the sample, not on the order in which the sample was generated. For these reasons, and for efficiency, the SURVEYSELECT procedure in SAS uses a "one-pass" algorithm to select observations in the same order that they appear in the "population" data set.

However, sometimes you might require the output data set from PROC SURVEYSELECT to be in a random order. For example, in a poker simulation, you might want the output of PROC SURVEYSELECT to represent a random shuffling of the 52 cards in a deck. To be specific, the following DATA step generates a deck of 52 cards in order: Aces first, then 2s, and so on up to jacks, queens, and kings. If you use PROC SURVEYSELECT and METHOD=SRS to select 10 cards at random (without replacement), you obtain the following subset:

data CardDeck;
length Face $2 Suit $8;
do Face = 'A','2','3','4','5','6','7','8','9','10','J','Q','K';
   do Suit = 'Clubs', 'Diamonds', 'Hearts', 'Spades';
      CardNumber + 1;
      output;
   end;
end;
run;
 
/* Deal 10 cards. Order is determined by input data */
proc surveyselect data=CardDeck out=Deal noprint
     seed=1234 method=SRS /* sample w/o replacement */
     sampsize=10;         /* number of observations in sample */
run;
 
proc print data=Deal; run;
Random sample without replacement from a card deck. The sample is in the same order as the data.

Notice that the call to PROC SURVEYSELECT did not use the OUTRANDOM option. Consequently, the cards are in the same order as they appear in the input data set. This sample is adequate if you want to simulate dealing hands and estimate probabilities of pairs, straights, flushes, and so on. However, if your simulation requires the cards to be in a random order (for example, you want the first five observations to represent the first player's cards), then clearly this sample is inadequate and needs an additional random permutation of the observations. That is exactly what the OUTRANDOM option provides, as shown by the following call to PROC SURVEYSELECT:

/* Deal 10 cards in random order */
proc surveyselect data=CardDeck out=Deal2 noprint
     seed=1234 method=SRS /* sample w/o replacement */
     sampsize=10          /* number of observations in sample */
     OUTRANDOM;           /* SAS/STAT 14.3: permute order */
run;
 
proc print data=Deal2; run;
Random sample without replacement from a card deck. The sample is then permuted into a random order.

You can use this sample when the output needs to be in a random order. For example, in a poker simulation, you can now assign the first five cards to the first player and the second five cards to a second player.

Permute the observations in a data set

A second application of the OUTRANDOM option is to permute the rows of a SAS data set. If you sample without replacement and request all observations (SAMPRATE=1), you obtain a copy of the original data in random order. For example, the students in the Sashelp.Class data set are listed in alphabetical order by their name. The following statements use the OUTRANDOM option to rearrange the students in a random order:

/* randomly permute order of observations */
proc surveyselect data=Sashelp.Class out=RandOrder noprint
     seed=123 method=SRS /* sample w/o replacement */
     samprate=1          /* proportion of observations in sample */
     OUTRANDOM;          /* SAS/STAT 14.3: permute order */
run;
 
proc print data=RandOrder; run;
Use PROC SURVEYSELECT to produce a random ordering of a SAS data set

There are many other ways to permute the rows of a data set, such as adding a uniform random variable to the data and then sorting. The two methods are equivalent, but the code for the SURVEYSELECT procedure is shorter to write.

Assign unique random IDs to patients in a clinical trial

Another application of the OUTRANDOM option is to assign a unique random ID to participants in an experimental trial. For example, suppose that four-digit integers are used for an ID variable. Some clinical trials assign an ID number sequentially to each patient in the study, but I recently learned from a SAS discussion forum that some companies assign random ID values to subjects. One way to assign random IDs is to sample randomly without replacement from the set of all ID values. The following DATA step generates all four-digit IDs, selects 19 of them in random order, and then merges those IDs with the participants in the study:

data AllIDs;
do ID = 1000 to 9999;  /* create set of four-digit ID values */
   output;
end;
run;
 
/* randomly select 19 unique IDs */
proc surveyselect data=AllIDs out=ClassIDs noprint
     seed=12345 method=SRS  /* sample w/o replacement */
     sampsize=19            /* number of observations in sample */
     OUTRANDOM;             /* SAS/STAT 14.3: permute order */
run;
 
data Class;
   merge ClassIDs Sashelp.Class;   /* merge ID variable and subjects */
run;
 
proc print data=Class; var ID Name Sex Age; run;

Random order for other sampling methods

The OUTRANDOM option also works for other sampling schemes, such as sampling with replacement (METHOD=URS, commonly used for bootstrap sampling) or stratified sampling. If you use the REPS= option to generate multiple samples, each sample is randomly ordered.

It is worth mentioning that the SAMPLE function in SAS/IML also can to perform a post-selection sort. Suppose that X is any vector that contains N elements. Then the syntax SAMPLE(X, k, "NoReplace") generates a random sample of k elements from the set of N. The documentation states that "the elements ... might appear in the same order as in X." This is likely to happen when k is almost equal to N. If you need the sample in random order, you can use the syntax SAMPLE(X, k, "WOR") which adds a random sort after the sample is selected, just like PROC SURVEYSELECT does when you use the OUTRANDOM option.

The post Sample and obtain the results in random order appeared first on The DO Loop.

5月 092018
 

In a previous blog post, I discussed ways to produce statistically independent samples from a random number generator (RNG). The best way is to generate all samples from one stream. However, if your program uses two or more SAS DATA steps to simulate the data, you cannot use the same seed value multiple times because that could introduce biases into your study. An easy way to deal with this situation is to use the CALL STREAM subroutine to set different key values for each DATA step.

An example of using two DATA steps to generate random numbers

Suppose that you have a simulation that needs to call the DATA step twice, and each call generates random numbers. If you want to encapsulate that functionality into a macro call, you might want to pass in the seed for the random number streams. An INCORRECT approach is to do the following:

%macro DoItWRONG(seed=);
data A;
   call streaminit('PCG', &seed);
   /* ... do something with random numbers ... */
run;
data B;
   call streaminit('PCG', &seed);       /* <== uh-oh! Problem here! */
   /* ... do something else with random numbers ... */
run;
%mend;

This code is unacceptable because the same seed value and RNG are used for both DATA steps. That means that the stream of random numbers is the same for the second DATA step as for the first, which can lead to correlations that can bias your study.

Prior to SAS 9.4M5, the only way to resolve this issue was to use different seed values for each DATA step. For example, you could use &seed + 1 in the second DATA step. Because the Mersenne twister RNG (the only RNG in SAS prior to 9.4M5) has an astronomically large period, streams that use different seed values have an extremely small probability of overlapping.

Use the CALL STREAM subroutine

SAS 9.4M5 introduced new random number generators. The Mersenne twister algorithm supports one parameter (the seed) which initializes each stream. The Threefry and PCG algorithms support two initialization parameters. To prevent confusion, the first parameter is called the seed value and the second is called the key value. By default, the key value is zero, but you can use the CALL STREAM subroutine to set a positive key value.

You should use the key value (not the seed value) to generate independent streams in two DATA steps. The following statements are true:

  • The PCG generator is not guaranteed to generate independent streams for different seed values. However, you can use the key value to ensure independent streams. Two streams with the same seed and different key values never overlap.
  • The Threefry generators (TF2 and TF4) generate independent streams for different seed values. However, the Threefry generators also accept key values, and two streams with different key values never overlap.
  • The Mersenne twister algorithm does not inherently support a key value. However, the STREAM routine creates a new seed from the current seed and the specified key value. Because of the long-period property of the MT algorithm, two streams that use the same seed and different key values are "independent with high probability."

The following example modifies the previous macro to use the CALL stream routine. For the PCG RNG, two streams that have the same seed and different key values do not overlap:

%macro DoIt(seed=);
data A;
   call streaminit('PCG', &seed);
   call stream(1); 
   /* ... do something with random numbers ... */
run;
data B;
   call streaminit('PCG', &seed);
   call stream(2);                  /* <== different stream, guaranteed */
   /* ... do something else with random numbers ... */
run;
%mend;

Of course, this situation generalizes. If the macro uses three (or 10 or 20) DATA steps, each of which generates random numbers, you can use a different key value in each DATA step. If you initialize each DATA step with the same seed and a different key, the DATA steps generate independent streams (for the Threefry and PCG RNGs) or generate independent streams with high probability (for the Mersenne twister RNG).

A real example of using different streams

This section provides a real-life example of using the CALL STREAM subroutine. Suppose a researcher wants to create a macro that simulates many samples for a regression model. The macro first creates a data set that contains p independent continuous variables. The macro then simulates B response variables for a regression model that uses the independent variables from the first data set.

To prevent the second DATA step from using the same random number stream that was used in the first DATA step, the second DATA step calls the STREAM subroutine to set a key value:

%macro Simulate(seed=, N=, NCont=,  /* params for indep vars */
                key=, betas=,       /* params for dep vars */
                NumSamples=);       /* params for MC simul */
/* Step 1: Generate nCont explanatory variables */
data MakeData;
   call streaminit('PCG', &seed);
   /* if you do not call STREAM, you get the stream for key=0 */
   array x[&NCont];         /* explanatory vars are named x1-x&nCont  */
 
   do ObsNum = 1 to &N;     /* for each observation in the sample  */
      do j = 1 to &NCont;
         x[j] = rand("Normal"); /* simulate NCont explanatory variables   */
      end;
      output;
    end;
    keep ObsNum x:;
run;
 
/* Step 2: Generate NumSamples responses, each from the same regression model 
   Y = X*beta + epsilon, where epsilon ~ N(0, 0.1)  */
data SimResponse;
   call streaminit('PCG', &seed);
   call stream(&key); 
   keep SampleId ObsNum x: Y;
   /* define regression coefficients beta[0], beta[1], beta[2], ... */
   array beta[0:&NCont] _temporary_ (&betas);
 
   /* construct linear model eta = beta[0] + beta[1]*x1 + betat[2]*x2 + ... */
   set MakeData;                    /* for each obs in data ...*/
   array x[&nCont] x1-x&NCont;      /* put the explanatory variables into an array */
   eta = beta[0];
   do i = 1 to &NCont;
      eta = eta + beta[i] * x[i];
   end;
 
   /* add random noise for each obs in each sample */
   do SampleID = 1 to &NumSamples;
      Y = eta + rand("Normal", 0, 0.1); 
      output;
   end;
run;
 
/* sort to prepare data for BY-group analysis */
proc sort data=SimResponse;
   by SampleId ObsNum;
run;
%mend;

For a discussion of how the simulation works, see the article "Simulate many samples from a linear regression model." In this example, the macro uses two DATA steps to simulate the regression data. The second DATA step uses the STREAM subroutine to ensure that the random numbers it generates are different from those used in the first DATA step.

For completeness, here's how you can call the macro to generate the data and then a call PROC REG with a BY statement to analyze the 1,000 regression models.

%Simulate(seed=123, N=50, NCont=3,         /* params for indep vars */
          key=1, Betas = 1 -0.5 0.5 -0.25, /* params for dep vars */
          NumSamples=1000)                 /* params for MC sim */
 
proc reg data=SimResponse noprint outest=PE;
   by SampleID;
   model Y = x1-x3;
run;
 
title "Parameter Estimates for key=1";
proc print data=PE(obs=5) noobs;
   var SampleID Intercept x1-x3;
run;

You can verify that the macro generates different simulated data if you use the same seed value but change the key value in the macro call.

In summary, this article shows that you can use the STREAM call in the SAS DATA step to generate different random number streams from the same seed value. For many simulation studies, the STREAM subroutine is not necessary. However, it useful when the simulation is run in stages or is implemented in modules that need to run independently.

The post Independent streams of random numbers in SAS appeared first on The DO Loop.

5月 072018
 

Simulation studies require both randomness and reproducibility, two qualities that are sometimes at odds with each other. A Monte Carlo simulation might need to generate millions of random samples, where each sample contains dozens of continuous variables and many thousands of observations. In simulation studies, the researcher wants each sample to be statistically independent. For example, if you are drawing samples of size N=100 from a continuous distribution, it is not acceptable for the millionth sample to equal the first sample. Such a situation would bias the study. Similarly, it is not acceptable for half (or a fourth, or an eighth,...) of the numbers in the millionth sample to be the same as the numbers in the first sample.

The easiest way to ensure that each sample is statistically independent is to generate all samples from one high-quality random number stream. However, if the structure of your simulation requires that you use two or more streams, then SAS 9.4M5 provides several ways to produce random samples that are statistically independent. One way is to use one of the new Threefry random number generators. The other way is to use the new CALL STREAM subroutine.

Independence and overlap in random number streams

I've previously written about the many random number generators (RNGs) in SAS. Recall that all RNGs produce uniform variates. When you call a RNG repeatedly, you are generating a stream of random numbers. A stream is the sequence of pseudorandom uniform values that is generated from a given seed value. (If you request a random value from a nonuniform distribution, such as normal or exponential, the software generates one or more uniform variates and then applies a mathematical transformation to produce a nonuniform variate.)

In SAS, the modern RNGs are high quality and have a very long period. For all practical purposes, the samples that are generated from a single stream are independent. However, for the Mersenne twister and PCG RNGs, samples from different streams might not be independent. It might happen that the first sample for one stream is equal to (or has a large intersection with) the millionth sample from another seed, as shown in the image to the right. This phenomenon is known as overlap of streams.

The possibility that streams might overlap is important because some programmers use a macro loop to generate thousands of samples, where each sample is from a different stream. Not only is this technique inefficient, but the birthday problem in probability warns us that the probability that two streams overlap is larger than our intuition suggests.

The probability of overlapping streams is usually small, even for very large simulation studies. The "number of birthdays" equals the period of the RNG, which for Mersenne twister RNG is astronomically huge (≈ 106000). However, for the PCG RNG (which has a smaller period of "only" 264 ≈ 1019), the probability of an overlap could be nonnegligible.

Vigna (2009) estimates the probability of overlap when you use the first L variates in each of n streams from a RNG that has period P. Assuming that nL/P ≪ 1, the probability is approximately n2L/P. When P = 1019 and you generate n = 10,000 different streams, each containing L = 1,000,000 variates, the probability of an overlap is approximately 10-5.

Fortunately, SAS makes it it is easy to ensure the independence of random samples:

  • As shown in Simulating Data with SAS and in many blog posts, you can almost always use a single stream to generate all the random samples.
  • You can use one of the Threefry generators, TF2 or TF4. For a Threefry generator, two streams with different seed values never overlap! Similarly, streams for the hardware-based RDRAND generator never overlap.
  • You can use the CALL STREAM subroutine to set a key value for the PCG algorithm. For the PCG RNG, streams that have the same seed value but different key values never overlap.

In my next blog post, I will demonstrate how to use the STREAM subroutine in the SAS DATA step to generate independent streams. The technique can be useful when you are running a simulation that requires several DATA steps, each of which generates random numbers.

The post Independence and overlap in streams of random numbers appeared first on The DO Loop.

4月 162018
 

A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the set of all possible permutations.

In SAS, you can use the RANPERM function to generate ranom permutations. For example, the statement p = ranperm(4) generates a random permutation of the elements in the set {1, 2, 3, 4}. However, the function does not ensure that the permutations it generates will be unique.

I can think of two ways to generate random permutations without replacement. The first is to generate all permutations and then choose a random subset of size k without replacement from the set of n! possibilities. This technique is efficient when n is small or when k is a substantial proportion of n!. The second technique is to randomly generate permutations until you get a total of k that are unique. This technique is efficient when n is large and k is small relative to n!.

Generate all permutations, then sample without replacement

For small sets (say, n ≤ 8), an efficient way to generate random permutations is to generate all permutations and then extract a random sample without replacement. In the SAS/IML language you can use the ALLPERM function to generate all permutations of a set of n items, where n ≤ 18. The ALLPERM function returns a matrix that has n! rows and n columns. Each row is a permutation. You can then use the SAMPLE function to sample without replacement from among the rows, as follows:

proc iml;
call randseed(123);
n = 4;                       /* permutations of n items */
k = 3;                       /* want k unique permutation */
 
p = allperm(n);               /* matrix has n! rows; each row is a permutation */
rows = sample(1:n, k, "WOR"); /* random rows w/o replacement */
ranPermWOR = p[rows, ];       /* extract rows */
print ranPermWOR;

Generate random permutations, then check for uniqueness

The matrix of all permutations has n! rows, which gets big fast. If you want only a small number of permutations from among a huge set of possible permutations, it is more efficient to use the RANPERM function to generate permutations, then discard duplicates. Last week I showed how to eliminate duplicate rows from a numeric matrix so that the remaining rows are unique. The following SAS/IML statements use the UniqueRows function, which is defined in the previous post. You must define or load the module before you can use it.

/* <define or load the DupRows and UniqueRows modules HERE> */
 
n = 10;                      /* permutations of n items */
k = 5;                       /* want k unique permutation; k << n! */
p = ranperm(n, 2*k);         /* 2x more permutations than necessary */
U = UniqueRows(p);           /* get unique rows of 2k x n matrix */
if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
else do;
   U = UniqueRows( U // ranperm(n, 10*k) ); /* generate more... repeat as necessary */
   if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
end;
print U;

Notice in the previous statements that the call to RANPERM requests 2*k random permutations, even though we only want k. You should ask for more permutations than you need because some of them might be duplicates. The factor of 2 is ad hoc; it is not based on any statistical consideration.

If k is much smaller than n!, then you might think that the probability of generating duplicate a permutation is small. However, the Birthday Problem teaches us that duplicates arise much more frequently than we might expect, so it is best to expect some duplicates and generate more permutations than necessary. When k is moderately large relative to n!, you might want to use a factor of 5, 10, or even 100. I have not tried to compute the number of permutations that would generate k unique permutations with 95% confidence, but that would be a fun exercise. In my program, if the initial attempt generates fewer than k unique rows, it generates an additional 10*k permutations. You could repeat this process, if necessary.

In summary, if you want to generate random permutations without any duplicates, you have at least two choices. You can generate all permutations and then extract a random sample of k rows. This method works well for small values of n, such as n ≤ 8. For larger values of n, you might want to generate random permutation (more than k) and use the first k unique rows of the matrix of permutations.

The post Random permutations without duplicates appeared first on The DO Loop.

3月 262018
 
Zipper plot for normally distributed data

Simulation studies are used for many purposes, one of which is to examine how distributional assumptions affect the coverage probability of a confidence interval. This article describes the "zipper plot," which enables you to compare the coverage probability of a confidence interval when the data do or do not follow certain distributional assumptions.

I saw the zipper plot in a tutorial about how to design, implement, and report the results of simulation studies (Morris, White, Crowther, 2017). An example of a zipper plot is shown to the right. The main difference between the zipper plot and the usual plot of confidence intervals is that the zipper plot sorts the intervals by their associated p-values, which causes the graph to look like a garment's zipper. The zipper plot in this blog post is slightly different from the one that is presented in Morris et al. (2017, p. 22), as discussed at the end of this article.

Coverage probabilities for confidence intervals

A formulas for a confidence interval (CI) is based on an assumption about the distribution of the parameter estimates on a simple random sample. For example, the two-sided 95% confidence interval for the mean of normally distributed data has upper and lower limits given by the formula
x̄ ± t* s / sqrt(n)
where x̄ is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the 97.5th percentile of the t distribution with n – 1 degrees of freedom. You can confirm that the formula is a 95% CI by simulating many N(0,1) samples and computing the CI for each sample. About 95% of the samples should contain the population mean, which is 0.

The central limit theorem indicates that the formula is asymptotically valid for sufficiently large samples nonnormal data. However, for small nonnormal samples, the true coverage probability of those intervals might be different from 95%. Again, you can estimate the coverage probability by simulating many samples, constructing the intervals, and counting how many time the mean of the distribution is inside the confidence interval. A previous article shows how to use simulation to estimate the coverage probability of this formula.

Zipper plots and simulation studies of confidence intervals

Zipper plot that compares the empirical coverage probability of confidence intervals that assume normality. The data samples are size N=50 drawn from the normal or exponential distribution.

The graph to the right shows the zipper plot applied to the results of the two simulations in the previous article. The zipper plot on the left visualizes the estimate of the coverage probability 10,000 for normal samples of size n = 50. (The graph at the top of this article provides more detail.) The Monte Carlo estimate of the coverage probability of the CIs is 0.949, which is extremely close to the true coverage of 0.95. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.945, 0.953], which includes 0.95. You can see that the width of the confidence intervals does not seem to depend on the sample mean: the samples whose means are "too low" tend to produce intervals that are about the same length as the intervals for the samples whose means are "too high."

The zipper plot on the right is for 10,000 random samples of size n = 50 that are drawn from an exponential distribution with mean 0 and unit scale parameter. (The graph at this link provides more detail.) The Monte Carlo estimate of the coverage probability is 0.9360. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.931, 0.941] and does not include 0.95. Notice that the left side of the "zipper" tends to contain intervals that are shorter than those on the right side. This indicates that samples whose means are "too low" tend to produce shorter confidence intervals than the samples whose means are "too high."

The zipper plot enables you to compare the results of two simulations that generate data from different distributions. The zipper plot enables you to visualize the fact that the nominal 95% CIs do, in fact, have empirical coverage of about 95% for normal data, whereas the intervals have about 93.6% empirical coverage for the exponential data.

If you want to experiment with the zipper plot or use it for your own simulation studies in SAS, you can download the SAS program that generates the graphs in this article.

Differences from the "zip plot" of Morris et al.

There are some minor differences between the zipper plot in this article and the graph in Morris et al. (2017, p. 22).

  • Morris et al., use the term "zip plot." However, statisticians use "ZIP" for the zero-inflated Poisson distribution, so I prefer the term "zipper plot."
  • Morris et al. bin the p-values into 100 centiles and graph the CIs against the centiles. This has the advantage of being able to plot the CI's from thousands or millions of simulations in a graph that uses only a few hundred vertical pixels. In contrast, I plot the CI's for each fractional rank, which is the rank divided by the number of simulations. In the SAS program, I indicate how to compute and use the centiles.
  • Morris et al. plot all the centiles. Consequently, the interesting portion of the graph occupies only about 5-10% of the total graph area. In contrast, I display only the CIs whose fractional rank is less than some cutoff value, which is 0.2 in this article. Thus my zipper plots are a "zoomed in" version of the ones that appear in Morris et al. In the SAS program, you can set the FractionMaxDisplay macro variable to 1.0 to show all the centiles.

The post A zipper plot for visualizing coverage probability in simulation studies appeared first on The DO Loop.

2月 072018
 
Multiple-Birthday Problem: The distribution of the number of shared birthdays among 23 random people

If N random people are in a room, the classical birthday problem provides the probability that at least two people share a birthday. The birthday problem does not consider how many birthdays are in common. However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of shared birthdays. Specifically, among N people, what is the probability that exactly k birthdays are shared (k = 1, 2, 3, ..., floor(N/2))? The bar chart at the right shows the distribution for N=23. The heights of the bars indicate the probability of 0 shared birthdays, 1 shared birthday, and so on.

This article uses simulation in SAS to examine the multiple-birthday problem. If you are interested in a theoretical treatment, see Fisher, Funk, and Sams (2013, p. 5-14).

The probability of multiple shared birthdays for N=23

You can explore the classical birthday problem by using probability theory or you can estimate the probabilities by using Monte Carlo simulation. The simulation in this article assumes 365 equally distributed birthdays, but see my previous article to extend the simulation to include leap days or a nonuniform distribution of birthdays.

The simulation-based approach enables you to investigate the multiple birthday problem. To begin, consider a room that contains N=23 people. The following SAS/IML statements are taken from my previous article. The Monte Carlo simulation generates one million rooms of size 23 and estimates the proportion of rooms in which the number of shared birthdays is 0, 1, 2, and so forth.

proc iml;
/* Function that simulates B rooms, each with N people, and counts the number of shared 
   (matching) birthdays in each room. The return value is a row vector that has B counts. */
start BirthdaySim(N, B);                 
   bday = Sample(1:365, B||N);             /* each column is a room */
   match = N - countunique(bday, "col");   /* number of unique birthdays in each col */
   return ( match );
finish;
 
call randseed(12345);                      /* set seed for random number stream */
NumPeople = 23;
NumRooms = 1e6;
match = BirthdaySim(NumPeople, NumRooms);  /* 1e6 counts */
 
call tabulate(k, freq, match);  /* summarize: k=number of shared birthdays, freq=counts */ 
prob = freq / NumRooms;         /* proportion of rooms that have 0, 1, 2,... shared birthdays */
print prob[F=BEST6. C=(char(k,2)) L="Estimate of Probability (N=23)"];
Multiple-Birthday Problem: The distribution of shared birthdays among 23 random people

The output summarizes the results. In less than half the rooms, no person shared a birthday with anyone else. In about 36% of the rooms, one birthday is shared by two or more people. In about 12% of the room, there were two birthdays that were shared by four or more people. About 2% of the rooms had three birthdays shared among six or more individuals, and so forth. In theory, there is a small probability of a room in which 8, 9, 10, or 11 birthdays are shared, but the probability of these events is very small. The probability estimates are plotted in the bar chart at the top of this article.

There is a second way to represent this probability distribution: you can create a stacked bar chart that shows the cumulative probability of k or fewer shared birthdays for k=1, 2, 3,.... You can see that the probability of 0 matching birthdays is less than 50%, the probability of 1 or fewer matches is 87%, the probability of 2 or fewer matches is 97%, and so forth. The probability for 5, 6, or 7 matches is not easily determined from either chart.

Multiple-Birthday Problem: The cumulative distribution of shared birthdays among 23 random people

Mathematically speaking, the information in the stacked bar chart is equivalent to the information in the regular bar chart. The regular bar chart shows the PMF (probability mass function) for the distribution whereas the stacked chart shows the CDF (cumulative distribution function).

The probability of shared birthdays among N people

The previous section showed the distribution of shared birthdays for N=23. You can download the SAS program that simulates the multiple birthday problem for rooms that contain N=2, 3, ..., and 60 people. For each simulation, the program computes estimates for the probabilities of k matching birthdays, where k ranges from 0 to 30.

Multiple-Birthday Problem: Three distributions of the number of shared birthdays among N=25, 40, and 60 random people

There are a few ways to visualize those 59 distributions. One way is to plot 59 bar charts, either in a panel or by using an animation. A panel of three bar charts is shown to the right for rooms of size N=25, 40, and 60. You can see that the bar chart for N=25 is similar to the one shown earlier. For N=40, most rooms have between one and three shared birthdays. For rooms with N=60 people, most rooms have between three and six shared birthdays. The horizontal graphs are truncated at k=12, even though a few simulations generated rooms that contain more than 12 matches.

Obviously plotting all 59 bar charts would require a lot of space. A different way to visualize these 59 distributions is to place 59 stacked bar charts side by side. This visualization is more compact. In fact, instead of 59 stacked bar charts, you might want to create a stacked band plot, which is what I show below.

The following stacked band chart is an attempt to visualize the distribution of shared birthdays for ALL room sizes N=2, 3, ..., 60. (Click to enlarge.)

Multiple-Birthday Problem: The cumulative distribution of the probability of k or fewer shared birthdays among N random people, N=2,3,...,60

How do you interpret this graph? A room that contains N=23 people corresponds to drawing a vertical line at N=23. That vertical line intersects the red, green, and brown bands near 0.5, 0.87, and 0.97. Therefore, these are the probabilities that a room with 23 people contains 0 shared birthdays, 1 or fewer shared birthdays, and 2 or fewer shared birthdays. The vertical heights of the bands qualitatively indicate the individual probabilities.

Next consider a room that contains N=40 people. A vertical line at N=40 intersects the colored bands at 0.11, 0.37, 0.66, 0.86, and 0.95. These are the cumulative probabilities P(k ≤ 0), P(k ≤ 1), P(k ≤ 2), and so forth. If you need the exact values, you can use PROC PRINT to display the actual probability estimates. For example:

proc print data=BDayProb noobs;
   where N=40 and k < 5;
run;
Multiple-Birthday Problem: The distribution of shared birthdays among N=40 random people

Summary

In summary, this article examines the multiple-birthday problem, which is a generalization of the classical birthday problem. If a room contains N random people, the multiple-birthday problem examines the probability that k birthdays are shared by two or more people for k = 0, 1, 2, ..., floor(N/2). This article shows how to use Monte Carlo simulation in SAS to estimate the probabilities. You can visualize the results by using a panel of bar charts or by using a stacked band plot to visualize the cumulative distribution. You can download the SAS program that simulates the multiple birthday problem, estimates the probabilities, and visualizes the results.

What do you think? Do you like the stacked band plot that summarizes the results in one graph, or do you prefer a series of traditional bar charts? Leave a comment.

The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.

2月 052018
 
The Birthday Problem: The probability of a matching birthday in a room of that contains N people (N=2,3,...,60). You can simulate the birthday problem in SAS and compute a Monte Carlo estimate of the probability of a match.

This article simulates the birthday-matching problem in SAS. The birthday-matching problem (also called the birthday problem or birthday paradox) answers the following question: "if there are N people in a room, what is the probability that at least two people share a birthday?" The birthday problem is famous because the probability of duplicate birthdays is much higher than most people would guess: Among 23 people, the probability of a shared birthday is more than 50%.

If you assume a uniform distribution of birthdays, the birthday-matching problem can be solved exactly. In reality birthdays are not uniformly distributed, but as I show in Statistical Programming with SAS/IML Software (Wicklin 2010), you can use a Monte Carlo simulation to estimate the probability of a matching birthday.

This article uses ideas from Wicklin (2010) to simulate the birthday problem in the SAS/IML language. The simulation is simplest to understand if you assume uniformly distributed birthdays (p. 335–337), but it is only slightly harder to run a simulation that uses an empirical distribution of birthdays (p. 344–346). For both cases, the simulation requires only a few lines of code.

Monte Carlo estimates of the probabilities in the birthday problem

The birthday-matching problem readily lends itself to Monte Carlo simulation. Represent the birthdays by the integers {1, 2, ..., 365} and do the following:

  1. Choose a random sample of size N (with replacement) from the set of birthdays. That sample represents a "room" of N people.
  2. Determine whether any of the birthdays in the room match. There are N(N-1)/2 pairs of birthdays to check.
  3. Repeat this process for thousands of rooms and compute the proportion of rooms that contain a match. This is a Monte Carlo estimate of the probability of a shared birthday.
  4. If desired, simulate rooms of various sizes, such as N=2, 3, 4,..., 60, and plot the results.

How to simulate the birthday problem in SAS

Before running the full simulation, let's examine a small-scale simulation that simulates only 10 rooms, each containing a random sample of N = 23 people:

proc iml;
call randseed(12345);                 /* set seed for random number stream */
N = 23;                               /* number of people in room */
NumRooms = 10;                        /* number of simulated rooms */
bday = Sample(1:365, NumRooms||N);    /* simulate: each column is a room that contains N birthdays */
*print bday;                          /* optional: print the 23 x 10 matrix of birthdays */
match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
print match[label="Num Matches (N=23)" c=("Rm1":"Rm10")];
 
/* estimate the probability of a match, based on 10 simulations */
probMatch = (match > 0)[:];          /* mean = proportion of matches */
print probMatch;
Simulate the birthday problem in SAS and estimate the probability of a match

The output shows the number of matches in 10 rooms, each with 23 people. The first room did not contain any people with matching birthdays, nor did rooms 3, 5, 6, 7, and 10. The second room contained one matching birthday, as did rooms 8 and 9. The fourth room contains two shared birthdays. In this small simulation, 4/10 of the rooms contain a shared birthday, so the Monte Carlo estimate of the probability is 0.4. You can compute the estimate by forming the binary indicator variable (match > 0) and computing the mean. (Recall that the mean of a binary vector is the proportion of 1s.)

Notice that the simulation requires only three SAS/IML vectorized statements. The SAMPLE function produces a matrix that has N rows. Each column of the matrix represents the birthdays of N people in a room. The COUNTUNIQUE function returns a row vector that contains the number of unique birthdays in each column. If you subtract that number from N, you obtain the number of shared birthdays. Finally, the estimate is obtained by computing the mean of an indicator variable.

Simulate the birthday problem in SAS

For convenience, the following SAS/IML statements define a function that simulates B rooms, each containing a random sample of N birthdays. The function returns a row vector of size B that contains the number of matching birthdays in each room. The program calls the function in a loop and graphs the results.

start BirthdaySim(N, B);
   bday = Sample(1:365, B||N);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
finish;
 
maxN = 60;                    /* consider rooms with 2, 3, ..., maxN people */
NumPeople = 2:maxN;
probMatch = j(MaxN, 1, 0);    /* allocate vector for results */
NumRooms = 10000;             /* number of simulated rooms */
 
do N = 1 to ncol(NumPeople);
   match =  BirthdaySim(NumPeople[N], NumRooms);
   probMatch[N] = (match > 0)[:];
end;
 
title "The Birthday Matching Problem";
call series(NumPeople, probMatch) grid={x y} 
     label={"Number of People in Room" "Probability of a Shared Birthday"};

The graph is shown at the top of this article. It shows the estimated probability of one or more matching birthdays in a room of size N for N=2, 3, ..., 60. The estimates are based on 10,000 Monte Carlo iterations. (Why did I use 10,000 rooms? See the Appendix.) As mentioned earlier, the probability exceeds 50% when a room contains 23 people. For 30 people, the probability of a match is 71%. If you put 50 people in a room, the chances that two people in the room share a birthday is about 97%.

Allow leap days and a nonuniform distribution of birthdays

The curious reader might wonder how this analysis changes if you account for people born on leap day (February 29th). The answer is that the probabilities of a match decrease very slightly because there are more birthdays, and you can calculate the probability analytically. If you are interested in running a simulation that accounts for leap day, use the following SAS/IML function:

start BirthdaySim(N, B);
   /* A four-year period has 3*365 + 366 = 1461 days. Assuming uniformly distributed 
      birthdays, the probability vector for randomly choosing a birthday is as follows: */
   p = j(366, 1, 4/1461);      /* most birthdays occur 4 times in 4 years */
   p[31+29] = 1/1461;          /* leap day occurs once in 4 years */
   /* sample with replacement using this probability of selection */
   bday = Sample(1:366, B||N, "replace", p);           /* each column is a room */
   match = N - countunique(bday, "col"); /* number of unique birthdays in each col */
   return ( match );
finish;

If you run the simulation, you will find that the estimated probability of sharing a birthday among N people does not change much if you include leap-day birthdays. The difference between the estimated probabilities is about 0.003 on average, which is much less than the standard error of the estimates. In practice, you can exclude February 29 without changing the conclusion that shared birthdays are expected to occur.

In a similar way, you can set the probability vector to be the empirical distribution of birthdays in the population. Wicklin (2010, p. 346) reports that the simulation-based estimates are slightly higher than the exact results that you obtain from assuming a uniform distribution. This agrees with theory: A. G. Munford (TAS, 1977) showed that any nonuniform distribution increases the likelihood of a matching birthday. In other words, in the real world, the probability of a shared birthday is slightly higher than in the idealized mathematical world.

Appendix: How many Monte Carlo simulations are necessary?

Why did I use 10,000 Monte Carlo simulations? Notice that the Monte Carlo estimate in this problem is an estimate of a proportion of a binary variable, which means that you can estimate the standard error. If p̂ is the estimate, then a 95% confidence interval has radius 1.96 sqrt(p̂(1-p̂)/B), where B is the number of Monte Carlo simulations (that is, the number of "rooms"). This quantity is largest when p̂ = 1/2, so if you want, say, two digits of accuracy you can choose a value of B which is large enough so that 1.96 sqrt(0.5(1-0.5)/B) ≤ 0.01. Solving the equation for B give B ≥ 10000. If you choose that many Monte Carlo iterations, then the conservative 95% confidence interval has a half-width no greater than 0.01.

This simple calculation shows that if you use B=10,000 for the number of simulated "room," you can expect the Monte Carlo estimates to be within 0.01 of the true probability 95% of the time. The standard error is largest where the curve is steepest (near N=23). You could add error bars to the graph, but the errors would be too small to see.

The post Simulate the birthday-matching problem appeared first on The DO Loop.

1月 292018
 

What is a random number generator? What are the random-number generators in SAS, and how can you use them to generate random numbers from probability distributions? In SAS 9.4M5, you can use the STREAMINIT function to select from eight random-number generators (RNGs), including five new RNGs. After choosing an RNG, you can use the RAND function to obtain random values from probability distributions.

What is a random-number generator?

A random-number generator usually refers to a deterministic algorithm that generates uniformly distributed random numbers. The algorithm (sometimes called a pseudorandom-number generator, or PRNG) is initialized by an integer, called the seed, which sets the initial state of the PRNG. Each iteration of the algorithm produces a pseudorandom number and advances the internal state. The seed completely determines the sequence of random numbers. Although it is deterministic, a modern PRNG generates extremely long sequences of numbers whose statistical properties are virtually indistinguishable from a truly random uniform process.

An RNG can also refer to a hardware-based device (sometimes called a "truly random RNG") that samples from an entropy source, which is often thermal noise within the silicon of a computer chip. In the Intel implementation, the random entropy source is used to initialize a PRNG in the chip, which generates a short stream of pseudorandom values before resampling from the entropy source and repeating the process. A hardware-based RNG is not deterministic or reproducible.

An RNG generates uniformly distributed numbers. If you request random numbers from a nonuniform probability distribution (such as the normal or exponential distributions), SAS will automatically apply mathematical transformations that convert the uniform numbers into nonuniform random variates.

Using random numbers in the real world

In the real world, random numbers are used as the basis for statistical simulations. Financial analysts use simulations of the economy to evaluate the risk of a portfolio. Insurance companies use simulations of natural disasters to assess the impact on their customers. Government workers use simulations of population growth to predict the infrastructure and services that will be required by future generations. In all fields, data analysts generate random values for statistical sampling, bootstrap and resampling methods, Monte Carlo estimation, and data simulation.

For these techniques to be statistically valid, the values that are produced by statistical software must contain a high degree of randomness. The new random-number generators in SAS provide analysts with fast, high-quality, state-of-the-art random numbers.

New random-number generators in SAS

Prior to SAS 9.4M5, SAS supported the Mersenne twister random-number generator (introduced in SAS 9.1) and a hardware-based RNG (supported on Intel CPUs in SAS 9.4M3). SAS 9.4M5 introduces new random-number generators and new variants of the Mersenne twister, as follows:

  • PCG: A 64-bit permuted congruential generator (O’Neill 2014) with good statistical properties.
  • TF2: A 2x64-bit counter-based RNG that is based on the Threefish encryption function in the Random123 library (Salmon et al. 2011). The generator is also known as the 2x64 Threefry generator.
  • TF4: A 4x64-bit counter-based RNG that is also known as the 4x64 Threefry generator.
  • RDRAND: A hardware-based RNG (Intel, 2014) that is repeatedly reseeded by using random values from thermal noise in the chip. The RNG requires an Intel processor (Ivy Bridge and later) that supports the RDRAND instruction.
  • MT1998: (deprecated) The original 1998 32-bit Mersenne twister algorithm (Matsumoto and Nishimura, 1998). This was the default RNG for the RAND function prior to SAS 9.4M3. For a small number of seed values (those exactly divisible by 8,192), this RNG does not produce a sufficiently random stream of numbers. Consequently, other RNGS are preferable.
  • MTHYBRID: (default) A hybrid method that improves the MT1998 method by using the MT2002 initialization for seeds that are exactly divisible by 8,192. This is the default RNG, beginning with the SAS 9.4M3.
  • MT2002: The 2002 32-bit Mersenne twister algorithm (Matsumoto and Nishimura 2002).
  • MT64: A 64-bit version of the 2002 Mersenne twister algorithm.

Choose a random-number generator in SAS

In SAS, you can use the STREAMINIT subroutine to choose an RNG and to set the seed value. You can then use the RAND function to produce high-quality streams of random numbers for many common distributions. The first argument to the RAND function specifies the name of the distribution. Subsequent arguments specify parameters for the distribution.

If you want to use the default RNG (which is 'MTHYBRID'), you can specify only the seed. For example, the following DATA step initializes the default PRNG with the seed value 12345 and then generates five observations from the uniform distribution on (0,1) and from the standard normal distribution:

data RandMT(drop=i);
   call streaminit(12345);   /* choose default RNG (MTHYBRID) and seed=12345 */
   do i = 1 to 5;
      uMT = rand('Uniform'); /* generate random uniform: U ~ U(0,1)           */
      nMT = rand('Normal');  /* generate random normal: N ~ N(mu=0, sigma=1) */
      output;
   end;
run;

If you want to choose a different RNG, you can specify the name of the RNG followed by the seed. For example, the following DATA step initializes the PCG method with the seed value 12345 and then generates five observations from the uniform and normal distributions:

data RandPCG(drop=i);
   call streaminit('PCG', 12345); /* SAS 9.4M5: choose PCG method, same seed */
   do i = 1 to 5;
      uPCG = rand('Uniform'); nPCG = rand('Normal'); output;
   end;
run;

A hardware-based method is initialized by the entropy source, so you do not need to specify a seed value. For example, the following DATA step initializes the RDRAND method and then generates five observations. The three data sets are then merged and printed:

 
data RandHardware(drop=i);
   call streaminit('RDRAND');     /* SAS 9.4M3: Hardware method, no seed */
   do i = 1 to 5;
      uHardware = rand('Uniform'); nHardware = rand('Normal'); output;
   end;
run;
 
data All;
   merge RandMT RandPCG RandHardware;
run;
 
title "Uniform Random Variates for Three RNGs";
proc print data=All noobs; var u:; run;
 
title "Normal Random Variates for Three RNGs";
proc print data=All noobs; var n:; run;

The output shows that each RNG generates a different sequence of numbers. If you run this program yourself, you will generate the same sequences for the first two columns (the MTHYBRID and PCG streams), but your numbers for the RDRAND RNG will be different. Each column of the first table is an independent random sample from the uniform distribution. Similarly, each column of the second table is an independent random sample from the standard normal distribution.

Summary

In SAS 9.4M5, you can choose from eight different random-number generators. You use the STREAMINIT function to select the RNG and use the RAND function to obtain random values. In a future blog post, I will compare the attributes of the different RNGs and discuss the advantages of each. I will also show how to use the new random-number generators to generate random numbers in parallel by running a DS2 program in multiple threads.

The information in this article is taken from my forthcoming SAS Global Forum 2018 paper, "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018).

References

The post How to use random-number generators in SAS appeared first on The DO Loop.