1月 282019

This article shows how to use SAS to simulate data that fits a linear regression model that has categorical regressors (also called explanatory or CLASS variables). Simulating data is a useful skill for both researchers and statistical programmers. You can use simulation for answering research questions, but you can also use it generate "fake" (or "synthetic") data for a presentation or blog post when the real data are confidential. I have previously shown how to use SAS to simulate data for a linear model and for a logistic model, but both articles used only continuous regressors in the model.

This discussion and SAS program are based on Chapter 11 in Simulating Data with SAS (Wicklin, 2013, p. 208-209). The main steps in the simulation are as follows. The comments in the SAS DATA program indicate each step:

  1. Macro variables are used to define the number of continuous and categorical regressors. Another macro variable is used to specify the number of levels for each categorical variable. This program simulates eight continuous regressors (x1-x8) and four categorical regressors (c1-c4). Each categorical regressor in this simulation has three levels.
  2. The continuous regressors are simulated from a normal distribution, but you can use any distribution you want. The categorical levels are 1, 2, and 3, which are generated uniformly at random by using the "Integer" distribution. The discrete "Integer" distribution was introduced in SAS 9.4M5; for older versions of SAS, use the %RandBetween macro as shown in the article "How to generate random integers in SAS." You can also generate the levels non-uniformly by using the "Table" distribution.
  3. The response variable, Y, is defined as a linear combination of some explanatory variables. In this simulation, the response depends linearly on the x1 and x8 continuous variables and on the levels of the C1 and C4 categorical variables. Noise is added to the model by using a normally distributed error term.
/* Program based on Simulating Data with SAS, Chapter 11 (Wicklin, 2013, p. 208-209) */
%let N = 10000;                  /* 1a. number of observations in simulated data */
%let numCont =  8;               /* number of continuous explanatory variables */
%let numClass = 4;               /* number of categorical explanatory variables */
%let numLevels = 3;              /* (optional) number of levels for each categorical variable */
data SimGLM; 
  call streaminit(12345); 
  /* 1b. Use macros to define arrays of variables */
  array x[&numCont]  x1-x&numCont;   /* continuous variables named x1, x2, x3, ... */
  array c[&numClass] c1-c&numClass;  /* CLASS variables named c1, c2, ... */
  /* the following statement initializes an array that contains the number of levels
     for each CLASS variable. You can hard-code different values such as (2, 3, 3, 2, 5) */
  array numLevels[&numClass] _temporary_ (&numClass * &numLevels);  
  do k = 1 to &N;                 /* for each observation ... */
    /* 2. Simulate value for each explanatory variable */ 
    do i = 1 to &numCont;         /* simulate independent continuous variables */
       x[i] = round(rand("Normal"), 0.001);
    do i = 1 to &numClass;        /* simulate values 1, 2, ..., &numLevels with equal prob */
       c[i] = rand("Integer", numLevels[i]);        /* the "Integer" distribution requires SAS 9.4M5 */
    /* 3. Simulate response as a function of certain explanatory variables */
    y = 4 - 3*x[1] - 2*x[&numCont] +                /* define coefficients for continuous effects */
       -3*(c[1]=1) - 4*(c[1]=2) + 5*c[&numClass]    /* define coefficients for categorical effects */
        + rand("Normal", 0, 3);                     /* normal error term */
  drop i k;
proc glm data=SimGLM;
   class c1-c&numClass;
   model y = x1-x&numCont c1-c&numClass / SS3 solution;
   ods select ModelANOVA ParameterEstimates;
Parameter estimates for synthetic (simulated) data that follows a regression model.

The ModelANOVA table from PROC GLM (not shown) displays the Type 3 sums of squares and indicates that the significant terms in the model are x1, x8, c1, and c4.

The parameter estimates from PROC GLM are shown to the right. You can see that each categorical variable has three levels, and you can use PROC FREQ to verify that the levels are uniformly distributed. I have highlighted parameter estimates for the significant effects in the model:

  • The estimates for the significant continuous effects are highlighted in red. The estimates for the coefficients of x1 and x8 are about -3 and -2, respectively, which are the parameter values that are specified in the simulation.
  • The estimates for the levels of the C1 CLASS variable are highlighted in green. They are close to (-3, -4, 0), which are the parameter values that are specified in the simulation.
  • The estimates for the Intercept and the C4 CLASS variable are highlighted in magenta. Notice that they seem to differ from the parameters in the simulation. As discussed previously, the simulation and PROC GLM use different parameterizations of the C4 effect. The simulation assigns Intercept = 4. The contribution of the first level of C4 is 5*1, the contribution for the second level is 5*2, and the contribution for the third level is 5*3. As explained in the previous article, the GLM parameterization reparameterizes the C4 effect as (4 + 15) + (5 - 15)*(C4=1) + (10 - 15)*(C4=2). The estimates are very close to these parameter values.

Although this program simulates a linear regression model, you can modify the program and simulate from a generalized linear model such as the logistic model. You just need to compute the linear predictor, eta (η), and then use the link function and the RAND function to generate the response variable, as shown in a previous article about how to simulate data from a logistic model.

In summary, this article shows how to simulate data for a linear regression model in the SAS DATA step when the model includes both categorical and continuous regressors. The program simulates arbitrarily many continuous and categorical variables. You can define a response variable in terms of the explanatory variables and their interactions.

The post Simulate data for a regression model with categorical and continuous variables appeared first on The DO Loop.

10月 032018

It is sometimes necessary for researchers to simulate data with thousands of variables. It is easy to simulate thousands of uncorrelated variables, but more difficult to simulate thousands of correlated variables. For that, you can generate a correlation matrix that has special properties, such as a Toeplitz matrix or a first-order autoregressive (AR(1)) correlation matrix. I have previously written about how to generate a large Toeplitz matrix in SAS. This article describes three useful results for an AR(1) correlation matrix:

  • How to generate an AR(1) correlation matrix in the SAS/IML language
  • How to use a formula to compute the explicit Cholesky root of an AR(1) correlation matrix.
  • How to efficiently simulate multivariate normal variables with AR(1) correlation.

Generate an AR(1) correlation matrix in SAS

The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L1 distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:

proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, p); 
   return rho##distance(T(1:p), T(1:p), "L1");
/* test on 10 x 10 matrix with rho = 0.8 */
rho = 0.8;  p = 10;
Sigma = AR1Corr(rho, p);
print Sigma[format=Best7.];

A formula for the Cholesky root of an AR(1) correlation matrix

Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = RTR, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:

/* direct computation of Cholesky root for an AR(1) matrix */
start AR1Root(rho, p);
   R = j(p,p,0);                /* allocate p x p matrix */
   R[1,] = rho##(0:p-1);        /* formula for 1st row */
   c = sqrt(1 - rho**2);        /* scaling factor: c^2 + rho^2 = 1 */
   R2 = c * R[1,];              /* formula for 2nd row */
   do j = 2 to p;               /* shift elements in 2nd row for remaining rows */
      R[j, j:p] = R2[,1:p-j+1]; 
   return R;
R = AR1Root(rho, p);   /* compare to R = root(Sigma), which requires forming Sigma */
print R[L="Cholesky Root" format=Best7.];

You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ2.

Efficient simulation of multivariate normal variables with AR(1) correlation

An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.

When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:

/* simulate multivariate normal data from a population with AR(1) correlation */
start RandNormalAR1( N, Mean, rho );
    mMean = rowvec(Mean);
    p = ncol(mMean);
    U = AR1Root(rho, p);      /* use explicit formula instead of ROOT(Sigma) */
    Z = j(N,p);
    call randgen(Z,'NORMAL');
    return (mMean + Z*U);
call randseed(12345);
p = 1000;                  /* big matrix */
mean = j(1, p, 0);         /* mean of MVN distribution */
/* simulate data from MVN distribs with different values of rho */
v = do(0.01, 0.99, 0.01);  /* choose rho from list 0.01, 0.02, ..., 0.99 */
t0 = time();               /* time it! */
do i = 1 to ncol(v);
   rho = v[i];
   X = randnormalAR1(500, mean, rho);  /* simulate 500 obs from MVN with p vars */
t_SimMVN = time() - t0;     /* total time to simulate all data */
print t_SimMVN;

The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.

In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.

The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.

7月 112018

In a previous article, I showed how to find the intersection (if it exists) between two line segments in the plane. There are some fun problems in probability theory that involve intersections of line segments. One is "What is the probability that two randomly chosen chords of a circle intersect?" This article shows how to create a simulation in SAS to estimate the probability.

An exact answer

For this problem, a "random chord" is defined as the line segment that joins two points chosen at random (with uniform probability) on the circle. The probability that two random chords intersect can be derived by using a simple counting argument. Suppose that you pick four points at random on the circle. Label the points according to their polar angle as p1, p2, p3, and p4. As illustrated by the following graphic, the points are arranged on the circle in one of the following three ways. Consequently, the probability that two random chords intersect is 1/3 because the chords intersect in only one of the three possible arrangements.

Possible arrangements of two random chords in the circle

A simulation in SAS

You can create a simulation to estimate the probability that two random chords intersect. The intersection of two segments can be detected by using either of the two SAS/IML modules in my article about the intersection of line segments. The following simulation generates four angles chosen uniformly at random in the interval (0, 2π). It converts those points to (x,y) coordinates on the unit circle. It then computes whether the chord between the first two points intersects the chord between the third and fourth points. It repeats this process 100,000 times and reports the proportion of times that the chords intersect.

proc iml;
/* Find the intersection between 2D line segments [p1,p2] and [q1,q2].
   This function assumes that the line segments have different slopes (A is nonsingular) */
start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
/* Generate two random chords on the unit circle.
   Simulate the probability that they intersect  */
N = 1e5;
theta = j(N, 4);
call randseed(123456);
call randgen(theta, "uniform", 0, 2*constant('pi'));
intersect = j(N,1,0);
do i = 1 to N;
   t = theta[i,]`;                 /* 4 random U(0, 2*pi) */
   pts = cos(t) || sin(t);         /* 4 pts on unit circle */
   p1 = pts[1,];    p2 = pts[2,];
   q1 = pts[3,];    q2 = pts[4,];
   intersect[i] = all(IntersectSegsSimple(p1, p2, q1, q2) ^= .);
prob = mean(intersect);
print prob;

This simulation produces an estimate that is close to the exact probability of 1/3.

Connection to Bertrand's Paradox

This problem has an interesting connection to Bertrand's Paradox. Bertrand's paradox shows the necessity of specifying the process that is used to define the random variables in a probability problem. It turns out that there are multiple ways to define "random chords" in a circle, and the different definitions can lead to different answers to probability questions. See the Wikipedia article for an example.

For the definition of "random chords" in this problem, the density of the endpoints is uniform on the circle. After you make that choice, other distributions are determined. For example, the distribution of the lengths of 1,000 random chords is shown below. The lengths are NOT uniformly distributed! The theoretical density of the chord lengths is overlaid on the distribution of the sample. If you change the process by which chords are randomly chosen (for example, you force the lengths to be uniformly distributed), you might also change the answer to the problem, as shown in Bertrand's Paradox.

Distribution of lengths of random chords in the unit circle, generated by choosing uniformly distributed endpoints

The post The probability that two random chords of a circle intersect appeared first on The DO Loop.

6月 152018

My 2018 SAS Global Forum paper was about "how to use the random-number generators (RNGs) in SAS." You can read the paper for details, but I recently recorded a short video that summarizes the main ideas in the paper. In particular, the video gives an overview of the new RNGs in SAS, which include the following:

  • MTHYBRID, MT2002, and MT64: Variants of the Mersenne twister RNG. The MTHYBRID method is the default RNG in SAS, beginning with the SAS 9.4M3.
  • PCG: A 64-bit permuted congruential generator
  • RDRAND: A hardware-based RNG that generates random numbers from thermal noise in the chip
  • TF2 and TF4: Counter-based Threefry RNGs

If your browser does not support embedded video, you can go directly to the video on YouTube.

The following references provide more information about the random number generators in SAS:

The post Video: New random number generators in SAS appeared first on The DO Loop.

6月 062018

The SURVEYSELECT procedure in SAS 9.4M5 supports the OUTRANDOM option, which causes the selected items in a simple random sample to be randomly permuted after they are selected. This article describes several statistical tasks that benefit from this option, including simulating card games, randomly permuting observations in a DATA step, assigning a random ID to patients in a clinical study, and generating bootstrap samples. In each case, the new OUTRANDOM option reduces the number of statements that you need to write. The OUTRANDOM option can also be specified by using OUTORDER=RANDOM.

Sample data with PROC SURVEYSELECT

Often when you draw a random sample (with or without replacement) from a population, the order in which the items were selected is not important. For example, if you have 10 patients in a clinical trial and want to randomly assign five patients to the control group, the control group does not depend on the order in which the patients were selected. Similarly, in simulation studies, many statistics (means, proportions, standard deviations,...) depend only on the sample, not on the order in which the sample was generated. For these reasons, and for efficiency, the SURVEYSELECT procedure in SAS uses a "one-pass" algorithm to select observations in the same order that they appear in the "population" data set.

However, sometimes you might require the output data set from PROC SURVEYSELECT to be in a random order. For example, in a poker simulation, you might want the output of PROC SURVEYSELECT to represent a random shuffling of the 52 cards in a deck. To be specific, the following DATA step generates a deck of 52 cards in order: Aces first, then 2s, and so on up to jacks, queens, and kings. If you use PROC SURVEYSELECT and METHOD=SRS to select 10 cards at random (without replacement), you obtain the following subset:

data CardDeck;
length Face $2 Suit $8;
do Face = 'A','2','3','4','5','6','7','8','9','10','J','Q','K';
   do Suit = 'Clubs', 'Diamonds', 'Hearts', 'Spades';
      CardNumber + 1;
/* Deal 10 cards. Order is determined by input data */
proc surveyselect data=CardDeck out=Deal noprint
     seed=1234 method=SRS /* sample w/o replacement */
     sampsize=10;         /* number of observations in sample */
proc print data=Deal; run;
Random sample without replacement from a card deck. The sample is in the same order as the data.

Notice that the call to PROC SURVEYSELECT did not use the OUTRANDOM option. Consequently, the cards are in the same order as they appear in the input data set. This sample is adequate if you want to simulate dealing hands and estimate probabilities of pairs, straights, flushes, and so on. However, if your simulation requires the cards to be in a random order (for example, you want the first five observations to represent the first player's cards), then clearly this sample is inadequate and needs an additional random permutation of the observations. That is exactly what the OUTRANDOM option provides, as shown by the following call to PROC SURVEYSELECT:

/* Deal 10 cards in random order */
proc surveyselect data=CardDeck out=Deal2 noprint
     seed=1234 method=SRS /* sample w/o replacement */
     sampsize=10          /* number of observations in sample */
     OUTRANDOM;           /* SAS/STAT 14.3: permute order */
proc print data=Deal2; run;
Random sample without replacement from a card deck. The sample is then permuted into a random order.

You can use this sample when the output needs to be in a random order. For example, in a poker simulation, you can now assign the first five cards to the first player and the second five cards to a second player.

Permute the observations in a data set

A second application of the OUTRANDOM option is to permute the rows of a SAS data set. If you sample without replacement and request all observations (SAMPRATE=1), you obtain a copy of the original data in random order. For example, the students in the Sashelp.Class data set are listed in alphabetical order by their name. The following statements use the OUTRANDOM option to rearrange the students in a random order:

/* randomly permute order of observations */
proc surveyselect data=Sashelp.Class out=RandOrder noprint
     seed=123 method=SRS /* sample w/o replacement */
     samprate=1          /* proportion of observations in sample */
     OUTRANDOM;          /* SAS/STAT 14.3: permute order */
proc print data=RandOrder; run;
Use PROC SURVEYSELECT to produce a random ordering of a SAS data set

There are many other ways to permute the rows of a data set, such as adding a uniform random variable to the data and then sorting. The two methods are equivalent, but the code for the SURVEYSELECT procedure is shorter to write.

Assign unique random IDs to patients in a clinical trial

Another application of the OUTRANDOM option is to assign a unique random ID to participants in an experimental trial. For example, suppose that four-digit integers are used for an ID variable. Some clinical trials assign an ID number sequentially to each patient in the study, but I recently learned from a SAS discussion forum that some companies assign random ID values to subjects. One way to assign random IDs is to sample randomly without replacement from the set of all ID values. The following DATA step generates all four-digit IDs, selects 19 of them in random order, and then merges those IDs with the participants in the study:

data AllIDs;
do ID = 1000 to 9999;  /* create set of four-digit ID values */
/* randomly select 19 unique IDs */
proc surveyselect data=AllIDs out=ClassIDs noprint
     seed=12345 method=SRS  /* sample w/o replacement */
     sampsize=19            /* number of observations in sample */
     OUTRANDOM;             /* SAS/STAT 14.3: permute order */
data Class;
   merge ClassIDs Sashelp.Class;   /* merge ID variable and subjects */
proc print data=Class; var ID Name Sex Age; run;

Random order for other sampling methods

The OUTRANDOM option also works for other sampling schemes, such as sampling with replacement (METHOD=URS, commonly used for bootstrap sampling) or stratified sampling. If you use the REPS= option to generate multiple samples, each sample is randomly ordered.

It is worth mentioning that the SAMPLE function in SAS/IML also can to perform a post-selection sort. Suppose that X is any vector that contains N elements. Then the syntax SAMPLE(X, k, "NoReplace") generates a random sample of k elements from the set of N. The documentation states that "the elements ... might appear in the same order as in X." This is likely to happen when k is almost equal to N. If you need the sample in random order, you can use the syntax SAMPLE(X, k, "WOR") which adds a random sort after the sample is selected, just like PROC SURVEYSELECT does when you use the OUTRANDOM option.

The post Sample and obtain the results in random order appeared first on The DO Loop.

5月 092018

In a previous blog post, I discussed ways to produce statistically independent samples from a random number generator (RNG). The best way is to generate all samples from one stream. However, if your program uses two or more SAS DATA steps to simulate the data, you cannot use the same seed value multiple times because that could introduce biases into your study. An easy way to deal with this situation is to use the CALL STREAM subroutine to set different key values for each DATA step.

An example of using two DATA steps to generate random numbers

Suppose that you have a simulation that needs to call the DATA step twice, and each call generates random numbers. If you want to encapsulate that functionality into a macro call, you might want to pass in the seed for the random number streams. An INCORRECT approach is to do the following:

%macro DoItWRONG(seed=);
data A;
   call streaminit('PCG', &seed);
   /* ... do something with random numbers ... */
data B;
   call streaminit('PCG', &seed);       /* <== uh-oh! Problem here! */
   /* ... do something else with random numbers ... */

This code is unacceptable because the same seed value and RNG are used for both DATA steps. That means that the stream of random numbers is the same for the second DATA step as for the first, which can lead to correlations that can bias your study.

Prior to SAS 9.4M5, the only way to resolve this issue was to use different seed values for each DATA step. For example, you could use &seed + 1 in the second DATA step. Because the Mersenne twister RNG (the only RNG in SAS prior to 9.4M5) has an astronomically large period, streams that use different seed values have an extremely small probability of overlapping.

Use the CALL STREAM subroutine

SAS 9.4M5 introduced new random number generators. The Mersenne twister algorithm supports one parameter (the seed) which initializes each stream. The Threefry and PCG algorithms support two initialization parameters. To prevent confusion, the first parameter is called the seed value and the second is called the key value. By default, the key value is zero, but you can use the CALL STREAM subroutine to set a positive key value.

You should use the key value (not the seed value) to generate independent streams in two DATA steps. The following statements are true:

  • The PCG generator is not guaranteed to generate independent streams for different seed values. However, you can use the key value to ensure independent streams. Two streams with the same seed and different key values never overlap.
  • The Threefry generators (TF2 and TF4) generate independent streams for different seed values. However, the Threefry generators also accept key values, and two streams with different key values never overlap.
  • The Mersenne twister algorithm does not inherently support a key value. However, the STREAM routine creates a new seed from the current seed and the specified key value. Because of the long-period property of the MT algorithm, two streams that use the same seed and different key values are "independent with high probability."

The following example modifies the previous macro to use the CALL stream routine. For the PCG RNG, two streams that have the same seed and different key values do not overlap:

%macro DoIt(seed=);
data A;
   call streaminit('PCG', &seed);
   call stream(1); 
   /* ... do something with random numbers ... */
data B;
   call streaminit('PCG', &seed);
   call stream(2);                  /* <== different stream, guaranteed */
   /* ... do something else with random numbers ... */

Of course, this situation generalizes. If the macro uses three (or 10 or 20) DATA steps, each of which generates random numbers, you can use a different key value in each DATA step. If you initialize each DATA step with the same seed and a different key, the DATA steps generate independent streams (for the Threefry and PCG RNGs) or generate independent streams with high probability (for the Mersenne twister RNG).

A real example of using different streams

This section provides a real-life example of using the CALL STREAM subroutine. Suppose a researcher wants to create a macro that simulates many samples for a regression model. The macro first creates a data set that contains p independent continuous variables. The macro then simulates B response variables for a regression model that uses the independent variables from the first data set.

To prevent the second DATA step from using the same random number stream that was used in the first DATA step, the second DATA step calls the STREAM subroutine to set a key value:

%macro Simulate(seed=, N=, NCont=,  /* params for indep vars */
                key=, betas=,       /* params for dep vars */
                NumSamples=);       /* params for MC simul */
/* Step 1: Generate nCont explanatory variables */
data MakeData;
   call streaminit('PCG', &seed);
   /* if you do not call STREAM, you get the stream for key=0 */
   array x[&NCont];         /* explanatory vars are named x1-x&nCont  */
   do ObsNum = 1 to &N;     /* for each observation in the sample  */
      do j = 1 to &NCont;
         x[j] = rand("Normal"); /* simulate NCont explanatory variables   */
    keep ObsNum x:;
/* Step 2: Generate NumSamples responses, each from the same regression model 
   Y = X*beta + epsilon, where epsilon ~ N(0, 0.1)  */
data SimResponse;
   call streaminit('PCG', &seed);
   call stream(&key); 
   keep SampleId ObsNum x: Y;
   /* define regression coefficients beta[0], beta[1], beta[2], ... */
   array beta[0:&NCont] _temporary_ (&betas);
   /* construct linear model eta = beta[0] + beta[1]*x1 + betat[2]*x2 + ... */
   set MakeData;                    /* for each obs in data ...*/
   array x[&nCont] x1-x&NCont;      /* put the explanatory variables into an array */
   eta = beta[0];
   do i = 1 to &NCont;
      eta = eta + beta[i] * x[i];
   /* add random noise for each obs in each sample */
   do SampleID = 1 to &NumSamples;
      Y = eta + rand("Normal", 0, 0.1); 
/* sort to prepare data for BY-group analysis */
proc sort data=SimResponse;
   by SampleId ObsNum;

For a discussion of how the simulation works, see the article "Simulate many samples from a linear regression model." In this example, the macro uses two DATA steps to simulate the regression data. The second DATA step uses the STREAM subroutine to ensure that the random numbers it generates are different from those used in the first DATA step.

For completeness, here's how you can call the macro to generate the data and then a call PROC REG with a BY statement to analyze the 1,000 regression models.

%Simulate(seed=123, N=50, NCont=3,         /* params for indep vars */
          key=1, Betas = 1 -0.5 0.5 -0.25, /* params for dep vars */
          NumSamples=1000)                 /* params for MC sim */
proc reg data=SimResponse noprint outest=PE;
   by SampleID;
   model Y = x1-x3;
title "Parameter Estimates for key=1";
proc print data=PE(obs=5) noobs;
   var SampleID Intercept x1-x3;

You can verify that the macro generates different simulated data if you use the same seed value but change the key value in the macro call.

In summary, this article shows that you can use the STREAM call in the SAS DATA step to generate different random number streams from the same seed value. For many simulation studies, the STREAM subroutine is not necessary. However, it useful when the simulation is run in stages or is implemented in modules that need to run independently.

The post Independent streams of random numbers in SAS appeared first on The DO Loop.

5月 072018

Simulation studies require both randomness and reproducibility, two qualities that are sometimes at odds with each other. A Monte Carlo simulation might need to generate millions of random samples, where each sample contains dozens of continuous variables and many thousands of observations. In simulation studies, the researcher wants each sample to be statistically independent. For example, if you are drawing samples of size N=100 from a continuous distribution, it is not acceptable for the millionth sample to equal the first sample. Such a situation would bias the study. Similarly, it is not acceptable for half (or a fourth, or an eighth,...) of the numbers in the millionth sample to be the same as the numbers in the first sample.

The easiest way to ensure that each sample is statistically independent is to generate all samples from one high-quality random number stream. However, if the structure of your simulation requires that you use two or more streams, then SAS 9.4M5 provides several ways to produce random samples that are statistically independent. One way is to use one of the new Threefry random number generators. The other way is to use the new CALL STREAM subroutine.

Independence and overlap in random number streams

I've previously written about the many random number generators (RNGs) in SAS. Recall that all RNGs produce uniform variates. When you call a RNG repeatedly, you are generating a stream of random numbers. A stream is the sequence of pseudorandom uniform values that is generated from a given seed value. (If you request a random value from a nonuniform distribution, such as normal or exponential, the software generates one or more uniform variates and then applies a mathematical transformation to produce a nonuniform variate.)

In SAS, the modern RNGs are high quality and have a very long period. For all practical purposes, the samples that are generated from a single stream are independent. However, for the Mersenne twister and PCG RNGs, samples from different streams might not be independent. It might happen that the first sample for one stream is equal to (or has a large intersection with) the millionth sample from another seed, as shown in the image to the right. This phenomenon is known as overlap of streams.

The possibility that streams might overlap is important because some programmers use a macro loop to generate thousands of samples, where each sample is from a different stream. Not only is this technique inefficient, but the birthday problem in probability warns us that the probability that two streams overlap is larger than our intuition suggests.

The probability of overlapping streams is usually small, even for very large simulation studies. The "number of birthdays" equals the period of the RNG, which for Mersenne twister RNG is astronomically huge (≈ 106000). However, for the PCG RNG (which has a smaller period of "only" 264 ≈ 1019), the probability of an overlap could be nonnegligible.

Vigna (2009) estimates the probability of overlap when you use the first L variates in each of n streams from a RNG that has period P. Assuming that nL/P ≪ 1, the probability is approximately n2L/P. When P = 1019 and you generate n = 10,000 different streams, each containing L = 1,000,000 variates, the probability of an overlap is approximately 10-5.

Fortunately, SAS makes it it is easy to ensure the independence of random samples:

  • As shown in Simulating Data with SAS and in many blog posts, you can almost always use a single stream to generate all the random samples.
  • You can use one of the Threefry generators, TF2 or TF4. For a Threefry generator, two streams with different seed values never overlap! Similarly, streams for the hardware-based RDRAND generator never overlap.
  • You can use the CALL STREAM subroutine to set a key value for the PCG algorithm. For the PCG RNG, streams that have the same seed value but different key values never overlap.

In my next blog post, I will demonstrate how to use the STREAM subroutine in the SAS DATA step to generate independent streams. The technique can be useful when you are running a simulation that requires several DATA steps, each of which generates random numbers.

The post Independence and overlap in streams of random numbers appeared first on The DO Loop.

4月 162018

A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the set of all possible permutations.

In SAS, you can use the RANPERM function to generate ranom permutations. For example, the statement p = ranperm(4) generates a random permutation of the elements in the set {1, 2, 3, 4}. However, the function does not ensure that the permutations it generates will be unique.

I can think of two ways to generate random permutations without replacement. The first is to generate all permutations and then choose a random subset of size k without replacement from the set of n! possibilities. This technique is efficient when n is small or when k is a substantial proportion of n!. The second technique is to randomly generate permutations until you get a total of k that are unique. This technique is efficient when n is large and k is small relative to n!.

Generate all permutations, then sample without replacement

For small sets (say, n ≤ 8), an efficient way to generate random permutations is to generate all permutations and then extract a random sample without replacement. In the SAS/IML language you can use the ALLPERM function to generate all permutations of a set of n items, where n ≤ 18. The ALLPERM function returns a matrix that has n! rows and n columns. Each row is a permutation. You can then use the SAMPLE function to sample without replacement from among the rows, as follows:

proc iml;
call randseed(123);
n = 4;                       /* permutations of n items */
k = 3;                       /* want k unique permutation */
p = allperm(n);               /* matrix has n! rows; each row is a permutation */
rows = sample(1:n, k, "WOR"); /* random rows w/o replacement */
ranPermWOR = p[rows, ];       /* extract rows */
print ranPermWOR;

Generate random permutations, then check for uniqueness

The matrix of all permutations has n! rows, which gets big fast. If you want only a small number of permutations from among a huge set of possible permutations, it is more efficient to use the RANPERM function to generate permutations, then discard duplicates. Last week I showed how to eliminate duplicate rows from a numeric matrix so that the remaining rows are unique. The following SAS/IML statements use the UniqueRows function, which is defined in the previous post. You must define or load the module before you can use it.

/* <define or load the DupRows and UniqueRows modules HERE> */
n = 10;                      /* permutations of n items */
k = 5;                       /* want k unique permutation; k << n! */
p = ranperm(n, 2*k);         /* 2x more permutations than necessary */
U = UniqueRows(p);           /* get unique rows of 2k x n matrix */
if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
else do;
   U = UniqueRows( U // ranperm(n, 10*k) ); /* generate more... repeat as necessary */
   if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
print U;

Notice in the previous statements that the call to RANPERM requests 2*k random permutations, even though we only want k. You should ask for more permutations than you need because some of them might be duplicates. The factor of 2 is ad hoc; it is not based on any statistical consideration.

If k is much smaller than n!, then you might think that the probability of generating duplicate a permutation is small. However, the Birthday Problem teaches us that duplicates arise much more frequently than we might expect, so it is best to expect some duplicates and generate more permutations than necessary. When k is moderately large relative to n!, you might want to use a factor of 5, 10, or even 100. I have not tried to compute the number of permutations that would generate k unique permutations with 95% confidence, but that would be a fun exercise. In my program, if the initial attempt generates fewer than k unique rows, it generates an additional 10*k permutations. You could repeat this process, if necessary.

In summary, if you want to generate random permutations without any duplicates, you have at least two choices. You can generate all permutations and then extract a random sample of k rows. This method works well for small values of n, such as n ≤ 8. For larger values of n, you might want to generate random permutation (more than k) and use the first k unique rows of the matrix of permutations.

The post Random permutations without duplicates appeared first on The DO Loop.

3月 262018
Zipper plot for normally distributed data

Simulation studies are used for many purposes, one of which is to examine how distributional assumptions affect the coverage probability of a confidence interval. This article describes the "zipper plot," which enables you to compare the coverage probability of a confidence interval when the data do or do not follow certain distributional assumptions.

I saw the zipper plot in a tutorial about how to design, implement, and report the results of simulation studies (Morris, White, Crowther, 2017). An example of a zipper plot is shown to the right. The main difference between the zipper plot and the usual plot of confidence intervals is that the zipper plot sorts the intervals by their associated p-values, which causes the graph to look like a garment's zipper. The zipper plot in this blog post is slightly different from the one that is presented in Morris et al. (2017, p. 22), as discussed at the end of this article.

Coverage probabilities for confidence intervals

A formulas for a confidence interval (CI) is based on an assumption about the distribution of the parameter estimates on a simple random sample. For example, the two-sided 95% confidence interval for the mean of normally distributed data has upper and lower limits given by the formula
x̄ ± t* s / sqrt(n)
where x̄ is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the 97.5th percentile of the t distribution with n – 1 degrees of freedom. You can confirm that the formula is a 95% CI by simulating many N(0,1) samples and computing the CI for each sample. About 95% of the samples should contain the population mean, which is 0.

The central limit theorem indicates that the formula is asymptotically valid for sufficiently large samples nonnormal data. However, for small nonnormal samples, the true coverage probability of those intervals might be different from 95%. Again, you can estimate the coverage probability by simulating many samples, constructing the intervals, and counting how many time the mean of the distribution is inside the confidence interval. A previous article shows how to use simulation to estimate the coverage probability of this formula.

Zipper plots and simulation studies of confidence intervals

Zipper plot that compares the empirical coverage probability of confidence intervals that assume normality. The data samples are size N=50 drawn from the normal or exponential distribution.

The graph to the right shows the zipper plot applied to the results of the two simulations in the previous article. The zipper plot on the left visualizes the estimate of the coverage probability 10,000 for normal samples of size n = 50. (The graph at the top of this article provides more detail.) The Monte Carlo estimate of the coverage probability of the CIs is 0.949, which is extremely close to the true coverage of 0.95. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.945, 0.953], which includes 0.95. You can see that the width of the confidence intervals does not seem to depend on the sample mean: the samples whose means are "too low" tend to produce intervals that are about the same length as the intervals for the samples whose means are "too high."

The zipper plot on the right is for 10,000 random samples of size n = 50 that are drawn from an exponential distribution with mean 0 and unit scale parameter. (The graph at this link provides more detail.) The Monte Carlo estimate of the coverage probability is 0.9360. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.931, 0.941] and does not include 0.95. Notice that the left side of the "zipper" tends to contain intervals that are shorter than those on the right side. This indicates that samples whose means are "too low" tend to produce shorter confidence intervals than the samples whose means are "too high."

The zipper plot enables you to compare the results of two simulations that generate data from different distributions. The zipper plot enables you to visualize the fact that the nominal 95% CIs do, in fact, have empirical coverage of about 95% for normal data, whereas the intervals have about 93.6% empirical coverage for the exponential data.

If you want to experiment with the zipper plot or use it for your own simulation studies in SAS, you can download the SAS program that generates the graphs in this article.

Differences from the "zip plot" of Morris et al.

There are some minor differences between the zipper plot in this article and the graph in Morris et al. (2017, p. 22).

  • Morris et al., use the term "zip plot." However, statisticians use "ZIP" for the zero-inflated Poisson distribution, so I prefer the term "zipper plot."
  • Morris et al. bin the p-values into 100 centiles and graph the CIs against the centiles. This has the advantage of being able to plot the CI's from thousands or millions of simulations in a graph that uses only a few hundred vertical pixels. In contrast, I plot the CI's for each fractional rank, which is the rank divided by the number of simulations. In the SAS program, I indicate how to compute and use the centiles.
  • Morris et al. plot all the centiles. Consequently, the interesting portion of the graph occupies only about 5-10% of the total graph area. In contrast, I display only the CIs whose fractional rank is less than some cutoff value, which is 0.2 in this article. Thus my zipper plots are a "zoomed in" version of the ones that appear in Morris et al. In the SAS program, you can set the FractionMaxDisplay macro variable to 1.0 to show all the centiles.

The post A zipper plot for visualizing coverage probability in simulation studies appeared first on The DO Loop.

2月 072018
Multiple-Birthday Problem: The distribution of the number of shared birthdays among 23 random people

If N random people are in a room, the classical birthday problem provides the probability that at least two people share a birthday. The birthday problem does not consider how many birthdays are in common. However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of shared birthdays. Specifically, among N people, what is the probability that exactly k birthdays are shared (k = 1, 2, 3, ..., floor(N/2))? The bar chart at the right shows the distribution for N=23. The heights of the bars indicate the probability of 0 shared birthdays, 1 shared birthday, and so on.

This article uses simulation in SAS to examine the multiple-birthday problem. If you are interested in a theoretical treatment, see Fisher, Funk, and Sams (2013, p. 5-14).

The probability of multiple shared birthdays for N=23

You can explore the classical birthday problem by using probability theory or you can estimate the probabilities by using Monte Carlo simulation. The simulation in this article assumes 365 equally distributed birthdays, but see my previous article to extend the simulation to include leap days or a nonuniform distribution of birthdays.

The simulation-based approach enables you to investigate the multiple birthday problem. To begin, consider a room that contains N=23 people. The following SAS/IML statements are taken from my previous article. The Monte Carlo simulation generates one million rooms of size 23 and estimates the proportion of rooms in which the number of shared birthdays is 0, 1, 2, and so forth.

proc iml;
/* Function that simulates B rooms, each with N people, and counts the number of shared 
   (matching) birthdays in each room. The return value is a row vector that has B counts. */
start BirthdaySim(N, B);                 
   bday = Sample(1:365, B||N);             /* each column is a room */
   match = N - countunique(bday, "col");   /* number of unique birthdays in each col */
   return ( match );
call randseed(12345);                      /* set seed for random number stream */
NumPeople = 23;
NumRooms = 1e6;
match = BirthdaySim(NumPeople, NumRooms);  /* 1e6 counts */
call tabulate(k, freq, match);  /* summarize: k=number of shared birthdays, freq=counts */ 
prob = freq / NumRooms;         /* proportion of rooms that have 0, 1, 2,... shared birthdays */
print prob[F=BEST6. C=(char(k,2)) L="Estimate of Probability (N=23)"];
Multiple-Birthday Problem: The distribution of shared birthdays among 23 random people

The output summarizes the results. In less than half the rooms, no person shared a birthday with anyone else. In about 36% of the rooms, one birthday is shared by two or more people. In about 12% of the room, there were two birthdays that were shared by four or more people. About 2% of the rooms had three birthdays shared among six or more individuals, and so forth. In theory, there is a small probability of a room in which 8, 9, 10, or 11 birthdays are shared, but the probability of these events is very small. The probability estimates are plotted in the bar chart at the top of this article.

There is a second way to represent this probability distribution: you can create a stacked bar chart that shows the cumulative probability of k or fewer shared birthdays for k=1, 2, 3,.... You can see that the probability of 0 matching birthdays is less than 50%, the probability of 1 or fewer matches is 87%, the probability of 2 or fewer matches is 97%, and so forth. The probability for 5, 6, or 7 matches is not easily determined from either chart.

Multiple-Birthday Problem: The cumulative distribution of shared birthdays among 23 random people

Mathematically speaking, the information in the stacked bar chart is equivalent to the information in the regular bar chart. The regular bar chart shows the PMF (probability mass function) for the distribution whereas the stacked chart shows the CDF (cumulative distribution function).

The probability of shared birthdays among N people

The previous section showed the distribution of shared birthdays for N=23. You can download the SAS program that simulates the multiple birthday problem for rooms that contain N=2, 3, ..., and 60 people. For each simulation, the program computes estimates for the probabilities of k matching birthdays, where k ranges from 0 to 30.

Multiple-Birthday Problem: Three distributions of the number of shared birthdays among N=25, 40, and 60 random people

There are a few ways to visualize those 59 distributions. One way is to plot 59 bar charts, either in a panel or by using an animation. A panel of three bar charts is shown to the right for rooms of size N=25, 40, and 60. You can see that the bar chart for N=25 is similar to the one shown earlier. For N=40, most rooms have between one and three shared birthdays. For rooms with N=60 people, most rooms have between three and six shared birthdays. The horizontal graphs are truncated at k=12, even though a few simulations generated rooms that contain more than 12 matches.

Obviously plotting all 59 bar charts would require a lot of space. A different way to visualize these 59 distributions is to place 59 stacked bar charts side by side. This visualization is more compact. In fact, instead of 59 stacked bar charts, you might want to create a stacked band plot, which is what I show below.

The following stacked band chart is an attempt to visualize the distribution of shared birthdays for ALL room sizes N=2, 3, ..., 60. (Click to enlarge.)

Multiple-Birthday Problem: The cumulative distribution of the probability of k or fewer shared birthdays among N random people, N=2,3,...,60

How do you interpret this graph? A room that contains N=23 people corresponds to drawing a vertical line at N=23. That vertical line intersects the red, green, and brown bands near 0.5, 0.87, and 0.97. Therefore, these are the probabilities that a room with 23 people contains 0 shared birthdays, 1 or fewer shared birthdays, and 2 or fewer shared birthdays. The vertical heights of the bands qualitatively indicate the individual probabilities.

Next consider a room that contains N=40 people. A vertical line at N=40 intersects the colored bands at 0.11, 0.37, 0.66, 0.86, and 0.95. These are the cumulative probabilities P(k ≤ 0), P(k ≤ 1), P(k ≤ 2), and so forth. If you need the exact values, you can use PROC PRINT to display the actual probability estimates. For example:

proc print data=BDayProb noobs;
   where N=40 and k < 5;
Multiple-Birthday Problem: The distribution of shared birthdays among N=40 random people


In summary, this article examines the multiple-birthday problem, which is a generalization of the classical birthday problem. If a room contains N random people, the multiple-birthday problem examines the probability that k birthdays are shared by two or more people for k = 0, 1, 2, ..., floor(N/2). This article shows how to use Monte Carlo simulation in SAS to estimate the probabilities. You can visualize the results by using a panel of bar charts or by using a stacked band plot to visualize the cumulative distribution. You can download the SAS program that simulates the multiple birthday problem, estimates the probabilities, and visualizes the results.

What do you think? Do you like the stacked band plot that summarizes the results in one graph, or do you prefer a series of traditional bar charts? Leave a comment.

The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.