10月 112017

The article "Fisher's transformation of the correlation coefficient" featured a Monte Carlo simulation that generated sample correlations from bivariate normal data. The simulation used three steps:

  1. Simulate B samples of size N from a bivariate normal distribution with correlation ρ.
  2. Use PROC CORR to compute the sample correlation matrix for each of the B samples.
  3. Use the DATA step to extract the off-diagonal elements from the correlation matrices.

After the three steps, you obtain a distribution of B sample correlation coefficients that approximates the sampling distribution of the Pearson correlation coefficient for bivariate normal data.

There is a simpler way to simulate the correlation estimates: You can directly simulate from the Wishart distribution. Each draw from the Wishart distribution is a sample covariance matrix for a multivariate normal sample of size N. If you convert that covariance matrix to a correlation matrix, you can immediately extract the off-diagonal elements, as shown in the following SAS/IML statements:

%let rho = 0.8;           /* correlation for bivariate normal distribution */
%let N = 20;              /* sample size */
%let NumSamples = 2500;   /* number of simulated samples */
/* generate sample correlation coefficients by using Wishart distribution */
proc iml;
call randseed(12345);
NumSamples = &NumSamples;
DF = &N - 1;              /* X ~ N obs from MVN(0, Sigma) */
Sigma = {1     &rho,      /* covariance for MVN samples */
         &rho   1  };
S = RandWishart(NumSamples, DF, Sigma); /* each row is 2x2 matrix */
Corr = j(NumSamples, 1);  /* allocate vector for correlation estimates */
do i = 1 to nrow(S);      /* convert to correlation; extract off-diagonal */
   Corr[i] = cov2corr( shape(S[i,], 2, 2) )[1,2];

You can create a comparative histogram of the sample correlation coefficients. In the following graph, the histogram at the top of the panel displays the distribution of the simulated correlation coefficients from the three-step method. The bottom histogram displays the distribution of correlations coefficients that are generated from the Wishart distribution.

Visually, the histograms appear to be similar. You can use PROC NPAR1WAY to run various hypothesis tests that compare the distributions; all tests support the hypothesis that these two distributions are equivalent.

If you'd like to see the complete analysis, you can download the SAS program that runs both simulations and compares the resulting distributions.

Although the Wishart distribution is more efficient for this simulation, recall that the Wishart distribution assumes that the underlying data distribution is multivariate normal. In contrast, the three-step simulation is more general. It can be used to generate correlation coefficients for any data distribution. So although the three-step simulation is not necessary for multivariate normal data, it is still an important technique to store in your simulation toolbox.

The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.

9月 272017

In a large simulation study, it can be convenient to have a "control file" that contains the parameters for the study. My recent article about how to simulate multivariate normal clusters demonstrates a simple example of this technique. The simulation in that article uses an input data set that contains the parameters (mean, standard deviations, and correlations) for the simulation. A SAS procedure (PROC SIMNORMAL) simulates data based on the parameters in the input data set.

This is a powerful paradigm. Instead of hard-coding the parameters in the program (or as macro variables), the parameters are stored in a data set that is processed by the program. This is sometimes called data-driven programming. (Some people call it dynamic programming, but there is an optimization technique of the same name so I will use the term "data-driven.") In a data-driven program, when you want to run the program with new parameters, you merely modify the data set that contains the control parameters.

I have previously written about a different way to control a batch program by passing in parameters on the command line when you invoke the SAS program.

Static programming and hard-coded parameters

Before looking at data-driven programming, let's review the static approach. I will simulate clusters of univariate normal data as an example.

Suppose that you want to simulate normal data for three different groups. Each group has its own sample size (N), mean, and standard deviation. In my book Simulating Data with SAS (p. 206), I show how to simulate this sort of ANOVA design by using arrays, as follows.

/* Static simulation: Parameters embedded in the simulation program */
data AnovaStatic;
/* define parameters for three simulated group */
array N[3]       _temporary_ (50,   50,   50);   /* sample sizes */
array Mean[3]    _temporary_ (14.6, 42.6, 55.5); /* center for each group */
array StdDev[3]  _temporary_ ( 1.7,  4.7,  5.5); /* spread for each group */
call streaminit(12345);
do k = 1 to dim(N);              /* for each group */
   do i = 1 to N[k];             /* simulate N[k] observations */
      x = rand("Normal", Mean[k], StdDev[k]); /* from k_th normal distribution */

The DATA step contains two loops, one for the groups and the other for the observations within each group. The parameters for each group are stored in arrays. Notice that if you want to change the parameters (including the number of groups), you need to edit the program. I call this method "static programming" because the behavior of the program is determined at the time that the program is written. This is a perfectly acceptable method for most applications. It has the advantage that you know exactly what the program will do by looking at the program.

Data-driven programming: Put parameters in a file

An alternative is to put the parameters for each group into a file or data set. If the k_th row in the data set contains the parameters for the k_th group, then the implicit loop in the DATA step will iterate over all groups, regardless of the number of groups. The following DATA step creates the parameters for three groups, which are read and processed by the second DATA step. The parameter values are the same as for the static example, but are transposed and processed row-by-row instead of via arrays:

/* Data-driven simulation: Parameters in a data set, processed by the simulation program */
data params;                     /* define parameters for each simulated group */
input N Mean StdDev;
50 14.6 1.7
50 42.6 4.7
50 55.5 5.5
data AnovaDynamic;
call streaminit(12345);
set params;                      /* implicit loop over groups k=1,2,... */
do i = 1 to N;                   /* simulate N[k] observations */
   x = rand("Normal", Mean, StdDev); /* from k_th normal distribution */

Notice the difference between the static and dynamic techniques. The static technique simulates data from three groups whose parameters are specified in temporary arrays. The dynamic technique simulates data from an arbitrary number of groups. Currently, the PARAMS data specifies three groups, but if I change the PARAMS data set to represent 10 or 1000 groups, the AnovaDynamic DATA step will simulate data from the new design without any modification.

Generate the parameters from real data

The data-driven technique is useful when the parameters are themselves the results of an analysis. For example, a common simulation technique is to generate the moments of real data (mean, variance, skewness, and so forth) and to use those statistics in place of the population parameters that they estimate. (See Chapter 16, "Moment Matching," in Simulating Statistics with SAS.)

The following call to PROC MEANS generates the sample mean and standard deviation for real data and writes those values to a data set:

proc means data=sashelp.iris N Mean StdDev stackods;
   class Species;
   var PetalLength;
   ods output Summary=params;

The output data set from PROC MEANS creates a PARAMS data set that contains the variables (N, MEAN, and STDDEV) that are read by the simulation program. Therefore, you can immediately run the AnovaDynamic DATA step to simulate normal data from the sample statistics. A visualization of the resulting simulated data is shown below.

You can run PROC MEANS on other data and other variables and the AnovaDynamic step will continue to work without any modification. The simulation is controlled entirely by the values in the "control file," which is the PARAMS data set.

You can generalize this technique by wrapping the program in a SAS macro in which the name of the parameter file and the name of the simulated data set are provided at run time. With a macro implementation, you can read from multiple input files and write to multiple output data sets. You could use such a macro, for example, to break up a large simulation study into smaller independent sub-simulations, each controlled by its own file of input parameters. In a gridded environment, each sub-simulation can be processed independently and in parallel, thus reducing the total time required to complete the study.

Although this article discusses control files in the context of statistical simulation, other applications are possible. Have you used a similar technique to control a program by using an input file that contains the parameters for the program? Leave a comment.

The post Data-driven simulation appeared first on The DO Loop.

9月 252017

My article about Fisher's transformation of the Pearson correlation contained a simulation. The simulation uses the RANDNORMAL function in SAS/IML software to simulate multivariate normal data. If you are a SAS programmer who does not have access to SAS/IML software, you can use the SIMNORMAL procedure in SAS/STAT software to simulate data from a multivariate normal distribution.

The 'TYPE' of a SAS data set

Most SAS procedures read and analyze raw data. However, some SAS procedures read and write special data sets that represent a statistical summary of data. PROC SIMNORMAL can read a TYPE=CORR or TYPE=COV data set. Usually, these special data sets are created as an output data set from another procedure. For example, the following SAS statements compute the correlation between four variables from a sample of 50 Iris versicolor flowers:

proc corr data=sashelp.iris(where=(Species="Versicolor"))  /* input raw data */
          nomiss noprint outp=OutCorr;                     /* output statistics */
var PetalLength PetalWidth SepalLength SepalWidth;
proc print data=OutCorr; run;
SAS TYPE=CORR data set

The output data set contains summary statistics including the mean, standard deviations, and correlation matrix for the four variables in the analysis. PROC PRINT does not display the 'TYPE' attribute of this data set, but if you run PROC CONTENTS you will see a field labeled "Data Set Type," which has the value "CORR".

You can also create a TYPE=CORR or TYPE=COV data set by using the DATA step as shown in the documentation for PROC SIMNORMAL.

Use PROC SIMNORMAL to generate multivariate normal data

Recall that you can use the standard deviations and correlations to construct a covariance matrix. When you call PROC SIMNORMAL, it internally constructs the covariance matrix from the information in the OutCorr data set and use the mean and covariance matrix to simulate multivariate normal data. The following call to PROC SIMNORMAL simulates 50 observations from a multivariate normal population. The DATA step combines the original and simulated data; the call to PROC SGSCATTER overlays the original and the simulated samples. Click to enlarge the graph.

proc simnormal data=OutCorr outsim=SimMVN
               numreal = 50           /* number of realizations = size of sample */
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
/* combine the original data and the simulated data */
data Both;
set sashelp.iris(where=(Species="Versicolor")) /* original */
    SimMVN(in=sim);                            /* simulated */
Simulated = sim;
ods graphics / attrpriority=none;   /* use different markers for each group */
title "Overlay of Original and Simulated MVN Data";
proc sgscatter data=Both;
   matrix PetalLength PetalWidth SepalLength SepalWidth / group=Simulated;
ods graphics / attrpriority=none;   /* reset markers */
Overlay of original and simulated four-dimensional data

Notice that the original data are rounded whereas the simulated data are not. Except for that minor difference, the simulated data appear to be similar to the original data. Of course, the simulated data will not match unless the original data is approximately multivariate normal.

Simulate many samples from a multivariate normal distribution

The SIMNORMAL procedure supports the NUMREAL= option, which you can use to specify the size of the simulated sample. (NUMREAL stands for "number of realizations," which is the number of independent draws.) You can use this option to generate multiple samples from the same multivariate normal population. For example, suppose you are conducting a Monte Carlo study and you want to generate 100 samples of size N=50, each drawn from the same multivariate normal population. This is equivalent to drawing 50*100 observations where the first 50 observations represent the first sample, the next 50 observations represent the second sample, and so on. The following statements generate 50*100 observations and then construct an ID variable that identifies each sample:

%let N = 50;            /* sample size */
%let NumSamples = 100;  /* number of samples */
proc simnormal data=OutCorr outsim=SimMVN
               numreal = %sysevalf(&N*&NumSamples) 
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
data SimMVNAll;
set SimMVN;
ID = floor((_N_-1) / &N) + 1;   /* ID = 1,1,...,1, 2,2,...,2, etc */

After adding the ID variable, you can efficiently analyze all samples by using a single call to a procedure. The procedure should use a BY statement to analyze each sample. For example, you could use PROC CORR with a BY ID statement to obtain a Monte Carlo estimate of the sampling distribution of the correlation for multivariate normal data.

In summary, although the SAS/IML language is the best tool for general multivariate simulation tasks, you can use the SIMNORMAL procedure in SAS/STAT software to simulate multivariate normal data. The key is to construct a TYPE=CORR or TYPE=COV data set, which is then processed by PROC SIMNORMAL.

The post Simulate multivariate normal data in SAS by using PROC SIMNORMAL appeared first on The DO Loop.

9月 132017
Simulate clusters of multivariate normal observations in SAS by using a Gaussian mixture distribution

This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture. You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a four-dimensional normal distribution. Each cluster has its own within-cluster covariance, which controls the spread of the cluster and the amount overlap between clusters.

This article is based on an example in Simulating Data with SAS (Wicklin, 2013, p. 138). Previous articles have explained how to simulate from a multivariate normal distribution and how to generate a random sample from a univariate mixture distribution.

A review of mixture distributions

The graph at the top of this article shows 100 random observation from a Gaussian mixture distribution. A mixture distribution consists of K component distributions and a set of mixing probabilities that determine the probability that a random observation belongs to each component. For example, if π = {0.35, 0.5, 0.15} is a vector of mixing probabilities, then in large random samples about 35% of the observations are drawn from the first component, about 50% from the second component, and about 15% from the third component.

For a mixture of Gaussian components, you need to specify the mean vector and the covariance matrix for each component. For this example, the means and covariances are approximately the estimates from Fisher's famous iris data set, so the scatter plot matrix might look familiar to statisticians who have previously analyzed the iris data. The means of the three component distributions are μ1 = {50, 34, 15, 2}, μ2 = {59, 28, 43, 13}, and μ3 = {66, 30, 56, 20}. The covariance matrices for the component distributions are shown later.

Simulate a Gaussian mixture in SAS

The SAS/IML language is the easiest way to simulate multivariate data in SAS. To simulate from a mixture of K Gaussian distributions, do the following:

  1. Generate a random draw from the multinomial distribution with probability vector π. This gives the number of observations to sample from each component.
  2. For each component, simulate from a multivariate normal distribution.

A technical issue is how to pass the mean vectors and covariance matrices into a module that simulates the data. If you are using SAS/IML 14.2 or beyond, you can use a list of lists (Wicklin, 2017, p. 12–14). For compatibility with older versions of SAS, the implementation in this article uses a different technique: each mean and covariance are stored as rows of a matrix. Because the covariance matrices are symmetric, Wicklin (2013) stores them in lower-triangular form. For simplicity, this article stores each 4 x 4 covariance matrix as a row in a 3 x 16 matrix.

proc iml;
/* Simulate from mixture of K mulitvariate normal distributions in dimension d.  
   OUTPUT variables:
   X  : a SampleSize x d matrix of simulated observations
   ID : a column vector that contains the component from which each obs is drawn
   INPUT variables:
   pi  : 1 x K vector of mixing probabilities, sum(pi)=1
   mu  : K x d matrix whose i_th row contains the mean vector for the i_th component
   Cov : K x (d**2) matrix whose i_th row contains the covariance matrix for the i_th component
start SimMVNMixture(X, ID,                     /* output arguments */
                    SampleSize, pi, mu, Cov);  /* input arguments  */
   K = ncol(pi);                               /* number of components */
   d = ncol(mu);                               /* number of variables */
   X = j(SampleSize, d);                       /* output: each row is observation */
   ID = j(SampleSize, 1);                      /* ID variable */
   N = RandMultinomial(1, SampleSize, pi);     /* vector of samples sizes for components */
   b = 1;                                      /* b = beginning index for group i */
   do i = 1 to K;
      e = b + N[i] - 1;                        /* e = ending index for group i */
      ID[b:e] = i;                             /* set ID variable */
      c = shape(Cov[i,], d, d);                /* reshape Cov to square matrix */
      X[b:e, ] = RandNormal(N[i], mu[i,], c);  /* simulate i_th MVN sample */
      b = e + 1;                               /* next group starts at this index */

The SimMVNMixture routine allocates a data matrix (X) that is large enough to hold the results. It generates a vector N ={N1, N2,..., NK} to determine the number of observations that will be drawn from each component and calls the RANDNORMAL function to simulate from each Gaussian component. The scalar values b and e keep track of the beginning and ending rows of each sample from each component.

After the module is defined, you can call it for a specific set of parameters. Assume you want to generate K=3 clusters of four-dimensional data (d=4). The following statements specify the mixing probabilities for a three-component model. The mu matrix is a K x d matrix whose rows are the mean vectors for the components. The Cov matrix is a K x (d**2) matrix whose rows are the covariance matrices for the components. The following statements generate a total of 100 observations from the three-component mixture distribution:

/* specify input args; means/cov correspond to sashelp.iris data for species */
pi = {0.35 0.5 0.15};                   /* mixing probs for K=3 groups */
mu = {50 34 15  2 ,                            /* means of Group 1     */
      59 28 43 13 ,                            /* means of Group 2     */
      66 30 56 20 };                           /* means of Group 3     */
/* specify within-group covariances */
Cov = {12 10  2 1  10 14 1 1   2 1  3 1  1 1 1 1 ,   /* cov of Group 1 */
       27  9 18 6   9 10 8 4  18 8 22 7  6 4 7 4 ,   /* cov of Group 2 */
       40  9 30 5   9 10 7 5  30 7 30 5  5 5 5 8 };  /* cov of Group 3 */
/* run the simulation to generate 100 observations */
call randseed(12345);
run SimMVNMixture(X, Group, 100, pi, mu, Cov);

The call to the SimMVNMixture routine returned the simulated random sample in X, which is a 100 x d matrix. The module also returns an ID vector that identifies the component from which each observation was drawn. You can visualize the random sample by writing the data to a SAS data set and using the SGSCATTER procedure to create a paneled scatter plot, as follows:

/* save simulated data to data set */
Y = Group || X;
create Clusters from Y[c=("Group" || varNames)];  append from Y;  close;
ods graphics / attrpriority=none;
title "Multivariate Normal Components";
title2 "(*ESC*){unicode pi} = {0.35, 0.5, 0.15}";
proc sgscatter data=Clusters;
   compare y=(x1 x2) x=(x3 x4) / group=Group markerattrs=(Size=12);

The graph is shown at the top of this article.

Although each component in this example is multivariate normal, the same technique will work for any component distributions. For example, one cluster could be multivariate normal, another multivariate t, and a third multivariate uniform.

In summary, you can create a function module in the SAS/IML language to simulate data from a mixture of Gaussian components. The RandMutinomial function returns a random vector that determines the number of observations to draw from each component. The RandNormal function generates the observations. A technical detail involves how to pass the parameters to the module. This implementation packs the parameters into matrices, but other options, such as a list of lists, are possible.

The post Simulate multivariate clusters in SAS appeared first on The DO Loop.

7月 262017

A classical problem in elementary probability asks for the expected lengths of line segments that result from randomly selecting k points along a segment of unit length. It is both fun and instructive to simulate such problems. This article uses simulation in the SAS/IML language to estimate solutions to the following problems:

  • Randomly choose a point at random in the interval (0, 1). The point divides the interval into two segments of length x and 1-x. What is the expected length of the larger (smaller) segment?
  • Broken stick problem: What is the probability that three randomly chosen points will break a segment into a triangle?
  • Randomly choose k points at random in the interval (0, 1). The points divide the interval into k+1 segments. What is the expected length of the largest (smallest) segment?
  • When k=2, the points divide the interval into three segments. What is the probability that the three segments can form a triangle? This is called the broken-stick problem and is illustrated in the figure to the right.

You can find a discussion and solution to these problems on many websites, but I like the website, which includes proofs and interactive Java applets.

Simulate a solution in SAS

You can simulate these problems in SAS by writing a DATA step or a SAS/IML program. I discuss the DATA step at the end of this article. The body of this article presents a SAS/IML simulation and constructed helper modules that solve the general problem. The simulation will do the following:

  1. Generate k points uniformly at random in the interval (0, 1). For convenience, sort the points in increasing order.
  2. Compute the lengths of the k+1 segments.
  3. Find the length of the largest and smallest segments.

In many languages (including the SAS DATA step), you would write a loop that performs these operations for each random sample. You would then estimate the expected length by computing the mean value of the largest segment for each sample. However, in the SAS/IML language, you can use matrices instead of using a loop. Each sample of random points can be held in the column of a matrix. The lengths of the segments can also be held in a matrix. The largest segment for each trial is stored in a row vector.

The following SAS/IML modules help solve the general simulation problem for k random points. Because the points are ordered, the lengths of the segments are the differences between adjacent rows. You can use the DIF function for this computation, but the following program uses the DifOp module to construct a small difference operator, and it uses matrix multiplication to compute the differences.

proc iml;
/* Independently sort column in a matrix.
   See */
start SortCols(A);
   do i = 1 to ncol(A);
      v = A[ ,i];  call sort(v);  A[ ,i] = v; /* get i_th col and sort it */
/* Generate a random (k x NSim) matrix of points, then sort each column. */
start GenPts(k, NSim);
   x = j(k, NSim);               /* allocate k x NSim matrix */
   call randgen(x, "Uniform");   /* fill with random uniform in (0,1) */
   if k > 1 then run SortCols(x);  
   return x;
/* Return matrix for difference operator.
   See */
start DifOp(dim);
   D = j(dim-1, dim, 0);         /* allocate zero matrix */
   n = nrow(D); m = ncol(D);
   D[do(1,n*m, m+1)] = -1;       /* assign -1 to diagonal elements */
   D[do(2,n*m, m+1)] = 1;        /* assign +1 to super-diagonal elements */
   return D;
/* Find lengths of segments formed by k points in the columns of x.
   Assume each column of x is sorted and all points are in (0,1). */
start SegLengths(x);   
   /* append 0 and 1 to top and bottom (respectively) of each column */
   P = j(1, ncol(x), 0) // x // j(1, ncol(x), 1);
   D = DifOp(nrow(P));           /* construct difference operator */
   return D*P;                   /* use difference operator to find lengths */
P = {0.1  0.2  0.3,
     0.3  0.8  0.5,
     0.7  0.9  0.8 };
L = SegLengths(P);
print L[label="Length (k=3)"];
Lengths of segments formed by random points in the unit interval

The table shows the lengths of three different sets of points for k=3. The first column of P corresponds to points at locations {0.1, 0.3, 0.7}. These three points divide the interval [0, 1] into four segments of lengths 0.1, 0.2, 0.4, and 0.3. Similar computations hold for the other columns.

The expected length of the longer of two segments

For k=1, the problem generates a random point in (0, 1) and asks for the expected length of the longer segment. Obviously the expected length is greater than 1/2, and you can read the Cut-The-Knot website to find a proof that shows that the expected length is 3/4 = 0.75.

The following SAS/IML statements generate one million random points and compute the larger of the segment lengths. The average value of the larger segments is computed and is very close to the expected value:

call randseed(54321);
k = 1;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 1 point dropped onto (0,1) */
L = SegLengths(x);               /* lengths of  segments */
Largest = L[<>, ];               /* max length among the segments */
mean = mean(Largest`);           /* average of the max lengths */
print mean;
Estimate for expected length of longest segment

You might not be familiar with the SAS/IML max subscript operator (<>) and the min subscript operator (><). These operators compute the minimum or maximum values for each row or column in a matrix.

The expected length of the longest of three segments

For k=2, the problem generates two random points in (0, 1) and asks for the expected length of the longest segment. You can also ask for the average shortest length. The Cut-The-Knot website shows that the expected length for the longest segment is 11/18 = 0.611, whereas the expected length of the shortest segment is 2/18 = 0.111.

The following SAS/IML statements simulate choosing two random points on one million unit intervals. The program computes the one million lengths for the resulting longest and shortest segments. Again, the average values of the segments are very close to the expected values:

k = 2;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 2 points dropped onto (0,1) */
L = SegLengths(x);               /* lengths of segments */
maxL = L[<>, ];                  /* max length among the segments */
meanMax = mean(maxL`);           /* average of the max lengths */
minL = L[><, ];                  /* min length among the segments */
meanMin = mean(minL`);           /* average of the max lengths */
print meanMin meanMax;
Estimates for expected lengths of the shorted and longest segments formed by two random points in the unit interval

The broken stick problem

You can use the previous simulation to estimate the broken stick probability. Recall that three line segments can form a triangle provided that they satisfy the triangle inequality: the sum of the two smaller lengths must be greater than the third length. If you randomly choose two points in (0,1), the probability that the resulting three segments can form a triangle is 1/4, which is smaller than what most people would guess.

The vectors maxL and minL each contain one million lengths, so it is trivial to compute the vector of that contains the third lengths.

/* what proportion of randomly broken sticks form triangles? */
medL = 1 - maxL - minL;          /* compute middle length */
isTriangle = (maxL <= minL + medL); /* do lengths satisfy triangle inequality? */
prop = mean(isTriangle`);        /* proportion of segments that form a triangle */
print prop;
Estimate for the probability that three random segments form a triangle

As expected, about 0.25 of the simulations resulted in segments that satisfy the triangle inequality.

In conclusion, this article shows how to use the SAS/IML language to solve several classical problems in probability. By using matrices, you can run the simulation by using vectorized computations such as matrix multiplication finding the minimum or maximum values of columns. (However, I had to use a loop to sort the points. Bummer!)

If you want to try this simulation yourself in the DATA step, I suggest that you transpose the SAS/IML setup. Use arrays to hold the random points and use the CALL SORTN subroutine to sort the points. Use the LARGEST function and the SMALLEST function to compute the largest and smallest elements in an array. Feel free to post your solution (and any other thoughts) in the comments.

The post Random segments and broken sticks appeared first on The DO Loop.

6月 052017

If you toss a coin 28 times, you would not be surprised to see three heads in a row, such as ...THHHTH.... But what about eight heads in a row? Would a sequence such as THHHHHHHHTH... be a rare event?

This question popped into my head last weekend as I attended my son's graduation ceremony. As the students marched in, I noticed that men were dressed in green cap and gowns, whereas the women were dressed in white. They entered in alphabetical order, which randomized the men and women. They filed into 12 rows that each contained 28 seats. Thus each row is like an independent toss of a coin, with green and white representing heads and tails, respectively.

Phot of graduating men and women in different colored caps and gowns

When the students entered the ninth row from the left (fourth from the right), I noticed a sequence of eight consecutive "little green men," which is highlighted in red in the picture on this page. (Click to enlarge.) I wish I had a photo of the students seated in their chairs because the effect is more dramatic when the green mortarboards are all aligned. But take my word for it: the long sequence of green was very noticeable.

The picture shows that there was actually a row to the extreme left that was partially filled. For the purpose of this article, ignore the partial row. In the 12 full rows, the number of men in each row is (from left to right) {15, 15, 14, 11, 16, 16, 15, 10, 20, 9, 14, 13}. Remarkably, this adds to 168, so the proportion of men is exactly 0.5 of the 12 x 28 = 336 students.

Simulate the binary pattern

You can simulate the students by generating 336 random binary values arranged on a 12 x 28 grid. Since this was the graduating class of 2017, I used 2017 as the random number seed in the following DATA step:

%let NumRows = 12;
%let NumInRow= 28;
data Graduation;
call streaminit(2017);  
do row = 1 to &NumRows;
   do seat = 1 to &NumInRow;
      Male = rand("Bernoulli", 0.5); 
title "One Simulated Seating Arrangement";
proc sgplot data=Graduation;
   styleattrs wallcolor=grey DATACONTRASTCOLORS=(white green);
   scatter x=Row y=Seat / group=male markerattrs=(symbol=SquareFilled);
   xaxis integer values=(1 to 12);
Random binary values on a regular grid

If you look at row 5 in the image, you will see a sequence of nine consecutive green markers. The fact that a simulated data set reproduced the graduation scenario on the very first attempt makes me think that this situation is not very rare. However, changing the seed a few times shows that the situation does not always occur.

Runs in coin tosses

There are 12 rows, each containing 28 students. The event of interest is a row with eight or more consecutive males. The easiest way to compute the probability of this happening is to first compute the probability for one row. Since the rows are assumed to be independent, you can then compute the probability of seeing the event in any of the 12 rows.

A sequence of consecutive events is also called a "run" of events. If you do an internet search for "probability of k heads in a row" or "probability of runs in coin toss", you will find many solutions to this problem. The source I used is a question that was asked on StackExchange about "blocks of events." Whereas many people approach this problem by using a simulation or an explicit recursive mathematical formula, "Neil G" and "COOLSerdash" compute the probability by using a Markov transition matrix, which is easy to create in the SAS/IML matrix language.

The following statements define a function that creates the Markov transition matrix and iterates it to compute the probability that coin will show k consecutive heads in N tosses. The program works for any probability of heads, not merely p=0.5. See the StackExchange article for the explanation:

proc iml;
k = 8;                     * desired number of correct trials in a row;
p = 1/2;                   * probability of getting a correct trial;
N = 28;                    * Total number of trials;
/* Iterate Markov transition matrix to compute probability of 
   k consecutive heads in N tosses of a coin that has 
   probability p of showing heads */
start ProbConsec(N, p, k);
   M = j(k+1, k+1, 0);     * set up the transition matrix M;
   M[1, 1:k] = (1-p);      * first row, except for last column;
   M[k+1, k+1] = 1;        * lower right corner;
   do i = 2 to (k+1);
      M[i, i-1] = p;       * subdiagonal elements;
   Mn = M**N;              * Calculate M^N;
   /* Prob that starting in State 1 ends in State (k+1) */
   return(Mn[(k+1), 1]);   
prob = ProbConsec(N, p, k);
print prob;

The result shows that the probability of seeing 8 consecutive heads out of 28 tosses is 0.0426. This is the same probability as observing 8 consecutive men in green in one of the rows at graduation, assuming that alphabetical ordering randomizes men and women. However, remember that there were 12 rows at graduation, so the probability of observing this event in ANY row is higher, as shown below:

ProbSee0 = (1-prob)##12;   * P(Not in Row1 AND ... NOT in Row 12);
ProbSeeAny = 1 - ProbSee0; * P(In Row1 OR ... OR in Row 12);
print ProbSeeAny ProbSee0;

The chance of observing exactly eight consecutive men in any of the 12 rows is about 41%. Of course, you can also compute the probability of observing 9, 10, 11, or more consecutive men. When you add up the probabilities, you discover that the cumulative probability of observing an "extreme arrangement" of 8 or more consecutive men is about 0.64. And why stop there? You could extend this analysis to include a sequence of consecutive women!


In summary, graduation events can be long, but computing the probabilities of interesting arrangements of the students can help make the time go faster! I wasn't able to compute the probabilities in my head while at the graduation, but it didn't take long to research the problem and solve it with SAS after I got home. I conclude that observing a long sequence of men in a randomized seating arrangement that has 12 rows of 28 seats is not a rare event. In fact, the chance of observing a run of eight or more men is about 64%.

The real lesson for all of us is that we should keep our eyes open and look around. Math and statistics are everywhere!

The post Runs in coin tosses; patterns in random seating appeared first on The DO Loop.

6月 012017

Last week I was asked a simple question: "How do I choose a seed for the random number functions in SAS?" The answer might surprise you: use any seed you like. Each seed of a well-designed random number generator is likely to give rise to a stream of random numbers, so you can view the various streams as statistically equivalent.

Random means random

To be clear, I am talking about using a seed value to initialize a modern, high-quality, pseudorandom number generator (RNG). For example, in SAS you can use the STREAMINIT subroutine to initialize the Mersenne twister algorithm that is used by the RAND function. If you are still using the old-style RANUNI or RANNOR functions in SAS, please read the article "Six reasons you should stop using the RANUNI function to generate random numbers."

A seed value specifies a particular stream from a set of possible random number streams. When you specify a seed, SAS generates the same set of pseudorandom numbers every time you run the program. However, there is no intrinsic reason to prefer one stream over another. The stream for seed=12345 is just as random as the stream for the nine-digit prime number 937162211.

Some people see the number 937162211 and think that it looks "more random" than 12345. They then assume that the random number stream that follows from CALL STREAMINIT(937162211) is "more random" than the random number stream for CALL STREAMINIT(12345). Nope, random means random. In modern pseudorandom generators, the streams for different seeds should have similar statistical properties. Furthermore, many RNGs use the base-2 representation of the seed for initialization and (12345)10 = (11000000111001)2 looks pretty random! In fact, if you avoid powers of 2, the base-2 representations of most base-10 numbers "look random."

Initialization: Hard for researchers, easy for users

Researchers who specialize in random number generators might criticize what I've said as overly simplistic. There have been many research papers written about how to take a 32-bit integer and use that information to initialize a RNG whose internal state contains more than 32 bits. There have been cases where a RNG was published and the authors later modified the initialization routine because certain seeds did not result in streams that were sufficiently random. There have been many discussions about how to create a seed initialization algorithm that is easy to call and that almost always results in a high-quality stream of random numbers.

These are hard problems, but fortunately researchers have developed ways to initialize a stream from a seed so that there is a high probability that the stream will have excellent statistical properties. The relevant question for many SAS programmers is "can I use 12345 or my telephone number as seed values, or do I always need to type a crazy-looking nine-digit sequence?" My response is that there is no reason to prefer the crazy-looking seed over an easy-to-type sequence such as your phone number, your birthday, or the first few digits of pi.

Choosing a random seed

If you absolutely insist on using a "random seed," SAS can help. If you call the STREAMINIT subroutine with the value 0, then SAS will use the date, time of day, and possibly other information to manufacture a seed when you call the RAND function. SAS puts the seed value into the SYSRANDOM system macro variable. That means you can use %PUT to display the seed that SAS created, as follows:

data _null_;
call streaminit(0);   /* generate seed from system clock */
x = rand("uniform");

Every time you run this program, you will get a different seed value that you can use as the seed for a next program.

A second method is to use the RAND function to generate a random integer between 1 and 231-1, which is the range of valid seed values for the Mersenne twister generator in SAS 9.4m4. The following program generates a random seed value:

data _null_;
call streaminit(0);
seed = ceil( (2**31 - 1)*rand("uniform") ); 
put seed=;

Both of these methods will generate a seed for you. However, the randomly generated seed does not provide any benefit. For a modern, high-quality, pseudorandom number generator, the stream should have good statistical properties regardless of the seed value. Using a random seed value does not make a stream "more random" than a seed that is easier to type.

The post How to choose a seed for generating random numbers in SAS appeared first on The DO Loop.

5月 102017

A SAS customer asked how to simulate data from a three-parameter lognormal distribution as specified in the PROC UNIVARIATE documentation. In particular, he wanted to incorporate a threshold parameter into the simulation.

Simulating lognormal data is easy if you remember an important fact: if X is lognormally distributed, then Y=log(X) is normally distributed. The converse is also true: If Y is normally distributed, then X=exp(Y) is lognormally distributed. Consequently, to simulate lognormal data you can simulate Y from the normal distribution and exponentiate it to get X, which is lognormally distributed by definition. If you want, you can add a threshold parameter to ensure that all values are greater than the threshold.

Simulate a sample from a two-parameter lognormal distribution

To reiterate: if Y ~ N(μ, σ) is normally distributed with location parameter μ and scale parameter σ, then X = exp(Y) is lognormally distributed with log-location parameter μ and log-scale parameter σ. Different authors use different names for the μ and σ parameters. The PROC UNIVARIATE documentation uses the symbol ζ (zeta) instead of μ, and it calls ζ a scale parameter. Hence, I will use the symbol ζ, too. I have previously written about the relationship between the two lognormal parameters and the mean and variance of the lognormal distribution.

Regardless of what name and symbol you use, you can use the definition to simulate lognormal data. The following SAS DATA set simulates one sample of size 1000 from a lognormal distribution with parameters ζ=2 and σ=0.5. PROC UNIVARIATE then fits a two-parameter lognormal distribution to the simulated data. The maximum likelihood estimates for the sample are 2.01 and 0.49, so the estimates from the simulated data are very close to the parameter values:

ods graphics/reset;
%let N = 1000;          /* sample size */
data LN1;
call streaminit(98765);
sigma = 0.5;      /* shape or log-scale parameter    */
zeta = 2;         /* scale or log-location parameter */
do i = 1 to &N;
   Y = rand("Normal", zeta, sigma);  /* Y ~ N(zeta, sigma)    */
   X = exp(Y);                       /* X ~ LogN(zeta, sigma) */
keep X;
proc univariate data=LN1;      /* visualize simulated data and check fit */
   histogram X / lognormal endpoints=(0 to 50 by 5)
                 odstitle="Simulated Lognormal Data (zeta=2, sigma=0.5)";
Simulated lognormal data

Simulate many samples from a three-parameter lognormal distribution

You can modify the previous program to simulate from a lognormal distribution that has a threshold parameter. You simply add the threshold value to the exp(Y) value, like this: X = theta + exp(Y). Because exp(Y) is always positive, X is always greater than theta, which is the threshold value.

In Monte Carlo simulation studies, you often want to investigate the sampling distribution of the model parameter estimates. That is, you want to generate many samples from the same model and see how the estimates differ across the random samples. The following DATA step simulates 500 random samples from the three-parameter lognormal distribution with threshold value 10. You can analyze all the samples with one call to PROC UNIVARIATE that uses the BY statement to identify each sample. This is the efficient way to perform Monte Carlo simulation studies in SAS.

%let N = 100;          /* sample size */
%let NumSamples = 500; /* number of samples */
%let Threshold = 10;   
data LN;               /* generate many random samples */
call streaminit(98765);
sigma = 0.5;      /* shape or log-scale parameter    */
zeta = 2;         /* scale or log-location parameter */
do SampleID = 1 to &NumSamples;
   do i = 1 to &N;
      Y = rand("Normal", zeta, sigma);
      X = &Threshold + exp(Y);
keep SampleID X;
ods exclude all;                 /* do not produce tables during analyses */
proc univariate data=LN;
   by SampleID;                  /* analyze the many random samples */
   histogram x / lognormal(theta=&Threshold); /* 2-param estimation */
   ods output parameterestimates=PE;
ods exclude none;
data Estimates;                 /* convert from long to wide for plotting */
keep SampleID Zeta Sigma;
merge PE(where=(Parameter="Scale") rename=(Estimate=Zeta))
      PE(where=(Parameter="Shape") rename=(Estimate=Sigma));
by sampleID;
label Zeta="zeta: Estimates of Scale (log-location)"
      Sigma="sigma: Estimate of Shape (log-scale)";
title "Approximate Sampling Distribution of Lognormal Estimates";
title2 "Estimates from &NumSamples Random Samples (N=&N)";
proc sgplot data=Estimates;
   scatter x=Zeta y=Sigma;
   refline 2 / axis=x;
   refline 0.5 / axis=y;
MLE Parameter Estimates for 500 Simulated Random Samples from Lognormal Distribution

The distribution of the 500 estimates appears to be centered on (ζ, σ) = (2, 0.5), which are the parameter values that were used to simulate the data. You can use the usual techniques in Monte Carlo simulation to estimate the standard deviation of the estimates.

A few closing remarks:

  • The RAND function does not support location and scale parameters for the lognormal distribution in SAS in 9.4m4. However, the RANDGEN function in SAS/IML does support two-parameter lognormal parameters. The RAND function will support lognormal parameters in 9.4m5.
  • In this study, the estimates are all two-parameter estimates, which assumes that you know the threshold value in the population. If not, you can use THETA=EST on the HISTOGRAM statement to obtain three-parameter lognormal estimates.
  • Because you need to exponentiate the Y variable, random values of Y must be less than the value of CONSTANT('logbig'), which is about 709. To avoid numerical overflows, make sure that ζ + 4*σ is safely less than 709.
  • This sort of univariate simulation is discussed in detail in Chapter 7 of the book Simulating Data with SAS, along with a general discussion about how to simulate from location-scale families even for distributions for which the RAND function does not support location or scale parameters.

The post Simulate lognormal data in SAS appeared first on The DO Loop.

3月 012017

Monte Carlo techniques have many applications, but a primary application is to approximate the probability that some event occurs. The idea is to simulate data from the population and count the proportion of times that the event occurs in the simulated data.

For continuous univariate distributions, the probability of an event is the area under a density curve. The integral of the density from negative infinity to a particular value is the definition of the cumulative distribution function (CDF) for a distribution. Instead of performing numerical integration, you can use Monte Carlo simulation to approximate the probability.

One-dimensional CDFs

In SAS software, you can use the CDF function to compute the CDF of many standard univariate distributions. For example, the statement prob = cdf("Normal", -1) computes the probability that a standard normal random variable takes on a value less than -1.

The CDF function is faster and more accurate than a Monte Carlo approximation, but let's see how the two methods compare. You can estimate the probability P(X < -1) by generating many random values from the N(0,1) distribution and computing the proportion that is less than -1, as shown in the following SAS DATA step

data _NULL_;
call streaminit(123456);
N = 10000;                       /* sample size */
do i = 1 to N;
   x = rand("Normal");           /* X ~ N(0,1) */
   cnt + (x < -1);               /* sum of counts for value less than -1 */
Prob = cdf("Normal", -1);        /* P(x< -1) */
MCEst = cnt / N;                 /* Monte Carlo approximation */
put Prob=;
put MCEst=;
Prob =0.1586552539

The Monte Carlo estimate is correct to two decimal places. The accuracy of this Monte Carlo computation is proportional to 1/sqrt(N), where N is the size of the Monte Carlo sample. Thus if you want to double the accuracy you need to quadruple the sample size.

Two-dimensional CDFs

SAS provides the PROBBNRM function for computing the CDF of a bivariate normal distribution, but does not provide a built-in function that computes the CDF for other multivariate probability distributions. However, you can use Monte Carlo techniques to approximate multivariate CDFs for any multivariate probability distribution for which you can generate random variates.

I have previously blogged about how to use the PROBBNRM function to compute the bivariate normal CDF. The following SAS/IML statements demonstrate how to use a Monte Carlo computation to approximate the bivariate normal CDF. The example uses a bivariate normal random variable Z ~ MVN(0, Σ), where Σ is the correlation matrix with Σ12 = 0.6.

The example computes the probability that a bivariate normal random variable is in the region G = {(x,y) | x<x0 and y<y0}. The program first calls the built-in PROBBNRM function to compute the probability. Then the program calls the RANDNORMAL function to generate 100,000 random values from the bivariate normal distribution. A binary vector (group) indicates whether each observation is in G. The MEAN function computes the proportion of observations that are in the region.

proc iml;
x0  = 0.3; y0 = 0.4; rho = 0.6; 
Prob = probbnrm(x0, y0, rho);      /* P(x<x0 and y<y0) */
call randseed(123456);
N = 1e5;                           /* sample size */
Sigma = { 1   &rho,                /* correlation matrix */
         &rho 1 };
mean = {0 0};
Z = randnormal(N, mean, Sigma);    /* sample from MVN(0, Sigma) */
group = (Z[,1] < x0 & Z[,2] < y0); /* binary vector */
MCEst = mean(group);               /* = sum(group=1) / N  */
print Prob MCEst;

You can use a scatter plot to visualize the Monte Carlo technique. The following statements create a scatter plot and use the DROPLINE statement in PROC SGPLOT to indicate the region G. Of the 100000 random observations, 49750 of them were in the region G. These observations are drawn in red. The observations that are outside the region are drawn in blue.

ods graphics / width=400px height=400px;
title "Estimate of P(x < x0  and y < y0) is 0.4978";
title2 "x0 = 0.3; y0 = 0.4; rho = 0.6";
call scatter(Z[,1], Z[,2]) group=group grid={x y}
     procopt="noautolegend aspect=1"
     option="transparency=0.9  markerattrs=(symbol=CircleFilled)"
     other="dropline x=0.3 y=0.4 / dropto=both;";

Higher dimensions

The Monte Carlo technique works well in low dimensions. As the dimensions get larger, you need to generate a lot of random variates in order to obtain an accurate estimate. For example, the following statements generate 10 million random values from the five-dimensional distribution of uncorrelated normal variates and estimate the probability of all components being less than 0:

d = 5;                         /* dimension */
N = 1e7;                       /* sample size */
mean = j(1, d, 0);             /* {0,0,...,0} */
Z = randnormal(N, mean, I(d)); /* Z ~  MVN (0, I)  */
v0 = {0 0 0 0 0};              /* cutoff values in each component */
ComponentsInRegion = (Z < v0)[,+]; /* number of components in region */
group = (ComponentsInRegion=d);    /* binary indicator vector */
MCEst = mean(group);               /* proportion of obs in region */
print (1/2**d)[label="Prob"] MCEst;

Because the normal components are independent, the joint probability is the product of the probabilities for each component: 1/25 = 0.03125. The Monte Carlo estimate is accurate to three decimal places.

The Monte Carlo technique can also handle non-rectangular regions. For example, you can compute the probability that a random variable is in a spherical region.

The Monte Carlo method is computationally expensive for high-dimensional distributions. In high dimensions (say, d > 10), you might need billions of random variates to obtain a reasonable approximation to the true probability. This is another example of the curse of dimensionality.

The post Monte Carlo estimates of joint probabilities appeared first on The DO Loop.

2月 012017

In a previous article, I showed how to simulate data for a linear regression model with an arbitrary number of continuous explanatory variables. To keep the discussion simple, I simulated a single sample with N observations and p variables. However, to use Monte Carlo methods to approximate the sampling distribution of statistics, you need to simulate many samples from the same regression model.

This article shows how to simulate many samples efficiently. Efficient simulation is the main emphasis of my book Simulating Data with SAS. For a detailed discussion about simulating data from regression models, see chapters 11 and 12.

The SAS DATA step in my previous post contains four steps. To simulate multiple samples, put a DO loop around the steps that generate the error term and the response variable for each observation in the model. The following program modifies the previous program and creates a single data set that contains NumSamples (=100) samples. Each sample is identified by an ordinal variable named SampleID.

/* Simulate many samples from a  linear regression model */
%let N = 50;            /* N = sample size               */
%let nCont = 10;        /* p = number of continuous variables */
%let NumSamples = 100;  /* number of samples                  */
data SimReg(keep= SampleID i Y x:);
call streaminit(54321);
array x[&nCont];        /* explanatory variables are named x1-x&nCont */
/* 1. Specify model coefficients. You can hard-code values such as
array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36);
      or you can use a formula such as the following */
array beta[0:&nCont] _temporary_;
do j = 0 to &nCont;
   beta[j] = 4 * (-1)**(j+1) / (j+1);       /* formula for beta[j] */
do i = 1 to &N;              /* for each observation in the sample */
   do j = 1 to &nCont;
      x[j] = rand("Normal"); /* 2. Simulate explanatory variables  */
   eta = beta[0];                       /* model = intercept term  */
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       /*     + sum(beta[j]*x[j]) */
   /* 5. simulate response for each sample */
   do SampleID = 1 to &NumSamples;      /* <== LOOP OVER SAMPLES   */
      epsilon = rand("Normal", 0, 1.5); /* 3. Specify error distrib*/
      Y = eta + epsilon;                /* 4. Y = model + error    */

The efficient way to analyzed simulated samples with SAS is to use BY-group processing. With By-group processing you can analyze all samples with a single procedure call. The following statements sort the data by the SampleID variable and call PROC REG to analyze all samples. The NOPRINT option ensures that the procedure does not spew out thousands of tables and graphs. (For procedures that do not support the NOPRINT option, there are other ways to turn off ODS when analyzing simulated data.) The OUTEST= option saves the parameter estimates for all samples to a SAS data set.

proc sort data=SimReg;
   by SampleID i;
proc reg data=SimReg outest=PE NOPRINT;
   by SampleID;
   model y = x:;

The PE data set contains NumSamples rows. Each row contains the p parameter estimates for the analysis of one simulated sample. The distribution of estimates is an approximation to the true (theoretical) sampling distribution of the statistics. The following image visualizes the joint distribution of the estimates of four regression coefficients. You can see that the distribution of the estimates appears to be multivariate normal and centered at the values of the population parameters.

You can download the SAS program that simulates the data, analyzes it, and produces the graph. The program is very efficient. For 10,000 random samples of size N=50 that contain p=10 variables, it takes about one second to run the Monte Carlo simulation and analyses.

tags: Simulation, Statistical Programming

The post Simulate many samples from a linear regression model appeared first on The DO Loop.