simulation

1月 102022
 

On this blog, I write about a diverse set of topics that are relevant to statistical programming and data visualization. In a previous article, I presented some of the most popular blog posts from 2021. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst. However, I also write articles that discuss more advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles deserve a second look. I have grouped them into four categories: SAS programming, statistics, optimization, and data simulation.

SAS programming

For years, I've been writing articles about how to accomplish tasks in Base SAS. In addition, I now write about how to program basic tasks in SAS Viya.

Use the DOLIST Syntax to Specify Tick Marks on a Graph

Probability and statistics

A Family of Density Curves for the Inverse Gamma Distribution

Probability and statistics provide the theoretical basis for modern data analysis. You cannot understand data science, machine learning, or artificial intelligence without understanding the basic principles of statistics. Most readers are familiar with common probability distributions, Pearson correlation, and least-squares regression. The following articles discuss some of the lesser-known statistical cousins of these topics:

  • The inverse gamma distribution: To use any probability distribution, you need to know four essential functions: the density, the cumulative probability, the quantiles, and how to generate random variates. You might be familiar with the gamma distribution, but what is the inverse gamma distribution and how do you define the four essential functions in SAS? Similarly, what is the generalized gamma distribution?
  • The Hoeffding D statistic: The Hoeffding D statistic measures the association between two variables. How do you compute the Hoeffding D statistic in SAS, and how do you interpret the results?
  • Weibull regression: In ordinary least squares regression, the response variable is assumed to be modeled by a set of explanatory variables and normally distributed errors. If the error terms are Weibull distributed, you can estimate parameters for the Weibull distribution as part of a regression analysis, but you need to transform the regression estimates to obtain the usual Weibull parameters.

Optimization

Optimization is at the heart of many statistical techniques, such as maximum likelihood estimation. But sometimes you need to solve a "pure" optimization problem. SAS supports many methods for solving optimization problems:

Multivariate simulation and bootstrapping

It is straightforward to simulate univariate data. It is harder to simulate multivariate data while preserving important relations between the variables, such as correlation. Similarly, it can be challenging to analyze the bootstrap distribution of a multivariate statistic, such as a correlation matrix:

The Geometry of the Iman-Conover Transformation

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2021? If so, leave a comment and tell me what topic you found interesting or useful.

The post 12 blog posts from 2021 that deserve a second look appeared first on The DO Loop.

1月 052022
 

You can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This method is encapsulated in the RANDNORMAL function in SAS/IML software, but you can also perform the computations manually by calling the ROOT function to get the Cholesky root and then use matrix multiplication to correlate multivariate norm (MVN) data, as follows:

%let d = 12;           /* number of variables */
%let N = 500;          /* number of observations */
proc iml;
d = &d;                /* want to generate d variables that are MVN */
N = &N;                /* sample size */
 
/* easy to generate uncorrelated MVN variables */
call randseed(1234);
z = j(d, N);
call randgen(z, "Normal"); /* z is MVN(0,I(d)) */
 
/* use Cholesky root to transform to correlated variables */
Sigma = toeplitz(d:1);   /* example of covariance matrix */
G = root(Sigma);         /* the Cholesky root of the full covariance matrix */
x = G`*z;                /* x is MVN(0, Sigma) */
title "Two Components of MVN(0,Sigma) Data";
call scatter(x[1,], x[2,]) label={x1 x2};

The simulation creates d=12 correlated variables and 500 observations. The first two variables are plotted against each other so that you can see that they are correlated. The other pairs of variables are also correlated according to the entries in the Σ covariance matrix.

Generating a huge number of MVN variables

In the comments to one of my previous articles, a SAS programmer asked whether it is possible to generate MVN data even if the covariance matrix is so large that it cannot be factored by the ROOT function. For example, suppose you want to generate d=20,000 MVN variables. A d x d covariance matrix of that size requires 3 GB of RAM, and on some operating system you cannot form the Cholesky root of a matrix that large. However, I have developed an algorithm that enables you to deal with the covariance matrix (Σ) as a block matrix.

To introduce the algorithm, represent the Σ matrix as a 2 x 2 block matrix. (See the last section for a more general representation.) For a 2 x 2 block algorithm, you only need to form the Cholesky roots of matrices of size (d/2), which are 10000 x 10000 matrices for this example. Matrices of that size require only 0.75 GB and can be quickly factored. Of course, there is no such thing as a free lunch: The new algorithm is more complicated and requires additional computations, albeit on smaller matrices.

Let's look at how to simulate MVN data by treating Σ as a block matrix of size d/2. For ease of exposition, assume d is even and each block is of size d/2. However, the algorithm easily handles the case where d is odd. When d is odd, choose the upper blocks to have floor(d/2) rows and choose the lower clocks to have ceil(d/2) rows. The SAS code (below) handles the general case. (In fact, the algorithm doesn't care about the block size. For any p in (0, 1), one block can be size pd and the other size (1-p)d.

In block form, the covariance matrix is
\(\Sigma = \begin{bmatrix}A & B\\B^\prime & C\end{bmatrix}\)
There is a matrix decomposition called the LDLT decomposition that enables you to write Σ as the product of three block matrices:

\(\begin{bmatrix}A & B\\B^\prime & C\end{bmatrix} = \begin{bmatrix}I & 0\\B^\prime A^{-1} & I\end{bmatrix} \begin{bmatrix}A & 0\\0 & S\end{bmatrix} \begin{bmatrix}I & A^{-1}B\\0 & I\end{bmatrix} \)
where \(S = C - B^\prime A^{-1} B\) is the Schur complement of the block matrix C. There is a theorem that says that if Σ is symmetric positive definite (SPD), then so is every principal submatrix, so A is SPD. Thus, we can use the Cholesky decomposition and write \(A = G^\prime_A G_A\) for an upper triangular matrix, \(G_A\). By reordering the variables, C is SPD as well. I don't have a proof that S is PD, but it seems to be true in the examples I've tried, so let's list that as an assumption and assume that \(S = G^\prime_S G_S\) for an upper triangular matrix \(G_S\).

Using these definitions, the Cholesky decomposition of Σ can be written in block form as
\(\Sigma = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}G_A & (G^\prime_A)^{-1} B \\ 0 & G_S\end{bmatrix} \)

To simulate MVN data, we let z be uncorrelated MVN(0, I) data and define
\(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}z_1 \\ z_2\end{bmatrix} = \begin{bmatrix}G^\prime_A z_1 \\ B^\prime G^{-1}_A z_1 + G^\prime_S z_2\end{bmatrix} \)

The previous equation indicates that you can generate correlated MVN(0, Σ) data if you know the Cholesky decomposition of the upper-left block of Σ (which is GA) and the Cholesky decomposition of the Schur complement (which is GS). Each of the matrices in these equations is of size d/2, which means that the computations are performed on matrices that are half the size of Σ. However, notice that generating the x2 components requires some extra work: the Schur complement involves the inverse of A and the formula for x2 also involves the inverse of GA. These issues are not insurmountable, but it means that the block algorithm is more complicated than the original algorithm, which merely multiplies z by the Cholesky root of Σ.

A SAS program for block simulation of MVN data

The SAS/IML program in a previous section shows how to generate correlated MVN(0, Σ) data (x) from uncorrelated MVN(0, I) data (z) by using the Cholesky root (G) of Σ. This section shows how to get exactly the same x values by performing block operations. For the block operations, each matrix is size d/2, so has 1/4 of the elements of Σ.

The first step is to represent Σ and z in block form.

/* The Main Idea: get the same MVN data by performing block 
   operations on matrices that are size d/2 */
/* 1. Represent Sigma={A B, B` C} and z={z1, z2} in block form */
d1 = ceil(d/2);           /* half the variables */
d2 = d - d1;              /* the other half of the vars */
dp1 = d1 + 1;
 
A = Sigma[1:d1,  1:d1];   /* break up the symmetric (d x d) covariance */
B = Sigma[1:d1,  dp1:d];  /* matrix into blocks of size d/2 */
C = Sigma[dp1:d, dp1:d];
z1 = z[1:d1, ];           /* extract the first d1 obs for MVN(0,I) data */
z2 = z[dp1:d, ];          /* extract the last d2 obs for MVN(0,I) data */

It is easy to generate x1, which contains the first d/2 components of the MVN(0, Σ) simulated data. You simply use the Cholesky decomposition of A, which is the upper-left block of Σ:

/* 2. Compute Cholesky root of A and compute x1 z1 */
G_A = root(A);            /* Cholesky of upper left block */
x1 = G_A`*z1;             /* generate first half of variables */

It is not as easy to generate x2, which contains the last d/2 components. In block form, \(x_2 = v + u\), where \(v = B^\prime G_A^{-1} z_1\) and \(u = G_S^\prime z_2\), and where \(S = G_S^\prime G_S\) is the Cholesky decomposition of the Shur complement. This is shown in the following statements:

/* 3. Generate the second half of variables */
S = C - B`*inv(A)*B;      /* S is the Schur complement */
G_S = root(S);            /* Cholesky root of Schur complement */
v = B`*inv(G_A)*z1;
u = G_S`*z2;
x2 = v + u;
 
/* verify that the computation gives exactly the same simulated data */
print (max(abs(x[1:d1,] - x1)))[L="Diff(x - x1)"];
print (max(abs(x[dp1:d,] - x2)))[L="Diff(x - x2)"];

The output shows that the block computation, which uses multiple matrices of size d/2, gives exactly the same answer as the original computation, which uses matrices of size d.

Analysis of the block algorithm

Now that the algorithm runs on a small example, you can increase the dimensions of the matrices. For example, try rerunning the algorithm with the following parameters:

%let d = 5000;    /* number of variables */
%let N = 1000;    /* number of observations */

When I run this larger program on my PC, it takes about 2.3 seconds to simulate the MVN data by using the full Cholesky decomposition. It takes about 17 seconds to simulate the data by using the block method. Most of the time for the block method is spent on the computation v = B`*inv(G_A)*z1, which takes about 15 seconds. So if you want to improve the performance, you should focus on improving that computation.

You can improve the performance of the block algorithm in several ways:
  • You don't need to find the inverse of A. You have the Cholesky root of A, which is triangular, so you can define H = inv(G`A)*B and compute the Schur complement as C – H`*H.
  • You don't need to explicitly form any inverses. Ultimately, all these matrices are going to operate on the vector z2, which means that you can simulate the data by using only matrix multiplication and solving linear equations.

I tried several techniques, but I could not make the block algorithm competitive with the performance of the original Cholesky algorithm. The issue is that the block algorithm computes two Cholesky decompositions, solves some linear equations, and performs matrix multiplication. Even though each computation is on a matrix that is half the size of the original matrix, the additional computations result in a longer runtime.

Generalization to additional blocks

The algorithm generalizes to other block sizes. I will outline the case for a 3 x 3 block matrix. The general case of k x k blocks follows by induction.

The figure to the right shows the Σ covariance matrix as a 3 x 3 block matrix. By using 2 x 2 block algorithm, you can obtain multivariate normal vectors x1 and x2, which correspond to the covariances in the four blocks in the upper-left corner of the matrix. You can then view the upper four blocks as a single (larger) block, thus obtaining a 2 x 2 block matrix where the upper-left block has size 2d/3 and the lower-right block (Σ33) has size d/3.

There are many details to work through, but the main idea is that you can repeat the computation on this new block matrix where the block sizes are unequal. To proceed, you need the Cholesky decomposition of the large upper-left block, but we have already shown how to compute that in terms of the Cholesky root of Σ11, its inverse, and the Cholesky root of the Schur complement of Σ22. You also need to form the Schur complement of Σ33, but that is feasible because we have the Cholesky decomposition of the large upper-left block.

I have not implemented the details because it is not clear to me whether three or more blocks provides a practical advantage. Even if I work out the details for k=3, the algebra for k > 3 blocks seems to be daunting.

Summary

It is well known that you can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This article shows how to break up the task by using a block Cholesky method. The method is implemented for k=2 blocks. The block method requires more matrix operations, but each operation is performed on a smaller matrix. Although the block method takes longer to run, it can be used to generate a large number of variables by generating k variables at a time, where k is small enough that the k x k matrices can be easily handled.

The block method generalizes to k > 2 blocks, but it is not clear whether more blocks provide any practical benefits.

The post A block-Cholesky method to simulate multivariate normal data appeared first on The DO Loop.

12月 062021
 

While discussing how to compute convex hulls in SAS with a colleague, we wondered how the size of the convex hull compares to the size of the sample. For most distributions of points, I claimed that the size of the convex hull is much less than the size of the original sample. This article looks at the expected value of the proportion p = k / N, where k is the size of the convex hull for a set of N planar points. The proportion depends on the distribution of the points. This article shows the results for three two-dimensional distributions: the uniform distribution on a rectangle, the normal distribution, and the uniform distribution on a disk.

Monte Carlo estimates of the expected number of points on a convex hull

You can use Monte Carlo simulation to estimate the expected number of points on a convex hull. First, specify the distribution of the points and the size of the sample, N. Then do the following:

  • Simulate N random points from the distribution and compute the convex hull of these points. This gives you a value, k, for this one sample.
  • Repeat the simulation B times. You will have B values of k: {k1, k2, ..., kB}.
  • The average of the B values is an estimate of the expected value for the number of points on the convex hull for a random set of size N.
  • If you prefer to think in terms of proportions, the ratio k / N is an estimate of the expected proportion of points on the convex hull.

Simulate uniform random samples for N=200

Before you run a full simulation study, you should always run a smaller simulation to ensure that your implementation is correct. The following SAS/IML statements generate 1000 samples of size N=200 from the uniform distribution on the unit square. For each sample, you can compute the number of points on the convex hull. You can then plot the distribution of the number of points on the convex hull, as follows:

proc iml;
call randseed(1);
 
/* 1. Convex hull of N=200 random uniformly distributed 
      points in the unit square [0,1] x [0,1] */
NRep = 1000;            /* number of Monte Carlo samples */
Distrib = "Uniform";
N = 200;                /* sample size */
k = j(NRep, 1, 0);      /* store the number of points on CH */
X = j(N, 2, 0);         /* allocate matrix for sample */
do j = 1 to NRep;
   call randgen(X, Distrib);  /* random sample */
   Indices = cvexhull(X);
   k[j] = sum(Indices>0);     /* the positive indices are on CH */
end;
 
Ek = mean(k);           /* MC estimate of expected value */
print Ek;
title "k = Number of Points on Convex Hull; N = 200";
call bar(k) other="inset ('E[k] =' = '13.88') / border;";

The bar chart shows the distribution of the size of the convex hull for the 1,000 simulated samples. For most samples, the convex hull contains between 12 and 15 points. The expected value is the average of these values, which is 13.88 for this simulation.

If you want to compare samples of different sizes, it is useful to standardize the estimated quantity. Rather than predict the number of points on the convex hull, it is useful to predict the proportion of the sample that lies on the convex hull. Furthermore, we should report not only the expected value but also a confidence interval. This makes it clearer that the proportion for any one sample is likely to be within a range of values:

p = k / N;              /* proportion of points on the CH */
mean = mean(p);         /* expected proportion */
call qntl(CI, p, {0.025, 0.975}); /* 95% confidence interval */
print mean CI;
 
refStmt = 'refline ' + char(mean//CI) + '/ axis=X lineattrs=(color=red);';
title "Proportion of Points on Convex Hull, 1000 Random Samples";
title2 "X ~ U(Square); N = 200";
call histogram(p) other=refStmt label='Proportion';

For this distribution of points, you should expect about 7% of the sample to lie on the convex hull. An empirical 95% confidence interval is between 5% and 9%.

According to an unpublished preprint by Har-Peled ("On the Expected Complexity of Random Convex Hulls", 2011), the asymptotic formula for the expected number of points on the convex hull of the unit square is O(log(n)).

Simulate samples for many sizes

The previous section simulated B samples for size N=200. This section repeats the simulation for values of N between 300 and 8,000. Instead of plotting histograms, I plot the expected value and a 95% confidence interval for each value of N. So that the simulation runs more quickly, I have set B=250. The following SAS/IML program simulates B samples for various values of N, all drawn from the uniform distribution on the unit square:

/* 2. simulation for X ~  U(Square), N = 300..8000 */
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
Distrib = "Uniform";
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call randgen(X, Distrib);  /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */

You can use the CREATE statement to write the results to a SAS data set. You can use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT to visualize the result. Because I want to create the same plot several times, I will define a subroutine that creates the graph:

/* define subroutine that creates a scatter plot with confidence intervals */
start HiLow(N, p, lower, upper);
    Ek = N#p;  /* expected value of k */
    create CH var {N p lower upper Ek}; append; close;
    submit;
      proc sgplot data=CH noautolegend;
        format Ek 2.0;
        highlow x=N low=lower high=upper;
        scatter x=N y=p / datalabel=Ek;
        xaxis grid label='Sample Size (N)' offsetmax=0.08;
        yaxis grid label='Proportion on Convex Hull';
      run;
    endsubmit;
finish;
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Square)";
run HiLow(N, mean, CI[1,], CI[2,]);

The graph shows the expected proportion of points on the convex hull for samples of size N. The expected number of points (rounded to the nearest integer) is shown adjacent to each marker. For example, you should expect a sample of size N=2000 to have about 1% of its points (k=20) on the convex hull. As the sample size grows larger, the proportion decreases. For N=8000, the expected proportion is only about 0.3%, or 23.5 points on the convex hull.

Simulate samples from the bivariate normal distribution

The previous graph shows the results for points drawn uniformly at random from the unit square. If you change the distribution of the points, you will change the graph. For example, a bivariate normal distribution has most of its points deep inside the convex hull and very few points near the convex hull. Intuitively, you would expect the ratio k / N to be smaller for bivariate normal data, as compared to uniformly distributed data.

You can analyze bivariate normal data by changing only one line of the previous program. Simply change the statement Distrib = "Uniform" to Distrib = "Normal". With that modification, the program creates the following graph:

For bivariate normal data, the graph shows that the expected proportion of points on the convex hull is much less than for data drawn from the uniform distribution on the square. For a sample of size N=2000, you should expect about 0.6% of its points (k=12) on the convex hull. For N=8000, the expected proportion is only about 0.2%, or 13.7 points on the convex hull.

Simulate samples from the uniform distribution on the disk

Let's examine one last distribution: the uniform distribution on the unit disk. I have previously shown how to generate random uniform points from the disk, or, in fact, from any d-dimensional ball. The following program defines the RandBall subroutine, which generates random samples from the unit disk. The function is used instead of the RANDGEN subroutine:

/* 4. random uniform in disk: X ~ U(Disk)
   https://blogs.sas.com/content/iml/2016/03/30/generate-uniform-2d-ball.html */
start RandBall(X, radius=1);
   N = nrow(X); d = ncol(X);
   call randgen(X, "Normal");       /* X ~ MVN(0, I(d)) */
   u = randfun(N, "Uniform");       /* U ~ U(0,1)       */
   r = radius * u##(1/d);           /* r proportional to d_th root of U */
   X = r # X / sqrt(X[,##]);        /* X[,##] is sum of squares for each row */
finish;
 
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call RandBall(X);          /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Disk)";
run HiLow(N, mean, CI[1,], CI[2,]);

For this distribution, a greater proportion of points are on the convex hull, as compared to the other examples. For a sample of size N=2000, you should expect about 2% of its points (k=42) on the convex hull. For N=8000, the expected proportion is about 0.8%, or 67.7 points on the convex hull.

According Har-Peled (2011), the asymptotic formula for the expected number of points on the convex hull of the unit disk is O(n1/3).

Linear transformations of convex hulls

Although I ran simulations for the uniform distribution on a square and on a disk, the same results hold for arbitrary rectangles and ellipses. It is not hard to show that if X is a set of points and C(X) is the convex hull, then for any affine transformation T: R2 → R2, the set T(C(X)) is the convex hull of T(X). Thus, for invertible transformations, the number of points on the convex hull is invariant under affine transformations.

In the same way, the simulation for bivariate normal data (which used uncorrelated variates) remains valid for correlated normal data. This is because the Cholesky transformation maps correlated variates to uncorrelated variates.

Summary

This article uses simulation to understand how the size of a convex hull relates to the size of the underlying sample, drawn randomly from a 2-D distribution. Convex hulls are preserved under linear transformations, so you can simulate from a standardized distribution. This article presents results for three distributions: the uniform distribution on a rectangle, the bivariate normal distribution, and the uniform distribution on an ellipse.

The post The expected number of points on a convex hull appeared first on The DO Loop.

12月 062021
 

While discussing how to compute convex hulls in SAS with a colleague, we wondered how the size of the convex hull compares to the size of the sample. For most distributions of points, I claimed that the size of the convex hull is much less than the size of the original sample. This article looks at the expected value of the proportion p = k / N, where k is the size of the convex hull for a set of N planar points. The proportion depends on the distribution of the points. This article shows the results for three two-dimensional distributions: the uniform distribution on a rectangle, the normal distribution, and the uniform distribution on a disk.

Monte Carlo estimates of the expected number of points on a convex hull

You can use Monte Carlo simulation to estimate the expected number of points on a convex hull. First, specify the distribution of the points and the size of the sample, N. Then do the following:

  • Simulate N random points from the distribution and compute the convex hull of these points. This gives you a value, k, for this one sample.
  • Repeat the simulation B times. You will have B values of k: {k1, k2, ..., kB}.
  • The average of the B values is an estimate of the expected value for the number of points on the convex hull for a random set of size N.
  • If you prefer to think in terms of proportions, the ratio k / N is an estimate of the expected proportion of points on the convex hull.

Simulate uniform random samples for N=200

Before you run a full simulation study, you should always run a smaller simulation to ensure that your implementation is correct. The following SAS/IML statements generate 1000 samples of size N=200 from the uniform distribution on the unit square. For each sample, you can compute the number of points on the convex hull. You can then plot the distribution of the number of points on the convex hull, as follows:

proc iml;
call randseed(1);
 
/* 1. Convex hull of N=200 random uniformly distributed 
      points in the unit square [0,1] x [0,1] */
NRep = 1000;            /* number of Monte Carlo samples */
Distrib = "Uniform";
N = 200;                /* sample size */
k = j(NRep, 1, 0);      /* store the number of points on CH */
X = j(N, 2, 0);         /* allocate matrix for sample */
do j = 1 to NRep;
   call randgen(X, Distrib);  /* random sample */
   Indices = cvexhull(X);
   k[j] = sum(Indices>0);     /* the positive indices are on CH */
end;
 
Ek = mean(k);           /* MC estimate of expected value */
print Ek;
title "k = Number of Points on Convex Hull; N = 200";
call bar(k) other="inset ('E[k] =' = '13.88') / border;";

The bar chart shows the distribution of the size of the convex hull for the 1,000 simulated samples. For most samples, the convex hull contains between 12 and 15 points. The expected value is the average of these values, which is 13.88 for this simulation.

If you want to compare samples of different sizes, it is useful to standardize the estimated quantity. Rather than predict the number of points on the convex hull, it is useful to predict the proportion of the sample that lies on the convex hull. Furthermore, we should report not only the expected value but also a confidence interval. This makes it clearer that the proportion for any one sample is likely to be within a range of values:

p = k / N;              /* proportion of points on the CH */
mean = mean(p);         /* expected proportion */
call qntl(CI, p, {0.025, 0.975}); /* 95% confidence interval */
print mean CI;
 
refStmt = 'refline ' + char(mean//CI) + '/ axis=X lineattrs=(color=red);';
title "Proportion of Points on Convex Hull, 1000 Random Samples";
title2 "X ~ U(Square); N = 200";
call histogram(p) other=refStmt label='Proportion';

For this distribution of points, you should expect about 7% of the sample to lie on the convex hull. An empirical 95% confidence interval is between 5% and 9%.

According to an unpublished preprint by Har-Peled ("On the Expected Complexity of Random Convex Hulls", 2011), the asymptotic formula for the expected number of points on the convex hull of the unit square is O(log(n)).

Simulate samples for many sizes

The previous section simulated B samples for size N=200. This section repeats the simulation for values of N between 300 and 8,000. Instead of plotting histograms, I plot the expected value and a 95% confidence interval for each value of N. So that the simulation runs more quickly, I have set B=250. The following SAS/IML program simulates B samples for various values of N, all drawn from the uniform distribution on the unit square:

/* 2. simulation for X ~  U(Square), N = 300..8000 */
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
Distrib = "Uniform";
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call randgen(X, Distrib);  /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */

You can use the CREATE statement to write the results to a SAS data set. You can use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT to visualize the result. Because I want to create the same plot several times, I will define a subroutine that creates the graph:

/* define subroutine that creates a scatter plot with confidence intervals */
start HiLow(N, p, lower, upper);
    Ek = N#p;  /* expected value of k */
    create CH var {N p lower upper Ek}; append; close;
    submit;
      proc sgplot data=CH noautolegend;
        format Ek 2.0;
        highlow x=N low=lower high=upper;
        scatter x=N y=p / datalabel=Ek;
        xaxis grid label='Sample Size (N)' offsetmax=0.08;
        yaxis grid label='Proportion on Convex Hull';
      run;
    endsubmit;
finish;
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Square)";
run HiLow(N, mean, CI[1,], CI[2,]);

The graph shows the expected proportion of points on the convex hull for samples of size N. The expected number of points (rounded to the nearest integer) is shown adjacent to each marker. For example, you should expect a sample of size N=2000 to have about 1% of its points (k=20) on the convex hull. As the sample size grows larger, the proportion decreases. For N=8000, the expected proportion is only about 0.3%, or 23.5 points on the convex hull.

Simulate samples from the bivariate normal distribution

The previous graph shows the results for points drawn uniformly at random from the unit square. If you change the distribution of the points, you will change the graph. For example, a bivariate normal distribution has most of its points deep inside the convex hull and very few points near the convex hull. Intuitively, you would expect the ratio k / N to be smaller for bivariate normal data, as compared to uniformly distributed data.

You can analyze bivariate normal data by changing only one line of the previous program. Simply change the statement Distrib = "Uniform" to Distrib = "Normal". With that modification, the program creates the following graph:

For bivariate normal data, the graph shows that the expected proportion of points on the convex hull is much less than for data drawn from the uniform distribution on the square. For a sample of size N=2000, you should expect about 0.6% of its points (k=12) on the convex hull. For N=8000, the expected proportion is only about 0.2%, or 13.7 points on the convex hull.

Simulate samples from the uniform distribution on the disk

Let's examine one last distribution: the uniform distribution on the unit disk. I have previously shown how to generate random uniform points from the disk, or, in fact, from any d-dimensional ball. The following program defines the RandBall subroutine, which generates random samples from the unit disk. The function is used instead of the RANDGEN subroutine:

/* 4. random uniform in disk: X ~ U(Disk)
   https://blogs.sas.com/content/iml/2016/03/30/generate-uniform-2d-ball.html */
start RandBall(X, radius=1);
   N = nrow(X); d = ncol(X);
   call randgen(X, "Normal");       /* X ~ MVN(0, I(d)) */
   u = randfun(N, "Uniform");       /* U ~ U(0,1)       */
   r = radius * u##(1/d);           /* r proportional to d_th root of U */
   X = r # X / sqrt(X[,##]);        /* X[,##] is sum of squares for each row */
finish;
 
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call RandBall(X);          /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Disk)";
run HiLow(N, mean, CI[1,], CI[2,]);

For this distribution, a greater proportion of points are on the convex hull, as compared to the other examples. For a sample of size N=2000, you should expect about 2% of its points (k=42) on the convex hull. For N=8000, the expected proportion is about 0.8%, or 67.7 points on the convex hull.

According Har-Peled (2011), the asymptotic formula for the expected number of points on the convex hull of the unit disk is O(n1/3).

Linear transformations of convex hulls

Although I ran simulations for the uniform distribution on a square and on a disk, the same results hold for arbitrary rectangles and ellipses. It is not hard to show that if X is a set of points and C(X) is the convex hull, then for any affine transformation T: R2 → R2, the set T(C(X)) is the convex hull of T(X). Thus, for invertible transformations, the number of points on the convex hull is invariant under affine transformations.

In the same way, the simulation for bivariate normal data (which used uncorrelated variates) remains valid for correlated normal data. This is because the Cholesky transformation maps correlated variates to uncorrelated variates.

Summary

This article uses simulation to understand how the size of a convex hull relates to the size of the underlying sample, drawn randomly from a 2-D distribution. Convex hulls are preserved under linear transformations, so you can simulate from a standardized distribution. This article presents results for three distributions: the uniform distribution on a rectangle, the bivariate normal distribution, and the uniform distribution on an ellipse.

The post The expected number of points on a convex hull appeared first on The DO Loop.

11月 082021
 

Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the normal density (the red curve, N(μ=9500, σ=21.79)) can be a very good approximation to the binomial density (blue bars, Binom(p=0.95, nTrials=10000)). However, because the binomial density is discrete, the binomial density is defined only for positive integers, whereas the normal density is defined for all real numbers.

In this graph, the binomial density and the normal density are close. But what does the approximation look like if you overlay a bar chart of a random sample from the binomial distribution? It turns out that the bar chart can have large deviations from the normal curve, even for a large sample. Read on for an example.

The normal approximation to the binomial distribution

I have written two previous articles that discuss the normal approximation to the binomial distribution:

The normal approximation is used to estimate probabilities because it is often easier to use the area under the normal curve than to sum many discrete values. However, as shown in the second article, the discrete binomial distribution can have statistical properties that are different from the normal distribution.

Random binomial samples

It is important to remember that a random sample (from any distribution!) can look much different from the underlying probability density function. The following graph shows a random sample from the binomial distribution Binom(0.95, 10000). The distribution of the sample looks quite different from the density curve.

In statistical terms, the observed and expected values have large deviations. Also, note that there can be a considerable deviation between adjacent bars. For example, in the graph, some bars have about 2% of the total frequency whereas an adjacent bar might have half that value. I was surprised to observe the large deviations in a large sample.

This graph emphasizes the fact that a random sample from the binomial distribution can look different from the smooth bell-shaped curve of the probability density. I think I was surprised by the magnitude of the deviations from the expected values because I have more experience visualizing continuous distributions. For a continuous distribution, we use a histogram to display the empirical distribution. When the bin width of the histogram is greater than 1, it smooths out the differences between adjacent bars to create a much smoother estimate of the density. For example, the following graph displays the distribution of the same random sample but uses a bin width of 10 to aggregate the frequency. The resulting histogram is much smoother and resembles the normal approximation.

Summary

The normal approximation is a convenient way to compute probabilities for the binomial distribution. However, it is important to remember that the binomial distribution is a discrete distribution. A binomial random variable can assume values only for positive integers. One consequence of this observation is that a bar chart of a random binomial sample can show considerable deviations from the theoretical density. This is normal (pun intended!). It is often overlooked because if you treat the random variable as if it were continuous and use a histogram to estimate the density, the histogram smooths out a lot of the bar-to-bar deviations.

Appendix

The following SAS program creates the graphs in this article.

/* define parameters for the Binom(p=0.95, nTrials=10000) simulation */
%let p = 0.95;                    /* probability of success */
%let NTrials = 10000;             /* number of trials */
%let N = 1000;                    /* sample size */   
 
/* First graph: Compute the density of the Normal and Binomial distributions. See
   https://blogs.sas.com/content/iml/2016/09/12/overlay-curve-bar-chart-sas.html
*/
data Binom;
n = &nTrials;  p = &p;  q = 1 - p;
mu = n*p;  sigma = sqrt(n*p*q);   /* parameters for the normal approximation */
Lower = mu-3.5*sigma;             /* evaluate normal density on [Lower, Upper] */
Upper = mu+3.5*sigma;
 
/* PDF of normal distribution */
do t = Lower to Upper by sigma/20;
   Normal = pdf("normal", t, mu, sigma);       output;
end;
 
/* PMF of binomial distribution */
t = .; Normal = .;        /* these variables are not used for the bar chart */
do j = max(0, floor(Lower)) to ceil(Upper);
   Binomial = pdf("Binomial", j, p, n);      output;
end;
/* store mu and sigma in macro variables */
call symput("mu", strip(mu));      
call symput("sigma", strip(round(sigma,0.01)));
label Binomial="Binomial Probability"  Normal="Normal Density";
keep t Normal j Binomial;
run;
 
/* overlay binomial density (needle plot) and normal density (series plot) */
title "Binomial Probability and Normal Approximation";
title2 "Binom(0.95, 10000) and N(9500, 21.79)";
proc sgplot data=Binom;
   needle x=j y=Binomial;
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "p = &p"  "q = %sysevalf(1-&p)" "nTrials = &nTrials"  
         "(*ESC*){unicode mu} = np = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = sqrt(npq) = &sigma" /
         position=topright border;
   yaxis label="Probability";
   xaxis label="x" integer;
run;
 
/*************************/
 
/* Second graph: simulate a random sample from Binom(p, NTrials) */
data Bin(keep=x);
call streaminit(1234);
do i = 1 to &N; 
   x = rand("Binomial", &p, &NTrials); 
   output;
end;
run; 
 
/* count the frequency of each observed count */
proc freq data=Bin noprint;
   tables x / out=FreqOut;
run;
 
data All;
set FreqOut Binom(keep=t Normal);
Normal = 100*1*Normal;   /* match scales: 100*h*PDF, where h=binwidth */
run;
 
/* overlay sample and normal approximation */
title "Random Binomial Sample";
title2 "Bar Chart of Binom(0.95, 10000)";
proc sgplot data=All;   
   needle x=x y=Percent;
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "n = &n"  "p = &p" 
         "(*ESC*){unicode mu} = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = &sigma" /
         position=topright border;
   yaxis label="Percent / Scaled Density";
   xaxis label="x" integer;
run;
 
/*************************/
 
/* Third graph: the bar-to-bar deviations are smoothed if you use a histogram */
title2 "Histogram BinWidth=10";
proc sgplot data=Bin;   
   histogram x / scale=percent binwidth=10;
   xaxis label="x" integer values=(9400 to 9600 by 20);
run;
 
/* for comparison, the histogram looks like the bar chart if you set BINWIDTH=1 */
title2 "Histogram BinWidth=1";
proc sgplot data=Bin;   
   histogram x / binwidth=1 scale=percent;
   xaxis label="x" integer;
run;

The post The normal approximation and random samples of the binomial distribution appeared first on The DO Loop.

11月 022021
 

There are times when it is useful to simulate data. One of the reasons I use simulated data sets is to demonstrate statistical techniques such as multiple or logistic regression. By using SAS random functions and some DATA step logic, you can create variables that follow certain distributions or are correlated with other variables. You might decide to simulate a drug study where the drug group has a higher or lower mean than a placebo group.

Because most programs that create simulated data use random numbers, let's start off by discussing the RAND function. This function can generate random numbers that follow distributions such as uniform, normal, Bernoulli, as well as dozens of other distributions. Veteran SAS programmers might be more familiar with some of the older random number functions such as RANUNI and RANNOR. RANUNI was used to generate uniform random numbers (numbers between 0 and 1) and RANNOR generated random numbers from a normal distribution. The RAND function has replaced all of the older functions and has a number of advantages over the older functions.

The first argument of the RAND function is the name of the distribution that you want to use, such as Uniform, Normal, or Bernoulli. For some of the distributions, such as Normal, you can supply parameters such as the mean and standard deviation. Here are some examples:

Function Description
rand('Uniform') Generates uniform random numbers (between 0 and 1)
rand('Normal',100,20) Generates values from a normal distribution with a mean of 100 and a standard deviation of 20
rand('Bernoulli',.4) Generates a 0 or 1 with a probability of a 1 equal to .4
rand('Binomial',.2,5) Generates random numbers that represent the number of successes in a sample of size 5 with the probability of success equal to .2

 

Important Note: if you want a reproducible series of random numbers using the RAND function, you must seed it by a call to STREAMINIT (with a positive integer argument) prior to its use. For example:

       call streaminit(132435);

To clarify the note above, here are two programs that use the RAND function—one with, and one without the call to Streaminit.

 data Without;
   do i = 1 to 5;
      x = rand('Uniform');
      output;
   end;   drop i;
run;

Here is the output from running this program twice.

Notice that the values of x are different in each run. Now let's run the same program with CALL STREAMINIT included. Here is the program.

data With;
   call streaminit(13579);
   do i = 1 to 5;
      x = rand('Uniform');
      output;
   end;
   drop i;
run;

And here are the output listings from running this program twice.

Adding CALL STREAMINIT creates the same sequence of random numbers each time the program is run. This is useful if you are generating groups for a drug study and want to be able to re-create the random sequences when it comes time to break the blind and analyze the results. Another reason I sometimes want to generate a repeatable sequence of random numbers is for problem sets included in many of my books—I want the reader to get exactly the same results as I did.

Let's switch topics and see how to write a program where you want to simulate flipping a coin. The program below uses a popular method, but not it is not as elegant as the next program I'm going to show you.

*Old fashioned way to generate "random" events;
data Toss;
   do n = 1 to 10;
      if rand('uniform') lt .5 then Result = 'Tails';
      else Result = 'Heads';
      output;
   end;
run;

In the long run, half of the uniform random numbers will be less than .5, and the proportion of heads and tails will be approximately .5. Here is a listing of data set Toss.

A more sophisticated approach takes advantage of the RAND function's ability to generate random number from multiple distributions. A Bernoulli distribution is similar to a coin toss where you can adjust the probability of a 1 or 0 by including a second parameter to the function. The Toss2 program, shown below, does just that.

*More sophisticated program;
proc format;
   value Heads_Tails 0="Heads" 1="Tails";
run;
 
data Toss2;
   do n = 1 to 10;
      Results = rand('Bernoulli',.5);
      format Results Heads_Tails.;
      output;
   end;
run;

The format Heads_Tails substitutes the labels "Heads" and "Tails" for values of 0 and 1, respectively. Here is a listing of data set Toss2.

The final discussion of this blog, concerns generating random values of two or more variables that are correlated. The example that follows generates x-y pairs that are correlated.

*Creating correlated x-y pairs;
data Corr;
   do i = 1 to 1000;
      x = rand('normal',100,10);
      y = .5*x + rand('Normal',50,10);
      output;
   end;
   drop i;
run;

By including a proportion of the x-value when creating the y-value, the x- and y-values will be correlated. Shown below is the output from PROC CORR, showing that x and y are correlated (r = .45586).

I used a SAS Studio task to create the scatterplot shown next.

You can increase or decrease the correlation by increasing the proportion of x used to create y. For example, you could use y = .8*x + rand('Normal',20,10); to create x-y pairs with a higher correlation.

You can see more examples of the RAND function in my book, SAS Functions by Example, Second Edition, available as an e-book from RedShelf or in print form from Amazon.

To learn more about how to use SAS Studio as part of OnDemand for Academics, to write SAS programs, or to use SAS Studio tasks, please take a look at my new book: Getting Started with SAS Programing: Using SAS Studio in the Cloud (available in e-book from RedShelf or in a paper version from Amazon).

I hope you enjoyed reading this blog and, as usual, I invite comments and/or questions.

 

 

 

 

 

 

Creating Simulated Data Sets was published on SAS Users.

9月 222021
 

The field of probability and statistics is full of asymptotic results. The Law of Large Numbers and the Central Limit Theorem are two famous examples. An asymptotic result can be both a blessing and a curse. For example, consider a result that says that the distribution of some statistic converges to the normal distribution as the sample size goes to infinity. This is a blessing because it means that you can estimate standard errors and confidence intervals for the statistic by applying simple formulas from the normal distribution. However, when I use an asymptotic approximation, I always hear a little voice in my head that whispers, "Are you sure the sample size is sufficiently large?"

One way to quiet that voice is to use a simulation to explore how large the sample must be before an asymptotic result becomes valid in practice. For some asymptotic results, textbooks give guidance about when the asymptotic result is valid. For example, many undergraduate texts recommend N ≥ 30 as a cutoff at which the Central Limit Theorem (CLT) applies, regardless of the shape of the underlying distribution. The Wikipedia article on the Central Limit Theorem shows several simulations that demonstrate the CLT.

This article examines a question that a reader asked on my blog. Recall that for normally distributed data X ~ N(μ, σ), you can use the sample mean, standard deviation, and sample size to estimate a confidence interval for the mean. The endpoints of the interval estimate are \(\bar{x} \pm t_{1-\alpha/2, N-1} \, s/\sqrt{N}\), where \(t_{1-\alpha/2, N-1}\) is a critical quantile from the t distribution with N-1 degrees of freedom. The reader asked when this formula is valid for statistics that are asymptotically normal. This article shows how to use simulation in SAS to answer questions like this on a case-by-case basis.

The distribution of the sample mean of ChiSq(k)

The question concerns confidence intervals for the mean of a chi-square distribution with k degrees of freedom. How large does the sample size have to be before you can use the asymptotic formula for a (normal) confidence interval?

The Central Limit Theorem applies to this case, so you can assume that the formula is approximately valid for a sample of size N ≥ 30. But what does "approximately valid" mean? And if you use the normal-based formula on smaller samples, is the result approximately correct, or is it way off? The next section demonstrates how to answer these questions by using a simulation study.

A simulation to reveal an asymptotic result

This section shows how to use a simulation to investigate when the normal-based formula for a 95% confidence interval is a good estimate for the true confidence interval. This section follows the techniques in the article "Coverage probability of confidence intervals: A simulation approach." If you have not previously performed a simulation to determine confidence intervals, start by reading that article, which explains the fundamental ideas.

When you conduct a simulation study, it is best to write down what you are simulating, from what distribution, and what you are trying to investigate. The following pseudocode specifies the simulation for this example:

For each parameter value k=10, 20, ... do the following steps:
    For sample sizes N=2, 3, ..., 40, do the following steps:
        Generate 10,000 samples of size N from the chi-square distribution with k degrees of freedom.
            For each sample:
                Compute the sample mean and standard deviation. 
                Use them in the formula for the normal confidence interval.
                Determine whether the confidence interval contains the true mean of the 
                          ChiSquare(k) distribution, which is k.
        Estimate the coverage probability as the proportion of samples for which the 
                 confidence interval contains the true mean.

In a previous article, I showed "how to perform this simulation study by using the SAS DATA step and Base SAS procedures. For comparison, the following program implements the simulation by using the SAS/IML matrix language. For simplicity, I do not loop over the degrees of freedom, but hard-code k=20 degrees of freedom.

%let DF = 20;                      /* degrees of freedom parameter */
%let NumSamples = 10000;           /* number of samples in each simulation */
proc iml;
k = &DF;
call randseed(12345);
Sizes = 2:40;                      /* sample sizes */
/* allocate the results matrix */
varNames = {'N' 'Coverage'};
h = ncol(varNames);
results = j(ncol(Sizes), h, .);
results[,1] = Sizes`;
 
do i = 1 to ncol(Sizes);
   N = Sizes[i];
   mu = k;                         /* true mean of chi-square(k) distribution */
   x = j(N, &NumSamples, .);       /* N x NumSamples matrix */
   call randgen(x, "chisq", k);    /* fill each column with chi-square(k) variates */
 
   /* use the normal formula to estimate CIs from samples */
   SampleMean = mean(x);           /* compute sample mean and standard error */
   SampleSEM = std(x) / sqrt(n);
   tCrit = quantile("t", 1-0.05/2, N-1);  /* two-sided critical value of t statistic, alpha=0.05 */
   Lower = SampleMean - tCrit*SampleSEM;
   Upper = SampleMean + tCrit*SampleSEM;
 
   /* what proportion of CIs contain the true mean? */
   ParamInCI = (Lower<mu & Upper>mu);     /* indicator variable */
   prop = mean(ParamInCI`);
   results[i,2] = prop;
end;
 
create MCResults from results[c=VarNames];
   append from results;
close;
QUIT;
 
title "Empirical Coverage of Confidence Intervals";
title2 "X ~ Chi-Square(&DF)";
proc  sgplot data=MCResults;
   scatter x=N y=Coverage;
   refline 0.95 / axis=y noclip;
   label N="Sample Size (N)" Coverage="Empirical Coverage Probability";
run;

The graph shows the empirical coverage probability for each value of the sample size, N. For k=20 degrees of freedom, the coverage probability appears to be less than 95% for very small samples but increases towards 95% as the sample size grows.

You can change the value of the DF macro variable to see how the graph changes for other values of the degrees-of-freedom parameter. The following graph shows the coverage probability for k=10 degrees of freedom and for a range of sample sizes. If you use the CI formula that is based on the assumption of normality, you will be slightly overconfident. The nominal "95% CI" is more like a 93% or 94% CI for small samples from the chi-square(10) distribution. Of course, you could adjust for this by increasing the confidence level that is used in the normal-based formula. For example, you might want to use the formula for a nominal 96% CI to ensure that the coverage probability for small samples is at least 95% in practice.

Notice in both graphs that the coverage probability is close to 95% for samples sizes N ≥ 30. You can see the Central Limit Theorem kicking in! The simulation study enables you to quantify this asymptotic result and to determine whether N ≥ 30 is "sufficiently large" for whatever asymptotic result you are considering using. You can modify the program and convince yourself that N ≥ 60 might be a safer choice when k=10.

Summary

In summary, simulation studies are useful for investigating whether an asymptotic result is valid for the sample sizes that you want to analyze. This article examined the question of whether you can use a normal-based formula to estimate the confidence interval for a mean. The Central Limit Theorem says that you do this for "sufficiently large" samples, but what does that mean in practice? A Monte Carlo simulation study enables you to quantify that statement. You can estimate the empirical coverage probability when you apply the formula to small sample sizes. This article used a SAS/IML program to implement the simulation study, but you could also use the SAS DATA step and Base SAS procedures, as shown in a previous article.

The post Use simulations to evaluate the accuracy of asymptotic results appeared first on The DO Loop.

9月 222021
 

The field of probability and statistics is full of asymptotic results. The Law of Large Numbers and the Central Limit Theorem are two famous examples. An asymptotic result can be both a blessing and a curse. For example, consider a result that says that the distribution of some statistic converges to the normal distribution as the sample size goes to infinity. This is a blessing because it means that you can estimate standard errors and confidence intervals for the statistic by applying simple formulas from the normal distribution. However, when I use an asymptotic approximation, I always hear a little voice in my head that whispers, "Are you sure the sample size is sufficiently large?"

One way to quiet that voice is to use a simulation to explore how large the sample must be before an asymptotic result becomes valid in practice. For some asymptotic results, textbooks give guidance about when the asymptotic result is valid. For example, many undergraduate texts recommend N ≥ 30 as a cutoff at which the Central Limit Theorem (CLT) applies, regardless of the shape of the underlying distribution. The Wikipedia article on the Central Limit Theorem shows several simulations that demonstrate the CLT.

This article examines a question that a reader asked on my blog. Recall that for normally distributed data X ~ N(μ, σ), you can use the sample mean, standard deviation, and sample size to estimate a confidence interval for the mean. The endpoints of the interval estimate are \(\bar{x} \pm t_{1-\alpha/2, N-1} \, s/\sqrt{N}\), where \(t_{1-\alpha/2, N-1}\) is a critical quantile from the t distribution with N-1 degrees of freedom. The reader asked when this formula is valid for statistics that are asymptotically normal. This article shows how to use simulation in SAS to answer questions like this on a case-by-case basis.

The distribution of the sample mean of ChiSq(k)

The question concerns confidence intervals for the mean of a chi-square distribution with k degrees of freedom. How large does the sample size have to be before you can use the asymptotic formula for a (normal) confidence interval?

The Central Limit Theorem applies to this case, so you can assume that the formula is approximately valid for a sample of size N ≥ 30. But what does "approximately valid" mean? And if you use the normal-based formula on smaller samples, is the result approximately correct, or is it way off? The next section demonstrates how to answer these questions by using a simulation study.

A simulation to reveal an asymptotic result

This section shows how to use a simulation to investigate when the normal-based formula for a 95% confidence interval is a good estimate for the true confidence interval. This section follows the techniques in the article "Coverage probability of confidence intervals: A simulation approach." If you have not previously performed a simulation to determine confidence intervals, start by reading that article, which explains the fundamental ideas.

When you conduct a simulation study, it is best to write down what you are simulating, from what distribution, and what you are trying to investigate. The following pseudocode specifies the simulation for this example:

For each parameter value k=10, 20, ... do the following steps:
    For sample sizes N=2, 3, ..., 40, do the following steps:
        Generate 10,000 samples of size N from the chi-square distribution with k degrees of freedom.
            For each sample:
                Compute the sample mean and standard deviation. 
                Use them in the formula for the normal confidence interval.
                Determine whether the confidence interval contains the true mean of the 
                          ChiSquare(k) distribution, which is k.
        Estimate the coverage probability as the proportion of samples for which the 
                 confidence interval contains the true mean.

In a previous article, I showed "how to perform this simulation study by using the SAS DATA step and Base SAS procedures. For comparison, the following program implements the simulation by using the SAS/IML matrix language. For simplicity, I do not loop over the degrees of freedom, but hard-code k=20 degrees of freedom.

%let DF = 20;                      /* degrees of freedom parameter */
%let NumSamples = 10000;           /* number of samples in each simulation */
proc iml;
k = &DF;
call randseed(12345);
Sizes = 2:40;                      /* sample sizes */
/* allocate the results matrix */
varNames = {'N' 'Coverage'};
h = ncol(varNames);
results = j(ncol(Sizes), h, .);
results[,1] = Sizes`;
 
do i = 1 to ncol(Sizes);
   N = Sizes[i];
   mu = k;                         /* true mean of chi-square(k) distribution */
   x = j(N, &NumSamples, .);       /* N x NumSamples matrix */
   call randgen(x, "chisq", k);    /* fill each column with chi-square(k) variates */
 
   /* use the normal formula to estimate CIs from samples */
   SampleMean = mean(x);           /* compute sample mean and standard error */
   SampleSEM = std(x) / sqrt(n);
   tCrit = quantile("t", 1-0.05/2, N-1);  /* two-sided critical value of t statistic, alpha=0.05 */
   Lower = SampleMean - tCrit*SampleSEM;
   Upper = SampleMean + tCrit*SampleSEM;
 
   /* what proportion of CIs contain the true mean? */
   ParamInCI = (Lower<mu & Upper>mu);     /* indicator variable */
   prop = mean(ParamInCI`);
   results[i,2] = prop;
end;
 
create MCResults from results[c=VarNames];
   append from results;
close;
QUIT;
 
title "Empirical Coverage of Confidence Intervals";
title2 "X ~ Chi-Square(&DF)";
proc  sgplot data=MCResults;
   scatter x=N y=Coverage;
   refline 0.95 / axis=y noclip;
   label N="Sample Size (N)" Coverage="Empirical Coverage Probability";
run;

The graph shows the empirical coverage probability for each value of the sample size, N. For k=20 degrees of freedom, the coverage probability appears to be less than 95% for very small samples but increases towards 95% as the sample size grows.

You can change the value of the DF macro variable to see how the graph changes for other values of the degrees-of-freedom parameter. The following graph shows the coverage probability for k=10 degrees of freedom and for a range of sample sizes. If you use the CI formula that is based on the assumption of normality, you will be slightly overconfident. The nominal "95% CI" is more like a 93% or 94% CI for small samples from the chi-square(10) distribution. Of course, you could adjust for this by increasing the confidence level that is used in the normal-based formula. For example, you might want to use the formula for a nominal 96% CI to ensure that the coverage probability for small samples is at least 95% in practice.

Notice in both graphs that the coverage probability is close to 95% for samples sizes N ≥ 30. You can see the Central Limit Theorem kicking in! The simulation study enables you to quantify this asymptotic result and to determine whether N ≥ 30 is "sufficiently large" for whatever asymptotic result you are considering using. You can modify the program and convince yourself that N ≥ 60 might be a safer choice when k=10.

Summary

In summary, simulation studies are useful for investigating whether an asymptotic result is valid for the sample sizes that you want to analyze. This article examined the question of whether you can use a normal-based formula to estimate the confidence interval for a mean. The Central Limit Theorem says that you do this for "sufficiently large" samples, but what does that mean in practice? A Monte Carlo simulation study enables you to quantify that statement. You can estimate the empirical coverage probability when you apply the formula to small sample sizes. This article used a SAS/IML program to implement the simulation study, but you could also use the SAS DATA step and Base SAS procedures, as shown in a previous article.

The post Use simulations to evaluate the accuracy of asymptotic results appeared first on The DO Loop.

9月 092021
 

A statistical programmer asked how to simulate event-trials data for groups. The subjects in each group have a different probability of experiencing the event. This article describes one way to simulate this scenario. The simulation is similar to simulating from a mixture distribution. This article also shows three different ways to visualize the results of the simulation.

Modeling the data-generation process

A simulation is supposed to reproduce a data-generation process. Typically, a simulation is based on some real data. You (the programmer) need to ensure that the simulation reflects how the real data are generated. For example, there are often differences between simulating a designed experiment or simulating an observational study:

  • In a designed experiment, the number of subjects in each group is determined by the researcher. For example, a researcher might choose 100 subjects, 50 men and 50 women. If so, each simulated data sample should also contain 50 males and 50 females.
  • In an observational study, the number of subjects in each group is a random variable that is based on the proportion of the population in each group. For example, a researcher might choose 100 subjects at random. In the data, there might be 53 men and 47 women, but that split is the result of random chance. Each simulated data sample should generate the number of males and females according to a random binomial variable. Some samples might have a 52:48 split, others a 49:51 split, and so on. In general, the group sizes are simulated from a multinomial distribution.

For other ways to choose the number of subjects in categories, see the article, "Simulate contingency tables with fixed row and column sums in SAS."

Data that specify events and trials

Suppose you have the following data from a pilot observational study. Subjects are classified into six groups based on two factors. One factor has three levels (for example, political party) and another has two levels (for example, sex). For each group, you know the number of subjects (Ni) and the number of events (Ei). You can therefore estimate the proportion of events in the i_th group as Ei / Ni. Because you know the total number of subjects (Sum(N)), you can also estimate the probability that an individual belongs to each group (πi = Ni / Sum(N)).

The following SAS DATA step defines the proportions for each group:

%let NTotal = 107;
data Events;
length Group $ 3;
input Cat1 $ Cat2 $ Events N;
Group = catx(" ", Cat1, Cat2); /* combine two factors into group ID */
p  = Events / N;               /* estimate P(event | Group[i]) */
pi = N / &NTotal;              /* estimate P(subject in Group[i]) */
drop Cat1 Cat2;
format p pi Best5.;
datalines;
A  F  1  10
A  M  8  24
B  F  4  13
B  M  12 36
C  F  12 16
C  M  6  8
;
 
proc print data=Events; run;

This pilot study has 107 subjects. The first group ("A F") contained 10 subjects, which is π1 = 9.3% of the total. In the first group, one person experienced the event, which is p1 = 10% of the group. The other rows of the table are similar. Some people call this "event-trials" data because the quantity of interest is the proportion of events that occurred in the subjects. The next section shows how to use these estimates to simulate new data samples.

Simulate group sizes and proportions in groups

One of the advantages of simulation studies is that you can take preliminary data from a pilot study and "scale it up" to create larger samples, assuming that the statistics from the pilot study are representative of the parameters in the population. For example, the pilot study involved 107 subjects, but you can easily simulate samples of size 250 or 1000 or more.

The following SAS/IML program simulates samples of size 250. The statistics from the pilot study are used as parameters in the simulation. For example, the simulation assumes that 0.093 of the population belongs to the first group, and that 10% of the subjects in the first group experience the event of interest. Notice that some groups will have a small number of subjects (such as Group 1 and Group 5, which have small values for π). Among the groups, some will have a small proportion of events (Group 1) whereas others will have a large proportion (Group 5 and Group 6).

The following simulation uses the following features:

  • The RandMultinomial function in SAS/IML generates a random set of samples sizes based on the π vector, which estimates the proportion of the population that belongs to each group.
  • For each group, you can use the binomial distribution to randomly generate the number of events in the group.
  • The proportion of events is the ratio (Number of Events) / (Group Size).
/* simulate proportions of events in groups */
proc iml;
use Events;
read all var _ALL_;  /* creates vectors Groups, Events, N, p, pi, etc */
close;
numGroups = nrow(Group);
 
call randseed(12345);
nTrials = 250;       /* simulate data for a study that contains 250 subjects */
 
Size = T( RandMultinomial(1, nTrials, pi) ); /* new random group sizes */
nEvents = j(numGroups, 1, .);                /* vector for number of events */
do i = 1 to nrow(nEvents);
   nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
end;
Proportion = nEvents / Size;                 /* proportion of events in each group */
 
results = Size || nEvents || Proportion;
print results[r=Group c={'Size' 'nEvents' 'Proportion'} F=Best5.];

The table shows one random sample of size 250 based on the statistics in the smaller pilot study. The group size and the number of events are both random variables. If you rerun this simulation, the number of subjects in the groups will change, as will the proportion of subjects in each group that experience the event. It would be trivial to adapt the program to handle a designed experiment in which the Size vector is constant.

Simulating multiple samples

An important property of a Monte Carlo simulation study is that it reveals the distribution of statistics that can possibly result in a future data sample. The table in the previous section shows one possible set of statistics, but we can run the simulation additional times to generate hundreds or thousands of additional variations. The following program generates 1,000 possible random samples and outputs the results to a SAS data set. You can then graph the distribution of the statistics. This section uses box plots to visualize the distribution of the proportion of events in each group:

/* simulate 1000 random draws under this model */
SampleID = j(numGroups, 1, .);
nEvents = j(numGroups, 1, .);                /* vector for number of events */
create Sim var {'SampleID' 'Group' 'Size' 'nEvents' 'Proportion'};
do ID = 1 to 1000;
   /* assign all elements the same value:
      https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   SampleID[,] = ID;
   Size = T( RandMultinomial(1, nTrials, pi) );  /* new random group sizes */
   do i = 1 to nrow(nEvents);
      nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
   end;
   Proportion = nEvents / Size;
   append;
end;
close Sim;
QUIT;
 
data PlotAll;
set Sim Events(keep=Group p);
run;
 
title "Proportion of Events in Each Group";
title2 "Simulated N=250; NumSim=1000";
/* box plot */
proc sgplot data=PlotAll noautolegend;
   hbox Proportion / category=Group;
   scatter y=Group x=p / markerattrs=GraphData2(symbol=CircleFilled size=11);
   yaxis type=discrete display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The box plot shows the distribution of the proportion of events in each group. For example, the proportion in the first group ("A F"), ranges from a low of 0% to a high of 33%. The proportion in the sixth group ("C M"), ranges from a low of 22% to a high of 100%. The red markers are the parameter values used in the simulation. These are the estimates from the pilot study.

Alternative ways to visualize the distribution within groups

The box plot shows a schematic representation of the distribution of proportions within each group. However, there are alternative ways to visualize the distributions. One way is to use a strip plot, as follows:

/* strip plot */
proc sgplot data=PlotAll noautolegend;
   scatter y=Group x=Proportion / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled);
   scatter y=Group x=p / markerattrs=(symbol=CircleFilled size=11);
   yaxis type=discrete reverse display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The strip plot uses a semi-transparent marker to display each statistic. (Again, the red markers indicate the parameters for the simulation.) The density of the distribution is visualized by the width of the strips and by the darkness of the plot, which is due to overplotting the semi-transparent markers.

This plot makes it apparent that the variation between groups is based on the relative size of the groups in the pilot study. Large groups such as Group 4 ("B M") have less variation than small groups such as Group 6 ("C M"). That's because the standard error of the proportion is inversely proportional to the square root of the sample size. Thus, smaller groups have a wider distribution than the larger groups.

You can see the same features in a panel of histograms. In the following visualization, the distribution of the proportions is shown by using a histogram. Red vertical lines indicate the parameters for the simulation. This graph might be easier to explain to non-statistician.
/* plot of histograms */
proc sgpanel data=PlotAll;
   panelby Group / onepanel layout=rowlattice novarname;
   histogram Proportion;
   refline p / axis=x lineattrs=GraphData2(thickness=2);
   colaxis grid; 
run;

Summary

This article shows how to simulate event-trials data in SAS. In this example, the data belong to six different groups, and the probability of experiencing the event varies between groups. The groups are also different sizes. In the simulation, the group size and the number of events in each group are random variables. This article also shows three different ways to visualize the results of the simulation: a box plot, a strip plot, and a panel of histograms.

The post Simulate proportions for groups appeared first on The DO Loop.

7月 072021
 

In general, it is hard to simulate multivariate data that has a specified correlation structure. Copulas make that task easier for continuous distributions. A previous article presented the geometry behind a copula and explained copulas in an intuitive way. Although I strongly believe that statistical practitioners should be familiar with statistical theory, you do not need to be an expert in copulas to use them to simulate multivariate correlated data. In SAS, the COPULA procedure in SAS/ETS software provides a syntax for simulating data from copulas. This article shows how to use PROC COPULA to simulate data that has a specified rank correlation structure. (Similar syntax applies to the CCOPULA procedure in SAS Econometrics, which can perform simulations in parallel in SAS Viya.)

Example data

Let's choose some example data to use for the simulation study. Suppose you want to simulate data that have marginal distributions and correlations that are similar to the joint distribution of the MPG_City, Weight, and EngineSize variables in the Sashelp.Cars data set. For ease of discussion, I rename these variables X1, X2, and X3 in this article. The following call to PROC CORR visualizes the univariate and bivariate distributions for these data:

/* for ease of discussion, rename vars to X1, X2, X3 */
data Have;
set Sashelp.Cars(keep= MPG_City Weight EngineSize);
label MPG_City= Weight= EngineSize=;
rename MPG_City=X1 Weight=X2 EngineSize=X3;
run;
 
/* graph original (renamed) data */
ods graphics / width=500px height=500px;
proc corr data=Have Spearman noprob plots=matrix(hist);
   var X1 X2 X3;
   ods select SpearmanCorr MatrixPlot;
run;

The graph shows the marginal and bivariate distributions for these data. The goal of this article is to use PROC COPULA to simulate a random sample that looks similar to these data. The output from PROC CORR includes a table of Spearman correlations (not shown), which are Corr(X1,X2)=-0.865, Corr(X1,X3)=-0.862, and Corr(X2,X3)=0.835.

Using PROC COPULA to simulate data

As explained in my previous article about copulas, the process of simulating data involves several conceptual steps. For each step, PROC COPULA provides a statement or option that performs the step:

  1. The multivariate distribution is defined by specifying a marginal distribution for each variable and a correlation structure for the joint distribution. In my previous article, these were created "out of thin air." However, in practice, you often want to simulate random samples that match the correlation and marginal distributions of real data. In PROC COPULA, you use the DATA= option and the VAR statement to specify real data.
  2. Choose a copula family and use the data to estimate the parameters of the copula. I've only discussed the Gaussian copula. For the Gaussian copula, the sample covariance matrix estimates the parameters for the copula, although PROC COPULA uses maximum likelihood estimates instead of the sample covariance. You use the FIT statement to fit the parameters of the copula from the data.
  3. Simulate from the copula. For the normal copula, this consists of generating multivariate normal data with the given rank correlations. These simulated data are transformed to uniformity by applying the normal CDF to each component. You use the SIMULATE statement to simulate the data. If you want to see the uniform marginals, you can use the OUTUNIFORM= option to store them to a SAS data set, or you can use the PLOTS=(DATATYPE=UNIFORM) option to display a scatter plot matrix of the uniform marginals.
  4. Transform the uniform marginals into the specified distributions by applying the inverse CDF transformation for each component. In PROC COPULA, you use the MARGINALS=EMPIRICAL option on the SIMULATE statement to create the marginal distributions by using the empirical CDF of the variables. You can use the OUT= option to write the simulated values to a SAS data set. You can use the PLOTS=(DATATYPE=ORIGINAL) option to display a scatter plot matrix of the simulated data.

The following call to PROC COPULA carries out these steps:

%let N = 428;            /* sample size */
proc copula data=Have;
   var X1 X2 X3;         /* original data vars */
   fit normal;           /* choose normal copula; estimate covariance by MLE */
   simulate / seed=1234 ndraws=&N
              marginals=empirical    /* transform from copula by using empirical CDF of data */
              out=SimData            /* contains the simulated data */
              plots=(datatype=both); /* optional: scatter plots of copula and simulated data */
              /* optional: use OUTUNIFORM= option to store the copula */
   ods select SpearmanCorrelation MatrixPlotSOrig MatrixPlotSUnif;
run;

The primary result of the PROC COPULA call is an output data set called SimData. The data set contains random simulated data. The Spearman correlation of the simulated data is close to the Spearman correlation of the original data. The marginal distributions of the simulated variables are similar to the marginal distributions of the original variables. The following graph shows the distributions of the simulated data. You can compare this graph to the graph of the original data.

When you use PROC COPULA, it is not necessary to visualize the copula, which encapsulates the correlation between the variables. However, if you want to visualize the copula, you can use the PLOTS= option to create a scatter plot matrix of the uniform marginals, as follows:

Use PROC COPULA for Monte Carlo simulations

In Monte Carlo simulations, it is useful to have one SAS data set that contains B simulated samples. You can then use BY-group processing to analyze all simulated samples with a single call to a SAS procedure.

In PROC COPULA, the NDRAWS= option on the SIMULATE statement specifies how many observations to simulate. If your original data contains N observations, you can use NDRAWS=N. To get a total of B simulated samples, you can simulate N*B observations. You can then add an ID variable that identifies which observations belong to each sample, as follows:

/* you can run a Monte Carlo simulation from the simulated data */
%let NumSamples = 1000; /* = B = number of Monte Carlo simulations */
%let NTotal = %eval(&N * &NumSamples);
%put &=NTotal;
 
ods exclude all;
proc copula data=Have;
   var X1 X2 X3;
   fit normal;
   simulate / seed=1234 ndraws=&NTotal marginals=empirical out=SimData;
run;
ods exclude none;
 
/* add ID variable that identifies each set of &N observations */
data MCData;
set SimData;
SampleID = ceil(_N_/&N);  /* 1,1,...,1,2,2,....,2,3,3,... */
run;

You can use your favorite SAS procedure and the BY SAMPLEID statement to analyze each sample in the MCData data set.

Summary

Given a set of continuous variables, a copula enables you to simulate a random sample from a distribution that has the same rank correlation structure and marginal distributions as the specified variables. A previous article discusses the mathematics and the geometry of copulas. In SAS, the COPULA procedure enables you to simulate data from copulas. This article shows how to use PROC COPULA to simulate data. The procedure can create graphs that visualize the simulated data and the copula. The main output is a SAS data set that contains the simulated data.

The post Simulate multivariate correlated data by using PROC COPULA in SAS appeared first on The DO Loop.