simulation

7月 072021
 

In general, it is hard to simulate multivariate data that has a specified correlation structure. Copulas make that task easier for continuous distributions. A previous article presented the geometry behind a copula and explained copulas in an intuitive way. Although I strongly believe that statistical practitioners should be familiar with statistical theory, you do not need to be an expert in copulas to use them to simulate multivariate correlated data. In SAS, the COPULA procedure in SAS/ETS software provides a syntax for simulating data from copulas. This article shows how to use PROC COPULA to simulate data that has a specified rank correlation structure. (Similar syntax applies to the CCOPULA procedure in SAS Econometrics, which can perform simulations in parallel in SAS Viya.)

Example data

Let's choose some example data to use for the simulation study. Suppose you want to simulate data that have marginal distributions and correlations that are similar to the joint distribution of the MPG_City, Weight, and EngineSize variables in the Sashelp.Cars data set. For ease of discussion, I rename these variables X1, X2, and X3 in this article. The following call to PROC CORR visualizes the univariate and bivariate distributions for these data:

/* for ease of discussion, rename vars to X1, X2, X3 */
data Have;
set Sashelp.Cars(keep= MPG_City Weight EngineSize);
label MPG_City= Weight= EngineSize=;
rename MPG_City=X1 Weight=X2 EngineSize=X3;
run;
 
/* graph original (renamed) data */
ods graphics / width=500px height=500px;
proc corr data=Have Spearman noprob plots=matrix(hist);
   var X1 X2 X3;
   ods select SpearmanCorr MatrixPlot;
run;

The graph shows the marginal and bivariate distributions for these data. The goal of this article is to use PROC COPULA to simulate a random sample that looks similar to these data. The output from PROC CORR includes a table of Spearman correlations (not shown), which are Corr(X1,X2)=-0.865, Corr(X1,X3)=-0.862, and Corr(X2,X3)=0.835.

Using PROC COPULA to simulate data

As explained in my previous article about copulas, the process of simulating data involves several conceptual steps. For each step, PROC COPULA provides a statement or option that performs the step:

  1. The multivariate distribution is defined by specifying a marginal distribution for each variable and a correlation structure for the joint distribution. In my previous article, these were created "out of thin air." However, in practice, you often want to simulate random samples that match the correlation and marginal distributions of real data. In PROC COPULA, you use the DATA= option and the VAR statement to specify real data.
  2. Choose a copula family and use the data to estimate the parameters of the copula. I've only discussed the Gaussian copula. For the Gaussian copula, the sample covariance matrix estimates the parameters for the copula, although PROC COPULA uses maximum likelihood estimates instead of the sample covariance. You use the FIT statement to fit the parameters of the copula from the data.
  3. Simulate from the copula. For the normal copula, this consists of generating multivariate normal data with the given rank correlations. These simulated data are transformed to uniformity by applying the normal CDF to each component. You use the SIMULATE statement to simulate the data. If you want to see the uniform marginals, you can use the OUTUNIFORM= option to store them to a SAS data set, or you can use the PLOTS=(DATATYPE=UNIFORM) option to display a scatter plot matrix of the uniform marginals.
  4. Transform the uniform marginals into the specified distributions by applying the inverse CDF transformation for each component. In PROC COPULA, you use the MARGINALS=EMPIRICAL option on the SIMULATE statement to create the marginal distributions by using the empirical CDF of the variables. You can use the OUT= option to write the simulated values to a SAS data set. You can use the PLOTS=(DATATYPE=ORIGINAL) option to display a scatter plot matrix of the simulated data.

The following call to PROC COPULA carries out these steps:

%let N = 428;            /* sample size */
proc copula data=Have;
   var X1 X2 X3;         /* original data vars */
   fit normal;           /* choose normal copula; estimate covariance by MLE */
   simulate / seed=1234 ndraws=&N
              marginals=empirical    /* transform from copula by using empirical CDF of data */
              out=SimData            /* contains the simulated data */
              plots=(datatype=both); /* optional: scatter plots of copula and simulated data */
              /* optional: use OUTUNIFORM= option to store the copula */
   ods select SpearmanCorrelation MatrixPlotSOrig MatrixPlotSUnif;
run;

The primary result of the PROC COPULA call is an output data set called SimData. The data set contains random simulated data. The Spearman correlation of the simulated data is close to the Spearman correlation of the original data. The marginal distributions of the simulated variables are similar to the marginal distributions of the original variables. The following graph shows the distributions of the simulated data. You can compare this graph to the graph of the original data.

When you use PROC COPULA, it is not necessary to visualize the copula, which encapsulates the correlation between the variables. However, if you want to visualize the copula, you can use the PLOTS= option to create a scatter plot matrix of the uniform marginals, as follows:

Use PROC COPULA for Monte Carlo simulations

In Monte Carlo simulations, it is useful to have one SAS data set that contains B simulated samples. You can then use BY-group processing to analyze all simulated samples with a single call to a SAS procedure.

In PROC COPULA, the NDRAWS= option on the SIMULATE statement specifies how many observations to simulate. If your original data contains N observations, you can use NDRAWS=N. To get a total of B simulated samples, you can simulate N*B observations. You can then add an ID variable that identifies which observations belong to each sample, as follows:

/* you can run a Monte Carlo simulation from the simulated data */
%let NumSamples = 1000; /* = B = number of Monte Carlo simulations */
%let NTotal = %eval(&N * &NumSamples);
%put &=NTotal;
 
ods exclude all;
proc copula data=Have;
   var X1 X2 X3;
   fit normal;
   simulate / seed=1234 ndraws=&NTotal marginals=empirical out=SimData;
run;
ods exclude none;
 
/* add ID variable that identifies each set of &N observations */
data MCData;
set SimData;
SampleID = ceil(_N_/&N);  /* 1,1,...,1,2,2,....,2,3,3,... */
run;

You can use your favorite SAS procedure and the BY SAMPLEID statement to analyze each sample in the MCData data set.

Summary

Given a set of continuous variables, a copula enables you to simulate a random sample from a distribution that has the same rank correlation structure and marginal distributions as the specified variables. A previous article discusses the mathematics and the geometry of copulas. In SAS, the COPULA procedure enables you to simulate data from copulas. This article shows how to use PROC COPULA to simulate data. The procedure can create graphs that visualize the simulated data and the copula. The main output is a SAS data set that contains the simulated data.

The post Simulate multivariate correlated data by using PROC COPULA in SAS appeared first on The DO Loop.

7月 052021
 

Do you know what a copula is? It is a popular way to simulate multivariate correlated data. The literature for copulas is mathematically formidable, but this article provides an intuitive introduction to copulas by describing the geometry of the transformations that are involved in the simulation process. Although there are several families of copulas, this article focuses on the Gaussian copula, which is the simplest to understand.

This article shows the geometry of copulas. This article and its example are based on Chapter 9 of Simulating Data with SAS (Wicklin, 2013). A previous article shows the geometry of the Iman-Conover transformation, which is an alternative way to create simulated data that have a specified rank correlation structure and specified marginal distributions.

Simulate data by using a copula

Recall that you can use the CDF function to transform any distribution to the uniform distribution. Similarly, you can use the inverse CDF to transform the uniform distribution to any distribution. To simulate correlated multivariate data from a Gaussian copula, follow these three steps:

  1. Simulate correlated multivariate normal data from a correlation matrix. The marginal distributions are all standard normal.
  2. Use the standard normal CDF to transform the normal marginals to the uniform distribution.
  3. Use inverse CDFs to transform the uniform marginals to whatever distributions you want.

The transformation in the second and third steps are performed on the individual columns of a data matrix. The transformations are monotonic, which means that they do not change the rank correlation between the columns. Thus, the final data has the same rank correlation as the multivariate normal data in the first step.

The next sections explore a two-dimensional example of using a Gaussian copula to simulate correlated data where one variable is Gamma-distributed and the other is Lognormally distributed. The article then provides intuition about what a copula is and how to visualize it. You can download the complete SAS program that generates the example and creates the graphs in this article.

A motivating example

Suppose that you want to simulate data from a bivariate distribution that has the following properties:

  • The rank correlation between the variables is approximately 0.6.
  • The marginal distribution of the first variable, X1, is Gamma(4) with unit scale.
  • The marginal distribution of the second variable, X2, is lognormal with parameters μ=0.5 and σ=0.8. That is, log(X2) ~ N(μ, σ).

The hard part of a multivariate simulation is getting the correlation structure correct, so let's start there. It seems daunting to generate a "Gamma-Lognormal distribution" with a correlation of 0.6, but it is straightforward to generate a bivariate NORMAL distribution with that correlation. Let's do that. Then we'll use a series of transformations to transform the normal marginal variables into the distributions that we want while preserving the rank correlation at each step.

Simulate multivariate normal data

The SAS/IML language supports the RANDNORMAL function, which can generate multivariate normal samples, as shown in the following statements:

proc iml;
N = 1e4;
call randseed(12345);
/* 1. Z ~ MVN(0, Sigma) */
Sigma = {1.0  0.6,
         0.6  1.0};
Z = RandNormal(N, {0,0}, Sigma);          /* Z ~ MVN(0, Sigma) */

The matrix Z contains 10,000 observations drawn from a bivariate normal distribution with correlation coefficient ρ=0.6. Because the mean vector is (0,0) and the covariance parameter is a correlation matrix, the marginal distributions are standard normal. The following graph shows a scatter plot of the bivariate normal data along with histograms for each marginal distribution.

Transform marginal distributions to uniform

The first step is to transform the normal marginals into a uniform distribution by using the probability integral transform (also known as the CDF transformation). The columns of Z are standard normal, so Φ(X) ~ U(0,1), where Φ is the cumulative distribution function (CDF) for the univariate normal distribution. In SAS/IML, the CDF function applies the cumulative distribution function to all elements of a matrix, so the transformation of the columns of Z is a one-liner:

/* 2. transform marginal variables to U(0,1) */
U = cdf("Normal", Z);                     /* U_i are correlated U(0,1) variates */

The columns of U are samples from a standard uniform distribution. However, the columns are not independent. Because the normal CDF is a monotonic transformation, it does not change the rank correlation between the columns. That is, the rank correlation of U is the same as the rank correlation of Z:

/* the rank correlations for Z and U are exactly the same */
rankCorrZ = corr(Z, "Spearman")[2]; 
rankCorrU = corr(U, "Spearman")[2]; 
print rankCorrZ rankCorrU;

The following graph shows a scatter plot of the transformed data along with histograms for each marginal distribution. The histograms show that the columns U1 and U2 are uniformly distributed on [0,1]. However, the joint distribution is correlated, as shown in the following scatter plot:

Transform marginal distributions to any distribution

Now comes the magic. One-dimensional uniform variates are useful because you can transform them into any distribution! How? Just apply the inverse cumulative distribution function of whatever distribution you want. For example, you can obtain gamma variates from the first column of U by applying the inverse gamma CDF. Similarly, you can obtain lognormal variates from the second column of U by applying the inverse lognormal CDF. In SAS, the QUANTILE function applies the inverse CDF, as follows:

/* 3. construct the marginals however you wish */
gamma = quantile("Gamma", U[,1], 4);            /* gamma ~ Gamma(alpha=4)   */
LN    = quantile("LogNormal", U[,2], 0.5, 0.8); /* LN ~ LogNormal(0.5, 0.8) */
X = gamma || LN;
 
/* check that the rank correlation still has not changed */
rankCorrX = corr(X, "Spearman")[2]; 
print rankCorrZ rankCorrX;

The first column of X is gamma-distributed. The second column of X is lognormally distributed. But because the inverse of a continuous CDF is monotonic, the column transformations do not change the rank correlation between the columns. The following graph shows a scatter plot of the newly transformed data along with histograms for each marginal distribution. The histograms show that the columns X1 and X2 are distributed as gamma and lognormal, respectively. The joint distribution is correlated.

What about the Pearson correlation?

For multivariate normal data, the Pearson correlation is close to the rank correlation. However, that is not true for nonnormal distributions. The CDF and inverse-CDF transformations are nonlinear, so the Pearson correlation is not preserved when you transform the marginal. For example, the following statements compute the Pearson correlation for Z (the multivariate normal data) and for X (the gamma-lognormal data):

rhoZ = corr(Z, "Pearson")[2];
rhoX = corr(X, "Pearson")[2];
print rhoZ rhoX;

Chapter 9 of Wicklin (2013), discusses the fact that you can adjust the correlation for Z and it will affect the correlation for X in a complicated manner. With a little work, you can choose a correlation for Z that will result in a specified correlation for X. For example, if you want the final Pearson correlation for X to be 0.6, you can use 0.642 as the correlation for Z:

/* re-run the example with new correlation 0.642 */
Sigma = {1.0    0.642,
         0.642  1.0};
newZ = RandNormal(N, {0,0}, Sigma);
newU = cdf("Normal", newZ);           /* columns of U are U(0,1) variates */
gamma = quantile("Gamma", newU[,1], 4);      /* gamma ~ Gamma(alpha=4) */
expo = quantile("Expo", newU[,2]);           /* expo ~ Exp(1)          */
newX = gamma || expo;
 
rhoZ = corr(newZ, "Pearson")[2];
rhoX = corr(newX, "Pearson")[2];
print rhoZ rhoX;

Success! The Pearson correlation between the gamma and lognormal variables is 0.6. In general, this is a complicated process because the value of the initial correlation depends on the target value and on the specific forms of the marginal distributions. Finding the initial value requires finding the root of an equation that involves the CDF and inverse CDF.

Higher dimensional data

This example in this article generalizes to higher-dimensional data in a natural way. The SAS program that generates the example in this article also includes a four-dimensional example of simulating correlated data. The program simulates four correlated variables whose marginal distributions are distributed as gamma, lognormal, exponential, and inverse Gaussian distributions. The following panel shows the bivariate scatter plots and marginal histograms for this four-dimensional simulated data.

What is a copula?

I've shown many graphs, but what is a copula? The word copula comes from a Latin word (copulare) which means to bind, link, or join. The same Latin root gives us the word "copulate." A mathematical copula is a joint probability distribution that induces a specified correlation structure among independent marginal distributions. Thus, a copula links or joins individual univariate distributions into a joint multivariate distribution that has a specified correlation structure.

Mathematically, a copula is any multivariate cumulative distribution function for which each component variable has a uniform marginal distribution on the interval [0, 1]. That's it. A copula is a cumulative distribution function whose domain is the cube [0,1]d. So the second graph in this article is the graph most nearly related to the copula for the bivariate data, since it shows the relationship between the uniform marginals.

The scatter plot shows the density, not the cumulative distribution. However, you can use the data to estimate the bivariate CDF. The following heat map visualizes the Gaussian copula for the correlation of 0.6:

Notice that this function does not depend on the final marginal distributions. It depends only on the multivariate normal distribution and the CDF transformation that produces the uniform marginals. In that sense, the copula captures only the correlation structure, without regard for the final form of the marginal distributions. You can use this same copula for ANY set of marginal distributions because the copula captures the correlation structure independently of the final transformations of the marginals.

Why are copulas important?

A remarkable theorem (Sklar’s theorem) says that every joint distribution can be written as a copula and a set of marginal distributions. If the marginals are continuous, then the copula is unique. Let that sink in for a second: Every continuous multivariate probability distribution can be expressed in terms of its marginal distributions and a copula. Every one. So, if you can understand copulas you can understand all multivariate distributions. Powerful math, indeed!

For simulation studies, you can use the converse of Sklar's theorem, which is also true. Specifically, if you have a set of d uniform random variables and a set of marginal distribution functions, a copula transforms the d components into a d-dimensional probability distribution.

Summary

Copulas are mathematically sophisticated. However, you can use copulas to simulate data without needing to understand all the mathematical details. This article presents an example of using a Gaussian copula to simulate multivariate correlated data. It shows the geometry at each step of the three-step process:

  1. Simulate data from a multivariate normal distribution with a known correlation matrix.
  2. Use the normal CDF to transform the marginal distributions to uniform.
  3. Use inverse CDFs to obtain any marginal distributions that you want.

The result is simulated correlated multivariate data that has the same rank correlation as the original simulated data but has arbitrary marginal distributions.

This article uses SAS/IML to show the geometry of the copula transformations. However, SAS/ETS software provides PROC COPULA, which enables you to perform copula modeling and simulation in a single procedure. A subsequent article shows how to use PROC COPULA.

The post An introduction to simulating correlated data by using copulas appeared first on The DO Loop.

6月 232021
 

This article uses simulation to demonstrate the fact that any continuous distribution can be transformed into the uniform distribution on (0,1). The function that performs this transformation is a familiar one: it is the cumulative distribution function (CDF). A continuous CDF is defined as an integral, so the transformation is called the probability integral transformation.

This article demonstrates the probability integral transformation and its close relative, the inverse CDF transformation. These transformations are useful in many applications, such as constructing statistical tests and simulating data.

The probability integral transform

Let X be a continuous random variable whose probability density function is f. Then the corresponding cumulative distribution function (CDF) is the integral \(F(x) = \int\nolimits_{-\infty}^{x} f(t)\,dt\). You can prove that the random variable Y = F(X) is uniformly distributed on (0,1). In terms of data, if {X1, X2, ..., Xn} is a random sample from X, the values {F(X1), F(X2), ..., F(Xn)} are a random sample from the uniform distribution.

Let's look at an example. The following SAS DATA step generates a random sample from the Gamma distribution with shape parameter α=4. At the same time, the DATA step computes U=F(X), where F is the CDF of the gamma distribution. A histogram of the U variable shows that it is uniformly distributed:

/* The probability integral transformation: If X is a random 
   variable with CDF F, then Y=F(X) is uniformly distributed on (0,1).
*/
%let N = 10000;
data Gamma4(drop=i);
call streaminit(1234);
do i = 1 to &N;
   x = rand("Gamma", 4);     /* X ~ Gamma(4) */
   u = cdf("Gamma", x, 4);   /* U = F(X) ~ U(0,1)   */
   output;
end;
run;
 
title "U ~ U(0,1)";
proc sgplot data=Gamma4;
   histogram U / binwidth=0.04 binstart=0.02;  /* center first bin at h/2 */
   yaxis grid;
run;

The histogram has some bars that are greater than the average and others that are less than the average. That is expected in a random sample.

One application of the CDF transformation is to test whether a random sample comes from a particular distribution. You can transform the sample by using the CDF function, then test whether the transformed values are distributed uniformly at random. If the transformed variates are uniformly distributed, then the original data was distributed according to the distribution F.

The inverse probability transformation

It is also useful to run the CDF transform in reverse. That is, start with a random uniform sample and apply the inverse-CDF function (the quantile function) to generate random variates from the specified distribution. This "inverse CDF method" is a standard technique for generating random samples.

The following example shows how it works. The DATA step generates random uniform variates on (0,1). The variates are transformed by using the quantiles of the lognormal(0.5, 0.8) distribution. A call to PROC UNIVARIATE overlays the density curve of the lognormal(0.5, 0.8) distribution on the histogram of the simulated data. The curve matches the data well:

/* inverse probability transformation */
data LN(drop=i);
call streaminit(12345);
do i = 1 to &N;
   u = rand("Uniform");                    /* U ~ U(0,1)   */
   x = quantile("Lognormal", u, 0.5, 0.8); /* X ~ LogNormal(0.5, 0.8) */
   output;
end;
run;
 
proc univariate data=LN;
   var x;
   histogram x / lognormal(scale=0.5 shape=0.8) odstitle="X ~ LogNormal(0.5, 0.8)";
run;

Summary

The probability integral transform (also called the CDF transform) is a way to transform a random sample from any distribution into the uniform distribution on (0,1). The inverse CDF transform transforms uniform data into a specified distribution. These transformations are used in testing distributions and in generating simulated data.

Appendix: Histograms of uniform variables

The observant reader might have noticed that I used the BINWIDTH= and BINSTART= options to set the endpoints of the bin for the histogram of uniform data on (0,1). Why? Because by default the histogram centers bins. For data that are bounded on an interval, you can use the BINWIDTH= and BINSTART= options to improve the visualization.

Let's use the default binning algorithm to create a histogram of the uniform data:

title "U ~ U(0,1)";
title2 "34 Bins: First Bin Centered at 0 (Too Short)";
proc sgplot data=Gamma4;
   histogram u;
   yaxis grid;
run;

Notice that the first bin is "too short." The first bin is centered at the location U=0. Because the data cannot be less than 0, the first bin has only half as many counts as the other bins. The last bin is also affected, although the effect is not usually as dramatic.

The histogram that uses the default binning is not wrong, but it doesn't show the uniformity of the data as clearly as a histogram that uses the BINWIDTH= and BINSTART= options to force 0 and 1 to be the endpoints of the bins.

The post The probability integral transform appeared first on The DO Loop.

6月 162021
 

A previous article showed how to simulate multivariate correlated data by using the Iman-Conover transformation (Iman and Conover, 1982). The transformation preserves the marginal distributions of the original data but permutes the values (columnwise) to induce a new correlation among the variables.

When I first read about the Iman-Conover transformation, it seems like magic. But, as with all magic tricks, if you peer behind the curtain and understand how the trick works, you realize that it's math, not magic. This article shows the main mathematical ideas behind the Iman-Conover transformation. For clarity, the article shows how two variables can be transformed to create different variables that have a new correlation but the same marginal distributions.

A series of transformations

The Iman-Conover transformation is actually a sequence of four transformations. Let X be a data matrix (representing continuous data) and let C be a target rank-correlation matrix. The goal of the Iman-Conover transformation is to permute the elements in the columns of X to create a new data matrix (W) whose rank correlation is close to C. Here are the four steps in the Iman-Conover algorithm:

  1. Create normal scores: Compute the van der Waerden normal scores for each column of X. Let S be the matrix of normal scores.
  2. Uncorrelate the scores. Let Z be the matrix of uncorrelated scores. Let Q-1 be the transformation that maps S to Z.
  3. Correlated the scores to match the target. Let Y be the correlated scores. Let P be the transformation that maps Z to Y.
  4. Reorder the elements in the i_th column of X to match the order of the elements in the i_th column of Y. This ensures that X has the same rank correlation as Y.

The following diagram presents the steps of the Iman-Conover transformation. The original data are in the upper-left graph. The final data are in the lower-left graph. Each transformation is explained in a separate section of this article.

Background information

To understand the Iman-Conover transformation, recall the three mathematical facts that make it possible:

The purpose of this article is to present the main ideas behind the Iman-Conover transformation. This is not a proof, and almost every sentence should include the word "approximately." For example, when I write, "the data are multivariate normal" or "the data has the specified correlation," you should insert the word "approximately" into those sentences. I also blur the distinction between correlation in the population and sample correlations.

Example data

This example uses the EngineSize and MPG_Highway variables from the Sashelp.Cars data set. These variables are not normally distributed. They have a rank correlation of -0.77. The goal of this example is to permute the values to create new variables that have a rank correlation that is close to a target correlation. This example uses +0.6 as the target correlation, but you could choose any other value in the interval (-1, 1).

The following DATA step creates the data for this example. The graph in the upper-left of the previous diagram shows the negative correlation between these variables:

data Orig;
set sashelp.cars(rename=(EngineSize=X MPG_Highway=Y));
keep X Y;
label X= Y=;
run;

Transformation 1: Create normal scores

You can compute the van der Waerden scores for each variable. The van der Waerden scores are approximately normal—or at least "more normal" than the original variables. The following SAS/IML statements create the van der Waerden scores as columns of a matrix, S. Because the normalizing transformation is monotonic, the columns of S have the same rank correlation as the original data. The second graph in the previous diagram shows a scatter plot of the columns of S.

proc iml;
varNames = {"X" "Y"};
use Orig; read all var varNames into X; close;
 
/* T1: Create normal scores of each column 
   Columns of S have exactly the same ranks as columns of X.
   ==> Spearman corr is the same    */
N = nrow(X);
S = J(N, ncol(X));
do i = 1 to ncol(X);
   ranks = ranktie(X[,i], "mean");          /* tied ranks */
   S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
end;
/* verify that rank correlation has not changed */
corrScores = corr(S, "Spearman")[2];
print corrX, corrScores;

Transformation 2: Uncorrelate the scores

For many data sets, the van der Waerden scores are approximately multivariate normal. Let CS be the Pearson correlation matrix of S. Let Q be the Cholesky root of CS. The inverse of Q is a transformation that removes any correlations among multivariate normal data. Thus, if you use the inverse of Q to transform the scores, you obtain multivariate data that are uncorrelated. The third graph in the previous diagram shows a scatter plot of the columns of Z.

/* T2: use the inverse Cholesky transform of the scores to uncorrelate */
CS = corr(S);        /* correlation of scores */
Q = root(CS);        /* Cholesky root of correlation of scores */
Z = S*inv(Q);        /* uncorrelated MVN data */
corrZ = corr(Z, "Spearman")[2];
print corrZ;

Transformation 3: Correlate the scores

Let C be the target correlation matrix. Let P be its Cholesky root. The matrix P is a transformation that induces the specified correlations in uncorrelated multivariate normal data. Thus, you obtain data, Y, that are multivariate normal with the given Pearson correlation. In most situations, the Spearman rank correlation is close to the Pearson correlation. The fourth graph in the previous diagram shows a scatter plot of the columns of Y.

/* T3: use the Cholesky transform of the target matrix to induce correlation */
/* define the target correlation */
C = {1    0.6,
     0.6 1  };
P = root(C);         /* Cholesky root of target correlation */
Y = Z*P;
corrY = corr(Y, "Spearman")[2];
print corrY;

Transformation 4: Reorder the values in columns of the original matrix

The data matrix Y has the correct correlations, but the Cholesky transformations of the scores have changed the marginal distributions of the scores. But that's okay: Iman and Conover recognized that any two matrices whose columns have the same ranks also have the same rank correlation. Thus, the last step uses the column ranks of Y to reorder the values in the columns of X.

Iman and Conover (1982) assume that the columns of X do not have any duplicate values. Under that assumption, when you use the ranks of each column of Y to permute the columns of X, the rank correlation of Y and X will be the same. (And, of course, you have not changed the marginal distributions of X.) If X has duplicate values (as in this example), then the rank correlations of Y and X will be close, as shown by the following statements:

/* T4: Permute or reorder data in the columns of X to have the same ranks as Y */
W = X; 
do i = 1 to ncol(Y);
   rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
   tmp = W[,i]; call sort(tmp); /* sort column by ranks */
   W[,i] = tmp[rank];           /* reorder the column of X by the ranks of Y */
end;
 
corrW = corr(W, "Spearman")[2];
print corrW;

The rank correlation for the matrix W is very close to the target correlation. Furthermore, the columns have the same values as the original data, and therefore the marginal distributions are the same. The lower-left graph in the previous diagram shows a scatter plot of the columns of W.

The effect of tied values

As mentioned previously, the Iman-Conover transformation was designed for continuous variables that do not have any duplicate values. In practice, the Iman-Conover transformation still works well even if there are some duplicate values in the data. For this example, there are 428 observations. However, the first column of X has only 43 unique values and the second column has only 33 unique values. Because the normal scores are created by using the RANKTIE function, the columns of the S matrix also have 43 and 33 unique values, respectively.

Because the Q-1 matrix is upper triangular, the number of unique values might change for the Z matrix. For these data, the columns of Z and Y have 43 and 194 unique values, respectively. These unique values determine the matrix W, which is why the rank correlation for W is slightly different than the rank correlation for Y in this example. If all data values were unique, then W and Y would have exactly the same rank correlation.

Summary

The Iman-Conover transformation seems magical when you first see it. This article attempts to lift the curtain and reveal the mathematics behind the algorithm. By understanding the geometry of various transformations, you can understand the mathe-magical properties that make the Iman-Conover algorithm an effective tool in multivariate simulation studies.

You can download the complete SAS program that creates all graphs and tables in this article..

The post The geometry of the Iman-Conover transformation appeared first on The DO Loop.

6月 142021
 

Simulating univariate data is relatively easy. Simulating multivariate data is much harder. The main difficulty is to generate variables that have given univariate distributions but also are correlated with each other according to a specified correlation matrix. However, Iman and Conover (1982, "A distribution-free approach to inducing rank correlation among input variables") proposed a transformation that approximately induces a specified correlation among component variables without changing the univariate distribution of the components. (The univariate distributions are called the marginal distributions.) The Iman-Conover transformation is designed to transform continuous variables.

This article defines the Iman-Conover transformation. You can use the transformation to generate a random sample that has (approximately) a given correlation structure and known marginal distributions. The Iman-Conover transformation uses rank correlation, rather than Pearson correlation. In practice, the rank correlation and the Pearson correlation are often close to each other.

This article shows how to use the Iman-Conover transformation. A separate article explains how the Iman-Conover transformation works. The example and SAS program in this article are adapted from Chapter 9 of my book Simulating Data with SAS (Wicklin, 2013).

Create simulated data with known distributions

To use the Iman-Conover transformation, you must specify two things: A set of data and a target correlation matrix. The transformation will induce the specified correlation in the data by rearranging (or permuting) the columns of the data. In other words, the algorithm permutes the columns independently to induce the specified correlation without changing the marginal distributions.

This example generates data with four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively. It does not matter whether the original data are correlated or not. For simplicity, generate each variable independently, so the initial correlation matrix is approximately the identity matrix. The following SAS DATA step simulates four variables from the specified distributions. The call to PROC CORR computes the rank correlation of the original data. It also creates a scatter plot matrix that shows the marginal distributions of the variable in the diagonal cells and the bivariate joint distributions in the off-diagonal cells.

%let N = 100; /* sample size */
/* independently simulate variables from known distributions */
data SimIndep;
call streaminit(12345);
do i = 1 to &N;
   Normal    = rand("Normal");
   Lognormal = rand("Lognormal");
   Expon     = rand("Expo");
   Uniform   = rand("Uniform");
   output;
end;
drop i;
run;
 
proc corr data=SimIndep Spearman noprob plots=matrix(hist);
   var Normal Lognormal Expon Uniform;
run;

For these simulated data, the table shows that the rank correlation between any pair of variables is small. None of the correlations are statistically different from 0, which you can verify by removing the NOPROB option from the PROC CORR statement.

Implement the Iman-Conover transformation in SAS

The Iman-Conover transformation is actually a sequence of transformations of multivariate data. The following program defines the Iman-Conover transformation as a SAS/IML function. An explanation of the function is provided in a separate article.

/* define a store a SAS/IML function that computes the Iman-Conover transformation */
proc iml;
/* Input: X is a data matrix with k columns
          C is a (k x k) rank correlation matrix
   Output: A matrix W. The i_th column of W is a permutation of the i_th
          columns of X. The rank correlation of W is close to the 
          specified correlation matrix, C.
*/          
start ImanConoverTransform(X, C);
   N = nrow(X);
   S = J(N, ncol(X));
   /* T1: Create normal scores of each column */
   do i = 1 to ncol(X);
      ranks = ranktie(X[,i], "mean");          /* tied ranks */
      S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
   end;
   /* T2: apply two linear transformations to the scores */
   CS = corr(S);        /* correlation of scores */
   Q = root(CS);        /* Cholesky root of correlation of scores */
   P = root(C);         /* Cholesky root of target correlation */
   T = solve(Q,P);      /* same as  T = inv(Q) * P; */
   Y = S*T;             /* transform scores: Y has rank corr close to target C */
 
   /* T3: Permute or reorder data in the columns of X to have the same ranks as Y */
   W = X;
   do i = 1 to ncol(Y);
      rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
      tmp = W[,i]; call sort(tmp); /* sort column by ranks */
      W[,i] = tmp[rank];           /* reorder the column of X by the ranks of M */
   end;
   return( W );
finish;
store module=(ImanConoverTransform);  /* store definition for later use */
quit;

Unlike many algorithms in the field of simulation, the Iman-Conover transformation is deterministic. Given a matrix of data (X) and a correlation matrix (C), the function returns the same transformed data every time you call it.

Use the Iman-Conover transformation to simulate correlated data

The following SAS/IML program shows how to use the Iman-Conover transformation to simulate correlated data. There are three steps:

  1. Read real or simulated data into a matrix, X. The columns of X define the marginal distributions. For this example, we will use the SimIndep data, which contains four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively.
  2. Define a "target" correlation matrix. This specifies the rank correlations that you hope to induce. For this example, the target correlation between the Normal and Lognormal variables is 0.75, the target correlation between the Normal and Exponential variables is -0.7, the target correlation between the Normal and Uniform variables is 0, and so forth.
  3. Call the ImanConoverTransform function. This returns a new data matrix, W. The i_th column of W is a permutation of the i_th column of X. The permutations are chosen so that the rank correlation of W is close to the specified correlation matrix:
proc iml;
load module=(ImanConoverTransform);  /* load the function definition */
 
/* Step 1: Read in the data */
varNames = {'Normal' 'Lognormal' 'Expon' 'Uniform'};
use SimIndep; read all var varNames into X; close;
 
/* Step 2: specify target rank correlation */
/*    X1    X2    X3    X4                 */
C = { 1.00  0.75 -0.70  0,           /* X1 */
      0.75  1.00 -0.95  0,           /* X2 */
     -0.70 -0.95  1.00 -0.3,         /* X3 */
      0     0    -0.3   1.0};        /* X4 */
 
/* Step 3: Strategically reorder the columns of X. 
           The new ordering has the specified rank corr (approx)
           and preserves the original marginal distributions. */
W = ImanConoverTransform(X, C);
RankCorr = corr(W, "Spearman");
print RankCorr[format=6.3 r=varNames c=varNames];

As advertised, the rank correlation of the new data matrix, W, is very close to the specified correlation matrix, C. You can write the data matrix to a SAS data set and use PROC CORR to create a scatter plot matrix that has the marginal histograms on the diagonal:

/* write new data to a SAS data set */
newvarNames = {'newNormal' 'newLognormal' 'newExpon' 'newUniform'};
create SimCorr from W[c=newvarNames];  append from W;  close;
quit;
 
proc corr data=SimCorr Spearman noprob plots=matrix(hist);
   var newNormal newLognormal newExpon newUniform;
run;

If you compare the diagonal histograms in this graph to the histograms in the previous graph, you will see that they are identical. They have to be identical because each new variable contains exactly the same data values as the original variables.

At the same time, the bivariate scatter plots show how permuting the data has induced the prescribed correlations between the variables. For this example, the newNormal and newLognormal variables have a strong positive rank correlation (0.73). The newNormal and NewExponential variables have a strong negative rank correlation (-0.65). The newNormal and newUniform variables are essentially uncorrelated (-0.03), and so forth.

Summary

The Iman-Conover transformation is a remarkable algorithm that induces a specified correlation structure on a data matrix while preserving the marginal distributions. You start with any data matrix (X) and any target correlation matrix. (The columns of X must represent continuous data.) You apply the Iman-Conover transformation to obtain a new matrix (W). The rank correlation of the new matrix is close to the values you specified. The columns of W have the same values as the columns of X, but the values have been reordered. Thus, the new matrix has the same marginal distributions as the original matrix.

In a follow-up article, I will explain the geometry or the Iman-Conover transformation and how it works.

The post Simulate correlated variables by using the Iman-Conover transformation appeared first on The DO Loop.

5月 052021
 

Quick! Which fraction is bigger, 40/83 or 27/56? It's not always easy to mentally compare two fractions to determine which is larger. For this example, you can easily see that both fractions are a little less than 1/2, but to compare the numbers you need to compare the products 40*56 and 27*83. Wouldn't it be great if there were a simpler way to compare fractions, maybe by using addition rather than multiplication?

It turns out that there is a simple equation that is often (but not always!) correct. I learned about this equation from John D. Cook, who blogged about this trick for comparing fractions by using only addition. According to a mathematical result by K. Odani (1998), the rule is correct more than 91% of the time.

Specifically, let N be a large integer. Randomly choose integers a, b, c, and d between 1 and N. Look at the signs of the following differences:

  • X = a*d – b*c
  • Y = (a+d) – (b+c)

Because a, b, c, and d are random, the quantities X and Y are random variables. K. Odani (1998) shows that X and Y have the same sign quite often for large N. As N goes to infinity, the probability that X and Y have the same sign is 11/12 ≈ 0.91667. I will refer to this result as Odani's truism.

Odani's truism provides a "rough and ready" rule for comparing two fractions a/b and c/d. If a/b < c/d, then X < 0, and, asymptotically, the expression (a+d) < (b+c) is true 91% of the time. Or, inverting the logic, you can evaluate (a+d) and (b+c) to determine, with high probability, whether the fraction a/b is greater than or less than the fraction c/d.

For example, again consider the fractions 40/83 and 27/56. It is easy to compute the sums 40+56=96 and 27+83=110. The first sum is less than the second, so Odani's truism suggests that 40/83 is less than 27/56, which is, in fact, true!

However, the summation rule does not always hold. For example, the summation rule for the fractions 13/27 and 40/83 gives the sums 96 and 67, but 13/27 is less than 40/83. If you use Odani's truism, you can never be sure that it holds for the two fractions you are trying to compare!

Simulation of Odani's truism

You can demonstrate Odami's truism by using a simulation. Choose a large number N, and choose integers a, b, c, and d uniformly at random in [1, N]. Then the difference of the products (X = a*d – b*c) and the difference of the sums (Y = (a+d) – (b+c)) should have the same sign about 91% of the time. The following SAS DATA step carries out the simulation for a particular choice of N:

/* Simulation to demonstrate Odani's truism: To determine if 
   a/b < c/d
   it is often sufficient to test whether 
   a+d < b+c
*/
data _null_;
retain count 0;
call streaminit(54321);
N = 1E8;                /* upper bound. Integers in [1,N] */
numSamples = 1E6;       /* number of random draws */
 
do i = 1 to numSamples;
   a = rand('integer', 1, N);
   b = rand('integer', 1, N);
   c = rand('integer', 1, N);
   d = rand('integer', 1, N);
   x = a*d - b*c;               /* correct way to compare a/b and c/d */
   y = (a+d) - (b+c);           /* Odani's truism is correct 11/12 of time */
   count + (sign(x) = sign(y)); /* is truism correct for this 4-tuple (a,b,c,d)? */
end;
prob = count / numSamples;    /* empirical probability for simulation */
put "Odani's truism holds with probability " prob;
run;
Odani's truism holds with probability 0.91648

This simulation randomly generates 1,000,000 integer 4-tuples between 1 and 100 million. For each 4-tuple, the program compares the difference in the product (a*d - b*c) and the difference in the sum ((a+d) - (b+c)). If these quantities have the same signs (positive, negative, or zero), a counter is incremented. The result shows that the quantities have the same sign more than 91% of the time, in accordance with Odani's theorem.

Interesting math versus practical math

As an applied mathematician, I am always curious whether an interesting mathematical result has a practical application. As John D. Cook points out at the end of his blog post, some fractions are easy to mentally compare whereas others are not. I don't need any trick to know that 6/29 is less than 18/35. It is obvious that the first fraction is less than 1/2 (in fact, it is close to 1/5) whereas the second is greater than 1/2. And I certainly don't need a trick to compare 6/29 and 35/18 because the latter fraction is greater than 1. Furthermore, Odani's result includes unreduced fractions such as 2/4, 3/6, and 50/100.

John D. Cook suggests that "it would be interesting to try to assess how often the rule of thumb presented here is correct in practice. You might try to come up with a model for the kinds of fractions people can’t compare instantly, such as proper fractions that have similar size."

That is a great idea. I don't need to use Odani's truism to compare 6/29 and 18/35, yet (6,29,18,35) is one of the 4-tuples for which Odani's truism is true. Similarly, it is true for (6,29,35,18), (1,6,5,8) and (1,10,9,10) and many other 4-tuples that represent fractions that can be easily compared without doing any explicit calculations. If you omit the "obvious" 4-tuples and unreduced fractions, you will reduce the probability that Odani's truism is true on the 4-tuples that remain. Thus, intuitively, Odani's truism should have a lower probability of being true if you restrict it only to reduced fractions that are close to each other. But how low? Is the rule any better than tossing a coin?

In the next blog post, I propose a way to measure how often Odani's truism is useful in practice. I will estimate the probability that Odani's truism holds for reduced fractions in the interval [0,1] that are close to each other.

Summary

Odani's truism is an asymptotic result that states that you can compare the fractions a/b and c/d by looking at the sums (a+d) and (b+c). Specifically, the quantities (a*d - b*c) and ((a+d) - (b+c)) have the same sign with a probability that approaches 11/12 when the numerators and denominators are chosen uniformly at random in the interval [1,N] for very large N.

In the next post, I will discuss whether Odani's truism is useful in practice or whether it is merely a mathematical curiosity.

The post Odani's truism: A probabilistic way to compare fractions appeared first on The DO Loop.

4月 072021
 

As mentioned in my article about Monte Carlo estimate of (one-dimensional) integrals, one of the advantages of Monte Carlo integration is that you can perform multivariate integrals on complicated regions. This article demonstrates how to use SAS to obtain a Monte Carlo estimate of a double integral over rectangular and non-rectangular regions. Be aware that a Monte Carlo estimate is often less precise than a numerical integration of the iterated integral.

Monte Carlo estimate of multivariate integrals

Multivariate integrals are notoriously difficult to solve. But if you use Monte Carlo methods, higher-dimensional integrals are only marginally more difficult than one-dimensional integrals. For simplicity, suppose we are interested in the double integral \(\int_D f(x,y) \,dx\,dy\), where D is a region in the plane. The basic steps for estimating a multivariate integral are as follows:

  1. Generate a large random sample of points from the uniform distribution on the domain, D.
  2. Evaluate f at each point (x,y) in the random sample. Let W = f(X,Y) be the vector of values.
  3. Compute Area(D)*mean(W), where Area(D) is the area of D and mean(W) is the arithmetic average of W. If you know the exact area of D, use it. If D is a very complicated shape, you can estimate the area of D by using the Monte Carlo methods in this article.

The quantity Area(D)*mean(W) is a Monte Carlo estimate of \(\int_D f(x,y) \,dx\,dy\). The estimate depends on the size of random sample (larger samples tend to give better estimates) and the particular random variates in the sample.

Monte Carlo estimates of double integrals on rectangular regions

Let's start with the simplest case, which is when the domain of integration, D, is a rectangular region. This section estimates the double integral of f(x,y) = cos(x)*exp(y) over the region D = [0,π/2] x [0,1]. That is, we want to estimate the integral
\(\int\nolimits_0^{\pi/2}\int\nolimits_0^1 \cos(x)\exp(y)\,dy\,dx = e - 1 \approx 1.71828\)
where e is the base of the natural logarithm. For this problem, the area of D is (b-a)*(d-c) = π/2. The graph at the right shows a heat map of the function on the rectangular domain.

The following SAS/IML program defines the integrand and the domain of integration. The program generates 5E6 uniform random variates on the interval [a,b]=[0,π/2] and on the interval [c,d] = [0,1]. The (x,y) pairs are evaluated, and the vector W holds the result. The Monte Carlo estimate is the area of the rectangular domain times the mean of W:

/* Monte Carlo approximation of a double integral over a rectangular region */
proc iml;
/* define integrand: f(x,y) = cos(x)*exp(y) */
start func(x);
   return cos(x[,1]) # exp(x[,2]);
finish;
 
/* Domain of integration: D = [a,b] x [c,d] */
pi = constant('pi');
a = 0; b = pi/2;    /* 0 < x < pi/2 */
c = 0; d = 1;       /* 0 < y < 1    */
 
call randseed(1234);
N = 5E6;
 
X = j(N,2);                        /* X ~ U(D)   */
z = j(N,1);  
call randgen(z, "uniform", a, b);  /* z ~ U(a,b) */
X[,1] = z;
call randgen(z, "uniform", c, d);  /* z ~ U(c,d) */
X[,2] = z;
 
W = func(X);                       /* W = f(X1,X2) */
Area = (b-a)*(d-c);                /* area of rectangular region */
MCEst = Area * mean(W);            /* MC estimate of double integral */
 
/* the double integral is separable; solve exactly */
Exact = (sin(b)-sin(a))*(exp(d)-exp(c));  
Diff = Exact - MCEst;
print Exact MCEst Diff;

The output shows that the Monte Carlo estimate is a good approximation to the exact value of the double integral. For this random sample, the estimate is within 0.0002 units of the true value.

Monte Carlo estimates of double integrals on non-rectangular regions

It is only slightly more difficult to estimate a double integral on a non-rectangular domain, D. It is helpful to split the problem into two subproblems: (1) generate point uniformly in D, and (2) estimate the integral on D.

The next two sections show how to estimate the integral of f(x,y) = exp(-(x2 + y2)) over the circle of unit radius centered at the origin: D = {(x,y) | x2 + y2 ≤ 1}. The graph at the right shows a heat map of the function on a square. The domain of integration is the unit disk.

By using polar coordinates, you can solve this integral exactly. The double integral has the value 2π*(e – 1)/(2 e) ≈ 1.9858653, where e is the base of the natural logarithm.

Generate point uniformly in a non-rectangular planar region

For certain shapes, you can directly generate a random sample of uniformly distributed points:

In general, for any planar region, you can use the acceptance-rejection technique. I previously showed an example of using Monte Carlo integration to estimate π by estimating the area of a circle.

Because the acceptance-rejection technique is the most general, let's use it to generate a random set of points in the unit disk, D. The steps are as follows:
  1. Construct a bounding rectangle for the region. In this example, you can use R = [-1,1] x [-1,1].
  2. Generate random points uniformly at random in the rectangle.
  3. Keep the points that are in the domain, D. Discard the others.

If you know the area of D, you can use a useful trick to choose the sample size. When you use an acceptance-rejection technique, you do not know in advance how many points will end up in D. However, you can estimate the number as ND ≈ NR*Area(D)/Area(R), where NR is the sample size in R. If you know the area of D, you can invert this formula. If you want approximately ND points to be in D, generate ND*Area(R)/Area(D) points in R. For example, suppose you want ND=5E6 points in the unit disk, D. We can choose NR ≥ ND*4/π, as in the following SAS/IML program, which generates random points in the unit disk:

proc iml;
/* (1) Generate approx N_D points in U(D), where D is 
   the unit disk D = { (x,y) | x**2 + y**2 <= 1 } */
N_D = 5E6;        /* we want this many points in D */
a = -1; b = 1;    /* Bounding rectangle, R:  */
c = -1; d = 1;    /* R = [a,b] x [c,d]       */
 
/* generate points inside R. Generate enough points (N_R)
   in R so that approximately N_D are actually in D */
pi = constant('pi');
area_Rect = (b-a)*(d-c); 
area_D    = pi;
N_R = ceil(N_D * area_Rect / area_D);    /* estimate how many points in R we'll need */
 
call randseed(1234);
X = j(N_R,2);
z = j(N_R,1);  
call randgen(z, "uniform", a, b);
X[,1] = z;
call randgen(z, "uniform", c, d);
X[,2] = z;
 
/* which points in the bounding rectangle are in D? */
b = (X[,##] <= 1);             /* x^2+y^2 <= 1 */
X = X[ loc(b),];               /* these points are in D */
print N_D[L="Target N_D" F=comma10.] 
      (nrow(X))[L="Actual N_D" F=comma10.];

The table shows the result. The program generated 6.3 million points in the rectangle, R. Of these, 4.998 million were in the unit disk, D. This is very close to the desired number of points in D, which was 5 million.

Monte Carlo estimate of a an integral over a planar region

Each row of the matrix, X, contains a point in the disk, D. The points are a random sample from the uniform distribution on D. Therefore, you can estimate the integral by calculating the mean value of the function on these points:

/* (2) Monte Carlo approximation of a double integral over a non-rectangular domain. 
   Estimate integral of f(x,y) = exp(-(x**2 + y**2))
   over the disk D = { (x,y) | x**2 + y**2 <= 1 }     */
/* integrand: f(x,y) = exp(-(x**2 + y**2)) */
start func(x);
   r2 = x[,1]##2 + x[,2]##2;
   return exp(-r2);
finish;
 
W = func(X);
MCEst = Area_D * mean(W);
 
/* compare the estimate to the exact value of the integral */
e = constant('e');
Exact = 2*pi*(e-1)/(2*e);
Diff = Exact - MCEst;
print Exact MCEst Diff;

The computation shows that a Monte Carlo estimate of the integral over the unit disk is very close to the exact value. For this random sample, the estimate is within 0.00006 units of the true value.

Summary

This article shows how to use SAS to compute Monte Carlo estimates of double integrals on a planar region. The first example shows a Monte Carlo integration over a rectangular domain. The second example shows a Monte Carlo integration over a non-rectangular domain. For a non-rectangular domain, the integration requires two steps: first, use an acceptance-rejection technique to generate points uniformly at random in the domain; then use Monte Carlo to estimate the integral. The technique in this article generalizes to integration on higher-dimensional domains.

The post Double integrals by using Monte Carlo methods appeared first on The DO Loop.

4月 052021
 

A previous article shows how to use Monte Carlo simulation to estimate a one-dimensional integral on a finite interval. A larger random sample will (on average) result in an estimate that is closer to the true value of the integral than a smaller sample. This article shows how you can determine a sample size so that the Monte Carlo estimate is within a specified distance from the true value, with high probability.

This article is inspired by and uses some of the notation from Neal, D. (1983) "Determining Sample Sizes for Monte Carlo Integration," The College Mathematics Journal.

Ensure the estimate is (probably) close to the true value

As shown in the previous article, a Monte Carlo estimate of an integral of g(x) on the interval (a,b) requires three steps:

  1. Generate a uniform random sample of size n on (a,b). Let X be the vector of n uniform random variates.
  2. Use the function g to evaluate each point in the sample. Let Y = g(X) be the vector of transformed variates.
  3. An estimate of the integral \(\int\nolimits_{a}^{b} g(x) \, dx\) is (b - a)*mean(Y), where mean(Y) is the arithmetic average of Y, as shown in the previous article.

The goal of this article is to choose n large enough so that, with probability at least β, the Monte Carlo estimate of the integral is within δ of the true value. The mathematical derivation is at the end of this article. The result (Neal, 1983, p. 257) is that you should choose
\(n > \left( \Phi^{-1}\left( \frac{\beta+1}{2} \right) \frac{(b-a)s_{Y}}{\delta} \right)^2\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(s_{Y}\) is an estimate of the standard deviation of Y.

Applying the formula to a Monte Carlo estimate of an integral

Let's apply the formula to see how large a sample size we need to estimate the integral \(\int\nolimits_{a}^{b} g(x) \, dx\) to three decimal places (δ=5E-4) for the function g(x) = xα-1 exp(-x), where α=4 and the interval of integration is (1, 3.5). The function is shown at the top of this article.

The estimate requires knowing an estimate for the standard deviation of Y=g(X), where X ~ U(a,b). You can use a small "pilot study" to obtain the standard deviation, as shown in the following SAS/IML program. You could also use the DATA step and PROC MEANS for this computation.

%let alpha = 4;    /* shape parameter */
%let a = 1;        /* lower limit of integration */
%let b = 3.5;      /* upper limit of integration */
 
proc iml;
/* define the integrand */
start Func(x) global(shape);
   return x##(shape - 1) # exp(-x);
finish;
 
call randseed(1234);
shape = &alpha; a = &a; b = &b;
 
/* Small "pilot study" to estimate s_Y = stddev(Y) */
N = 1e4;                          /* small sample */
X = j(N,1);
call randgen(x, "Uniform", a, b); /* X ~ U(a,b) */
Y = func(X);                      /* Y = f(X)   */
s = std(Y);                       /* estimate of std dev(Y) */

For this problem, the estimate of the standard deviation of Y is about 0.3. For β=0.95, the value of \(\Phi^{-1}((\beta+1)/2) \approx 1.96\), but the following program implements the general formula for any probability, β:

/* find n so that MC est is within delta of true value with 95% prob */
beta  = 0.95;
delta = 5e-4;      
k = quantile("Normal", (beta+1)/2) * (b-a) * s;
sqrtN = k / delta;
roundN = round(sqrtN**2, 1000);       /* round to nearest 1000 */
print beta delta roundN[F=comma10.];

With 95% probability, if you use a sample size of n=8,765,000, the Monte Carlo estimate will be within δ=0.0005 units of the true value of the integral. Wow! That's a larger value than I expected!

I used n=1E6 in the previous article and reported that the difference between the Monte Carlo approximation and the true value of the integral was less than 0.0002. So it is possible to be close to the true value by using a smaller value of n. In fact, the graph to the right (from the previous article) shows that for n=400,000 and n=750,000, the Monte Carlo estimate is very close for the specific random number seed that I used. But the formula in this section provides the sample size that you need to (probably!) be within ±0.0005 REGARDLESS of the random number seed.

The distribution of Monte Carlo estimates

I like to use SAS to check my math. The Monte Carlo estimate of the integral of g will vary according to the random sample. The math in the previous section states that if you generate k random samples of size n=8,765,000, that (on average) about 0.95*k of the sample will be within δ=0.0005 units of the true value of the integral, which is 2.666275. The following SAS/IML program generates k=200 random samples of size n and k estimates of the integral. We expect about 190 estimates to be within δ units of the true value and only about 10 to be farther away:

N = roundN;
X = j(N,1);
 
k = 200;
Est = j(k,1);
do i = 1 to k;
   call randgen(X, "Uniform", a, b);     /* x[i] ~ U(a,b) */
   Y = func(X);                          /* Y = f(X)   */
   f_avg = mean(Y);                      /* estimate E(Y) */
   Est[i] = (b-a)*f_avg;                 /* estimate integral */
end;
 
call quad(Exact, "Func", a||b);          /* find the "exact" value of the integral */
Diff = Est - Exact;
title "Difference Between Monte Carlo Estimate and True Value";
title2 "k=200 Estimates; n=8,765,000";
call Histogram(Diff) other="refline -0.0005 0.0005/axis=x;";

The histogram shows the difference between the exact integral and the estimates. The vertical reference lines are at ±0.0005. As predicted, most of the estimates are less than 0.0005 units from the exact value. What percentage? Let's find out:

/* how many estimates are within and NOT within delta? */
Close    = ncol(loc(abs(Diff)<=delta));    /* number that are closer than delta to true value */
NotClose = ncol(loc(abs(Diff)> delta));    /* number that are farther than delta */
PctClose = Close / k;                      /* percent close to true value */
PctNotClose = NotClose / k;                /* percent not close */
print k Close NotClose PctClose PctNotClose;

For this set of 200 random samples, exactly 95% of the estimates were accurate to within 0.0005. Usually, a simulation of 200 estimates would show that between 93% and 97% of the estimates are close. In this case, the answer was exactly 95%, but don't expect that to happen always!

Summary

This article shows how you can use elementary statistics to find a sample size that is large enough so that a Monte Carlo estimate is (probably!) within a certain distance of the exact value. The article was inspired by an article by Neal, D. (1983). The mathematical derivation of the result is provided in the next section.

Derivation of the sample size formula

This section derives the formula for choosing the sample size, n. Because the Monte Carlo estimate is a mean, you can use elementary probability theory to find n. Let {x1, x2, ..., xn} be random variates on (a, b). Let Y be the vector {g(x1), g(x2), ..., g(xn)}. Let \(m_{n} = \frac{1}{n} \sum\nolimits_{i = 1}^{n} g(x_{i})\) be the mean of Y and let \(\mu = \frac{1}{b-a} \int\nolimits_{a}^{b} g(x) \, dx\) be the average value of g on the interval (a,b). We know that \(m_n \to \mu\) as \(n \to \infty\). For any probability 0 < β < 1, we want to find n large enough so that
\(P\left( | m_n - \mu | < \frac{\delta}{(b-a)} \right) \geq \beta\)
From the central limit theorem, we can substitute the standard normal random variable, Z, inside the parentheses by dividing by the standard error of Y, which is σY/sqrt(n):
\(P\left( | Z | < \frac{\delta \sqrt{n}}{(b-a)\sigma_{Y}} \right) \geq \beta\)
Equivalently,
\(2 P\left( Z < \frac{\delta \sqrt{n}}{(b-a)\sigma_{Y}} \right) - 1 \geq \beta\)

Solving the equation for sqrt(n) gives
\(\sqrt{n} > \Phi^{-1}\left( \frac{\beta+1}{2} \right) \frac{(b-a)\sigma_{Y}}{\delta}\)
Squaring both sides leads to the desired bound on n. Because we do not know the true standard deviation of Y, we substitute a sample statistic, sY.

The post Sample size for the Monte Carlo estimate of an integral appeared first on The DO Loop.

3月 312021
 

Numerical integration is important in many areas of applied mathematics and statistics. For one-dimensional integrals on the interval (a, b), SAS software provides two important tools for numerical integration:

This article shows a third method to estimate an integral in SAS: Monte Carlo simulation.

How to use Monte Carlo simulation to estimate an integral

I previously showed an example of using Monte Carlo simulation to estimate the value of pi (π) by using the "average value method." This section presents the mathematics behind the Monte Carlo estimate of the integral.

The Monte Carlo technique takes advantage of a theorem in probability that is whimsically called the Law of the Unconscious Statistician. The theorem is basically the chain rule for integrals. If X is a continuous random variable and Y = g(X) is a random variable that is created by a continuous transformation (g) of X, then the expected value of Y is given by the following convolution:
\(E[Y] = E[g(X)] = \int g(x) f_{X}(x) \,dx\)
where \(f_{X}\) is the probability density function for X.

To apply the theorem, choose X to be a uniform random variable on the interval (a,b). The density function of X is therefore \(f_{X}(x) = 1/(b-a)\) if x is in (a,b) and 0 otherwise. Rearranging the equation gives
\(\int_{a}^{b} g(x) \,dx = (b - a) \cdot E[g(X)]\)

Consequently, to estimate the integral of a continuous function g on the interval (a,b), you need to estimate the expected value E[g(X)], where X ~ U(a,b). To do this, generate a uniform random sample in (a,b), evaluate g on each point in the sample, and take the arithmetic mean of those values. In symbols,
\(\int_{a}^{b} g(x) \,dx \approx (b - a) \cdot \frac{1}{n} \sum\nolimits_{i = 1}^{n} g(x_{i})\)
where the \(x_i\) are indepenent random uniform variates on (a,b).

Estimate an integral in SAS by using Monte Carlo simulation

Suppose you want to estimate the integral of \(g(x) = x^{\alpha - 1} \exp(-x)\) on the interval (a,b) = (1, 3.5). The figure at the top of this article shows the graph of g for α=4 and the area under the curve on the interval (1, 3.5).

As mentioned earlier, an accurate way to numerically integrate the function is to use the QUAD subroutine in SAS/IML, as follows:

%let alpha = 4;    /* shape parameter */
%let a = 1;        /* lower limit of integration */
%let b = 3.5;      /* upper limit of integration */
 
proc iml;
/* define the integrand */
start Func(x) global(shape);
   return x##(shape - 1) # exp(-x);
finish;
 
/* call the QUAD subroutine to perform numerical integration */
shape = &alpha; a= &a; b = &b;
call quad(AreaQuad, "Func", a||b);
print AreaQuad;

The QUAD subroutine approximates the area under the curve as 2.6663. This is a very accurate result. When you compute a Monte Carlo estimate, the estimate will depend on the size of the random sample that you use and the random number seed. The following SAS/IML program samples one million random variates from the uniform distribution on [1, 3.5]. The vector Y contains the transformed points (Y=g(X)). The call to the MEAN function estimates the expected value of g(x) when 1 < x < 3.5. If you multiply the mean by (3.5 - 1) = 2.5, you obtain an estimate of the integral, as follows:

/* Monte Carlo estimation of the integral */
call randseed(1234);
N = 1e6;                              /* sample size for MC estimate */
X = j(N,1);                           /* allocate vector of size N */
call randgen(X, "Uniform", a, b);     /* x[i] ~ U(a,b) */
Y = func(X);                          /* Y = f(X)   */
f_avg = mean(Y);                      /* estimate E(Y) */
AreaMC = (b-a)*f_avg;                 /* estimate integral on (a,b) */
Diff = abs(AreaMC - AreaQuad);        /* how close is the estimate to the true value? */
print AreaQuad AreaMC Diff;

Advantages and drawbacks of a Monte Carlo estimate

The main advantage of a Monte Carlo estimate is its simplicity: sample, evaluate, average. The same technique works for any function over any finite interval of integration. Although I do not demonstrate it here, a second advantage is that you can extend the idea to estimate higher-dimensional integrals.

The main disadvantage is that you do not obtain a definitive answer. You obtain a statistical estimate that is based on a random sample. I won't repeat the arguments from my article about Monte Carlo estimates, but be aware of the following facts about a Monte Carlo estimate of an integral:

  • Unless you use a huge sample size, the Monte Carlo estimate is usually less accurate than numerical integration.
  • A Monte Carlo estimate (like all statistics) has a distribution. If you change the random number seed or you change the algorithm that you use to generate random uniform variates, you will get a different estimate. Thus, you cannot talk about the Monte Carlo estimate, but only about a Monte Carlo estimate. The best you can claim is that the true value of the integral is within a certain distance from your estimate (with a specified probability).

How does the accuracy of the estimate depend on the sample size?

You can extend the previous program to demonstrate the accuracy of the Monte Carlo estimate as a function of the sample size. How would the estimate change if you use only 50,000 or 100,000 random variates? The following program creates a graph the shows the estimate of the integral when you use only the first k elements in the sequence of random variates:

/* how does the estimate depend on the sample size? */
size = do(1E4, N, 1E4);
MCest = j(1,ncol(size),.);
do i = 1 to ncol(size);
   Z = X[1:size[i]];                  /* sample size k=size[i] */
   Y = func(Z);                       /* Y = f(Z)       */
   MCEst[i] = (b-a)*mean(Y);          /* estimate integral */
end;
 
title "Monte Carlo Estimates as a Function of Sample Size";
stmt = "refline " + char(AreaQuad,6,4) + "/ axis=y label; format size comma10.;";
call scatter(size, MCEst) other=stmt label={"Sample Size" "Estimate of Area"};

The markers in the scatter plot show the estimates for the integral when only the first k random variates are used. The horizontal line shows the exact value of the integral. For this random-number seed, the estimate stays is close to the exact value when the sample size is more than 400,000. If you use a different random number seed, you will get a different graph. However, this graph is fairly typical. The Monte Carlo estimates can either overestimate or underestimate the integral.

The takeaway lesson is that the estimate converges to the exact value, but not very quickly. From general theory, we know that the standard error of the mean of n numbers is SD(Y)/sqrt(n), where SD(Y) is the standard deviation of Y for Y=g(X). That means that a 95% confidence interval for the mean has a radius 1.96*SD(Y)/sqrt(n). You can use this fact to choose a sample size so that the Monte Carlo estimate is (probably) close to the exact value of the integral. I will discuss this fact in a second article.

Summary

This article shows how to use Monte Carlo simulation to estimate a one-dimensional integral. Although Monte Carlo simulation is less accurate than other numerical integration methods, it is simple to implement and it readily generalizes to higher-dimensional integrals.

The post Estimate an integral by using Monte Carlo simulation appeared first on The DO Loop.

2月 032021
 

In a previous article, I showed how to generate random points uniformly inside a d-dimensional sphere. In that article, I stated the following fact:

If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit sphere.

I was reminded of this fact recently when I wrote about how to simulate a random walk in high dimensions. Each step of a random walk needs to generate a random direction and a random step length. For a 2-D random walk, it is easy to generate a random direction as an angle, θ, chosen from the uniform distribution on the interval [0,2π). However, for higher-dimensional random walks, I do not suggest using random angles to determine directions. Although it is mathematically possible to use spherical coordinates in d-dimensions, computations in a spherical coordinate system are extremely messy.

A better alternative is to use random unit vectors to determine a direction for each step in a random walk. (In fact, a unit vector is sometimes called a direction vector.) Since a unit vector is a vector from the origin to a point on the unit sphere, this article shows how to generate random points uniformly on the unit sphere. I show the computation twice: once by using the SAS/IML matrix language and again by using the SAS DATA step. In order to visualize the data, I show how to create a contour plot of ungridded data in SAS.

Random points on a circle

For simplicity, let's see how to generate random points on the unit circle in 2-D. The following SAS/IML program simulates N=1000 points from d=2 independent normal distributions. The notation X[ ,##] is a subscript reduction operator that returns a row vector that contains the sum of the squared elements of each row of X. Thus the expression sqrt(X[,##]) gives the Euclidean distance of each row to the origin. If you divide the coordinates by the distance, you obtain a point on the unit circle:

%let N = 1000;         /* number of steps in random walk */
%let d = 2;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
 
proc iml;
call randseed(12345);
r = &r; d = &d;
x = randfun(&N // d, "Normal");   /* N points from MVN(0, I(d)) */
norm = sqrt( x[,##] );            /* Euclidean distance to origin */
x = r * x / norm;                 /* scale point so that distance to origin is r */
title "&n Random Points on Circle";
call scatter(x[,1], x[,2]) grid={x y}
             procopt="aspect=1" option="transparency=0.8";
QUIT;

As promised, the resulting points are on the circle of radius r=1. Recall that uniformly at random does not mean evenly spaced. Point patterns that are generated by a uniform process will typically have gaps and clusters.

Random points on a sphere

You can also use the SAS DATA step to generate the random points on a sphere. You can generate each coordinate independently from a normal distribution and use the EUCLID function in Base SAS to compute the Euclidean distance from a point to the origin. You can then scale the coordinates so that the point is on the sphere of radius r. The following DATA step generates d-dimensional points for any d:

%let N = 2000;         /* number of steps in random walk */
%let d = 3;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
data RandSphere;
array x[&d];
call streaminit(12345);
do i = 1 to &N;
   do j = 1 to &d;
      x[j] = rand("Normal");      /* random point from MVN(0, I(d)) */
   end;
   norm = Euclid( of x[*] );      /* Euclidean distance to origin */
   do j = 1 to &d;
      x[j] = &r * x[j] / norm;    /* scale point so that distance to origin is r */
   end;
   output;
end;
drop j norm;
run;

How can we visualize the points to assure ourselves that the program is running correctly? One way is to plot the first two coordinates and use colors to represent the value of the points in the third coordinate direction. For example, the following call to PROC SGPLOT creates a scatter plot of (x1,x2) and uses the COLORRESPONSE=x3 option to color the markers. The default color ramp is the ThreeAltColor ramp, which (for my ODS style) is a blue-black-red color ramp. That means that points near the "south pole" (x3 = -1) will be colored blue, points near the equator will be colored black, and points near the "north pole" (x3 = 1) will be colored red. I have also used the TRANSPARENCY= option so that the density of the points is shown.

title "&n Random Points on a Sphere";
proc sgplot data=RandSphere aspect=1;
   scatter x=x1 y=x2 / colorresponse=x3 transparency=0.5 markerattrs=(symbol=CircleFilled);
   xaxis grid; yaxis grid;
run;

Switching to (x,y,z) instead of (x1,x2,x3) notation, the graph shows that observations near the (x,y)-plane (for which x2 + y2 &asymp 1) have a dark color because the z-value is close 1 zero. In contrast, observations near for which x2 + y2 &asymp 0 have either a pure blue or pure red color since those observations are near one of the poles.

A contour plot of ungridded data

Another way to visualize the data would be to construct a contour plot of the points in the upper hemisphere of the sphere. The exact contours of the upper hemisphere are concentric circles (lines of constant latitude) centered at (x,y)=(0,0). For these simulated data, a contour plot should display contours that are approximately concentric circles.

In a previous article, I showed how to use PROC TEMPLATE and PROC SGRENDER in SAS to create a contour plot. However, in that article the data were aligned on an evenly spaced grid in the (x,y) plane. The data in this article have random positions. Consequently, on the CONTOURPLOTPARM statement in PROC TEMPLATE, you must use the GRIDDED=FALSE option so that the plot interpolates the data onto a rectangular grid. The following Graph Template Language (GTL) defines a contour plot for irregularly spaced data and creates the contour plot for the random point on the upper hemisphere:

/* Set GRIDDED=FALSE if the points are not arranged on a regular grid */
proc template;
define statgraph ContourPlotParmNoGrid;
dynamic _X _Y _Z _TITLE;
begingraph;
   entrytitle _TITLE;
   layout overlay;
      contourplotparm x=_X y=_Y z=_Z / GRIDDED=FALSE /* for irregular data */
        contourtype=fill nhint=12  colormodel=twocolorramp name="Contour";
      continuouslegend "Contour" / title=_Z;
   endlayout;
endgraph;
end;
run;
 
ods graphics / width=480px height=480px;
proc sgrender data=RandSphere template=ContourPlotParmNoGrid;
where x3 >= 0;
dynamic _TITLE="Contour Plot of &n Random Points on Upper Hemisphere"
        _X="x1" _Y="x2" _Z="x3";
run;

The data are first interpolated onto an evenly spaced grid and then the contours are estimated. The contours are approximately concentric circles. From the contour plot, you can conclude that the data form a dome or hemisphere above the (x,y) plane.

Summary

This article shows how to use SAS to generate random points on the sphere in any dimension. You can use this technique to generate random directions for a random walk. In addition, the article shows how to use the GRIDDED=FALSE option on the CONTOURPLOTPARM statement in GTL to create a contour plot from irregularly spaced ("ungridded") data in SAS.

The post Generate random points on a sphere appeared first on The DO Loop.