simulation

9月 092021
 

A statistical programmer asked how to simulate event-trials data for groups. The subjects in each group have a different probability of experiencing the event. This article describes one way to simulate this scenario. The simulation is similar to simulating from a mixture distribution. This article also shows three different ways to visualize the results of the simulation.

Modeling the data-generation process

A simulation is supposed to reproduce a data-generation process. Typically, a simulation is based on some real data. You (the programmer) need to ensure that the simulation reflects how the real data are generated. For example, there are often differences between simulating a designed experiment or simulating an observational study:

  • In a designed experiment, the number of subjects in each group is determined by the researcher. For example, a researcher might choose 100 subjects, 50 men and 50 women. If so, each simulated data sample should also contain 50 males and 50 females.
  • In an observational study, the number of subjects in each group is a random variable that is based on the proportion of the population in each group. For example, a researcher might choose 100 subjects at random. In the data, there might be 53 men and 47 women, but that split is the result of random chance. Each simulated data sample should generate the number of males and females according to a random binomial variable. Some samples might have a 52:48 split, others a 49:51 split, and so on. In general, the group sizes are simulated from a multinomial distribution.

For other ways to choose the number of subjects in categories, see the article, "Simulate contingency tables with fixed row and column sums in SAS."

Data that specify events and trials

Suppose you have the following data from a pilot observational study. Subjects are classified into six groups based on two factors. One factor has three levels (for example, political party) and another has two levels (for example, sex). For each group, you know the number of subjects (Ni) and the number of events (Ei). You can therefore estimate the proportion of events in the i_th group as Ei / Ni. Because you know the total number of subjects (Sum(N)), you can also estimate the probability that an individual belongs to each group (πi = Ni / Sum(N)).

The following SAS DATA step defines the proportions for each group:

%let NTotal = 107;
data Events;
length Group $ 3;
input Cat1 $ Cat2 $ Events N;
Group = catx(" ", Cat1, Cat2); /* combine two factors into group ID */
p  = Events / N;               /* estimate P(event | Group[i]) */
pi = N / &NTotal;              /* estimate P(subject in Group[i]) */
drop Cat1 Cat2;
format p pi Best5.;
datalines;
A  F  1  10
A  M  8  24
B  F  4  13
B  M  12 36
C  F  12 16
C  M  6  8
;
 
proc print data=Events; run;

This pilot study has 107 subjects. The first group ("A F") contained 10 subjects, which is π1 = 9.3% of the total. In the first group, one person experienced the event, which is p1 = 10% of the group. The other rows of the table are similar. Some people call this "event-trials" data because the quantity of interest is the proportion of events that occurred in the subjects. The next section shows how to use these estimates to simulate new data samples.

Simulate group sizes and proportions in groups

One of the advantages of simulation studies is that you can take preliminary data from a pilot study and "scale it up" to create larger samples, assuming that the statistics from the pilot study are representative of the parameters in the population. For example, the pilot study involved 107 subjects, but you can easily simulate samples of size 250 or 1000 or more.

The following SAS/IML program simulates samples of size 250. The statistics from the pilot study are used as parameters in the simulation. For example, the simulation assumes that 0.093 of the population belongs to the first group, and that 10% of the subjects in the first group experience the event of interest. Notice that some groups will have a small number of subjects (such as Group 1 and Group 5, which have small values for π). Among the groups, some will have a small proportion of events (Group 1) whereas others will have a large proportion (Group 5 and Group 6).

The following simulation uses the following features:

  • The RandMultinomial function in SAS/IML generates a random set of samples sizes based on the π vector, which estimates the proportion of the population that belongs to each group.
  • For each group, you can use the binomial distribution to randomly generate the number of events in the group.
  • The proportion of events is the ratio (Number of Events) / (Group Size).
/* simulate proportions of events in groups */
proc iml;
use Events;
read all var _ALL_;  /* creates vectors Groups, Events, N, p, pi, etc */
close;
numGroups = nrow(Group);
 
call randseed(12345);
nTrials = 250;       /* simulate data for a study that contains 250 subjects */
 
Size = T( RandMultinomial(1, nTrials, pi) ); /* new random group sizes */
nEvents = j(numGroups, 1, .);                /* vector for number of events */
do i = 1 to nrow(nEvents);
   nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
end;
Proportion = nEvents / Size;                 /* proportion of events in each group */
 
results = Size || nEvents || Proportion;
print results[r=Group c={'Size' 'nEvents' 'Proportion'} F=Best5.];

The table shows one random sample of size 250 based on the statistics in the smaller pilot study. The group size and the number of events are both random variables. If you rerun this simulation, the number of subjects in the groups will change, as will the proportion of subjects in each group that experience the event. It would be trivial to adapt the program to handle a designed experiment in which the Size vector is constant.

Simulating multiple samples

An important property of a Monte Carlo simulation study is that it reveals the distribution of statistics that can possibly result in a future data sample. The table in the previous section shows one possible set of statistics, but we can run the simulation additional times to generate hundreds or thousands of additional variations. The following program generates 1,000 possible random samples and outputs the results to a SAS data set. You can then graph the distribution of the statistics. This section uses box plots to visualize the distribution of the proportion of events in each group:

/* simulate 1000 random draws under this model */
SampleID = j(numGroups, 1, .);
nEvents = j(numGroups, 1, .);                /* vector for number of events */
create Sim var {'SampleID' 'Group' 'Size' 'nEvents' 'Proportion'};
do ID = 1 to 1000;
   /* assign all elements the same value:
      https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   SampleID[,] = ID;
   Size = T( RandMultinomial(1, nTrials, pi) );  /* new random group sizes */
   do i = 1 to nrow(nEvents);
      nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
   end;
   Proportion = nEvents / Size;
   append;
end;
close Sim;
QUIT;
 
data PlotAll;
set Sim Events(keep=Group p);
run;
 
title "Proportion of Events in Each Group";
title2 "Simulated N=250; NumSim=1000";
/* box plot */
proc sgplot data=PlotAll noautolegend;
   hbox Proportion / category=Group;
   scatter y=Group x=p / markerattrs=GraphData2(symbol=CircleFilled size=11);
   yaxis type=discrete display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The box plot shows the distribution of the proportion of events in each group. For example, the proportion in the first group ("A F"), ranges from a low of 0% to a high of 33%. The proportion in the sixth group ("C M"), ranges from a low of 22% to a high of 100%. The red markers are the parameter values used in the simulation. These are the estimates from the pilot study.

Alternative ways to visualize the distribution within groups

The box plot shows a schematic representation of the distribution of proportions within each group. However, there are alternative ways to visualize the distributions. One way is to use a strip plot, as follows:

/* strip plot */
proc sgplot data=PlotAll noautolegend;
   scatter y=Group x=Proportion / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled);
   scatter y=Group x=p / markerattrs=(symbol=CircleFilled size=11);
   yaxis type=discrete reverse display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The strip plot uses a semi-transparent marker to display each statistic. (Again, the red markers indicate the parameters for the simulation.) The density of the distribution is visualized by the width of the strips and by the darkness of the plot, which is due to overplotting the semi-transparent markers.

This plot makes it apparent that the variation between groups is based on the relative size of the groups in the pilot study. Large groups such as Group 4 ("B M") have less variation than small groups such as Group 6 ("C M"). That's because the standard error of the proportion is inversely proportional to the square root of the sample size. Thus, smaller groups have a wider distribution than the larger groups.

You can see the same features in a panel of histograms. In the following visualization, the distribution of the proportions is shown by using a histogram. Red vertical lines indicate the parameters for the simulation. This graph might be easier to explain to non-statistician.
/* plot of histograms */
proc sgpanel data=PlotAll;
   panelby Group / onepanel layout=rowlattice novarname;
   histogram Proportion;
   refline p / axis=x lineattrs=GraphData2(thickness=2);
   colaxis grid; 
run;

Summary

This article shows how to simulate event-trials data in SAS. In this example, the data belong to six different groups, and the probability of experiencing the event varies between groups. The groups are also different sizes. In the simulation, the group size and the number of events in each group are random variables. This article also shows three different ways to visualize the results of the simulation: a box plot, a strip plot, and a panel of histograms.

The post Simulate proportions for groups appeared first on The DO Loop.

7月 072021
 

In general, it is hard to simulate multivariate data that has a specified correlation structure. Copulas make that task easier for continuous distributions. A previous article presented the geometry behind a copula and explained copulas in an intuitive way. Although I strongly believe that statistical practitioners should be familiar with statistical theory, you do not need to be an expert in copulas to use them to simulate multivariate correlated data. In SAS, the COPULA procedure in SAS/ETS software provides a syntax for simulating data from copulas. This article shows how to use PROC COPULA to simulate data that has a specified rank correlation structure. (Similar syntax applies to the CCOPULA procedure in SAS Econometrics, which can perform simulations in parallel in SAS Viya.)

Example data

Let's choose some example data to use for the simulation study. Suppose you want to simulate data that have marginal distributions and correlations that are similar to the joint distribution of the MPG_City, Weight, and EngineSize variables in the Sashelp.Cars data set. For ease of discussion, I rename these variables X1, X2, and X3 in this article. The following call to PROC CORR visualizes the univariate and bivariate distributions for these data:

/* for ease of discussion, rename vars to X1, X2, X3 */
data Have;
set Sashelp.Cars(keep= MPG_City Weight EngineSize);
label MPG_City= Weight= EngineSize=;
rename MPG_City=X1 Weight=X2 EngineSize=X3;
run;
 
/* graph original (renamed) data */
ods graphics / width=500px height=500px;
proc corr data=Have Spearman noprob plots=matrix(hist);
   var X1 X2 X3;
   ods select SpearmanCorr MatrixPlot;
run;

The graph shows the marginal and bivariate distributions for these data. The goal of this article is to use PROC COPULA to simulate a random sample that looks similar to these data. The output from PROC CORR includes a table of Spearman correlations (not shown), which are Corr(X1,X2)=-0.865, Corr(X1,X3)=-0.862, and Corr(X2,X3)=0.835.

Using PROC COPULA to simulate data

As explained in my previous article about copulas, the process of simulating data involves several conceptual steps. For each step, PROC COPULA provides a statement or option that performs the step:

  1. The multivariate distribution is defined by specifying a marginal distribution for each variable and a correlation structure for the joint distribution. In my previous article, these were created "out of thin air." However, in practice, you often want to simulate random samples that match the correlation and marginal distributions of real data. In PROC COPULA, you use the DATA= option and the VAR statement to specify real data.
  2. Choose a copula family and use the data to estimate the parameters of the copula. I've only discussed the Gaussian copula. For the Gaussian copula, the sample covariance matrix estimates the parameters for the copula, although PROC COPULA uses maximum likelihood estimates instead of the sample covariance. You use the FIT statement to fit the parameters of the copula from the data.
  3. Simulate from the copula. For the normal copula, this consists of generating multivariate normal data with the given rank correlations. These simulated data are transformed to uniformity by applying the normal CDF to each component. You use the SIMULATE statement to simulate the data. If you want to see the uniform marginals, you can use the OUTUNIFORM= option to store them to a SAS data set, or you can use the PLOTS=(DATATYPE=UNIFORM) option to display a scatter plot matrix of the uniform marginals.
  4. Transform the uniform marginals into the specified distributions by applying the inverse CDF transformation for each component. In PROC COPULA, you use the MARGINALS=EMPIRICAL option on the SIMULATE statement to create the marginal distributions by using the empirical CDF of the variables. You can use the OUT= option to write the simulated values to a SAS data set. You can use the PLOTS=(DATATYPE=ORIGINAL) option to display a scatter plot matrix of the simulated data.

The following call to PROC COPULA carries out these steps:

%let N = 428;            /* sample size */
proc copula data=Have;
   var X1 X2 X3;         /* original data vars */
   fit normal;           /* choose normal copula; estimate covariance by MLE */
   simulate / seed=1234 ndraws=&N
              marginals=empirical    /* transform from copula by using empirical CDF of data */
              out=SimData            /* contains the simulated data */
              plots=(datatype=both); /* optional: scatter plots of copula and simulated data */
              /* optional: use OUTUNIFORM= option to store the copula */
   ods select SpearmanCorrelation MatrixPlotSOrig MatrixPlotSUnif;
run;

The primary result of the PROC COPULA call is an output data set called SimData. The data set contains random simulated data. The Spearman correlation of the simulated data is close to the Spearman correlation of the original data. The marginal distributions of the simulated variables are similar to the marginal distributions of the original variables. The following graph shows the distributions of the simulated data. You can compare this graph to the graph of the original data.

When you use PROC COPULA, it is not necessary to visualize the copula, which encapsulates the correlation between the variables. However, if you want to visualize the copula, you can use the PLOTS= option to create a scatter plot matrix of the uniform marginals, as follows:

Use PROC COPULA for Monte Carlo simulations

In Monte Carlo simulations, it is useful to have one SAS data set that contains B simulated samples. You can then use BY-group processing to analyze all simulated samples with a single call to a SAS procedure.

In PROC COPULA, the NDRAWS= option on the SIMULATE statement specifies how many observations to simulate. If your original data contains N observations, you can use NDRAWS=N. To get a total of B simulated samples, you can simulate N*B observations. You can then add an ID variable that identifies which observations belong to each sample, as follows:

/* you can run a Monte Carlo simulation from the simulated data */
%let NumSamples = 1000; /* = B = number of Monte Carlo simulations */
%let NTotal = %eval(&N * &NumSamples);
%put &=NTotal;
 
ods exclude all;
proc copula data=Have;
   var X1 X2 X3;
   fit normal;
   simulate / seed=1234 ndraws=&NTotal marginals=empirical out=SimData;
run;
ods exclude none;
 
/* add ID variable that identifies each set of &N observations */
data MCData;
set SimData;
SampleID = ceil(_N_/&N);  /* 1,1,...,1,2,2,....,2,3,3,... */
run;

You can use your favorite SAS procedure and the BY SAMPLEID statement to analyze each sample in the MCData data set.

Summary

Given a set of continuous variables, a copula enables you to simulate a random sample from a distribution that has the same rank correlation structure and marginal distributions as the specified variables. A previous article discusses the mathematics and the geometry of copulas. In SAS, the COPULA procedure enables you to simulate data from copulas. This article shows how to use PROC COPULA to simulate data. The procedure can create graphs that visualize the simulated data and the copula. The main output is a SAS data set that contains the simulated data.

The post Simulate multivariate correlated data by using PROC COPULA in SAS appeared first on The DO Loop.

7月 052021
 

Do you know what a copula is? It is a popular way to simulate multivariate correlated data. The literature for copulas is mathematically formidable, but this article provides an intuitive introduction to copulas by describing the geometry of the transformations that are involved in the simulation process. Although there are several families of copulas, this article focuses on the Gaussian copula, which is the simplest to understand.

This article shows the geometry of copulas. This article and its example are based on Chapter 9 of Simulating Data with SAS (Wicklin, 2013). A previous article shows the geometry of the Iman-Conover transformation, which is an alternative way to create simulated data that have a specified rank correlation structure and specified marginal distributions.

Simulate data by using a copula

Recall that you can use the CDF function to transform any distribution to the uniform distribution. Similarly, you can use the inverse CDF to transform the uniform distribution to any distribution. To simulate correlated multivariate data from a Gaussian copula, follow these three steps:

  1. Simulate correlated multivariate normal data from a correlation matrix. The marginal distributions are all standard normal.
  2. Use the standard normal CDF to transform the normal marginals to the uniform distribution.
  3. Use inverse CDFs to transform the uniform marginals to whatever distributions you want.

The transformation in the second and third steps are performed on the individual columns of a data matrix. The transformations are monotonic, which means that they do not change the rank correlation between the columns. Thus, the final data has the same rank correlation as the multivariate normal data in the first step.

The next sections explore a two-dimensional example of using a Gaussian copula to simulate correlated data where one variable is Gamma-distributed and the other is Lognormally distributed. The article then provides intuition about what a copula is and how to visualize it. You can download the complete SAS program that generates the example and creates the graphs in this article.

A motivating example

Suppose that you want to simulate data from a bivariate distribution that has the following properties:

  • The rank correlation between the variables is approximately 0.6.
  • The marginal distribution of the first variable, X1, is Gamma(4) with unit scale.
  • The marginal distribution of the second variable, X2, is lognormal with parameters μ=0.5 and σ=0.8. That is, log(X2) ~ N(μ, σ).

The hard part of a multivariate simulation is getting the correlation structure correct, so let's start there. It seems daunting to generate a "Gamma-Lognormal distribution" with a correlation of 0.6, but it is straightforward to generate a bivariate NORMAL distribution with that correlation. Let's do that. Then we'll use a series of transformations to transform the normal marginal variables into the distributions that we want while preserving the rank correlation at each step.

Simulate multivariate normal data

The SAS/IML language supports the RANDNORMAL function, which can generate multivariate normal samples, as shown in the following statements:

proc iml;
N = 1e4;
call randseed(12345);
/* 1. Z ~ MVN(0, Sigma) */
Sigma = {1.0  0.6,
         0.6  1.0};
Z = RandNormal(N, {0,0}, Sigma);          /* Z ~ MVN(0, Sigma) */

The matrix Z contains 10,000 observations drawn from a bivariate normal distribution with correlation coefficient ρ=0.6. Because the mean vector is (0,0) and the covariance parameter is a correlation matrix, the marginal distributions are standard normal. The following graph shows a scatter plot of the bivariate normal data along with histograms for each marginal distribution.

Transform marginal distributions to uniform

The first step is to transform the normal marginals into a uniform distribution by using the probability integral transform (also known as the CDF transformation). The columns of Z are standard normal, so Φ(X) ~ U(0,1), where Φ is the cumulative distribution function (CDF) for the univariate normal distribution. In SAS/IML, the CDF function applies the cumulative distribution function to all elements of a matrix, so the transformation of the columns of Z is a one-liner:

/* 2. transform marginal variables to U(0,1) */
U = cdf("Normal", Z);                     /* U_i are correlated U(0,1) variates */

The columns of U are samples from a standard uniform distribution. However, the columns are not independent. Because the normal CDF is a monotonic transformation, it does not change the rank correlation between the columns. That is, the rank correlation of U is the same as the rank correlation of Z:

/* the rank correlations for Z and U are exactly the same */
rankCorrZ = corr(Z, "Spearman")[2]; 
rankCorrU = corr(U, "Spearman")[2]; 
print rankCorrZ rankCorrU;

The following graph shows a scatter plot of the transformed data along with histograms for each marginal distribution. The histograms show that the columns U1 and U2 are uniformly distributed on [0,1]. However, the joint distribution is correlated, as shown in the following scatter plot:

Transform marginal distributions to any distribution

Now comes the magic. One-dimensional uniform variates are useful because you can transform them into any distribution! How? Just apply the inverse cumulative distribution function of whatever distribution you want. For example, you can obtain gamma variates from the first column of U by applying the inverse gamma CDF. Similarly, you can obtain lognormal variates from the second column of U by applying the inverse lognormal CDF. In SAS, the QUANTILE function applies the inverse CDF, as follows:

/* 3. construct the marginals however you wish */
gamma = quantile("Gamma", U[,1], 4);            /* gamma ~ Gamma(alpha=4)   */
LN    = quantile("LogNormal", U[,2], 0.5, 0.8); /* LN ~ LogNormal(0.5, 0.8) */
X = gamma || LN;
 
/* check that the rank correlation still has not changed */
rankCorrX = corr(X, "Spearman")[2]; 
print rankCorrZ rankCorrX;

The first column of X is gamma-distributed. The second column of X is lognormally distributed. But because the inverse of a continuous CDF is monotonic, the column transformations do not change the rank correlation between the columns. The following graph shows a scatter plot of the newly transformed data along with histograms for each marginal distribution. The histograms show that the columns X1 and X2 are distributed as gamma and lognormal, respectively. The joint distribution is correlated.

What about the Pearson correlation?

For multivariate normal data, the Pearson correlation is close to the rank correlation. However, that is not true for nonnormal distributions. The CDF and inverse-CDF transformations are nonlinear, so the Pearson correlation is not preserved when you transform the marginal. For example, the following statements compute the Pearson correlation for Z (the multivariate normal data) and for X (the gamma-lognormal data):

rhoZ = corr(Z, "Pearson")[2];
rhoX = corr(X, "Pearson")[2];
print rhoZ rhoX;

Chapter 9 of Wicklin (2013), discusses the fact that you can adjust the correlation for Z and it will affect the correlation for X in a complicated manner. With a little work, you can choose a correlation for Z that will result in a specified correlation for X. For example, if you want the final Pearson correlation for X to be 0.6, you can use 0.642 as the correlation for Z:

/* re-run the example with new correlation 0.642 */
Sigma = {1.0    0.642,
         0.642  1.0};
newZ = RandNormal(N, {0,0}, Sigma);
newU = cdf("Normal", newZ);           /* columns of U are U(0,1) variates */
gamma = quantile("Gamma", newU[,1], 4);      /* gamma ~ Gamma(alpha=4) */
expo = quantile("Expo", newU[,2]);           /* expo ~ Exp(1)          */
newX = gamma || expo;
 
rhoZ = corr(newZ, "Pearson")[2];
rhoX = corr(newX, "Pearson")[2];
print rhoZ rhoX;

Success! The Pearson correlation between the gamma and lognormal variables is 0.6. In general, this is a complicated process because the value of the initial correlation depends on the target value and on the specific forms of the marginal distributions. Finding the initial value requires finding the root of an equation that involves the CDF and inverse CDF.

Higher dimensional data

This example in this article generalizes to higher-dimensional data in a natural way. The SAS program that generates the example in this article also includes a four-dimensional example of simulating correlated data. The program simulates four correlated variables whose marginal distributions are distributed as gamma, lognormal, exponential, and inverse Gaussian distributions. The following panel shows the bivariate scatter plots and marginal histograms for this four-dimensional simulated data.

What is a copula?

I've shown many graphs, but what is a copula? The word copula comes from a Latin word (copulare) which means to bind, link, or join. The same Latin root gives us the word "copulate." A mathematical copula is a joint probability distribution that induces a specified correlation structure among independent marginal distributions. Thus, a copula links or joins individual univariate distributions into a joint multivariate distribution that has a specified correlation structure.

Mathematically, a copula is any multivariate cumulative distribution function for which each component variable has a uniform marginal distribution on the interval [0, 1]. That's it. A copula is a cumulative distribution function whose domain is the cube [0,1]d. So the second graph in this article is the graph most nearly related to the copula for the bivariate data, since it shows the relationship between the uniform marginals.

The scatter plot shows the density, not the cumulative distribution. However, you can use the data to estimate the bivariate CDF. The following heat map visualizes the Gaussian copula for the correlation of 0.6:

Notice that this function does not depend on the final marginal distributions. It depends only on the multivariate normal distribution and the CDF transformation that produces the uniform marginals. In that sense, the copula captures only the correlation structure, without regard for the final form of the marginal distributions. You can use this same copula for ANY set of marginal distributions because the copula captures the correlation structure independently of the final transformations of the marginals.

Why are copulas important?

A remarkable theorem (Sklar’s theorem) says that every joint distribution can be written as a copula and a set of marginal distributions. If the marginals are continuous, then the copula is unique. Let that sink in for a second: Every continuous multivariate probability distribution can be expressed in terms of its marginal distributions and a copula. Every one. So, if you can understand copulas you can understand all multivariate distributions. Powerful math, indeed!

For simulation studies, you can use the converse of Sklar's theorem, which is also true. Specifically, if you have a set of d uniform random variables and a set of marginal distribution functions, a copula transforms the d components into a d-dimensional probability distribution.

Summary

Copulas are mathematically sophisticated. However, you can use copulas to simulate data without needing to understand all the mathematical details. This article presents an example of using a Gaussian copula to simulate multivariate correlated data. It shows the geometry at each step of the three-step process:

  1. Simulate data from a multivariate normal distribution with a known correlation matrix.
  2. Use the normal CDF to transform the marginal distributions to uniform.
  3. Use inverse CDFs to obtain any marginal distributions that you want.

The result is simulated correlated multivariate data that has the same rank correlation as the original simulated data but has arbitrary marginal distributions.

This article uses SAS/IML to show the geometry of the copula transformations. However, SAS/ETS software provides PROC COPULA, which enables you to perform copula modeling and simulation in a single procedure. A subsequent article shows how to use PROC COPULA.

The post An introduction to simulating correlated data by using copulas appeared first on The DO Loop.

6月 232021
 

This article uses simulation to demonstrate the fact that any continuous distribution can be transformed into the uniform distribution on (0,1). The function that performs this transformation is a familiar one: it is the cumulative distribution function (CDF). A continuous CDF is defined as an integral, so the transformation is called the probability integral transformation.

This article demonstrates the probability integral transformation and its close relative, the inverse CDF transformation. These transformations are useful in many applications, such as constructing statistical tests and simulating data.

The probability integral transform

Let X be a continuous random variable whose probability density function is f. Then the corresponding cumulative distribution function (CDF) is the integral \(F(x) = \int\nolimits_{-\infty}^{x} f(t)\,dt\). You can prove that the random variable Y = F(X) is uniformly distributed on (0,1). In terms of data, if {X1, X2, ..., Xn} is a random sample from X, the values {F(X1), F(X2), ..., F(Xn)} are a random sample from the uniform distribution.

Let's look at an example. The following SAS DATA step generates a random sample from the Gamma distribution with shape parameter α=4. At the same time, the DATA step computes U=F(X), where F is the CDF of the gamma distribution. A histogram of the U variable shows that it is uniformly distributed:

/* The probability integral transformation: If X is a random 
   variable with CDF F, then Y=F(X) is uniformly distributed on (0,1).
*/
%let N = 10000;
data Gamma4(drop=i);
call streaminit(1234);
do i = 1 to &N;
   x = rand("Gamma", 4);     /* X ~ Gamma(4) */
   u = cdf("Gamma", x, 4);   /* U = F(X) ~ U(0,1)   */
   output;
end;
run;
 
title "U ~ U(0,1)";
proc sgplot data=Gamma4;
   histogram U / binwidth=0.04 binstart=0.02;  /* center first bin at h/2 */
   yaxis grid;
run;

The histogram has some bars that are greater than the average and others that are less than the average. That is expected in a random sample.

One application of the CDF transformation is to test whether a random sample comes from a particular distribution. You can transform the sample by using the CDF function, then test whether the transformed values are distributed uniformly at random. If the transformed variates are uniformly distributed, then the original data was distributed according to the distribution F.

The inverse probability transformation

It is also useful to run the CDF transform in reverse. That is, start with a random uniform sample and apply the inverse-CDF function (the quantile function) to generate random variates from the specified distribution. This "inverse CDF method" is a standard technique for generating random samples.

The following example shows how it works. The DATA step generates random uniform variates on (0,1). The variates are transformed by using the quantiles of the lognormal(0.5, 0.8) distribution. A call to PROC UNIVARIATE overlays the density curve of the lognormal(0.5, 0.8) distribution on the histogram of the simulated data. The curve matches the data well:

/* inverse probability transformation */
data LN(drop=i);
call streaminit(12345);
do i = 1 to &N;
   u = rand("Uniform");                    /* U ~ U(0,1)   */
   x = quantile("Lognormal", u, 0.5, 0.8); /* X ~ LogNormal(0.5, 0.8) */
   output;
end;
run;
 
proc univariate data=LN;
   var x;
   histogram x / lognormal(scale=0.5 shape=0.8) odstitle="X ~ LogNormal(0.5, 0.8)";
run;

Summary

The probability integral transform (also called the CDF transform) is a way to transform a random sample from any distribution into the uniform distribution on (0,1). The inverse CDF transform transforms uniform data into a specified distribution. These transformations are used in testing distributions and in generating simulated data.

Appendix: Histograms of uniform variables

The observant reader might have noticed that I used the BINWIDTH= and BINSTART= options to set the endpoints of the bin for the histogram of uniform data on (0,1). Why? Because by default the histogram centers bins. For data that are bounded on an interval, you can use the BINWIDTH= and BINSTART= options to improve the visualization.

Let's use the default binning algorithm to create a histogram of the uniform data:

title "U ~ U(0,1)";
title2 "34 Bins: First Bin Centered at 0 (Too Short)";
proc sgplot data=Gamma4;
   histogram u;
   yaxis grid;
run;

Notice that the first bin is "too short." The first bin is centered at the location U=0. Because the data cannot be less than 0, the first bin has only half as many counts as the other bins. The last bin is also affected, although the effect is not usually as dramatic.

The histogram that uses the default binning is not wrong, but it doesn't show the uniformity of the data as clearly as a histogram that uses the BINWIDTH= and BINSTART= options to force 0 and 1 to be the endpoints of the bins.

The post The probability integral transform appeared first on The DO Loop.

6月 162021
 

A previous article showed how to simulate multivariate correlated data by using the Iman-Conover transformation (Iman and Conover, 1982). The transformation preserves the marginal distributions of the original data but permutes the values (columnwise) to induce a new correlation among the variables.

When I first read about the Iman-Conover transformation, it seems like magic. But, as with all magic tricks, if you peer behind the curtain and understand how the trick works, you realize that it's math, not magic. This article shows the main mathematical ideas behind the Iman-Conover transformation. For clarity, the article shows how two variables can be transformed to create different variables that have a new correlation but the same marginal distributions.

A series of transformations

The Iman-Conover transformation is actually a sequence of four transformations. Let X be a data matrix (representing continuous data) and let C be a target rank-correlation matrix. The goal of the Iman-Conover transformation is to permute the elements in the columns of X to create a new data matrix (W) whose rank correlation is close to C. Here are the four steps in the Iman-Conover algorithm:

  1. Create normal scores: Compute the van der Waerden normal scores for each column of X. Let S be the matrix of normal scores.
  2. Uncorrelate the scores. Let Z be the matrix of uncorrelated scores. Let Q-1 be the transformation that maps S to Z.
  3. Correlated the scores to match the target. Let Y be the correlated scores. Let P be the transformation that maps Z to Y.
  4. Reorder the elements in the i_th column of X to match the order of the elements in the i_th column of Y. This ensures that X has the same rank correlation as Y.

The following diagram presents the steps of the Iman-Conover transformation. The original data are in the upper-left graph. The final data are in the lower-left graph. Each transformation is explained in a separate section of this article.

Background information

To understand the Iman-Conover transformation, recall the three mathematical facts that make it possible:

The purpose of this article is to present the main ideas behind the Iman-Conover transformation. This is not a proof, and almost every sentence should include the word "approximately." For example, when I write, "the data are multivariate normal" or "the data has the specified correlation," you should insert the word "approximately" into those sentences. I also blur the distinction between correlation in the population and sample correlations.

Example data

This example uses the EngineSize and MPG_Highway variables from the Sashelp.Cars data set. These variables are not normally distributed. They have a rank correlation of -0.77. The goal of this example is to permute the values to create new variables that have a rank correlation that is close to a target correlation. This example uses +0.6 as the target correlation, but you could choose any other value in the interval (-1, 1).

The following DATA step creates the data for this example. The graph in the upper-left of the previous diagram shows the negative correlation between these variables:

data Orig;
set sashelp.cars(rename=(EngineSize=X MPG_Highway=Y));
keep X Y;
label X= Y=;
run;

Transformation 1: Create normal scores

You can compute the van der Waerden scores for each variable. The van der Waerden scores are approximately normal—or at least "more normal" than the original variables. The following SAS/IML statements create the van der Waerden scores as columns of a matrix, S. Because the normalizing transformation is monotonic, the columns of S have the same rank correlation as the original data. The second graph in the previous diagram shows a scatter plot of the columns of S.

proc iml;
varNames = {"X" "Y"};
use Orig; read all var varNames into X; close;
 
/* T1: Create normal scores of each column 
   Columns of S have exactly the same ranks as columns of X.
   ==> Spearman corr is the same    */
N = nrow(X);
S = J(N, ncol(X));
do i = 1 to ncol(X);
   ranks = ranktie(X[,i], "mean");          /* tied ranks */
   S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
end;
/* verify that rank correlation has not changed */
corrScores = corr(S, "Spearman")[2];
print corrX, corrScores;

Transformation 2: Uncorrelate the scores

For many data sets, the van der Waerden scores are approximately multivariate normal. Let CS be the Pearson correlation matrix of S. Let Q be the Cholesky root of CS. The inverse of Q is a transformation that removes any correlations among multivariate normal data. Thus, if you use the inverse of Q to transform the scores, you obtain multivariate data that are uncorrelated. The third graph in the previous diagram shows a scatter plot of the columns of Z.

/* T2: use the inverse Cholesky transform of the scores to uncorrelate */
CS = corr(S);        /* correlation of scores */
Q = root(CS);        /* Cholesky root of correlation of scores */
Z = S*inv(Q);        /* uncorrelated MVN data */
corrZ = corr(Z, "Spearman")[2];
print corrZ;

Transformation 3: Correlate the scores

Let C be the target correlation matrix. Let P be its Cholesky root. The matrix P is a transformation that induces the specified correlations in uncorrelated multivariate normal data. Thus, you obtain data, Y, that are multivariate normal with the given Pearson correlation. In most situations, the Spearman rank correlation is close to the Pearson correlation. The fourth graph in the previous diagram shows a scatter plot of the columns of Y.

/* T3: use the Cholesky transform of the target matrix to induce correlation */
/* define the target correlation */
C = {1    0.6,
     0.6 1  };
P = root(C);         /* Cholesky root of target correlation */
Y = Z*P;
corrY = corr(Y, "Spearman")[2];
print corrY;

Transformation 4: Reorder the values in columns of the original matrix

The data matrix Y has the correct correlations, but the Cholesky transformations of the scores have changed the marginal distributions of the scores. But that's okay: Iman and Conover recognized that any two matrices whose columns have the same ranks also have the same rank correlation. Thus, the last step uses the column ranks of Y to reorder the values in the columns of X.

Iman and Conover (1982) assume that the columns of X do not have any duplicate values. Under that assumption, when you use the ranks of each column of Y to permute the columns of X, the rank correlation of Y and X will be the same. (And, of course, you have not changed the marginal distributions of X.) If X has duplicate values (as in this example), then the rank correlations of Y and X will be close, as shown by the following statements:

/* T4: Permute or reorder data in the columns of X to have the same ranks as Y */
W = X; 
do i = 1 to ncol(Y);
   rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
   tmp = W[,i]; call sort(tmp); /* sort column by ranks */
   W[,i] = tmp[rank];           /* reorder the column of X by the ranks of Y */
end;
 
corrW = corr(W, "Spearman")[2];
print corrW;

The rank correlation for the matrix W is very close to the target correlation. Furthermore, the columns have the same values as the original data, and therefore the marginal distributions are the same. The lower-left graph in the previous diagram shows a scatter plot of the columns of W.

The effect of tied values

As mentioned previously, the Iman-Conover transformation was designed for continuous variables that do not have any duplicate values. In practice, the Iman-Conover transformation still works well even if there are some duplicate values in the data. For this example, there are 428 observations. However, the first column of X has only 43 unique values and the second column has only 33 unique values. Because the normal scores are created by using the RANKTIE function, the columns of the S matrix also have 43 and 33 unique values, respectively.

Because the Q-1 matrix is upper triangular, the number of unique values might change for the Z matrix. For these data, the columns of Z and Y have 43 and 194 unique values, respectively. These unique values determine the matrix W, which is why the rank correlation for W is slightly different than the rank correlation for Y in this example. If all data values were unique, then W and Y would have exactly the same rank correlation.

Summary

The Iman-Conover transformation seems magical when you first see it. This article attempts to lift the curtain and reveal the mathematics behind the algorithm. By understanding the geometry of various transformations, you can understand the mathe-magical properties that make the Iman-Conover algorithm an effective tool in multivariate simulation studies.

You can download the complete SAS program that creates all graphs and tables in this article..

The post The geometry of the Iman-Conover transformation appeared first on The DO Loop.

6月 142021
 

Simulating univariate data is relatively easy. Simulating multivariate data is much harder. The main difficulty is to generate variables that have given univariate distributions but also are correlated with each other according to a specified correlation matrix. However, Iman and Conover (1982, "A distribution-free approach to inducing rank correlation among input variables") proposed a transformation that approximately induces a specified correlation among component variables without changing the univariate distribution of the components. (The univariate distributions are called the marginal distributions.) The Iman-Conover transformation is designed to transform continuous variables.

This article defines the Iman-Conover transformation. You can use the transformation to generate a random sample that has (approximately) a given correlation structure and known marginal distributions. The Iman-Conover transformation uses rank correlation, rather than Pearson correlation. In practice, the rank correlation and the Pearson correlation are often close to each other.

This article shows how to use the Iman-Conover transformation. A separate article explains how the Iman-Conover transformation works. The example and SAS program in this article are adapted from Chapter 9 of my book Simulating Data with SAS (Wicklin, 2013).

Create simulated data with known distributions

To use the Iman-Conover transformation, you must specify two things: A set of data and a target correlation matrix. The transformation will induce the specified correlation in the data by rearranging (or permuting) the columns of the data. In other words, the algorithm permutes the columns independently to induce the specified correlation without changing the marginal distributions.

This example generates data with four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively. It does not matter whether the original data are correlated or not. For simplicity, generate each variable independently, so the initial correlation matrix is approximately the identity matrix. The following SAS DATA step simulates four variables from the specified distributions. The call to PROC CORR computes the rank correlation of the original data. It also creates a scatter plot matrix that shows the marginal distributions of the variable in the diagonal cells and the bivariate joint distributions in the off-diagonal cells.

%let N = 100; /* sample size */
/* independently simulate variables from known distributions */
data SimIndep;
call streaminit(12345);
do i = 1 to &N;
   Normal    = rand("Normal");
   Lognormal = rand("Lognormal");
   Expon     = rand("Expo");
   Uniform   = rand("Uniform");
   output;
end;
drop i;
run;
 
proc corr data=SimIndep Spearman noprob plots=matrix(hist);
   var Normal Lognormal Expon Uniform;
run;

For these simulated data, the table shows that the rank correlation between any pair of variables is small. None of the correlations are statistically different from 0, which you can verify by removing the NOPROB option from the PROC CORR statement.

Implement the Iman-Conover transformation in SAS

The Iman-Conover transformation is actually a sequence of transformations of multivariate data. The following program defines the Iman-Conover transformation as a SAS/IML function. An explanation of the function is provided in a separate article.

/* define a store a SAS/IML function that computes the Iman-Conover transformation */
proc iml;
/* Input: X is a data matrix with k columns
          C is a (k x k) rank correlation matrix
   Output: A matrix W. The i_th column of W is a permutation of the i_th
          columns of X. The rank correlation of W is close to the 
          specified correlation matrix, C.
*/          
start ImanConoverTransform(X, C);
   N = nrow(X);
   S = J(N, ncol(X));
   /* T1: Create normal scores of each column */
   do i = 1 to ncol(X);
      ranks = ranktie(X[,i], "mean");          /* tied ranks */
      S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
   end;
   /* T2: apply two linear transformations to the scores */
   CS = corr(S);        /* correlation of scores */
   Q = root(CS);        /* Cholesky root of correlation of scores */
   P = root(C);         /* Cholesky root of target correlation */
   T = solve(Q,P);      /* same as  T = inv(Q) * P; */
   Y = S*T;             /* transform scores: Y has rank corr close to target C */
 
   /* T3: Permute or reorder data in the columns of X to have the same ranks as Y */
   W = X;
   do i = 1 to ncol(Y);
      rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
      tmp = W[,i]; call sort(tmp); /* sort column by ranks */
      W[,i] = tmp[rank];           /* reorder the column of X by the ranks of M */
   end;
   return( W );
finish;
store module=(ImanConoverTransform);  /* store definition for later use */
quit;

Unlike many algorithms in the field of simulation, the Iman-Conover transformation is deterministic. Given a matrix of data (X) and a correlation matrix (C), the function returns the same transformed data every time you call it.

Use the Iman-Conover transformation to simulate correlated data

The following SAS/IML program shows how to use the Iman-Conover transformation to simulate correlated data. There are three steps:

  1. Read real or simulated data into a matrix, X. The columns of X define the marginal distributions. For this example, we will use the SimIndep data, which contains four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively.
  2. Define a "target" correlation matrix. This specifies the rank correlations that you hope to induce. For this example, the target correlation between the Normal and Lognormal variables is 0.75, the target correlation between the Normal and Exponential variables is -0.7, the target correlation between the Normal and Uniform variables is 0, and so forth.
  3. Call the ImanConoverTransform function. This returns a new data matrix, W. The i_th column of W is a permutation of the i_th column of X. The permutations are chosen so that the rank correlation of W is close to the specified correlation matrix:
proc iml;
load module=(ImanConoverTransform);  /* load the function definition */
 
/* Step 1: Read in the data */
varNames = {'Normal' 'Lognormal' 'Expon' 'Uniform'};
use SimIndep; read all var varNames into X; close;
 
/* Step 2: specify target rank correlation */
/*    X1    X2    X3    X4                 */
C = { 1.00  0.75 -0.70  0,           /* X1 */
      0.75  1.00 -0.95  0,           /* X2 */
     -0.70 -0.95  1.00 -0.3,         /* X3 */
      0     0    -0.3   1.0};        /* X4 */
 
/* Step 3: Strategically reorder the columns of X. 
           The new ordering has the specified rank corr (approx)
           and preserves the original marginal distributions. */
W = ImanConoverTransform(X, C);
RankCorr = corr(W, "Spearman");
print RankCorr[format=6.3 r=varNames c=varNames];

As advertised, the rank correlation of the new data matrix, W, is very close to the specified correlation matrix, C. You can write the data matrix to a SAS data set and use PROC CORR to create a scatter plot matrix that has the marginal histograms on the diagonal:

/* write new data to a SAS data set */
newvarNames = {'newNormal' 'newLognormal' 'newExpon' 'newUniform'};
create SimCorr from W[c=newvarNames];  append from W;  close;
quit;
 
proc corr data=SimCorr Spearman noprob plots=matrix(hist);
   var newNormal newLognormal newExpon newUniform;
run;

If you compare the diagonal histograms in this graph to the histograms in the previous graph, you will see that they are identical. They have to be identical because each new variable contains exactly the same data values as the original variables.

At the same time, the bivariate scatter plots show how permuting the data has induced the prescribed correlations between the variables. For this example, the newNormal and newLognormal variables have a strong positive rank correlation (0.73). The newNormal and NewExponential variables have a strong negative rank correlation (-0.65). The newNormal and newUniform variables are essentially uncorrelated (-0.03), and so forth.

Summary

The Iman-Conover transformation is a remarkable algorithm that induces a specified correlation structure on a data matrix while preserving the marginal distributions. You start with any data matrix (X) and any target correlation matrix. (The columns of X must represent continuous data.) You apply the Iman-Conover transformation to obtain a new matrix (W). The rank correlation of the new matrix is close to the values you specified. The columns of W have the same values as the columns of X, but the values have been reordered. Thus, the new matrix has the same marginal distributions as the original matrix.

In a follow-up article, I will explain the geometry or the Iman-Conover transformation and how it works.

The post Simulate correlated variables by using the Iman-Conover transformation appeared first on The DO Loop.

5月 052021
 

Quick! Which fraction is bigger, 40/83 or 27/56? It's not always easy to mentally compare two fractions to determine which is larger. For this example, you can easily see that both fractions are a little less than 1/2, but to compare the numbers you need to compare the products 40*56 and 27*83. Wouldn't it be great if there were a simpler way to compare fractions, maybe by using addition rather than multiplication?

It turns out that there is a simple equation that is often (but not always!) correct. I learned about this equation from John D. Cook, who blogged about this trick for comparing fractions by using only addition. According to a mathematical result by K. Odani (1998), the rule is correct more than 91% of the time.

Specifically, let N be a large integer. Randomly choose integers a, b, c, and d between 1 and N. Look at the signs of the following differences:

  • X = a*d – b*c
  • Y = (a+d) – (b+c)

Because a, b, c, and d are random, the quantities X and Y are random variables. K. Odani (1998) shows that X and Y have the same sign quite often for large N. As N goes to infinity, the probability that X and Y have the same sign is 11/12 ≈ 0.91667. I will refer to this result as Odani's truism.

Odani's truism provides a "rough and ready" rule for comparing two fractions a/b and c/d. If a/b < c/d, then X < 0, and, asymptotically, the expression (a+d) < (b+c) is true 91% of the time. Or, inverting the logic, you can evaluate (a+d) and (b+c) to determine, with high probability, whether the fraction a/b is greater than or less than the fraction c/d.

For example, again consider the fractions 40/83 and 27/56. It is easy to compute the sums 40+56=96 and 27+83=110. The first sum is less than the second, so Odani's truism suggests that 40/83 is less than 27/56, which is, in fact, true!

However, the summation rule does not always hold. For example, the summation rule for the fractions 13/27 and 40/83 gives the sums 96 and 67, but 13/27 is less than 40/83. If you use Odani's truism, you can never be sure that it holds for the two fractions you are trying to compare!

Simulation of Odani's truism

You can demonstrate Odami's truism by using a simulation. Choose a large number N, and choose integers a, b, c, and d uniformly at random in [1, N]. Then the difference of the products (X = a*d – b*c) and the difference of the sums (Y = (a+d) – (b+c)) should have the same sign about 91% of the time. The following SAS DATA step carries out the simulation for a particular choice of N:

/* Simulation to demonstrate Odani's truism: To determine if 
   a/b < c/d
   it is often sufficient to test whether 
   a+d < b+c
*/
data _null_;
retain count 0;
call streaminit(54321);
N = 1E8;                /* upper bound. Integers in [1,N] */
numSamples = 1E6;       /* number of random draws */
 
do i = 1 to numSamples;
   a = rand('integer', 1, N);
   b = rand('integer', 1, N);
   c = rand('integer', 1, N);
   d = rand('integer', 1, N);
   x = a*d - b*c;               /* correct way to compare a/b and c/d */
   y = (a+d) - (b+c);           /* Odani's truism is correct 11/12 of time */
   count + (sign(x) = sign(y)); /* is truism correct for this 4-tuple (a,b,c,d)? */
end;
prob = count / numSamples;    /* empirical probability for simulation */
put "Odani's truism holds with probability " prob;
run;
Odani's truism holds with probability 0.91648

This simulation randomly generates 1,000,000 integer 4-tuples between 1 and 100 million. For each 4-tuple, the program compares the difference in the product (a*d - b*c) and the difference in the sum ((a+d) - (b+c)). If these quantities have the same signs (positive, negative, or zero), a counter is incremented. The result shows that the quantities have the same sign more than 91% of the time, in accordance with Odani's theorem.

Interesting math versus practical math

As an applied mathematician, I am always curious whether an interesting mathematical result has a practical application. As John D. Cook points out at the end of his blog post, some fractions are easy to mentally compare whereas others are not. I don't need any trick to know that 6/29 is less than 18/35. It is obvious that the first fraction is less than 1/2 (in fact, it is close to 1/5) whereas the second is greater than 1/2. And I certainly don't need a trick to compare 6/29 and 35/18 because the latter fraction is greater than 1. Furthermore, Odani's result includes unreduced fractions such as 2/4, 3/6, and 50/100.

John D. Cook suggests that "it would be interesting to try to assess how often the rule of thumb presented here is correct in practice. You might try to come up with a model for the kinds of fractions people can’t compare instantly, such as proper fractions that have similar size."

That is a great idea. I don't need to use Odani's truism to compare 6/29 and 18/35, yet (6,29,18,35) is one of the 4-tuples for which Odani's truism is true. Similarly, it is true for (6,29,35,18), (1,6,5,8) and (1,10,9,10) and many other 4-tuples that represent fractions that can be easily compared without doing any explicit calculations. If you omit the "obvious" 4-tuples and unreduced fractions, you will reduce the probability that Odani's truism is true on the 4-tuples that remain. Thus, intuitively, Odani's truism should have a lower probability of being true if you restrict it only to reduced fractions that are close to each other. But how low? Is the rule any better than tossing a coin?

In the next blog post, I propose a way to measure how often Odani's truism is useful in practice. I will estimate the probability that Odani's truism holds for reduced fractions in the interval [0,1] that are close to each other.

Summary

Odani's truism is an asymptotic result that states that you can compare the fractions a/b and c/d by looking at the sums (a+d) and (b+c). Specifically, the quantities (a*d - b*c) and ((a+d) - (b+c)) have the same sign with a probability that approaches 11/12 when the numerators and denominators are chosen uniformly at random in the interval [1,N] for very large N.

In the next post, I will discuss whether Odani's truism is useful in practice or whether it is merely a mathematical curiosity.

The post Odani's truism: A probabilistic way to compare fractions appeared first on The DO Loop.

4月 072021
 

As mentioned in my article about Monte Carlo estimate of (one-dimensional) integrals, one of the advantages of Monte Carlo integration is that you can perform multivariate integrals on complicated regions. This article demonstrates how to use SAS to obtain a Monte Carlo estimate of a double integral over rectangular and non-rectangular regions. Be aware that a Monte Carlo estimate is often less precise than a numerical integration of the iterated integral.

Monte Carlo estimate of multivariate integrals

Multivariate integrals are notoriously difficult to solve. But if you use Monte Carlo methods, higher-dimensional integrals are only marginally more difficult than one-dimensional integrals. For simplicity, suppose we are interested in the double integral \(\int_D f(x,y) \,dx\,dy\), where D is a region in the plane. The basic steps for estimating a multivariate integral are as follows:

  1. Generate a large random sample of points from the uniform distribution on the domain, D.
  2. Evaluate f at each point (x,y) in the random sample. Let W = f(X,Y) be the vector of values.
  3. Compute Area(D)*mean(W), where Area(D) is the area of D and mean(W) is the arithmetic average of W. If you know the exact area of D, use it. If D is a very complicated shape, you can estimate the area of D by using the Monte Carlo methods in this article.

The quantity Area(D)*mean(W) is a Monte Carlo estimate of \(\int_D f(x,y) \,dx\,dy\). The estimate depends on the size of random sample (larger samples tend to give better estimates) and the particular random variates in the sample.

Monte Carlo estimates of double integrals on rectangular regions

Let's start with the simplest case, which is when the domain of integration, D, is a rectangular region. This section estimates the double integral of f(x,y) = cos(x)*exp(y) over the region D = [0,π/2] x [0,1]. That is, we want to estimate the integral
\(\int\nolimits_0^{\pi/2}\int\nolimits_0^1 \cos(x)\exp(y)\,dy\,dx = e - 1 \approx 1.71828\)
where e is the base of the natural logarithm. For this problem, the area of D is (b-a)*(d-c) = π/2. The graph at the right shows a heat map of the function on the rectangular domain.

The following SAS/IML program defines the integrand and the domain of integration. The program generates 5E6 uniform random variates on the interval [a,b]=[0,π/2] and on the interval [c,d] = [0,1]. The (x,y) pairs are evaluated, and the vector W holds the result. The Monte Carlo estimate is the area of the rectangular domain times the mean of W:

/* Monte Carlo approximation of a double integral over a rectangular region */
proc iml;
/* define integrand: f(x,y) = cos(x)*exp(y) */
start func(x);
   return cos(x[,1]) # exp(x[,2]);
finish;
 
/* Domain of integration: D = [a,b] x [c,d] */
pi = constant('pi');
a = 0; b = pi/2;    /* 0 < x < pi/2 */
c = 0; d = 1;       /* 0 < y < 1    */
 
call randseed(1234);
N = 5E6;
 
X = j(N,2);                        /* X ~ U(D)   */
z = j(N,1);  
call randgen(z, "uniform", a, b);  /* z ~ U(a,b) */
X[,1] = z;
call randgen(z, "uniform", c, d);  /* z ~ U(c,d) */
X[,2] = z;
 
W = func(X);                       /* W = f(X1,X2) */
Area = (b-a)*(d-c);                /* area of rectangular region */
MCEst = Area * mean(W);            /* MC estimate of double integral */
 
/* the double integral is separable; solve exactly */
Exact = (sin(b)-sin(a))*(exp(d)-exp(c));  
Diff = Exact - MCEst;
print Exact MCEst Diff;

The output shows that the Monte Carlo estimate is a good approximation to the exact value of the double integral. For this random sample, the estimate is within 0.0002 units of the true value.

Monte Carlo estimates of double integrals on non-rectangular regions

It is only slightly more difficult to estimate a double integral on a non-rectangular domain, D. It is helpful to split the problem into two subproblems: (1) generate point uniformly in D, and (2) estimate the integral on D.

The next two sections show how to estimate the integral of f(x,y) = exp(-(x2 + y2)) over the circle of unit radius centered at the origin: D = {(x,y) | x2 + y2 ≤ 1}. The graph at the right shows a heat map of the function on a square. The domain of integration is the unit disk.

By using polar coordinates, you can solve this integral exactly. The double integral has the value 2π*(e – 1)/(2 e) ≈ 1.9858653, where e is the base of the natural logarithm.

Generate point uniformly in a non-rectangular planar region

For certain shapes, you can directly generate a random sample of uniformly distributed points:

In general, for any planar region, you can use the acceptance-rejection technique. I previously showed an example of using Monte Carlo integration to estimate π by estimating the area of a circle.

Because the acceptance-rejection technique is the most general, let's use it to generate a random set of points in the unit disk, D. The steps are as follows:
  1. Construct a bounding rectangle for the region. In this example, you can use R = [-1,1] x [-1,1].
  2. Generate random points uniformly at random in the rectangle.
  3. Keep the points that are in the domain, D. Discard the others.

If you know the area of D, you can use a useful trick to choose the sample size. When you use an acceptance-rejection technique, you do not know in advance how many points will end up in D. However, you can estimate the number as ND ≈ NR*Area(D)/Area(R), where NR is the sample size in R. If you know the area of D, you can invert this formula. If you want approximately ND points to be in D, generate ND*Area(R)/Area(D) points in R. For example, suppose you want ND=5E6 points in the unit disk, D. We can choose NR ≥ ND*4/π, as in the following SAS/IML program, which generates random points in the unit disk:

proc iml;
/* (1) Generate approx N_D points in U(D), where D is 
   the unit disk D = { (x,y) | x**2 + y**2 <= 1 } */
N_D = 5E6;        /* we want this many points in D */
a = -1; b = 1;    /* Bounding rectangle, R:  */
c = -1; d = 1;    /* R = [a,b] x [c,d]       */
 
/* generate points inside R. Generate enough points (N_R)
   in R so that approximately N_D are actually in D */
pi = constant('pi');
area_Rect = (b-a)*(d-c); 
area_D    = pi;
N_R = ceil(N_D * area_Rect / area_D);    /* estimate how many points in R we'll need */
 
call randseed(1234);
X = j(N_R,2);
z = j(N_R,1);  
call randgen(z, "uniform", a, b);
X[,1] = z;
call randgen(z, "uniform", c, d);
X[,2] = z;
 
/* which points in the bounding rectangle are in D? */
b = (X[,##] <= 1);             /* x^2+y^2 <= 1 */
X = X[ loc(b),];               /* these points are in D */
print N_D[L="Target N_D" F=comma10.] 
      (nrow(X))[L="Actual N_D" F=comma10.];

The table shows the result. The program generated 6.3 million points in the rectangle, R. Of these, 4.998 million were in the unit disk, D. This is very close to the desired number of points in D, which was 5 million.

Monte Carlo estimate of a an integral over a planar region

Each row of the matrix, X, contains a point in the disk, D. The points are a random sample from the uniform distribution on D. Therefore, you can estimate the integral by calculating the mean value of the function on these points:

/* (2) Monte Carlo approximation of a double integral over a non-rectangular domain. 
   Estimate integral of f(x,y) = exp(-(x**2 + y**2))
   over the disk D = { (x,y) | x**2 + y**2 <= 1 }     */
/* integrand: f(x,y) = exp(-(x**2 + y**2)) */
start func(x);
   r2 = x[,1]##2 + x[,2]##2;
   return exp(-r2);
finish;
 
W = func(X);
MCEst = Area_D * mean(W);
 
/* compare the estimate to the exact value of the integral */
e = constant('e');
Exact = 2*pi*(e-1)/(2*e);
Diff = Exact - MCEst;
print Exact MCEst Diff;

The computation shows that a Monte Carlo estimate of the integral over the unit disk is very close to the exact value. For this random sample, the estimate is within 0.00006 units of the true value.

Summary

This article shows how to use SAS to compute Monte Carlo estimates of double integrals on a planar region. The first example shows a Monte Carlo integration over a rectangular domain. The second example shows a Monte Carlo integration over a non-rectangular domain. For a non-rectangular domain, the integration requires two steps: first, use an acceptance-rejection technique to generate points uniformly at random in the domain; then use Monte Carlo to estimate the integral. The technique in this article generalizes to integration on higher-dimensional domains.

The post Double integrals by using Monte Carlo methods appeared first on The DO Loop.

4月 052021
 

A previous article shows how to use Monte Carlo simulation to estimate a one-dimensional integral on a finite interval. A larger random sample will (on average) result in an estimate that is closer to the true value of the integral than a smaller sample. This article shows how you can determine a sample size so that the Monte Carlo estimate is within a specified distance from the true value, with high probability.

This article is inspired by and uses some of the notation from Neal, D. (1983) "Determining Sample Sizes for Monte Carlo Integration," The College Mathematics Journal.

Ensure the estimate is (probably) close to the true value

As shown in the previous article, a Monte Carlo estimate of an integral of g(x) on the interval (a,b) requires three steps:

  1. Generate a uniform random sample of size n on (a,b). Let X be the vector of n uniform random variates.
  2. Use the function g to evaluate each point in the sample. Let Y = g(X) be the vector of transformed variates.
  3. An estimate of the integral \(\int\nolimits_{a}^{b} g(x) \, dx\) is (b - a)*mean(Y), where mean(Y) is the arithmetic average of Y, as shown in the previous article.

The goal of this article is to choose n large enough so that, with probability at least β, the Monte Carlo estimate of the integral is within δ of the true value. The mathematical derivation is at the end of this article. The result (Neal, 1983, p. 257) is that you should choose
\(n > \left( \Phi^{-1}\left( \frac{\beta+1}{2} \right) \frac{(b-a)s_{Y}}{\delta} \right)^2\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(s_{Y}\) is an estimate of the standard deviation of Y.

Applying the formula to a Monte Carlo estimate of an integral

Let's apply the formula to see how large a sample size we need to estimate the integral \(\int\nolimits_{a}^{b} g(x) \, dx\) to three decimal places (δ=5E-4) for the function g(x) = xα-1 exp(-x), where α=4 and the interval of integration is (1, 3.5). The function is shown at the top of this article.

The estimate requires knowing an estimate for the standard deviation of Y=g(X), where X ~ U(a,b). You can use a small "pilot study" to obtain the standard deviation, as shown in the following SAS/IML program. You could also use the DATA step and PROC MEANS for this computation.

%let alpha = 4;    /* shape parameter */
%let a = 1;        /* lower limit of integration */
%let b = 3.5;      /* upper limit of integration */
 
proc iml;
/* define the integrand */
start Func(x) global(shape);
   return x##(shape - 1) # exp(-x);
finish;
 
call randseed(1234);
shape = &alpha; a = &a; b = &b;
 
/* Small "pilot study" to estimate s_Y = stddev(Y) */
N = 1e4;                          /* small sample */
X = j(N,1);
call randgen(x, "Uniform", a, b); /* X ~ U(a,b) */
Y = func(X);                      /* Y = f(X)   */
s = std(Y);                       /* estimate of std dev(Y) */

For this problem, the estimate of the standard deviation of Y is about 0.3. For β=0.95, the value of \(\Phi^{-1}((\beta+1)/2) \approx 1.96\), but the following program implements the general formula for any probability, β:

/* find n so that MC est is within delta of true value with 95% prob */
beta  = 0.95;
delta = 5e-4;      
k = quantile("Normal", (beta+1)/2) * (b-a) * s;
sqrtN = k / delta;
roundN = round(sqrtN**2, 1000);       /* round to nearest 1000 */
print beta delta roundN[F=comma10.];

With 95% probability, if you use a sample size of n=8,765,000, the Monte Carlo estimate will be within δ=0.0005 units of the true value of the integral. Wow! That's a larger value than I expected!

I used n=1E6 in the previous article and reported that the difference between the Monte Carlo approximation and the true value of the integral was less than 0.0002. So it is possible to be close to the true value by using a smaller value of n. In fact, the graph to the right (from the previous article) shows that for n=400,000 and n=750,000, the Monte Carlo estimate is very close for the specific random number seed that I used. But the formula in this section provides the sample size that you need to (probably!) be within ±0.0005 REGARDLESS of the random number seed.

The distribution of Monte Carlo estimates

I like to use SAS to check my math. The Monte Carlo estimate of the integral of g will vary according to the random sample. The math in the previous section states that if you generate k random samples of size n=8,765,000, that (on average) about 0.95*k of the sample will be within δ=0.0005 units of the true value of the integral, which is 2.666275. The following SAS/IML program generates k=200 random samples of size n and k estimates of the integral. We expect about 190 estimates to be within δ units of the true value and only about 10 to be farther away:

N = roundN;
X = j(N,1);
 
k = 200;
Est = j(k,1);
do i = 1 to k;
   call randgen(X, "Uniform", a, b);     /* x[i] ~ U(a,b) */
   Y = func(X);                          /* Y = f(X)   */
   f_avg = mean(Y);                      /* estimate E(Y) */
   Est[i] = (b-a)*f_avg;                 /* estimate integral */
end;
 
call quad(Exact, "Func", a||b);          /* find the "exact" value of the integral */
Diff = Est - Exact;
title "Difference Between Monte Carlo Estimate and True Value";
title2 "k=200 Estimates; n=8,765,000";
call Histogram(Diff) other="refline -0.0005 0.0005/axis=x;";

The histogram shows the difference between the exact integral and the estimates. The vertical reference lines are at ±0.0005. As predicted, most of the estimates are less than 0.0005 units from the exact value. What percentage? Let's find out:

/* how many estimates are within and NOT within delta? */
Close    = ncol(loc(abs(Diff)<=delta));    /* number that are closer than delta to true value */
NotClose = ncol(loc(abs(Diff)> delta));    /* number that are farther than delta */
PctClose = Close / k;                      /* percent close to true value */
PctNotClose = NotClose / k;                /* percent not close */
print k Close NotClose PctClose PctNotClose;

For this set of 200 random samples, exactly 95% of the estimates were accurate to within 0.0005. Usually, a simulation of 200 estimates would show that between 93% and 97% of the estimates are close. In this case, the answer was exactly 95%, but don't expect that to happen always!

Summary

This article shows how you can use elementary statistics to find a sample size that is large enough so that a Monte Carlo estimate is (probably!) within a certain distance of the exact value. The article was inspired by an article by Neal, D. (1983). The mathematical derivation of the result is provided in the next section.

Derivation of the sample size formula

This section derives the formula for choosing the sample size, n. Because the Monte Carlo estimate is a mean, you can use elementary probability theory to find n. Let {x1, x2, ..., xn} be random variates on (a, b). Let Y be the vector {g(x1), g(x2), ..., g(xn)}. Let \(m_{n} = \frac{1}{n} \sum\nolimits_{i = 1}^{n} g(x_{i})\) be the mean of Y and let \(\mu = \frac{1}{b-a} \int\nolimits_{a}^{b} g(x) \, dx\) be the average value of g on the interval (a,b). We know that \(m_n \to \mu\) as \(n \to \infty\). For any probability 0 < β < 1, we want to find n large enough so that
\(P\left( | m_n - \mu | < \frac{\delta}{(b-a)} \right) \geq \beta\)
From the central limit theorem, we can substitute the standard normal random variable, Z, inside the parentheses by dividing by the standard error of Y, which is σY/sqrt(n):
\(P\left( | Z | < \frac{\delta \sqrt{n}}{(b-a)\sigma_{Y}} \right) \geq \beta\)
Equivalently,
\(2 P\left( Z < \frac{\delta \sqrt{n}}{(b-a)\sigma_{Y}} \right) - 1 \geq \beta\)

Solving the equation for sqrt(n) gives
\(\sqrt{n} > \Phi^{-1}\left( \frac{\beta+1}{2} \right) \frac{(b-a)\sigma_{Y}}{\delta}\)
Squaring both sides leads to the desired bound on n. Because we do not know the true standard deviation of Y, we substitute a sample statistic, sY.

The post Sample size for the Monte Carlo estimate of an integral appeared first on The DO Loop.

3月 312021
 

Numerical integration is important in many areas of applied mathematics and statistics. For one-dimensional integrals on the interval (a, b), SAS software provides two important tools for numerical integration:

This article shows a third method to estimate an integral in SAS: Monte Carlo simulation.

How to use Monte Carlo simulation to estimate an integral

I previously showed an example of using Monte Carlo simulation to estimate the value of pi (π) by using the "average value method." This section presents the mathematics behind the Monte Carlo estimate of the integral.

The Monte Carlo technique takes advantage of a theorem in probability that is whimsically called the Law of the Unconscious Statistician. The theorem is basically the chain rule for integrals. If X is a continuous random variable and Y = g(X) is a random variable that is created by a continuous transformation (g) of X, then the expected value of Y is given by the following convolution:
\(E[Y] = E[g(X)] = \int g(x) f_{X}(x) \,dx\)
where \(f_{X}\) is the probability density function for X.

To apply the theorem, choose X to be a uniform random variable on the interval (a,b). The density function of X is therefore \(f_{X}(x) = 1/(b-a)\) if x is in (a,b) and 0 otherwise. Rearranging the equation gives
\(\int_{a}^{b} g(x) \,dx = (b - a) \cdot E[g(X)]\)

Consequently, to estimate the integral of a continuous function g on the interval (a,b), you need to estimate the expected value E[g(X)], where X ~ U(a,b). To do this, generate a uniform random sample in (a,b), evaluate g on each point in the sample, and take the arithmetic mean of those values. In symbols,
\(\int_{a}^{b} g(x) \,dx \approx (b - a) \cdot \frac{1}{n} \sum\nolimits_{i = 1}^{n} g(x_{i})\)
where the \(x_i\) are indepenent random uniform variates on (a,b).

Estimate an integral in SAS by using Monte Carlo simulation

Suppose you want to estimate the integral of \(g(x) = x^{\alpha - 1} \exp(-x)\) on the interval (a,b) = (1, 3.5). The figure at the top of this article shows the graph of g for α=4 and the area under the curve on the interval (1, 3.5).

As mentioned earlier, an accurate way to numerically integrate the function is to use the QUAD subroutine in SAS/IML, as follows:

%let alpha = 4;    /* shape parameter */
%let a = 1;        /* lower limit of integration */
%let b = 3.5;      /* upper limit of integration */
 
proc iml;
/* define the integrand */
start Func(x) global(shape);
   return x##(shape - 1) # exp(-x);
finish;
 
/* call the QUAD subroutine to perform numerical integration */
shape = &alpha; a= &a; b = &b;
call quad(AreaQuad, "Func", a||b);
print AreaQuad;

The QUAD subroutine approximates the area under the curve as 2.6663. This is a very accurate result. When you compute a Monte Carlo estimate, the estimate will depend on the size of the random sample that you use and the random number seed. The following SAS/IML program samples one million random variates from the uniform distribution on [1, 3.5]. The vector Y contains the transformed points (Y=g(X)). The call to the MEAN function estimates the expected value of g(x) when 1 < x < 3.5. If you multiply the mean by (3.5 - 1) = 2.5, you obtain an estimate of the integral, as follows:

/* Monte Carlo estimation of the integral */
call randseed(1234);
N = 1e6;                              /* sample size for MC estimate */
X = j(N,1);                           /* allocate vector of size N */
call randgen(X, "Uniform", a, b);     /* x[i] ~ U(a,b) */
Y = func(X);                          /* Y = f(X)   */
f_avg = mean(Y);                      /* estimate E(Y) */
AreaMC = (b-a)*f_avg;                 /* estimate integral on (a,b) */
Diff = abs(AreaMC - AreaQuad);        /* how close is the estimate to the true value? */
print AreaQuad AreaMC Diff;

Advantages and drawbacks of a Monte Carlo estimate

The main advantage of a Monte Carlo estimate is its simplicity: sample, evaluate, average. The same technique works for any function over any finite interval of integration. Although I do not demonstrate it here, a second advantage is that you can extend the idea to estimate higher-dimensional integrals.

The main disadvantage is that you do not obtain a definitive answer. You obtain a statistical estimate that is based on a random sample. I won't repeat the arguments from my article about Monte Carlo estimates, but be aware of the following facts about a Monte Carlo estimate of an integral:

  • Unless you use a huge sample size, the Monte Carlo estimate is usually less accurate than numerical integration.
  • A Monte Carlo estimate (like all statistics) has a distribution. If you change the random number seed or you change the algorithm that you use to generate random uniform variates, you will get a different estimate. Thus, you cannot talk about the Monte Carlo estimate, but only about a Monte Carlo estimate. The best you can claim is that the true value of the integral is within a certain distance from your estimate (with a specified probability).

How does the accuracy of the estimate depend on the sample size?

You can extend the previous program to demonstrate the accuracy of the Monte Carlo estimate as a function of the sample size. How would the estimate change if you use only 50,000 or 100,000 random variates? The following program creates a graph the shows the estimate of the integral when you use only the first k elements in the sequence of random variates:

/* how does the estimate depend on the sample size? */
size = do(1E4, N, 1E4);
MCest = j(1,ncol(size),.);
do i = 1 to ncol(size);
   Z = X[1:size[i]];                  /* sample size k=size[i] */
   Y = func(Z);                       /* Y = f(Z)       */
   MCEst[i] = (b-a)*mean(Y);          /* estimate integral */
end;
 
title "Monte Carlo Estimates as a Function of Sample Size";
stmt = "refline " + char(AreaQuad,6,4) + "/ axis=y label; format size comma10.;";
call scatter(size, MCEst) other=stmt label={"Sample Size" "Estimate of Area"};

The markers in the scatter plot show the estimates for the integral when only the first k random variates are used. The horizontal line shows the exact value of the integral. For this random-number seed, the estimate stays is close to the exact value when the sample size is more than 400,000. If you use a different random number seed, you will get a different graph. However, this graph is fairly typical. The Monte Carlo estimates can either overestimate or underestimate the integral.

The takeaway lesson is that the estimate converges to the exact value, but not very quickly. From general theory, we know that the standard error of the mean of n numbers is SD(Y)/sqrt(n), where SD(Y) is the standard deviation of Y for Y=g(X). That means that a 95% confidence interval for the mean has a radius 1.96*SD(Y)/sqrt(n). You can use this fact to choose a sample size so that the Monte Carlo estimate is (probably) close to the exact value of the integral. I will discuss this fact in a second article.

Summary

This article shows how to use Monte Carlo simulation to estimate a one-dimensional integral. Although Monte Carlo simulation is less accurate than other numerical integration methods, it is simple to implement and it readily generalizes to higher-dimensional integrals.

The post Estimate an integral by using Monte Carlo simulation appeared first on The DO Loop.