simulation

4月 172023
 

The "Teacher’s Corner" of The American Statistician enables statisticians to discuss topics that are relevant to teaching and learning statistics. Sometimes, the articles have practical relevance, too. Andersson (2023) "The Wald Confidence Interval for a Binomial p as an Illuminating 'Bad' Example," is intended for professors and masters-level students in mathematical statistics. However, the topic has practical importance. The introduction to the article states, "the standard two-sided Wald interval for the binomial [proportion]is ... known to be of low quality with respect to actual coverage, and it ought to be stricken from all courses" in favor of alternative intervals such as the Wilson (score).

If Andersson wants the method "stricken from all courses," then presumably he also advocates that the Wald CI should not be used in practice. That is a strong statement with practical importance. The Wald confidence interval (CI) is the default CI for PROC FREQ in SAS, although the documentation states "the Wilson interval has ... better performance than the Wald interval."

Let's investigate this issue. This article presents a simulation study that compares the coverage probability of the two-sided CIs for the (uncorrected) Wald method and for the Wilson (score) method. The simulation demonstrates that Andersson makes an excellent point: In practice, the Wald method is an inferior choice for the CI. You can use the Wilson CI to obtain a better confidence interval. In PROC FREQ, you can use the BINOMIAL(CL=WILSON) option on the TABLES statement to obtain the Wilson CI.

The Wald confidence interval for a binomial proportion

As Andersson (2023) points out, "generations" of students have memorized the formula for the Wald CI, "and probably some of them still remember it." If a sample has n observations and n1 are an event of interest, then a point-estimate for the proportion of events is \(\hat{p} = n_1 / n\). This is called "the binomial proportion" because the distribution of n1 is Binomial(p, n). The Wald confidence interval for the proportion is given by the formula
\( \hat{p} \pm z \sqrt { \hat{p} (1 - \hat{p} ) / n } \)
where z is the (1-α/2)th quantile of the standard normal distribution. The formula for the Wilson interval is given in the documentation of PROC FREQ.

I won't repeat Andersson's mathematical arguments about why the Wald interval is deficient, but a few issues are:

  • The point estimator and its standard error (the quantity under the square root sign) are correlated.
  • The Wald CI approximates the discrete sampling distribution of the Wald statistic by using a continuous normal distribution. This leads to a mismatch between the quantiles of the actual sampling distribution and the approximate quantiles.
  • The Wald statistic is biased, skewed, and exhibits excess kurtosis. (See Table 1 in Andersson (2023).)
  • The Wald interval has lower-than-expected coverage. For example, a 95% confidence interval might contain the proportion parameter for only 93% of random samples.

Coverage probability for the Wald confidence interval

Andersson states that the Wald interval is "low quality with respect to actual coverage." The coverage probability of a confidence interval is the probability that the interval (which is a statistic) contains the value of the distribution parameter. For example, a 95% confidence interval should contain the distribution parameter in 95% of all random samples. For the binomial proportion, you can run a simulation to estimate the coverage probability of the Wald interval and Wilson intervals, as follows:

  • Choose a proportion parameter, p.
  • Simulate n independent Bernoulli trials with probability p. This results in n1 events. Equivalently, yhat you can simulate the n1 values directly from the Binomial(p, n) distribution.
  • Calculate the Wald and Wilson intervals. Assess whether the parameter is inside each interval.
  • Repeat the simulation B times, where B is a large number. Estimate the coverage probability as k/B, where k is the number of times that the parameter is in the confidence interval.

The following SAS program simulates B=10,000 samples of size n=50 from the Binomial(p, n) distribution for p=0.2. You can use PROC FREQ to estimate the proportion and confidence intervals from each sample. The OUTPUT statement writes the statistics to a SAS data set. The variables L_Bin and U_Bin are the lower and upper confidence limits, respectively, for the Wald interval. The variables L_W_Bin and U_W_Bin are the corresponding limits for the Wilson interval. A second DATA step creates binary variables that indicate whether each interval contains the parameter, p. A second call to PROC FREQ calculates the proportion of samples for which the interval contains the parameter, which estimates the coverage probability.

/* simulation of Wald interval vs Wilson interval for binomial proportion */
%let B = 10000;                    /* number of simulations in study */
%let N = 50;                       /* sample size */
data BinomSim;
call streaminit(12345);
N = &N;
do p = 0.2;
   do ID = 1 to &B;
      Event = 1; Count = rand("Binom", p, N); /* number of events in N trials */
      output;
      Event = 0; Count = N - Count;           /* number of non-events */
      output;
   end;
end;
run;
 
/* for each sample, estimate proportion and CIs (Wald and Wilson) */
proc freq data=BinomSim noprint;
   by p ID;
   weight Count / zeros;
   tables Event / nocum binomial(level='1' CL=Wald CL=Wilson);
   output out=Est binomial;
run;
 
/* create binary (0/1) variables that indicate whether p is in the CI */
data Coverage;
set Est;
label _BIN_="Estimate (pHat)" L_Bin="Lower Wald CL" U_Bin="Upper Wald CL"
                              L_W_Bin="Lower Wilson CL" U_W_Bin="Upper Wilson CL";
WaldInCL   = (L_Bin   <= p & p <= U_Bin);      /* is p in Wald CI?   */
WilsonInCL = (L_W_Bin <= p & p <= U_W_Bin);    /* is p in Wilson CI? */
keep p ID _Bin_ L_Bin U_Bin L_W_Bin U_W_Bin WaldInCL WilsonInCL;
run;
 
/* estimate the empirical coverage probability for the nominal 95% CIs */
proc freq data=Coverage;
   by p;
   tables WaldInCL WilsonInCl / nocum;
run;

The output for p=0.2 is shown. We wanted 95% confidence intervals, but the simulation p indicates that the empirical coverage probability for the Wald confidence interval is about 93.89%, which is substantially lower than 95%. In contrast, the empirical coverage probability for the Wald confidence interval is about 95.11% when p=0.2.

Repeat the simulation for a range of parameters

Figure 1 of Andersson (2023), which shows the oscillation phenomenon for the Wald interval when n=50.

Andersson's article taught me something new. Namely, that the coverage probability for an interval is not a smooth (or even continuous) function of the proportion parameter. I knew that interval estimates are better when p ≈ 1/2 and can be bad when p is near 0 or 1, but I did not know that the coverage probability can jump dramatically if you change p by a small amount. Andersson shows this in Figure 1, which is reproduced on the right. Andersson states that "the ragged performance" is "due to the lattice structure of the binomial distribution." This phenomenon is explained in Brown, Cai, and DasGupta (2001), which contains several similar plots. The phenomenon is not some artifact of the Wald interval. It is a real phenomenon that affects all confidence intervals, including the Wilson interval.

Let's extend the simulation so that it estimates the Wald and Wilson coverage probability for a range of proportion parameters. Most textbooks caution that you should not use the Wald interval unless np(1-p) > 5, which means that when n=50 we should consider only p > 0.11. You can convert the previous simulation to a series of simulations by changing one line in the first DATA step. You should replace the line that sets p=0.2 with a DO loop that iterates over a range of values.

/* simulation of Wald interval vs Wilson interval for binomial proportion */
%let B = 10000;                    /* number of simulations in study */
%let N = 50;                       /* sample size */
/* Simulate random samples of size N where Pr(Event=1)=p */
data BinomSim;
call streaminit(12345);
N = &N;
/* n*p*(1-p) > 5 ==> p > 0.1127 */
do p = 0.11 to 0.5 by 0.005;       /* <== iterate over a range of proportion parameters */
   do ID = 1 to &B;
      Event = 1; Count = rand("Binom", p, N); /* number of events in N trials */
      output;
      Event = 0; Count = N - Count;           /* number of non-events */
      output;
   end;
end;
run;
 
/* construct Wald and Wilson CIs for estimate of proportion, \hat{p} */
proc freq data=BinomSim noprint;
   by p ID;
   weight Count / zeros;
   tables Event / nocum binomial(level='1' CL=Wald CL=Wilson);
   output out=Est binomial;
run;
 
/* create binary variable to indicate whether interval contains p */
data Coverage;
set Est;
label _BIN_="Estimate (pHat)" L_Bin="Lower Wald CL" U_Bin="Upper Wald CL"
                              L_W_Bin="Lower Wilson CL" U_W_Bin="Upper Wilson CL";
WaldInCL   = (L_Bin   <= p & p <= U_Bin);      /* is p in Wald CI?   */
WilsonInCL = (L_W_Bin <= p & p <= U_W_Bin);    /* is p in Wilson CI? */
keep p ID _Bin_ L_Bin U_Bin L_W_Bin U_W_Bin WaldInCL WilsonInCL;
run;

Instead of looking at tables of the empirical coverage probabilities, the following program statements graph the estimated coverage versus the parameter for both the Wald and Wilson intervals:

/* plot empirical coverage probability of the Wald CI for range of p */
proc freq data=Coverage noprint;
   by p;
   tables WaldInCL / nocum binomial(level='1' CL=Wilson);
   output out=CoverageWard binomial;
run;
title "Coverage Probability of Ward CI";
proc sgplot data=CoverageWard;
   label _BIN_="Ward Coverage" p="Binomial Probability";
   scatter y=p x=_BIN_ / xerrorlower=L_W_Bin xerrorupper=U_W_Bin;
   refline 0.95 / axis=x noclip;
   xaxis min=0.88 max=0.98;
run;
 
/* plot empirical coverage probability of the Wilson CI for range of p */
proc freq data=Coverage noprint;
   by p;
   tables WilsonInCL / nocum binomial(level='1' CL=Wilson);
   output out=CoverageWilson binomial;
run;
title "Coverage Probability of Wilson CI";
proc sgplot data=CoverageWilson;
   label _BIN_="Wilson Coverage" p="Binomial Probability";
   scatter y=p x=_BIN_ / xerrorlower=L_W_Bin xerrorupper=U_W_Bin;
   refline 0.95 / axis=x noclip;
   xaxis min=0.88 max=0.98;
run;

I've aligned the graphs for comparison. From the graphs, you can notice the following:

  • The coverage probability "jumps around" in both graphs as the parameter, p, changes. This is a real phenomenon and is not due to Monte Carlo variation.
  • In general, the Wald confidence interval exhibits less than 95% coverage. For proportion parameters less than 0.2, the coverage is especially bad.
  • Sometimes the Wilson confidence interval is less than 95% coverage, but other times it is more. The Wilson intervals appear to be roughly symmetric around the 95% reference line. This is a good property to have, as opposed to the systematic downward bias of the Wald CI.

These graphs show why the Wilson interval is preferred over the Wald interval. The Wald interval has actual coverage that is less than its nominal coverage. If you do not know the proportion parameter (and you never do!) then the Wilson interval is a more reliable and dependable interval. If you pardon the pun, you can have more "confidence" in the Wilson interval.

Summary

Andersson (2023) states, "the standard two-sided Wald interval for the binomial [proportion]is ... known to be of low quality with respect to actual coverage." The documentation for PROC FREQ in SAS agrees by stating, "the Wilson interval has ... better performance than the Wald interval." This article runs a simulation study that provides evidence for those statements.

Andersson suggests that the Wald CI should not be used in practice. In PROC FREQ in SAS, you can use the BINOMIAL(CL=WILSON) option on the TABLES statement to obtain the Wilson CI. With this small change to your programs, you will obtain a more reliable interval estimate for a binomial proportion.

The post Should you use the Wald confidence interval for a binomial proportion? appeared first on The DO Loop.

3月 132023
 

A previous article describes the metalog distribution (Keelin, 2016). The metalog distribution is a flexible family of distributions that can model a wide range of shapes for data distributions. The metalog system can model bounded, semibounded, and unbounded continuous distributions. This article shows how to use the metalog distribution in SAS by downloading a library of SAS IML functions from a GitHub site.

The article contains the following sections:

  • An overview of the metalog library of SAS IML functions
  • How to download the functions from GitHub and include the functions in a program
  • How to use a metalog model to fit an unbounded model to data
  • A discussion of infeasible models

For a complete discussion of the metalog functions in SAS, see "A SAS IML Library for the Metalog Distribution" (Wicklin, 2023).

The metalog library of functions in SAS

The metalog library is a collection of SAS IML modules. The modules are object-oriented, which means that you must first construct a "metalog object," usually from raw data. The object contains information about the model. All subsequent operations are performed by passing the metalog object as an argument to functions. You can classify the functions into four categories:

  • Constructors create a metalog object from data. By default, the constructors fit a five-term unbounded metalog model.
  • Query functions enable you to query properties of the model.
  • Computational functions enable you to compute the PDF and quantiles of the distribution and to generate random variates.
  • Plotting routines create graphs of the CDF and PDF of the metalog distribution. The plotting routines are available when you use PROC IML.

Before you create a metalog object, you must choose how to model the data. There are four types of models: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. Then, you must decide the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k > 6) can fit multimodal data distributions, but be careful not to overfit the data by choosing k too large.

The most common way to create a metalog object is to fit a model to the empirical quantiles of the data, which estimates the k parameters in the model. For an unbounded metalog model, the quantile function is
\( Q(p) = a_1 + a_2 L(p) + a_3 c(p) L(p) + a_4 c(p) + a_5 c(p)^2 + a_6 c(p)^2 L(p) + a_7 c(p)^3 + \ldots \)
where L(p) = Logit(p) is the quantile function of the logistic distribution and c(p) = p – 1/2 is a linear function. You use ordinary least squares regression to fit the parameters (\(a_1, a_2, \ldots, a_k\)) in the model.

The semibounded and bounded versions are metalog models of log-transformed variables.

  • If X > L, fit a metalog model to Z = log(X – L)
  • If L < X < U, fit a metalog model to Z = log((X – L) / (U – X))

How to download the metalog functions from GitHub

The process of downloading SAS code from a GitHub site is explained in the article, "How to use Git to share SAS programs." The process is also explained in Appendix A in the metalog documentation. The following SAS program downloads the metalog functions to the WORK libref. You could use a permanent libref, if you prefer:

/* download metalog package from GitHub */
options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-iml-packages;  /* clone repo to WORK, or use permanent libref */
 
/* clone repository; if repository exists, skip download */
data _null_;
if fileexist("&repoPath.") then 
   put 'Repository already exists; skipping the clone operation'; 
else do;
   put "Cloning repository 'sas-iml-packages'";
   rc = gitfn_clone("https://github.com/sassoftware/sas-iml-packages/", "&repoPath." ); 
end;
run;
 
/* Use %INCLUDE to read source code and STORE functions to current storage library */
proc iml;
%include "&repoPath./Metalog/ML_proc.sas";   /* each file ends with a STORE statement */
%include "&repoPath./Metalog/ML_define.sas";
quit;

Each file ends with a STORE statement, so running the above program results in storing the metalog functions in the current storage library. The first file (ML_PROC.sas) should be included only when you are running the program in PROC IML. It contains the plotting routines. The second file (ML_define.sas) should be included in both PROC IML and the iml action.

Fit an unbounded metalog model to data

The Sashelp.heart data set contains data on patients in a heart study. The data set contains a variable (Systolic) for the systolic blood pressure of 5,209 patients. The following call to PROC SGPLOT creates a histogram of the data and overlays a kernel density estimate.

title 'Systolic Blood Pressure';
proc sgplot data=sashelp.heart;
   histogram Systolic;
   density Systolic / type=kernel;
   keylegend /location=inside;
run;

The kernel density estimate (KDE) is a nonparametric model. The histogram and the KDE indicate that the data are not normally distributed, and it is not clear whether the data can be modeled by using one of the common "named" probability distributions such as gamma or lognormal. In situations such as this, the metalog distribution can be an effective tool for modeling and simulation. The following SAS IML statements load the metalog library of functions, then fit a five-term unbounded metalog distribution to the data. After you fit the model, it is a good idea to run the ML_Summary subroutine, which displays a summary of the model and displays the parameter estimates.

proc iml;
load module=_all_;     /* load the metalog library */
use sashelp.heart;
   read all var "Systolic";
close;
 
MLObj = ML_CreateFromData(Systolic, 5);  /* 5-term unbounded model */
run ML_Summary(MLObj);

The ML_Summary subroutine displays two tables. The first table tells you that the model is a 5-term unbounded model, and it is feasible. Not every model is feasible, which is why it is important to examine the summary of the model. (Feasibility is discussed in a subsequent section.) The second table provides the parameter estimates for the model.

The MLObj variable is an object that encapsulates relevant information about the metalog model. After you create a metalog object, you can pass it to other functions to query, compute, or plot the model. For example, in PROC IML, you can use the ML_PlotECDF subroutine to show the agreement between the cumulative distribution of the model and the empirical cumulative distribution of the data, as follows:

title '5-Term Metalog Model for Systolic Blood Pressure';
run ML_PlotECDF(MLObj);

The model is fit to 5,209 observations. For this blog post, I drew the CDF of the model in red so that it would stand out. The model seems to fit the data well. If you prefer to view the density function, you can run the statement
    run ML_PlotPDF(MLObj);
Again, note that the MLObj variable is passed as the first argument to the function.

One way to assess the goodness of fit is to compare the sample quantiles to the quantiles predicted by the model. You can use the ML_Quantile function to obtain the predicted quantiles, as follows:

/* estimates of quantiles */
Prob = {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99};
call qntl(SampleEst, Systolic, Prob);   /* sample quantiles */
ModelEst = ML_Quantile(MLObj, Prob);    /* model quantiles */
print Prob SampleEst ModelEst;

The model quantiles agree well with the sample quantiles. Note that the median estimate from the model (the 0.5 quantile) is equal to the a1 parameter estimate. This is always the case.

An advantage of the metalog model is that it has an explicit quantile function. As a consequence, it is easy to simulate from the model by using the inverse CDF method. You can use the ML_Rand function to simulate from the model, as follows:

/* simulate 500 new blood pressure values from the model */
call randseed(1234);
title 'Simulated Systolic Values';
BP = ML_Rand(MLObj, 500);
call histogram(BP);

Notice that the shape of this histogram is similar to the histogram of the original data. The simulated data seem to have similar properties as the original data.

The metalog documentation provides the syntax and documentation for every public method in the library.

Infeasible metalog functions

As discussed in a previous article, not every metalog model is a valid distribution. An infeasible model is one for which the quantile function is not monotone increasing. This can happen if you have a small data set that has extreme outliers or gaps, and you attempt to fit a metalog model that has many parameters. The model will overfit the data, which results in an infeasible model.

I was unable to create an infeasible metalog model for the blood-pressure data, which has thousands of observations and no extreme observations. However, if I use only the first 20 observations, I can demonstrate an infeasible model by fitting a six-term metalog model. The following SAS IML statements create a new data vector that contains only 20 observations. A six-term metalog model is fit to those data.

/* What if we use a much smaller data set? Set N = 20. */
S = Systolic[1:20];               /* N = 20 */
MLObj = ML_CreateFromData(S, 6);  /* 6-term model */
run ML_Summary(MLObj);

As shown by the output of the ML_Summary call, the six-term metalog model is not feasible. The following graph shows the quantile function of the model overlaid on the empirical quantiles. Notice that the quantile function of the model is not monotone increasing, which explains why the model is infeasible. If you encounter this situation in practice, Keelin (2016) suggests reducing the number of terms in the model.

Summary

This article describes how to use the metalog distribution in SAS. The metalog distribution is a flexible family that can model a wide range of distributional shapes. You can download the Metalog package from the SAS Software GitHub site. The package contains documentation and a collection of SAS IML modules. You can call the modules in a SAS IML program to fit a metalog model. After fitting the model, you can compute properties of the model and simulate data from the model. More information and examples are available in the documentation of the Metalog package.

The post Use the metalog distribution in SAS appeared first on The DO Loop.

2月 202023
 

A SAS programmer asked for help to simulate data from a distribution that has certain properties. The distribution must be supported on the interval [a, b] and have a specified mean, μ, where a < μ < b. It turns out that there are infinitely many distributions that satisfy these conditions. This article describes the shapes for a family of beta distributions that solve this problem.

Common bounded distributions

There are three common distributions that are used to model data on a bounded interval:

  • The triangular distribution has a peak (mode) that is easy to specify. The PDF looks like a triangle, so this distribution might not be a good model for real data.
  • The PERT distribution also has a mode that is easy to specify. The PERT distribution is a particular example of a beta distribution that is used in decision analysis.
  • The two-parameter beta distribution is a flexible family that can model a wide range of distributional shapes.

An interesting fact about the two-parameter beta distribution is that you can model many different shapes. The parameters for the beta distribution enable you to model distributions for which the PDF is decreasing, increasing, U-shaped, and has either positive or negative skewness.

If Y is a beta-distributed random variable on [0,1] that has mean p, then X = (ba)Y + a is a random variable on [a, b] that has mean μ = (ba)p + a. Thus, we can simulate beta-distributed data, and then scale and translate the data to any other bounded interval.

Beta distributions that have a common mean

Let's examine the shapes of some beta distributions that all have the same mean, p, in [0,1]. The mean of the Beta(α, β) distribution is p = α/(α+β). Thus, for any specified mean, there is a one-parameter family of beta distributions, each with a different shape, that all have the same mean. For any value of the β parameter, choose α = p / (1 – p) β to ensure that the Beta(α, β) distribution has mean p.

Let's compute the PDF for a few members of the family to see what they look like. In the following program, I specify that I want a beta distribution that has mean value p = 2/3, which forces α = 2 β. I then plot the PDF for several values of β to visualize the different shapes:

/* show PDFs for a sample of (alpha, beta) values such that the
   Beta(alpha, beta) distribution has mean=2/3 */ 
data BetaPDF;
keep alpha beta y pdf;
p = 2/3;                     /* mean of Y ~ Beta(alpha, beta) distribution */
do beta = 0.2, 0.8, 2, 6;
   alpha = p/(1-p) * beta;   /* choose alpha so that distrib has mean p */
   do y = 0.01 to 0.99 by 0.01;
      PDF = pdf("beta", y, alpha, beta);
      output;
   end; 
end;
run;
 
title "A Family of Beta Distributions for Mean = 2/3";
proc sgplot data=BetaPDF;
   series x=y y=PDF / group=beta lineattrs=(thickness=2);
   yaxis min=0 max=4 label="Density";
run;

Notice the shapes of the resulting beta distributions:

  • The PDF for β=0.2 is U-shaped.
  • The PDF for β=0.8 is monotonic increasing.
  • The PDF for β=2 has a mode at 0.75.
  • The PDF for β=6 has a mode at 0.6875. It appears to be approximately bell-shaped.

All these distributions have the same mean, which is p = 2/3. As β increases, the distribution becomes nearly normal, and the mode approaches the mean.

Simulate data from a bounded distribution with a specified mean

The PDF of the distributions is easier to visualize than a random sample. But you can modify the program to generate random variates instead of a PDF. To obtain a random sample on [a, b] that has mean μ, you can transform the problem: use the beta distribution to simulate a sample on [0, 1], then transform the data into the interval [a, b].

For example, suppose you want a random sample from a distribution that has mean 20 and is bounded on the interval [10, 25]. Because 20 is two-thirds of the way between 10 and 25, you can simulate from a beta distribution on [0, 1] that has mean p = 2/3. If Y is a beta-distributed random variable on [0, 1], then X = (25-10)*Y + 10 is a random variable on [10, 25].

The following SAS DATA step demonstrates this technique. Because the problem does not have a unique solution, the program generates six random samples, each with N=200 observations. Each sample has a different shape, but they are all generated from a distribution whose mean is 20.

/* Define interval [a,b] and mean, mu */
%let a = 10;
%let b = 25;
%let mu = 20;                /* note that mu is 2/3 of the way from a to b */
/* if X is r.v. on [a,b] with mean mu, then 
   Y = (X-a)/(b-a) is r.v. on [0,1] with mean p=a + (b-a)*mu */
data BetaSim;
call streaminit(1234);
keep alpha beta x y;
a = &a; b = &b; mu = &mu;
p = (mu - a)/(b-a);          /* mean of Y ~ Beta in [0, 1] */
do beta = 0.2, 0.5, 0.8, 1, 2, 6;
   alpha = p/(1-p) * beta;   /* choose alpha so that distrib has mean p */
   do i = 1 to 200;                  /* N = 200 for this example */
      y = rand("beta", alpha, beta); /* Y ~ Beta(alpha, beta) on [0,1] */
      x = (b-a)*y + a;               /* transform values into [a,b] */
      output;
   end; 
end;
run;
 
proc sgpanel data=BetaSim;
   panelby alpha beta / columns=3;
   histogram x;
   colaxis grid;
run;

The panel shows six different samples. Each sample is drawn from a distribution that has mean 20. Four of the samples are generated from a (rescaled) distribution that was shown in the previous section. As you can see, the shape of the distributions vary. Some are U-shaped, some are nearly linear, and some are bell-shaped.

If you want a unique solution to this problem, you must add an additional constraint. A common choice is to match not just the mean of some sample data, but also the variance. These beta distributions all have different variances, so adding a constraint on the variance ensures a unique beta distribution.

Summary

This article shows how to simulate data from a distribution on the interval [a, b] that has a specified mean, μ. There are infinitely many distributions that satisfy these constraints. This article visualizes the shapes for a family of beta distributions that you can use to solve this problem. To get a unique solution, you can specify an additional requirement, such as a value for the variance.

The post Simulate from a bounded distribution that has a specified mean appeared first on The DO Loop.

2月 082023
 

SAS programmers love to make special graphs for Valentine's Day. In fact, there is a long history of heart-shaped graphs and love-inspired programs written in SAS! Last year, I added to the collection by showing how a ball bounces on a heart-shaped billiards table. This year, I create a similar image, but one that is stochastic instead of deterministic. This year's image is a random walk in a heart-shaped region.

In the figure to the right, the red lines trace the path of a random walk that starts in the middle of the heart. For each step, the new location is determined by a random bivariate normal vector. That is, from the position (x,y), the next location is (x+dx, y+dy) where (dx, dy) is a random bivariate normal vector. If the next location would be outside of the heart-shaped region, then shorten the step and land on the boundary. The figure shows the path of a constrained random walk for 2000 steps.

Perhaps this image is a visual metaphor for randomly exploring love? Some readers will be content to view this image and reflect on the love in their lives. Others will want to learn more about how the image is created in SAS. If you are in the second group, read on!

An unconstrained random walk

Before I show how to constrain the random walk to stay inside the heart, let's create and visualize an unconstrained random walk. There are many kinds of random walks, but this one takes a step according to a random bivariate normal vector. Starting from a point (x,y), you generate a vector (dx,dy) ~ MVN(0, I(2)) and move to the new point (x+dx, y+dy). You repeat this process many time, then draw the path that results. The following SAS DATA step generates the random walk by using 0.2 as the standard deviation of each variate:

/* random 2-D walk where each step is a vector v, which is bivariate normal
   v ~ MVN(0, I(2)) */
data RandomWalk;
N = 2000;
call streaminit(12345);
x = 0; y = 0;
do i = 1 to N;
   dx = rand("Normal", 0, 0.2);
   dy = rand("Normal", 0, 0.2);
   x = x + dx;
   y = y + dy;
   output;
end;
keep x y;
run;
ods graphics / reset width=480px height=480px;
title "Random Walk with Bivariate Normal Steps";
proc sgplot data=RandomWalk;
   series x=x y=y / lineattrs=(color=lightred);
   xaxis grid;
   yaxis grid;
run;

This graph displays a random walk with 2,000 steps, starting from the origin. Most of the time, the two consecutive points are close to each other. However, sometimes the next point is far away from the previous point.

A constrained random walk

Now imagine a variation on the previous random walk. The initial point is inside a compact closed region. Instead of always moving from an initial point (x,y) to a final point (x+dx, y+dy), you compute whether the final point is inside the region. If so, move to the point. If not, truncate the step so that you land on the boundary of the region. Repeat this many times to obtain a constrained random walk that is always inside the region.

For Valentine's Day, I will choose the region to be heart-shaped. It is easiest to perform these computations in the SAS IML language. The following program defines a few helper functions and generates 2,000 points for the constrained random walk. The points are written to a SAS data set and overlaid on a graph of the heart-shaped region:

/* Instead of an unconstrained walk, at each step check whether the 
   new location is outside of a region. If so, truncate the step to 
   land on the boundary of the region. */
proc iml;
/* define the heart-shaped region */
start H(xy);
  x = xy[,1]; y = xy[,2];
  return ( (x**2 + y**2 - 1)**3 - x**2 * y**3 );
finish;
 
/* return 1 if (x,y) is inside the region */
start InRegion(xy, tol=1e-14);
  return (H(xy) <= tol);
finish;
 
/* given a point, x, and a vector, v, this function returns the function
   f(t) = H(x + t*v), t in [0,1]. If f has a root, then the line segment from 
   x to x+v intersects the boundary of the reqion. 
   This function is used by FROOT to find points on the boundary of the region. */
start OnBoundary(t) global(G_x, G_v);
  return ( H( G_x + t*G_v ) );
finish;
 
/* Start at x. Try to step to x+v.
   If x+v is in the region, take the step.
   Otherwise, see if we can take part of the step to land on the boundary. */
start StepInRegion(x, v) global(G_x, G_v);
   if InRegion(x + v) then 
      return (x + v);
   if InRegion(x) then do;
      G_x = x; G_v = v;
      /* does the line from x to x+v cross the boundary? */
      t = froot("OnBoundary", {0 1});  /* find intersection of line and region */
      if t > 0 then 
         return (x + t*v);       /* step onto the boundary */
      else
         return (x);             /* cannot step in this direction */
   end;
   /* something is wrong: x is not in the region */
   return ( {. .} );
finish;
 
N = 2000;
call randseed(12345);
vel = randfun(N//2, "Normal", 0, 0.2);
 
x = j(1, 2, 0);
create walk from x [c={x y}];
do i = 1 to N;
   x = StepInRegion(x, vel[i,]);
   append from x;
end;
close;
QUIT;
 
title "Random Walk Inside a Heart-Shaped Region";
proc sgplot data=Walk;
   series x=x y=y / lineattrs=(color=lightred);
run;

The trajectory of the random walk is shown for 2,000 random steps inside a heart-shaped region. You can carry out this computation for any compact region, provided that you can represent the region as the level set of some continuous function H(x,y)=0 that divides the plane into an inside region (where H(x,y) < 0) and an outside region (where H(x,y) > 0).

To visualize the heart-shaped region, you can overlay the trajectory on the region, as done at the top of this article. The images below visualize how the random walk evolves and gradually fills the region. For details about how these images were created, see the SAS program that creates these images.

Summary

It is easy to simulate an unconstrained random walk. This article shows how to use SAS to generate a random walk in which the points are contained in a compact region. During the simulation, you can monitor whether the next step will occur outside the region. If so, you can decrease the step so that the point is on the boundary of the region. This is illustrated by using a heart-shaped region, but the same algorithm can constrain the random walk to any compact region of the plane.

The post A random walk inside a heart appeared first on The DO Loop.

1月 182023
 

A previous article shows that you can use the Intercept parameter to control the ratio of events to nonevents in a simulation of data from a logistic regression model. If you decrease the intercept parameter, the probability of the event decreases; if you increase the intercept parameter, the probability of the event increases.

The probability of Y=1 also depends on the slope parameters (coefficients of the regressor effects) and on the distribution of the explanatory variables. This article shows how to visualize the probability of an event as a function of the regression parameters. I show the visualization for two one-variable logistic models. For the first model, the distribution of the explanatory variable is standard normal. In the second model, the distribution is exponential.

Visualize the probability as the Intercept varies

In a previous article, I simulated data from a one-variable binary logistic model. I simulated the explanatory variable, X, from a standard normal distribution. The linear predictor is given by η = β0 + β1 X. The probability of the event Y=1 is given by μ = logistic(η) = 1 / (1 + exp(-η)).

In the previous article, I looked at various combinations of Intercept (β0) and Slope (β1) parameters. Now, let's look systematically at how the probability of Y=1 depends on the Intercept for a fixed value of Slope > 0. Most of the following program is repeated and explained in my previous post. For your convenience, it is repeated here. First, the program simulates data (N=1000) for a variable, X, from a standard normal distribution. Then it simulates a binary variable for a series of binary logistic models as the intercept parameter varies on the interval [-3, 3]. (To get better estimates of the probability of Y=1, the program simulates 100 data sets for each value of Intercept.) Lastly, the variable calls PROC FREQ to compute the proportion of simulated events (Y=1), and it plots the proportion versus the Intercept value.

/* simulate X ~ N(0,1) */
data Explanatory;
call streaminit(12345);
do i = 1 to 1000;
   x = rand("Normal",  0, 1);   /* ~ N(0,1) */
   output;
end;
drop i;
run;
 
/* as Intercept changes, how does P(Y=1) change when Slope=2? */
%let Slope = 2;
data SimLogistic;
call streaminit(54321);
set Explanatory;
do Intercept = -3 to 3 by 0.2;
   do nSim = 1 to 100;               /* Optional: param estimate is better for large samples */
      eta = Intercept + &Slope*x;    /* eta = linear predictor */
      mu = logistic(eta);            /* mu = Prob(Y=1) */
      Y = rand("Bernoulli", mu);     /* simulate binary response */
      output;
   end;
end;
run;
 
proc sort data=SimLogistic; by Intercept; run;
ods select none;
proc freq data=SimLogistic;
   by Intercept;
   tables Y / nocum;
   ods output OneWayFreqs=FreqOut;
run;
ods select all;
 
title "Percent of Y=1 in Logistic Model";
title2 "Slope=&Slope";
footnote J=L "X ~ N(0,1)";
proc sgplot data=FreqOut(where=(Y=1));
   series x=Intercept y=percent;
   xaxis grid;
   yaxis grid label="Percent of Y=1";
run;

In the simulation, the slope parameter is set to Slope=2, and the Intercept parameter varies systematically between -3 and 3. The graph visualizes the probability (as a percentage) that Y=1, given various values for the Intercept parameter. This graph depends on the value of the slope parameter and on the distribution of the data. It also has random variation, due to the call to generate Y as a random Bernoulli variate. For these data, the graph shows the estimated probability of the event as a function of the Intercept parameter. When Intercept = –2, Pr(Y=1) = 22%; when Intercept = 0, Pr(Y=1) = 50%; and when Intercept = 2, Pr(Y=1) = 77%. The statement that Pr(Y=1) = 50% when Intercept=0 should be approximately true when the data is from a symmetric distribution with zero mean.

Vary the slope and intercept together

In the previous section, the slope parameter is fixed at Slope=2. What if we allow the slope to vary over a range of values? We can restrict our attention to positive slopes because logistic(η) = 1 – logistic(-η). The following program varies the Intercept parameter in the range [-3,3] and the Slope parameter in the range [0, 4].

/* now do two-parameter simulation study where slope and intercept are varied */
data SimLogistic2;
call streaminit(54321);
set Explanatory;
do Slope = 0 to 4 by 0.2;
do Intercept = -3 to 3 by 0.2;
   do nSim = 1 to 50;                  /* optional: the parameter estimate are better for larger samples */
   eta = Intercept + Slope*x;          /* eta = linear predictor */
   mu = logistic(eta);                 /* transform by inverse logit */
   Y = rand("Bernoulli", mu);          /* simulate binary response */
   output;
   end;
end;
end;
run;
 
/* Monte Carlo estimate of Pr(Y=1) for each (Int,Slope) pair */
proc sort data=SimLogistic2; by Intercept Slope; run;
ods select none;
proc freq data=SimLogistic2;
   by Intercept Slope;
   tables Y / nocum;
   ods output OneWayFreqs=FreqOut2;
run;
ods select all;

The output data set (FreqOut2) contains Monte Carlo estimates of the probability that Y=1 for each pair of (Intercept, Slope) parameters, given the distribution of the explanatory variable. You can use a contour plot to visualize the probability of the event for each combination of slope and intercept:

/* Create template for a contour plot
   https://blogs.sas.com/content/iml/2012/07/02/create-a-contour-plot-in-sas.html
*/
proc template;
define statgraph ContourPlotParm;
dynamic _X _Y _Z _TITLE _FOOTNOTE;
begingraph;
   entrytitle _TITLE;
   entryfootnote halign=left _FOOTNOTE;
   layout overlay;
      contourplotparm x=_X y=_Y z=_Z /
        contourtype=fill nhint=12  colormodel=twocolorramp name="Contour";
      continuouslegend "Contour" / title=_Z;
   endlayout;
endgraph;
end;
run;
/* render the Monte Carlo estimates as a contour plot */
proc sgrender data=Freqout2 template=ContourPlotParm;
where Y=1;
dynamic _TITLE="Percent of Y=1 in a One-Variable Logistic Model"
        _FOOTNOTE="X ~ N(0, 1)"
        _X="Intercept" _Y="Slope" _Z="Percent";
run;

As mentioned earlier, because X is approximately symmetric, the contours of the graph have reflective symmetry. Notice that the probability that Y=1 is 50% whenever Intercept=0 for these data. Furthermore, if the points (β0, β1) are on the contour for Pr(Y=1)=α, then the contour for Pr(Y=1)=1-α contains points close to (-β0, β1). The previous line plot is equivalent to slicing the contour plot along the horizontal line Slope=2.

You can use a graph like this to simulate data that have a specified probability of Y=1. For example, if you want approximately 70% of the cases to be Y=1, you can choose any pair of (Intercept, Slope) values along that contour, such as (0.8, 0), (1, 1), (2, 3.4), and (2.4, 4). If you want to see all parameter values for which Pr(Y=1) is close to a desired value, you can use the WHERE statement in PROC PRINT. For example, the following call to PROC PRINT displays all parameter values for which the Pr(Y=1) is approximately 0.7 or 70%:

%let Target = 70;
proc print data=FreqOut2;
   where Y=1 and %sysevalf(&Target-1) <= Percent <= %sysevalf(&Target+1);
   var Intercept Slope Y Percent;
run;

Probabilities for nonsymmetric data

The symmetries in the previous graphs are a consequence of the symmetry in the data for the explanatory variable. To demonstrate how the graphs change for a nonsymmetric data distribution, you can run the same simulation study, but use data that are exponentially distributed. To eliminate possible effects due to a different mean and variance in the data, the following program standardizes the explanatory variable so that it has zero mean and unit variance.

data Expo;
call streaminit(12345);
do i = 1 to 1000;
   x = rand("Expon", 1.5);   /* ~ Exp(1.5) */
   output;
end;
drop i;
run;
 
proc stdize data=Expo method=Std out=Explanatory;
   var x;
run;

If you rerun the simulation study by using the new distribution of the explanatory variable, you obtain the following contour plot of the probability as a function of the Intercept and Slope parameters:

If you compare this new contour plot to the previous one, you will see that they are very similar for small values of the slope parameter. However, they are different for larger values such as Slope=2. The contours in the new plot do not show reflective symmetry about the vertical line Intercept=0. Pairs of parameters for which Pr(Y=1) is approximately 0.7 include (0.8, 0) and (1, 1), which are the same as for the previous plot, and (2.2, 3.4), and (2.6, 4), which are different from the previous example.

Summary

This article presents a Monte Carlo simulation study in SAS to compute and visualize how the Intercept and Slope parameters of a one-variable logistic model affect the probability of the event. A previous article notes that decreasing the Intercept decreases the probability. This article shows that the probability depends on both the Intercept and Slope parameters. Furthermore, the probability depends on the distribution of the explanatory variables. You can use the results of the simulation to control the proportion of events to nonevents in the simulated data.

You can download the complete SAS program that generates the results in this article.

The ideas in this article generalize to logistic regression models that contain multiple explanatory variables. For multivariate models, the effect of the Intercept parameter is similar. However, the effect of the slope parameters is more complicated, especially when the variables are correlated with each other.

The post Visualize how parameters in a binary logistic regression model affect the probability of the event appeared first on The DO Loop.

1月 162023
 

This article shows that you can use the intercept parameter to control the probability of the event in a simulation study that involves a binary logistic regression model. For simplicity, I will simulate data from a logistic regression model that involves only one explanatory variable, but the main idea applies to multivariate regression models. I also show mathematically why the intercept parameter controls the probability of the event.

The problem: Unbalance events when simulating regression data

Usually, simulation studies are based on real data. From a set of data, you run a statistical model that estimates parameters. Assuming the model fits the data, you can simulate new data by using the estimates as if they were the true parameters for the model. When you start with real data, you obtain simulated data that often have similar statistical properties.

However, sometimes you might want to simulate data for an example, and you don't have any real data to model. This is often the case on online discussion forums such as the SAS Support Community. Often, a user posts a question but cannot post the data because it is proprietary. In that case, I often simulate data to demonstrate how to use SAS to answer the question. The simulation is usually straightforward for univariate data or for data from an ordinary least squares regression model. However, the simulation can become complicated for generalized linear regression models and nonlinear models. Why? Because not every choice of parameters leads to a reasonable data set. For example, if you choose parameters "out of thin air" in an arbitrary manner, you might end up with a model for which the probability of the event (Pr(Y=1)) is close to 1 for all values of the explanatory variables. This model is not very useful.

It turns out that you can sometimes "fix" your model by decreasing the value of the intercept parameter in the model. This reduces the value of Pr(Y=1), which can lead to a more equitable and balanced distribution of the response values.

Simulate and visualize a one-variable binary logistic regression model

Let's construct an example. For a general discussion and a multivariate example, see the article, "Simulating data for a logistic regression model." For this article, let's keep it simple and use a one-variable model. Let X be an independent variable that is distributed as a standardized normal variate: X ~ N(0,1). Then choose an intercept parameter, β0, and a slope parameter, β1, and form the linear predictor η = β0 + β1 X. A binary logistic model uses a logistic transformation to transform the linear predictor to a probability: μ = logistic(η), where logistic(η) = 1 / (1 + exp(-η)). The graph of the S-shaped logistic function is shown to the right. You can then simulate a response as a Bernoulli random variable Y ~ Bern(μ).

In SAS, the following DATA step simulates data from a standard normal variable:

data Explanatory;
call streaminit(12345);
do i = 1 to 1000;
   x = rand("Normal",  0, 1);   /* ~ N(0,1) */
   output;
end;
drop i;
run;

Regardless of the distribution of X, the following DATA step simulates data from a logistic model that has a specified set of values for the regression parameters. I embed the program in a SAS macro so that I can run several models with different values of the Intercept and Slope parameters. The macro also runs PROC FREQ to determine the relative proportions of the binary response Y=0 and Y=1. Lastly, the macro calls PROC SGPLOT to display a graph that shows the logistic regression model and the "observed" responses as a function of the linear predictor values,

/* Simulate data from a one-variable logistic model.
   X ~ N(0,1) and Y ~ Bern(eta)
   where eta = logistic(Intercept + Slope*X).
   Display frequency table for Y and a graph of Y and Prob(Y=1) vs eta */
%macro SimLogistic(Intercept, Slope);
   title "Logistic Model for Simulated Data";
   title2 "Intercept=&Intercept; Slope=&Slope";
   data Sim1;
      call streaminit(54321);
      set Explanatory;
      eta = &Intercept + &Slope*x;   /* eta = linear predictor */
      mu = logistic(eta);            /* mu = Prob(Y=1) */
      Y = rand("Bernoulli", mu);     /* simulate binary response */
   run;
 
   proc freq data=Sim1;
      tables Y / nocum;
   run;
   proc sort data=Sim1; by eta; run;
 
   proc sgplot data=Sim1;
      label Y="Observed" mu="P(Y=1)" eta="Linear Predictor";
      scatter x=eta y=Y / transparency=0.8 markerattrs=(symbol=CircleFilled);
      series x=eta y=mu;
      xaxis grid values = (-16 to 16) valueshint;
      yaxis grid label="Probability";
      keylegend / location=inside position=E opaque;
   run;
%mend;

How does the probability of the event depend on the intercept?

Let's run the program. The following call uses an arbitrary choice of values for the regression parameters: Intercept=5 and Slope=2:

/* call the macro with Intercept=5 and Slope=2 */
%SimLogistic(5, 2);

The output shows what can happen if you aren't careful when choosing the regression parameters. For this simulation, 96.5% of the response values are Y=1 and only a few are Y=0. This could be a satisfactory model for a rare event, but not for situations in which the ratio of events to nonevents is more balanced, such as 75-25, 60-40, or 50-50 ratios.

The graph shows the problem: the linear predictor (in matrix form, η = X*β) is in the range [-2, 12], and about 95% of the linear prediction are greater than 1.5. Because logistic(1.5) > 0.8, the probability of Y=1 is very high for most observations.

Since you have complete control of the parameters in the simulation, you can adjust the parameters. If you shift η to the left, the corresponding probabilities will decrease. For example, if you subtract 4 from η, then the range of η becomes [-6, 8], which should provide a more balanced distribution of the response category. The following statement implements this choice:

/* call the macro with Intercept=1 and Slope=2 */
%SimLogistic(1, 2);

The output shows the results of the simulation after decreasing the intercept parameter. When the Intercept=1 (instead of 5), the proportion of simulated responses is more balanced. In this case, 63% of the responses are Y=1 and 37% are Y=0. If you want to further decrease the proportion of Y=1, you can decrease the Intercept parameter even more. For example, if you run %SimLogistic(-1, 2), then only 35.9% of the simulated responses are Y=1.

The mathematics

I have demonstrated that decreasing η (the linear predictor) also decreases μ (the probability of Y=1). This is really a simple mathematical consequence of the formula for the logistic function. Suppose that η1 < η2. Then -η1 > -η2
⇒    1 + exp(-η1) > 1 + exp(-η2) because EXP is a monotonic increasing function
⇒    1 / (1 + exp(-η1)) < 1 / (1 + exp(-η2)) because the reciprocal function is monotonic decreasing
⇒    μ1 < μ2 by definition.
Therefore, decreasing η results in decreasing μ For some reason, visualizing the simulation makes more of an impact than the math.

Summary

This article shows that by adjusting the value of the Intercept parameter in a simulation study, you can adjust the ratio of events to nonevents. Decrease the intercept parameter to decrease the number of events; increase the intercept parameter to increase the number of events. I demonstrated this fact for a one-variable model, but the main ideas apply to multivariate models for which the linear predictor (η) is a linear combination of multiple variables.

The probability of Y=1 depends continuously on the regression parameters because the linear predictor and the logistic function are both continuous. In my next article, I show how to visualize the probability of an event as a function of the regression parameters.

The post Simulate data from a logistic regression model: How the intercept parameter affects the probability of the event appeared first on The DO Loop.

11月 302022
 

A probabilistic card trick is a trick that succeeds with high probability and does not require any skill from the person performing the trick. I have seen a certain trick mentioned several times on social media. I call it "ladders" or the "ladders game" because it reminds me of the childhood board game Chutes and Ladders (previously known as Snakes and Ladders in India and the UK). In Chutes and Ladders, some squares on the board tell the player to move forward (go up a ladder) or to move backward (go down a chute). In the "ladders" game, you only move forward.

This article uses Monte Carlo simulation in SAS to show that the "ladders" card trick "works" about 98.2% of the time.

I have been told that Martin Gardner wrote about this game. He referred to mathematical card tricks as those that "do not require prestidigitation," or sleight of hand.

The ladders card trick

The rules of the game and a Python simulation are presented in a blog post by Bob Carpenter, who saw the trick performed at a children's museum. The performer shuffles an ordinary deck of 52 playing cards and places them face up in a row on a table. The performer then describes the game of ladders:

  • A player starts by pointing to any card in the first five positions.
  • The value of the card indicates the number of cards to move forward. If the card is 2, 3, ..., 9, you move forward that many cards. If the card is a 10, jack, queen, king, or ace, you move forward one space.
  • From the new position, apply the rule again. Repeat this process until you cannot advance any more.

The performer selects two volunteers from the audience and tells them each to point randomly to any card from among the first five cards. Each player does so. The performer then announces that both players will end up on a certain "magical" card and he tells everyone the name of the magical card. The players then follow the game rules. Amazingly, each person's path terminates on the magical card that was previously announced. The trick works about 98.2% of the time.

To illustrate the ladders game, first consider a simpler situation of a deck that contains only 13 cards, one for each card value (for example, use only spades). One random shuffle of the 13 cards is shown to the right. For this arrangement of cards, the performer announces that both volunteers will end up on the 7 card (the magical card for this shuffle), which is the 11th card in the row of cards. Suppose the first volunteer points to the ace in the first position as her starting card. The rules of the game dictate the following sequence of moves:

  • From the A, the player moves forward one card and lands on the J.
  • From the J, the player moves forward one card and lands on the 2.
  • From the 2, the player moves forward two cards and lands on the 6.
  • From the 6, the player moves forward six cards and lands on the 7. The player cannot move forward seven spaces, so the 7 is her final card, as predicted in advance.

Thus, the path for the first player is the sequence of positions (1, 2, 3, 5, 11).

What happens for the other player? Well, if the second volunteer chooses his initial card at positions 2, 3, or 5, then his path will instantly coincide with the path of the first player, so his path, too, terminates at the 7 card in the 11th position. If the second volunteer chooses his initial card at position 4, then his path hits positions 4, 8, and 11. This path also terminates at the 7 card in the 11th position. Thus, for this set of cards and this shuffle, all initial positions follow a path that eventually lands on the 7.

How does the performer know the name of the final magical card? As he deals, he secretly counts the path determined by the first card in the deck.

Simulate the ladders game in SAS

You can use Monte Carlo simulation to estimate the probability that the ladders card trick "works." To keep it simple, let's say the trick "works" when the paths of the two players converge. That is, the players initially start on two random positions (1-5), but they end up on the same final card after they each follow the path dictated by their initial position. The simulation for the ladders card trick has the following major steps:

  1. Set up the card deck. For each card, assign the number of steps to move forward.
  2. Create a function that computes the final position, given the initial position and the sequence of moves determined by the shuffled deck.
  3. Run the Monte Carlo simulation. This consists of several sub-steps:
    • Shuffle the deck (that is, shuffle the array of moves).
    • Use sampling without replacement to choose two initial positions.
    • Compute the final card for each initial position.
    • Record whether the two paths converged.
  4. Estimate the probability of the trick working as the proportion of times that the paths converged in the simulation.

As is often the case, this simulation is easiest to perform in SAS by using the SAS/IML language. You can use the RANPERM function to shuffle the deck. You can use the SAMPLE function to sample without replacement from the values 1-5. You can write a function that computes the final position based on the initial position. The following program is one way to implement a Monte Carlo simulation for the ladders card trick:

/* The Ladders card trick:
   https://statmodeling.stat.columbia.edu/2022/10/07/mira-magic-a-probabilistic-card-trick/
   Lay out a shuffled deck of cards.
   Choose one of the first five cards as a starting point.
   Move down the deck as follows:
   If the current card is 2-9, move that many cards forward.
   For other cards, move one space forward.
*/
proc iml;
/* 1. set up the card deck and assign values. 
   Note: We don't actually need the deck, only the instructions
   about how to move. */
values = {A,'2','3','4','5','6','7','8','9','10', J, Q, K};
advance= {1,  2,  3,  4,  5,  6,  7,  8,  9,  1,  1, 1, 1};
deck  = repeat(values, 4);   /* full deck of 52 cards */
moves = repeat(advance, 4);  /* corresponding moves */
nCards = nrow(moves);
 
/* 2. Find the final position, given the first position and 
   the vector of moves for the current shuffled deck.
   startPos : a scalar value 1-5
   Move : a vector of values. If your current card is in 
          position i, the next card is position (i + Move[i])
*/
start EndPos(startPos, Move);
   Pos = startPos;
   do while(Pos + Move[Pos] < nrow(Move));
      Pos = Pos + Move[Pos];
   end;
   return( Pos );
finish;
 
/* 3. Run the Monte Carlo simulation:
      A. Shuffle the deck (vector of moves)
      B. Use sampling w/o replacement to choose two initial positions
      C. Compute the final card for each initial position
      D. Record whether the two paths converged
*/
call randseed(1234);
NSim = 1E5;
match = j(NSim,1);
do j = 1 to nSim;
   k = ranperm(nCards);          /* shuffle the deck */
   Move = moves[k];
   /* choose two different starting positions in 1-5 */
   startPos = sample(1:5, 2, "NoReplace");
   EndPos1 = EndPos(startPos[1], Move);
   EndPos2 = EndPos(startPos[2], Move);
   match[j] = (EndPos1=EndPos2);
end;
 
/* 4. Estimate probability as the proportion of matches (1s) in a binary vector.
      Optional: Compute a Wald CL for the proportion */
ProbMatch = mean(match);
StdErr = sqrt( ProbMatch*(1-ProbMatch)/ NSim );
CL = ProbMatch + {-1 1}*1.96*StdErr;
print NSim[F=comma9.] ProbMatch StdErr CL[c={'LCL' 'UCL'}];

The simulation shows that the two players end up on the same card more than 98% of the time.

Summary

This article describes a probabilistic card trick, which I call the game of ladders. Two players choose random cards as an initial condition, then follow the rules of the game. For about 98.2% of the deals and initial positions, the paths of the two players will converge to the same card. As the performer is laying out the card, he can predict in advance that the paths will converge and, with high probability, announce the name of the final card.

Bob Carpenter's blog post mentions the connection between this card trick and Markov chains. He also presents several variations, such as using more cards in the deck.

The post Ladders: A probabilistic card trick appeared first on The DO Loop.

11月 282022
 

A SAS programmer was trying to simulate poker hands. He was having difficulty because the sampling scheme for simulating card games requires that you sample without replacement for each hand. In statistics, this is called "simple random sampling."

If done properly, it is straightforward to simulate poker hands in SAS. This article shows how to generate a standard deck of 52 cards. It then shows how to generate many card hands for a game that involves several players. This can be done in two ways: The SAMPLE function in the SAS/IML language and PROC SURVEYSELECT in SAS/STAT software.

Encode a deck of cards

The first step is to generate a standard deck of 52 playing cards. There are many ways to do this, but the following DATA step iterates over 13 values (A, 2, 3, ..., Q, K) and over four suits (clubs, diamonds, hearts, and spades). The step creates a data set that has 52 cards. To make the output more compact, I concatenate the value of the card with the first letter of the suit to obtain a two-character string for each card. For example, "5C" is the "five of clubs" and "AS" is the "ace of spades." So that each card can be represented by using two characters, I use "T" instead of "10." For example, "TH" is the "ten of hearts." The

data Cards;
length Value $2 Suit $8 Card $3;
array Values[13] $ _temporary_ ('A','2','3','4','5','6','7','8','9','T','J','Q','K');
array Suits[4] $ _temporary_ ('Clubs', 'Diamonds', 'Hearts', 'Spades');
do v = 1 to 13;
   do s = 1 to 4;
      Value = Values[v]; Suit = Suits[s]; Card = cats(Value,substr(Suit,1,1));
      output;
   end;
end;
run;
 
proc print data=Cards(obs=8); 
   var v s Value Suit Card;
run;
TABLE

The cards are generated in order. The output shows the first eight cards in the deck, which are the aces and twos. For some types of poker hands (such as determining pairs and straights), the V variable can be useful, and for other analyses (such as determining flushes), the S variable can be useful.

Deal cards by using SAS/IML

When humans deal cards, we use a two-step method: shuffle the deck and then deal the top cards to players in a round-robin fashion. A computer simulation can simluate dealing cards by using a one-step method: deal random cards (without replacement) to every player. SAS provides built-in sampling methods to make this process easy to program.

In SAS/IML, the default sampling method for the SAMPLE function is sampling without replacement. If there are P players and each player is to receive C cards, then a deal consists of P*C cards, drawn without replacement from the deck. The deck does not need to be shuffled because the SAMPLE function outputs the cards in random order. Because the cards are in random order, it does not matter how you assign the cards to the players. The following SAS/IML program uses P=3 players and generates C=5 cards for each player. The program simulates one deal by assigning the first C cards to the first player, the second C cards to the second player, and so on:

%let nCards   = 5;    /* cards per player */
%let nPlayers = 3;    /* number of players */
%let nDeals   = 10;   /* number of deals to simulate */
 
/* simulate dealing many card hands in SAS/IML */
proc iml;
call randseed(1234);                       /* set random number seed */
use Cards;  read all var "Card";  close;   /* read the deck of cards into a vector */
 
nCardsPerDeal = &nCards * &nPlayers;       /* each row is a complete deal */
D = sample(Card, nCardsPerDeal, "WOR" );   /* sample without replacement from the deck */
Deal = shape(D, &nPlayers, &nCards);       /* reshape into nPlayers x nCards matrix */
 
/* print one deal */
cardNames   = "C1":"C&nCards";
playerNames = "P1":"P&nPlayers";
print Deal[c=cardNames r=playerNames];
TABLE

For this deal, the first player has two tens, two fives, and a king. The second player has a four, a five, a nine, an ace, and a two. You could sort the cards in each row, but I have left each player's cards unsorted so that you can see the random nature of the assignment.

Simulate multiple deals

The previous program simulates one deal. What if you want to simulate multiple deals? One way is to loop over the number of deals, but there is a more efficient method. The second argument to the SAMPLE function can be a two-element vector. The first element specifies the number of cards for each deal, and the second element specifies the number of deals. Thus, for example, if you want to simulate 10 deals, you can use the following statement to generate a matrix that has 10 rows. Each row is a sample (without replacement) from a 52-card deck:

/* use one call to simulate many deals */
D = sample(Card, nCardsPerDeal//&nDeals, "WOR" );   /* simulate many deals */
 
/* Ex: The 8th row contains the 8th deal. Display it. */
Deal8 = shape(D[8,], &nPlayers, &nCards);
print Deal8[c=cardNames r=playerNames];
TABLE

To demonstrate that each row is a deal, I have extracted, reshaped, and displayed the eighth row. The eighth deal is unusual because each player is dealt a "good" hand. The first player has a pair of eights, the second player has two pairs (threes and sixes), and the third player has a pair of sevens. If you want to print (or analyze) the 10 deals, you can loop over the rows, as follows:

do i = 1 to &nDeals;
   Deal = shape(D[i,], &nPlayers, &nCards);
   /* analyze or print the i_th deal */
   print i, Deal[c=cardNames r=playerNames];
end;

Deal random hands by using PROC SURVEYSELECT

You can also simulate card deals by using the SURVEYSELECT procedure. You should use the following options:

  • Use the METHOD=SRS option to specify sampling without replacement for each deal. To obtain the cards in random order, use the OUTRANDOM option.
  • Use the N= option to specify the number of cards in each deal.
  • Use the REPS= option to specify the total number of deals.

An example is shown in the following call to PROC SURVEYSELECT:

/* similar approach for PROC SURVEYSELECT */
proc surveyselect data=Cards seed=123 NOPRINT
     method=SRS                   /* sample without replacement */
     outrandom                    /* randomize the order of the cards */
     N=%eval(&nCards * &nPlayers) /* num cards per deal */
     reps=&nDeals                 /* total number of deals */
     out=AllCards;
run;

The output data set (AllCards) contains a variable named Replicate, which identifies the cards in each deal. Because the cards for each deal are in a random order, you can arbitrarily assign the cards to players. You can use a round-robin method or assign the first C cards to the first player, the next C cards to the second player, and so on. This is done by using the following DATA step:

/* assign the cards to players */
data AllDeals;
set AllCards;
by Replicate;
if first.Replicate then do;
   Cnt = 0;
end;
Player = 1 + floor(Cnt/&nCards);
Cnt + 1;
drop Cnt;
run;
OUTPUT

The simulated data are in long form. For some analyses, it might be more convenient to convert it into wide form. You can do this by using the DATA step or by using PROC TRANSPOSE, as below:

/* OPTIONAL: Convert from long form to wide form */
proc transpose data=AllDeals 
               out=Deals(rename=(col1-col&nCards=C1-C&nCards) drop=_NAME_);
   var Card;
   by Replicate Player;
run;
 
proc print data=Deals;
   where Replicate=8;
   var Replicate Player C:;
run;
OUTPUT

To demonstrate the results, I display the result of the eighth deal (Replicate=8). For this deal, the first player has several face cards but no pairs, the second player has a pair of threes, and the third player has a pair of fours.

Summary

SAS provides many tools for simulation studies. For simulations of card games, this article shows how to create a data set that represents a standard deck of 52 playing cards. From the cards, you can use the SAMPLE function in SAS/IML software or the SURVEYSELECT procedure to simulate many deals, where each deal is a sample without replacement from the deck. For each deal, you can assign the cards to the players in a systematic manner.

The post Simulate poker hands in SAS appeared first on The DO Loop.

11月 072022
 

I recently blogged about how to compute the area of the convex hull of a set of planar points. This article discusses the expected value of the area of the convex hull for n random uniform points in the unit square. The article introduces an exact formula (due to Buchta, 1984) and uses Monte Carlo simulation to approximate the sampling distribution of the area of a convex hull. A simulation like this is useful when testing data against the null hypothesis that the points are distributed uniformly at random.

The expected value of a convex hull

Assume that you sample n points uniformly at random in the unit square, n ≥ 3. What is the expected area of the convex hull that contains the points? Christian Buchta ("Zufallspolygone in konvexen Vielecken," 1984) proved a formula for the expected value of the area of n random points in the unit square. (The unit square is sufficient: For any other rectangle, simply multiply the results for the unit square by the area of the rectangle.)

Unfortunately, Buchta's result is published in German and is behind a paywall, but the formula is given as an answer to a question on MathOverflow by 'user9072'. Let \(s = \sum_{k=1}^{n+1} \frac{1}{k} (1 - \frac{1}{2^k}) \). Then the expected area of the convex hull of n points in [0,1] x [0,1] is \( E[A(n)] = 1 -\frac{8}{3(n+1)} \bigl[ s - \frac{1}{(n+1)2^{n+1}} \bigr] \)

The following SAS DATA step evaluates Buchta's formula for random samples of n ≤ 100. A few values are printed. You can use PROC SGPLOT to graph the expected area against n:

/* Expected area of the convex hull of a random sample of size n in the unit square.
   Result by Buchta (1984, in German), as reported by 
   https://mathoverflow.net/questions/93099/area-enclosed-by-the-convex-hull-of-a-set-of-random-points
*/
data EArea;
do n = 3 to 100;
   s = 0;
   do k = 1 to n+1;
      s + (1 - 1/2**k) / k;
   end;
   EArea = 1 - 8/3 /(n+1) * (s - 2**(-n-1) / (n+1));
   output;
end;
keep n EArea;
run;
 
proc print data=EArea noobs; 
   where n in (3,4,6,8,10,12,13) or mod(n,10)=0;
run;
 
title "Expected Area of Convex Hull";
proc sgplot data=EArea;
   series x=n y=EArea;
   xaxis grid label="Number of Random Points in Unit Square";
   yaxis grid label="Expected Area" min=0 max=1;
run;

The computation shows that when n is small, the expected area, E[A(n)], is very small. The first few values are E[A(3)]=11/144, E[A(4)]=11/72, E[A(5)] = 79/360, and E[A(6)] = 199/720. For small n, the expected area increases quickly as n increases. When n = 12, the expected area is about 0.5 (half the area of the bounding square). As n gets larger, the expected area asymptotically approaches 1 very slowly. For n = 50, the expected area is 0.8. For n = 100, the expected area is 0.88.

The distribution of the expected area of a convex hull

Buchta's result gives the expected value, but the expected value is only one number. You can use Monte Carlo simulation to compute and visualize the distribution of the area statistics for random samples. The distribution enables you to estimate other quantities, such as the median area and an interval that predicts the range of the areas for 95% of random samples.

The following SAS/IML program performs a Monte Carlo simulation to approximate the sampling distribution of the area of the convex hull of a random sample in the unit square of size n = 4. The program does the following:

  1. Load the PolyArea function, which was defined in a previous article that shows how to compute the area of a convex hull. Or you can copy/paste from that article.
  2. Define a function (CHArea) that computes the convex hull of a set of points and returns the area.
  3. Let B be the number of Monte Carlo samples, such as B = 10,000. Use the RANDFUN function to generate B*n random points in the unit square. This is an efficient way to generate B samples of size n.
  4. Loop over the samples. For each sample, extract the points in the sample and call the CHArea function to compute the area.
  5. The Monte Carlo simulation computes B sample areas. The distribution of the areas is the Monte Carlo approximation of the sampling distribution. You can compute descriptive statistics such as mean, median, and percentiles to characterize the shape of the distribution.
proc iml;
/* load the PolyArea function or copy it from 
   https://blogs.sas.com/content/iml/2022/11/02/area-perimeter-convex-hull.html
*/
load module=(PolyArea);
 
/* compute the area of the convex hull of n pts in 2-D */
start CHArea(Pts);
   indices = cvexhull(Pts);            /* compute convex hull of the N points */
   hullIdx = indices[loc(indices>0)];  /* the positive indices are on the CH */
   P = Pts[hullIdx, ];                 /* extract rows to form CH polygon */
   return PolyArea(P);                 /* return the area */
finish;
 
call randseed(1234);
NSim = 1E4;                            /* number of Monte Carlo samples */
n = 4;
X = randfun((NSim*n)//2, "Uniform");   /* matrix of uniform variates in [0,1]x[0,1] */
Area = j(NSim, 1, .);                  /* allocate array for areas */
do i = 1 to NSim;
   beginIdx = (i-1)*n + 1;             /* first obs for this sample */
   ptIdx    = beginIdx:(beginIdx+n-1); /* range of obs for this sample */
   Pts = X[ptIdx, ];                   /* use these n random points */
   Area[i] = CHArea(Pts);              /* find the area */
end;
 
meanArea = mean(Area);
medianArea = median(Area);
call qntl(CL95, Area, {0.025 0.975});
print n meanArea medianArea (CL95`)[c={LCL UCL}];

For samples of size n = 4, the computation shows that the Monte Carlo estimate of the expected area is 0.154, which is very close to the expected value (0.153). In addition, the Monte Carlo simulation enables you to estimate other quantities, such as the median area (0.139) and a 95% interval estimate for the predicted area ([0.022, 0.369]).

You can visualize the distribution of the areas by using a histogram. The following SAS/IML statements create a histogram and overlay a vertical reference line at the expected value.

title "Distribution of Areas of Convex Hull";
title2 "N=4; NSim = 10,000";
refStmt = "refline 0.15278 / axis=x lineattrs=(color=red) label='E(Area)';";
call histogram(Area) other=refStmt;

The histogram shows that the distribution of areas is skewed to the right for n = 4. For this distribution, a better measure of centrality might be the median, which is smaller than the expected value (the mean). Or you can describe the distribution by reporting a percentile range such as the inter-quartile range or the range for 95% of the areas.

Graphs of the distribution of the area of a convex hull

You can repeat the previous Monte Carlo simulation for other values of n. For example, if you use n = 50, you get the histogram shown to the right. The distribution for n = 50 is skewed to the left. A prediction interval for 95% of areas is [0.69, 0.89].

Whether or not you know about Buchta's theoretical result, you can use Monte Carlo simulation to estimate the expected value for any sample size. The following SAS/IML program simulates 10,000 random samples of size n for n in the range [3, 100]. For each simulation, the program writes the estimate of the mean and a prediction interval to a SAS data set. You can use the BAND statement in PROC SGPLOT to overlay the interval estimates and the estimate of the expected value for each value of n:

/* repeat Monte Carlo simulation for many values of N. Write results to a data set and visualize */
create ConvexHull var {'n' MeanArea LCL UCL};
 
numPts = 3:100;
NSim = 1E4;                            /* number of Monte Carlo samples */
do j = 1 to ncol(numPts);                 /* number of random pts for each convex hull */
   N = numPts[j];
   X = randfun((NSim*N)//2, "Uniform");   /* matrix of uniform variates in [0,1]x[0,1] */
   Area = j(NSim, 1, .);                  /* allocate array for areas */
   LCL = j(NSim, 1, .);                   /* allocate arrays for 95% lower/upper CL */
   UCL = j(NSim, 1, .);
   do i = 1 to NSim;
      beginIdx = (i-1)*N + 1;             /* first obs for this sample */
      ptIdx    = beginIdx:(beginIdx+N-1); /* range of obs for this sample */
      Pts = X[ptIdx, ];                   /* use these N random points */
      Area[i] = CHArea(Pts);
   end;
 
   meanArea = mean(Area);
   call qntl(CL95, Area, {0.025 0.975});
   LCL = CL95[1];   UCL = CL95[2];
   append;
end;
close ConvexHull;
QUIT;
 
title "Monte Carlo Estimate of Expected Area of Convex Hull";
proc sgplot data=ConvexHull noautolegend;
   band x=n lower=LCL upper=UCL / legendlabel="95% Prediction";
   series x=n y=meanArea / legendlabel="Expected Area";
   keylegend / location=inside position=E opaque across=1;
   xaxis grid label="Number of Random Points in Unit Square";
   yaxis grid label="Expected Area" min=0 max=1;
run;

From this graph, you can determine the expected value and a 95% prediction range for any sample size, 3 ≤ n ≤ 100. Notice that the solid line is the Monte Carlo estimate of the expected value. This estimate is very close to the true expected value, which we know from Buchta's formula.

Summary

Some researchers use the area of a convex hull to estimate the area in which an event occurs. The researcher might want to test whether the area is significantly different from the area of the convex hull of a random point pattern. This article presents several properties of the area of the convex hull of a random sample in the unit square. The expected value of the area is known theoretically (Buchta, 1984). Other properties can be estimated by using a Monte Carlo simulation.

You can download the SAS program that generates the tables and graphs in this article.

The post The area of the convex hull of random points appeared first on The DO Loop.

10月 102022
 

One of the benefits of social media is the opportunity to learn new things. Recently, I saw a post on Twitter that intrigued me. The tweet said that the expected volume of a random tetrahedron in the unit cube (in 3-D) is E[Volume] = 0.0138427757.... This number seems surprisingly small! After all, the cube has unit volume, and the volume of the largest tetrahedron in the cube is 1/3. So, the result says that the average volume of a random tetrahedron is 24 times smaller than the largest tetrahedron!

The result is attributed to Zinani, A. (2003) "The Expected Volume of a Tetrahedron whose Vertices are Chosen at Random in the Interior of a Cube," which is behind a paywall. The abstract says that the analytical result is exactly
      E[Volume] = 3977 / 216000 - π2 / 2160
and that this number is "in convincing agreement with a simulation."

This article shows how to run a Monte Carlo simulation in SAS to simulate random tetrahedra and compute their volumes. We use the simulation to confirm Zinani's result and to visualize the distribution of the volume.

The volume of a tetrahedron

In multivariable calculus, students are introduced to the scalar triple product of three vectors. Given three vectors u, v, and w, the scalar triple product is the quantity (u x v)⋅w, where (u x v) is the cross product of the vectors, and zw = zT*w is the scalar dot product. The scalar triple product is the (signed) volume of the parallelepiped formed by the vectors.

Recall that the volume of a tetrahedron with vertices (A,B,C,D) is 1/6 of the volume of the parallelepiped formed by the vectors between vertices. Consequently, the volume of a tetrahedron is
      V = 1/6 |(AB x AC)⋅AD|
where AB is the vector from A to B, AC is the vector from A to C, and AD is the vector from A to D.

Compute the volume of a tetrahedron in SAS

Since the volume formula includes vectors and vector operations, it is natural to use the SAS/IML language to compute the volume. The following functions implement the definition of the vector cross-product. To test the function, the program computes the volume of the largest tetrahedron that can be imbedded in the unit cube. One such tetrahedron is shown to the right. The tetrahedron has vertices A(0,0,0), B(1,1,0), C(0,1,1), and D(1,0,1). The volume of this tetrahedron is 1/3, as shown by the following program.

/* volume of tetrahedron in 3D */
proc iml;
/* vector cross product (3D vectors only)
   https://blogs.sas.com/content/iml/2013/05/20/matri-multiplication.html
*/
start CrossProd(u, v);
   i231 = {2 3 1};   i312 = {3 1 2};
   return( u[i231]#v[i312] - v[i231]#u[i312] );
finish;
 
start VolumeTetra(A,B,C,D);
   z = CrossProd(B-A, C-A);    /* z = AB x AC */
   w = colvec(D - A);          /* w = AD */
   volume = abs(z`*w) / 6;     /* V = 1/6 | z`*w | */
   return( volume );
finish;
 
/* test the formula: compute the volume of the largest tetrahedron inside the unit cube */
A = {0 0 0};
B = {1 1 0};
C = {0 1 1}; 
D = {1 0 1};
maxVol = VolumeTetra(A,B,C,D);
print maxVol;

A Monte Carlo simulation of random tetrahedron

It is not hard to write a SAS/IML program that generates four random vertices for a tetrahedron inside the unit cube in 3-D. To generate the four vertices, you can generate 12 uniform variates independently in the interval [0,1]. Assign the first three numbers to be the (x,y,z) Euclidean coordinates of the first vertex, the next three numbers to be the coordinates of the second vertex, and so forth. From the vertices, you can call the VolumeTetra function to produce the volume.

The following program generates 100,000 random tetrahedra and computes their volumes. The program then calls the MEAN function computes the average volume, which is the Monte Carlo estimate of the expected value.

call randseed(1234);
N = 1E5;
X = randfun(N//12, "Uniform");   /* Z x 12 matrix of uniform variates */
Vol = j(N, 1, .);                /* allocate array for volumes */
do i = 1 to N;
   A = X[i, 1:3];   B = X[i, 4:6];  /* form four vertices from the 12 random numbers */
   C = X[i, 7:9];   D = X[i, 10:12];
   Vol[i] = VolumeTetra(A,B,C,D);
end;
 
avgVol = mean(Vol);                                       /* Monte Carlo estimate of mean */
expectedValue = 3977/216000 - constant('pi')**2 / 2160;   /* Zinani's exact result for E[Volume] */
print avgVol expectedValue (expectedValue-avgVol)[L="Diff"];

The output shows that the Monte Carlo estimate is very close to the expected value. This confirms Zinani's result: the average volume of a random tetrahedron is smaller than you might think.

Visualize the distribution of volumes

The expected value for the distribution of tetrahedral volumes is small. Since we know that at least one tetrahedron has a relatively large value (1/3), we might conjecture that most of the tetrahedra have very small volumes and only a few have larger volumes. A way to test that conjecture is to visualize the distribution by drawing a histogram of the volumes for the 100,000 random tetrahedra. The following SAS/IML statements write the expected value to a macro variable and write the volumes of the random tetrahedra to a data set. PROC UNIVARIATE then computes descriptive statistics for the volumes and plots a histogram of the distribution of volumes:

/* E[Volume] = 3977 / 216000 - pi**2 / 2160 = 0.01384277574...  */
call symputx("EV",expectedValue);       /* write E[Volume] to a macro variable */
create Vol var "Vol";  append;  close;  /* write volumes to a data set */
QUIT;
 
title "Volume of a Random Tetrahedron";
proc univariate data=Vol;
   histogram Vol / endpoints href=&EV odstitle=title;
run;

The graph indicates that a random tetrahedron is likely to have a volume that is near 0. The vertical line is the expected value of the distribution. Notice that only a small proportion of the tetrahedra have volumes that exceed 0.1. In this collection of 100,000 random tetrahedra, the largest volume was 0.1777, which is about 50% of the volume of the largest possible tetrahedron. The distribution has a long but thin tail.

If you look carefully at the table of descriptive statistics (the "Moments" table), you will see that the mean and the standard deviation are nearly the same. This property is characteristic of an exponential distribution. An exponential distribution has a skewness of 2 and an excess kurtosis of 6. However, the sample skewness and sample kurtosis are biased statistics, and the estimates shown here (1.8 and 5.1, respectively) are typical for a large sample from an exponential distribution. This suggests that the tetrahedral volume is distributed as V ~ Exp(λ=E[Volume]). Let's use PROC UNIVARIATE again and overlay the density of the Exp(λ=0.0139) distribution. Furthermore, you can use a Q-Q plot to visualize how well the exponential model fits the simulated volumes.

proc univariate data=Vol;
   histogram Vol / exponential(theta=0 sigma=&EV) endpoints;
   qqplot Vol / exponential(theta=0 sigma=&EV);
run;

The fit is excellent! Not only does the density curve overlay perfectly on the histogram, but the quantiles of the data (not shown) are in excellent agreement with the expected quantiles of the exponential distribution. If you look at the Q-Q plot, you again see excellent agreement between the data and the exponential model. The only shortcoming is that the simulated volumes have fewer extreme values than predicted by the model.

The model enables you to compute the probability of certain events. For example, the probability that a random tetrahedron has volume greater than 0.1 is 1-cdf("Expo", 0.1, &EV), which is 0.00073.

Summary

A result by Zinani (2003) gives the expected volume of a tetrahedron whose vertices are chosen uniformly at random in the unit cube. Zinani's result is E[Volume] = 0.0138427757.... Although this number seems surprisingly small, the expected value is confirmed by using a Monte Carlo simulation in SAS. The simulation suggests that the volumes are exponentially distributed.

For an alternative mathematical proof of the result, see Philip, J. (2007) "The Expected Volume of a Random Tetrahedron in a Cube", which is free to view and is very well written.

The post The expected volume of a random tetrahedron in a cube appeared first on The DO Loop.