simulation

5月 022022
 

Recently, I wrote about Bartlett's test for sphericity. The purpose of this hypothesis test is to determine whether the variables in the data are uncorrelated. It works by testing whether the sample correlation matrix is close to the identity matrix.

Often statistics textbooks or articles include a statement such as "under the null hypothesis, the test statistic is distributed as a chi-square statistic with DF degrees of freedom." That sentence contains a lot of information! Using algebraic formulas and probability theory to describe a sampling distribution can be very complicated. But it is straightforward to describe a sampling distribution by using simulation and statistical graphics.

This article discusses how to use simulation to understand the sampling distribution of a statistic under a null hypothesis. You can find many univariate examples of simulating a sampling distribution for univariate data, but this article shows the sampling distribution of a statistic for a hypothesis test on multivariate data. The hypothesis test in this article is Bartlett's sphericity test, but the ideas apply to other statistics. It also shows a useful trick for simulating correlation matrices for multivariate normal (MVN) data.

The simulation approach to null distributions

Simulation is well-suited for explaining the sampling distribution of a statistic under the null hypothesis. For Bartlett's sphericity test, the null hypothesis is that the data are a random sample of size N from a multivariate normal population of p uncorrelated variables. Equivalently, the correlation matrix for the population is the p x p identity matrix. Under this hypothesis, the test statistic
\(T = -\log(\det(R)) (N-1-(2p+5)/6)\)
has a chi-square distribution with p(p–1)/2 degrees of freedom, where R is the sample correlation matrix. In other words, if we randomly sample many times from an uncorrelated MVN distribution, the statistics for each sample will follow a chi-square distribution. Let's use simulation to visualize the null distribution for Bartlett's test.

Here is a useful fact: You do not have to generate MVN data if the statistic is related to the covariance or correlation of the data. Instead, you can DIRECTLY simulate a correlation matrix for MVN data by using the Wishart distribution. The following SAS/IML statements use the Wishart distribution to simulate 10,000 correlation matrices for MVN(0, Σ) samples, where Σ is a diagonal covariance matrix:

proc iml;
call randseed(12345);
p = 3;                               /* number of MVN variables */
N = 50;                              /* MVN sample size   */
Sigma = diag(1:p);                   /* population covariance */
NumSamples = 10000;                  /* number of samples in simulation */
 
/* Simulate correlation matrices for MVN data by using the Wishart distribution.
/* Each row of A is a scatter matrix; each row of B is a covariance matrix */
A = RandWishart(NumSamples, N-1, Sigma); /* N-1 degrees of freedom */
B = A / (N-1);                           /* rescale to form covariance matrix */
 
/* print one row to show an example */
C = shape(B[1,], p);                     /* reshape 1st row to p x p matrix */
R = cov2corr(C);                         /* convert covariance to correlation */
print C, R;

The output shows one random covariance matrix (C) and its associate correlation matrix (R) from among the 1,000 random matrices. The B matrix is 10000 x 9, and each row is a sample covariance matrix for a MVN(0, Σ) sample that has N=50 observations.

Recall that the determinant of a correlation matrix is always in [0,1]. The determinant equals 1 only when R=I(p). Therefore, the expression -log(det(R)) is close to 0 when R is close to the identity matrix and gets more and more positive as R is farther and farther away from the identity matrix. (It is undefined if R is singular.) So if we apply Bartlett's formula to each of the random matrices, we expect to get a lot of statistics that are close to 0 (because R should be close to the identity) and a few that are far away. The following SAS IML program carries out this method and plots the 10,000 statistics that result:

T = j(numSamples, 1);
do i = 1 to NumSamples;
   cov = shape(B[i,], p, p);             /* convert each row to square covariance matrix */
   R = cov2corr(cov);                    /* convert covariance to correlation */
   T[i] = -log(det(R)) * (N-1-(2*p+5)/6);
end;
 
title "Bartlett's Sphericity Statistic";
title2 "&numSamples MVN(0, Sigma) Samples for Diagonal Covariance";
call histogram(T) rebin={0.25 0.5};

The histogram shows the null distribution, which is the distribution of Bartlett's statistic under the null hypothesis. As expected, most statistics are close to 0. Only a few are far from 0, where "far" is a relative term that depends on the dimension of the data (p).

Knowing that the null distribution is a chi-square distribution with DF=p(p-1)/2 degrees of freedom helps to provide a quantitative value to "far from 0." You can use the 95th percentile of the chi-square(DF) distribution to decide whether a sample correlation matrix is "far from 0":

/* critical value of chi-square(3) */
DF = p*(p-1)/2;
crit = quantile("ChiSq", 0.95, DF);  /* one-sided: Pr(chiSq < crit)=0.95 */
print crit;

You can overlay the chi-square(3) distribution on the null distribution and add the critical value to obtain the following figure:

This graph summarizes the Bartlett sphericity test. The histogram approximates the null distribution by computing Bartlett's statistic on 10,000 random samples from an uncorrelated trivariate normal distribution. The curve is the asymptotic chi-square(DF=3) distribution. The vertical line is the critical value for testing the hypothesis at the 95% confidence level. The next section uses this graph and Bartlett's test to determine whether real data is a sample from an uncorrelated MVN distribution.

Bartlett's test on real data

The previous graph shows the null distribution. If you run the test on a sample that contains three variables and 50 observations, you will get a value of the test statistic. If the value is greater than 7.8, it is unlikely that the data are from an uncorrelated multivariate normal distribution.

A previous article showed how to use PROC FACTOR to run Bartlett's test in SAS. Let's run PROC FACTOR on 50 observations and three variables of Fisher's Iris data:

proc factor data=sashelp.iris(obs=50) method=ML heywood;
   var SepalLength SepalWidth PetalLength ;
   ods select SignifTests; /* output only Bartlett's test */
run;

The value of Bartlett's statistic on these data is 41.35. The X axis for the null distribution only goes up to 20, so this value is literally "off the chart"! You would reject the null hypothesis for these data and conclude that these data do not come from the null distribution.

When researchers see this result, they often assume that one or more variables in the data are correlated. However, it could also be the case that the data are not multivariate normal since normality is an assumption that was used to generate the null distribution.

Summary

This article shows how to simulate the null distribution for Bartlett's sphericity test. The null distribution is obtained by simulating many data samples from an uncorrelated multivariate normal distribution and graphing the distribution of the test statistics. For Bartlett's test, you get a chi-square distribution.

The ideas in this article apply more generally to other hypothesis tests. An important use of simulation is to approximate the null distribution for tests when an exact form of the distribution is not known.

This article also shows that you can use the Wishart distribution to avoid having to simulate MVN data. If the goal of the simulation is to obtain a covariance or correlation matrix for MVN data, you can use the Wishart distribution to simulate the matrix directly.

The post Simulate the null distribution for a hypothesis test appeared first on The DO Loop.

3月 212022
 

While studying business intelligence as an undergraduate student at business school HEC Montreal, Camille Duchesne encountered Cortex, an analytics simulation that pits participants against each other to develop the most accurate models for a particular task. In this case, the simulation supports a fictional charity by predicting which subjects from [...]

Hooked on data science: gamification drives engagement among students and trainees was published on SAS Voices by Alex Coop

1月 202022
 

Here's a fun problem to think about: Suppose that you have two different valid ways to test a statistical hypothesis. For a given sample, will both tests reject or fail to reject the hypothesis? Or might one test reject it whereas the other does not?

The answer is that two different tests can and will disagree for some samples, although they are likely to agree most of the time. Statistical software sometimes displays the results of multiple statistical tests when you request a hypothesis test. For example, PROC UNIVARIATE in SAS outputs four different statistical tests for normality, including the Shapiro-Wilk and Kolmogorov-Smirnov tests. One or more tests might reject the hypothesis of normality whereas other tests fail to reject it.

A SAS customer asked a question that made me wonder how often the four normality tests agree, and what the samples look like when the tests do not agree. This article presents a simulation study that generates random samples from the normal distribution and calls PROC UNIVARIATE to perform tests for normality on each sample. At the 95% confidence level, we expect each of these tests to reject the test of normality for 5% of the samples. But even though the tests all are designed to detect normal data, they do so in different ways. This means that the 5% of samples that are rejected by one test might be different from the 5% of samples that are rejected by another test.

Tests for Normality

PROC UNIVARIATE in SAS supports four different tests for normality of a univariate distribution. The Shapiro-Wilk statistic (W) is based on a computation that involves the variance and quantiles of the data. The other three tests are based on the empirical cumulative distribution function (ECDF). The ECDF-based tests include the Kolmogorov-Smirnov statistic (D), the Anderson-Darling statistic (A2), and the Cramer–von Mises statistic (W2). When the associate p-value of a statistic is less than α, the test rejects the null hypothesis ("the sample is from a normal distribution") at the α significance level. This article uses α=0.05.

What should an analyst do if some tests reject the hypothesis whereas others fail to reject it? I know a statistician who habitually rejects the assumption of normality if ANY of the tests reject normality. I think this is a bad idea because he is making decisions at less than the 95% confidence level. The simulation study in this article estimates the probability that a random sample from the normal distribution is not rejected simultaneously by all four tests. That is, all four tests agree that the sample is normal.

Run a simulation study

If you simulate a moderate-sized sample from the normal distribution, these four tests should each reject normality for about 5% of the samples. The following SAS program uses the standard Monte Carlo simulation technique in SAS:

  1. Simulate B=10,000 samples of size N=50 from the normal distribution. Associate a unique ID to each sample.
  2. Make sure that you suppress the output. Then use the BY statement to analyze the B samples. Write the B statistics to a data set.
  3. Analyze the sampling distribution of the statistics or tabulate the results of a statistical test.

The first two steps are straightforward. You can use the ODS OUTPUT statement to save the statistics to a data set, as follows:

/* simulate normal data */
%let n = 50;
%let numSamples = 10000;
data Sim;
call streaminit(123);
do SampleID=1 to &NumSamples;
   do i = 1 to &n;
      x = rand("Normal", 0);  /* sample from null hypothesis */
      output;
   end;
end;
run;
 
ods exclude all;    /* suppress output */
proc univariate data=Sim normal;
   by SampleID;
   var x;
   ods output TestsForNormality = NormTests;
run;
ods exclude none;

Transpose and analyze the results of the simulation study

The TestsForNormality table lists the statistics in a "long form," where each statistic is on a separate row. To analyze the distribution of the statistics, it is convenient to transpose the data to wide form. The following call to PROC TRANSPOSE creates four variables (pValue_W, pValue_D, pValue_W_Sq, and pValue_A_Sq), which are the p-values for the four normality tests.

/* Transpose from long to wide form. This creates four variables
   pValue_W, pValue_D, pValue_W_Sq, and pValue_A_Sq. */
proc transpose data=NormTests out=Tran prefix=pValue_;
   by SampleID;
   id TestLab;
   var pValue;
run;

Let's create a binary indicator variable for each p-value. The indicator variable has the value 1 if the rejects the hypothesis of normality at the α=0.05 significance level. If the test fails to reject, the indicator variable has the value 0. You can then tabulate the results for each individual test:

%let alpha = 0.05;
data Results;
set Tran;
RejectW = (pValue_W < &alpha);
RejectD = (pValue_D < &alpha);
RejectWSq = (pValue_W_Sq < &alpha);
RejectASq = (pValue_A_Sq < &alpha);
run;
 
/* tabulate the results */
proc freq data=Results;
   tables RejectW RejectD RejectWSq RejectASq / norow nocol nocum;
run;

As expected, each test rejects the hypothesis of normality on about 5% of the random samples. However, the tests have different characteristics. For a given sample, one test might reject the hypothesis whereas another test fails to reject it. For example, you can investigate how many samples were jointly rejected by the Shapiro-Wilk test (W) and the Kolmogorov-Smirnov test (D):

proc freq data=Results;
   tables RejectW*RejectD / norow nocol nocum list; 
run;

The table shows the results of 10,000 statistical tests. For 192 samples (1.9%), both reject the null hypothesis. For 9,224 (92.2%) samples, both tests fail to reject it. Thus the tests agree for 9,416 samples or 94.2%.

The interesting cases are when one test rejects the hypothesis and the other does not. The Kolmogorov-Smirnov test rejects the hypothesis on 279 samples that are not rejected by the Shapiro-Wilk test. Conversely, the Shapiro-Wilk test rejects 305 samples that the Kolmogorov-Smirnov test does not reject. Two such samples are shown in the following graph. A casual inspection does not reveal why the tests differ on these samples, but they do.

Multiple tests

When you run four tests, the situation becomes more complicated. For a given sample, you could find that zero, one, two, three, or all four tests reject normality. Since these statistics all test the same hypothesis, they should agree often. Thus, you should expect that for most random samples, either zero or four tests will reject the null hypothesis. Again, the interesting cases are when some tests reject it and others fail to reject it. The following table shows the results of all four tests on the samples in the simulation study:

 
proc freq data=Results;
   tables RejectW*RejectD*RejectWSq*RejectASq / norow nocol nocum list;
run;

There are a total of 9,103 samples for which all tests agree that the sample is normally distributed. If you reject the hypothesis if ANY of the tests reject it, you are making decisions at the 91% confidence level, not at the 95% level.

The first and last rows show the cases where all four tests agree. You can see that the tests agree with each other about 92.8% of the time. However, there are some other cases that occur frequently:

  • The 5th and 12th rows indicate samples for which the Kolmogorov-Smirnov test does not agree with the other three tests. This happens for 2.6% of the samples.
  • The 8th and 9th rows indicate samples for which the Shapiro-Wilk test does not agree with the ECDF tests. This happens for 2.5% of the samples.
  • The ECDF tests agree with each other for 95.3% of these samples. This is shown by the 1st, 8th, 9th, and 16th rows.
  • It is very rare for the Anderson-Darling statistic (A2) and the Cramer–von Mises statistic (W2) to disagree when the null hypothesis is true. They agree for 98.6% of these samples.

Summary

This article uses a simulation study to investigate how often different tests (for the same hypothesis) agree with each other. The article uses small samples of size N=50 that are normally distributed. The results show that these tests for normality tend to agree with each other. At least one test disagrees with the others for 7.2% of the samples.

As usual, this simulation study applies only to the conditions in the study. Wicklin (Simulating Data with SAS, 2013, p. 105) cautions against generalizing too much from a particular simulation: "real data are rarely distributed according to any 'named' distribution, so how well do the results generalize to real data? The answer can depend on the true distribution of the data and the sample size."

Nevertheless, this article provides some insight into how often statistical tests of normality agree and disagree with each other when the null hypothesis is true. You can use the techniques in this article to compare other tests.

The post How often do different statistical tests agree? A simulation study appeared first on The DO Loop.

1月 182022
 

Several probability distributions model the outcomes of various trials when the probabilities of certain events are given. For some distributions, the definitions make sense even when a probability is 0. For other distributions, the definitions do not make sense unless all probabilities are strictly positive. This article examines how zero probabilities affect some common probability distributions, and how to simulate events in SAS when a probability is 0. However, the same techniques can be used to detect and handle other situations where an invalid parameter is used as an argument to a SAS function.

Elementary distributions and event-trial scenarios

The following discrete probability distributions are usually presented in an introductory statistics course. They describe how likely it is to observe one or more events when you run one or more trials. In the following, X is a random variable and p is the probability of success in one Bernoulli trial.

  • X follows the Bernoulli(p) distribution if P(X=1) = p and P(X=0) = 1 - p.
  • X follows the Binomial(p, k) distribution if it is the sum of k independent Bernoulli random variable.
  • X follows a Table(p1, p2, ..., pd) distribution if P(X=Xi) = pi is the probability that X takes on the i_th value in a set of d values, where the probability of observing the i_th value is pi.
  • X follows a Geometric(p) distribution if X is the number of failures until the first success in a sequence of independent Bernoulli trials, each with probability of success p.
  • X follows a NegativeBinomial(p, k) distribution if X is the number of failures before k successes in a sequence of independent Bernoulli trials, each with probability of success p.

Among these distributions, the Bernoulli and binomial distributions make sense if p=0. In that case, the random variable always has the value 0 because no successes are ever observed. The Table distribution can handle pi=0, but it usually is advisable to remove the categories that have zero probabilities. However, the geometric and negative binomial distributions are not well-defined if p=0.

Simulation in the SAS DATA step

What happens if you use p=0 for these five distributions? The result might depend on the software you use. The following SAS DATA shows what happens in SAS:

%let NSim = 200;         /* sample size */
data Sample; 
/* define array so that sum(prob)=1 */
array prob[10] _temporary_ (0 0.2 0.125 0.2 0 0.25 0 0.125 0 0.1); 
call streaminit(54321); 
p = 0;                   /* p=0 might cause errors */
k = 10;
do i = 1 to &NSim;
   /* For some distributions, p=0 is okay. */
   /* Success (1) or failure (0) of one trial when Prob(success)=p. */
   Bern = rand("Bernoulli", p);
   /* The number of successes in k indep trials when Prob(success)=p. */
   Bin = rand("Binomial", p, 10);
 
   /* For some distributions, p=0 is questionable, but supported. */
   /* The Table distribution returns a number between 1 and dim(pro)
      when the probability of observing the i_th category is prob[i]. */
   T = rand("Table", of prob[*]);
 
   /* For some distributions, p=0 does not make sense. */
   /* The number of trials that are needed to obtain one success when Prob(success)=p. */
   Geo = rand("Geometric", p);      /* ERROR */
   /* The number of failures before k successes when Prob(success)=p. */
   NB = rand("NegBinomial", p, k);  /* ERROR */
   output; 
end; 
run;

When you run this program, the log will display many notes like the following:

NOTE: Argument 2 to function RAND('Geometric',0) at line 1735
      column 10 is invalid.
NOTE: Argument 2 to function RAND('NegBinomial',0,10) at line
      1737 column 9 is invalid.
<many more notes>
NOTE: Mathematical operations could not be performed at the
      following places. The results of the operations have been
      set to missing values.
<more notes>

The log is telling you that the RAND function considers p=0 to be an invalid argument for the geometric and negative binomial distributions? Why? A geometric random variable shows the number of trials that are needed to obtain one success. But if the probability of success is exactly zero, then we will never observe a success. The number of trials will be infinite. Similarly, a negative binomial random variable shows the number of failures before k successes, but there will never be any successes when p=0. Again, the number of trials will be infinite.

Because the geometric and negative binomial distributions are not defined when p=0, SAS probability routines will note the error and return a missing value, as shown by printing the output data:

proc print data=Sample(obs=5); run;

Notice that the Bernoulli and binomial variables are well-defined. Every Bernoulli trial results in X=0, which means that the sum of arbitrarily many trials is also 0. The Table distribution will return only the categories for which the probabilities are nonzero. However, the geometric and negative binomial values are assigned missing values.

Eliminate the notes during a simulation

If you are running a large simulation in which p=0 is a valid possibility, the log will contain many notes. Eventually, SAS will print the following note:

NOTE: Over 100 NOTES, additional NOTES suppressed.

In general, it is a bad idea to suppress notes because there might be additional notes later in the program that you need to see. A better approach is to get rid of all the notes that are telling you that p=0 is an invalid argument. You can do that by using IF-THEN logic to trap the cases where p is very small. For those cases, you can manually assign a missing value. SAS will not issue a note in this case because all calls to SAS functions have valid arguments. This technique is shown in the following DATA step program:

/* workaround: use different logic when p is too small */
%let cutoff = 1e-16;
data Sample2; 
format p Geo NB Best7.;
call streaminit(54321); 
input p @@;
/* The number of trials that are needed to obtain one success when Prob(success)=p. */
if p > &cutoff then 
   Geo = rand("Geometric", p);
else
   Geo = .;
/* The number of failures before k successes when Prob(success)=p. */
if p > &cutoff then 
   NB = rand("NegBinomial", p, 10);  /* k=10 */
else
   NB = .;
datalines;
-0.1  0  1e-20  1e-15 1e-10  1e-6 0.001 0.1
;
 
proc print data=Sample2; run;

When you run the program, it "runs clean." There are no notes in the log about invalid parameters. The output shows that values of p that are below the cutoff value result in a missing value. Very small values of p that are above the cutoff value are handled by the RAND function, which returns large integers.

Summary

This article discusses how to handle extremely small probabilities in functions that compute probability distributions. The trick is to use IF-THEN logic to prevent the invalid probability from being used as an argument to the RAND function (or other probability functions, such as the PDF, CDF, or QUANTILE functions).

Although this article discusses zero probabilities, this situation is actually a specific case of a larger programming issue: how to prevent notes from appearing in the SAS log when a parameter to a function is invalid. Thus, you can think of this article as yet another example of how to use the "trap and cap" method of defensive programming.

The post Simulate events when some probabilities are zero appeared first on The DO Loop.

1月 182022
 

Several probability distributions model the outcomes of various trials when the probabilities of certain events are given. For some distributions, the definitions make sense even when a probability is 0. For other distributions, the definitions do not make sense unless all probabilities are strictly positive. This article examines how zero probabilities affect some common probability distributions, and how to simulate events in SAS when a probability is 0. However, the same techniques can be used to detect and handle other situations where an invalid parameter is used as an argument to a SAS function.

Elementary distributions and event-trial scenarios

The following discrete probability distributions are usually presented in an introductory statistics course. They describe how likely it is to observe one or more events when you run one or more trials. In the following, X is a random variable and p is the probability of success in one Bernoulli trial.

  • X follows the Bernoulli(p) distribution if P(X=1) = p and P(X=0) = 1 - p.
  • X follows the Binomial(p, k) distribution if it is the sum of k independent Bernoulli random variable.
  • X follows a Table(p1, p2, ..., pd) distribution if P(X=Xi) = pi is the probability that X takes on the i_th value in a set of d values, where the probability of observing the i_th value is pi.
  • X follows a Geometric(p) distribution if X is the number of failures until the first success in a sequence of independent Bernoulli trials, each with probability of success p.
  • X follows a NegativeBinomial(p, k) distribution if X is the number of failures before k successes in a sequence of independent Bernoulli trials, each with probability of success p.

Among these distributions, the Bernoulli and binomial distributions make sense if p=0. In that case, the random variable always has the value 0 because no successes are ever observed. The Table distribution can handle pi=0, but it usually is advisable to remove the categories that have zero probabilities. However, the geometric and negative binomial distributions are not well-defined if p=0.

Simulation in the SAS DATA step

What happens if you use p=0 for these five distributions? The result might depend on the software you use. The following SAS DATA shows what happens in SAS:

%let NSim = 200;         /* sample size */
data Sample; 
/* define array so that sum(prob)=1 */
array prob[10] _temporary_ (0 0.2 0.125 0.2 0 0.25 0 0.125 0 0.1); 
call streaminit(54321); 
p = 0;                   /* p=0 might cause errors */
k = 10;
do i = 1 to &NSim;
   /* For some distributions, p=0 is okay. */
   /* Success (1) or failure (0) of one trial when Prob(success)=p. */
   Bern = rand("Bernoulli", p);
   /* The number of successes in k indep trials when Prob(success)=p. */
   Bin = rand("Binomial", p, 10);
 
   /* For some distributions, p=0 is questionable, but supported. */
   /* The Table distribution returns a number between 1 and dim(pro)
      when the probability of observing the i_th category is prob[i]. */
   T = rand("Table", of prob[*]);
 
   /* For some distributions, p=0 does not make sense. */
   /* The number of trials that are needed to obtain one success when Prob(success)=p. */
   Geo = rand("Geometric", p);      /* ERROR */
   /* The number of failures before k successes when Prob(success)=p. */
   NB = rand("NegBinomial", p, k);  /* ERROR */
   output; 
end; 
run;

When you run this program, the log will display many notes like the following:

NOTE: Argument 2 to function RAND('Geometric',0) at line 1735
      column 10 is invalid.
NOTE: Argument 2 to function RAND('NegBinomial',0,10) at line
      1737 column 9 is invalid.
<many more notes>
NOTE: Mathematical operations could not be performed at the
      following places. The results of the operations have been
      set to missing values.
<more notes>

The log is telling you that the RAND function considers p=0 to be an invalid argument for the geometric and negative binomial distributions? Why? A geometric random variable shows the number of trials that are needed to obtain one success. But if the probability of success is exactly zero, then we will never observe a success. The number of trials will be infinite. Similarly, a negative binomial random variable shows the number of failures before k successes, but there will never be any successes when p=0. Again, the number of trials will be infinite.

Because the geometric and negative binomial distributions are not defined when p=0, SAS probability routines will note the error and return a missing value, as shown by printing the output data:

proc print data=Sample(obs=5); run;

Notice that the Bernoulli and binomial variables are well-defined. Every Bernoulli trial results in X=0, which means that the sum of arbitrarily many trials is also 0. The Table distribution will return only the categories for which the probabilities are nonzero. However, the geometric and negative binomial values are assigned missing values.

Eliminate the notes during a simulation

If you are running a large simulation in which p=0 is a valid possibility, the log will contain many notes. Eventually, SAS will print the following note:

NOTE: Over 100 NOTES, additional NOTES suppressed.

In general, it is a bad idea to suppress notes because there might be additional notes later in the program that you need to see. A better approach is to get rid of all the notes that are telling you that p=0 is an invalid argument. You can do that by using IF-THEN logic to trap the cases where p is very small. For those cases, you can manually assign a missing value. SAS will not issue a note in this case because all calls to SAS functions have valid arguments. This technique is shown in the following DATA step program:

/* workaround: use different logic when p is too small */
%let cutoff = 1e-16;
data Sample2; 
format p Geo NB Best7.;
call streaminit(54321); 
input p @@;
/* The number of trials that are needed to obtain one success when Prob(success)=p. */
if p > &cutoff then 
   Geo = rand("Geometric", p);
else
   Geo = .;
/* The number of failures before k successes when Prob(success)=p. */
if p > &cutoff then 
   NB = rand("NegBinomial", p, 10);  /* k=10 */
else
   NB = .;
datalines;
-0.1  0  1e-20  1e-15 1e-10  1e-6 0.001 0.1
;
 
proc print data=Sample2; run;

When you run the program, it "runs clean." There are no notes in the log about invalid parameters. The output shows that values of p that are below the cutoff value result in a missing value. Very small values of p that are above the cutoff value are handled by the RAND function, which returns large integers.

Summary

This article discusses how to handle extremely small probabilities in functions that compute probability distributions. The trick is to use IF-THEN logic to prevent the invalid probability from being used as an argument to the RAND function (or other probability functions, such as the PDF, CDF, or QUANTILE functions).

Although this article discusses zero probabilities, this situation is actually a specific case of a larger programming issue: how to prevent notes from appearing in the SAS log when a parameter to a function is invalid. Thus, you can think of this article as yet another example of how to use the "trap and cap" method of defensive programming.

The post Simulate events when some probabilities are zero appeared first on The DO Loop.

1月 102022
 

On this blog, I write about a diverse set of topics that are relevant to statistical programming and data visualization. In a previous article, I presented some of the most popular blog posts from 2021. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst. However, I also write articles that discuss more advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles deserve a second look. I have grouped them into four categories: SAS programming, statistics, optimization, and data simulation.

SAS programming

For years, I've been writing articles about how to accomplish tasks in Base SAS. In addition, I now write about how to program basic tasks in SAS Viya.

Use the DOLIST Syntax to Specify Tick Marks on a Graph

Probability and statistics

A Family of Density Curves for the Inverse Gamma Distribution

Probability and statistics provide the theoretical basis for modern data analysis. You cannot understand data science, machine learning, or artificial intelligence without understanding the basic principles of statistics. Most readers are familiar with common probability distributions, Pearson correlation, and least-squares regression. The following articles discuss some of the lesser-known statistical cousins of these topics:

  • The inverse gamma distribution: To use any probability distribution, you need to know four essential functions: the density, the cumulative probability, the quantiles, and how to generate random variates. You might be familiar with the gamma distribution, but what is the inverse gamma distribution and how do you define the four essential functions in SAS? Similarly, what is the generalized gamma distribution?
  • The Hoeffding D statistic: The Hoeffding D statistic measures the association between two variables. How do you compute the Hoeffding D statistic in SAS, and how do you interpret the results?
  • Weibull regression: In ordinary least squares regression, the response variable is assumed to be modeled by a set of explanatory variables and normally distributed errors. If the error terms are Weibull distributed, you can estimate parameters for the Weibull distribution as part of a regression analysis, but you need to transform the regression estimates to obtain the usual Weibull parameters.

Optimization

Optimization is at the heart of many statistical techniques, such as maximum likelihood estimation. But sometimes you need to solve a "pure" optimization problem. SAS supports many methods for solving optimization problems:

Multivariate simulation and bootstrapping

It is straightforward to simulate univariate data. It is harder to simulate multivariate data while preserving important relations between the variables, such as correlation. Similarly, it can be challenging to analyze the bootstrap distribution of a multivariate statistic, such as a correlation matrix:

The Geometry of the Iman-Conover Transformation

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2021? If so, leave a comment and tell me what topic you found interesting or useful.

The post 12 blog posts from 2021 that deserve a second look appeared first on The DO Loop.

1月 052022
 

You can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This method is encapsulated in the RANDNORMAL function in SAS/IML software, but you can also perform the computations manually by calling the ROOT function to get the Cholesky root and then use matrix multiplication to correlate multivariate norm (MVN) data, as follows:

%let d = 12;           /* number of variables */
%let N = 500;          /* number of observations */
proc iml;
d = &d;                /* want to generate d variables that are MVN */
N = &N;                /* sample size */
 
/* easy to generate uncorrelated MVN variables */
call randseed(1234);
z = j(d, N);
call randgen(z, "Normal"); /* z is MVN(0,I(d)) */
 
/* use Cholesky root to transform to correlated variables */
Sigma = toeplitz(d:1);   /* example of covariance matrix */
G = root(Sigma);         /* the Cholesky root of the full covariance matrix */
x = G`*z;                /* x is MVN(0, Sigma) */
title "Two Components of MVN(0,Sigma) Data";
call scatter(x[1,], x[2,]) label={x1 x2};

The simulation creates d=12 correlated variables and 500 observations. The first two variables are plotted against each other so that you can see that they are correlated. The other pairs of variables are also correlated according to the entries in the Σ covariance matrix.

Generating a huge number of MVN variables

In the comments to one of my previous articles, a SAS programmer asked whether it is possible to generate MVN data even if the covariance matrix is so large that it cannot be factored by the ROOT function. For example, suppose you want to generate d=20,000 MVN variables. A d x d covariance matrix of that size requires 3 GB of RAM, and on some operating system you cannot form the Cholesky root of a matrix that large. However, I have developed an algorithm that enables you to deal with the covariance matrix (Σ) as a block matrix.

To introduce the algorithm, represent the Σ matrix as a 2 x 2 block matrix. (See the last section for a more general representation.) For a 2 x 2 block algorithm, you only need to form the Cholesky roots of matrices of size (d/2), which are 10000 x 10000 matrices for this example. Matrices of that size require only 0.75 GB and can be quickly factored. Of course, there is no such thing as a free lunch: The new algorithm is more complicated and requires additional computations, albeit on smaller matrices.

Let's look at how to simulate MVN data by treating Σ as a block matrix of size d/2. For ease of exposition, assume d is even and each block is of size d/2. However, the algorithm easily handles the case where d is odd. When d is odd, choose the upper blocks to have floor(d/2) rows and choose the lower clocks to have ceil(d/2) rows. The SAS code (below) handles the general case. (In fact, the algorithm doesn't care about the block size. For any p in (0, 1), one block can be size pd and the other size (1-p)d.

In block form, the covariance matrix is
\(\Sigma = \begin{bmatrix}A & B\\B^\prime & C\end{bmatrix}\)
There is a matrix decomposition called the LDLT decomposition that enables you to write Σ as the product of three block matrices:

\(\begin{bmatrix}A & B\\B^\prime & C\end{bmatrix} = \begin{bmatrix}I & 0\\B^\prime A^{-1} & I\end{bmatrix} \begin{bmatrix}A & 0\\0 & S\end{bmatrix} \begin{bmatrix}I & A^{-1}B\\0 & I\end{bmatrix} \)
where \(S = C - B^\prime A^{-1} B\) is the Schur complement of the block matrix C. There is a theorem that says that if Σ is symmetric positive definite (SPD), then so is every principal submatrix, so A is SPD. Thus, we can use the Cholesky decomposition and write \(A = G^\prime_A G_A\) for an upper triangular matrix, \(G_A\). By reordering the variables, C is SPD as well. I don't have a proof that S is PD, but it seems to be true in the examples I've tried, so let's list that as an assumption and assume that \(S = G^\prime_S G_S\) for an upper triangular matrix \(G_S\).

Using these definitions, the Cholesky decomposition of Σ can be written in block form as
\(\Sigma = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}G_A & (G^\prime_A)^{-1} B \\ 0 & G_S\end{bmatrix} \)

To simulate MVN data, we let z be uncorrelated MVN(0, I) data and define
\(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}z_1 \\ z_2\end{bmatrix} = \begin{bmatrix}G^\prime_A z_1 \\ B^\prime G^{-1}_A z_1 + G^\prime_S z_2\end{bmatrix} \)

The previous equation indicates that you can generate correlated MVN(0, Σ) data if you know the Cholesky decomposition of the upper-left block of Σ (which is GA) and the Cholesky decomposition of the Schur complement (which is GS). Each of the matrices in these equations is of size d/2, which means that the computations are performed on matrices that are half the size of Σ. However, notice that generating the x2 components requires some extra work: the Schur complement involves the inverse of A and the formula for x2 also involves the inverse of GA. These issues are not insurmountable, but it means that the block algorithm is more complicated than the original algorithm, which merely multiplies z by the Cholesky root of Σ.

A SAS program for block simulation of MVN data

The SAS/IML program in a previous section shows how to generate correlated MVN(0, Σ) data (x) from uncorrelated MVN(0, I) data (z) by using the Cholesky root (G) of Σ. This section shows how to get exactly the same x values by performing block operations. For the block operations, each matrix is size d/2, so has 1/4 of the elements of Σ.

The first step is to represent Σ and z in block form.

/* The Main Idea: get the same MVN data by performing block 
   operations on matrices that are size d/2 */
/* 1. Represent Sigma={A B, B` C} and z={z1, z2} in block form */
d1 = ceil(d/2);           /* half the variables */
d2 = d - d1;              /* the other half of the vars */
dp1 = d1 + 1;
 
A = Sigma[1:d1,  1:d1];   /* break up the symmetric (d x d) covariance */
B = Sigma[1:d1,  dp1:d];  /* matrix into blocks of size d/2 */
C = Sigma[dp1:d, dp1:d];
z1 = z[1:d1, ];           /* extract the first d1 obs for MVN(0,I) data */
z2 = z[dp1:d, ];          /* extract the last d2 obs for MVN(0,I) data */

It is easy to generate x1, which contains the first d/2 components of the MVN(0, Σ) simulated data. You simply use the Cholesky decomposition of A, which is the upper-left block of Σ:

/* 2. Compute Cholesky root of A and compute x1 z1 */
G_A = root(A);            /* Cholesky of upper left block */
x1 = G_A`*z1;             /* generate first half of variables */

It is not as easy to generate x2, which contains the last d/2 components. In block form, \(x_2 = v + u\), where \(v = B^\prime G_A^{-1} z_1\) and \(u = G_S^\prime z_2\), and where \(S = G_S^\prime G_S\) is the Cholesky decomposition of the Shur complement. This is shown in the following statements:

/* 3. Generate the second half of variables */
S = C - B`*inv(A)*B;      /* S is the Schur complement */
G_S = root(S);            /* Cholesky root of Schur complement */
v = B`*inv(G_A)*z1;
u = G_S`*z2;
x2 = v + u;
 
/* verify that the computation gives exactly the same simulated data */
print (max(abs(x[1:d1,] - x1)))[L="Diff(x - x1)"];
print (max(abs(x[dp1:d,] - x2)))[L="Diff(x - x2)"];

The output shows that the block computation, which uses multiple matrices of size d/2, gives exactly the same answer as the original computation, which uses matrices of size d.

Analysis of the block algorithm

Now that the algorithm runs on a small example, you can increase the dimensions of the matrices. For example, try rerunning the algorithm with the following parameters:

%let d = 5000;    /* number of variables */
%let N = 1000;    /* number of observations */

When I run this larger program on my PC, it takes about 2.3 seconds to simulate the MVN data by using the full Cholesky decomposition. It takes about 17 seconds to simulate the data by using the block method. Most of the time for the block method is spent on the computation v = B`*inv(G_A)*z1, which takes about 15 seconds. So if you want to improve the performance, you should focus on improving that computation.

You can improve the performance of the block algorithm in several ways:
  • You don't need to find the inverse of A. You have the Cholesky root of A, which is triangular, so you can define H = inv(G`A)*B and compute the Schur complement as C – H`*H.
  • You don't need to explicitly form any inverses. Ultimately, all these matrices are going to operate on the vector z2, which means that you can simulate the data by using only matrix multiplication and solving linear equations.

I tried several techniques, but I could not make the block algorithm competitive with the performance of the original Cholesky algorithm. The issue is that the block algorithm computes two Cholesky decompositions, solves some linear equations, and performs matrix multiplication. Even though each computation is on a matrix that is half the size of the original matrix, the additional computations result in a longer runtime.

Generalization to additional blocks

The algorithm generalizes to other block sizes. I will outline the case for a 3 x 3 block matrix. The general case of k x k blocks follows by induction.

The figure to the right shows the Σ covariance matrix as a 3 x 3 block matrix. By using 2 x 2 block algorithm, you can obtain multivariate normal vectors x1 and x2, which correspond to the covariances in the four blocks in the upper-left corner of the matrix. You can then view the upper four blocks as a single (larger) block, thus obtaining a 2 x 2 block matrix where the upper-left block has size 2d/3 and the lower-right block (Σ33) has size d/3.

There are many details to work through, but the main idea is that you can repeat the computation on this new block matrix where the block sizes are unequal. To proceed, you need the Cholesky decomposition of the large upper-left block, but we have already shown how to compute that in terms of the Cholesky root of Σ11, its inverse, and the Cholesky root of the Schur complement of Σ22. You also need to form the Schur complement of Σ33, but that is feasible because we have the Cholesky decomposition of the large upper-left block.

I have not implemented the details because it is not clear to me whether three or more blocks provides a practical advantage. Even if I work out the details for k=3, the algebra for k > 3 blocks seems to be daunting.

Summary

It is well known that you can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This article shows how to break up the task by using a block Cholesky method. The method is implemented for k=2 blocks. The block method requires more matrix operations, but each operation is performed on a smaller matrix. Although the block method takes longer to run, it can be used to generate a large number of variables by generating k variables at a time, where k is small enough that the k x k matrices can be easily handled.

The block method generalizes to k > 2 blocks, but it is not clear whether more blocks provide any practical benefits.

The post A block-Cholesky method to simulate multivariate normal data appeared first on The DO Loop.

12月 062021
 

While discussing how to compute convex hulls in SAS with a colleague, we wondered how the size of the convex hull compares to the size of the sample. For most distributions of points, I claimed that the size of the convex hull is much less than the size of the original sample. This article looks at the expected value of the proportion p = k / N, where k is the size of the convex hull for a set of N planar points. The proportion depends on the distribution of the points. This article shows the results for three two-dimensional distributions: the uniform distribution on a rectangle, the normal distribution, and the uniform distribution on a disk.

Monte Carlo estimates of the expected number of points on a convex hull

You can use Monte Carlo simulation to estimate the expected number of points on a convex hull. First, specify the distribution of the points and the size of the sample, N. Then do the following:

  • Simulate N random points from the distribution and compute the convex hull of these points. This gives you a value, k, for this one sample.
  • Repeat the simulation B times. You will have B values of k: {k1, k2, ..., kB}.
  • The average of the B values is an estimate of the expected value for the number of points on the convex hull for a random set of size N.
  • If you prefer to think in terms of proportions, the ratio k / N is an estimate of the expected proportion of points on the convex hull.

Simulate uniform random samples for N=200

Before you run a full simulation study, you should always run a smaller simulation to ensure that your implementation is correct. The following SAS/IML statements generate 1000 samples of size N=200 from the uniform distribution on the unit square. For each sample, you can compute the number of points on the convex hull. You can then plot the distribution of the number of points on the convex hull, as follows:

proc iml;
call randseed(1);
 
/* 1. Convex hull of N=200 random uniformly distributed 
      points in the unit square [0,1] x [0,1] */
NRep = 1000;            /* number of Monte Carlo samples */
Distrib = "Uniform";
N = 200;                /* sample size */
k = j(NRep, 1, 0);      /* store the number of points on CH */
X = j(N, 2, 0);         /* allocate matrix for sample */
do j = 1 to NRep;
   call randgen(X, Distrib);  /* random sample */
   Indices = cvexhull(X);
   k[j] = sum(Indices>0);     /* the positive indices are on CH */
end;
 
Ek = mean(k);           /* MC estimate of expected value */
print Ek;
title "k = Number of Points on Convex Hull; N = 200";
call bar(k) other="inset ('E[k] =' = '13.88') / border;";

The bar chart shows the distribution of the size of the convex hull for the 1,000 simulated samples. For most samples, the convex hull contains between 12 and 15 points. The expected value is the average of these values, which is 13.88 for this simulation.

If you want to compare samples of different sizes, it is useful to standardize the estimated quantity. Rather than predict the number of points on the convex hull, it is useful to predict the proportion of the sample that lies on the convex hull. Furthermore, we should report not only the expected value but also a confidence interval. This makes it clearer that the proportion for any one sample is likely to be within a range of values:

p = k / N;              /* proportion of points on the CH */
mean = mean(p);         /* expected proportion */
call qntl(CI, p, {0.025, 0.975}); /* 95% confidence interval */
print mean CI;
 
refStmt = 'refline ' + char(mean//CI) + '/ axis=X lineattrs=(color=red);';
title "Proportion of Points on Convex Hull, 1000 Random Samples";
title2 "X ~ U(Square); N = 200";
call histogram(p) other=refStmt label='Proportion';

For this distribution of points, you should expect about 7% of the sample to lie on the convex hull. An empirical 95% confidence interval is between 5% and 9%.

According to an unpublished preprint by Har-Peled ("On the Expected Complexity of Random Convex Hulls", 2011), the asymptotic formula for the expected number of points on the convex hull of the unit square is O(log(n)).

Simulate samples for many sizes

The previous section simulated B samples for size N=200. This section repeats the simulation for values of N between 300 and 8,000. Instead of plotting histograms, I plot the expected value and a 95% confidence interval for each value of N. So that the simulation runs more quickly, I have set B=250. The following SAS/IML program simulates B samples for various values of N, all drawn from the uniform distribution on the unit square:

/* 2. simulation for X ~  U(Square), N = 300..8000 */
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
Distrib = "Uniform";
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call randgen(X, Distrib);  /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */

You can use the CREATE statement to write the results to a SAS data set. You can use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT to visualize the result. Because I want to create the same plot several times, I will define a subroutine that creates the graph:

/* define subroutine that creates a scatter plot with confidence intervals */
start HiLow(N, p, lower, upper);
    Ek = N#p;  /* expected value of k */
    create CH var {N p lower upper Ek}; append; close;
    submit;
      proc sgplot data=CH noautolegend;
        format Ek 2.0;
        highlow x=N low=lower high=upper;
        scatter x=N y=p / datalabel=Ek;
        xaxis grid label='Sample Size (N)' offsetmax=0.08;
        yaxis grid label='Proportion on Convex Hull';
      run;
    endsubmit;
finish;
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Square)";
run HiLow(N, mean, CI[1,], CI[2,]);

The graph shows the expected proportion of points on the convex hull for samples of size N. The expected number of points (rounded to the nearest integer) is shown adjacent to each marker. For example, you should expect a sample of size N=2000 to have about 1% of its points (k=20) on the convex hull. As the sample size grows larger, the proportion decreases. For N=8000, the expected proportion is only about 0.3%, or 23.5 points on the convex hull.

Simulate samples from the bivariate normal distribution

The previous graph shows the results for points drawn uniformly at random from the unit square. If you change the distribution of the points, you will change the graph. For example, a bivariate normal distribution has most of its points deep inside the convex hull and very few points near the convex hull. Intuitively, you would expect the ratio k / N to be smaller for bivariate normal data, as compared to uniformly distributed data.

You can analyze bivariate normal data by changing only one line of the previous program. Simply change the statement Distrib = "Uniform" to Distrib = "Normal". With that modification, the program creates the following graph:

For bivariate normal data, the graph shows that the expected proportion of points on the convex hull is much less than for data drawn from the uniform distribution on the square. For a sample of size N=2000, you should expect about 0.6% of its points (k=12) on the convex hull. For N=8000, the expected proportion is only about 0.2%, or 13.7 points on the convex hull.

Simulate samples from the uniform distribution on the disk

Let's examine one last distribution: the uniform distribution on the unit disk. I have previously shown how to generate random uniform points from the disk, or, in fact, from any d-dimensional ball. The following program defines the RandBall subroutine, which generates random samples from the unit disk. The function is used instead of the RANDGEN subroutine:

/* 4. random uniform in disk: X ~ U(Disk)
   https://blogs.sas.com/content/iml/2016/03/30/generate-uniform-2d-ball.html */
start RandBall(X, radius=1);
   N = nrow(X); d = ncol(X);
   call randgen(X, "Normal");       /* X ~ MVN(0, I(d)) */
   u = randfun(N, "Uniform");       /* U ~ U(0,1)       */
   r = radius * u##(1/d);           /* r proportional to d_th root of U */
   X = r # X / sqrt(X[,##]);        /* X[,##] is sum of squares for each row */
finish;
 
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call RandBall(X);          /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Disk)";
run HiLow(N, mean, CI[1,], CI[2,]);

For this distribution, a greater proportion of points are on the convex hull, as compared to the other examples. For a sample of size N=2000, you should expect about 2% of its points (k=42) on the convex hull. For N=8000, the expected proportion is about 0.8%, or 67.7 points on the convex hull.

According Har-Peled (2011), the asymptotic formula for the expected number of points on the convex hull of the unit disk is O(n1/3).

Linear transformations of convex hulls

Although I ran simulations for the uniform distribution on a square and on a disk, the same results hold for arbitrary rectangles and ellipses. It is not hard to show that if X is a set of points and C(X) is the convex hull, then for any affine transformation T: R2 → R2, the set T(C(X)) is the convex hull of T(X). Thus, for invertible transformations, the number of points on the convex hull is invariant under affine transformations.

In the same way, the simulation for bivariate normal data (which used uncorrelated variates) remains valid for correlated normal data. This is because the Cholesky transformation maps correlated variates to uncorrelated variates.

Summary

This article uses simulation to understand how the size of a convex hull relates to the size of the underlying sample, drawn randomly from a 2-D distribution. Convex hulls are preserved under linear transformations, so you can simulate from a standardized distribution. This article presents results for three distributions: the uniform distribution on a rectangle, the bivariate normal distribution, and the uniform distribution on an ellipse.

The post The expected number of points on a convex hull appeared first on The DO Loop.

12月 062021
 

While discussing how to compute convex hulls in SAS with a colleague, we wondered how the size of the convex hull compares to the size of the sample. For most distributions of points, I claimed that the size of the convex hull is much less than the size of the original sample. This article looks at the expected value of the proportion p = k / N, where k is the size of the convex hull for a set of N planar points. The proportion depends on the distribution of the points. This article shows the results for three two-dimensional distributions: the uniform distribution on a rectangle, the normal distribution, and the uniform distribution on a disk.

Monte Carlo estimates of the expected number of points on a convex hull

You can use Monte Carlo simulation to estimate the expected number of points on a convex hull. First, specify the distribution of the points and the size of the sample, N. Then do the following:

  • Simulate N random points from the distribution and compute the convex hull of these points. This gives you a value, k, for this one sample.
  • Repeat the simulation B times. You will have B values of k: {k1, k2, ..., kB}.
  • The average of the B values is an estimate of the expected value for the number of points on the convex hull for a random set of size N.
  • If you prefer to think in terms of proportions, the ratio k / N is an estimate of the expected proportion of points on the convex hull.

Simulate uniform random samples for N=200

Before you run a full simulation study, you should always run a smaller simulation to ensure that your implementation is correct. The following SAS/IML statements generate 1000 samples of size N=200 from the uniform distribution on the unit square. For each sample, you can compute the number of points on the convex hull. You can then plot the distribution of the number of points on the convex hull, as follows:

proc iml;
call randseed(1);
 
/* 1. Convex hull of N=200 random uniformly distributed 
      points in the unit square [0,1] x [0,1] */
NRep = 1000;            /* number of Monte Carlo samples */
Distrib = "Uniform";
N = 200;                /* sample size */
k = j(NRep, 1, 0);      /* store the number of points on CH */
X = j(N, 2, 0);         /* allocate matrix for sample */
do j = 1 to NRep;
   call randgen(X, Distrib);  /* random sample */
   Indices = cvexhull(X);
   k[j] = sum(Indices>0);     /* the positive indices are on CH */
end;
 
Ek = mean(k);           /* MC estimate of expected value */
print Ek;
title "k = Number of Points on Convex Hull; N = 200";
call bar(k) other="inset ('E[k] =' = '13.88') / border;";

The bar chart shows the distribution of the size of the convex hull for the 1,000 simulated samples. For most samples, the convex hull contains between 12 and 15 points. The expected value is the average of these values, which is 13.88 for this simulation.

If you want to compare samples of different sizes, it is useful to standardize the estimated quantity. Rather than predict the number of points on the convex hull, it is useful to predict the proportion of the sample that lies on the convex hull. Furthermore, we should report not only the expected value but also a confidence interval. This makes it clearer that the proportion for any one sample is likely to be within a range of values:

p = k / N;              /* proportion of points on the CH */
mean = mean(p);         /* expected proportion */
call qntl(CI, p, {0.025, 0.975}); /* 95% confidence interval */
print mean CI;
 
refStmt = 'refline ' + char(mean//CI) + '/ axis=X lineattrs=(color=red);';
title "Proportion of Points on Convex Hull, 1000 Random Samples";
title2 "X ~ U(Square); N = 200";
call histogram(p) other=refStmt label='Proportion';

For this distribution of points, you should expect about 7% of the sample to lie on the convex hull. An empirical 95% confidence interval is between 5% and 9%.

According to an unpublished preprint by Har-Peled ("On the Expected Complexity of Random Convex Hulls", 2011), the asymptotic formula for the expected number of points on the convex hull of the unit square is O(log(n)).

Simulate samples for many sizes

The previous section simulated B samples for size N=200. This section repeats the simulation for values of N between 300 and 8,000. Instead of plotting histograms, I plot the expected value and a 95% confidence interval for each value of N. So that the simulation runs more quickly, I have set B=250. The following SAS/IML program simulates B samples for various values of N, all drawn from the uniform distribution on the unit square:

/* 2. simulation for X ~  U(Square), N = 300..8000 */
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
Distrib = "Uniform";
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call randgen(X, Distrib);  /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */

You can use the CREATE statement to write the results to a SAS data set. You can use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT to visualize the result. Because I want to create the same plot several times, I will define a subroutine that creates the graph:

/* define subroutine that creates a scatter plot with confidence intervals */
start HiLow(N, p, lower, upper);
    Ek = N#p;  /* expected value of k */
    create CH var {N p lower upper Ek}; append; close;
    submit;
      proc sgplot data=CH noautolegend;
        format Ek 2.0;
        highlow x=N low=lower high=upper;
        scatter x=N y=p / datalabel=Ek;
        xaxis grid label='Sample Size (N)' offsetmax=0.08;
        yaxis grid label='Proportion on Convex Hull';
      run;
    endsubmit;
finish;
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Square)";
run HiLow(N, mean, CI[1,], CI[2,]);

The graph shows the expected proportion of points on the convex hull for samples of size N. The expected number of points (rounded to the nearest integer) is shown adjacent to each marker. For example, you should expect a sample of size N=2000 to have about 1% of its points (k=20) on the convex hull. As the sample size grows larger, the proportion decreases. For N=8000, the expected proportion is only about 0.3%, or 23.5 points on the convex hull.

Simulate samples from the bivariate normal distribution

The previous graph shows the results for points drawn uniformly at random from the unit square. If you change the distribution of the points, you will change the graph. For example, a bivariate normal distribution has most of its points deep inside the convex hull and very few points near the convex hull. Intuitively, you would expect the ratio k / N to be smaller for bivariate normal data, as compared to uniformly distributed data.

You can analyze bivariate normal data by changing only one line of the previous program. Simply change the statement Distrib = "Uniform" to Distrib = "Normal". With that modification, the program creates the following graph:

For bivariate normal data, the graph shows that the expected proportion of points on the convex hull is much less than for data drawn from the uniform distribution on the square. For a sample of size N=2000, you should expect about 0.6% of its points (k=12) on the convex hull. For N=8000, the expected proportion is only about 0.2%, or 13.7 points on the convex hull.

Simulate samples from the uniform distribution on the disk

Let's examine one last distribution: the uniform distribution on the unit disk. I have previously shown how to generate random uniform points from the disk, or, in fact, from any d-dimensional ball. The following program defines the RandBall subroutine, which generates random samples from the unit disk. The function is used instead of the RANDGEN subroutine:

/* 4. random uniform in disk: X ~ U(Disk)
   https://blogs.sas.com/content/iml/2016/03/30/generate-uniform-2d-ball.html */
start RandBall(X, radius=1);
   N = nrow(X); d = ncol(X);
   call randgen(X, "Normal");       /* X ~ MVN(0, I(d)) */
   u = randfun(N, "Uniform");       /* U ~ U(0,1)       */
   r = radius * u##(1/d);           /* r proportional to d_th root of U */
   X = r # X / sqrt(X[,##]);        /* X[,##] is sum of squares for each row */
finish;
 
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call RandBall(X);          /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Disk)";
run HiLow(N, mean, CI[1,], CI[2,]);

For this distribution, a greater proportion of points are on the convex hull, as compared to the other examples. For a sample of size N=2000, you should expect about 2% of its points (k=42) on the convex hull. For N=8000, the expected proportion is about 0.8%, or 67.7 points on the convex hull.

According Har-Peled (2011), the asymptotic formula for the expected number of points on the convex hull of the unit disk is O(n1/3).

Linear transformations of convex hulls

Although I ran simulations for the uniform distribution on a square and on a disk, the same results hold for arbitrary rectangles and ellipses. It is not hard to show that if X is a set of points and C(X) is the convex hull, then for any affine transformation T: R2 → R2, the set T(C(X)) is the convex hull of T(X). Thus, for invertible transformations, the number of points on the convex hull is invariant under affine transformations.

In the same way, the simulation for bivariate normal data (which used uncorrelated variates) remains valid for correlated normal data. This is because the Cholesky transformation maps correlated variates to uncorrelated variates.

Summary

This article uses simulation to understand how the size of a convex hull relates to the size of the underlying sample, drawn randomly from a 2-D distribution. Convex hulls are preserved under linear transformations, so you can simulate from a standardized distribution. This article presents results for three distributions: the uniform distribution on a rectangle, the bivariate normal distribution, and the uniform distribution on an ellipse.

The post The expected number of points on a convex hull appeared first on The DO Loop.

11月 082021
 

Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the normal density (the red curve, N(μ=9500, σ=21.79)) can be a very good approximation to the binomial density (blue bars, Binom(p=0.95, nTrials=10000)). However, because the binomial density is discrete, the binomial density is defined only for positive integers, whereas the normal density is defined for all real numbers.

In this graph, the binomial density and the normal density are close. But what does the approximation look like if you overlay a bar chart of a random sample from the binomial distribution? It turns out that the bar chart can have large deviations from the normal curve, even for a large sample. Read on for an example.

The normal approximation to the binomial distribution

I have written two previous articles that discuss the normal approximation to the binomial distribution:

The normal approximation is used to estimate probabilities because it is often easier to use the area under the normal curve than to sum many discrete values. However, as shown in the second article, the discrete binomial distribution can have statistical properties that are different from the normal distribution.

Random binomial samples

It is important to remember that a random sample (from any distribution!) can look much different from the underlying probability density function. The following graph shows a random sample from the binomial distribution Binom(0.95, 10000). The distribution of the sample looks quite different from the density curve.

In statistical terms, the observed and expected values have large deviations. Also, note that there can be a considerable deviation between adjacent bars. For example, in the graph, some bars have about 2% of the total frequency whereas an adjacent bar might have half that value. I was surprised to observe the large deviations in a large sample.

This graph emphasizes the fact that a random sample from the binomial distribution can look different from the smooth bell-shaped curve of the probability density. I think I was surprised by the magnitude of the deviations from the expected values because I have more experience visualizing continuous distributions. For a continuous distribution, we use a histogram to display the empirical distribution. When the bin width of the histogram is greater than 1, it smooths out the differences between adjacent bars to create a much smoother estimate of the density. For example, the following graph displays the distribution of the same random sample but uses a bin width of 10 to aggregate the frequency. The resulting histogram is much smoother and resembles the normal approximation.

Summary

The normal approximation is a convenient way to compute probabilities for the binomial distribution. However, it is important to remember that the binomial distribution is a discrete distribution. A binomial random variable can assume values only for positive integers. One consequence of this observation is that a bar chart of a random binomial sample can show considerable deviations from the theoretical density. This is normal (pun intended!). It is often overlooked because if you treat the random variable as if it were continuous and use a histogram to estimate the density, the histogram smooths out a lot of the bar-to-bar deviations.

Appendix

The following SAS program creates the graphs in this article.

/* define parameters for the Binom(p=0.95, nTrials=10000) simulation */
%let p = 0.95;                    /* probability of success */
%let NTrials = 10000;             /* number of trials */
%let N = 1000;                    /* sample size */   
 
/* First graph: Compute the density of the Normal and Binomial distributions. See
   https://blogs.sas.com/content/iml/2016/09/12/overlay-curve-bar-chart-sas.html
*/
data Binom;
n = &nTrials;  p = &p;  q = 1 - p;
mu = n*p;  sigma = sqrt(n*p*q);   /* parameters for the normal approximation */
Lower = mu-3.5*sigma;             /* evaluate normal density on [Lower, Upper] */
Upper = mu+3.5*sigma;
 
/* PDF of normal distribution */
do t = Lower to Upper by sigma/20;
   Normal = pdf("normal", t, mu, sigma);       output;
end;
 
/* PMF of binomial distribution */
t = .; Normal = .;        /* these variables are not used for the bar chart */
do j = max(0, floor(Lower)) to ceil(Upper);
   Binomial = pdf("Binomial", j, p, n);      output;
end;
/* store mu and sigma in macro variables */
call symput("mu", strip(mu));      
call symput("sigma", strip(round(sigma,0.01)));
label Binomial="Binomial Probability"  Normal="Normal Density";
keep t Normal j Binomial;
run;
 
/* overlay binomial density (needle plot) and normal density (series plot) */
title "Binomial Probability and Normal Approximation";
title2 "Binom(0.95, 10000) and N(9500, 21.79)";
proc sgplot data=Binom;
   needle x=j y=Binomial;
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "p = &p"  "q = %sysevalf(1-&p)" "nTrials = &nTrials"  
         "(*ESC*){unicode mu} = np = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = sqrt(npq) = &sigma" /
         position=topright border;
   yaxis label="Probability";
   xaxis label="x" integer;
run;
 
/*************************/
 
/* Second graph: simulate a random sample from Binom(p, NTrials) */
data Bin(keep=x);
call streaminit(1234);
do i = 1 to &N; 
   x = rand("Binomial", &p, &NTrials); 
   output;
end;
run; 
 
/* count the frequency of each observed count */
proc freq data=Bin noprint;
   tables x / out=FreqOut;
run;
 
data All;
set FreqOut Binom(keep=t Normal);
Normal = 100*1*Normal;   /* match scales: 100*h*PDF, where h=binwidth */
run;
 
/* overlay sample and normal approximation */
title "Random Binomial Sample";
title2 "Bar Chart of Binom(0.95, 10000)";
proc sgplot data=All;   
   needle x=x y=Percent;
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "n = &n"  "p = &p" 
         "(*ESC*){unicode mu} = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = &sigma" /
         position=topright border;
   yaxis label="Percent / Scaled Density";
   xaxis label="x" integer;
run;
 
/*************************/
 
/* Third graph: the bar-to-bar deviations are smoothed if you use a histogram */
title2 "Histogram BinWidth=10";
proc sgplot data=Bin;   
   histogram x / scale=percent binwidth=10;
   xaxis label="x" integer values=(9400 to 9600 by 20);
run;
 
/* for comparison, the histogram looks like the bar chart if you set BINWIDTH=1 */
title2 "Histogram BinWidth=1";
proc sgplot data=Bin;   
   histogram x / binwidth=1 scale=percent;
   xaxis label="x" integer;
run;

The post The normal approximation and random samples of the binomial distribution appeared first on The DO Loop.