Statistical Thinking

2月 222023
 

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library of SAS IML functions that enable you to use the metalog distribution to model data in SAS.

The metalog distribution

There are dozens of continuous probability distributions that are commonly used to model univariate data. Examples include the normal, lognormal, Weibull, gamma, and beta distributions. (These are sometimes called "named" distributions.) Given a set of data, which distribution should you use? Ideally, you would use domain-specific knowledge to decide which distribution models the data-generating process. In practice, the process that generates the data is often unknown. Consequently, some people fit many different distributions to the data and use a goodness-of-fit statistic to choose the distribution that best fits the data.

Unfortunately, this "scattershot method" is neither reliable nor principled. Furthermore, sometimes the usual well-known distributions are not sufficiently flexible to model a set of data. This is shown by the graph to the right, which shows three different distributions fit to the same data. It is not clear that one model is superior to the others.

So that modelers do not have to fit many "named" distributions and choose the best, researchers have developed flexible systems of distributions that can model a wide range of shapes. Popular systems include the following:

  • The Pearson system: Karl Pearson used the normal distribution and seven other families to construct a system that can match any sample skewness and kurtosis. In SAS, you can use PROC SIMSYSTEM to fit a Pearson distribution to data.
  • The Johnson system: Norman Johnson used the normal distribution, the lognormal distribution, an unbounded distribution, and a bounded distribution to match any sample skewness and kurtosis. In SAS, you can use PROC UNIVARIATE or PROC SIMSYSTEM to fit a Johnson distribution to data.
  • The Fleishman system: The book Simulating Data with SAS (Wicklin, 2013) includes SAS IML modules that can fit the Fleishman system (Fleishman, 1978) to data.

The metalog system is a more recent family of distributions that can model many different shapes for bounded, semibounded, and unbounded continuous distributions. To use the metalog system, do the following:

  1. Choose the type of the distribution from the four available types: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. This choice is often based on domain-specific knowledge of the data-generating process.
  2. Choose the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k ≥ 7) can fit data distributions for which the density appears to have multiple peaks.
  3. Fit the quantile function of the metalog distribution to the data. This estimates the k parameters in the model.

For example, the following graph shows the PDF of a 5-term unbounded metalog model overlaid on the same histogram that was previously shown. This model seems to fit the data better than the "named" distributions.

Advantages and drawbacks of using the metalog distribution

The metalog distribution has the following advantages:

  • Because the metalog distribution can have many parameters, it is very flexible and can fit a wide range of shapes, including multimodal distributions.
  • Keelin (2016) uses ordinary least squares (OLS) to fit the data to the quantile function of the metalog distribution. Thus, the fitting process is simple to implement.
  • Because the quantile function has an explicit formula, it is simple to simulate data from the model by using the inverse CDF method of simulating data.

The metalog function has one primary disadvantage: the fitted metalog coefficients might not correspond to a valid distribution. An invalid model is called infeasible. Recall that a quantile function is always monotonically increasing as a function of the probability on the open interval (0,1). However, for an infeasible model, the quantile function is not monotonic. Equivalently, the probability density function (PDF) of the model contains nonpositive values.

Infeasibility is somewhat complicated, but one way to think about infeasibility is to think about the fact that the moments for a metalog model are related to the coefficients. If the coefficients lead to impossible moments (for example, a negative variance), then the model is infeasible.

When fitting a metalog model to data, you should always verify that the resulting model is feasible. If it is not, reduce the number of terms in the model. Some data cannot be fit when k > 2. For example, data that have one or more extreme observations often lead to an infeasible set of parameter estimates.

The following panel of graph shows six models for the same data. The models show increasing complexity as the number of terms increases. If you include six terms, the model becomes bimodal. However, the 9-term model is infeasible because the PDF near x=15 has a negative value, which is inside the circle in the last cell of the panel.

Definitions of the metalog distribution

Most common continuous probability distributions are parameterized on the data scale. That is, if X is a continuous random variable, then the CDF is a function that is defined on the set of outcomes for X. You can write the CDF as some increasing function p = F(x).

From the definition of a continuous CDF, you can obtain the quantile function and the PDF. The quantile function is defined as the inverse CDF: \(x = F^{-1}(p)\) for p in (0,1). The PDF is defined as the derivative with respect to x of the CDF function.

Keelin (2016) specifies the definitions of the metalog quantile function and PDF function. The metalog distribution reverses the usual convention by parameterizing the quantile function as x = Q(p). From the quantile function, you can obtain the CDF, which is the inverse of the quantile function. From the CDF, you can obtain the PDF. From the quantile function, you can generate random variates. Thus, knowing the quantile function is the key to working with the metalog distribution.

The unbounded metalog distribution

When you fit an unbounded metalog model, you estimate the parameters for which the model's predicted values (p, Q(p)) are close (in a least squares sense) to the data pairs (p, x). Here, p is a vector of cumulative probabilities, x is a vector of observed data values, and Q(p) is the quantile function of a metalog distribution.

The predicted values are obtained by using a linear regression model. The unbounded model has a design matrix that uses the following basis functions:

  • a vector of 1s
  • a vector that contains the values logit(p) = log(p/(1-p))
  • a vector that contains the values c(p) = p - 1/2

Specifically, the columns of the design matrix involve powers of the vector c(p) and interactions with the term logit(p). If M = [M1 | M2 | ... | Mk] is the design matrix, then the columns are as follows:

  • M1 is a column of 1s.
  • M2 is the column logit(p).
  • M3 is the interaction term c(p) logit(p).
  • M4 is the linear term c(p).
  • M5 is the quadratic term c(p)2.
  • M6 is the interaction term cc(p)2 logit(p).
  • Additional columns are of the form c(p)j or c(p)j logit(p).

The following image shows the basis functions M2-M7 for the metalog regression.

Given the design matrix, M, the least squares parameter estimates are the elements of the vector, b, which solves the matrix equations \(M^\prime M b = M^\prime x\). An estimate of the quantile function is given by Q(p) = M*b. By differentiating this expression with respect to p, you can obtain a formula for the PDF for the model (Keelin, 2016, Eq. 6, p. 254).

From these equations, you can see that not every set of coefficients corresponds to an increasing quantile function. For example, when k=2, Q(p) = b1 + b2 logit(p). To be an increasing function of p, it is necessary that b2 be strictly positive.

The semibounded and bounded metalog distributions

The semibounded and bounded metalog distributions are obtained by transforming the unbounded metalog model.

The semibounded metalog distribution on the interval [L, ∞) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(x-L). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution on [L, ∞) is then defined as Q_L(p) = L + exp(Q(p)) for p in (0,1), and Q_L(0) = L.

Similarly, the quantile function for the semibounded distribution on (-∞, U) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(U-x). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution is then defined as Q_U(p) = U - exp(-Q(p)) for p in (0,1), and Q_U(1) = U.

Finally, the quantile function for the bounded metalog distribution on an interval [L, U] is obtained by fitting an unbounded k-term distribution to the transformed data z = log((x-L)/(U-x)). This gives the quantile function, Q(p), for z. The quantile function for the bounded distribution is then defined by Q_B(p) = (L + U exp(Q(p))) / (1 + exp(Q(p))) for p in (0,1), and the boundary conditions Q_B(0) = L and Q_B(1) = U.

As mentioned previously, for every distribution, you can obtain the PDF function from the quantile function.

Random variates from the metalog distribution

Recall that you can generate a random sample from any continuous probability distribution by using the inverse CDF method. You generate a random variate, u, from the uniform distribution U(0,1), then compute \(x = F^{-1}(u)\), where F is the CDF function and \(F^{-1}\) is the inverse CDF, which is the quantile function. You can prove that x is a random variate that is distributed according to F.

For most distributions, the inverse CDF is not known explicitly, which means that you must solve for x as the root of the nonlinear equation F(x) - u = 0. However, the metalog distribution supports direct simulation because the quantile function is defined explicitly. For the unbounded metalog distribution, you can generate random variates by generating u ~ U(0,1) and computing x = Q(u), where Q is the quantile function. The semibounded and bounded cases are treated similarly.

Summary

The metalog distribution is a flexible distribution and can fit a wide range of shapes of data distributions, including multimodal distributions. You can fit a model to data by using ordinary least squares regression. Unfortunately, not every model is feasible, which means that the model is not a valid distribution. For an infeasible model, the CDF is not monotone and the PDF is not strictly positive. Nevertheless, the metalog distribution is a powerful option for fitting data when the underlying data-generating method is unknown.

In a future article, I will show how to use a library of SAS IML functions to fit a metalog distribution to data. All the images in this article were generated in SAS by using the library.

The post What is the metalog distribution? appeared first on The DO Loop.

2月 222023
 

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library of SAS IML functions that enable you to use the metalog distribution to model data in SAS.

The metalog distribution

There are dozens of continuous probability distributions that are commonly used to model univariate data. Examples include the normal, lognormal, Weibull, gamma, and beta distributions. (These are sometimes called "named" distributions.) Given a set of data, which distribution should you use? Ideally, you would use domain-specific knowledge to decide which distribution models the data-generating process. In practice, the process that generates the data is often unknown. Consequently, some people fit many different distributions to the data and use a goodness-of-fit statistic to choose the distribution that best fits the data.

Unfortunately, this "scattershot method" is neither reliable nor principled. Furthermore, sometimes the usual well-known distributions are not sufficiently flexible to model a set of data. This is shown by the graph to the right, which shows three different distributions fit to the same data. It is not clear that one model is superior to the others.

So that modelers do not have to fit many "named" distributions and choose the best, researchers have developed flexible systems of distributions that can model a wide range of shapes. Popular systems include the following:

  • The Pearson system: Karl Pearson used the normal distribution and seven other families to construct a system that can match any sample skewness and kurtosis. In SAS, you can use PROC SIMSYSTEM to fit a Pearson distribution to data.
  • The Johnson system: Norman Johnson used the normal distribution, the lognormal distribution, an unbounded distribution, and a bounded distribution to match any sample skewness and kurtosis. In SAS, you can use PROC UNIVARIATE or PROC SIMSYSTEM to fit a Johnson distribution to data.
  • The Fleishman system: The book Simulating Data with SAS (Wicklin, 2013) includes SAS IML modules that can fit the Fleishman system (Fleishman, 1978) to data.

The metalog system is a more recent family of distributions that can model many different shapes for bounded, semibounded, and unbounded continuous distributions. To use the metalog system, do the following:

  1. Choose the type of the distribution from the four available types: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. This choice is often based on domain-specific knowledge of the data-generating process.
  2. Choose the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k ≥ 7) can fit data distributions for which the density appears to have multiple peaks.
  3. Fit the quantile function of the metalog distribution to the data. This estimates the k parameters in the model.

For example, the following graph shows the PDF of a 5-term unbounded metalog model overlaid on the same histogram that was previously shown. This model seems to fit the data better than the "named" distributions.

Advantages and drawbacks of using the metalog distribution

The metalog distribution has the following advantages:

  • Because the metalog distribution can have many parameters, it is very flexible and can fit a wide range of shapes, including multimodal distributions.
  • Keelin (2016) uses ordinary least squares (OLS) to fit the data to the quantile function of the metalog distribution. Thus, the fitting process is simple to implement.
  • Because the quantile function has an explicit formula, it is simple to simulate data from the model by using the inverse CDF method of simulating data.

The metalog function has one primary disadvantage: the fitted metalog coefficients might not correspond to a valid distribution. An invalid model is called infeasible. Recall that a quantile function is always monotonically increasing as a function of the probability on the open interval (0,1). However, for an infeasible model, the quantile function is not monotonic. Equivalently, the probability density function (PDF) of the model contains nonpositive values.

Infeasibility is somewhat complicated, but one way to think about infeasibility is to think about the fact that the moments for a metalog model are related to the coefficients. If the coefficients lead to impossible moments (for example, a negative variance), then the model is infeasible.

When fitting a metalog model to data, you should always verify that the resulting model is feasible. If it is not, reduce the number of terms in the model. Some data cannot be fit when k > 2. For example, data that have one or more extreme observations often lead to an infeasible set of parameter estimates.

The following panel of graph shows six models for the same data. The models show increasing complexity as the number of terms increases. If you include six terms, the model becomes bimodal. However, the 9-term model is infeasible because the PDF near x=15 has a negative value, which is inside the circle in the last cell of the panel.

Definitions of the metalog distribution

Most common continuous probability distributions are parameterized on the data scale. That is, if X is a continuous random variable, then the CDF is a function that is defined on the set of outcomes for X. You can write the CDF as some increasing function p = F(x).

From the definition of a continuous CDF, you can obtain the quantile function and the PDF. The quantile function is defined as the inverse CDF: \(x = F^{-1}(p)\) for p in (0,1). The PDF is defined as the derivative with respect to x of the CDF function.

Keelin (2016) specifies the definitions of the metalog quantile function and PDF function. The metalog distribution reverses the usual convention by parameterizing the quantile function as x = Q(p). From the quantile function, you can obtain the CDF, which is the inverse of the quantile function. From the CDF, you can obtain the PDF. From the quantile function, you can generate random variates. Thus, knowing the quantile function is the key to working with the metalog distribution.

The unbounded metalog distribution

When you fit an unbounded metalog model, you estimate the parameters for which the model's predicted values (p, Q(p)) are close (in a least squares sense) to the data pairs (p, x). Here, p is a vector of cumulative probabilities, x is a vector of observed data values, and Q(p) is the quantile function of a metalog distribution.

The predicted values are obtained by using a linear regression model. The unbounded model has a design matrix that uses the following basis functions:

  • a vector of 1s
  • a vector that contains the values logit(p) = log(p/(1-p))
  • a vector that contains the values c(p) = p - 1/2

Specifically, the columns of the design matrix involve powers of the vector c(p) and interactions with the term logit(p). If M = [M1 | M2 | ... | Mk] is the design matrix, then the columns are as follows:

  • M1 is a column of 1s.
  • M2 is the column logit(p).
  • M3 is the interaction term c(p) logit(p).
  • M4 is the linear term c(p).
  • M5 is the quadratic term c(p)2.
  • M6 is the interaction term cc(p)2 logit(p).
  • Additional columns are of the form c(p)j or c(p)j logit(p).

The following image shows the basis functions M2-M7 for the metalog regression.

Given the design matrix, M, the least squares parameter estimates are the elements of the vector, b, which solves the matrix equations \(M^\prime M b = M^\prime x\). An estimate of the quantile function is given by Q(p) = M*b. By differentiating this expression with respect to p, you can obtain a formula for the PDF for the model (Keelin, 2016, Eq. 6, p. 254).

From these equations, you can see that not every set of coefficients corresponds to an increasing quantile function. For example, when k=2, Q(p) = b1 + b2 logit(p). To be an increasing function of p, it is necessary that b2 be strictly positive.

The semibounded and bounded metalog distributions

The semibounded and bounded metalog distributions are obtained by transforming the unbounded metalog model.

The semibounded metalog distribution on the interval [L, ∞) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(x-L). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution on [L, ∞) is then defined as Q_L(p) = L + exp(Q(p)) for p in (0,1), and Q_L(0) = L.

Similarly, the quantile function for the semibounded distribution on (-∞, U) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(U-x). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution is then defined as Q_U(p) = U - exp(-Q(p)) for p in (0,1), and Q_U(1) = U.

Finally, the quantile function for the bounded metalog distribution on an interval [L, U] is obtained by fitting an unbounded k-term distribution to the transformed data z = log((x-L)/(U-x)). This gives the quantile function, Q(p), for z. The quantile function for the bounded distribution is then defined by Q_B(p) = (L + U exp(Q(p))) / (1 + exp(Q(p))) for p in (0,1), and the boundary conditions Q_B(0) = L and Q_B(1) = U.

As mentioned previously, for every distribution, you can obtain the PDF function from the quantile function.

Random variates from the metalog distribution

Recall that you can generate a random sample from any continuous probability distribution by using the inverse CDF method. You generate a random variate, u, from the uniform distribution U(0,1), then compute \(x = F^{-1}(u)\), where F is the CDF function and \(F^{-1}\) is the inverse CDF, which is the quantile function. You can prove that x is a random variate that is distributed according to F.

For most distributions, the inverse CDF is not known explicitly, which means that you must solve for x as the root of the nonlinear equation F(x) - u = 0. However, the metalog distribution supports direct simulation because the quantile function is defined explicitly. For the unbounded metalog distribution, you can generate random variates by generating u ~ U(0,1) and computing x = Q(u), where Q is the quantile function. The semibounded and bounded cases are treated similarly.

Summary

The metalog distribution is a flexible distribution and can fit a wide range of shapes of data distributions, including multimodal distributions. You can fit a model to data by using ordinary least squares regression. Unfortunately, not every model is feasible, which means that the model is not a valid distribution. For an infeasible model, the CDF is not monotone and the PDF is not strictly positive. Nevertheless, the metalog distribution is a powerful option for fitting data when the underlying data-generating method is unknown.

In a future article, I will show how to use a library of SAS IML functions to fit a metalog distribution to data. All the images in this article were generated in SAS by using the library.

The post What is the metalog distribution? appeared first on The DO Loop.

9月 282022
 

The moments of a continuous probability distribution are often used to describe the shape of the probability density function (PDF). The first four moments (if they exist) are well known because they correspond to familiar descriptive statistics:

  • The first raw moment is the mean of a distribution. For a random variable, this is the expected value. It can be positive, negative, or zero.
  • The second central moment is the variance. The variance is never negative; in practice, the variance is strictly positive.
  • The third standardized moment is the skewness. It can be positive, negative, or zero. A distribution that has zero skewness is symmetric.
  • The fourth standardized moment is the kurtosis. Whereas the raw kurtosis, which is used in probability theory, is always positive, statisticians most often use the excess kurtosis. The excess kurtosis is 3 less than the raw kurtosis. Statisticians subtract 3 because the normal distribution has raw kurtosis of three, and researchers often want to compare whether the tail of a distribution is thicker than the normal distribution's tail (a positive excess kurtosis) or thinner than the normal distribution's tail (a negative excess kurtosis).

The purpose of this article is to point out a subtle point. There are three common definitions of moments: raw moments, central moments, and standardized moments. The adjectives matter. When you read a textbook, article, or software documentation, you need to know which definition the author is using.

A cautionary tale

If you re-read the previous list, you will see that it is traditional to use the RAW moment for the mean, the CENTRAL moment for the variance, the STANDARDIZED moment for the skewness, and the STANDARDIZED moment (sometimes subtracting 3) for the kurtosis. However, researchers can report other quantities. Recently, I read a paper in which the author reported formulas for the third and fourth moments of a distribution. I assumed that the author was referring to the standardized moments, but when I simulated data from the distribution, my Monte Carlo estimates for the skewness and excess kurtosis did not agree with the author's formulas. After many hours of checking and re-checking the formulas and my simulation, I realized that the author's formulas were for the central moments (not standardized). After I converted the formulas to report standardized moments (and the excess kurtosis), I was able to reconcile the published results and my Monte Carlo estimates.

Definitions of raw moments

For a continuous probability distribution for density function f(x), the nth raw moment (also called the moment about zero) is defined as
\(\mu_n^\prime = \int_{-\infty}^{\infty} x^n f(x)\, dx\)
The mean is defined as the first raw moment. Higher-order raw moments are used less often. The superscript on the symbol \(\mu_n^\prime\) is one way to denote the raw moment, but not everyone uses that notation. For the first raw moment, both the superscript and the subscript are often dropped, and we use \(\mu\) to denote the mean of the distribution.

For all moments, you should recognize that the moments might not exist. For example, the Student t distribution with ν degrees of freedom does not have finite moments of order ν or greater. Thus, you should mentally add the phrase "when they exist" to the definitions in this article.

Definitions of central moments

The nth central moment for a continuous probability distribution with density f(x) is defined as
\(\mu_n = \int_{-\infty}^{\infty} (x - \mu)^n f(x)\, dx\)
where \(\mu\) is the mean of the distribution. The most famous central moment is the second central moment, which is the variance. The second central moment, \(\mu_2\) is usually denoted by \(\sigma^2\) to emphasize that the variance is a positive quantity.

Notice that the central moments of even orders (2, 4, 6,...) are always positive since they are defined as the integral of positive quantities. It is easy to show that the first central moment is always 0.

The third and fourth central moments are used as part of the definition of the skewness and kurtosis, respectively. These moments are covered in the next section.

Definitions of standardized moments

The nth standardized moment is defined by dividing the nth central moment by the nth power of the standard deviation:
\({\tilde \mu}_n = \mu_n / \sigma^n\)
For the standardized moments, we have the following results:

  • The first standardized moment is always 0.
  • The second standardized moment is always 1.
  • The third standardized moment is the skewness of the distribution.
  • The fourth standardized moment is the raw kurtosis of the distribution. Because the raw kurtosis of the normal distribution is 3, it is common to define the excess kurtosis as \({\tilde \mu}_n - 3\).

A distribution that has a negative excess kurtosis has thinner tails than the normal distribution. An example is the uniform distribution. A distribution that has a positive excess kurtosis has fatter tails than the normal distribution. An example is the t distribution. For more about kurtosis, see "Does this kurtosis make my tail look fat?"

Moments for discrete distributions

Similar definitions exist for discrete distributions. Technically, the moments are defined by using the notion of the expected value of a random variable. Loosely speaking, you can replace the integrals by summations. For example, if X is a discrete random variable with a countable set of possible values {x1, x2, x3,...} that have probability {p1, p2, p3, ...} of occurring (respectively), then the raw nth moment for X is the sum
\(E[X^n] = \sum_i x_i^n p_i\)
and the nth central moment is
\(E[(X-\mu)^n] = \sum_i (x_i-\mu)^n p_i\)

Summary

I almost titled this article, "Will the real moments please stand up!" The purpose of the article is to remind you that "the moment" of a probability distribution has several possible interpretations. You need to use the adjectives "raw," "central," or "standardized" to ensure that your audience knows which moment you are using. Conversely, when you are reading a paper that discusses moments, you need to determine which definition the author is using.

The issue is complicated because the common descriptive statistics refer to different definitions. The mean is defined as the first raw moment. The variance is the second central moment. The skewness and kurtosis are the third and fourth standardized moments, respectively. When using the kurtosis, be aware that most computer software reports the excess kurtosis, which is 3 less than the raw kurtosis.

The post Definitions of moments in probability and statistics appeared first on The DO Loop.

5月 252022
 

Many modern statistical techniques incorporate randomness: simulation, bootstrapping, random forests, and so forth. To use the technique, you need to specify a seed value, which determines pseudorandom numbers that are used in the algorithm. Consequently, the seed value also determines the results of the algorithm. In theory, if you know the seed value and the internal details of the pseudorandom algorithm, then the stream is completely determined, and the results of an algorithm are reproducible. For example, if I publish code for a simulation or bootstrap method in SAS, you can reproduce my computations as long as my program specifies the seed value for every part of the program that uses random numbers.

Occasionally, programmers post questions on discussion forums about how to compare the results that were obtained by using different software. A typical question might be, "My colleague used R to run a simulation. He gets a confidence interval of [14.585, 21.42]. When I run the simulation in SAS, I get [14.52, 21.19]. How can I get the same results as my colleague?" In most cases, you can't get the same answer. More importantly, it doesn't matter: these two results are equally valid estimates. When you run an algorithm that uses pseudorandom numbers, you don't get THE answer, you get AN answer. There are many possible answers you can get. Each time you change the random number stream, you will get a different answer, but all answers are equally valid.

This article examines how the random number seed affects the results of a bootstrap analysis by using an example of a small data set that has N=10 observations. I will run the same bootstrap analysis for 50 different seeds and compare the results of a confidence interval. The same ideas apply to simulation studies, machine learning algorithms, and other algorithms that use randomness to obtain an answer.

A bootstrap estimate of a confidence interval

The following data were used by Davison, Hinkley, and Schechtman (1986). They ran an analysis to examine the bootstrap distribution of the sample mean in a small nonnormal set of data. The goal of the following analysis is to obtain a bootstrap estimate of a 90% confidence interval for the population mean. Because we are going to run the same analysis many times with different random number seeds, I will put the bootstrap steps in a SAS macro. The following statements run two bootstrap analyses on the same data. The only difference is the value of the random-number seeds.

data Sample;
input x @@;
datalines;
9.6 10.4 13.0 15.0 16.6 17.2 17.3 21.8 24.0 33.8
;
 
/* macro to perform one bootstrap analysis that uses NumSamples resamples
   and uses a particular seed. Results are in the data set BootEst */
%macro Do1Boot(seed, NumSamples);
   /* 1. generate NumSamples bootstrap resamples from SEED value */
   proc surveyselect data=Sample NOPRINT seed=&SEED
        method=urs              /* URS = resample with replacement */
        samprate=1              /* each bootstrap sample has N obs */
        OUTHITS                 /* do not use a frequency var */
        reps=&NumSamples(repname=SampleID)        /* generate NumSamples samples */
        out=BootSamp;
   run;
 
   /* 2. compute sample mean for each resample; write to data set */
   proc means data=BootSamp NOPRINT;
      by SampleID;
      var x;
      output out=BootStats mean=Mean;  /* bootstrap distribution */
   run;
   /* 3. find boostrap estimates of mean, stdErr, and 90% CI */
   proc means data=BootStats NOPRINT;
      var Mean;
      output n=N mean=Mean stddev=StdErr p5=LCL90 p95=UCL90
             out=BootEst(drop=_TYPE_ _FREQ_);
   run;
   /* add the seed value to these statistics */
   data BootEst;
      Seed = &seed;
      set BootEst;
   run;
%mend;
 
title "A Boostrap Analysis for Seed=1";
%Do1Boot(1, 1000);
proc print data=BootEst noobs; run;
 
title "A Boostrap Analysis for Seed=1234";
%Do1Boot(1234, 1000);
proc print data=BootEst noobs; run;

The output shows that the bootstrap estimates are similar, but not exactly the same. This is expected. Each analysis uses a different seed, which means that the bootstrap resamples are different between the two runs. Accordingly, the means of the resamples are different, and so the bootstrap distributions are different. Consequently, the descriptive statistics that summarize the bootstrap distributions will be different. To make the differences easier to see, I used only 1,000 samples in each bootstrap analysis. If you increase the number of resamples to, say, 10,000, the differences between the two analyses will be smaller.

The fact that the random number seed "changes" the result makes some people nervous. Which value should you report to a journal, to a customer, or to your boss? Which is "the right answer"?

Both answers are correct. A bootstrap analysis relies on randomness. The results from the analysis depend on the pseudorandom numbers that are used.

Do the results depend on the seed value?

You should think of each bootstrap estimate as being accurate to some number of digits, where the number of digits depends on the number of resamples, B. For some estimates, such as the bootstrap standard error, you can calculate that the accuracy is proportional to 1/sqrt(B). Thus, larger values for B tend to shrink the difference between two bootstrap analyses that use different random number streams.

One way to visualize the randomness in a bootstrap estimate is to run many bootstrap analyses that are identical except for the random-number seed. The following SAS macro runs 50 bootstrap analyses and uses the seed values 1,2,3,..., 50:

/* Run nTrials bootstrap analyses. The i_th trials uses SEED=i. The estimates are written 
   to the data set AllBoot by concatenating the results from each analysis. */
%macro DoNBoot(NumSamples, nTrials);
   options nonotes;
   proc datasets noprint nowarn;   /* delete data set */
      delete AllBoot;
   quit;
   %do seed = 1 %to &nTrials;
       %Do1Boot(&seed, &numSamples);   /* run with seed=1,2,3,...,nTrials */
       proc append base=AllBoot data=BootEst force; 
       run;
   %end;
   options notes;
%mend;
 
/* Run 50 bootstrap analyses: numSamples=1000, nTrials=50, seeds=1,2,3,...,50 */
%DoNBoot(1000, 50); 
 
/* find boostrap estimates of mean, stdErr, and 90% CI */
title "Variation in 50 Traditional Boostrap Estimates";
proc means data=AllBoot N Mean StdDev Min Max;
   var Mean StdErr LCL90 UCL90;
run;

The table shows how four estimates vary across the 50 bootstrap analyses:

  • The bootstrap estimate of the mean varies in the interval [17.76, 18.00]. The average of the 50 estimates is 17.88.
  • The bootstrap estimate of the standard error varies in the interval [2.06, 2.25]. The average of the 50 estimates is 17.88.
  • The bootstrap estimate of the lower 90% confidence interval varies in the interval [14.41, 14.80]. The average of the 50 estimates is 14.56.
  • The bootstrap estimate of the upper 90% confidence interval varies in the interval [21.36, 22.09]. The average of the 50 estimates is 21.63.

The table shows that the variation among the 50 estimates is small (and can be made smaller by increasing B). If you and your colleague run an identical bootstrap analysis but use different seed values, you are likely to obtain results that are similar to each other. Even more comforting is that the inferences that you make from the analyses should be the same. For example, suppose you are testing the null hypotheses that these data came from a population in which the mean is μ=20. Let's look at the 90% confidence intervals for the 50 analyses:

title "Traditional Bootstrap Estimates";
title2 "Varying Seed Values";
proc sgplot data=AllBoot;
   refline &ObsStat / axis=Y label="Sample Mean";
   highlow x=Seed low=LCL90 high=UCL90 / legendlabel="90% CL";
   scatter x=Seed y=Mean / legendlabel="Bootstrap Estimate";
   yaxis label="x" ;
run;

The graph shows that the value μ=20 is inside all 50 of the confidence intervals. Therefore, the decision to accept or reject the null hypothesis does not depend on the seed value when μ=20.

On the other hand, if you are testing whether the population mean is μ=21.5, then some analyses would reject the null hypothesis whereas others would fail to reject it. What to do in that situation? I would acknowledge that there is uncertainty in the inference and conclude that it cannot be reliably decided based on these data and these analyses. It would be easy to generate additional bootstrap samples (increase B). You could also advocate for collecting additional data, if possible.

Summary

In summary, the result of a bootstrap analysis (and ANY statistical technique that uses randomness) depends on the stream of pseudorandom numbers that are used. If you change the seed or use different statistical software, you are likely to obtain a different result, but the two results should be similar in many cases. You can repeat the analysis several times with different seeds to understand how the results vary according to the random-number seed.

If some seed values enable you to reject a hypothesis and others do not, you should acknowledge that fact. It is not ethical to use the seed value that gives the result that you want to see! (See "How to lie with a simulation" for an example.)

If you are nervous about the result of an algorithm that uses randomness, I suggest that you rerun the analysis several times with different seed values. If the results are consistent across seeds, then you can have confidence in the analysis. If you get vastly different results, then you are justified in being nervous.

The post How much does a bootstrap estimate depend on the random number stream? appeared first on The DO Loop.

1月 202022
 

Here's a fun problem to think about: Suppose that you have two different valid ways to test a statistical hypothesis. For a given sample, will both tests reject or fail to reject the hypothesis? Or might one test reject it whereas the other does not?

The answer is that two different tests can and will disagree for some samples, although they are likely to agree most of the time. Statistical software sometimes displays the results of multiple statistical tests when you request a hypothesis test. For example, PROC UNIVARIATE in SAS outputs four different statistical tests for normality, including the Shapiro-Wilk and Kolmogorov-Smirnov tests. One or more tests might reject the hypothesis of normality whereas other tests fail to reject it.

A SAS customer asked a question that made me wonder how often the four normality tests agree, and what the samples look like when the tests do not agree. This article presents a simulation study that generates random samples from the normal distribution and calls PROC UNIVARIATE to perform tests for normality on each sample. At the 95% confidence level, we expect each of these tests to reject the test of normality for 5% of the samples. But even though the tests all are designed to detect normal data, they do so in different ways. This means that the 5% of samples that are rejected by one test might be different from the 5% of samples that are rejected by another test.

Tests for Normality

PROC UNIVARIATE in SAS supports four different tests for normality of a univariate distribution. The Shapiro-Wilk statistic (W) is based on a computation that involves the variance and quantiles of the data. The other three tests are based on the empirical cumulative distribution function (ECDF). The ECDF-based tests include the Kolmogorov-Smirnov statistic (D), the Anderson-Darling statistic (A2), and the Cramer–von Mises statistic (W2). When the associate p-value of a statistic is less than α, the test rejects the null hypothesis ("the sample is from a normal distribution") at the α significance level. This article uses α=0.05.

What should an analyst do if some tests reject the hypothesis whereas others fail to reject it? I know a statistician who habitually rejects the assumption of normality if ANY of the tests reject normality. I think this is a bad idea because he is making decisions at less than the 95% confidence level. The simulation study in this article estimates the probability that a random sample from the normal distribution is not rejected simultaneously by all four tests. That is, all four tests agree that the sample is normal.

Run a simulation study

If you simulate a moderate-sized sample from the normal distribution, these four tests should each reject normality for about 5% of the samples. The following SAS program uses the standard Monte Carlo simulation technique in SAS:

  1. Simulate B=10,000 samples of size N=50 from the normal distribution. Associate a unique ID to each sample.
  2. Make sure that you suppress the output. Then use the BY statement to analyze the B samples. Write the B statistics to a data set.
  3. Analyze the sampling distribution of the statistics or tabulate the results of a statistical test.

The first two steps are straightforward. You can use the ODS OUTPUT statement to save the statistics to a data set, as follows:

/* simulate normal data */
%let n = 50;
%let numSamples = 10000;
data Sim;
call streaminit(123);
do SampleID=1 to &NumSamples;
   do i = 1 to &n;
      x = rand("Normal", 0);  /* sample from null hypothesis */
      output;
   end;
end;
run;
 
ods exclude all;    /* suppress output */
proc univariate data=Sim normal;
   by SampleID;
   var x;
   ods output TestsForNormality = NormTests;
run;
ods exclude none;

Transpose and analyze the results of the simulation study

The TestsForNormality table lists the statistics in a "long form," where each statistic is on a separate row. To analyze the distribution of the statistics, it is convenient to transpose the data to wide form. The following call to PROC TRANSPOSE creates four variables (pValue_W, pValue_D, pValue_W_Sq, and pValue_A_Sq), which are the p-values for the four normality tests.

/* Transpose from long to wide form. This creates four variables
   pValue_W, pValue_D, pValue_W_Sq, and pValue_A_Sq. */
proc transpose data=NormTests out=Tran prefix=pValue_;
   by SampleID;
   id TestLab;
   var pValue;
run;

Let's create a binary indicator variable for each p-value. The indicator variable has the value 1 if the rejects the hypothesis of normality at the α=0.05 significance level. If the test fails to reject, the indicator variable has the value 0. You can then tabulate the results for each individual test:

%let alpha = 0.05;
data Results;
set Tran;
RejectW = (pValue_W < &alpha);
RejectD = (pValue_D < &alpha);
RejectWSq = (pValue_W_Sq < &alpha);
RejectASq = (pValue_A_Sq < &alpha);
run;
 
/* tabulate the results */
proc freq data=Results;
   tables RejectW RejectD RejectWSq RejectASq / norow nocol nocum;
run;

As expected, each test rejects the hypothesis of normality on about 5% of the random samples. However, the tests have different characteristics. For a given sample, one test might reject the hypothesis whereas another test fails to reject it. For example, you can investigate how many samples were jointly rejected by the Shapiro-Wilk test (W) and the Kolmogorov-Smirnov test (D):

proc freq data=Results;
   tables RejectW*RejectD / norow nocol nocum list; 
run;

The table shows the results of 10,000 statistical tests. For 192 samples (1.9%), both reject the null hypothesis. For 9,224 (92.2%) samples, both tests fail to reject it. Thus the tests agree for 9,416 samples or 94.2%.

The interesting cases are when one test rejects the hypothesis and the other does not. The Kolmogorov-Smirnov test rejects the hypothesis on 279 samples that are not rejected by the Shapiro-Wilk test. Conversely, the Shapiro-Wilk test rejects 305 samples that the Kolmogorov-Smirnov test does not reject. Two such samples are shown in the following graph. A casual inspection does not reveal why the tests differ on these samples, but they do.

Multiple tests

When you run four tests, the situation becomes more complicated. For a given sample, you could find that zero, one, two, three, or all four tests reject normality. Since these statistics all test the same hypothesis, they should agree often. Thus, you should expect that for most random samples, either zero or four tests will reject the null hypothesis. Again, the interesting cases are when some tests reject it and others fail to reject it. The following table shows the results of all four tests on the samples in the simulation study:

 
proc freq data=Results;
   tables RejectW*RejectD*RejectWSq*RejectASq / norow nocol nocum list;
run;

There are a total of 9,103 samples for which all tests agree that the sample is normally distributed. If you reject the hypothesis if ANY of the tests reject it, you are making decisions at the 91% confidence level, not at the 95% level.

The first and last rows show the cases where all four tests agree. You can see that the tests agree with each other about 92.8% of the time. However, there are some other cases that occur frequently:

  • The 5th and 12th rows indicate samples for which the Kolmogorov-Smirnov test does not agree with the other three tests. This happens for 2.6% of the samples.
  • The 8th and 9th rows indicate samples for which the Shapiro-Wilk test does not agree with the ECDF tests. This happens for 2.5% of the samples.
  • The ECDF tests agree with each other for 95.3% of these samples. This is shown by the 1st, 8th, 9th, and 16th rows.
  • It is very rare for the Anderson-Darling statistic (A2) and the Cramer–von Mises statistic (W2) to disagree when the null hypothesis is true. They agree for 98.6% of these samples.

Summary

This article uses a simulation study to investigate how often different tests (for the same hypothesis) agree with each other. The article uses small samples of size N=50 that are normally distributed. The results show that these tests for normality tend to agree with each other. At least one test disagrees with the others for 7.2% of the samples.

As usual, this simulation study applies only to the conditions in the study. Wicklin (Simulating Data with SAS, 2013, p. 105) cautions against generalizing too much from a particular simulation: "real data are rarely distributed according to any 'named' distribution, so how well do the results generalize to real data? The answer can depend on the true distribution of the data and the sample size."

Nevertheless, this article provides some insight into how often statistical tests of normality agree and disagree with each other when the null hypothesis is true. You can use the techniques in this article to compare other tests.

The post How often do different statistical tests agree? A simulation study appeared first on The DO Loop.

11月 082021
 

Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the normal density (the red curve, N(μ=9500, σ=21.79)) can be a very good approximation to the binomial density (blue bars, Binom(p=0.95, nTrials=10000)). However, because the binomial density is discrete, the binomial density is defined only for positive integers, whereas the normal density is defined for all real numbers.

In this graph, the binomial density and the normal density are close. But what does the approximation look like if you overlay a bar chart of a random sample from the binomial distribution? It turns out that the bar chart can have large deviations from the normal curve, even for a large sample. Read on for an example.

The normal approximation to the binomial distribution

I have written two previous articles that discuss the normal approximation to the binomial distribution:

The normal approximation is used to estimate probabilities because it is often easier to use the area under the normal curve than to sum many discrete values. However, as shown in the second article, the discrete binomial distribution can have statistical properties that are different from the normal distribution.

Random binomial samples

It is important to remember that a random sample (from any distribution!) can look much different from the underlying probability density function. The following graph shows a random sample from the binomial distribution Binom(0.95, 10000). The distribution of the sample looks quite different from the density curve.

In statistical terms, the observed and expected values have large deviations. Also, note that there can be a considerable deviation between adjacent bars. For example, in the graph, some bars have about 2% of the total frequency whereas an adjacent bar might have half that value. I was surprised to observe the large deviations in a large sample.

This graph emphasizes the fact that a random sample from the binomial distribution can look different from the smooth bell-shaped curve of the probability density. I think I was surprised by the magnitude of the deviations from the expected values because I have more experience visualizing continuous distributions. For a continuous distribution, we use a histogram to display the empirical distribution. When the bin width of the histogram is greater than 1, it smooths out the differences between adjacent bars to create a much smoother estimate of the density. For example, the following graph displays the distribution of the same random sample but uses a bin width of 10 to aggregate the frequency. The resulting histogram is much smoother and resembles the normal approximation.

Summary

The normal approximation is a convenient way to compute probabilities for the binomial distribution. However, it is important to remember that the binomial distribution is a discrete distribution. A binomial random variable can assume values only for positive integers. One consequence of this observation is that a bar chart of a random binomial sample can show considerable deviations from the theoretical density. This is normal (pun intended!). It is often overlooked because if you treat the random variable as if it were continuous and use a histogram to estimate the density, the histogram smooths out a lot of the bar-to-bar deviations.

Appendix

The following SAS program creates the graphs in this article.

/* define parameters for the Binom(p=0.95, nTrials=10000) simulation */
%let p = 0.95;                    /* probability of success */
%let NTrials = 10000;             /* number of trials */
%let N = 1000;                    /* sample size */   
 
/* First graph: Compute the density of the Normal and Binomial distributions. See
   https://blogs.sas.com/content/iml/2016/09/12/overlay-curve-bar-chart-sas.html
*/
data Binom;
n = &nTrials;  p = &p;  q = 1 - p;
mu = n*p;  sigma = sqrt(n*p*q);   /* parameters for the normal approximation */
Lower = mu-3.5*sigma;             /* evaluate normal density on [Lower, Upper] */
Upper = mu+3.5*sigma;
 
/* PDF of normal distribution */
do t = Lower to Upper by sigma/20;
   Normal = pdf("normal", t, mu, sigma);       output;
end;
 
/* PMF of binomial distribution */
t = .; Normal = .;        /* these variables are not used for the bar chart */
do j = max(0, floor(Lower)) to ceil(Upper);
   Binomial = pdf("Binomial", j, p, n);      output;
end;
/* store mu and sigma in macro variables */
call symput("mu", strip(mu));      
call symput("sigma", strip(round(sigma,0.01)));
label Binomial="Binomial Probability"  Normal="Normal Density";
keep t Normal j Binomial;
run;
 
/* overlay binomial density (needle plot) and normal density (series plot) */
title "Binomial Probability and Normal Approximation";
title2 "Binom(0.95, 10000) and N(9500, 21.79)";
proc sgplot data=Binom;
   needle x=j y=Binomial;
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "p = &p"  "q = %sysevalf(1-&p)" "nTrials = &nTrials"  
         "(*ESC*){unicode mu} = np = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = sqrt(npq) = &sigma" /
         position=topright border;
   yaxis label="Probability";
   xaxis label="x" integer;
run;
 
/*************************/
 
/* Second graph: simulate a random sample from Binom(p, NTrials) */
data Bin(keep=x);
call streaminit(1234);
do i = 1 to &N; 
   x = rand("Binomial", &p, &NTrials); 
   output;
end;
run; 
 
/* count the frequency of each observed count */
proc freq data=Bin noprint;
   tables x / out=FreqOut;
run;
 
data All;
set FreqOut Binom(keep=t Normal);
Normal = 100*1*Normal;   /* match scales: 100*h*PDF, where h=binwidth */
run;
 
/* overlay sample and normal approximation */
title "Random Binomial Sample";
title2 "Bar Chart of Binom(0.95, 10000)";
proc sgplot data=All;   
   needle x=x y=Percent;
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "n = &n"  "p = &p" 
         "(*ESC*){unicode mu} = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = &sigma" /
         position=topright border;
   yaxis label="Percent / Scaled Density";
   xaxis label="x" integer;
run;
 
/*************************/
 
/* Third graph: the bar-to-bar deviations are smoothed if you use a histogram */
title2 "Histogram BinWidth=10";
proc sgplot data=Bin;   
   histogram x / scale=percent binwidth=10;
   xaxis label="x" integer values=(9400 to 9600 by 20);
run;
 
/* for comparison, the histogram looks like the bar chart if you set BINWIDTH=1 */
title2 "Histogram BinWidth=1";
proc sgplot data=Bin;   
   histogram x / binwidth=1 scale=percent;
   xaxis label="x" integer;
run;

The post The normal approximation and random samples of the binomial distribution appeared first on The DO Loop.

10月 042021
 

A reader asked whether it is possible to find a bootstrap sample that has some desirable properties. I am using the term "bootstrap sample" to refer to the result of randomly resampling with replacement from a data set. Specifically, he wanted to find a bootstrap sample that has a specific value of the mean and standard deviation. If not, can we find a bootstrap sample whose mean and standard deviation are close to the target value? It's an interesting question. Each bootstrap sample is composed of the original observations, but because it is a sample with replacement, some observations can appear more often than others.

For very small values of N, you can solve the reader's problem by generating all bootstrap samples. In a previous article, I used the data {-1, -0.1, -0.6, 0, 0.5, 0.8, 4}, which has the sample mean 0.51 for N=7. I mentioned that one possible bootstrap sample is {4, 4, 4, 4, 0.5, 4, 4}, which has the mean 3.5. This bootstrap sample contains only large values in the original data. The mean of that sample is much greater than the mean of the original data, and the standard deviation is much less.

Of course, generating all bootstrap samples is impractical for even moderate values of N. Furthermore, bootstrap samples like this are improbable if the probability of choosing each observation is 1/N. But what if you assign a higher probability to select some observations and a lesser probability to others? For example, the sample that has six 4s is not improbable if the probability of choosing the 4 is 0.9. If you sample with unequal probability (and still with replacement) the statistics in the samples might be very different from the statistics in the original data.

This kind of sampling is no longer classical bootstrap sampling. It could be considered a kind of importance sampling. Whatever you call it, each sample you generate is a resampling from the original data, but you can control the kinds of samples that you generate by adjusting the probability of selecting each observation.

Sampling with non-uniform probability

Recall that there are four main ways to sample. You can sample with or without replacement; you can sample with equal or unequal probability. In this article, we use sampling with replacement with unequal probability. SAS supports several ways to sample with replacement and with unequal probability. This article uses PROC SURVEYSELECT with the METHOD=PPS_WR option. The acronym PPS_WR stands for "probability proportional to size (PPS), with replacement."

To illustrate sampling with unequal probability, consider the following data with N=1000 that are simulated from a uniform distribution on [0, 20]:

%let N = 1000;    /* sample size */
data Have;
call streaminit(54321);
do i = 1 to &N;
   x = round( rand("uniform", 0, 20), 0.01);  /* round to nearest 0.01 */
   output;
end;
run;
 
/* compute samples mean and std dev of original data */
proc means data=Have mean std min max;
  var x;
run;

The mean and standard deviation of the data are 10.04 and 5.80, respectively. If you were to perform the usual bootstrap sampling, which selects each observation with equal probability, then most samples would have a mean in the interval [9.5, 10.5] and a standard deviation in the interval [5.6, 6.0].

Suppose you are seeking a resample that has unusual properties, such as Mean=9 and StdDev=3. It would be highly unusual to obtain these statistics by using the usual method of resampling observations with equal probability. However, you can assign a high probability to observations whose values are near the value 9. The following DATA step uses the normal density function for the N(9,3) distribution to assign the probability of selecting an observation:

/* Target: mean=9; SD = 3 */
%let mean = 9;
%let StdDev = 3;
/* Use the normal distribution to assign the probability of choosing an observation */
data Have2;
set Have;
Prob = pdf("Normal", x, &mean, &StdDev);
run;

You can use the METHOD=PPS_WR option to resample from the data by using this probability of selecting each observation. The following call to PROC SURVEYSELECT generates a resample of size N from the data but gives preference to observations that are closer to x=9.

/* Sample with replacement from the data, where the probability 
   of choosing an obs is given by the lProb variable */
proc surveyselect noprint data=Have2 out=Resample
     method=PPS_WR 
     seed=12345 N=&N;       /* use OUTHITS option if you want N obs */
   size Prob;               /* specify the probability variable */
run;
 
proc means data=Resample mean std;  /* a different seed will give a different result */
  var x;
  freq NumberHits;
run;

The output from PROC MEANS shows that for these uniformly distributed data, you can obtain a resample whose mean and standard deviations are close to the target values. Of course, using a different random number seed will result in a different resample and different statistics.

You can create a graph to visualize the observations that were chosen for this resample. The following graph plots the NumberHits variable, which records the number of times that each observation appears in the sample:

title "Number of Times Each Observation Appears in Resample";
proc sgplot data=Resample;
   scatter x=x y=NumberHits / markerattrs=(symbol=CircleFilled) transparency=0.75;
   xaxis values=(0 to 20 by 2);
   yaxis values=(1 to 8) valueshint;
run;

You can see that no observations for which x ≥ 18 were selected. Observations for which x ≈ 9 are selected more often than observations for which x is far from 9.

Random samples whose statistics are close to target values

If you change the random number seed, you will get a different resample that has different statistics. If you repeat this process many times, you can choose the resample whose statistics are closest to the target values. You can add the REPS= option to PROC SURVEYSELECT to generate additional resamples. This creates a variable (Replicate) that you can use as a BY-group variable in PROC MEANS. For each (Mean, StdDev) pair, you can compute a distance to the target value. You can then sort the statistics and identify the resamples whose statistics are closest to the target, as follows:

proc surveyselect noprint data=Have2 out=Resample100
     method=PPS_WR 
     Reps=100
     seed=12345 N=&N;       /* use OUTHITS option if you want N obs */
   size Prob;               /* specify the probability variable */
run;
 
/* Compute mean and StdDev of each resample */
proc means data=Resample100 mean std noprint;
  by Replicate;
  var x;
  freq NumberHits;
  output out=BootMeans mean=Mean stddev=StdDev;
run;
 
/* compute distance from statistics to target values */
data Diff;
set BootMeans;     
DiffMean = &Mean - Mean;
DiffStd  = &StdDev - StdDev;
EuclidDist = euclid(DiffMean, DiffStd);  /* distance between observation and a target value */
run;
 
proc sort data=Diff;
   by EuclidDist;    /* sort by how close statistics are to target values */
run;
 
proc print data=Diff(obs=5);
   var Replicate Mean StdDev DiffMean DiffStd EuclidDist;
run;

The top five resamples (from 100) are shown. For these resamples, the mean is close to 9 and the standard deviation is close to 3, which are the target values. If you measure "closeness" by using a Euclidean distance, then Replicate=45 is the resample whose statistics are closest to the target values.

Why the example works

It is possible to turn this example into a general-purpose method for finding a resample that has statistics that are close to target values. However, it requires some additional work. I will outline why this example works and what is required to generalize the example.

Let f(x) be the probability density function of the data. If you sample uniformly from the distribution, then \(E(X) = \int x f(x)\, dx\) is the expected value of the distribution. The classical bootstrap method generates samples for which the expected sample mean is E(X). However, if you sample according to some other probability distribution, g(x), then the expected value of the sample is \(\int x f(x) g(x)\, dx\). By strategically choosing g(x), you can control the expected value. The same is true for higher-order moments, such as the variance.

For this example, the data has a uniform density, so it was easy to choose a value for g(x) so that the expected mean and variance matches target values. For data that has a non-uniform density, the process is more complicated because you must consider the convolution of f(x) and g(x).

Even for this example, you might not be able to match every set of target values. For example, you probably can't find a resample that has a large mean and also a large variance.

Summary

This article considers whether it is possible to find a resample that has a desired set of statistics. A classical bootstrap sample selects observations with replacement and with uniform probability. With high probability, the bootstrap sample has statistical properties that are somewhat similar to the original data. However, if you sample with nonuniform probability, you can obtain resamples whose statistical properties are quite different from the original data. The article uses the METHOD=PPS_WR option in PROC SURVEYSELECT to create an example in the special case where the data are uniformly distributed.

The post Choose samples with specified statistical properties appeared first on The DO Loop.

8月 162021
 

People love rankings. You've probably seen articles about the best places to live, the best colleges to attend, the best pizza to order, and so on. Each of these is an example of a ranking that is based on multiple characteristics. For example, a list of the best places to live might base the ranking on quantities such as the average salary, the cost of homes, crime rates, quality of schools, and so forth. Pizzas might be ranked on the crust, the sauce, the toppings, and more. Typically, the ranking is a weighted average of these multiple variables.

Inevitably, after the rankings are released, some people complain that the weights used to construct the ranking were not fair. The most common complaint is that the weights were chosen arbitrarily by the editors. This is a valid criticism: someone must choose the weights, but not everyone will agree with the choice. Some organizations base the weights on surveys in which people are asked to weigh the relative importance of the factors. Other organizations might use equal weights for the characteristics. Others just make up weights that seem to work well.

This article discusses the geometry of a weighted mean (or sum) of multiple variables. When you specify the weights, you are choosing a direction (a line) in the space of the variables. Geometrically, each subject (city, college, pizza, ...) is projected onto the line. The ranking is then determined by the relative placement of each subject when projected onto the line. This is shown visually by the graph to the right. The details of the graph are explained later in this article.

Mathematically, a weighted mean and a weighted sum are very similar and result in the same ranks. I will use the weighted sum in this article. If you prefer, you can get the weighted mean by dividing the weighted sum by the number of factors. Also, a projection onto a line requires scaling the weight vector to have unit length, but I will ignore that technical detail.

Two-dimensional rankings

To illustrate the geometry of ranking subjects, consider the case where there are only two characteristics. Suppose three students are being evaluated for admission to a university program. Each subject submits two scores on a standardized test. The following table shows the test scores for the three students:

Which student is the best fit for the university program? It depends on how the program weights the two test scores. Let's pretend that "Test 1" is a test of mathematical and analytical skills and that "Test 2" is a test of verbal and language skills. Some programs (math, engineering, and other STEM disciplines) will put extra weight on the math score, whereas other programs (creative writing, linguistics, journalism, ...) will put extras weight on the verbal score. Here are a few ways that various programs might weigh the tests:

  • A math department might look only at the math test and ignore the language test. This corresponds to the weight vector w1=(1,0).
  • A law school might want to equally weigh the math and language tests. This corresponds to the weight vector w2=(1,1).
  • An engineering program might decide that math skills are twice as important as language skills. This corresponds to the weight vector w3=(2,1).

The following table displays the weighted sums for each student under these different choices for weights:

The table shows that the "best" candidate depends on the weights that the programs choose:

  • For the math department ("Wt Sum 1"), Student A is the best candidate.
  • For the law school ("Wt Sum 2"), Student B is the best candidate.
  • For the engineering program ("Wt Sum 3"), Student C is the best candidate.

The geometry of weighted averages

You can plot the test scores as ordered pairs in two-dimensional space. The weight vector determines a line. The projection of the ordered pairs onto that line determines the ranking of the subjects. For example, the following graph shows the geometry for the weight vector w1=(1,0):

For this weight vector, the weighted sum is merely the first component. This is also the projection of the point onto the blue dashed line αw1, for any scalar value of α. The projections that are far from the origin have a larger weighted sum than the projections that are closer to the origin. The students' ranks are determined by their relative positions along the blue dashed line. For this example, Student A is the most highly ranked student (for the math department) because its projection onto the line is farthest from the origin. The gray lines indicate the projection onto the blue line. The gray lines are perpendicular to the blue line.

If you use a different weight vector, you might get a different result. The following graph shows the results for the same students but for the law-school weight vector w2=(1,1). Again, the students' ranks are determined by their relative positions along the blue dashed line. But this time, Student B is the most highly ranked student because its projection onto the line is farthest from the origin. The gray lines are perpendicular to the blue line and indicate the projection onto the blue line.

The ranks of the students change again if you consider the weights given by w3=(2,1). For these weights, Student C is the highest-ranked student by the engineering school. The graph is shown at the top of this article. Note that students A and B are assigned the same rank because they lie on the same gray line, which is perpendicular to the blue line. This can happen for any weight vector: If two students are on the same perpendicular line, their projection is the same.

Ranking by using additional variables

The geometry does not change much if you compute a weighted sum of more than two factors. For three factors, the weight vector determines a (blue dashed) line in 3-D space, and the scores are projected onto the line. In the 2-D examples, I drew gray lines perpendicular to the blue line. For three factors, the analogous structures are planes that are perpendicular to the blue line. All subjects on the same plane have the same ranking. In d dimensions, the planes become (d-1)-dimensional hyperplanes.

An important consideration for this process (in any dimension) is that the factors should be measured on comparable scales. In the earlier example, both standardized test scores were assumed to be measured on the same scale. However, you should standardize variables that are on different scales. For example, you would not want to use a weighted average of the median home price (possibly $300,000) and the mean annual temperature (possibly 15°C) without standardizing those quantities.

Summary

Sometimes rankings are based on a statistical model, but other times they are based on a "gut feeling" about which features are important. Magazines publish annual rankings of colleges, cities, hospitals, pizzas, and places to work. It is important to realize that the weights that are used for these rankings are somewhat arbitrary. Even if the weights are based on a survey, there is uncertainty in the weights.

This article shows that if you vary the weights, the rankings can change. This article discusses the geometry of weighted sums and averages. A weight vector determines a direction in a d-dimensional feature space. To rank subjects, you project d-dimensional tuples of points onto the line that is determined by the weight vector. The highest-ranked subject is the one that is farthest from the origin along the direction of the weight vector.

So next time you disagree with a ranking, you can confidently say to your friends, "Yes, but if they had used different weights, the results would have been different."

You can download the SAS program that creates the tables and graphs in this article.

The post Rankings and the geometry of weighted averages appeared first on The DO Loop.

6月 232021
 

This article uses simulation to demonstrate the fact that any continuous distribution can be transformed into the uniform distribution on (0,1). The function that performs this transformation is a familiar one: it is the cumulative distribution function (CDF). A continuous CDF is defined as an integral, so the transformation is called the probability integral transformation.

This article demonstrates the probability integral transformation and its close relative, the inverse CDF transformation. These transformations are useful in many applications, such as constructing statistical tests and simulating data.

The probability integral transform

Let X be a continuous random variable whose probability density function is f. Then the corresponding cumulative distribution function (CDF) is the integral \(F(x) = \int\nolimits_{-\infty}^{x} f(t)\,dt\). You can prove that the random variable Y = F(X) is uniformly distributed on (0,1). In terms of data, if {X1, X2, ..., Xn} is a random sample from X, the values {F(X1), F(X2), ..., F(Xn)} are a random sample from the uniform distribution.

Let's look at an example. The following SAS DATA step generates a random sample from the Gamma distribution with shape parameter α=4. At the same time, the DATA step computes U=F(X), where F is the CDF of the gamma distribution. A histogram of the U variable shows that it is uniformly distributed:

/* The probability integral transformation: If X is a random 
   variable with CDF F, then Y=F(X) is uniformly distributed on (0,1).
*/
%let N = 10000;
data Gamma4(drop=i);
call streaminit(1234);
do i = 1 to &N;
   x = rand("Gamma", 4);     /* X ~ Gamma(4) */
   u = cdf("Gamma", x, 4);   /* U = F(X) ~ U(0,1)   */
   output;
end;
run;
 
title "U ~ U(0,1)";
proc sgplot data=Gamma4;
   histogram U / binwidth=0.04 binstart=0.02;  /* center first bin at h/2 */
   yaxis grid;
run;

The histogram has some bars that are greater than the average and others that are less than the average. That is expected in a random sample.

One application of the CDF transformation is to test whether a random sample comes from a particular distribution. You can transform the sample by using the CDF function, then test whether the transformed values are distributed uniformly at random. If the transformed variates are uniformly distributed, then the original data was distributed according to the distribution F.

The inverse probability transformation

It is also useful to run the CDF transform in reverse. That is, start with a random uniform sample and apply the inverse-CDF function (the quantile function) to generate random variates from the specified distribution. This "inverse CDF method" is a standard technique for generating random samples.

The following example shows how it works. The DATA step generates random uniform variates on (0,1). The variates are transformed by using the quantiles of the lognormal(0.5, 0.8) distribution. A call to PROC UNIVARIATE overlays the density curve of the lognormal(0.5, 0.8) distribution on the histogram of the simulated data. The curve matches the data well:

/* inverse probability transformation */
data LN(drop=i);
call streaminit(12345);
do i = 1 to &N;
   u = rand("Uniform");                    /* U ~ U(0,1)   */
   x = quantile("Lognormal", u, 0.5, 0.8); /* X ~ LogNormal(0.5, 0.8) */
   output;
end;
run;
 
proc univariate data=LN;
   var x;
   histogram x / lognormal(scale=0.5 shape=0.8) odstitle="X ~ LogNormal(0.5, 0.8)";
run;

Summary

The probability integral transform (also called the CDF transform) is a way to transform a random sample from any distribution into the uniform distribution on (0,1). The inverse CDF transform transforms uniform data into a specified distribution. These transformations are used in testing distributions and in generating simulated data.

Appendix: Histograms of uniform variables

The observant reader might have noticed that I used the BINWIDTH= and BINSTART= options to set the endpoints of the bin for the histogram of uniform data on (0,1). Why? Because by default the histogram centers bins. For data that are bounded on an interval, you can use the BINWIDTH= and BINSTART= options to improve the visualization.

Let's use the default binning algorithm to create a histogram of the uniform data:

title "U ~ U(0,1)";
title2 "34 Bins: First Bin Centered at 0 (Too Short)";
proc sgplot data=Gamma4;
   histogram u;
   yaxis grid;
run;

Notice that the first bin is "too short." The first bin is centered at the location U=0. Because the data cannot be less than 0, the first bin has only half as many counts as the other bins. The last bin is also affected, although the effect is not usually as dramatic.

The histogram that uses the default binning is not wrong, but it doesn't show the uniformity of the data as clearly as a histogram that uses the BINWIDTH= and BINSTART= options to force 0 and 1 to be the endpoints of the bins.

The post The probability integral transform appeared first on The DO Loop.

10月 282020
 

The skewness of a distribution indicates whether a distribution is symmetric or not. The Wikipedia article about skewness discusses two common definitions for the sample skewness, including the definition used by SAS. In the middle of the article, you will discover the following sentence: In general, the [estimators]are both biased estimators of the population skewness. The article goes on to say that the estimators are not biased for symmetric distributions. Similar statements are true for the sample kurtosis.

This statement might initially surprise you. After all, the statistics that we use to estimate the mean and standard deviation are unbiased. Although biased estimates are not inherently "bad," it is useful to get an intuitive feel for how biased an estimator might be.

Let's demonstrate the bias in the skewness statistic by running a Monte Carlo simulation. Choose an unsymmetric univariate distribution for which the population skewness is known. For example, the exponential distribution has skewness equal to 2. Then do the following:

  1. Choose a sample size N. Generate B random samples from the chosen distribution.
  2. Compute the sample skewness for each sample.
  3. Compute the average of the skewness values, which is the Monte Carlo estimate of the sample skewness. The difference between the Monte Carlo estimate and the parameter value is an estimate of the bias of the statistic.

In this article, I will generate B=10,000 random samples of size N=100 from the exponential distribution. The simulation shows that the expected value of the skewness is NOT close to the population parameter. Hence, the skewness statistic is biased.

A Monte Carlo simulation of skewness

The following DATA step simulates B random samples of size N from the exponential distribution. The call to PROC MEANS computes the sample skewness for each sample. The call to PROC SGPLOT displays the approximate sampling distribution of the skewness. The graph overlays a vertical reference line at 2, which is the skewness parameter for the exponential distribution, and also overlays a reference line at the Monte Carlo estimate of the expected value.

%let NumSamples = 10000;
%let N = 100;
/* 1. Simulate B random samples from exponential distribution */
data Exp;
call streaminit(1);
do SampleID = 1 to &NumSamples;
   do i = 1 to &N;
      x = rand("Expo");
      output;
   end;
end;
run;
 
/* 2. Estimate skewness (and other stats) for each sample */
proc means data=Exp noprint;
   by SampleID;
   var x;
   output out=MCEst mean=Mean stddev=StdDev skew=Skewness kurt=Kurtosis;
run;
 
/* 3. Graph the sampling distribution and overlay parameter value */
title "Monte Carlo Distribution of Sample Skewness";
title2 "N = &N; B = &NumSamples";
proc sgplot data=MCEst;
   histogram Skewness;
   refline 2 / axis=x lineattrs=(thickness=3 color=DarkRed) 
             labelattrs=(color=DarkRed) label="Parameter";
   refline 1.818 / axis=x lineattrs=(thickness=3 color=DarkBlue) label="Monte Carlo Estimate"
             labelattrs=(color=DarkBlue) labelloc=inside ;
run;
 
/* 4. Display the Monte Carlo estimate of the statistics */ 
proc means data=MCEst ndec=3 mean stddev;
   var Mean StdDev Skewness Kurtosis;
run;
Monte Carlo distribution of skewness statistic (B=10000, N=100)

For the exponential distribution, the skewness parameter has the value 2. However, according to the Monte Carlo simulation, the expected value of the sample skewness is about 1.82 for these samples of size 100. Thus, the bias is approximately 0.18, which is about 9% of the true value.

The kurtosis statistic is also biased. The output from PROC MEANS includes the Monte Carlo estimates for the expected value of the sample mean, standard deviation, skewness, and (excess) kurtosis. For the exponential distribution, the parameter values are 1, 1, 2, and 6, respectively. The Monte Carlo estimates for the sample mean and standard deviation are close to the parameter values because these are unbiased estimators. However, the estimates for the skewness and kurtosis are biased towards zero.

Summary

This article uses Monte Carlo simulation to demonstrate bias in the commonly used definitions of skewness and kurtosis. For skewed distributions, the expected value of the sample skewness is biased towards zero. The bias is greater for highly skewed distributions. The skewness statistic for a symmetric distribution is unbiased.

The post The sample skewness is a biased statistic appeared first on The DO Loop.