Statistical Thinking

3月 282018
 

Suppose you want to find observations in multivariate data that are closest to a numerical target value. For example, for the students in the Sashelp.Class data set, you might want to find the students whose (Age, Height, Weight) values are closest to the triplet (13, 62, 100). The way to do this is to compute a distance from each observation to the target. Unfortunately, there are many definitions of distance! Which distance should you use? This article describes and compares a few common distances and shows how to compute them in SAS.

Euclidean and other distances

The two most widely used distances are the Euclidean distance (called the L2 distance by mathematicians) and the "sum of absolute differences" distance, which is better known as the L1 distance or occasionally the taxicab distance. In one dimension, these distances are equal. However, when you have multiple coordinates, the distances are different. The Euclidean and L1 distances between (x1, x2, ..., xn) and a target vector (t1, t2, ..., tn) are defined as follows:
L1 distance: |x1-t1| + |x2-t2| + ... + |xn-tn|
Euclidean distance: sqrt( (x1-t1)2 + (x2-t2)2 + ... + (xn-tn)2 )

Both of these distances are supported in the SAS DATA step. You can use the EUCLID function to compute Euclidean distance and use the SUBABS function to compute the L1 distance. For example, the following DATA step computes the distance from each observation to the target value (Age, Height, Weight) = (13, 62, 100):

data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = euclid(Age-13, Height-62, Weight-100);
L1Dist     = sumabs(Age-13, Height-62, Weight-100);
run;
 
/* sort by Euclidean distance */
proc sort data=Closest out=Euclid; by EuclidDist; run;
 
/* plot the distances for each observation */
title "Distance to Target Values";
proc sgplot data=Euclid;
   series x=Name y=EuclidDist / curvelabel="Euclidean";  *datalabel=Weight;
   series x=Name y=L1Dist     / curvelabel="L1";         *datalabel=Weight;
   yaxis grid label="Distance";
   xaxis grid discreteorder=data;
run;
Euclidean and L1 distances between observations and a target value

The graph shows the Euclidean and L1 distances from each student's data to the target value. The X axis is ordered by the Euclidean distance from each observation to the target value. If you add the DATALABEL=Weight or DATALABEL=Height options to the SERIES statements, you can see that the students who appear near the left side of the X axis have heights and weights that are close to the target values (Height, Weight) = (62, 100). The students who appear to the right are much taller/heavier or shorter/lighter than the target values. In particular, Joyce is the smallest student and Philip is the largest.

If you order the students by their L1 distances, you will obtain a different ordering. For example, in an L1 ranking, John and Henry would switch positions and Jeffery would be ranked 7th instead of 12th.

Distances that account for scale

There is a problem with the computations in the previous section: the variables are measured in different units, but the computations do not account for these differences. In particular, Age is an important factor in the distance computations because all student ages are within three years of the target age. In contrast, the heaviest student (Phillip) is 88 pounds more than the target weight.

The distance formula needs to account for amount of variation within each variable. If you run PROC MEANS on the data, you will discover that the standard deviation of the Age variable is about 1.5, whereas the standard deviations for the Height and Weight variables are about 5.1 and 22.8, respectively. It follows that one year should be treated as a substantial difference, whereas one pound of weight should be treated as a small difference.

A way to correct for differences among the scales of variables is to standardize the variables. In SAS, you can use PROC STDIZE to standardize the variables in a variety of ways. By default, the procedure centers each variable by subtracting the sample mean; it scales by dividing by the standard deviations. The following procedure all also writes the mean and standard deviations to a data set, which is displayed by using PROC PRINT:

proc stdize data=Sashelp.Class out=StdClass outstat=StdIn method=STD;
   var Age Height Weight;
run;
 
proc print data=StdIn(obs=2);
run;

Notice that target value is not part of the standardization! Only the data are used. However, you must convert the target value to the new coordinates before you can compute the standardized distances. You can standardize the target value by using the METHOD=IN option in PROC STDIZE to tell the procedure to transform the target value by using the location and scale values in the StdIn data set, as follows:

data Target;
   Age=13; Height=62; Weight=100;
run;
 
proc stdize data=Target out=StdTarget method=in(StdIn);
   var Age Height Weight;
run;
 
proc print data=StdTarget; run;

The data and the target values are now in standardized coordinates. You can therefore repeat the earlier DATA step and compute the Euclidean and L1 distances in the new coordinates. The graph of the standardized distances are shown below:

In the standardized coordinates "one unit" corresponds to one standard deviation from the mean in each variable. Thus Age is measured in units of 1.5 years, Height in units of 5.1 inches, and Weight in units of 22.8 pounds. Notice that some of the students have changed positions. Carol is still very similar to the target value, but Jeffrey is now more similar to the target value than previously thought. The smallest student (Joyce) and largest student (Philip) are still dissimilar (very distant) from the target.

Distances that account for correlation

In the previous section, different scales in the data are handled by using distances in a standardized coordinate system. This is an improvement over the unstandardized computations. However, there is one more commonly used distance computation, called the Mahalanobis distance. The Mahalanobis distance is similar to the standardized L2 distance but also accounts for correlations between the variables. I have previously discussed the meaning of Mahalanobis distance and have described how you can use the inverse Cholesky decomposition to uncorrelate variables. I won't repeat the arguments, but the following graph displays the Mahalanobis distance between each observation and the target value. Some students are in a different order than they were for the standardized distances:

If you would like to examine or modify any of my computations, you can download the SAS program that computes the various distances.

In summary, there are several reasonable definitions of distance between multivariate observations. The simplest is the raw Euclidean distance, which is appropriate when all variables are measured on the same scale. The next simplest is the standardized distance, which is appropriate if the scales of the variables are vastly different and the variables are uncorrelated. The third is the Mahalanobis distance, which becomes important if you want to measure distances in a way that accounts for correlation between variables.

For other articles about Mahalanobis distance or distance computations in SAS, see:

The post Find the distance between observations and a target value appeared first on The DO Loop.

1月 152018
 

Last week I got the following message:

Dear Rick: How can I create a normal distribution within a specified range (min and max)? I need to simulate a normal distribution that fits within a specified range. I realize that a normal distribution is by definition infinite... Are there any alternatives, such as a distribution that has shape and properties similar to a normal distribution, but for which I can restrict the range? - Carol

Mathematically, this request doesn't make sense, as Carol (not her real name) acknowledges. However, if you are working with a client who is not statistically savvy, you might be asked to do the impossible! How can Carol make sense of this request?

I provide a two-part answer. First I discuss general ways to choose a distribution that models a data sample. You can use these methods when you have access to the data. Then I address the specific question and present five distributions that you can use to model data that looks like a "bounded" normal distribution. These distributions can be used when you don't have access to the data.

How to choose a distribution that matches the data?

In general, statisticians are often asked to choose a model that seems to fit an observed set of data. In simulation studies, the model is used to simulate samples that have the same distributional properties as the real data. In the book Simulating Data with SAS (Chapter 16), I discuss many strategies, including the following:

  • Use domain-specific knowledge to guide you. There might be physical or biological reasons to prefer one probability distribution over another.
  • Sample from the empirical distribution of the data, which is equivalent to bootstrap resampling. For more information, see the article about how to bootstrap in SAS or Chapter 15 of Simulating Data with SAS.
  • Use a well-known “named” parametric distribution to model the data, such as the normal, gamma, and beta distributions. Typically you will fit several candidate distributions to the data to see which fits best. (In SAS, you can use PROC UNIVARIATE, PROC SEVERITY, or PROC NLMIXED to fit distributions.) After you choose a distribution that fits the data, you can draw random samples from the fitted distribution.
  • For univariate data, you can choose a flexible system of distributions such as the Pearson system or the Johnson system. The Johnson system is supported by PROC UNIVARIATE (Wicklin, p. 112–114).
  • Use a graphical tool, called the moment-ratio diagram, to help select candidate distributions for the data. Traditionally the moment-ratio diagram is a theoretical tool for understanding the relationships of families and systems of distributions, but Chapter 16 shows how to use the moment-ratio diagram as a tool to organize simulation studies.

These ideas (and others) are illustrated in the following diagram, which is also from Wicklin (2013, p. 298):

The flowchart shows a few paths that a researcher can follow to model data. Most of the paths are also applicable to multivariate data.

Simulate data from the "eyeball distribution"

Now let's discuss the specific question that I was asked: how to simulate a "bounded" normal distribution. If Carol has access to the data, she could use the techniques in the previous section. Consequently, I assume that Carol does not have access to the data. This can happen in several ways, such as when you are trying to reproduce the results in a published article but the article includes only summary statistics or a graph of the data. If you cannot obtain the original data from the authors, you might be forced to simulate fake data based only on the graph and the summary statistics.

I call this using the "eyeball distribution." For non-native speakers of English, "to eyeball" means to look at or observe something. When used as an adjective, “eyeball” indicates that a quantity was obtained by visual inspection, rather than through formal measurements. Thus an "eyeball distribution" is one that is chosen heuristically because it looks similar to the histogram of the data. (You can extend these ideas to multivariate data.)

From Carol's description, the histogram of her data is symmetric, unimodal, and "bell-shaped." The data are in the interval [a, b]. I can think of several "eyeball distributions" that could model data like this:

  1. If you know the sample mean (m) and standard deviation (s), then draw random samples from N(m, s).
  2. If you know sample quantiles of the data, you can approximate the CDF and use inverse probability sampling for the simulation.
  3. Theory tells us that 99.7% of the normal distribution is within three standard deviations of the mean. For symmetric distributions, the mean equals the midrange m = (a + b)/2. Thus you could use the three-sigma rule to approximate s ≈ (b - a)/6 and sample from N(m, s).
  4. If you have domain knowledge that guarantees that the data are truly bounded on [a, b], you could use the truncated normal distribution, which samples from N(m, s) but discards any random variates that are outside of the interval.
  5. Alternatively, you can sample from a symmetric bounded distribution such as the Beta(α, α) distribution. For example, Beta(5, 5) is approximately bell-shaped on the interval [0,1].

The first option is often the most reasonable choice. Even though the original data are in the interval [a, b], the simulated data do not have to be in the same interval. If you believe that the original data are normally distributed, then simulate from that model and accept that you might obtain simulated data that are beyond the bounds of the original sample. The same fact holds for the third option, which estimates the standard deviation from a histogram.

The following SAS DATA step implements the last three options. The sample distributions are displayed by using a comparative histogram in SAS:

%let N = 500;      /* sample size */
%let min = 5;
%let max = 15;
 
proc format;
value MethodFmt 1 = "Normal, 3-Sigma Rule"
                2 = "Truncated Normal"
                3 = "Beta(5, 5), Scaled";
run;
data BellSim(keep=x Method
format method MethodFmt.;
call streaminit(54321);  /* set random number seed */
numSD = 3;               /* use 3-sigma rule to estiamte std dev */
mu = (&min + &max)/2;
sigma = (&max - &min)/ (2*numSD);
 
method = 1;        /* Normal distribution N(mu, sigma) */
   do i = 1 to &N;
      x = rand("Normal", mu, sigma);  output;
   end;
 
method = 2;        /* truncated normal distribution TN(mu, sigma; a, b) */
   i = 0;
   do until (i = &N);
      x = rand("Normal", mu, sigma);
      if &min < x < &max then do;
         output;
         i + 1;
      end;
   end;
 
method = 3;        /* symmetric beta distribution, scaled into [a, b] */
   alpha = 5;
   do i = 1 to &N;
      x = &min + (&max - &min)*rand("beta", alpha, alpha);
      output;
   end;
run;
 
ods graphics / width=480px height=480px;
title "Simulate Bell-Shaped Data on [&min, &max]";
proc sgpanel data=BellSim;
   panelby method / columns=1 novarname;
   histogram x;
   refline &min &max / axis=x;
   colaxis values=(&min to &max) valueshint;
run;

The histograms show a simulated sample of size 500 for each of the three models. The top sample is from the normal distribution. It contains simulated observations that are outside of the interval [a,b]. In many situations, that is fine. The second sample is from the truncated normal distribution on [a,b]. The third is from a Beta(5,5) distribution, which has been scaled onto the interval [a,b].

Although the preceding SAS program answers the question Carol asked, let me emphasize that if you (and Carol) have access to the data, you should use it to choose a model. When the data are unobtainable, you can use an "eyeball distribution," which is basically a guess. However, if your goal is to generate fake data so that you can test a computational method, an eyeball distribution might be sufficient.

The post Data unavailable? Use the "eyeball distribution" to simulate appeared first on The DO Loop.

10月 252017
 

This article describes the advantages and disadvantages of principal component regression (PCR). This article also presents alternative techniques to PCR.

In a previous article, I showed how to compute a principal component regression in SAS. Recall that principal component regression is a technique for handling near collinearities among the regression variables in a linear regression. The PCR algorithm in most statistical software is more correctly called "incomplete" PCR because it uses only a subset of the principal components. Incomplete PCR means that you compute the principal components for the explanatory variables, keep only the first k principal components (which explain most of the variance among the regressors), and regress the response variable onto those k components.

The principal components that are dropped correspond to the near collinearities. Consequently, the standard errors of the parameter estimates are reduced, although the tradeoff is that the estimates are biased, and "the bias increases as more [principal components]are droppped" (Jackson, p. 276).

Some arguments in this article are from J. E. Jackson's excellent book, A User's Guide to Principal Components, (1991, pp. 271–278). Jackson introduces PCR and then immediately cautions against using it (p. 271). He writes that PCR "is a widely used technique," but "it also has some serious drawbacks." Let's examine the advantages and disadvantages of principal component regression.

Advantages of principal component regression

Principal component regression is a popular and widely used method. Advantages of PCR include the following:

  • PCR can perform regression when the explanatory variables are highly correlated or even collinear.
  • PCR is intuitive: you replace the basis {X1, X2, ..., Xp} with an orthogonal basis of principal components, drop the components that do not explain much variance, and regress the response onto the remaining components.
  • PCR is automatic: The only decision you need to make is how many principal components to keep.
  • The principal components that are dropped give insight into which linear combinations of variables are responsible for the collinearities.
  • PCR has a discrete parameter, namely the number of components kept. This parameter is very interpretable in terms of geometry (linear dimensions kept) and in terms of linear algebra (low-rank approximations).
  • You can run PCR when there are more variables than observations (wide data).

Drawbacks of principal component regression

The algorithm that is currently known as PCR is actually a misinterpretation of the original ideas behind PCR (Jolliffe, 1982, p. 201). When Kendall and Hotelling first proposed PCR in the 1950s, they proposed "complete" PCR, which means replacing the original variables by all the principal components, thereby stabilizing the numerical computations. Which principal components are included in the final model is determined by looking at the significance of the parameter estimates. By the early 1980s, the term PCR had changed to mean "incomplete PCR."

The primary argument against using (incomplete) principal component regression can be summarized in a single sentence: Principal component regression does not consider the response variable when deciding which principal components to drop. The decision to drop components is based only on the magnitude of the variance of the components.

There is no a priori reason to believe that the principal components with the largest variance are the components that best predict the response. In fact, it is trivial to construct an artificial example in which the best predictor is the last component, which will surely be dropped from the analysis. (Just define the response to be the last principal component!) More damning, Jolliffe (1982, p. 302) presents four examples from published papers that advocate PCR, and he shows that some of the low-variance components (which were dropped) have greater predictive power than the high-variance components that were kept. Jolliffe concludes that "it is not necessary to find obscure or bizarre data in order for the last few principal components to be important in principal component regression. Rather it seems that such examples may be rather common in practice."

There is a hybrid version of PCR that enables you to use cross validation and the predicted residual sum of squares (PRESS) criterion to select how many components to keep. (In SAS, the syntax is proc pls method=PCR cv=one cvtest(stat=press).) Although this partially addresses the issue by including the response variable in the selection of components, it is still the case that the first k components are selected and the last p – k are dropped. The method never keeps the first, third, and sixth components, for example.

Alternatives to principal component regression

Some alternatives to principal component regression include the following:

  • Ridge regression: In ridge regression, a diagonal matrix is added to the X`X matrix so that it becomes better conditioned. This results in biased parameter estimates. You can read an explanation of ridge regression and how to compute it by using PROC REG in SAS.
  • Complete PCR: As mentioned previously, use the PCs as the variables and keep the components whose parameter estimates are significant.
  • Complete PCR with variable selection: Use the PCs as the variables and use the variable-selection techniques to decide which components to retain. However, if your primary goal is variable reduction, then use variable-selection techniques on the original variables.
  • Partial Least Squares (PLS): Partial least square regression is similar to PCR in that both select components that explain the most variance in the model. The difference is that PLS incorporates the response variable. That is, the components that are produced are those that explain the most variance in the explanatory AND response variables. In SAS, you can compute a PLS regression by using PROC PLS with METHOD=PLS or METHOD=SIMPLS. You will probably also want to use the CV and CVTEST options.

Summary

In summary, principal component regression is a technique for computing regressions when the explanatory variables are highly correlated. It has several advantages, but the main drawback of PCR is that the decision about how many principal components to keep does not depend on the response variable. Consequently, some of the variables that you keep might not be strong predictors of the response, and some of the components that you drop might be excellent predictors. A good alternative is partial least squares regression, which I recommend. In SAS, you can run partial least squares regression by using PROC PLS with METHOD=PLS.

References

The post Should you use principal component regression? appeared first on The DO Loop.

10月 022017
 
Visualization of regression anlysis that uses a weight variable in SAS

How can you specify weights for a statistical analysis? Hmmm, that's a "weighty" question! Many people on discussion forums ask "What is a weight variable?" and "How do you choose a weight for each observation?" This article gives a brief overview of weight variables in statistics and includes examples of how weights are used in SAS.

Different kinds of weight variables

One source of confusion is that different areas of statistics use weights in different ways. All weights are not created equal! The weights in survey statistics have a different interpretation from the weights in a weighted least squares regression.

Let's start with a basic definition. A weight variable provides a value (the weight) for each observation in a data set. The i_th weight value, wi, is the weight for the i_th observation. For most applications, a valid weight is nonnegative. A zero weight usually means that you want to exclude the observation from the analysis. Observations that have relatively large weights have more influence in the analysis than observations that have smaller weights. An unweighted analysis is the same as a weighted analysis in which all weights are 1.

There are several kinds of weight variables in statistics. At the 2007 Joint Statistical Meetings in Denver, I discussed weighted statistical graphics for two kinds of statistical weights: survey weights and regression weights. An audience member informed me that STATA software provides four definitions of weight variables, as follows:

  • Frequency weights: A frequency variable specifies that each observation is repeated multiple times. Each frequency value is a nonnegative integer.
  • Survey weights: Survey weights (also called sampling weights or probability weights) indicate that an observation in a survey represents a certain number of people in a finite population. Survey weights are often the reciprocals of the selection probabilities for the survey design.
  • Analytical weights: An analytical weight (sometimes called an inverse variance weight or a regression weight) specifies that the i_th observation comes from a sub-population with variance σ2/wi, where σ2 is a common variance and wi is the weight of the i_th observation. These weights are used in multivariate statistics and in a meta-analyses where each "observation" is actually the mean of a sample.
  • Importance weights: According to a STATA developer, an "importance weight" is a STATA-specific term that is intended "for programmers, not data analysts." The developer says that the formulas "may have no statistical validity" but can be useful as a programming convenience. Although I have never used STATA, I imagine that a primary use is to downweight the influence of outliers. an example that shows the distinction between a frequency variable and a weight variable in regression. Briefly, a frequency variable is a notational convenience that enables you to compactly represent the data. A frequency variable determines the sample size (and the degrees of freedom), but using a frequency variable is always equivalent to "expanding" the data set. (To expand the data, create fi identical observations when the i_th value of the frequency variable is fi.) An analysis of the expanded data is identical to the same analysis on the original data that uses a frequency variable.

    In SAS, the FREQ statement enables you to specify a frequency variable in most procedures. Ironically, the SAS SURVEY procedures to analyze survey data. The SURVEY procedures (including SURVEYMEANS, SURVEYFREQ, and SURVEYREG) also support stratified samples and strata weights.

    Inverse variance weights

    Inverse variance weights are appropriate for regression and other multivariate analyses. When you include a weight variable in a multivariate analysis, the crossproduct matrix is computed as X`WX, where W is the diagonal matrix of weights and X is the data matrix (possibly centered or standardized). In these analyses, the weight of an observation is assumed to be inversely proportional to the variance of the subpopulation from which that observation was sampled. You can "manually" reproduce a lot of formulas for weighted multivariate statistics by multiplying each row of the data matrix (and the response vector) by the square root of the appropriate weight.

    In particular, if you use a weight variable in a regression procedure, you get a weighted regression analysis. For regression, the right side of the normal equations is X`WY.

    You can also use weights to analyze a set of means, such as you might encounter in meta-analysis or an analysis of means. The weight that you specify for the i_th mean should be inversely proportional to the variance of the i_th sample. Equivalently, the weight for the i_th group is (approximately) proportional to the sample size of the i_th group.

    In SAS, most regression procedures support WEIGHT statements. For example, PROC REG performs a weighted least squares regression. The multivariate analysis procedures (DISRIM, FACTOR, PRINCOMP, ...) use weights to form a weighted covariance or correlation matrix. You can use PROC GLM to compute a meta-analyze of data that are the means from previous studies.

    What happens if you "make up" a weight variable?

    Analysts can (and do!) create weights arbitrarily based on "gut feelings." You might say, "I don't trust the value of this observation, so I'm going to downweight it." Suppose you assign Observation 1 twice as much weight as Observation 2 because you feel that Observation 1 is twice as "trustworthy." How does a multivariate procedure interpret those weights?

    In statistics, precision is the inverse of the variance. When you use those weights you are implicitly stating that you believe that Observation 2 is from a population whose variance is twice as large as the population variance for Observation 1. In other words, "less trust" means that you have less faith in the precision of the measurement for Observation 2 and more faith in the precision of Observation 1.

    Examples of weighted analyses in SAS

    In SAS, many procedures support a WEIGHT statement. The documentation for the procedure describes how the procedure incorporates weights. In addition to the previously mentioned procedures, many How to compute and interpret a weighted mean

  • How to compute and interpret weighted quantiles or weighted percentiles
  • How to compute and visualize a weighted linear regression

The post How to understand weight variables in statistical analyses appeared first on The DO Loop.

9月 202017
 
Fisher's Z Transformation: z = arctanh(r)

Pearson's correlation measures the linear association between two variables. Because the correlation is bounded between [-1, 1], the sampling distribution for highly correlated variables is highly skewed. Even for bivariate normal data, the skewness makes it challenging to estimate confidence intervals for the correlation, to run one-sample hypothesis tests ("Is the correlation equal to 0.5?"), and to run two-sample hypothesis tests ("Do these two samples have the same correlation?").

In 1921, R. A. Fisher studied the correlation of bivariate normal data and discovered a wonderful transformation (shown to the right) that converts the skewed distribution of the sample correlation (r) into a distribution that is approximately normal. Furthermore, whereas the variance of the sampling distribution of r depends on the correlation, the variance of the transformed distribution is independent of the correlation. The transformation is called Fisher's z transformation. This article describes Fisher's z transformation and shows how it transforms a skewed distribution into a normal distribution.

The distribution of the sample correlation

The following graph (click to enlarge) shows the sampling distribution of the correlation coefficient for bivariate normal samples of size 20 for four values of the population correlation, rho (ρ). You can see that the distributions are very skewed when the correlation is large in magnitude.

Sampling distributions of correlation for bivariate normal data of size N=20

The graph was created by using simulated bivariate normal data as follows:

  1. For rho=0.2, generate M random samples of size 20 from a bivariate normal distribution with correlation rho. (For this graph, M=2500.)
  2. For each sample, compute the Pearson correlation.
  3. Plot a histogram of the M correlations.
  4. Overlay a kernel density estimate on the histogram and add a reference line to indicate the correlation in the population.
  5. Repeat the process for rho=0.4, 0.6, and 0.8.

The histograms approximate the sampling distribution of the correlation coefficient (for bivariate normal samples of size 20) for the various values of the population correlation. The distributions are not simple. Notice that the variance and the skewness of the distributions depend on the value the underlying correlation (ρ) in the population.

Fisher's transformation of the correlation coefficient

Fisher sought to transform these distributions into normal distributions. He proposed the transformation f(r) = arctanh(r), which is the inverse hyperbolic tangent function. The graph of arctanh is shown at the top of this article. Fisher's transformation can also be written as (1/2)log( (1+r)/(1-r) ). This transformation is sometimes called Fisher's "z transformation" because the letter z is used to represent the transformed correlation: z = arctanh(r).

How he came up with that transformation is a mystery to me, but he was able to show that arctanh is a normalizing and variance-stabilizing transformation. That is, when r is the sample correlation for bivariate normal data and z = arctanh(r) then the following statements are true (See Fisher, Statistical Methods for Research Workers, 6th Ed, pp 199-203):

Transformed sampling distributions of correlation for bivariate normal data of size N=20
  • The distribution of z is approximately normal and "tends to normality rapidly as the sample is increased" (p 201).
  • The standard error of z is approximately 1/sqrt(N-3), which is independent of the value of the correlation.

The graph to the right demonstrates these statements. The graph is similar to the preceding panel, except these histograms show the distributions of the transformed correlations z = arctanh(r). In each cell, the vertical line is drawn at the value arctanh(ρ). The curves are normal density estimates with σ = 1/sqrt(N-3), where N=20.

The two features of the transformed variables are apparent. First, the distributions are normally distributed, or, to quote Fisher, "come so close to it, even for a small sample..., that the eye cannot detect the difference" (p. 202). Second, the variance of these distributions are constant and are independent of the underlying correlation.

Fisher's transformation and confidence intervals

From the graph of the transformed variables, it is clear why Fisher's transformation is important. If you want to test some hypothesis about the correlation, the test can be conducted in the z coordinates where all distributions are normal with a known variance. Similarly, if you want to compute a confidence interval, the computation can be made in the z coordinates and the results "back transformed" by using the inverse transformation, which is r = tanh(z).

You can perform the calculations by applying the standard formulas for normal distributions (see p. 3-4 of Shen and Lu (2006)), but most statistical software provides an option to use the Fisher transformation to compute confidence intervals and to test hypotheses. In SAS, the CORR procedure supports the FISHER option to compute confidence intervals and to test hypotheses for the correlation coefficient.

The following call to PROC CORR computes a sample correlation between the length and width of petals for 50 Iris versicolor flowers. The FISHER option specifies that the output should include confidence intervals based on Fisher's transformation. The RHO0= suboption tests the null hypothesis that the correlation in the population is 0.75. (The BIASADJ= suboption turns off a bias adjustment; a discussion of the bias in the Pearson estimate will have to wait for another article.)

proc corr data=sashelp.iris fisher(rho0=0.75 biasadj=no);
   where Species='Versicolor';
   var PetalLength PetalWidth;
run;
Use Fisher's transformation to compute confidence intervals and to test hypotheses in PROC CORR in SAS

The output shows that the Pearson estimate is r=0.787. A 95% confidence interval for the correlation is [0.651, 0.874]. Notice that r is not the midpoint of that interval. In the transformed coordinates, z = arctanh(0.787) = 1.06 is the center of a symmetric confidence interval (based on a normal distribution with standard error 1/sqrt(N-3)). However, the inverse transformation (tanh) is nonlinear, and the right half-interval gets compressed more than the left half-interval.

For the hypothesis test of ρ = 0.75, the output shows that the p-value is 0.574. The data do not provide evidence to reject the hypothesis that ρ = 0.75 at the 0.05 significance level. The computations for the hypothesis test use only the transformed (z) coordinates.

Summary

This article shows that Fisher's "z transformation," which is z = arctanh(r), is a normalizing transformation for the Pearson correlation of bivariate normal samples of size N. The transformation converts the skewed and bounded sampling distribution of r into a normal distribution for z. The standard error of the transformed distribution is 1/sqrt(N-3), which does not depend on the correlation. You can perform hypothesis tests in the z coordinates. You can also form confidence intervals in the z coordinates and use the inverse transformation (r=tanh(z)) to obtain a confidence interval for ρ.

The Fisher transformation is exceptionally useful for small sample sizes because, as shown in this article, the sampling distribution of the Pearson correlation is highly skewed for small N. When N is large, the sampling distribution of the Pearson correlation is approximately normal except for extreme correlations. Although the theory behind the Fisher transformation assumes that the data are bivariate normal, in practice the Fisher transformation is useful as long as the data are not too skewed and do not contain extreme outliers.

You can download the SAS program that creates all the graphs in this article.

The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.

8月 022017
 

Last week I blogged about the broken-stick problem in probability, which reminded me that the broken-stick model is one of the many techniques that have been proposed for choosing the number of principal components to retain during a principal component analysis. Recall that for a principal component analysis (PCA) of p variables, a goal is to represent most of the variation in the data by using k new variables, where hopefully k is much smaller than p. Thus PCA is known as a dimension-reduction algorithm.

Many researchers have proposed methods for choosing the number of principal components. Some methods are heuristic, others are statistical. No method is perfect. Often different techniques result in different suggestions.

This article uses SAS to implement the broken stick model and compares that method with three other simple rules for dimension reduction. A few references are provided at the end of this article.

A principal component analysis by using PROC PRINCOMP

Let's start with an example. In SAS, you can use the PRINCOMP procedure to conduct a principal component analysis. The following example is taken from the Getting Started example in the PROC PRINCOMP documentation. The program analyzes seven crime rates for the 50 US states in 1977. (The documentation shows how to generate the data set.) The following call generates a scree plot, which shows the proportion of variance explained by each component. It also writes the Eigenvalues table to a SAS data set:

proc princomp data=Crime plots=scree;
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
   id State;
   ods output Eigenvalues=Evals;
run;
Eigenvalues for a principal component analysis in SAS
Scree plot of eigenvalues for a principal component analysis in SAS

The panel shows two graphs that plot the numbers in the "Eigenvalues of the Correlation Matrix" table. The plot on the left is the scree plot, which is a graph of the eigenvalues. The sum of the eigenvalues is 7, which is the number of variables in the analysis. If you divide each eigenvalue by 7, you obtain the proportion of variance that each principal component explains. The graph on the right plots the proportions and the cumulative proportions.

The scree plot as a guide to retaining components

The scree plot is my favorite graphical method for deciding how many principal components to keep. If the scree plot contains an "elbow" (a sharp change in the slopes of adjacent line segments), that location might indicate a good number of principal components (PCs) to retain. For this example, the scree plot shows a large change in slopes at the second eigenvalue and a smaller change at the fourth eigenvalue. From the graph of the cumulative proportions, you can see that the first two PCs explain 76% of the variance in the data, whereas the first four PCs explain 91%.

If "detect the elbow" is too imprecise for you, a more precise algorithm is to start at the right-hand side of the scree plot and look at the points that lie (approximately) on a straight line. The leftmost point along the trend line indicates the number of components to retain. (In geology, "scree" is rubble at the base of a cliff; the markers along the linear trend represent the rubble that can be discarded.) For the example data, the markers for components 4–7 are linear, so components 1–4 would be kept. This rule (and the scree plot) was proposed by Cattell (1966) and revised by Cattell and Jaspers (1967).

How does the broken-stick model choose components?

D. A. Jackson (1993) says that the broken-stick method is one of the better methods for choosing the number of PCs. The method provides "a good combination of simplicity of calculation and accurate evaluation of dimensionality relative to the other statistical approaches" (p. 2212). The broken-stick model retains components that explain more variance than would be expected by randomly dividing the variance into p parts. As I discussed last week, if you randomly divide a quantity into p parts, the expected proportion of the kth largest piece is (1/p)Σ(1/i) where the summation is over the values i=k..p. For example, if p=7 then
E1 = (1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.37,
E2 = (1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.228,
E3 = (1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.156, and so forth.

I think of the "expected proportions" as corresponding to a null model that contains uncorrelated (noise) variables. If you plot the eigenvalues of the correlation matrix against the broken-stick proportions, the observed proportions that are higher than the expected proportions indicate of how many principal component to keep.

The broken-stick model for retaining components

Broken-stick method for retaining principal components

The plot to the right shows the scree plot overlaid on a dashed curve that indicates the expected proportions that result from a broken-stick model. An application of the broken-stick model keeps one PC because only the first observed proportion of variance is higher than the corresponding broken-stick proportion.

How can you compute the points on the dashed curve? The expected proportions in the broken-stick model for p variables are proportional to the cumulative sums of the sequence of ratios {1/p, 1/(p-1), ..., 1}. You can use the CUSUM function in SAS/IML to compute a cumulative sum of a sequence, as shown below. Notice that the previous call to PROC PRINCOMP used the ODS OUTPUT statement to create a SAS data set that contains the values in the Eigenvalue table. The SAS/IML program reads in that data and compares the expected proportions to the observed proportions. The number of components to retain is computed as the largest integer k for which the first k components each explain more variance than the broken-stick model (null model).

proc iml;
use Evals;  read all var {"Number" "Proportion"};  close;
 
/* Broken Stick (Joliffe 1986; J. E. Jackson, p. 47) */
/* For random p-1 points in [0,1], expected lengths of the p subintervals are: */
p = nrow(Proportion);
g = cusum(1 / T(p:1)) / p;   /* expected lengths of intervals (smallest to largest) */
ExpectedLen = g[p:1];        /* reverse order: largest to smallest */
 
keep = 0;                    /* find first k for which ExpectedLen[i] < Proportion[i] if i<=k */
do i = 1 to p while(ExpectedLen[i] < Proportion[i]);
   keep = i;
end;
print Proportion ExpectedLen keep;
Broken-stick rule for retaining principal components

As seen in the graph, only the first component is retained under the broken-stick model.

Average of eigenvalues

The average-eigenvalue test (Kaiser-Guttman test) retains the eigenvalues that exceed the average eigenvalue. For a p x p correlation matrix, the sum of the eigenvalues is p, so the average value of the eigenvalues is 1. To account for sampling variability, Jolliffe (1972) suggested a more liberal criterion: retain eigenvalues greater than 0.7 times the average eigenvalue. These two suggestions are implemented below:

/* Average Root (Kaiser 1960; Guttman 1954; J. E. Jackson, p. 47) */
mean = mean(Proportion);
keepAvg = loc( Proportion >= mean )[<>];
 
/* Scaled Average Root (Joliffe 1972; J. E. Jackson, p. 47-48) */
keepScaled = loc( Proportion >= 0.7*mean )[<>];
print keepAvg keepScaled;
Average eigenvalue rule for retaining principal components

Create the broken-stick graph

For completeness, the following statement write the broken-stick proportions to a SAS data set and call PROC SGPLOT to overlay the proportion of variance for the observed data on the broken-stick model:

/* write expected proportions for broken-stick model */
create S var {"Number" "Proportion" "ExpectedLen"}; append; close;
quit;
 
title "Broken Stick Method for Retaining Principal Components";
proc sgplot data=S;
label ExpectedLen = "Broken-Stick Rule"  Proportion = "Proportion of Variance"
      Number = "Number of Components";
   series x=Number y=ExpectedLen / lineattrs=(pattern=dash);
   series x=Number y=Proportion / lineattrs=(thickness=2);
   keylegend / location=inside position=topright across=1;
   xaxis grid;
   yaxis label = "Proportion of Variance";
run;

Summary

Sometimes people ask why PROC PRINCOMP doesn't automatically choose the "correct" number of PCs to use for dimension reduction. This article describes four popular heuristic rules, all which give different answers! The rules in this article are the scree test (2 or 4 components), the broken-stick rule (1 component), the average eigenvalue rule (2 components), and the scaled eigenvalue rule (3 components).

So how should a practicing statistician decide how many PCs to retain? First, remember that these guidelines do not tell you how many components to keep, they merely make suggestions. Second, recognize that any reduction of dimension requires a trade-off between accuracy (high dimensions) and interpretability (low dimensions). Third, these rules—although helpful—cannot replace domain-specific knowledge of the data. Try each suggestion and see if the resulting model contains the features in the data that are important for your analysis.

Further reading

The post Dimension reduction: Guidelines for retaining principal components appeared first on The DO Loop.

7月 192017
 

Skewness is a measure of the asymmetry of a univariate distribution. I have previously shown how to compute the skewness for data distributions in SAS. The previous article computes Pearson's definition of skewness, which is based on the standardized third central moment of the data.

Moment-based statistics are sensitive to extreme outliers. A single extreme observation can radically change the mean, standard deviation, and skewness of data. It is not surprising, therefore, that there are alternative definitions of skewness. One robust definition of skewness that is intuitive and easy to compute is a quantile definition, which is also known as the Bowley skewness or Galton skewness.

A quantile definition of skewness

The quantile definition of skewness uses Q1 (the lower quartile value), Q2 (the median value), and Q3 (the upper quartile value). You can measure skewness as the difference between the lengths of the upper quartile (Q3-Q2) and the lower quartile (Q2-Q1), normalized by the length of the interquartile range (Q3-Q1). In symbols, the quantile skewness γQ is

Definition of quantile skewness (Bowley skewness)

You can visualize this definition by using the figure to the right. Figure that shows the relevant lengths used to define the quantile skewness (Bowley skewness) For a symmetric distribution, the quantile skewness is 0 because the length Q3-Q2 is equal to the length Q2-Q1. If the right length (Q3-Q2) is larger than the left length (Q2-Q1), then the quantile skewness is positive. If the left length is larger, then the quantile skewness is negative. For the extreme cases when Q1=Q2 or Q2=Q3, the quantile skewness is ±1. Consequently, whereas the Pearson skewness can be any real value, the quantile skewness is bounded in the interval [-1, 1]. The quantile skewness is not defined if Q1=Q3, just as the Pearson skewness is not defined when the variance of the data is 0.

There is an intuitive interpretation for the quantile skewness formula. Recall that the relative difference between two quantities R and L can be defined as their difference divided by their average value. In symbols, RelDiff = (R - L) / ((R+L)/2). If you choose R to be the length Q3-Q2 and L to be the length Q2-Q1, then quantile skewness is half the relative difference between the lengths.

Compute the quantile skewness in SAS

It is instructive to simulate some skewed data and compute the two measures of skewness. The following SAS/IML statements simulate 1000 observations from a Gamma(a=4) distribution. The Pearson skewness of a Gamma(a) distribution is 2/sqrt(a), so the Pearson skewness for a Gamma(4) distribution is 1. For a large sample, the sample skewness should be close to the theoretical value. The QNTL call computes the quantiles of a sample.

/* compute the quantile skewness for data */
proc iml;
call randseed(12345);
x = j(1000, 1);
call randgen(x, "Gamma", 4);
 
skewPearson = skewness(x);           /* Pearson skewness */
call qntl(q, x, {0.25 0.5 0.75});    /* sample quartiles */
skewQuantile = (q[3] -2*q[2] + q[1]) / (q[3] - q[1]);
print skewPearson skewQuantile;
The Pearson and Bowley skewness statistics for skewed data

For this sample, the Pearson skewness is 1.03 and the quantile skewness is 0.174. If you generate a different random sample from the same Gamma(4) distribution, the statistics will change slightly.

Relationship between quantile skewness and Pearson skewness

In general, there is no simple relationship between quantile skewness and Pearson skewness for a data distribution. (This is not surprising: there is also no simple relationship between a median and a mean, nor between the interquartile range and the standard deviation.) Nevertheless, it is interesting to compare the Pearson skewness to the quantile skewness for a particular probability distribution.

For many probability distributions, the Pearson skewness is a function of the parameters of the distribution. To compute the quantile skewness for a probability distribution, you can use the quantiles for the distribution. The following SAS/IML statements compute the skewness for the Gamma(a) distribution for varying values of a.

/* For Gamma(a), the Pearson skewness is skewP = 2 / sqrt(a).  
   Use the QUANTILE function to compute the quantile skewness for the distribution. */
skewP = do(0.02, 10, 0.02);                  /* Pearson skewness for distribution */
a = 4 / skewP##2;        /* invert skewness formula for the Gamma(a) distribution */
skewQ = j(1, ncol(skewP));                   /* allocate vector for results       */
do i = 1 to ncol(skewP);
   Q1 = quantile("Gamma", 0.25, a[i]);
   Q2 = quantile("Gamma", 0.50, a[i]);
   Q3 = quantile("Gamma", 0.75, a[i]);
   skewQ[i] = (Q3 -2*Q2 + Q1) / (Q3 - Q1);  /* quantile skewness for distribution */
end;
 
title "Pearson vs. Quantile Skewness";
title2 "Gamma(a) Distributions";
call series(skewP, skewQ) grid={x y} label={"Pearson Skewness" "Quantile Skewness"};
Pearson skewness versus quantile skewness for the Gamma distribution

The graph shows a nonlinear relationship between the two skewness measures. This graph is for the Gamma distribution; other distributions would have a different shape. If a distribution has a parameter value for which the distribution is symmetric, then the graph will go through the point (0,0). For highly skewed distributions, the quantile skewness will approach ±1 as the Pearson skewness approaches ±∞.

Alternative quantile definitions

Several researchers have noted that there is nothing special about using the first and third quartiles to measure skewness. An alternative formula (sometimes called Kelly's coefficient of skewness) is to use deciles: γKelly = ((P90 - P50) - (P50 - P10)) / (P90 - P10). Hinkley (1975) considered the q_th and (1-q)_th quantiles for arbitrary values of q.

Conclusions

The quantile definition of skewness is easy to compute. In fact, you can compute the statistic by hand without a calculator for small data sets. Consequently, the quantile definition provides an easy way to quickly estimate the skewness of data. Since the definition uses only quantiles, the quantile skewness is robust to extreme outliers.

At the same time, the Bowley-Galton quantile definition has several disadvantages. It uses only the central 50% of the data to estimate the skewness. Two different data sets that have the same quartile statistics will have the same quantile skewness, regardless of the shape of the tails of the distribution. And, as mentioned previously, the use of the 25th and 75th percentiles are somewhat arbitrary.

Although the Pearson skewness is widely used in the statistical community, it is worth mentioning that the quantile definition is ideal for use with a box-and-whisker plot. The Q1, Q2, and Q2 quartiles are part of every box plot. Therefore you can visually estimate the quantile skewness as the relative difference between the lengths of the upper and lower boxes.

The post A quantile definition for skewness appeared first on The DO Loop.

2月 082017
 

On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:

  • I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
  • I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.

It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?


Winsorization: Definition, pitfalls, and alternatives #StatWisdom
Click To Tweet


What is Winsorization?

The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.

Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.

The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.

I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:

  • Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
  • Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.

As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.

Winsorization: The good

Why do some people want to Winsorize their data? There are a few reasons:

  • Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
  • Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
  • If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.

Winsorization: The bad

There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.

As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:

Winsorization: The ugly

If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,

we are likely to think of them as 'strays' [or] 'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.

A little later in the essay (p. 458), he says

Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.

I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.

Summary

Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.

If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.

What are your thoughts about Winsorizing data? Share them in the comments.

tags: Statistical Programming, Statistical Thinking

The post Winsorization: The good, the bad, and the ugly appeared first on The DO Loop.

11月 282016
 

In the classic textbook by Johnson and Wichern (Applied Multivariate Statistical Analysis, Third Edition, 1992, p. 164), it says:

All measures of goodness-of-fit suffer the same serious drawback. When the sample size is small, only the most aberrant behaviors will be identified as lack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distributions may be very small and technically unimportant to the inferential conclusions.

In short, goodness-of-fit (GOF) tests are not very informative when the sample size is very small or very large.

I thought it would be useful to create simulated data that demonstrate the statements by Johnson and Wichern. Obviously I can't show "all measures of goodness-of-fit," so this article uses tests for normality. You can construct similar examples for other GOF tests.


All measures of goodness-of-fit suffer the same serious drawback. #StatWisdom
Click To Tweet


Small data: Only "aberrant behaviors" are rejected

As I showed last week, the distribution of a small sample can look quite different from the population distribution. A GOF test must avoid falsely rejecting the bulk of these samples, so the test necessarily rejects "only the most aberrant behaviors."

To demonstrate how GOF tests work with small samples, let's generate four samples of size N=25 from the following populations:

  • A normal N(4,2) distribution. The population mean and standard deviation are 4 and 2, respectively.
  • A gamma(4) distribution. The population mean and standard deviation are 4 and 2, respectively.
  • A shifted exponential(2) distribution. The population mean and standard deviation are 4 and 2, respectively.
  • A lognormal(1.25, 0.5) distribution. The population mean and standard deviation are 3.96 and 2.11, respectively.

The following SAS DATA step creates the four samples. The Distribution variable identifies the observations in each sample. You can use the SGPANEL procedure to visualize the sample distributions and overlay a normal density estimate, as follows:

data Rand;
call streaminit(1234);
N = 25;
do i = 1 to N;
   Distribution = "Normal     ";
   x = rand("Normal", 4, 2);   output;
   Distribution = "Gamma      ";
   x = rand("Gamma", 4);       output;
   Distribution = "Exponential";
   x = 2 + rand("Expo", 2);    output;
   Distribution = "Lognormal  ";
   x = rand("Lognormal", 1.25, 0.5); output;
end;
run;
 
proc sgpanel data=Rand;
  panelby Distribution / rows=4 layout=rowlattice onepanel novarname;
  histogram x;
  density x / type=normal;
run;
A  panel of histograms for small samples from the normal, lognormal, gamma, and exponential distributions

We know that three of the four distributions are not normal, but what will the goodness-of-fit tests say? The NORMAL option in PROC UNIVARIATE computes four tests for normality for each sample. The following statements run the tests:

ods select TestsForNormality;
proc univariate data=Rand normal;
  class Distribution;
   var x;
run;

The results (not shown) are that the exponential sample is rejected by the tests for normality (at the α=0.05 level), but the other samples are not. The samples are too small for the test to rule out the possibility that the gamma and lognormal samples might actually be normal. This actually makes sense: the distribution of these samples do not appear to be very different from some of the normal samples in my previous blog post.

Large samples and small deviations from fit

As Johnson and Wichern said, a large sample might appear to be normal, but it might contain small deviations that cause a goodness-of-fit test to reject the hypothesis of normality. Maybe the tails are a little too fat. Perhaps there are too many or too few outliers. Maybe the values are rounded. For large samples, a GOF test has the power to detect these small deviations and therefore reject the hypothesis of normality.

I will demonstrate how rounded values can make a GOF test reject an otherwise normal sample. The following DATA step creates a random sample of size N=5000. The X variable is normally distributed; the R variable is identical to X except values are rounded to the nearest tenth.

data RandBig;
call streaminit(1234);
N = 5000;
do i = 1 to N;
   x = rand("Normal", 4, 2);
   r = round(x, 0.1);        /* same, but round to nearest 0.1 */
   output;
end;
run;

There is little difference between the X and R variables. The means and standard deviations are almost the same. The skewness and kurtosis are almost the same. Histograms of the variables look identical. Yet because the sample size is 5000, the GOF tests reject the hypothesis of normality for the R variable at the 95% confidence level. The following call to PROC UNIVARIATE computes the analysis for both X and R:

ods select Moments Histogram TestsForNormality;
proc univariate data=RandBig normal;
var x r;
histogram x r / Normal;
run;
smallsamplegof

Partial results are shown. The first table is for the variable X. The goodness-of-fit tests fail to reject the null hypothesis, so we correctly accept that the X variable is normally distributed. The second table is for the variable R. The GOF tests reject the hypothesis that R is normally distributed, merely because the values in R are rounded to the nearest 0.1 unit.

Rounded values occur frequently in practice, so you could argue that the variables R and X are not substantially different, yet normality is rejected for one variable but not for the other.

And it is not just rounding that can cause GOF tests to fail. Other small and seemingly innocuous deviations from normality would be similarly detected.

In conclusion, be aware of the cautionary words of Johnson and Wichern. For small samples, goodness-of-fit tests do not reject a sample unless it exhibits "aberrant behaviors." For very large samples, the GOF tests "invariably produce statistically significant lack of fit," regardless of whether the deviations from the target distributions are practically important.

tags: Simulation, Statistical Thinking

The post Goodness-of-fit tests: A cautionary tale for large and small samples appeared first on The DO Loop.

11月 232016
 

Somewhere in my past I encountered a panel of histograms for small random samples of normal data. I can't remember the source, but it might have been from John Tukey or William Cleveland. The point of the panel was to emphasize that (because of sampling variation) a small random sample might have a distribution that looks quite different from the distribution of the population. The diversity of shapes in the panel was surprising, and it made a big impact on me. About half the histograms exhibited shapes that looked nothing like the familiar bell shape in textbooks.


A small random sample might look quite different than the population #StatWisdom
Click To Tweet


In this article I recreate the panel. In the following SAS DATA step I create 20 samples, each of size N. I think the original panel showed samples of size N=15 or N=20. I've used N=15, but you can change the value of the macro variable to explore other sample sizes. If you change the random number seed to 0 and rerun the program, you will get a different panel every time. The SGPANEL procedure creates a 4 x 5 panel of the resulting histograms. Click to enlarge.

%let N = 15;
data Normal;
call streaminit(93779);
do ID = 1 to 20;
   do i = 1 to &N;
      x = rand("Normal");   output;
   end;
end;
run;
 
title "Random Normal Samples of Size &N";
proc sgpanel data=Normal;
   panelby ID /  rows=4 columns=5 onepanel;
   histogram x;
run;
Panel of histograms for random normal samples (N=15). Due to sampling variation, some histograms are not bell-shaped.

Each sample is drawn from the standard normal distribution, but the panel of histogram reveals a diversity of shapes. About half of the ID values display the typical histogram for normal data: a peak near x=0 and a range of [-3, 3]. However, the other ID values look less typical. The histograms for ID=1, 19, and 20 seem to have fewer negative values than you might expect. The distribution is very flat (almost uniform) for ID=3, 9, 13, and 16.

Because histograms are created by binning the data, they are not always the best way to visualize a sample distribution. You can create normal quantile-quantile (Q-Q) plots to compare the empirical quantiles for the simulated data to the theoretical quantiles for normally distributed data. The following statements use PROC RANK to create the normal Q-Q plots, as explained in a previous article about how to create Q-Q plots in SAS:

proc rank data=Normal normal=blom out=QQ;
   by ID;
   var x;
   ranks Quantile;
run;
 
title "Q-Q Plots for Random Normal Samples of Size &N";
proc sgpanel data=QQ noautolegend;
   panelby ID / rows=4 columns=5 onepanel;
   scatter x=Quantile y=x;
   lineparm x=0 y=0 slope=1;
   colaxis label="Normal Quantiles";
run;
Panel of quantile-quantile plots for random normal samples (N=15), which shows the sampling variation of samples

The Q-Q plots show that the sample distributions are well-modeled by a standard normal distribution, although the deviation in the lower tail is apparent for ID=1, 19, and 20. This panel shows why it is important to use Q-Q plots to investigate the distribution of samples: the bins used to create histograms can make us see shapes that are not really there. The SAS documentation includes a section on how to interpret Q-Q plots.

In conclusion, small random normal samples can display a diversity of shapes. Statisticians understand this sampling variation and routinely caution that "the standard errors are large" for statistics that are computed on a small sample. However, viewing a panel of histograms makes the concept of sampling variation more real and less abstract.

tags: Simulation, Statistical Thinking

The post Sampling variation in small random samples appeared first on The DO Loop.