data analysis

7月 192017
 

Skewness is a measure of the asymmetry of a univariate distribution. I have previously shown how to compute the skewness for data distributions in SAS. The previous article computes Pearson's definition of skewness, which is based on the standardized third central moment of the data.

Moment-based statistics are sensitive to extreme outliers. A single extreme observation can radically change the mean, standard deviation, and skewness of data. It is not surprising, therefore, that there are alternative definitions of skewness. One robust definition of skewness that is intuitive and easy to compute is a quantile definition, which is also known as the Bowley skewness or Galton skewness.

A quantile definition of skewness

The quantile definition of skewness uses Q1 (the lower quartile value), Q2 (the median value), and Q3 (the upper quartile value). You can measure skewness as the difference between the lengths of the upper quartile (Q3-Q2) and the lower quartile (Q2-Q1), normalized by the length of the interquartile range (Q3-Q1). In symbols, the quantile skewness γQ is

Definition of quantile skewness (Bowley skewness)

You can visualize this definition by using the figure to the right. Figure that shows the relevant lengths used to define the quantile skewness (Bowley skewness) For a symmetric distribution, the quantile skewness is 0 because the length Q3-Q2 is equal to the length Q2-Q1. If the right length (Q3-Q2) is larger than the left length (Q2-Q1), then the quantile skewness is positive. If the left length is larger, then the quantile skewness is negative. For the extreme cases when Q1=Q2 or Q2=Q3, the quantile skewness is ±1. Consequently, whereas the Pearson skewness can be any real value, the quantile skewness is bounded in the interval [-1, 1]. The quantile skewness is not defined if Q1=Q3, just as the Pearson skewness is not defined when the variance of the data is 0.

There is an intuitive interpretation for the quantile skewness formula. Recall that the relative difference between two quantities R and L can be defined as their difference divided by their average value. In symbols, RelDiff = (R - L) / ((R+L)/2). If you choose R to be the length Q3-Q2 and L to be the length Q2-Q1, then quantile skewness is half the relative difference between the lengths.

Compute the quantile skewness in SAS

It is instructive to simulate some skewed data and compute the two measures of skewness. The following SAS/IML statements simulate 1000 observations from a Gamma(a=4) distribution. The Pearson skewness of a Gamma(a) distribution is 2/sqrt(a), so the Pearson skewness for a Gamma(4) distribution is 1. For a large sample, the sample skewness should be close to the theoretical value. The QNTL call computes the quantiles of a sample.

/* compute the quantile skewness for data */
proc iml;
call randseed(12345);
x = j(1000, 1);
call randgen(x, "Gamma", 4);
 
skewPearson = skewness(x);           /* Pearson skewness */
call qntl(q, x, {0.25 0.5 0.75});    /* sample quartiles */
skewQuantile = (q[3] -2*q[2] + q[1]) / (q[3] - q[1]);
print skewPearson skewQuantile;
The Pearson and Bowley skewness statistics for skewed data

For this sample, the Pearson skewness is 1.03 and the quantile skewness is 0.174. If you generate a different random sample from the same Gamma(4) distribution, the statistics will change slightly.

Relationship between quantile skewness and Pearson skewness

In general, there is no simple relationship between quantile skewness and Pearson skewness for a data distribution. (This is not surprising: there is also no simple relationship between a median and a mean, nor between the interquartile range and the standard deviation.) Nevertheless, it is interesting to compare the Pearson skewness to the quantile skewness for a particular probability distribution.

For many probability distributions, the Pearson skewness is a function of the parameters of the distribution. To compute the quantile skewness for a probability distribution, you can use the quantiles for the distribution. The following SAS/IML statements compute the skewness for the Gamma(a) distribution for varying values of a.

/* For Gamma(a), the Pearson skewness is skewP = 2 / sqrt(a).  
   Use the QUANTILE function to compute the quantile skewness for the distribution. */
skewP = do(0.02, 10, 0.02);                  /* Pearson skewness for distribution */
a = 4 / skewP##2;        /* invert skewness formula for the Gamma(a) distribution */
skewQ = j(1, ncol(skewP));                   /* allocate vector for results       */
do i = 1 to ncol(skewP);
   Q1 = quantile("Gamma", 0.25, a[i]);
   Q2 = quantile("Gamma", 0.50, a[i]);
   Q3 = quantile("Gamma", 0.75, a[i]);
   skewQ[i] = (Q3 -2*Q2 + Q1) / (Q3 - Q1);  /* quantile skewness for distribution */
end;
 
title "Pearson vs. Quantile Skewness";
title2 "Gamma(a) Distributions";
call series(skewP, skewQ) grid={x y} label={"Pearson Skewness" "Quantile Skewness"};
Pearson skewness versus quantile skewness for the Gamma distribution

The graph shows a nonlinear relationship between the two skewness measures. This graph is for the Gamma distribution; other distributions would have a different shape. If a distribution has a parameter value for which the distribution is symmetric, then the graph will go through the point (0,0). For highly skewed distributions, the quantile skewness will approach ±1 as the Pearson skewness approaches ±∞.

Alternative quantile definitions

Several researchers have noted that there is nothing special about using the first and third quartiles to measure skewness. An alternative formula (sometimes called Kelly's coefficient of skewness) is to use deciles: γKelly = ((P90 - P50) - (P50 - P10)) / (P90 - P10). Hinkley (1975) considered the q_th and (1-q)_th quantiles for arbitrary values of q.

Conclusions

The quantile definition of skewness is easy to compute. In fact, you can compute the statistic by hand without a calculator for small data sets. Consequently, the quantile definition provides an easy way to quickly estimate the skewness of data. Since the definition uses only quantiles, the quantile skewness is robust to extreme outliers.

At the same time, the Bowley-Galton quantile definition has several disadvantages. It uses only the central 50% of the data to estimate the skewness. Two different data sets that have the same quartile statistics will have the same quantile skewness, regardless of the shape of the tails of the distribution. And, as mentioned previously, the use of the 25th and 75th percentiles are somewhat arbitrary.

Although the Pearson skewness is widely used in the statistical community, it is worth mentioning that the quantile definition is ideal for use with a box-and-whisker plot. The Q1, Q2, and Q2 quartiles are part of every box plot. Therefore you can visually estimate the quantile skewness as the relative difference between the lengths of the upper and lower boxes.

The post A quantile definition for skewness appeared first on The DO Loop.

7月 172017
 

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

Prediction regions for the binary classification problem. Graph created in SAS.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

This article shows three ways to visualize prediction regions in SAS:

  1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
  2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
  3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   store work.LogiModel;                /* save model to item store */
run;
 
data Grid;                              /* create grid in (x1,x2) coords */
do x1 = 0 to 1 by 0.02;
   do x2 = -7.5 to 7.5 by 0.3;
      output;
   end;
end;
run;
 
proc plm restore=work.LogiModel;        /* use PROC PLM to score model on a grid */
   score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink;  /* evaluate the model on new data */
run;

The polygon method

Parameter estimates for  logistic model

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   effectplot contour(x=x1 y=x2);       /* 2. contour plot with scatter plot overlay */
run;

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

The background grid method

Prediction region for a classification problem with two outcomes

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.

Summary

This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.

7月 052017
 

A SAS customer asked how to use SAS to conduct a Z test for the equality of two proportions. He was directed to the SAS Usage Note "Testing the equality of two or more proportions from independent samples." The note says to "specify the CHISQ option in the TABLES statement of PROC FREQ to compute this test," and then adds "this is equivalent to the well-known Z test for comparing two independent proportions."

You might wonder why a chi-square test for association is equivalent to a Z test for the equality of proportions. You might also wonder if there is a direct way to test the equality of proportions. This article implements the well-known test for proportions in the DATA step and compares the results to the chi-square test results. It also shows how to get this test directly from PROC FREQ by using the RISKDIFF option.

A chi-square test for association in SAS

The SAS Usage Note poses the following problem: Suppose you want to compare the proportions responding "Yes" to a question in independent samples of 100 men and 100 women. The number of men responding "Yes" is observed to be 30 and the number of women responding Yes was 45.

You can create the data by using the following DATA step, then call PROC FREQ to analyze the association between the response variable and gender.

data Prop;
length Group $12 Response $3;
input Group Response N;
datalines;
Men          Yes  30
Men          No   70
Women        Yes  45
Women        No   55
;
 
proc freq data=Prop order=data;
   weight N;
   tables Group*Response / chisq;
run;
Test for association of two categorical variables in SAS

As explained in the PROC FREQ documentation, the Pearson chi-square statistic indicates an association between the variables in the 2 x 2 table. The results show that the chi-square statistic (for 1 degree of freedom) is 4.8, which corresponds to a p-value of 0.0285. The test indicates that we should reject the null hypothesis of no association at the 0.05 significance level.

As stated in the SAS Usage Note, this association test is equivalent to a Z test for whether the proportion of males who responded "Yes" equals the proportion of females who responded "Yes." The equivalence relies on a fact from probability theory: a chi-square random variable with 1 degree of freedom is the square of a random variable from the standard normal distribution. Thus the square root of the chi-square statistic is the Z statistic (up to a sign) that you get from the test of equality of two proportion. Therefore the Z statistic should be z = ±sqrt(4.8) = ±2.19. The p-value is unchanged.

Z test for the equality of two proportions: A DATA step implmentation

For comparison, you can implement the classical Z test by applying the formulas from a textbook or from the course material from Penn State, which includes a section about comparing two proportions. The following DATA step implements the Z test for equality of proportions:

/* Implement the Z test for pre-summarized statistics. Specify the group proportions and sizes. 
   For formulas, see https://onlinecourses.science.psu.edu/stat414/node/268 */
%let alpha = 0.05;
%let N1    = 100;   /* total trials in Group1 */
%let Event1=  30;   /* Number of events in Group1  */
%let N2    = 100;   /* total trials in Group2 */
%let Event2=  45;   /* Number of events in Group2  */
 
%let Side  =   2;   /* use L, U, or 2 for lower, upper, or two-sided test */
title "Test of H0: p1=p2 vs Ha: p1^=p2"; /* change for Side=L or U */
 
data zTestProp;
p1Hat = &Event1 / &N1;                /* observed proportion in Group1 */
var1  = p1Hat*(1-p1Hat) / &N1;        /* variance in Group1 */
p2Hat = &Event2 / &N2;                /* observed proportion in Group2 */
var2  = p2Hat*(1-p2Hat) / &N2;        /* variance in Group2 */
/* use pooled estimate of p for test */
Diff = p1Hat - p2Hat;                 /* estimate of p1 = p2 */
pHat = (&Event1 + &Event2) / (&N1 + &N2);
pVar = pHat*(1-pHat)*(1/&N1 + 1/&N2); /* pooled variance */
SE   = sqrt(pVar);                    /* estimate of standard error */
Z = Diff / SE;    
 
Side = "&Side";
if Side="L" then                      /* one-sided, lower tail */
   pValue = cdf("normal", z);
else if Side="U" then                 /* one-sided, upper tail */
   pValue = sdf("normal", Z);         /* SDF = 1 - CDF */
else if Side="2" then
   pValue = 2*(1-cdf("normal", abs(Z))); /* two-sided */
format pValue PVALUE6.4 Z 7.4;
label pValue="Pr < Z";
drop var1 var2 pHat pVar;
run;
 
proc print data=zTestProp label noobs; run;
Test for equality of two independent proportions in SAS

The DATA step obtains a test statistic of Z = –2.19, which is one of the square roots of the chi-square statistic in the PROC FREQ output. Notice also that the p-value from the DATA step matches the p-value from the PROC FREQ output.

Test equality of proportions by using PROC FREQ

There is actually a direct way to test for the equality of two independent proportions: use the RISKDIFF option in the TABLES statement in PROC FREQ. In the documentation, binomial proportions are called "risks," so a "risk difference" is a difference in proportions. (Also, a "relative risk" (the RELRISK option) measures the ratio of two proportions.) Equality of proportions is equivalent to testing whether the difference of proportions (risks) is zero.

As shown in the documentation, PROC FREQ supports many options for comparing proprtions. You can use the following suboptions to reproduce the classical equality of proportions test:

  1. EQUAL requests an equality test for the difference in proportion. By default, the Wald interval (METHOD=WALD) is used, but you can choose other intervals.
  2. VAR=NULL specifies how to estimate the variance for the Wald interval.
  3. (optional) CL=WALD outputs the Wald confidence interval for the difference.

Combining these options gives the following direct computation of the difference between two proportions:

proc freq data=Prop order=data;
   weight N;
   tables Group*Response / riskdiff(equal var=null cl=wald); /* Wald test for equality */
run;
Test for difference of proportions in SAS

The 95% (Wald) confidence interval is shown in the first table. The confidence interval is centered on the point estimate of the difference (-0.15). The interval does not contain 0, so the difference is significantly different from 0 at the 0.05 significance level.

The second table show the result of the Wald equality test. The "ASE (H0)" row gives the estimate for the (asymptotic) standard error, assuming the null hypothesis. The Z score and the two-sided p-value match the values from the DATA step computation, and the interpretation is the same.

Summary

In summary, the SAS Usage Note correctly states that the chi-square test of association is equivalent to the Z test for the equality of proportion. To run the Z test explicitly, this article uses the SAS DATA step to implement the test when you have summary statistics. As promised, the Z statistic is one of the square roots of the chi-square statistic and the p-values are the same. The DATA step removes some of the mystery regarding the equivalence between these two tests.

However, writing DATA step code cannot match the convenience of a procedure. For raw or pre-summarized data, you can use the RISKDIFF option in PROC FREQ to run the same test (recast as a difference of proportions or "risks"). To get exactly the same confidence intervals and statistics as the classical test (which is called the Wald test), you need to add a few suboptions. The resulting output matches the DATA step computations.

The post Test for the equality of two proportions in SAS appeared first on The DO Loop.

6月 282017
 

Suppose you roll six identical six-sided dice. Chance are that you will see at least one repeated number. The probability that you will see six unique numbers is very small: only 6! / 6^6 ≈ 0.015.

This example can be generalized. If you draw a random sample with replacement from a set of n items, duplicate values occur much more frequently than most people think. This is the source of the famous "birthday matching problem" in probability, which shows that the probability of duplicate birthdays among a small group of people is higher than most people would guess. (Among 23 people, the probability of a duplicate birthday is more than 50%.)

This fact has implications for bootstrap resampling. Recall that if a sample has n observations, then a bootstrap sample is obtained by sampling n times with replacement from the data. Since most bootstrap samples contain a duplicate of at least one observation, it is also true that most samples omit at least one observation. That raises the question: On average, how many of the original observations are not present in an average bootstrap sample?

The average bootstrap sample

I'll give you the answer: an average bootstrap sample contains 63.2% of the original observations and omits 36.8%. The book by Chernick and LaBudde (2011, p. 199) states the following result about bootstrap resamples: "If the sample size is large and we [generate many]  bootstrap samples, we will find that on average, approximately 36.8% of the original observations will be missing from the individual bootstrap samples. Another way to look at this is that for any particular observation, approximately 36.8% of the bootstrap samples will not contain it." (Emphasis added.)

You can use elementary probability to derive this result. Suppose the original data contains n observations. A bootstrap sample is generated by sampling with replacement from the data. The probability that a particular observation is not chosen from a set of n observations is 1 - 1/n, so the probability that the observation is not chosen n times is (1 - 1/n)^n. This is the probability that the observation does not appear in a bootstrap sample.

You might remember from calculus that the limit as n → ∞ of (1 - 1/n)^n is 1/e. Therefore, when n is large, the probability that an observation is not chosen is approximately 1/e ≈ 0.368.

Simulation: the proportion of observations in bootstrap samples

If you prefer simulation to calculation, you can simulate many bootstrap samples and count the proportion of samples that do not contain observation 1, observation 2, and so forth. The average of those proportions should be close to 1/e.

The theoretical result is only valid for large values of n, but let's start with n=6 so that we can print out intermediate results during the simulation. The following SAS/IML program uses the SAMPLE function to simulate rolling six dice a total of 10 times. Each column of the matrix M contains the result of one trial:

options linesize=128;
proc iml;
call randseed(54321);
n = 6;
NumSamples = 10;
x = T(1:n);                     /* the original sample {1,2,...,n} */
M = sample(x, NumSamples//n);  /* each column is a draw with replacement */

The table shows that each sample (column) contains duplicates. The first column does not contain the values {4,5,6}. The second column does not contain the value 6.

Given these samples, what proportion of samples does not contain 1? What proportion does not contain 2, and so forth? For this small simulation, we can answer these questions by visual inspection. The number 1 does not appear in the sample S5, so it does not appear in 0.1 of the samples. The number 2 does not appears in S3, S5, S8, S9, or S10, so it does not appear in 0.5 of the samples. The following SAS/IML statements count the proportion of columns that do not contain each number and then takes the average of those proportions:

/* how many samples do not contain x[1], x[2], etc */
cnt = j(n, 1, .);     /* allocate space for results */
do i = 1 to n;
   Y = (M=x[i]);      /* binary matrix */
   s = Y[+, ];        /* count for each sample (column) */
   cnt[i] = sum(s=0); /* number of samples that do not contain x[i] */
end;
prop = cnt / NumSamples;
avg_prop = mean(prop);
print avg_prop;

The result says that, on the average, a sample does not contain 0.35 of the data values. Let's increase the sample size and the number of bootstrap samples. Change the parameters to the following and rerun the program:

n = 75;
NumSamples = 10000;

The new estimate is 0.365, which is close to the theoretical value of 1/e. If you like, you can plot a histogram of the n proportions that make up the average:

title "Proportion of Samples That Do Not Contain an Observation";
title2 "n=75; NumSamples=10000";
call histogram(prop) label="Proprtion"
         other="refline "+char(avg_prop)+"/axis=x;";
Distribution of the proportion of bootstrap samples that do not contain an observation (n=75)

Elementary statistical theory tells us that the proportions are approximately normally distributed with mean p=1/e and standard deviation sqrt(p(1-p)/NumSamples) ≈ 0.00482. The mean and standard deviation of the simulated proportions are very close to the theoretical values.

In conclusion, when you draw n items with replacement from a large sample of size n, on average the sample contains 63.2% of the original observations and omits 36.8%. In other words, the average bootstrap sample omits 36.8% of the original data.

The post The average bootstrap sample omits 36.8% of the data appeared first on The DO Loop.

6月 142017
 

In a previous article, I showed two ways to define a log-likelihood function in SAS. This article shows two ways to compute maximum likelihood estimates (MLEs) in SAS: the nonlinear optimization subroutines in SAS/IML and the NLMIXED procedure in SAS/STAT. To illustrate these methods, I will use the same data sets from my previous post. One data set contains binomial data, the other contains data that are lognormally distributed.

Maximum likelihood estimates for binomial data from SAS/IML

I previously wrote a step-by-step description of how to compute maximum likelihood estimates in SAS/IML. SAS/IML contains many algorithms for nonlinear optimization, including the NLPNRA subroutine, which implements the Newton-Raphson method.

In my previous article I used the LOGPDF function to define the log-likelihood function for the binomial data. The following statements define bounds for the parameter (0 < p < 1) and provides an initial guess of p0=0.5:

/* Before running program, create Binomial and LN data sets from previous post */
 
/* Example 1: MLE for binomial data */
/* Method 1: Use SAS/IML optimization routines */
proc iml;
/* log-likelihood function for binomial data */
start Binom_LL(p) global(x, NTrials);
   LL = sum( logpdf("Binomial", x, p, NTrials) );
   return( LL );
finish;
 
NTrials = 10;    /* number of trials (fixed) */
use Binomial; read all var "x"; close;
 
/* set constraint matrix, options, and initial guess for optimization */
con = { 0,      /* lower bounds: 0 < p     */
        1};     /* upper bounds:     p < 1 */
opt = {1,       /* find maximum of function   */
       2};      /* print some output      */
p0  = 0.5;      /* initial guess for solution */
call nlpnra(rc, p_MLE, "Binom_LL", p0, opt, con);
print p_MLE;
Maximum likelihood estimate for binomial data

The NLPNRA subroutine computes that the maximum of the log-likelihood function occurs for p=0.56, which agrees with the graph in the previous article. We conclude that the parameter p=0.56 (with NTrials=10) is "most likely" to be the binomial distribution parameter that generated the data.

Maximum likelihood estimates for binomial data from PROC NLMIXED

If you've never used PROC NLMIXED before, you might wonder why I am using that procedure, since this problem is not a mixed modeling regression. However, you can use the NLMIXED procedure for general maximum likelihood estimation. In fact, I sometimes joke that SAS could have named the procedure "PROC MLE" because it is so useful for solving maximum likelihood problems.

PROC NLMIXED has built-in support for computing maximum likelihood estimates of data that follow the Bernoulli (binary), binomial, Poisson, negative binomial, normal, and gamma distributions. (You can also use PROC GENMOD to fit these distributions; I have shown an example of fitting Poisson data.)

The syntax for PROC NLMIXED is very simple for the Binomial data. You use the PARMS statement to supply an initial guess for the parameter p. On the MODEL statement, you declare that you want to model the X variable as Binom(p) where NTrials=10. Be sure to ALWAYS check the documentation for the correct syntax for a binomial distribution. Some functions like PDF, CDF, and RAND use the p parameter as the first argument: Binom(p, NTrials). Some procedure (like the MCMC and NLMIXED) use the p parameter as the second argument: Binom(NTrials, p).

/* Method 2: Use PROC NLMIXED solve using built-in modeling syntax */
proc nlmixed data=Binomial;
   parms p = 0.5;             * initial value for parameter;
   NTrials = 10;
   model x ~ binomial(NTrials, p);
run;
Maximum likelihood estimates for binomial data by using PROC NLMIXED in SAS

Notice that the output from PROC NLMIXED contains the parameter estimate, standard error, and 95% confidence intervals. The parameter estimate is the same value (0.56) as was found by the NLPNRA routine in SAS/IML. The confidence interval confirms what we previously saw in the graph of the log-likelihood function: the function is somewhat flat near the optimum, so a 95% confidence interval is wide: [0.49, 0.63].

Maximum likelihood estimates for lognormal data

You can use similar syntax to compute MLEs for lognormal data. The SAS/IML syntax is similar to the binomial example, so it is omitted. To view it, download the complete SAS program that computes these maximum likelihood estimates.

PROC NLMIXED does not support the lognormal distribution as a built-in distribution, which means that you need to explicitly write out the log-likelihood function and specify it in the GENERAL function on the MODEL statement. Whereas in SAS/IML you have to use the SUM function to sum the log-likelihood over all observations, the syntax for PROC NLMIXED is simpler. Just as the DATA step has an implicit loop over all observations, the NLMIXED procedure implicitly sums the log-likelihood over all observations. You can use the LOGPDF function, or you can explicitly write the log-density formula for each observation.

If you look up the lognormal distribution in the list of "Standard Definition" in the PROC MCMC documentation, you will see that one parameterization of the lognormal PDF in terms of the log-mean μ and log-standard-deviation σ is
f(x; μ, σ) = (1/(sqrt(2π σ x) exp(-(log(x)-μ)**2 / (2σ**2))
When you take the logarithm of this quantity, you get two terms, or three if you use the rules of logarithms to isolate quantities that do not depend on the parameters:

proc nlmixed data=LN;
parms mu 1 sigma 1;                 * initial values of parameters;
bounds 0 < sigma;                   * bounds on parameters;
sqrt2pi = sqrt(2*constant('pi'));
LL = -log(sigma) 
     - log(sqrt2pi*x)               /* this term is constant w/r/t (mu, sigma) */
     - (log(x)-mu)**2  / (2*sigma**2);
/* Alternative: LL = logpdf("Lognormal", x, mu, sigma); */
model x ~ general(LL);
run;
Maximum likelihood estimates for lognormal data by using PROC NLMIXED in SAS

The parameter estimates are shown, along with standard errors and 95% confidence intervals. The maximum likelihood estimates for the lognormal data are (μ, σ) = (1.97, 0.50). You will get the same answer if you use the LOGPDF function (inside the comment) instead of the "manual calculation." You will also get the same estimates if you omit the term log(sqrt2pi*x) because that term does not depend on the MLE parameters.

In conclusion, you can use nonlinear optimization in the SAS/IML language to compute MLEs. This approach is especially useful when the computation is part of a larger computational program in SAS/IML. Alternatively, the NLMIXED procedure makes it easy to compute MLEs for discrete and continuous distributions. For some simple distributions, the log-likelihood functions are built into PROC NLMIXED. For others, you can specify the log likelihood yourself and find the maximum likelihood estimates by using the GENERAL function.

The post Two ways to compute maximum likelihood estimates in SAS appeared first on The DO Loop.

6月 122017
 

Maximum likelihood estimation (MLE) is a powerful statistical technique that uses optimization techniques to fit parametric models. The technique finds the parameters that are "most likely" to have produced the observed data. SAS provides many tools for nonlinear optimization, so often the hardest part of maximum likelihood is writing down the log-likelihood function. This article shows two simple ways to construct the log-likelihood function in SAS. For simplicity, this article describes fitting the binomial and lognormal distributions to univariate data.

Always use the log-likelihood function!

Although the method is known as maximum likelihood estimation, in practice you should optimize the log-likelihood function, which is numerically superior to work with. For an introduction to MLE, including the definitions of the likelihood and log-likelihood functions, see the Penn State Online Statistics Course, which is a wonderful reference.

MLE assumes that the observed data x={x1, x2, ..., xn} are independently drawn from some population. You want to find the most likely parameters θ = (θ1,...,θk) such that the data are fit by the probability density function (PDF), f(x; θ). Since the data are independent, the probability of observing the data is the product Πi f(xi, θ), which is the likelihood function L(θ | x). If you take the logarithm, the product becomes a sum. The log-likelihood function is
LL(θ | x) = Σi log( f(xi, θ) )

This formula is the key. It says that the log-likelihood function is simply the sum of the log-PDF function evaluated at the data values. Always use this formula. Do not ever compute the likelihood function (the product) and then take the log, because the product is prone to numerical errors, including overflow and underflow.

Two ways to construct the log-likelihood function

There are two simple ways to construct the log-likelihood function in SAS:

Example: The log-likelihood function for the binomial distribution

A coin was tossed 10 times and the number of heads was recorded. This was repeated 20 times to get a sample. A student wants to fit the binomial model X ~ Binom(p, 10) to estimate the probability p of the coin landing on heads. For this problem, the vector of MLE parameters θ is merely the one parameter p.

Recall that if you are using SAS/IML to optimize an objective function, the parameter that you are trying to optimize should be the only argument to the function, and all other parameters should be specified on the GLOBAL statement. Thus one way to write a SAS/IML function for the binomial log-likelihood function is as follows:

proc iml;
/* Method 1: Use LOGPDF. This method works in DATA step as well */
start Binom_LL1(p) global(x, NTrials);
   LL = sum( logpdf("Binomial", x, p, NTrials) );
   return( LL );
finish;
 
/* visualize log-likelihood function, which is a function of p */
NTrials = 10;    /* number of trials (fixed) */
use Binomial; read all var "x"; close;
 
p = do(0.01, 0.99, 0.01);      /* vector of parameter values */
LL = j(1, ncol(p), .);
do i = 1 to ncol(LL);
   LL[i] = Binom_LL1( p[i] );  /* evaluate LL for a sequence of p */
end;
 
title "Graph of Log-Likelihood Function";
title2 "Binomial Distribution, NTrials=10";
call series(p, LL) grid={x y} xvalues=do(0,1,0.1)
                   label={"Probability of Sucess (p)", "Log Likelihood"};
Graph of log-likelihood function for the binomial distribution

Notice that the data is constant and does not change. The log likelihood is considered to be a function of the parameter p. Therefore you can graph the function for representative values of p, as shown. The graph clearly shows that the log likelihood is maximal near p=0.56, which is the maximum likelihood estimate. The graph is fairly flat near its optimal value, which indicates that the estimate has a wide standard error. A 95% confidence interval for the parameter is also wide. If the sample contained 100 observations instead of only 20, the log-likelihood function might have a narrower peak.

Notice also that the LOGPDF function made this computation very easy. You do not need to worry about the actual formula for the binomial density. All you have to do is sum the log-density at the data values.

In contrast, the second method requires a little more work, but can handle any distribution for which you can compute the density function. If you look up the formula for the binomial PDF in the MCMC documentation, you see that
PDF(x; p, NTrials) = comb(NTrials, x) * p**x * (1-p)**(NTrials-x)
where the COMB function computes the binomial coefficient "NTrials choose x." There are three terms in the PDF that are multiplied together. Therefore when you apply the LOG function, you get the sum of three terms. You can use the LCOMB function in SAS to evaluate the logarithm of the binomial coefficients in an efficient manner, as follows:

/* Method 2: Manually compute log likelihood by using formula */
start Binom_LL2(p) global(x, NTrials);
   LL = sum(lcomb(NTrials, x)) + log(p)*sum(x) + log(1-p)*sum(NTrials-x);
   return( LL );
finish;
 
LL2 = Binom_LL2(p);      /* vectorized function, so no need to loop */

The second formulation has an advantage in a vector language such as SAS/IML because you can write the function so that it can evaluate a vector of values with one call, as shown. It also has the advantage that you can modify the function to eliminate terms that do not depend on the parameter p. For example, if your only goal is maximize the log-likelihood function, you can omit the term sum(lcomb(NTrials, x)) because that term is a constant with respect to p. That reduces the computational burden. Of course, if you omit the term then you are no longer computing the exact binomial log likelihood.

Example: The log-likelihood function for the lognormal distribution

In a similar way, you can use the LOGPDF or the formula for the PDF to define the log-likelihood function for the lognormal distribution. For brevity, I will only show the SAS/IML functions, but you can download the complete SAS program that defines the log-likelihood function and computes the graph.

The following SAS/IML modules show two ways to define the log-likelihood function for the lognormal distribution. For the lognormal distribution, the vector of parameters θ = (μ, σ) contains two parameters.

/* Method 1: use LOGPDF */
start LogNormal_LL1(param) global(x);
   mu = param[1];
   sigma = param[2];
   LL = sum( logpdf("Lognormal", x, mu, sigma) );
   return( LL );
finish;
 
/* Method 2: Manually compute log likelihood by using formula
   PDF(x; p, NTrials) = comb(NTrials,x) # p##x # (1-p)##(NTrials-x)
*/
start LogNormal_LL2(param) global(x);
   mu = param[1];
   sigma = param[2];
   twopi = 2*constant('pi');
   LL = -nrow(x)/2*log(twopi*sigma##2) 
        - sum( (log(x)-mu)##2 )/(2*sigma##2)
        - sum(log(x));  /* this term is constant w/r/t (mu, sigma) */
   return( LL );
finish;

The function that uses the LOGPDF function is simple to write. The second method is more complicated because the lognormal PDF is more complicated than the binomial PDF. Nevertheless, the complete log-likelihood function only requires a few SAS/IML statements.

For completeness, the contour plot on this page shows the log-likelihood function for 200 simulated observations from the Lognormal(2, 0.5) distribution. The parameter estimates are (μ, σ) = (1.97, 0.5).

Graph of the log-likelihood function for the lognormal distribution

Summary

This article has shown two simple ways to define a log-likelihood function in SAS. You can sum the values of the LOGPDF function evaluated at the observations, or you can manually apply the LOG function to the formula for the PDF function. The log likelihood is regarded as a function of the parameters of the distribution, even though it also depends on the data. For distributions that have one or two parameters, you can graph the log-likelihood function and visually estimate the value of the parameters that maximize the log likelihood.

Of course, SAS enables you to numerically optimize the log-likelihood function, thereby obtaining the maximum likelihood estimates. My next blog post shows two ways to obtain maximum likelihood estimates in SAS.

The post Two simple ways to construct a log-likelihood function in SAS appeared first on The DO Loop.

6月 052017
 

If you toss a coin 28 times, you would not be surprised to see three heads in a row, such as ...THHHTH.... But what about eight heads in a row? Would a sequence such as THHHHHHHHTH... be a rare event?

This question popped into my head last weekend as I attended my son's graduation ceremony. As the students marched in, I noticed that men were dressed in green cap and gowns, whereas the women were dressed in white. They entered in alphabetical order, which randomized the men and women. They filed into 12 rows that each contained 28 seats. Thus each row is like an independent toss of a coin, with green and white representing heads and tails, respectively.

Phot of graduating men and women in different colored caps and gowns

When the students entered the ninth row from the left (fourth from the right), I noticed a sequence of eight consecutive "little green men," which is highlighted in red in the picture on this page. (Click to enlarge.) I wish I had a photo of the students seated in their chairs because the effect is more dramatic when the green mortarboards are all aligned. But take my word for it: the long sequence of green was very noticeable.

The picture shows that there was actually a row to the extreme left that was partially filled. For the purpose of this article, ignore the partial row. In the 12 full rows, the number of men in each row is (from left to right) {15, 15, 14, 11, 16, 16, 15, 10, 20, 9, 14, 13}. Remarkably, this adds to 168, so the proportion of men is exactly 0.5 of the 12 x 28 = 336 students.

Simulate the binary pattern

You can simulate the students by generating 336 random binary values arranged on a 12 x 28 grid. Since this was the graduating class of 2017, I used 2017 as the random number seed in the following DATA step:

%let NumRows = 12;
%let NumInRow= 28;
data Graduation;
call streaminit(2017);  
do row = 1 to &NumRows;
   do seat = 1 to &NumInRow;
      Male = rand("Bernoulli", 0.5); 
      output;
   end;
end;
run;
 
title "One Simulated Seating Arrangement";
proc sgplot data=Graduation;
   styleattrs wallcolor=grey DATACONTRASTCOLORS=(white green);
   scatter x=Row y=Seat / group=male markerattrs=(symbol=SquareFilled);
   xaxis integer values=(1 to 12);
run;
Random binary values on a regular grid

If you look at row 5 in the image, you will see a sequence of nine consecutive green markers. The fact that a simulated data set reproduced the graduation scenario on the very first attempt makes me think that this situation is not very rare. However, changing the seed a few times shows that the situation does not always occur.

Runs in coin tosses

There are 12 rows, each containing 28 students. The event of interest is a row with eight or more consecutive males. The easiest way to compute the probability of this happening is to first compute the probability for one row. Since the rows are assumed to be independent, you can then compute the probability of seeing the event in any of the 12 rows.

A sequence of consecutive events is also called a "run" of events. If you do an internet search for "probability of k heads in a row" or "probability of runs in coin toss", you will find many solutions to this problem. The source I used is a question that was asked on StackExchange about "blocks of events." Whereas many people approach this problem by using a simulation or an explicit recursive mathematical formula, "Neil G" and "COOLSerdash" compute the probability by using a Markov transition matrix, which is easy to create in the SAS/IML matrix language.

The following statements define a function that creates the Markov transition matrix and iterates it to compute the probability that coin will show k consecutive heads in N tosses. The program works for any probability of heads, not merely p=0.5. See the StackExchange article for the explanation:

proc iml;
k = 8;                     * desired number of correct trials in a row;
p = 1/2;                   * probability of getting a correct trial;
N = 28;                    * Total number of trials;
 
/* Iterate Markov transition matrix to compute probability of 
   k consecutive heads in N tosses of a coin that has 
   probability p of showing heads */
start ProbConsec(N, p, k);
   M = j(k+1, k+1, 0);     * set up the transition matrix M;
   M[1, 1:k] = (1-p);      * first row, except for last column;
   M[k+1, k+1] = 1;        * lower right corner;
   do i = 2 to (k+1);
      M[i, i-1] = p;       * subdiagonal elements;
   end;
   Mn = M**N;              * Calculate M^N;
   /* Prob that starting in State 1 ends in State (k+1) */
   return(Mn[(k+1), 1]);   
finish;
 
prob = ProbConsec(N, p, k);
print prob;

The result shows that the probability of seeing 8 consecutive heads out of 28 tosses is 0.0426. This is the same probability as observing 8 consecutive men in green in one of the rows at graduation, assuming that alphabetical ordering randomizes men and women. However, remember that there were 12 rows at graduation, so the probability of observing this event in ANY row is higher, as shown below:

ProbSee0 = (1-prob)##12;   * P(Not in Row1 AND ... NOT in Row 12);
ProbSeeAny = 1 - ProbSee0; * P(In Row1 OR ... OR in Row 12);
print ProbSeeAny ProbSee0;

The chance of observing exactly eight consecutive men in any of the 12 rows is about 41%. Of course, you can also compute the probability of observing 9, 10, 11, or more consecutive men. When you add up the probabilities, you discover that the cumulative probability of observing an "extreme arrangement" of 8 or more consecutive men is about 0.64. And why stop there? You could extend this analysis to include a sequence of consecutive women!

Summary

In summary, graduation events can be long, but computing the probabilities of interesting arrangements of the students can help make the time go faster! I wasn't able to compute the probabilities in my head while at the graduation, but it didn't take long to research the problem and solve it with SAS after I got home. I conclude that observing a long sequence of men in a randomized seating arrangement that has 12 rows of 28 seats is not a rare event. In fact, the chance of observing a run of eight or more men is about 64%.

The real lesson for all of us is that we should keep our eyes open and look around. Math and statistics are everywhere!

The post Runs in coin tosses; patterns in random seating appeared first on The DO Loop.

5月 242017
 

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows how to use SAS to visualize and compare the nine common definitions of sample quantiles. It also compares the default definitions of sample quantiles in SAS and R.

Definitions of sample quantiles

Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ ... ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.

Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:

  • The value p and the sample size N are used to determine two adjacent data values, x[j]and x[j+1]. The quantile estimate will be in the closed interval between those data points. For the previous example, the quantile estimate would be in the closed interval between x[6] and x[7].
  • For many methods, a fractional quantity is used to determine an interpolation parameter, λ. For the previous example, the fraction quantity is (Np - j) = (6.4 - 6) = 0.4. If you use λ = 0.4, then an estimate the 64th percentile would be the value 40% of the way between x[6] and x[7].
  • Each definition has a parameter m, 0 ≤ m ≤ 1, which determines how the method interpolates between adjacent data points. In general, the methods define the index j by using j = floor(Np + m). The previous example used m=0, but other choices include m=0.5 or values of m that depend on p.

Thus a general formula for quantile estimates is q = (1 - λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.

You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary. The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1 and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.

Compute all nine sample quantile definitions in SAS

SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.

SAS definitions of sample quantiles

It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 - c.

The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:

Definitions of sample quantiles that are not natively supported in SAS

Visualizing the definitions of sample quantiles

Visualization of nine defniitions of sample quantiles, from Hyndman and Fan (1996)

You can download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F. The differences between the definitions are most evident for small data sets and when there is a large "gap" between one or more adjacent data values. The following panel of graphs shows the nine sample quantile methods for a data set that has 10 observations, {0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, ..., 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.

In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.

You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.

You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.

Comparison of the default  quantile estimates in SAS and R on a tiny data set

Does the definition of sample quantiles matter?

As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.

However, in practice analysts don't worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.

The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.

Comparison of the default  quantile estimates in SAS and R on a larger data set

So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.

In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.

SAS program to compute nine sample quantiles.

The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.

5月 222017
 

In last week's article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn't mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine.

That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.

You might wonder why there are multiple definitions. Recall that a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population. The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.

Definitions of sample quantiles

In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.

For convenience, assume that the sample data are listed in sorted order. In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer, the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].

Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2). Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].

Visualizing the definitions for quantiles

The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations. The differences between the definitions are most apparent on small data sets that contain integer values, so let's create a tiny data set and apply the five definitions to it. The following example has 10 observations and six unique values.

data Q;
input x @@;
datalines;
0 1 1 1 2 2 2 4 5 8
;
ECDF of a small data set

You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are "in a gap." By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.

Since the default definition is QNTLDEF=5, let's visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, ..., 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.

proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
call qntl(quantile, x, prob, 5);      /* use method=5 definition  */
create Pctls var {"quantile" "prob" "x"}; append; close;
quit;
 
title "Sample Percentiles";
title2 "QNTLDEF = 5";
proc sgplot data=Pctls noautolegend;
   scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled);
   fringe x / lineattrs=GraphData2(thickness=3);
   xaxis display=(nolabel) values=(0 to 8);
   yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions";
   refline 0 1 / axis=y;
run;
Sample quantiles (percentiles) for a small data set

For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.

If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.

Comparing all five SAS percentile definitions

It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:

/* Compare all SAS methods */
proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
create Pctls var {"Qntldef" "quantile" "prob" "x"};
do def = 1 to 5;
   call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */
   Qntldef = j(nrow(prob), 1, def);   /* ID variable */
   append;
end;
close;
quit;
 
proc print data=Pctls noobs;
   where prob = 0.45;                 /* compare 0.45 quantile for different definitions */
   var Qntldef quantile;
run;

You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let's plot them and see:

ods graphics / antialiasmax=10000;
title "Sample Percentiles in SAS";
proc sgpanel data=Pctls noautolegend;
   panelby Qntldef / onepanel rows=2;
   scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled);
   fringe x;
   rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
   refline 0 1 / axis=y;
   colaxis display=(nolabel);
run;
Compare percentile definitions in SAS

The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.

If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.

The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively. If you compute the 0.901 quantile, you will discover that the "round down" method (QNTLDEF=2) gives 5 as the sample quantile, whereas the "round up" method (QNTLDEF=3) gives the value 8. Similarly, the "backward interpolation method" (QNTLDEF=1) gives 5.03, whereas the "forward interpolation method" (QNTLDEF=4) gives 7.733.

In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.

Reference

For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) "Sample quantiles in statistical packages", American Statistician, 50, 361–365.

The post Quantile definitions in SAS appeared first on The DO Loop.

5月 172017
 

The April 2017 issue of Significance magazine features a cover story by Robert Langkjaer-Bain about the Flint (Michigan) water crisis. For those who don't know, the Flint water crisis started in 2014 when the impoverished city began using the Flint River as a source of city water. The water was not properly treated, which led to unhealthy (even toxic) levels of lead and other contaminants in the city's water supply. You can read an overview of the Flint Water Crisis on Wikipedia.

The crisis was compounded because someone excluded two data points before computing a quantile of a small data set. This seemingly small transgression had tragic ramifications. This article examines the Flint water quality data and shows why excluding those two points changed the way that the city responded. You can download the SAS program that analyzes these data.

Federal standards for detecting unsafe levels of lead

The federal Lead and Copper Rule of 1991 specifies a statistical method for determining when the concentration of lead in a water supply is too high. First, you sample from a number of "worst case" homes (such as those served by lead pipes), then compute the 90th percentile of the lead levels from those homes. If the 90th percentile exceeds 15 parts per billion (ppb), then the water is unsafe and action must be taken to correct the problem.

In spring 2015, this data collection and analysis was carried out in Flint by the Michigan Department of Environmental Quality (MDEQ), but as discussed in the Significance article, the collection process was flawed. For example, the MDEQ was supposed to collect 100 measurements, but only 71 samples were obtained, and they were not from the "worst case" homes. The 71 lead measurements that they collected are reproduced below, where I have used '0' for "not detectable." A call to PROC MEANS computes the 90th percentile (P90) of the complete sample:

/* values of lead concentration in Flint water samples.
   Use 0 for "not detectable" */
data FlintObs;
label Lead = "Lead Concentration (ppb)";
input lead @@;
Exclude = (Lead=20 | Lead=104); /* MDEQ excluded these two large values */
datalines;
0 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 3 3 3 4 4 
5 5 5 5 5 5 5 5 6 6 6 6 7 7 7 
8 8 9 10 10 11 13 18 20 21 22 29 43 43 104
;
 
proc means data=FlintObs N p90; 
   var lead;
run;

According to this analysis of the full data, the 90th percentile of the sample is 18 ppb, which exceeds the federal limit of 15 ppb. Consequently, the Flint water fails the safety test, and the city must take action to improve the water.

But that is not what happened. Instead, "the MDEQ told the city's water quality supervisors to remove two of the samples" (Langkjaer-Bain, p. 19) that were over 15 ppb. The MDEQ claimed that these data were improperly collected. The two data points that were excluded have the values 20 and 104. Because these values are both higher than the 90th percentile, excluding these observations lowered the 90th percentile of the modified sample, which has 69 observations. The following call to PROC MEANS computes the 90th percentile for the modified sample:

proc means data=FlintObs N p90; 
   where Exclude=0;     /* exclude two observations */
   var lead;
run;

The second table shows that the 90th percentile of the modified sample is 13 ppb. "The exclusion of these two samples nudged the 90th percentile reading...below the all-important limit of 15 ppb." (Langkjaer-Bain, p. 20) The modified conclusion is that the city does not need to undertake expensive corrective action to render the water safe.

The consequences of keeping or excluding these outliers were huge. If kept in the analysis, officials are required to take expensive corrective action; if excluded, no action is required. In the end, the modified sample was used, and the citizens of Flint were told that the water supply had passed the safety test.

The following histogram (click to enlarge) is similar to Figure 1 in Langkjaer-Bain's article. The red bars indicate observations that the MDEQ excluded from the analysis. The graph clearly shows that the distribution of lead values has a long tail. A broken axis is used to indicate that the distance to the 104 ppb reading has been shortened to reduce horizontal space. The huge gap near 15 ppb indicates a lack of data near that important value. Therefore the quantiles near that point will be extremely sensitive to deleting extreme values. To me, the graph indicates that more data should be collected so that policy makers can be more confident in their conclusions.

Distribution of Lead Levels in Flint, Michigan (Spring 2015)

Confidence intervals for the 90th percentile

Clearly the point estimate for the 90th percentile depends on whether or not those two measurements are excluded. But can statistics provide any additional insight into the 90th percentile of lead levels in the Flint water supply? When someone reports a statistic for a small sample, I like to ask "how confident are you in that statistic?" A standard error is one way to express the accuracy of a point estimate; a 95% confidence interval (CI) is another. The width of a confidence interval gives us information about the accuracy of the point estimate. As you might expect, the standard error for an extreme quantile (such as 0.9) is typically much bigger than for a quantile near 0.5, especially when there isn't much data near the quantile.

Let's use SAS procedures to construct a 95% CI for the 90th percentile. PROC UNIVARIATE supports the CIPCTLDF option, which produces distribution-free confidence intervals. I'll give the Flint officials the benefit of the doubt and compute a confidence interval for the modified data that excluded the two "outliers":

proc univariate data=FlintObs CIPctlDF;
   where Exclude=0;     /* exclude two observations */
   var Lead;
   ods select quantiles;
run;

The 95% confidence interval for P90 is [8, 43], which is very wide and includes the critical value 15 ppb in its interior. If someone asks, "how confident are you that the 90th percentile does not exceed 15 ppb," you should respond, "based on these data, I am not confident at all."

Statistical tests for the 90th percentile

As I've written before, you can also use the QUANTREG procedure in SAS to provide an alternative method to compute confidence intervals for percentiles. Furthermore, the QUANTREG procedure supports the ESTIMATE statement, which you can use to perform a one-sided test for the hypothesis "does the 90th percentile exceed 15 ppb?" The following call to PROC QUANTREG performs this analysis and uses 10,000 bootstrap resamples to estimate the confidence interval:

proc quantreg data=FlintObs CI=resampling(nrep=10000);
   where Exclude=0;     /* exclude two observations */
   model lead = / quantile=0.9 seed=12345;
   estimate 'P90 > 15' Intercept 1 / upper CL testvalue=15;
   ods select ParameterEstimates Estimates;
run;

The standard error for the 90th percentile is about 5.3. Based on bootstrap resampling methods, the 95% CI for P90 is approximately [2.4, 23.6]. (Other methods for estimating the CI give similar or wider intervals.) The ESTIMATE statement is used to test the null hypothesis that P90 is greater than 15. The p-value is large, which means that even if you delete two large lead measurements, the data do not provide evidence to reject the null hypothesis.

Conclusions

There is not enough evidence to reject the hypothesis that P90 is greater than the legal limit of 15 ppb. Two different 95% confidence intervals for P90 include 15 ppb in their interiors.

In fact, the confidence intervals include 15 ppb whether you use all 71 observations or just the 69 observations that MDEQ used. So you can argue about whether the MDEQ should have excluded the controversial measurements, but the hypothesis test gives the same conclusion regardless. By using these data, you cannot rule out the possibility that the 90th percentile of the Flint water supply is greater than 15 ppb.

What do you have to say? Share your comments.

The post Quantiles and the Flint water crisis appeared first on The DO Loop.