data analysis

5月 082019
 

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced?

Daymond discussed the following three criteria for choosing a model:

  1. Discrimination: The ability of the binary classifier to predict the class of a labeled observation. The area under an ROC curve is one measure of a binary model's discrimination power. In SAS, you can compute the ROC curve for any predictive model.
  2. Accuracy: The ability of the model to estimate the probability of an event. The calibration curve is a graphical indication of a model's accuracy. In SAS, you can compute a calibration curve manually, or you can use PROC LOGISTIC in SAS/STAT 15.1 to automatically compute a calibration curve.
  3. Stability: Point estimates are often used to choose a model, but you should be aware of the variability of the estimates. This is a basic concept in statistics: When choosing between two unbiased estimators, you should usually choose the one that has smaller variance. SAS procedures provide (asymptotic) standard errors for many statistics such as the area under an ROC curve. If you have reason to doubt the accuracy of an asymptotic estimate, you can use bootstrap methods in SAS to estimate the sampling distribution of the statistic.

Estimates of model stability

My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:

The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."

In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?

The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.

Bootstrap estimates of model stability

If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.

You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.

I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.

References

Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.

Download the complete SAS program that creates the analyses and graphs in this article.

The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.

4月 242019
 

The CUSUM test has many incarnations. Different areas of statistics use different assumption and test for different hypotheses. This article presents a brief overview of CUSUM tests and gives an example of using the CUSUM test in PROC AUTOREG for autoregressive models in SAS.

A CUSUM test uses the cumulative sum of some quantity to investigate whether a sequence of values can be modeled as random. Here are some examples:

  • A sequence of binary values (call them +1 and -1) might appear to be random, like a coin flip, or nonrandom. A random sequence has a cumulative sum that does not deviate far from zero, as I've discussed in a previous about the CUSUM test for randomness of a binary sequence.
  • In quality control, the CUSUM chart and test is used to monitor whether a process is drifting away from its mean. The CUSUM chart is centered around the mean value of the process. The process is said to be "out of control" if the cumulative sums of the standardized deviations exceed a specified range. The documentation for the CUSUM procedure in SAS/QC software includes an example and a page of formulas that describe the statistics behind the CUSUM chart.
  • In time series analysis, the CUSUM statistics use the sequence of residual deviations from a model to indicate whether the autoregressive model is misspecified. The CUSUM statistics are produced by PROC AUTOREG in SAS/ETS software.

Whereas the CUSUM test for a binary sequence uses cumulative sums for a discrete (+1, -1} sequence, the other tests assume that the sequence is a random sequence of normally distributed values. The main idea behind the tests are the same: The test statistic measures how far the sequence has drifted away from an expected value. If the sequence drifts too far too fast, the sequence is unlikely to be random.

CUSUM test for time series

Let's see how the CUSUM test in PROC AUTOREG can help to identify a misspecified model. For simplicity, consider two response variables, one that is linear in time (with uncorrelated errors) and the other that is quadratic in time. If you fit a linear model to both variables, the CUSUM test can help you to see that the model does not fit the quadratic data.

In a previous article, I discussed Anscombe's quartet and created two series that have the same linear fit and correlation coefficient. These series are ideal to use for the CUSUM test because the first series is linear whereas the second is quadratic. The following calls to PROC AUTOREG fit a linear model to each variable.

ods graphics on;
/* PROC AUTOREG models a time series with autocorrelation */
proc autoreg data=Anscombe2;
  Linear: model y1 = x;        /* Y1 is linear. Model is oorrectly specified. */
  output out=CusumLinear cusum=cusum cusumub=upper cusumlb=lower recres=RecursiveResid;
run;
 
proc autoreg data=Anscombe2;
  Quadratic: model y2 = x;     /* Y2 is quadratic. Model is misspecified. */
  output out=CusumQuad cusum=cusum cusumub=upper cusumlb=lower recres=RecursiveResid;
run;

The AUTOREG procedure creates a panel of standard residual diagnostic plots. The panel includes a plot of the residuals and a fit plot that shows the fitted model and the observed values. For the linear data, the residual plots seem to indicate that the model fits the data well:

In contrast, the same residual panel for the quadratic data indicates a systematic pattern in the residuals:

If this were a least squares model, which assumes independence of the residuals, those residual plots would indicate that this data-model combination does not satisfy the assumptions of the least squares regression model. For an autoregressive model, however, raw residuals can be correlated and exhibit a pattern. To determine whether the model is misspecified, PROC AUTOREG supports a special kind of residual analysis that uses recursive residuals.

The recursive residual for the k_th point is formed by fitting a line to the first k-1 points and then forming a standardized residual for the k_th point. The complete formulas are in the AUTOREG documentation. Galpin and Hawkins (1984) suggest plotting the cumulative sums of the recursive residuals as a diagnostic plot. Galpin and Hawkins credit Brown, Durbin, and Evans (1975) with proposing the CUSUM plot of the recursive residuals. The statistics output from the AUTOREG procedure are different than those in Galpin and Hawkin, but the idea and purpose behind the CUSUM charts are the same.

Galpin and Hawkin show a panel of nine plots that display different patterns that you might see in the CUSUM plots. I have reproduced two of the plots from the paper. (Remember, these graphs were produced in 1984!) The graph on the left shows what you should see for a correctly specified model. The cumulative sums stay within a region near the expected value of zero. In contrast, the graph on the right is one example of a CUSUM plot for a misspecified model.

The previous calls to PROC AUTOREG wrote the cumulative sums and the upper and lower boundaries of the confidence region to a data set. You can use PROC SGPLOT to create the CUSUM plot. The BAND statement is used to draw the confidence band:

ods layout gridded columns=2 advance=table;
 proc sgplot data=CusumLinear noautolegend;
    band x=x lower=lower upper=upper;
    series x=x y=cusum / break markers;
    refline 0  /axis=y noclip;
    xaxis grid; yaxis grid;
 run;
 proc sgplot data=CusumQuad noautolegend;
    band x=x lower=lower upper=upper;
    series x=x y=cusum / break markers;
    refline 0  /axis=y noclip;
    xaxis grid; yaxis grid;
 run;
ods layout end;
CUSUM graphs of cumulative  sums of recursive residuals

The graph on the left looks like a random walk on independent normal data. The cumulative sums stay within the colored confidence region. The model seems to fit the data. In contrast, the graph on the right quickly leaves the shaded region, which indicates that the model is misspecified.

In summary, there are many statistical tests that use a CUSUM statistic to determine whether deviations are random. These tests appear in many areas of statistics, including random walks, quality control, and time series analysis. For quality control, SAS supports the CUSUM procedure in SAS/QC software. For time series analysis, the AUTOREG procedure in SAS supports CUSUM charts of recursive residuals, which enable you to diagnose misspecified models.

You can download the SAS program that generates the graphs in this article.

The post A CUSUM test for autregressive models appeared first on The DO Loop.

4月 222019
 

Many statistical tests use a CUSUM statistic as part of the test. It can be confusing when a researcher refers to "the CUSUM test" without providing details about exactly which CUSUM test is being used. This article describes a CUSUM test for the randomness of a binary sequence. You start with a long sequence of binary values such as heads and tails from a coin toss. The test tries to determine whether the sample comes from a Bernoulli distribution with probability p=0.5. In short, is the binary sequence random?

The CUSUM test for randomness of a binary sequence is one of the NIST tests for verifying that a random or pseudorandom generator is generating bits that are indistinguishable from truly random bits (Rukin, et al., 2000, revised 2010, pp 2-31 through 2-33). The test is straightforward to implement. You first translate the data to {-1, +1} values. You then compute the cumulative sums of the sequence. If the sequence is random, the cumulative sum is equivalent to the position of a random walker who takes unit steps. If the sequence is random, the sums will not move away from 0 (the expected sum) too quickly. I've previously visualized the random walk with unit steps (sometimes called a "Drunkard's walk").

Before proceeding to the CUSUM test, I should mention that this test is often used in conjunction with other tests, such as the "runs test" for randomness. That is because a perfectly alternating sequence such as 0101010101... will pass the CUSUM test even though the sequence is clearly not randomly generated. In fact, any sequence that repeatedly has k zeros followed by k ones also passes the test, provided that k is small enough.

The CUSUM test for randomness of a binary sequence in SAS

The NIST report contains an example of calling the CUSUM test with a sequence of length N=100. The following SAS/IML statements define a sequence of {0, 1} values, convert those values to {-1, +1}, and plot the cumulative sums:

proc iml;
eps = {1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1
       1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 
       0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 
       0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 
       0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 }; 
x = 2*eps - 1;            /* convert to {-1, +1} sequence */
S = cusum(x);
 
title "Cumulative Sums of Sequence of {-1, +1} Values";
call series(1:ncol(S), S) option="markers" other="refline 0 / axis=y" label="Observation Number";
Plot of the cumulative sums of a binary sequence of values in {-1, +1}.

The sequence contains 58 values of one category and 42 values of the other. For a binomial distribution with p=0.5, the probability of a sample that has proportions at least this extreme is about 13%, as shown by the computation 2*cdf("Binomial", 42, 0.5, 100);. Consequently, the proportions are not unduly alarming. However, to test whether the sequence is random, you need to consider not only the proportion of values, but also the sequence. The graph of the cumulative sums of the {-1, +1} sequence shows a drift away from the line S=0, but it is not clear from the graph whether the deviation is more extreme than would be expected for a random sequence of this length.

The CUSUM test gives you a way to quantify whether the sequence is likely to have occurred as a random draw from a Bernoulli(p=0.5) distribution. The test statistic is the maximum deviation from 0. As you can see from the graph, the test statistic for this sequence is 16. The NIST paper provides a formula for the probability that a statistic at least this extreme occurs in a random sequence of length N=100. I implemented a (vectorized) version of the formula in SAS/IML.

/* NIST CUSUM test for randomness in a binary {-1, +1} sequence.
   https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-22r1a.pdf
   Section 2.13.  Pages 2-31 through 2-33.
   INPUT: x is sequence of {-1, +1} values.      */
start BinaryCUSUMTest(x, PrintTable=1, alpha=0.01);
   S = colvec( cusum(x) );     /* cumulative sums */
   n = nrow(S);
   z = max(abs(S));            /* test statistic = maximum deviation */
 
   /* compute probability of this test statistic for a sequence of this length */
   zn = z/sqrt(n);
   kStart = int( (-n/z +1)/4 );
   kEnd   = int( ( n/z -1)/4 );
   k = kStart:kEnd;
   sum1 = sum( cdf("Normal", (4*k+1)*zn) - cdf("Normal", (4*k-1)*zn) );
 
   kStart = int( (-n/z -3)/4 );
   k = kStart:kEnd;
   sum2 = sum( cdf("Normal", (4*k+3)*zn) - cdf("Normal", (4*k+1)*zn) );
   pValue = 1 - sum1 + sum2;
 
   /* optional: print the test results in a nice human-readable format */
   cusumTest = z || pValue;
   if PrintTable then do;
      print cusumTest[L="Result of CUSUM Test" c={"Test Statistic" "p Value"}];
      labl= "H0: Sequence is a random binary sequence";
      if pValue <= alpha then 
         msg = "Reject H0 at alpha = " + char(alpha); /* sequence seems random */
      else
         msg = "Do not reject H0 at alpha = " + char(alpha); /* sequence does not seem random */
      print msg[L=labl];
   end;
   return ( cusumTest );
finish;
 
/* call the function for the {-1, +1} sequence */
cusumTest = BinaryCUSUMTest(x);

According to the CUSUM test, there is not sufficient evidence to doubt that the sequence was generated from a random Bernoulli process.

A few comments about the program:

  • if you vectorize the computations, the CUSUM test requires only a few SAS/IML statements. Half of the function is dedicated to printing the results in a friendly format.
  • The computation of the p-value took me a while to puzzle over. The formulas in the NIST article did not write the INT() function for the limits of the summation. But the summation only makes sense when the index of summation (k) is an integer.
  • The significance value for the CUSUM test is usually chosen to be alpha = 0.01.
  • The CUSUM test depends only on the maximum deviation of the cumulative sums (the test statistic) and on the length of the sequence. For a sequence of length 100, the test statistic can be as large as 28 without rejecting the null hypothesis. If the statistic is 29 or larger, then the null hypothesis is rejected and we conclude that the sequence is not generated by a random process.

A neat thing about the CUSUM test is that you can compute the maximum test statistic based only on the sequence length. Thus if you plan to toss a coin 100 times to determine if it is fair, you can stop tossing (with 99% confidence) if the number of heads ever exceeds the number of tails by 29. Similarly, you can stop tossing if you know that the number of excess heads cannot possibly be 29 or greater. (For example, you've tossed 80 times and the current cumulative sum is 5.) You can apply the same argument to excess tails.

In summary, this article shows how to implement the CUSUM test for randomness of a binary sequence in SAS. Only a few lines of SAS/IML are required, and you can implement the test without using any loops. Be aware that the CUSUM test is not very powerful because regular sequences can pass the test. For example, the sequence 000111000111000111... has a maximum deviation of 3.

The post The CUSUM test for randomness of a binary sequence appeared first on The DO Loop.

4月 032019
 

I've previously written about how to deal with nonconvergence when fitting generalized linear regression models. Most generalized linear and mixed models use an iterative optimization process, such as maximum likelihood estimation, to fit parameters. The optimization might not converge, either because the initial guess is poor or because the model is not a good fit to the data. SAS regression procedures for which this might happen include PROC LOGISTIC, GENMOD, MIXED, GLMMIX, and NLMIXED.

For mixed models, several problems can occur if you have a misspecified model. One issue results in the following note in the SAS log:
NOTE: Estimated G matrix is not positive definite.
This article describes what the note means, why it might occur, and what to do about it. If you encounter this note during a BY-group analysis or simulation study, this article shows how to identify the samples that have the problem.

There are two excellent references that discuss issues related to convergence in the SAS mixed model procedures:

What is the G matrix?

Before we discuss convergence, let's review what the G matrix is and why it needs to be positive definite. The matrix formulation of a mixed model is
Y = X β + Z γ + ε
where β is a vector of fixed-effect parameters. The random effects are assumed to be random realizations from multivariate normal distributions. In particular, γ ~ MVN(0, G) and ε ~ MVN(0, R), where G and R are covariance matrices. The variance-covariance matrix G is often used to specify subject-specific effects, whereas R specifies residual effects. A goal of mixed models is to specify the structure of the G and/or R matrices and estimate the variance-covariance parameters.

Because G is a covariance matrix, G must be positive semidefinite. A nondegenerate covariance matrix will be fully positive definite. However, estimates of G might not have this property. SAS alerts you if the estimate is not positive definite.

As stated in Kiernan (2018, p. ), "It is important that you do not ignore this message."

Reasons the estimated G matrix is not positive definite

A SAS Usage Note states that the PROC MIXED message means that "one or more variance components on the RANDOM statement is/are estimated to be zero and could/should be removed from the model." Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) describe several reasons why an estimated G matrix can fail to be positive definite. Two of the more common reasons include:

  • There is not enough variation in the response. After controlling for the fixed effects, there isn't any (or much) variation for the random effects. You might want to remove the random effect from the model. (KTG, p. 9)
  • The model is misspecified. You might need to modify the model or change the covariance structure to use fewer parameters. (Kiernan, p. 18)

Convergence issues in simulation studies or BY-group analyses

If you encounter the note "Estimated G matrix is not positive definite" for real data, you should modify the model or collect more data. In a simulation study, however, there might be simulated samples that do not fit the model even when the data is generated from the model! This can happen for very small data sets and for studies in which the variance components are very small. In either case, you might want to identify the samples for which the model fails to converge, either to exclude them from the analysis or to analyze them by using a different model. The same situation can occur for few BY groups during a traditional BY-group analysis.

The following SAS program illustrates how to detect the samples for which the estimated G matrix is not positive definite. The 'SP' data are from an example in the PROC MIXED documentation. The data set named 'ByData' is constructed for this blog post. It contains four copies of the real data, along with an ID variable with values 1, 2, 3, and 4. For the first copy, the response variable is set to 0, which means that there is no variance in the response. For the third copy, the response variable is simulated from a normal distribution. It has variance, but does not conform to the model. The call to PROC MIXED fits the same random-effects model to all four samples:

/* Example from PROC MIXED documentation: https://bit.ly/2YEmpGw  */
data sp;
input Block A B Y @@;
datalines;
1 1 1  56  1 1 2  41  1 2 1  50  1 2 2  36  1 3 1  39  1 3 2  35
2 1 1  30  2 1 2  25  2 2 1  36  2 2 2  28  2 3 1  33  2 3 2  30
3 1 1  32  3 1 2  24  3 2 1  31  3 2 2  27  3 3 1  15  3 3 2  19
4 1 1  30  4 1 2  25  4 2 1  35  4 2 2  30  4 3 1  17  4 3 2  18
;
 
/* concatenate four copies of the data, but modify Y for two copies */
data ByData;
call streaminit(1);
set sp(in=a1) sp(in=a2) sp(in=a3) sp(in=a4);
SampleID = a1 + 2*a2 + 3*a3 + 4*a4;
if a1 then Y = 0;                         /* response is constant (no variation) */
if a3 then Y = rand("Normal", 31, 10);    /* response is indep of factors */
run;
 
/* suppress ODS output: https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html */
%macro ODSOff(); ods graphics off; ods exclude all; ods noresults;  %mend;
%macro ODSOn();  ods graphics on; ods exclude none; ods results;    %mend;
 
%ODSOff
proc mixed data=ByData;
   by sampleID;
   class A B Block;
   model Y = A B A*B;
   random Block A*Block;
   ods output ConvergenceStatus=ConvergeStatus;
run;
%ODSOn

The SAS log displays various notes about convergence. The second and fourth samples (the doc example) converged without difficulty. The log shows a WARNING for the first group. The third group displays the note Estimated G matrix is not positive definite.

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive residual variance estimate.
WARNING: Stopped because of infinite likelihood.
NOTE: The above message was for the following BY group:
      SampleID=1
 
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:
      SampleID=2
NOTE: Convergence criteria met.
 
NOTE: Estimated G matrix is not positive definite.
NOTE: The above message was for the following BY group:
      SampleID=3
 
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:
      SampleID=4

Notice that the ODS OUTPUT statement writes the ConvergenceStatus table for each BY group to a SAS data set called ConvergeStatus. For each BY group, the ConvergenceStatus table is appended to a SAS data set called ConvergeStatus. The result of the PROC PRINT statement shows the columns of the table for each of the four groups:

proc print data=ConvergeStatus;  run;
Convergence status table in PROC MIXED, which shows which BY groups had convergence or modeling problems

The SampleID column is the value of the BY-group variable. The Reason column explains why the optimization process terminated. The Status column is 0 if the model converged and nonzero otherwise. Note that the third model converged, even though the G matrix was not positive definite!

To detect nonpositive definite matrices, you need to look at the pdG column, The pdG indicates which models had a positive definite G matrix (pdG=1) or did not (pdG=0). Although I do not discuss it in this article, the pdH column is an indicator variable that has value 0 if the SAS log displays the message NOTE: Convergence criteria met but final hessian is not positive definite.

The pdG column tells you which models did not have a positive definite variance matrix. You can merge the ConverenceStatus table with the original data and exclude (or keep) the samples that did not converge or that had invalid variance estimates, as shown in the following DATA step:

/* keep samples that converge and for which gradient and Hessian are positive definite */
data GoodSamples;
merge ByData ConvergeStatus;
by SampleID;
if Status=0 & pdG & pdH;
run;

If you run PROC MIXED on the samples in the GoodSamples data set, all samples will converge without any scary-looking notes.

In summary, this article discusses the note Estimated G matrix is not positive definite, which sometimes occurs when using PROC MIXED or PROC GLMMIX in SAS. You should not ignore this note. This note can indicate that the model is misspecified. This article shows how you can detect this problem programmatically during a simulation study or a BY-group analysis. You can use the ConvergenceStatus table to identify the BY groups for which this the problem occurs. Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) provide more insight into this note and other problems that you might encounter when fitting mixed models.

The post Convergence in mixed models: When the estimated G matrix is not positive definite appeared first on The DO Loop.

4月 012019
 

Many SAS procedures support the BY statement, which enables you to perform an analysis for subgroups of the data set. Although the SAS/IML language does not have a built-in "BY statement," there are various techniques that enable you to perform a BY-group analysis. The two I use most often are the UNIQUE-LOC technique and the UNIQUEBY technique. The first is more intuitive, the second is more efficient. This article shows how to use SAS/IML to read and process BY-group data from a data set.

I previously showed that you can perform BY-group processing in SAS/IML by using the UNIQUEBY technique, so this article uses the UNIQUE-LOC technique. The statistical application is simulating clusters of data. If you have a SAS data set that contains the centers and covariance matrices for several groups of observations, you can then read that information into SAS/IML and simulate new observations for each group by using a multivariate normal distribution.

Matrix operations and BY groups

A BY-group analysis in SAS/IML usually starts with a SAS data set that contains a bunch of covariance or correlation matrices. A simple example is a correlation analysis of each species of flower in Fisher's iris data set. The BY-group variable is the species of iris: Setosa, Versicolor, or Virginica. The variables are measurements (in mm) of the sepals and petals of 150 flowers, 50 from each species. A panel of scatter plots for the iris data is shown to the right. You can see that the three species appear to be clustered. From the shapes of the clusters, you might decide to model each cluster by using a multivariate normal distribution.

You can use the OUTP= and COV options in PROC CORR to output mean and covariance statistics for each group, as follows:

proc corr data=sashelp.iris outp=CorrOut COV noprint;
   by Species;
   var Petal: Sepal:;
run;
 
proc print data=CorrOut(where=(_TYPE_ in ("N", "MEAN", "COV"))) noobs;
   where Species="Setosa";  /* just view information for one group */
   by Species  _Type_ notsorted;
   var _NAME_ Petal: Sepal:;
run;

The statistics for one of the groups (Species='Setosa') are shown. The number of observations in the group (N) is actually a scalar value, but it was replicated to fit into a rectangular data set.

Reading BY-group information into SAS/IML

This section reads the sample size, mean vector, and covariance matrix for all groups. A WHERE clause selects only the observations of interest:

/* Read in N, Mean, and Cov for each species. Use to create a parametric bootstrap 
   by simulating N[i] observations from a MVN(Mean[i], Cov[i]) distribution */
proc iml;
varNames = {'PetalLength' 'PetalWidth' 'SepalLength' 'SepalWidth'};
use CorrOut where (_TYPE_="N" & Species^=" ");       /* N */
read all var varNames into mN[rowname=Species];      /* read for all groups */
print mN[c=varNames];
 
use CorrOut where (_TYPE_="MEAN" & Species^=" ");    /* mean */
read all var varNames into mMean[rowname=Species];   /* read for all groups */
print mMean[c=varNames];
 
use CorrOut where (_TYPE_="COV" & Species^=" ");     /* covariance */
read all var varNames into mCov[rowname=Species];    /* read for all groups */
close;
print mCov[c=varNames];

The output (not shown) shows that the matrices mN, mMean, and mCov contain the vertical concatenation (for all groups) of the sample size, mean vectors, and covariance matrices, respectively.

The grouping variable is Species. You can use the UNIQUE function to get the unique (sorted) values of the groups. You can then iterate over the unique values and use the LOC function to extract only the rows of the matrices that correspond to the ith group. What you do with that information depends on your application. In the following program, the information for each group is used to create a random sample from a multivariate normal distribution that has the same size, mean, and covariance as the ith group:

/* Goal: Write random MVN values in X to data set */
X = j(1, ncol(varNames), .);         /* X variables will be numeric */
Spec = BlankStr(nleng(Species));     /* and Spec will be character */
create SimOut from X[rowname=Spec colname=varNames]; /* open for writing */
 
/* The BY-group variable is Species */
call randseed(12345);
u = unique(Species);                 /* get unique values for groups */
do i = 1 to ncol(u);                 /* iterate over unique values */
   idx = loc(Species=u[i]);          /* rows for the i_th group */
   N = mN[i, 1];                     /* extract scalar from i_th group */
   mu = mMean[i,];                   /* extract vector from i_th group */
   Sigma = mCov[idx,];               /* extract matrix from i_th group */
   /* The parameters for this group are now in N, mu, and Sigma.
      Do something with these values. */
   X = RandNormal(N, mu, Sigma);     /* simulate obs for the i_th group */
   X = round(X);                     /* measure to nearest mm */
   Spec = j(N, 1, u[i]);             /* ID for this sample */
   append from X[rowname=Spec];
end;
close SimOut;
quit;
 
ods graphics / attrpriority=none;
title "Parametric Bootstrap Simulation of Iris Data"; 
proc sgscatter data=SimOut(rename=(Spec=Species));
   matrix Petal: Sepal: / group=Species;
run;

The simulation has generated a new set of clustered data. If you compare the simulated data with the original, you will notice many statistical similarities.

Although the main purpose of this article is to discuss BY-group processing in SAS/IML, I want to point out that the simulation in this article is an example of the parametric bootstrap. Simulating Data with SAS (Wicklin, 2013) states that "the parametric bootstrap is nothing more than the process of fitting a model distribution to the data and simulating data from the fitted model." That is what happens in this program. The sample means and covariance matrices are used as parameters to generate new synthetic observations. Thus, the parametric bootstrap technique is really a form of simulation where the parameters for the simulation are obtained from the data.

In conclusion, sometimes you have many matrices in a SAS data set, each identified by a categorical variable. You can perform "BY-group processing" in SAS/IML by reading in all the matrices into a big matrix and then use the UNIQUE-LOC technique to iterate over each matrix.

The post Matrix operations and BY groups appeared first on The DO Loop.

3月 252019
 

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate normal data. It is better than looking at the univariate z-scores of each coordinate because a multivariate outlier does not necessarily have extreme coordinate values.

The geometry of multivariate outliers

In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in every coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in any coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

  • The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
  • For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
  • The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
  • The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.
Geometry of multivariate outliers showing the relationship between Mahalanobis distance and univariate outliers. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'B' is a multivariate outlier.

The diagram displays two observations, labeled "A" and "B":

  • The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a univariate outlier for both the X1 and X2 coordinates!
  • The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the combination of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate z scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its z scores are relatively small.

Applications to real data

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate z scores. Another observation had large z scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

Geometry of multivariate outliers. The point 'A' has large univariate z scores but the Mahalanobis distance is only about 2.5. The point 'B' has a Mahalanobis distance of 5 and is a multivariate outlier.

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within t MD units of the center is 1 - exp(-t2/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6. Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

Summary and further reading

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

3月 202019
 

An analyst was using SAS to analyze some data from an experiment. He noticed that the response variable is always positive (such as volume, size, or weight), but his statistical model predicts some negative responses. He posted the data and asked if it is possible to modify the graph so that only positive responses are displayed.

This article shows how you can truncate a surface or a contour plot so that negative values are not displayed. You could do something similar to truncate unreasonably high values in a surface plot.

Why does the model predict negative values?

Before showing how to truncate the surface plot, let's figure out why the model predicts negative values when all the observed responses are positive. The following DATA step is a simplified version of the real data. The RSREG procedure uses least squares regression to fit a quadratic response surface. If you use the PLOTS=SURFACE option, the procedure automatically displays a contour plot and surface plot for the predicted response:

data Sample;
input X Y Z @@;
datalines;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
;
 
ods graphics / width=400px height=400px ANTIALIASMAX=10000;
proc rsreg data=Sample plots=surface(fill=pred overlaypairs);
   model Z = Y X;
run;
 
proc rsreg data=Sample plots=surface(3d fill=Pred gridsize=80);
   model Z = Y X;
   ods select Surface;
   ods output Surface=Surface; /* use ODS OUTPUT to save surface data to a data set */
run;

The contour plot overlays a scatter plot of the data. You can see that the data are observed only in the upper-right portion of the plot (the red regions) and that no data are in the lower-left portion of the plot. The RSREG procedure fits a quadratic model to the data. The predicted values near the observed data are all positive. Some of the predicted values that are far from the observed data are negative.

I previously wrote about this phenomenon and showed how to compute the convex hull for these bivariate data. When you evaluate the model inside the convex hull, you are interpolating. When you evaluate the model outside the convex hull, you are extrapolating. It is well known that polynomial regression models can give nonsensical results if you extrapolate far from the data.

Truncating a response surface

The RSREG procedure is not aware that the response variable should be positive. A quadratic surface will eventually get arbitrarily big in the positive and/or negative directions. You can see this on the contour and surface plots, which show the predictions of the model on a regular grid of (X, Y) values.

If you want to display only the positive portion of the prediction surface, you can replace each negative predicted value with a missing value. The first step is to obtain the predicted values on a regular grid. You can use the "missing value trick" to score the quadratic model on a grid, or you can use the ODS OUTPUT statement to obtain the gridded values that are used in the surface plot. I chose the latter option. In the previous section, I used the ODS OUTPUT statement to write the gridded predicted values for the surface plot to a SAS data set named Surface.

As Warren Kuhfeld points out in his article about processing ODS OUTPUT data set, the names in an ODS data object can be "long and hard to type." Therefore, I rename the variables. I also combine the gridded values with the original data so that I can optionally overlay the data and the predicted values.

/* rename vars and set negative responses to missing */
data Surf2;
set Surface(rename=(
       Predicted0_1_0_0 = Pred  /* rename the long ODS names */
       Factor1_0_1_0_0  = GY    /* 'G' for 'gridded' */
       Factor2_0_1_0_0  = GX))  
    Sample(in=theData);         /* combine with original data */
if theData then Type = "Data   ";
else            Type = "Gridded";
if Pred < 0 then Pred = .;      /* replace negative predictions with missing values */
label GX = 'X'  GY = 'Y';
run;

You can use the Graph Template Language (GTL) to generate graphs that are similar to those produced by PROC RSREG. You can then use PROC SGRENDER to create the graph. Because the negative response values were set to missing, the contour plot displays a missing value color (black, for this ODS style) in the lower-left and upper-right portions of the plot. Similarly, the missing values cause the surface plot to be truncated. By using the GRIDSIZE= option, you can make the jagged edges small.

Notice that the colors in the graphs are now based on the range [0, 50], whereas previously the colors were based on the range [-60, 50]. I've added a continuous legend to the plots so that the range of the response variable is obvious.

I'd like to stress that sometimes "nonsensical values" indicate an inappropriate model. If you notice nonsensical values, you should always ask yourself why the model is predicting those values. You shouldn't modify the prediction surface without a good reason. But if you do have a good reason, the techniques in this article should help you.

You can download the complete SAS program that analyzes the data and generates the truncated graphs.

The post Truncate response surfaces appeared first on The DO Loop.

3月 182019
 

Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical prediction, such as trillions of people or a negative number of people! The lesson is that you should be careful when you evaluate a model far beyond the range of the training data.

The same dangers exist for multivariate regression models, but they are emphasized less often. Perhaps the reason is that it is much harder to know when you are extrapolating a multivariate model. Interpolation occurs when you evaluate the model inside the convex hull of the training data. Anything else is an extrapolation. In particular, you might be extrapolating even if you score the model at a point inside the bounding box of the training data. This differs from the univariate case in which the convex hull equals the bounding box (range) of the data. In general, the convex hull of a set of points is smaller than the bounding box.

You can use a bivariate example to illustrate the difference between the convex hull of the data and the bounding box for the data, which is the rectangle [Xmin, Xmax] x [Ymin, Ymax]. The following SAS DATA step defines two explanatory variables (X and Y) and one response variable (Z). The SGPLOT procedure shows the distribution of the (X, Y) variables and colors each marker according to the response value:

data Sample;
input X Y Z @@;
datalines;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
;
 
title "Response Values for Bivariate Data";
proc sgplot data=Sample;
scatter x=x y=y / markerattrs=(size=12 symbol=CircleFilled)
        colorresponse=Z colormodel=AltThreeColorRamp;
xaxis grid; yaxis grid;
run;
Bivariate data. The data are not uniformly distributed in the bounding box of the data.

The data are observed in a region that is approximately triangular. No observations are near the lower-left corner of the plot. If you fit a response surface to this data, it is likely that you would visualize the model by using a contour plot or a surface plot on the rectangular domain [10, 30] x [62, 95]. For such a model, predicted values near the lower-left corner are not very reliable because the corner is far from the data.

In general, you should expect less accuracy when you predict the model "outside" the data (for example, (10, 60)) as opposed to points that are "inside" the data (for example, (25, 70)). This concept is sometimes discussed in courses about the design of experiments. For a nice exposition, see the course notes of Professor Rafi Hafka (2012, p. 49–59) at the University of Florida.

The convex hull of bivariate points

You can use SAS to visualize the convex hull of the bivariate observations. The convex hull is the smallest convex set that contains the observations. The SAS/IML language supports the CVEXHULL function, which computes the convex hull for a set of planar points.

You can represent the points by using an N x 2 matrix, where each row is a 2-D point. When you call the CVEXHULL function, you obtain a vector of N integers. The first few integers are positive and represent the rows of the matrix that comprise the convex hull. The (absolute value of) the negative integers represents the rows that are interior to the convex hull. This is illustrated for the sample data:

proc iml;
use Sample;  read all var {x y} into points;  close;
 
/* get indices of points in the convex hull in counter-clockwise order */
indices = cvexhull( points );
print (indices`)[L="indices"];  /* positive indices are boundary; negative indices are inside */

The output shows that the observation numbers (indices) that form the convex hull are {1, 6, 7, 12, 13, 14, 11}. The other observations are in the interior. You can visualize the interior and boundary points by forming a binary indicator vector that has the value 1 for points on the boundary and 0 for points in the interior. To get the indicator vector in the order of the data, you need to use the SORTNDX subroutine to compute the anti-rank of the indices, as follows:

b = (indices > 0);               /* binary indicator variable for sorted vertices */
call sortndx(ndx, abs(indices)); /* get anti-rank, which is sort index that "reverses" the order */
onBoundary = b[ndx];             /* binary indicator data in original order */
 
title "Convex Hull of Bivariate Data";
call scatter(points[,1], points[,2]) group=onBoundary 
             option="markerattrs=(size=12 symbol=CircleFilled)";
Convex hull: Blue points are on the convex hull and red points are in the interior

The blue points are the boundary of the convex hull whereas the red points are in the interior.

Visualize the convex hull as a polygon

You can visualize the convex hull by forming the polygon that connects the first, sixth, seventh, ..., eleventh observations. You can do this manually by using the POLYGON statement in PROC SGPLOT, which I show in the Appendix section. However, there is an easier way to visualize the convex hull. I previously wrote about SAS/IML packages and showed how to install the polygon package. The polygon package contains a module called PolyDraw, which enables you to draw polygons and overlay a scatter plot.

The following SAS/IML statements extract the positive indices and use them to get the points on the boundary of the convex hull. If the polygon package is installed, you can load the polygon package and visualize the convex hull and data:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
 
/* In SAS/IML 14.1, you can use the polygon package to visualize the convex hull:
   https://blogs.sas.com/content/iml/2016/04/27/packages-share-sas-iml-programs-html */
package load polygon;    /* assumes package is installed */
run PolyDraw(convexHull, points||onBoundary) grid={x y}
    markerattrs="size=12 symbol=CircleFilled";

The graph shows the convex hull of the data. You can see that it primarily occupies the upper-right portion of the rectangle. The convex hull shows the interpolation region for regression models. If you evaluate a model outside the convex hull, you are extrapolating. In particular, even though points in the lower left corner of the plot are within the bounding box of the data, they are far from the data.

Of course, if you have 5, 10 or 100 explanatory variables, you will not be able to visualize the convex hull of the data. Nevertheless, the same lesson applies. Namely, when you evaluate the model inside the bounding box of the data, you might be extrapolating rather than interpolating. Just as in the univariate case, the model might predict nonsensical data when you extrapolate far from the data.

Appendix

Packages are supported in SAS/IML 14.1. If you are running an earlier version of SAS, you create the same graph by writing the polygon data and the binary indicator variable to a SAS data set, as follows:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
 
/* Write the data and polygon to SAS data sets. Use the POLYGON statement in PROC SGPLOT. */
p = points || onBoundary;       
poly = j(nrow(convexHull), 1, 1) || convexHull;
create TheData from p[colname={x y "onBoundary"}]; append from p;    close;
create Hull from poly[colname={ID cX cY}];         append from poly; close;
quit;
 
data All; set TheData Hull; run;    /* combine the data and convex hull polygon */
 
proc sgplot data=All noautolegend;
   polygon x=cX y=cY ID=id / fill;
   scatter x=x y=y / group=onBoundary markerattrs=(size=12 symbol=CircleFilled); 
   xaxis grid; yaxis grid;
run;

The resulting graph is similar to the one produced by the PolyDraw modules and is not shown.

The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.

3月 062019
 

A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many schools, the markers might overlap, which makes it difficult to visualize the distribution of scores. This is a general problem with using dot plots. An alternative visualization is to plot a box plot for each school district, which is described in today's article.

Box plots are not for everyone

Box plots (also called box-and-whisker plots) are used by statisticians to provide a schematic visualization of the distribution of some quantity. The previous article was written for non-statisticians, so I did not include any box plots. To understand a box plot, the reader needs to know how to interpret the box and whiskers:

  • The lower and upper portion of the box correspond to the 25th and 75th percentiles of the data, respectively
  • The line in the middle of the box indicates the median score.
  • A diamond marker indicates the mean score.
  • The standard box plot extends the whiskers to observations that are within a distance D from the box, where D is 1.5 times the interquartile range. Other kinds of box plots are sometimes used. The SAS documentation explains the parts of a box plot.

Box plots require a certain level of comfort with statistical ideas. Nevertheless, for a statistical audience, box plots provide a compact way to compare dozens or hundreds of distributions.

The BOXPLOT procedure: An alternative way to display many box plots

I almost always use the SGPLOT procedure to create box plots, but today I'm going to demonstrate the BOXPLOT procedure. The BOXPLOT procedure is from the days before ODS graphics, but it has several nice features, including the following:

  1. When grouping box plots into categories, you can add headers, category dividers, and colors to help distinguish the categories. I demonstrated these so-called "nested box plots" in a previous blog post.
  2. When you want to display statistics about each box plot as a table inside the graph, the BOXPLOT procedure provides a simple syntax. You can use PROC SGPLOT to add a statistical table to a box plot, but you need to pre-compute the statistics and merge the statistics and the data.
  3. When you are plotting dozens or hundreds of box plots, the BOXPLOT procedure automatically splits the graph into a series of panels.

The second and third features are both useful for visualizing the SAT data for public high schools in NC.

Add a statistical table to a graph that contains many box plots

You can use the INSETGROUP statement in PROC BOXPLOT to specify statistics that you want to display under each box plot. For example, the following syntax displays the number of high schools in each district and the median of the schools' SAT scores. The WHERE clause filters the data so that the graph shows only the largest school districts (those with seven or more high schools).

ods graphics / width=700px height=480px;
title "Average SAT Scores for Large NC School Districts";
proc boxplot data=SATSortMerge;
   where _FREQ_ >= 7;      /* restrict to large school districts */
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
   insetgroup Q2 N;
run;

The graph shows schematic box plots for 18 large school districts. The districts are sorted according to the median value of the schools' SAT scores. The INSETGROUP statement creates a table inside the graph. The table shows the number of schools in each district and gives the median score for the district. The INSETGROUP statement can display many other statistics such as the mean, standard deviation, minimum value, and maximum value for each district.

Automatically create panels of box plots

One of the coolest features of PROC BOXPLOT is that it will automatically create a panel of box plots. It is difficult to visualize all 115 NC school districts in a single graph. The graph would be very wide (or tall) and the labels for the school districts would potentially collide. However, PROC BOXPLOT will split the display into a panel, which is extremely convenient if you plan to print the graphs on a piece of paper.

For example, the following call to PROC BOXPLOT results in box plots for 115 school districts. The procedure splits these box plots across a panel that contains five graphs and plots 23 box plots in each graph. Notice that I do not have to specify the number of graphs: the procedure uses the data to make an intelligent decision. To save space in this blog post, I omit three of the graphs and only show the first and last graphs:

ods graphics / width=640px height=400px;
title "Average SAT Scores for NC School Districts";
proc boxplot data=SATSortMerge;
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
run;

Because the districts are ordered by the median SAT score, the first plot shows the school districts with high SAT scores and the last plot shows districts with lower SAT scores. Districts that have only one school are shown as a diamond (the mean value) with a line through it (the median value). Districts that have two or three schools are shown as a box without whiskers. For larger school districts, the box plots show a schematic representation of the distribution of the schools' SAT scores.

Summary

In summary, PROC BOXPLOT has several useful features for plotting many box plots. This article shows that you can use the INSETGROUP statement to easily add a table of descriptive statistics to the graph. The procedure also automatically creates a panel of graphs so that you can more easily look at dozens or hundreds of box plots.

You can download the SAS program (NCSATBoxplots.sas) that creates the data and the graphs.

The post Use PROC BOXPLOT to display hundreds of box plots appeared first on The DO Loop.

3月 062019
 

A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many schools, the markers might overlap, which makes it difficult to visualize the distribution of scores. This is a general problem with using dot plots. An alternative visualization is to plot a box plot for each school district, which is described in today's article.

Box plots are not for everyone

Box plots (also called box-and-whisker plots) are used by statisticians to provide a schematic visualization of the distribution of some quantity. The previous article was written for non-statisticians, so I did not include any box plots. To understand a box plot, the reader needs to know how to interpret the box and whiskers:

  • The lower and upper portion of the box correspond to the 25th and 75th percentiles of the data, respectively
  • The line in the middle of the box indicates the median score.
  • A diamond marker indicates the mean score.
  • The standard box plot extends the whiskers to observations that are within a distance D from the box, where D is 1.5 times the interquartile range. Other kinds of box plots are sometimes used. The SAS documentation explains the parts of a box plot.

Box plots require a certain level of comfort with statistical ideas. Nevertheless, for a statistical audience, box plots provide a compact way to compare dozens or hundreds of distributions.

The BOXPLOT procedure: An alternative way to display many box plots

I almost always use the SGPLOT procedure to create box plots, but today I'm going to demonstrate the BOXPLOT procedure. The BOXPLOT procedure is from the days before ODS graphics, but it has several nice features, including the following:

  1. When grouping box plots into categories, you can add headers, category dividers, and colors to help distinguish the categories. I demonstrated these so-called "nested box plots" in a previous blog post.
  2. When you want to display statistics about each box plot as a table inside the graph, the BOXPLOT procedure provides a simple syntax. You can use PROC SGPLOT to add a statistical table to a box plot, but you need to pre-compute the statistics and merge the statistics and the data.
  3. When you are plotting dozens or hundreds of box plots, the BOXPLOT procedure automatically splits the graph into a series of panels.

The second and third features are both useful for visualizing the SAT data for public high schools in NC.

Add a statistical table to a graph that contains many box plots

You can use the INSETGROUP statement in PROC BOXPLOT to specify statistics that you want to display under each box plot. For example, the following syntax displays the number of high schools in each district and the median of the schools' SAT scores. The WHERE clause filters the data so that the graph shows only the largest school districts (those with seven or more high schools).

ods graphics / width=700px height=480px;
title "Average SAT Scores for Large NC School Districts";
proc boxplot data=SATSortMerge;
   where _FREQ_ >= 7;      /* restrict to large school districts */
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
   insetgroup Q2 N;
run;

The graph shows schematic box plots for 18 large school districts. The districts are sorted according to the median value of the schools' SAT scores. The INSETGROUP statement creates a table inside the graph. The table shows the number of schools in each district and gives the median score for the district. The INSETGROUP statement can display many other statistics such as the mean, standard deviation, minimum value, and maximum value for each district.

Automatically create panels of box plots

One of the coolest features of PROC BOXPLOT is that it will automatically create a panel of box plots. It is difficult to visualize all 115 NC school districts in a single graph. The graph would be very wide (or tall) and the labels for the school districts would potentially collide. However, PROC BOXPLOT will split the display into a panel, which is extremely convenient if you plan to print the graphs on a piece of paper.

For example, the following call to PROC BOXPLOT results in box plots for 115 school districts. The procedure splits these box plots across a panel that contains five graphs and plots 23 box plots in each graph. Notice that I do not have to specify the number of graphs: the procedure uses the data to make an intelligent decision. To save space in this blog post, I omit three of the graphs and only show the first and last graphs:

ods graphics / width=640px height=400px;
title "Average SAT Scores for NC School Districts";
proc boxplot data=SATSortMerge;
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
run;

Because the districts are ordered by the median SAT score, the first plot shows the school districts with high SAT scores and the last plot shows districts with lower SAT scores. Districts that have only one school are shown as a diamond (the mean value) with a line through it (the median value). Districts that have two or three schools are shown as a box without whiskers. For larger school districts, the box plots show a schematic representation of the distribution of the schools' SAT scores.

Summary

In summary, PROC BOXPLOT has several useful features for plotting many box plots. This article shows that you can use the INSETGROUP statement to easily add a table of descriptive statistics to the graph. The procedure also automatically creates a panel of graphs so that you can more easily look at dozens or hundreds of box plots.

You can download the SAS program (NCSATBoxplots.sas) that creates the data and the graphs.

The post Use PROC BOXPLOT to display hundreds of box plots appeared first on The DO Loop.