Statistical Programming

8月 192019

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies your program.

I was thinking about TINSTAAFL recently when I was calling a Base SAS function from the SAS/IML matrix language. The SAS/IML language supports hundreds of built-in functions that operate on vectors and matrices. However, you can also call hundreds of functions in Base SAS and pass in vectors for the parameters. It is awesome and convenient to be able to call the virtual smorgasbord of functions in Base SAS, such as probability function, string matching functions, trig function, financial functions, and more. Of course, there is no such thing as a free lunch, so I wondered about the overhead costs associated with calling a Base SAS function from SAS/IML. Base SAS functions typically are designed to operate on scalar values, so the IML language has to call the underlying function many times, once for each value of the parameter vector. It is more expensive to call a function a million times (each time passing in a scalar parameter) than it is to call a function one time and pass in a vector that contains a million parameters.

To determine the overhead costs, I decided to test the cost of calling the MISSING function in Base SAS. The IML language has a built-in syntax (b = (X=.)) for creating a binary variable that indicates which elements of a vector are missing. The call to the MISSING function (b = missing(X)) is equivalent, but requires calling a Base SAS many times, once for each element of x. The native SAS/IML syntax will be faster than calling a Base SAS function (TINSTAAFL!), but how much faster?

The following program incorporates many of my tips for measuring the performance of a SAS computation. The test is run on large vectors of various sizes. Each computation (which is very fast, even on large vectors) is repeated 50 times. The results are presented in a graph. The following program measures the performance for a character vector that contains all missing values.

/* Compare performance of IML syntax
   b = (X = " ");
   to performance of calling Base SAS MISSING function 
   b = missing(X);
proc iml;
numRep = 50;                            /* repeat each computation 50 times */
sizes = {1E4, 2.5E4, 5E4, 10E4, 20E4};  /* number of elements in vector */
labl = {"Size" "T_IML" "T_Missing"};
Results = j(nrow(sizes), 3);
Results[,1] = sizes;
/* measure performance for character data */
do i = 1 to nrow(sizes);
   A = j(sizes[i], 1, " ");            /* every element is missing */
   t0 = time();
   do k = 1 to numRep;
      b = (A = " ");                   /* use built-in IML syntax */
   Results[i, 2] = (time() - t0) / numRep;
   t0 = time();
   do k = 1 to numRep;
      b = missing(A);                  /* call Base SAS function */
   Results[i, 3] = (time() - t0) / numRep;
title "Timing Results for (X=' ') vs missing(X) in SAS/IML";
title2 "Character Data";
long = (sizes // sizes) || (Results[,2] // Results[,3]);   /* convert from wide to long for graphing */
Group = j(nrow(sizes), 1, "T_IML") // j(nrow(sizes), 1, "T_Missing"); 
call series(long[,1], long[,2]) group=Group grid={x y} label={"Size" "Time (s)"} 
            option="markers curvelabel" other="format X comma8.;";

The graph shows that the absolute times for creating a binary indicator variable is very fast for both methods. Even for 200,000 observations, creating a binary indicator variable takes less than five milliseconds. However, on a relative scale, the built-in SAS/IML syntax is more than twice as fast as calling the Base SAS MISSING function.

You can run a similar test for numeric values. For numeric values, the SAS/IML syntax is about 10-20 times faster than the call to the MISSING function, but, again, the absolute times are less than five milliseconds.

So, what's the cost of calling a Base SAS function from SAS/IML? It's not free, but it's very cheap in absolute terms! Of course, the cost depends on the number of elements that you are sending to the Base SAS function. However, in general, there is hardly any cost associated with calling a Base SAS function from SAS/IML. So enjoy the lunch buffet! Not only is it convenient and plentiful, but it's also very cheap!

The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.

8月 142019

Many programmers are familiar with "short-circuit" evaluation in an IF-THEN statement. Short circuit means that a program does not evaluate the remainder of a logical expression if the value of the expression is already logically determined. The SAS DATA step supports short-circuiting for simple logical expressions in IF-THEN statements and WHERE clauses (Polzin 1994, p. 1579; Gilsen 1997). For example, in the following logical-AND expression, the condition for the variable Y does not need to be checked if the condition for variable X is true:

data _null_;
set A end=eof;
if x>0 & y>0 then   /* Y is evaluated only if X>0 */
   count + 1;
if eof then 
   put count=;

Order the conditions in a logical statement by likelihood

SAS programmers can optimize their IF-THEN and WHERE clauses if they can estimate the probability of each separate condition in a logical expression:

  • In a logical AND statement, put the least likely events first and the most likely events last. For example, suppose you want to find patients at a VA hospital that are male, over the age of 50, and have kidney cancer. You know that kidney cancer is a rare form of cancer. You also know that most patients at the VA hospital are male. To optimize a WHERE clause, you should put the least probable conditions first:
    WHERE Disease="Kidney Cancer" & Age>50 & Sex="Male";
  • In a logical OR statement, put the most likely events first and the least likely events last. For example, suppose you want to find all patients at a VA hospital that are either male, or over the age of 50, or have kidney cancer. To optimize a WHERE clause, you should use
    WHERE Sex="Male" | Age>50 | Disease="Kidney Cancer";

The SAS documentation does not discuss the conditions for which a logical expression does or does not short circuit. Polzin (1994, p. 1579) points out that when you put function calls in the logical expression, SAS evaluates certain function calls that produce side effects. Common functions that have side effects include random number functions and user-defined functions (via PROC FCMP) that have output arguments. The LAG and DIF functions can also produce side effects, but it appears that expressions that involve the LAG and DIF functions are short-circuited. You can force a function evaluation by calling the function prior to an IF-THEN statement. You can use nested IF-THEN/ELSE statements to ensure that functions are not evaluated unless prior conditions are satisfied.

Logical ligatures

The SAS/IML language does not support short-circuiting in IF-THEN statements, but it performs several similar optimizations that are designed to speed up your code execution. One optimization involves the ANY and ALL functions, which test whether any or all (respectively) elements of a matrix satisfy a logical condition. A common usage is to test whether a missing value appear anywhere in a vector, as shown in the following SAS/IML statement:

bAny = any(y = .);   /* TRUE if any element of Y is missing */
/* Equivalently, use   bAll = all(y ^= .);  */

The SAS/IML language treats simple logical expressions like these as a single function call, not as two operations. I call this a logical ligature because two operations are combined into one. (A computer scientist might just call this a parser optimization.)

You might assume that the expression ANY(Y=.) is evaluated by using a two-step process. In the first step, the Boolean expression y=. is evaluated and the result is assigned to a temporary binary matrix, which is the same size as Y. In the second step, the temporary matrix is sent to the ANY function, which evaluates the binary matrix and returns TRUE if any element is nonzero. However, it turns out that SAS/IML does not use a temporary matrix. The SAS/IML parser recognizes that the expression inside the ANY function is a simple logical expression. The program can evaluate the function by looking at each element of Y and returning TRUE as soon it finds a missing value. In other words, it short-circuits the computation. If no value is missing, the expression evaluates to FALSE.

Short circuiting an operation can save considerable time. In the following SAS/IML program, the vector X contains 100 million elements, all equal to 1. The vector Y also contains 100 million elements, but the first element of the vector is a missing value. Consequently, the computation for Y is essentially instantaneous whereas the computation for X takes a tenth of a second:

proc iml;
numReps = 10;      /* run computations 10 times and report average */
N = 1E8;           /* create vector with 100 million elements */
x = j(N, 1, 1);    /* all elements of x equal 1 */
y = x; y[1] = .;   /* the first element of x is missing */
/* the ALL and ANY functions short-circuit when the 
   argument is a simple logical expression */
/* these function calls examine only the first elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(y = .);   /* TRUE for y[1] */
   bAll = all(y ^= .);  /* TRUE for y[2] */
t = (time() - t0) / numReps;
print t[F=BEST6.];
/* these function calls must examine all elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(x = .);   
   bAll = all(x ^= .);
t = (time() - t0) / numReps;
print t[F=BEST6.];

Although the evaluation of X does not short circuit, it still uses the logical ligature to evaluate the expression. Consequently, the evaluation is much faster than the naive two-step process that is shown explicitly by the following statements, which require about 0.3 seconds and twice as much memory:

   /* two-step process: slower */
   b1 = (y=.);                /* form the binary vector */
   bAny = any(b1);            /* TRUE for y[1] */

In summary, the SAS DATA step uses short-circuit evaluation in IF-THEN statements and WHERE clauses that use simple logical expressions. If the expression contains several subexpressions, you can optimize your program by estimating the probability that each subexpression is true. In the SAS/IML language, the ANY and ALL functions not only short circuit, but when their argument is a simple Boolean expression, the language treats the function call as a logical ligature and evaluates the call in an efficient manner that does not require any temporary memory.

Short circuits can be perplexing if you don't expect them. Equally confusing is expecting a statement to short circuit, but it doesn't. If you have a funny story about short-circuit operators, in any language, leave a comment.

The post Short-circuit evaluation and logical ligatures in SAS appeared first on The DO Loop.

7月 312019

Sometimes a little thing can make a big difference. I am enjoying a new enhancement of SAS/IML 15.1, which enables you to use a numeric vector as the column header or row header when you print a SAS/IML matrix. Prior to SAS/IML 15.1, you had to first use the CHAR or PUTN function to manually apply a format to the numeric values. You could then use the formatted values as column or row headers. Now you can skip the formatting step.

A common situation in which I want to use numeric values as a column header is when I am using the TABULATE function to compute the frequencies for a discrete numerical variable. For example, the Cylinders variable in the Sashelp.Cars data set indicates the number of cylinders in each vehicle. You might want to know how many vehicles in the data are four-cylinder, six-cylinder, eight-cylinder, and so forth. In SAS/IML 15.1, you can use a numeric variable (such as the levels of the Cylinders variable) directly in the COLNAME= option of the PRINT statement, as follows:

proc iml;
use Sashelp.Cars;
read all var "Cylinders" into X;
call tabulate(level, freq, X);      /* count number of obs for each level of X */
/* New in SAS/IML 15.1: use numeric values as COLNAME= option to PRINT statement */
print freq[colname=level];

Prior to SAS/IML 15.1, you had to convert numbers to character values by applying a format. This is commonly done by using the CHAR function, which applies a W.D format. This technique is still useful, but optional. You can also use the PUTN function to apply any format you want, such as COMMA., DATE., TIME., and so forth. For example, if you like Roman numerals, you could apply the ROMAN. format!

labels = char(level, 2);         /* apply w.d format */
print freq[colname=labels];
labels = putn(level, "ROMAN.");  /* second arg is name of format */
print freq[colname=labels];

You can also use numeric values for the COLNAME= option to obtain row headers in SAS/IML 15.1. Sure, it's a small enhancement, but I like it!

The post Use numeric values for column headers when printing a matrix appeared first on The DO Loop.

7月 242019
Probability densities for Gumbel distributions. The parameter values correspond to the parameters that model the distribution of the maximum value in a sample of size n drawn from the standard normal distribution.

SAS supports more than 25 common probability distributions for the PDF, CDF, QUANTILE, and RAND functions. Of course, there are infinitely many distributions, so not every possible distribution is supported. If you need a less-common distribution, I've shown how to extend the functionality of Base SAS (by using PROC FCMP) or the SAS/IML language by defining your own functions. This article shows how to implement the Gumbel distribution in SAS. The density functions for four parameter values of the Gumbel distribution are shown to the right. As I recently discussed, the Gumbel distribution is used to model the maximum (or minimum) in a sample of size n.

Essential functions for the Gumbel Distribution

If you are going to work with a probability distribution, you need at least four essential functions: The CDF (cumulative distribution), the PDF (probability density), the QUANTILE function (inverse CDF), and a way to generate random values from the distribution. Because the RAND function in SAS supports the Gumbel distribution, you only need to define the CDF, PDF, and QUANTILE functions.

These functions are trivial to implement for the Gumbel distribution because each has an explicit formula. The following list defines each function. To match the documentation for the RAND function and PROC UNIVARIATE (which both support the Gumbel distribution), μ is used for the location parameter and σ is used for the scale parameter:

  • CDF: F(x; μ, σ) = exp(-exp(-z)), where z = (x - μ) / σ
  • PDF: f(x; μ, σ) = exp(-z-exp(-z)) / σ, where z = (x - μ) / σ
  • QUANTILE: F-1(p; μ, σ) = μ - σ log(-log(p)), where LOG is the natural logarithm.

The RAND function in Base SAS and the RANDGEN function in SAS/IML support generating random Gumbel variates. Alternatively, you could generate u ~ U(0,1) and then use the QUANTILE formula to transform u into a random Gumbel variate.

Using these definitions, you can implement the functions as inline functions, or you can create a function call, such as in the following SAS/IML program:

proc iml;
start CDFGumbel(x, mu, sigma);
   z = (x-mu)/sigma;
   return exp(-exp(-z));                   * CDF of Gumbel distribution;
start PDFGumbel(x, mu, sigma);
   z = (x-mu)/sigma;
   return exp(-z-exp(-z)) / sigma; 
start QuantileGumbel(p, mu, sigma);
   return mu - sigma # log(-log(p));
/* Example: Call the PDFGumbel function */
mu = 3.09;                       /* params for max value of a sample of size */
sigma = 0.286;                   /*       n=1000 from the std normal distrib */
x = do(2, 5, 0.025);
PDF = PDFGumbel(x, mu, sigma);
title "Gumbel PDF";
call series(x, PDF) grid={x y};

The graph shows the density for the Gumbel(3.09, 0.286) distribution, which models the distribution of the maximum value in a sample of size n=1000 drawn from the standard normal distribution. You can see that the maximum value is typically between 2.5 and 4.5, which values near 3 being the most likely.

Plot a family of Gumbel curves

I recently discussed the best way to draw a family of curves in SAS by using PROC SGPLOT. The following DATA step defines the PDF and CDF for a family of Gumbel distributions. You can use this data set to display several curves that indicate how the distribution changes for various values of the parameters.

data Gumbel;
array n[4] _temporary_     (10   100  1000 1E6);   /* sample size */
array mu[4] _temporary_    (1.28 2.33 3.09 4.75);  /* Gumbel location parameter */
array beta [4] _temporary_ (0.5  0.36 0.29 0.2);   /* Gumbel scale parameter */
do i = 1 to dim(beta);                       /* parameters in the outer loop */
   m = mu[i]; b = beta[i];
   Params = catt("mu=", m, "; beta=", b);    /* concatenate parameters */
   do x = 0 to 6 by 0.01;
      z = (x-m)/b;
      CDF = exp( -exp(-z));                  /* CDF for Gumbel */
      PDF = exp(-z - exp(-z)) / b;           /* PDF for Gumbel */
drop z;
   if you want to force the line attributes to vary in the HTML destination. */
title "The Gumbel(mu, beta) Cumulative Distribution";
proc sgplot data=Gumbel;
   label cdf="Probability";
   series x=x y=cdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid;  yaxis grid;
title "The Gumbel(mu, beta) Probability Density";
proc sgplot data=Gumbel;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
title "The Gumbel(mu, beta) Quantile Function";
proc sgplot data=Gumbel;
   label cdf="Probability";
   series x=cdf y=x / group=Params lineattrs=(thickness=2);
   keylegend / position=right sortorder=reverseauto;
   xaxis grid;  yaxis grid;

The graph for the CDF is shown. A graph for the PDF is shown at the top of this article. I augmented the PDF curve with labels that show the sample size for which the Gumbel distribution is associated. The leftmost curve is the distribution of the maximum in a sample of size n=10; the rightmost curve is for a sample of size one million.

Random values and fitting parameters to data

For completeness, I will mention that you can use the RAND function to generate random values and use PROC UNIVARIATE to fit Gumbel parameters to data. For example, the following DATA step generates 5,000 random variates from Gumbel(1.28, 0.5). The call to PROC UNIVARIATE fits a Gumbel model to the data. The estimates for (μ, σ) are (1.29, 0.51).

data GumbelRand;
call streaminit(12345);
mu = 1.28;
beta = 0.5;
do i = 1 to 5000;
   x = rand("Gumbel", mu, beta);
keep x;
proc univariate data=GumbelRand;
   histogram x / Gumbel;

In summary, although the PDF, CDF, and QUANTILE functions in SAS do not support the Gumbel distribution, it is trivial to implement these functions. This article visualizes the PDF and CDF of the Gumbel distributions for several parameter values. It also shows how to simulate random values from a Gumbel distribution and how to fit the Gumbel model to data.

The post Implement the Gumbel distribution in SAS appeared first on The DO Loop.

6月 262019

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

5月 202019

Recently I wrote about how to compute the Kolmogorov D statistic, which is used to determine whether a sample has a particular distribution. One of the beautiful facts about modern computational statistics is that if you can compute a statistic, you can use simulation to estimate the sampling distribution of that statistic. That means that instead of looking up critical values of the D statistic in a table, you can estimate the critical value by using empirical quantiles from the simulation.

This is a wonderfully liberating result! No longer are we statisticians constrained by the entries in a table in the appendix of a textbook. In fact, you could claim that modern computation has essentially killed the standard statistical table.

Obtain critical values by using simulation

Before we compute anything, let's recall a little statistical theory. If you get a headache thinking about null hypotheses and sampling distributions, you might want to skip the next two paragraphs!

When you run a hypothesis test, you compare a statistic (computed from data) to a hypothetical distribution (called the null distribution). If the observed statistic is way out in a tail of the null distribution, you reject the hypothesis that the statistic came from that distribution. In other words, the data does not seem to have the characteristic that you are testing for. Statistical tables use "critical values" to designate when a statistic is in the extreme tail. A critical value is a quantile of the null distribution; if the observed statistic is greater than the critical value, then the statistic is in the tail. (Technically, I've described a one-tailed test.)

One of the uses for simulation is to approximate the sampling distribution of a statistic when the true distribution is not known or is known only asymptotically. You can generate a large number of samples from the null hypothesis and compute the statistic on each sample. The union of the statistics approximates the true sampling distribution (under the null hypothesis) so you can use the quantiles to estimate the critical values of the null distribution.

Critical values of the Kolmogorov D distribution

You can use simulation to estimate the critical value for the Kolmogorov-Smirnov statistical test for normality. For the data in my previous article, the null hypothesis is that the sample data follow a N(59, 5) distribution. The alternative hypothesis is that they do not. The previous article computed a test statistic of D = 0.131 for the data (N = 30). If the null hypothesis is true, is that an unusual value to observe? Let's simulate 40,000 samples of size N = 30 from N(59,5) and compute the D statistic for each. Rather than use PROC UNIVARIATE, which computes dozens of statistics for each sample, you can use the SAS/IML computation from the previous article, which is very fast. The following simulation runs in a fraction of a second.

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
%let NumSamples = 40000;
proc iml;
call randseed(73);
N = &N;
i = T( 1:N );                           /* ranks */
u = i/N;                                /* ECDF height at right-hand endpoints */
um1 = (i-1)/N;                          /* ECDF height at left-hand endpoints  */
y = j(N, &NumSamples, .);               /* columns of Y are samples of size N */
call randgen(y, "Normal", &mu, &sigma); /* fill with random N(mu, sigma)      */
D = j(&NumSamples, 1, .);               /* allocate vector for results        */
do k = 1 to ncol(y);                    /* for each sample:                   */
   x = y[,k];                           /*    get sample x ~ N(mu, sigma)     */
   call sort(x);                        /*    sort sample                     */
   F = cdf("Normal", x, &mu, &sigma);   /*    CDF of reference distribution   */
   D[k] = max( F - um1, u - F );        /*    D = max( D_minus, D_plus )      */
title "Monte Carlo Estimate of Sampling Distribution of Kolmogorov's D Statistic";
title2 "N = 30; N_MC = &NumSamples";
call histogram(D) other=
     "refline 0.131 / axis=x label='Sample D' labelloc=inside lineattrs=(color=red);";

The test statistic is right smack dab in the middle of the null distribution, so there is no reason to doubt that the sample is distributed as N(59, 5).

How big would the test statistic need to be to be considered extreme? To test the hypothesis at the α significance level, you can compute the 1 – α quantile of the null distribution. The following statements compute the critical value for α = 0.05 and N = 30:

/* estimate critical value as the 1 - alpha quantile */
alpha = 0.05;
call qntl(Dcrit_MC, D, 1-alpha);
print Dcrit_MC;

The estimated critical value for a sample of size 30 is 0.242. This compares favorably with the exact critical value from a statistical table, which gives Dcrit = 0.2417 for N = 30.

You can also use the null distribution to compute a p value for an observed statistic. The p value is estimated as the proportion of statistics in the simulation that exceed the observed value. For example, if you observe data that has a D statistic of 0.28, the estimated p value is obtained by the following statements:

Dobs = 0.28;                        /* hypothetical observed statistic */
pValue = sum(D >= Dobs) / nrow(D);  /* proportion of distribution values that exceed D0 */
print Dobs pValue;

This same technique works for any sample size, N, although most tables critical values only for all N ≤ 30. For N > 35, you can use the following asymptotic formulas, developed by Smirnov (1948), which depend only on α:

The Kolmogorov D statistic does not depend on the reference distribution

It is reasonable to assume that the results of this article apply only to a normal reference distribution. However, Kolmogorov proved that the sampling distribution of the D statistic is actually independent of the reference distribution. In other words, the distribution (and critical values) are the same regardless of the continuous reference distribution: beta, exponential, gamma, lognormal, normal, and so forth. That is a surprising result, which explains why there is only one statistical table for the critical values of the Kolmogorov D statistic, as opposed to having different tables for different reference distributions.

In summary, you can use simulation to estimate the critical values for the Kolmogorov D statistic. In a vectorized language such as SAS/IML, the entire simulation requires only about a dozen statements and runs extremely fast.

The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.

5月 152019

Have you ever run a statistical test to determine whether data are normally distributed? If so, you have probably used Kolmogorov's D statistic. Kolmogorov's D statistic (also called the Kolmogorov-Smirnov statistic) enables you to test whether the empirical distribution of data is different than a reference distribution. The reference distribution can be a probability distribution or the empirical distribution of a second sample. Most statistical software packages provide built-in methods for computing the D statistic and using it in hypothesis tests. This article discusses the geometric meaning of the Kolmogorov D statistic, how you can compute it in SAS, and how you can compute it "by hand."

This article focuses on the case where the reference distribution is a continuous probability distribution, such as a normal, lognormal, or gamma distribution. The D statistic can help you determine whether a sample of data appears to be from the reference distribution. Throughout this article, the word "distribution" refers to the cumulative distribution.

What is the Kolmogorov D statistic?

The geometry of Kolmogorov's D statistic, which measures the maximum vertical distance between an emirical distribution and a reference distribution.

The letter "D" stands for "distance." Geometrically, D measures the maximum vertical distance between the empirical cumulative distribution function (ECDF) of the sample and the cumulative distribution function (CDF) of the reference distribution. As shown in the adjacent image, you can split the computation of D into two parts:

  • When the reference distribution is above the ECDF, compute the statistic D as the maximal difference between the reference distribution and the ECDF. Because the reference distribution is monotonically increasing and because the empirical distribution is a piecewise-constant step function that changes at the data values, the maximal value of D will always occur at a right-hand endpoint of the ECDF.
  • When the reference distribution is below the ECDF, compute the statistic D+ as the maximal difference between the ECDF and the reference distribution. The maximal value of D+ will always occur at a left-hand endpoint of the ECDF.

The D statistic is simply the maximum of D and D+.

You can compare the statistic D to critical values of the D distribution, which appear in tables. If the statistic is greater than the critical value, you reject the null hypothesis that the sample came from the reference distribution.

How to compute Kolmogorov's D statistic in SAS?

In SAS, you can use the HISTOGRAM statement in PROC UNIVARIATE to compute Kolmogorov's D statistic for many continuous reference distributions, such as the normal, beta, or gamma distributions. For example, the following SAS statements simulate 30 observations from a N(59, 5) distribution and compute the D statistic as the maximal distance between the ECDF of the data and the CDF of the N(59, 5) distribution:

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
/* simulate a random sample of size N from the reference distribution */
data Test;
call streaminit(73);
do i = 1 to &N;
   x = rand("Normal", &mu, &sigma);
drop i;
proc univariate data=Test;
   ods select Moments CDFPlot GoodnessOfFit;
   histogram x / normal(mu=&mu sigma=&sigma) vscale=proportion;  /* compute Kolmogov D statistic (and others) */
   cdfplot x / normal(mu=&mu sigma=&sigma) vscale=proportion;    /* plot ECDF and reference distribution */
   ods output CDFPlot=ECDF(drop=VarName CDFPlot);                /* for later use, save values of ECDF */

The computation shows that D = 0.131 for this simulated data. The plot at the top of this article shows that the maximum occurs for a datum that has the value 54.75. For that observation, the ECDF is farther away from the normal CDF than at any other location.

The critical value of the D distribution when N=30 and α=0.05 is 0.2417. Since D < 0.2417, you should not reject the null hypothesis. It is reasonable to conclude that the sample comes from the N(59, 5) distribution.

Although the computation is not discussed in this article, you can use PROC NPAR1WAY to compute the statistic when you have two samples and want to determine if they are from the same distribution. In this two-sample test, the geometry is the same, but the computation is slightly different because the reference distribution is itself an ECDF, which is a step function.

Compute Kolmogorov's D statistic manually

Although PROC UNIVARIATE in SAS computes the D statistic (and other goodness-of-fit statistics) automatically, it is not difficult to compute the statistic from first principles in a vector language such as SAS/IML.

The key is to recall that the ECDF is a piecewise constant function that changes heights at the value of the observations. If you sort the data, the height at the i_th sorted observation is i / N, where N is the sample size. The height of the ECDF an infinitesimal distance before the i_th sorted observation is (i – 1)/ N. These facts enable you to compute D and D+ efficiently.

The following algorithm computes the D statistic:

  1. Sort the data in increasing order.
  2. Evaluate the reference distribution at the data values.
  3. Compute the pairwise differences between the reference distribution and the ECDF.
  4. Compute D as the maximum value of the pairwise differences.

The following SAS/IML program implements the algorithm:

/* Compute the Kolmogorov D statistic manually */
proc iml;
use Test;  read all var "x";  close;
N = nrow(x);                         /* sample size */
call sort(x);                        /* sort data */
F = cdf("Normal", x, &mu, &sigma);   /* CDF of reference distribution */
i = T( 1:N );                        /* ranks */
Dminus = F - (i-1)/N;
Dplus = i/N - F;
D = max(Dminus, Dplus);              /* Kolmogorov's D statistic */
print D;

The D statistic matches the computation from PROC UNIVARIATE.

The SAS/IML implementation is very compact because it is vectorized. By computing the statistic "by hand," you can perform additional computations. For example, you can find the observation for which the ECDF and the reference distribution are farthest away. The following statements find the index of the maximum for the Dminus and Dplus vectors. You can use that information to find the value of the observation at which the maximum occurs, as well as the heights of the ECDF and reference distribution. You can use these values to create the plot at the top of this article, which shows the geometry of the Kolmogorov D statistic:

/* compute locations of the maximum D+ and D-, then plot with a highlow plot */
i1 = Dminus[<:>];  i2 = Dplus [<:>];              /* indices of maxima */
/* Compute location of max, value of max, upper and lower curve values */
x1=x[i1];  v1=DMinus[i1];  Low1=(i[i1]-1)/N;  High1=F[i1]; 
x2=x[i2];  v2=Dplus [i2];  Low2=F[i2];        High2=i[i2]/N; 
Result = (x1 || v1 || Low1 || High1) // (x2 || v2 || Low2 || High2);
print Result[c={'x' 'Value' 'Low' 'High'} r={'D-' 'D+'} L='Kolmogorov D'];

The observations that maximize the D and D+ statistics are x=54.75 and x=61.86, respectively. The value of D is the larger value, so that is the value of Kolmogorov's D.

For completeness, the following statement show how to create the graph at the top of this article. The HIGHLOW statement in PROC SGPLOT is used to plot the vertical line segments that represent the D and D+ statistics.

create KolmogorovD var {x x1 Low1 High1 x2 Low2 High2}; append; close;
data ALL;
   set KolmogorovD ECDF;   /* combine the ECDF, reference curve, and the D- and D+ line segments */
title "Kolmogorov's D Statistic";
proc sgplot data=All;
   label CDFy = "Reference CDF" ECDFy="ECDF";
   xaxis grid label="x";
   yaxis grid label="Cumulative Probability" offsetmin=0.08;
   fringe x;
   series x=CDFx Y=CDFy / lineattrs=GraphReference(thickness=2) name="F";
   step x=ECDFx Y=ECDFy / lineattrs=GraphData1 name="ECDF";
   highlow x=x1 low=Low1 high=High1 / 
           lineattrs=GraphData2(thickness=4) name="Dm" legendlabel="D-";
   highlow x=x2 low=Low2 high=High2 / 
           lineattrs=GraphData3(thickness=4) name="Dp" legendlabel="D+";
   keylegend "F" "ECDF" "Dm" "Dp";

Concluding thoughts

Why would anyone want to compute the D statistic by hand if PROC UNIVARIATE can compute it automatically? There are several reasons:

  • Understanding: When you compute a statistic manually, you are forced to understand what the statistic means and how it works.
  • Efficiency: When you run PROC UNIVARIATE, it computes many statistics (mean, standard deviation, skewness, kurtosis,...), many goodness-of-fit tests (Kolmogorov-Smirnov, Anderson-Darling, and Cramér–von Mises), a histogram, and more. That's a lot of work! If you are interested only in the D statistic, why needlessly compute the others?
  • Generalizability: You can compute quantities such as the location of the maximum value of D and D+. You can use the knowledge and experience you've gained to compute related statistics.

The post What is Kolmogorov's D statistic? appeared first on The DO Loop.

5月 012019

SAS regression procedures support several parameterizations of classification variables. When a categorical variable is used as an explanatory variable in a regression model, the procedure generates dummy variables that are used to construct a design matrix for the model. The process of forming columns in a design matrix is called a parameterization or encoding. In SAS, most regression procedures use either the GLM encoding, the EFFECT encoding, or the REFERENCE encoding. This article summarizes the default and optional encodings for each regression procedure in SAS/STAT. In many SAS procedures, you can use the PARAM= option to change the default encoding.

The documentation section "Parameterization of Model Effects" provides a complete list of the encodings in SAS and shows how the design matrices are constructed from the levels. (The levels are the values of a classification variable.) Pasta (2005) gives examples and further discussion.

Default and optional encodings for SAS regression procedures

The following SAS regression procedures support the CLASS statement or a similar syntax. The columns GLM, REFERENCE, and EFFECT indicate the three most common encodings. The word "Default" indicates the default encoding. For procedures that support the PARAM= option, the column indicates the supported encodings. The word All means that the procedure supports the complete list of SAS encodings. Most procedures default to using the GLM encoding; the exceptions are highlighted.

ANOVA Default
CATMOD Default
FMM Default
GAM Default
GAMPL Default Yes GLM | REF
GEE Default
GENMOD Default Yes Yes All
GLM Default
GLMSELECT Default Yes Yes All
HP regression procedures Default Yes GLM | REF
ICPHREG Default Yes Yes All
LOGISTIC Yes Yes Default All
MIXED Default
ORTHOREG Default Yes Yes All
PLS Default
PROBIT Default
PHREG Yes Default Yes All
QUANTSELECT Default Yes Yes All
RMTSREG Default Yes Yes All
SURVEYPHREG Default Yes Yes All
TRANSREG Yes Default Yes

A few comments:

  • The REFERENCE encoding is the default for PHREG and TRANSREG.
  • The EFFECT encoding is the default for CATMOD, LOGISTIC, and SURVEYLOGISTIC.
  • The HP regression procedures all use the GLM encoding by default and support only PARAM=GLM or PARAM=REF. The HP regression procedures include HPFMM, HPGENSELECT, HPLMIXED, HPLOGISTIC, HPNLMOD, HPPLS, HPQUANTSELECT, and HPREG. In spite of its name, GAMPL is also an HP procedure. In spite of its name, HPMIXED is NOT an HP procedure!
  • PROC LOGISTIC and PROC HPLOGISTIC use different default encodings.
  • CATMOD does not have a CLASS statement because all variables are assumed to be categorical.
  • PROC TRANSREG does not support a CLASS statement. Instead, it uses a CLASS() transformation list. It uses different syntax to support parameter encodings.

How to interpret main effects for the SAS encodings

The GLM parameterization is a singular parameterization. The other encodings are nonsingular. The "Other Parameterizations" section of the documentation gives a simple one-sentence summary of how to interpret the parameter estimates for the main effects in each encoding:

  • The GLM encoding estimates the difference in the effect of each level compared to the reference level. You can use the REF= option to specify the reference level. By default, the reference level is the last ordered level. The design matrix for the GLM encoding is singular.
  • The REFERENCE encoding estimates the difference in the effect of each nonreference level compared to the effect of the reference level. You can use the REF= option to specify the reference level. By default, the reference level is the last ordered level. Notice that the REFERENCE encoding gives the same interpretation as the GLM encoding. The difference is that the design matrix for the REFERENCE encoding excludes the column for the reference level, so the design matrix for the REFERENCE encoding is (usually) nonsingular.
  • The EFFECT encoding estimates the difference in the effect of each nonreference level compared to the average effect over all levels.

This article lists the various encodings that are supported for each SAS regression procedures. I hope you will find it to be a useful reference. If I've missed your favorite regression procedure, let me know in the comments.

The post Encodings of CLASS variables in SAS regression procedures: A cheat sheet appeared first on The DO Loop.

4月 292019

Did you know that SAS provides built-in support for working with probability distributions that are finite mixtures of normal distributions? This article shows examples of using the "NormalMix" distribution in SAS and describes a trick that enables you to easily work with distributions that have many components.

As with all probability distributions, there are four essential functions that you need to know: The PDF, CDF, QUANTILE, and RAND functions.

What is a normal mixture distribution?

A finite mixture distribution is a weighted sum of component distributions. When all of the components are normal, the distribution is called a mixture of normals. If the i_th component has parameters (μi, σi), then you can write the probability density function (PDF) of the normal mixture as
f(x) = Σi wi φ(x; μi, σi)
where φ is the normal PDF and the positive constants wi are the mixing weights. The mixing weights must sum to 1.

The adjacent graph shows the density function for a three-component mixture of normal distributions. The means of the components are -6, 3, and 8, respectively, and are indicated by vertical reference lines. The mixing weights are 0.1, 0.3, and 0.6. The SAS program to create the graph is in the next section.

The "NormalMix" distribution in SAS

The PDF and CDF functions in Base SAS support the "NormalMix" distribution. The syntax is a little unusual because the function needs to support an arbitrary number of components. If there are k components, the PDF and CDF functions require 3k + 3 parameters:

  1. The first parameter is the name of the distribution: "NormalMix". The second parameter is the value, x, at which to evaluate the density function.
  2. The third parameter is the number of component distributions, k > 1.
  3. The next k parameters specify the mixing weights, w1, w2, ..., wk.
  4. The next k parameters specify the component means, μ1, μ2, ..., μk.
  5. The next k parameters specify the component standard deviations, σ1, σ2, ..., σk.

If you are using a model that has many components, it is tedious to explicitly list every parameter in every function call. Fortunately, there is a simple trick that prevents you from having to list the parameters. You can put the parameters into arrays and use the OF operator (sometimes called the OF keyword) to reference the parameter values in the array. This is shown in the next section.

The PDF and CDF of a normal mixture

The following example demonstrates how to compute the PDF and CDF for a three-component mixture-of-normals distribution. The DATA step shows two tricks:

  • The parameters (weights, means, and standard deviations) are stored in arrays. In the calls to the PDF and CDF functions, syntax such as OF w[*} enables you to specify the parameter values with minimal typing.
  • A normal density is extremely tiny outside of the interval [μ - 5*σ, μ + 5*σ]. You can use this fact to compute the effective domain for the PDF and CDF functions.
/* PDF and CDF of the normal mixture distribution. This example specifies three components. */
data NormalMix;
array w[3]     _temporary_ ( 0.1, 0.3, 0.6);  /* mixing weights */ 
array mu[3]    _temporary_ (-6,   3,   8);    /* mean for each component */
array sigma[3] _temporary_ (0.5,  0.6, 2.5);  /* standard deviation for each component */
/* For each component, the range [mu-5*sigma, mu+5*sigma] is the effective support. */
minX = 1e308; maxX = -1e308;                  /* initialize to extreme values */
do i = 1 to dim(mu);                          /* find largest interval where density > 1E-6 */
   minX = min(minX, mu[i] - 5*sigma[i]);
   maxX = max(maxX, mu[i] + 5*sigma[i]);
/* Visualize the functions on the effective support. Use arrays and the OF operator to
   specify the parameters. An alternative syntax is to list the arguments, as follows:
   cdf = CDF('normalmix', x, 3, 
              0.1, 0.3, 0.6,    
             -6,   3,   8,    
              0.5, 0.6, 2.5);       */
dx = (maxX - minX)/200;
do x = minX to maxX by dx;
  pdf = pdf('normalmix', x, dim(mu), of w[*], of mu[*], of sigma[*]); 
  cdf = cdf('normalmix', x, dim(mu), of w[*], of mu[*], of sigma[*]); 
keep x pdf cdf;

As shown in the program, the OF operator greatly simplifies and clarifies the function calls. The alternative syntax, which is shown in the comments, is unwieldy.

The following statements create graphs of the PDF and CDF functions. The PDF function is shown at the top of this article. The CDF function, along with a few reference lines, is shown below.

title "PDF function for Normal Mixture Distribution";
title2 "Vertical Lines at Component Means";
proc sgplot data=NormalMix;
   refline -6 3 8 / axis=x;
   series x=x y=pdf;
title "CDF function for Normal Mixture Distribution";
proc sgplot data=NormalMix;
   xaxis grid; yaxis min=0 grid; 
   refline 0.1 0.5 0.7 0.9;
   series x=x y=cdf;

The quantiles of a normal mixture

The quantile function for a continuous distribution is the inverse of the CDF distribution. The graph of the CDF function for a mixture of normals can have flat regions when the component means are far apart relative to their standard deviations. Technically, these regions are not completely flat because the normal distribution has infinite support, but computationally they can be very flat. Because finding a quantile is equivalent to finding the root of a shifted CDF, you might encounter computational problems if you try to compute the quantile that corresponds to an extremely flat region, such as the 0.1 quantile in the previous graph.

The following DATA step computes the 0.1, 0.5, 0.7, and 0.9 quantiles for the normal mixture distribution. Notice that you can use arrays and the OF operator for the QUANTILE function:

data Q;
array w[3]     _temporary_ ( 0.1, 0.3, 0.6);  /* mixing weights */ 
array mu[3]    _temporary_ (-6,   3,   8);    /* mean for each component */
array sigma[3] _temporary_ (0.5,  0.6, 2.5);  /* standard deviation for each component */
array p[4] (0.1, 0.5, 0.7, 0.9);  /* find quantiles for these probabilities */
do i = 1 to dim(p);
   prob = p[i];
   qntl = quantile('normalmix', prob, dim(mu), of w[*], of mu[*], of sigma[*]); 
keep qntl prob;
proc print; run;

The table tells you that 10% of the density of the normal mixture is less than x=-3.824. That is essentially the result of the first component, which has weight 0.1 and therefore is responsible for 10% of the total density. Half of the density is less than x=5.58. Fully 70% of the density lies to the left of x=8, which is the mean of the third component. That result makes sense when you look at the mixing weights.

Random values from a normal mixture

The RAND function does not explicitly support the "NormalMix" distribution. However, as I have shown in a previous article, you can simulate from an arbitrary mixture of distributions by using the "Table" distribution in conjunction with the component distributions. For the three-component mixture distribution, the following DATA step simulates a random sample:

/* random sample from a mixture distribution */
%let N = 1000;
data RandMix(drop=i);
call streaminit(12345);
array w[3]     _temporary_ ( 0.1, 0.3, 0.6);  /* mixing weights */ 
array mu[3]    _temporary_ (-6,   3,   8);    /* mean for each component */
array sigma[3] _temporary_ (0.5,  0.6, 2.5);  /* standard deviation for each component */
do obsNum = 1 to &N;
   i = rand("Table", of w[*]);                /* choose the component by using the mixing weights */
   x = rand("Normal", mu[i], sigma[i]);       /* sample from that component */
title "Random Sample for Normal Mixture Distribution";
proc sgplot data=RandMix;
   histogram x;
   refline -6 3 8 / axis=x;   /* means of component distributions */

The histogram of a random sample looks similar to the graph of the PDF function, as it should.

In summary, SAS provides built-in support for working with the density (PDF), cumulative probability (CDF), and quantiles (QUANTILE) of a normal mixture distribution. You can use arrays and the OF operator to call these Base SAS functions without having to list every parameter. Although the RAND function does not natively support the "NormalMix" distribution, you can use the "Table" distribution to select a component according to the mixing weights and use the RAND("Normal") function to simulate from the selected normal component.

The post The normal mixture distribution in SAS appeared first on The DO Loop.

4月 172019

I think every course in exploratory data analysis should begin by studying Anscombe's quartet. Anscombe's quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but graphical properties. They are a great reminder of why you should graph your data. You can read about Anscombe's quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe's first two examples are shown below:

The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have the same statistics sounds like a fun challenge! As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.

Generating your own version of Anscombe's data

The Wikipedia article notes that there are "several methods to generate similar data sets with identical statistics and dissimilar graphics," but I did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!

I used the following approach to construct the first two data sets:

  1. Use a simulation to create linear data with random normal errors: Y1 = 3 + 0.5 X + ε, where ε ~ N(0,1).
  2. Compute the regression estimates (b0 and b1) and the sample correlation for the linear data. These are the target statistics. They define three equations that the second data set must match.
  3. The response variable for the second data set is of the form Y2 = β0 + β1 X + β2 X2. There are three equations and three unknowns parameters, so we can solve a system of nonlinear equations to find β.

From geometric reasoning, there are three different solution for the β parameter: One with β2 > 0 (a parabola that opens up), one with β2 = 0 (a straight line), and one with β2 < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.

Construct the first data set

You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:

proc iml;
call randseed(12345);
x = T( do(4, 14, 0.2) );                              /* evenly spaced X */
eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */
y = 3 + 0.5*x + eps;                                  /* linear Y + error */
/* Helper function. Return paremater estimates for linear regression. Args are col vectors */
start LinearReg(Y, tX);
   X = j(nrow(tX), 1, 1) || tX;
   b = solve(X`*X, X`*Y);       /* solve normal equation */
   return b;
targetB = LinearReg(y, x);          /* compute regression estimates */
targetCorr = corr(y||x)[2];         /* compute sample correlation */
print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"];

You can use these values as the target values. The next step is to find a parameter vector β such that Y2 = β0 + β1 X + β2 X2 has the same regression line and corr(Y2, X) has the same sample correlation. For uniqueness, set β2 < 0.

Construct the second data set

You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.) The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y2 and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.

/* Define system of simultaneous equations: */
/* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */
start LinearFitCorr(beta) global(x);
   y2 = beta[1] + beta[2]*x + beta[3]*x##2;    /* construct quadratic Y */
   b = LinearReg(y2, x);      /* linear fit */
   corr = corr(y2||x)[2];     /* sample corr */
   return ( b` || corr);      /* return row vector */
/* This function returns the vector quantity (beta - target). 
   Find value that minimizes Sum | F_i(beta)-Target+i |^2 */
start Func(beta) global(targetB, targetCorr);
   target = rowvec(targetB) || targetCorr;
   G = LinearFitCorr(beta) - target;
   return( G );              /* return row vector */
/* now let's solve for quadratic parameters so that same 
   linear fit and correlation occur */
beta0 = {-5 1 -0.1};         /* initial guess */
con = {.  .  .,              /* constraint matrix */
       0  .  0};             /* quadratic term is negative */
optn = ncol(beta0) || 0;     /* LS with 3 components || amount of printing */
/* minimize sum( beta[i] - target[i])**2 */
call nlphqn(rc, beta, "Func", beta0, optn) blc=con;  /* LS solution */
print beta[L="Optimal beta"];
/* How nearly does the solution solve the problem? Did we match the target values? */
Y2Stats = LinearFitCorr(beta);
print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3];

The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y2 = 4.955 + 2.566*X - 0.118*X2. The following statements create scatter plots of the new Anscombe-like data:

y2 = beta[1] + beta[2]*x + beta[3]*x##2;
create Anscombe2 var {x y y2};  append;  close;
ods layout gridded columns=2 advance=table;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=X y=y;    lineparm x=0 y=3.561 slope=0.447 / clip;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=x y=y2;   lineparm x=0 y=3.561 slope=0.447 / clip;
ods layout end;

Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.

What about the other data sets?

You can create the other data sets similarly. For example:

  • For the data set that consist of points along a line except for one outlier, there are three free parameters. Most of the points fall along the line Y3 = a + b*X and one point is at the height Y3 = c. Therefore, you can run an optimization to solve for the values of (a, b, c) that match the target statistics.
  • I'm not sure how to formulate the requirements for the fourth data set. It looks like all but one point have coordinates (X, Y4), where X is a fixed value and the vertical mean of the cluster is c. The outlier has coordinate (a, b). I'm not sure whether the variance of the cluster is important.


In summary, you can solve a system of equations to construct data similar to Anscombe's quartet. By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.

To be fair, the technique I've presented does not enable you to reproduce Anscombe's quartet in its entirety. My data share a linear fit and sample correlation, whereas Anscombe's data share seven statistics!

Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet. However he did it, Anscombe's quartet is truly remarkable!

You can download the SAS program that creates the results in this article.

The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.