rick wicklin

4月 222019

Many statistical tests use a CUSUM statistic as part of the test. It can be confusing when a researcher refers to "the CUSUM test" without providing details about exactly which CUSUM test is being used. This article describes a CUSUM test for the randomness of a binary sequence. You start with a long sequence of binary values such as heads and tails from a coin toss. The test tries to determine whether the sample comes from a Bernoulli distribution with probability p=0.5. In short, is the binary sequence random?

The CUSUM test for randomness of a binary sequence is one of the NIST tests for verifying that a random or pseudorandom generator is generating bits that are indistinguishable from truly random bits (Rukin, et al., 2000, revised 2010, pp 2-31 through 2-33). The test is straightforward to implement. You first translate the data to {-1, +1} values. You then compute the cumulative sums of the sequence. If the sequence is random, the cumulative sum is equivalent to the position of a random walker who takes unit steps. If the sequence is random, the sums will not move away from 0 (the expected sum) too quickly. I've previously visualized the random walk with unit steps (sometimes called a "Drunkard's walk").

Before proceeding to the CUSUM test, I should mention that this test is often used in conjunction with other tests, such as the "runs test" for randomness. That is because a perfectly alternating sequence such as 0101010101... will pass the CUSUM test even though the sequence is clearly not randomly generated. In fact, any sequence that repeatedly has k zeros followed by k ones also passes the test, provided that k is small enough.

The CUSUM test for randomness of a binary sequence in SAS

The NIST report contains an example of calling the CUSUM test with a sequence of length N=100. The following SAS/IML statements define a sequence of {0, 1} values, convert those values to {-1, +1}, and plot the cumulative sums:

proc iml;
eps = {1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1
       1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 
       0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 
       0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 
       0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 }; 
x = 2*eps - 1;            /* convert to {-1, +1} sequence */
S = cusum(x);
title "Cumulative Sums of Sequence of {-1, +1} Values";
call series(1:ncol(S), S) option="markers" other="refline 0 / axis=y" label="Observation Number";
Plot of the cumulative sums of a binary sequence of values in {-1, +1}.

The sequence contains 58 values of one category and 42 values of the other. For a binomial distribution with p=0.5, the probability of a sample that has proportions at least this extreme is about 13%, as shown by the computation 2*cdf("Binomial", 42, 0.5, 100);. Consequently, the proportions are not unduly alarming. However, to test whether the sequence is random, you need to consider not only the proportion of values, but also the sequence. The graph of the cumulative sums of the {-1, +1} sequence shows a drift away from the line S=0, but it is not clear from the graph whether the deviation is more extreme than would be expected for a random sequence of this length.

The CUSUM test gives you a way to quantify whether the sequence is likely to have occurred as a random draw from a Bernoulli(p=0.5) distribution. The test statistic is the maximum deviation from 0. As you can see from the graph, the test statistic for this sequence is 16. The NIST paper provides a formula for the probability that a statistic at least this extreme occurs in a random sequence of length N=100. I implemented a (vectorized) version of the formula in SAS/IML.

/* NIST CUSUM test for randomness in a binary {-1, +1} sequence.
   Section 2.13.  Pages 2-31 through 2-33.
   INPUT: x is sequence of {-1, +1} values.      */
start BinaryCUSUMTest(x, PrintTable=1, alpha=0.01);
   S = colvec( cusum(x) );     /* cumulative sums */
   n = nrow(S);
   z = max(abs(S));            /* test statistic = maximum deviation */
   /* compute probability of this test statistic for a sequence of this length */
   zn = z/sqrt(n);
   kStart = int( (-n/z +1)/4 );
   kEnd   = int( ( n/z -1)/4 );
   k = kStart:kEnd;
   sum1 = sum( cdf("Normal", (4*k+1)*zn) - cdf("Normal", (4*k-1)*zn) );
   kStart = int( (-n/z -3)/4 );
   k = kStart:kEnd;
   sum2 = sum( cdf("Normal", (4*k+3)*zn) - cdf("Normal", (4*k+1)*zn) );
   pValue = 1 - sum1 + sum2;
   /* optional: print the test results in a nice human-readable format */
   cusumTest = z || pValue;
   if PrintTable then do;
      print cusumTest[L="Result of CUSUM Test" c={"Test Statistic" "p Value"}];
      labl= "H0: Sequence is a random binary sequence";
      if pValue <= alpha then 
         msg = "Reject H0 at alpha = " + char(alpha); /* sequence seems random */
         msg = "Do not reject H0 at alpha = " + char(alpha); /* sequence does not seem random */
      print msg[L=labl];
   return ( cusumTest );
/* call the function for the {-1, +1} sequence */
cusumTest = BinaryCUSUMTest(x);

According to the CUSUM test, there is not sufficient evidence to doubt that the sequence was generated from a random Bernoulli process.

A few comments about the program:

  • if you vectorize the computations, the CUSUM test requires only a few SAS/IML statements. Half of the function is dedicated to printing the results in a friendly format.
  • The computation of the p-value took me a while to puzzle over. The formulas in the NIST article did not write the INT() function for the limits of the summation. But the summation only makes sense when the index of summation (k) is an integer.
  • The significance value for the CUSUM test is usually chosen to be alpha = 0.01.
  • The CUSUM test depends only on the maximum deviation of the cumulative sums (the test statistic) and on the length of the sequence. For a sequence of length 100, the test statistic can be as large as 28 without rejecting the null hypothesis. If the statistic is 29 or larger, then the null hypothesis is rejected and we conclude that the sequence is not generated by a random process.

A neat thing about the CUSUM test is that you can compute the maximum test statistic based only on the sequence length. Thus if you plan to toss a coin 100 times to determine if it is fair, you can stop tossing (with 99% confidence) if the number of heads ever exceeds the number of tails by 29. Similarly, you can stop tossing if you know that the number of excess heads cannot possibly be 29 or greater. (For example, you've tossed 80 times and the current cumulative sum is 5.) You can apply the same argument to excess tails.

In summary, this article shows how to implement the CUSUM test for randomness of a binary sequence in SAS. Only a few lines of SAS/IML are required, and you can implement the test without using any loops. Be aware that the CUSUM test is not very powerful because regular sequences can pass the test. For example, the sequence 000111000111000111... has a maximum deviation of 3.

The post The CUSUM test for randomness of a binary sequence appeared first on The DO Loop.

4月 172019

I think every course in exploratory data analysis should begin by studying Anscombe's quartet. Anscombe's quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but graphical properties. They are a great reminder of why you should graph your data. You can read about Anscombe's quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe's first two examples are shown below:

The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have the same statistics sounds like a fun challenge! As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.

Generating your own version of Anscombe's data

The Wikipedia article notes that there are "several methods to generate similar data sets with identical statistics and dissimilar graphics," but I did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!

I used the following approach to construct the first two data sets:

  1. Use a simulation to create linear data with random normal errors: Y1 = 3 + 0.5 X + ε, where ε ~ N(0,1).
  2. Compute the regression estimates (b0 and b1) and the sample correlation for the linear data. These are the target statistics. They define three equations that the second data set must match.
  3. The response variable for the second data set is of the form Y2 = β0 + β1 X + β2 X2. There are three equations and three unknowns parameters, so we can solve a system of nonlinear equations to find β.

From geometric reasoning, there are three different solution for the β parameter: One with β2 > 0 (a parabola that opens up), one with β2 = 0 (a straight line), and one with β2 < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.

Construct the first data set

You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:

proc iml;
call randseed(12345);
x = T( do(4, 14, 0.2) );                              /* evenly spaced X */
eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */
y = 3 + 0.5*x + eps;                                  /* linear Y + error */
/* Helper function. Return paremater estimates for linear regression. Args are col vectors */
start LinearReg(Y, tX);
   X = j(nrow(tX), 1, 1) || tX;
   b = solve(X`*X, X`*Y);       /* solve normal equation */
   return b;
targetB = LinearReg(y, x);          /* compute regression estimates */
targetCorr = corr(y||x)[2];         /* compute sample correlation */
print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"];

You can use these values as the target values. The next step is to find a parameter vector β such that Y2 = β0 + β1 X + β2 X2 has the same regression line and corr(Y2, X) has the same sample correlation. For uniqueness, set β2 < 0.

Construct the second data set

You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.) The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y2 and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.

/* Define system of simultaneous equations:
   https://blogs.sas.com/content/iml/2018/02/28/solve-system-nonlinear-equations-sas.html */
/* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */
start LinearFitCorr(beta) global(x);
   y2 = beta[1] + beta[2]*x + beta[3]*x##2;    /* construct quadratic Y */
   b = LinearReg(y2, x);      /* linear fit */
   corr = corr(y2||x)[2];     /* sample corr */
   return ( b` || corr);      /* return row vector */
/* This function returns the vector quantity (beta - target). 
   Find value that minimizes Sum | F_i(beta)-Target+i |^2 */
start Func(beta) global(targetB, targetCorr);
   target = rowvec(targetB) || targetCorr;
   G = LinearFitCorr(beta) - target;
   return( G );              /* return row vector */
/* now let's solve for quadratic parameters so that same 
   linear fit and correlation occur */
beta0 = {-5 1 -0.1};         /* initial guess */
con = {.  .  .,              /* constraint matrix */
       0  .  0};             /* quadratic term is negative */
optn = ncol(beta0) || 0;     /* LS with 3 components || amount of printing */
/* minimize sum( beta[i] - target[i])**2 */
call nlphqn(rc, beta, "Func", beta0, optn) blc=con;  /* LS solution */
print beta[L="Optimal beta"];
/* How nearly does the solution solve the problem? Did we match the target values? */
Y2Stats = LinearFitCorr(beta);
print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3];

The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y2 = 4.955 + 2.566*X - 0.118*X2. The following statements create scatter plots of the new Anscombe-like data:

y2 = beta[1] + beta[2]*x + beta[3]*x##2;
create Anscombe2 var {x y y2};  append;  close;
ods layout gridded columns=2 advance=table;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=X y=y;    lineparm x=0 y=3.561 slope=0.447 / clip;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=x y=y2;   lineparm x=0 y=3.561 slope=0.447 / clip;
ods layout end;

Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.

What about the other data sets?

You can create the other data sets similarly. For example:

  • For the data set that consist of points along a line except for one outlier, there are three free parameters. Most of the points fall along the line Y3 = a + b*X and one point is at the height Y3 = c. Therefore, you can run an optimization to solve for the values of (a, b, c) that match the target statistics.
  • I'm not sure how to formulate the requirements for the fourth data set. It looks like all but one point have coordinates (X, Y4), where X is a fixed value and the vertical mean of the cluster is c. The outlier has coordinate (a, b). I'm not sure whether the variance of the cluster is important.


In summary, you can solve a system of equations to construct data similar to Anscombe's quartet. By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.

To be fair, the technique I've presented does not enable you to reproduce Anscombe's quartet in its entirety. My data share a linear fit and sample correlation, whereas Anscombe's data share seven statistics!

Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet. However he did it, Anscombe's quartet is truly remarkable!

You can download the SAS program that creates the results in this article.

The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.

4月 152019

A quadratic form is a second-degree polynomial that does not have any linear or constant terms. For multivariate polynomials, you can quickly evaluate a quadratic form by using the matrix expression
x` A x
This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like
x` A-1 x
where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.

What is a quadratic form?

For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example, if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x equals the second degree polynomial q(x1, x2) = 9*x12 - 4*x1 x2 + 5*x22. A contour plot of this polynomial is shown below.

Contour plot of the quadratic form x` A x for A = {9 -2, -2 5}

Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` Id x = x` x = Σi xi2. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x12 + 4*x22. If you add in off-diagonal elements, you get cross terms in the polynomial.

Efficient evaluation of a quadratic form

If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).

proc iml;
   q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 +
                   2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3
A = {4 3 2,
     3 6 1,
     2 1 9};
x = {-1, 2, 0.5};
q1 = x`*A*x;
print q1;

When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory. Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:

w = {4, 6, 9};
q2 = x`*(w#x);     /* more efficient than q = x`*diag(w)*x; */
print q2;

Quadratic forms with positive definite matrices

In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function. (Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.) An important example is the squared Mahalanobis distance x` Σ-1 x, which is a quadratic form.

As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance. The trick is to compute the Cholesky decomposition of the SPD matrix. I'll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:

/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite.
   Let A = U`*U be the Cholesky decomposition of A.
   Then q = x`(U`*U)*x = (U*x)`(Ux)
   So define z = U*x and compute q = z`*z
start EvalSPDQuadForm(A, x);
   U = root(A);           /* Cholesky root */
   z = U*x;
   return (z` * z);       /* dot product of vectors */
/* Run on large example */
call randseed(123);
N = 1000;
A = toeplitz(N:1);
x = randfun(N, "Normal");
q3 = EvalSPDQuadForm(A, x);  /* efficient */
qCheck = x`*A*x;             /* check computation by using a slower method */
print q3 qCheck;

You can use a similar trick to evaluate the quadratic form x` A-1 x. I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.

/* Evaluate quadratic form x`*inv(A)*x, where  A is symmetric positive definite.
   Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition.
   To compute  q = x` * inv(A) * x = x` * w, you need to solve for w.
   (U`U)*w = x, so
        First solve triangular system U` z = x   for z,
        then solve triangular system  U w = z   for w 
start EvalInvQuadForm(A, x);
   U = root(A);              /* Cholesky root */
   z = trisolv(2, U, x);     /* solve linear system */
   w = trisolv(1, U, z);  
   return (x` * w);          /* dot product of vectors */
/* test the function */
q4 = EvalInvQuadForm(A, x);  /* efficient */
qCheck = x`*Solve(A,x);      /* check computation by using a slower method */
print q4 qCheck;

You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.

In summary, you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.

The post Efficient evaluation of a quadratic form appeared first on The DO Loop.

4月 102019

In numerical linear algebra, there are often multiple ways to solve a problem, and each way is useful in various contexts. In fact, one of the challenges in matrix computations is choosing from among different algorithms, which often vary in their use of memory, data access, and speed. This article describes four ways to perform the sum of squares and crossproducts matrix, which is usually abbreviated as the SSCP matrix. The SSCP matrix is an essential matrix in ordinary least squares (OLS) regression. The normal equations for OLS are written as (X`*X)*b = X`*Y, where X is a design matrix, Y is the vector of observed responses, and b is the vector of parameter estimates, which must be computed. The X`*X matrix (pronounced "X-prime-X") is the SSCP matrix and the topic of this article.

If you are performing a least squares regressions of "long" data (many observations but few variables), forming the SSCP matrix consumes most of the computational effort. In fact, the PROC REG documentation points out that for long data "you can save 99% of the CPU time by reusing the SSCP matrix rather than recomputing it." That is because the SSCP matrix for an n x p data matrix is a symmetric p x p matrix. When n ≫ p, forming the SSCP matrix requires computing with a lot of rows. After it is formed, it is relatively simple to solve a p x p linear system.

SSCP as a matrix computation

Conceptually, the simplest way to compute the SSCP matrix is by multiplying the matrices in an efficient manner. This is what the SAS/IML matrix language does. It recognizes that X`*X is a special kind of matrix multiplication and uses an efficient algorithm to form the product. However, this approach requires that you be able to hold the entire data matrix in RAM, which might not be possible when you have billions of rows.

The following SAS DATA Step creates a data matrix that contains 426 rows and 10 variables. The PROC IML step reads the data into a matrix and forms the SSCP matrix:

/* remove any rows that contain a missing value:
   https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */
data cars;
set sashelp.cars;
if not nmiss(of _numeric_);
proc iml;
use cars;  read all var _NUM_ into X[c=varNames]; close;
/* If you want, you can add an intercept column: X = j(nrow(X),1,1) || X; */
n = nrow(X);
p = ncol(X);
SSCP = X`*X;         /* 1. Matrix multiplication */
print n p, SSCP;

Although the data has 426 rows, it only has 10 variables. Consequentially, the SSCP matrix is a small 10 x 10 symmetric matrix. (To simplify the code, I did not add an intercept column, so technically this SSCP matrix would be used for a no-intercept regression.)

SSCP as an array of inner product computations

The (i,j)th element of the SSCP matrix is the inner product of the i_th column and the j_th column. In general, Level 1 BLAS computations (inner products) are not as efficient as Level 3 computations (matrix products). However, if you have the data variables (columns) in an array of vectors, you can compute the p(p+1)/2 elements of the SSCP matrix by using the following loops over columns. This computation assumes that you can hold at least two columns in memory at the same time:

/* 2. Compute SSCP[i,j] as an inner-product of i_th and j_th columns */
Y = j(p, p, .);
do i = 1 to p;
   u = X[,i];             /* u = i_th column */
   do j = 1 to i;
      v = X[,j];          /* v = j_th column */
      Y[i,j] = u` * v;   
      Y[j,i] = Y[i,j];    /* assign symmetric element */

SSCP as a sum of outer products of rows

The third approach is to compute the SSCP matrix as a sum of outer products of rows. Before I came to SAS, I considered the outer-product method to be inferior to the other two. After all, you need to form n matrices (each p x p) and add these matrices together. This did not seem like an efficient scheme. However, when I came to SAS I learned that this method is extremely efficient for dealing with Big Data because you never need to keep more than one row of data into memory! A SAS procedure like PROC REG has to read the data anyway, so as it reads each row, it also forms outer product and updates the SSCP. When it finishes reading the data, the SSCP is fully formed and ready to solve!

I've recently been working on parallel processing, and the outer-product SSCP is ideally suited for reading and processing data in parallel. Suppose you have a grid of G computational nodes, each holding part of a massive data set. If you want to perform a linear regression on the data, each node can read its local data and form the corresponding SSCP matrix. To get the full SSCP matrix, you merely need to add the G SSCP matrices together, which are relatively small and thus cheap to pass between nodes. Consequently, any algorithm that uses the SSCP matrix can greatly benefit from a parallel environment when operating on Big Data. You can also use this scheme for streaming data.

For completeness, here is what the outer-product method looks like in SAS/IML:

/* 3. Compute SSCP as a sum of rank-1 outer products of rows */
Z = j(p, p, 0);
do i = 1 to n;
   w = X[i,];           /* Note: you could read the i_th row; no need to have all rows in memory */
   Z = Z + w` * w;

For simplicity, the previous algorithm works on one row at a time. However, it can be more efficient to process multiple rows. You can easily buffer a block of k rows and perform an outer product of the partial data matrix. The value of k depends on the number of variables in the data, but typically the block size, k, is dozens or hundreds. In a procedure that reads a data set (either serially or in parallel), each operation would read k observations except, possibly, the last block, which would read the remaining observations. The following SAS/IML statements loop over blocks of k=10 observations at a time. You can use the FLOOR-MOD trick to find the total number of blocks to process, assuming you know the total number of observations:

/* 4. Compute SSCP as the sum of rank-k outer products of k rows */
/* number of blocks: https://blogs.sas.com/content/iml/2019/04/08/floor-mod-trick-items-to-groups.html */
k = 10;                                /* block size */
numBlocks = floor(n / k) + (mod(n, k) > 0);  /* FLOOR-MOD trick */
W = j(p, p, 0);
do i = 1 to numBlocks;
   idx = (1 + k*(i-1)) : min(k*i, n);  /* indices of the next block of rows to process */
   A = X[idx,];                        /* partial data matrix: k x p */
   W = W + A` * A;

All computations result in the same SSCP matrix. The following statements compute the sum of squares of the differences between elements of X`*X (as computed by using matrix multiplication) and the other methods. The differences are zero, up to machine precision.

diff = ssq(SSCP - Y) || ssq(SSCP - Z) || ssq(SSCP - W);
if max(diff) < 1e-12 then 
   print "The SSCP matrices are equivalent.";
print diff[c={Y Z W}];


In summary, there are several ways to compute a sum of squares and crossproducts (SSCP) matrix. If you can hold the entire data in memory, a matrix multiplication is very efficient. If you can hold two variables of the data at a time, you can use the inner-product method to compute individual cells of the SSCP. Lastly, you can process one row at a time (or a block of rows) and use outer products to form the SSCP matrix without ever having to hold large amounts of data in RAM. This last method is good for Big Data, streaming data, and parallel processing of data.

The post 4 ways to compute an SSCP matrix appeared first on The DO Loop.

4月 082019

Suppose you need to assign 100 patients equally among 3 treatment groups in a clinical study. Obviously, an equal allocation is impossible because the second number does not evenly divide the first, but you can get close by assigning 34 patients to one group and 33 to the others. Mathematically, this is a grade-school problem in integer division: simply assign floor(100/3) patients to each group, then deal with the remainders. The FLOOR function rounds numbers down.

The problem of allocating a discrete number of items to groups comes up often in computer programming. I call the solution the FLOOR-MOD trick because you can use the FLOOR function to compute the base number in each group and use the MOD function to compute the number of remainders. Although the problem is elementary, this article describes two programming tips that you can use in a vectorized computer language like SAS/IML:

  • You can compute all the group sizes in one statement. You do not need to use a loop to assign the remaining items.
  • If you enumerate the patients, you can directly assign consecutive numbers to each group. There is no need to deal with the remainders at the end of the process.

Assign items to groups, patients to treatments, or tasks to workers

Although I use patients and treatment groups for the example, I encountered this problem as part of a parallel programming problem where I wanted to assign B tasks evenly among k computational resources. In my example, there were B = 14 tasks and k = 3 resources.

Most computer languages support the FLOOR function for integer division and the MOD function for computing the remainder. The FLOOR-MOD trick works like this. If you want to divide B items into k groups, let A be the integer part of B / k and let C be the remainder, C = B - A*k. Then B = A*k + C, where C < k. In computer code, A = FLOOR(B/k) and C = MOD(B, k).

There are many ways to distribute the remaining items, but for simplicity let's give an extra item to each of the first C groups. For example, if B = 14 and k = 3, then A = floor(14/3) = 4 and C = 2. To allocate 14 items to 3 groups, give 4 items to all groups, and give an extra item to the first C=2 groups.

In a vector programming language, you can assign the items to each group without doing any looping, as shown in the following SAS/IML program:

/* Suppose you have B tasks that you want to divide as evenly as possible 
   among k Groups. How many tasks should you assign to the i_th group? */
proc iml;
/* B = total number of items or tasks (B >= 0, scalar)
   k = total number of groups or workers (k > 0, scalar)
   i = scalar or vector that specifies the group(s), max(i)<=k
   Return the number of items to allocate to the i_th group */
start AssignItems(B, k, i);
   n = floor(B / k) + (i <= mod(B, k));     /* the FLOOR-MOD trick */
/* Test the code. Assign 14 tasks to 3 Groups. */
nItems = 14;
nGroups = 3;
idx = T(1:nGroups);          /* get number of items for all groups */
Group = char(idx);           /* ID label for each group */
n = AssignItems(nItems, nGroups, idx);
print n[c={"Num Items"} r=Group L="Items per Group"];

The AssignItems function is a "one-liner." The interesting part of the AssignItems function is the binary expression i <= mod(B, k), which is valid even when i is a vector. In this example, the expression evaluates to the vector {1, 1, 0}, which assigns an extra item to each of the first two groups.

Which items are assigned to each group?

A related problem is figuring out which items get sent to which groups. In the example where B=14 and k=3, I want to put items 1-5 in the first group, items 6-10 in the second group, and items 11-14 in the last group. There is a cool programming trick, called the CUSUM-LAG trick, which enables you to find these indices. The following function is copied from my article on the CUSUM-LAG trick. After you find the number of items in each group, you can use the ByGroupIndices function to find the item numbers in each group:

/* Return kx2 matrix that contains the first and last elements for k groups that have sizes 
    s[1], s[2],...,s[k]. The i_th row contains the first and last index for the i_th group. */
start ByGroupIndices( s );
   size = colvec(s);              /* make sure you have a column vector */
   endIdx = cusum(size);          /* locations of ending index */
   beginIdx = 1 + lag(endIdx);    /* begin after each ending index ... */
   beginIdx[1] = 1;               /*    ...and at 1  */
   return ( beginIdx || endIdx );
/* apply the CUSUM-LAG trick to the item allocation problem */
GroupInfo = n || ByGroupIndices(n);
print GroupInfo[r=Group c={"NumItems" "FirstItem" "LastItem"}];

The table shows the number of items allocated to each group, as well as the item indices in each group. You can use the FLOOR-MOD trick to get the number of items and the CUSUM-LAG trick to get the indices. In a vector language, you can implement the entire computation without using any loops.

The post Use the FLOOR-MOD trick to allocate items to groups appeared first on The DO Loop.

4月 032019

I've previously written about how to deal with nonconvergence when fitting generalized linear regression models. Most generalized linear and mixed models use an iterative optimization process, such as maximum likelihood estimation, to fit parameters. The optimization might not converge, either because the initial guess is poor or because the model is not a good fit to the data. SAS regression procedures for which this might happen include PROC LOGISTIC, GENMOD, MIXED, GLMMIX, and NLMIXED.

For mixed models, several problems can occur if you have a misspecified model. One issue results in the following note in the SAS log:
NOTE: Estimated G matrix is not positive definite.
This article describes what the note means, why it might occur, and what to do about it. If you encounter this note during a BY-group analysis or simulation study, this article shows how to identify the samples that have the problem.

There are two excellent references that discuss issues related to convergence in the SAS mixed model procedures:

What is the G matrix?

Before we discuss convergence, let's review what the G matrix is and why it needs to be positive definite. The matrix formulation of a mixed model is
Y = X β + Z γ + ε
where β is a vector of fixed-effect parameters. The random effects are assumed to be random realizations from multivariate normal distributions. In particular, γ ~ MVN(0, G) and ε ~ MVN(0, R), where G and R are covariance matrices. The variance-covariance matrix G is often used to specify subject-specific effects, whereas R specifies residual effects. A goal of mixed models is to specify the structure of the G and/or R matrices and estimate the variance-covariance parameters.

Because G is a covariance matrix, G must be positive semidefinite. A nondegenerate covariance matrix will be fully positive definite. However, estimates of G might not have this property. SAS alerts you if the estimate is not positive definite.

As stated in Kiernan (2018, p. ), "It is important that you do not ignore this message."

Reasons the estimated G matrix is not positive definite

A SAS Usage Note states that the PROC MIXED message means that "one or more variance components on the RANDOM statement is/are estimated to be zero and could/should be removed from the model." Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) describe several reasons why an estimated G matrix can fail to be positive definite. Two of the more common reasons include:

  • There is not enough variation in the response. After controlling for the fixed effects, there isn't any (or much) variation for the random effects. You might want to remove the random effect from the model. (KTG, p. 9)
  • The model is misspecified. You might need to modify the model or change the covariance structure to use fewer parameters. (Kiernan, p. 18)

Convergence issues in simulation studies or BY-group analyses

If you encounter the note "Estimated G matrix is not positive definite" for real data, you should modify the model or collect more data. In a simulation study, however, there might be simulated samples that do not fit the model even when the data is generated from the model! This can happen for very small data sets and for studies in which the variance components are very small. In either case, you might want to identify the samples for which the model fails to converge, either to exclude them from the analysis or to analyze them by using a different model. The same situation can occur for few BY groups during a traditional BY-group analysis.

The following SAS program illustrates how to detect the samples for which the estimated G matrix is not positive definite. The 'SP' data are from an example in the PROC MIXED documentation. The data set named 'ByData' is constructed for this blog post. It contains four copies of the real data, along with an ID variable with values 1, 2, 3, and 4. For the first copy, the response variable is set to 0, which means that there is no variance in the response. For the third copy, the response variable is simulated from a normal distribution. It has variance, but does not conform to the model. The call to PROC MIXED fits the same random-effects model to all four samples:

/* Example from PROC MIXED documentation: https://bit.ly/2YEmpGw  */
data sp;
input Block A B Y @@;
1 1 1  56  1 1 2  41  1 2 1  50  1 2 2  36  1 3 1  39  1 3 2  35
2 1 1  30  2 1 2  25  2 2 1  36  2 2 2  28  2 3 1  33  2 3 2  30
3 1 1  32  3 1 2  24  3 2 1  31  3 2 2  27  3 3 1  15  3 3 2  19
4 1 1  30  4 1 2  25  4 2 1  35  4 2 2  30  4 3 1  17  4 3 2  18
/* concatenate four copies of the data, but modify Y for two copies */
data ByData;
call streaminit(1);
set sp(in=a1) sp(in=a2) sp(in=a3) sp(in=a4);
SampleID = a1 + 2*a2 + 3*a3 + 4*a4;
if a1 then Y = 0;                         /* response is constant (no variation) */
if a3 then Y = rand("Normal", 31, 10);    /* response is indep of factors */
/* suppress ODS output: https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html */
%macro ODSOff(); ods graphics off; ods exclude all; ods noresults;  %mend;
%macro ODSOn();  ods graphics on; ods exclude none; ods results;    %mend;
proc mixed data=ByData;
   by sampleID;
   class A B Block;
   model Y = A B A*B;
   random Block A*Block;
   ods output ConvergenceStatus=ConvergeStatus;

The SAS log displays various notes about convergence. The second and fourth samples (the doc example) converged without difficulty. The log shows a WARNING for the first group. The third group displays the note Estimated G matrix is not positive definite.

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive residual variance estimate.
WARNING: Stopped because of infinite likelihood.
NOTE: The above message was for the following BY group:
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:
NOTE: Convergence criteria met.
NOTE: Estimated G matrix is not positive definite.
NOTE: The above message was for the following BY group:
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:

Notice that the ODS OUTPUT statement writes the ConvergenceStatus table for each BY group to a SAS data set called ConvergeStatus. For each BY group, the ConvergenceStatus table is appended to a SAS data set called ConvergeStatus. The result of the PROC PRINT statement shows the columns of the table for each of the four groups:

proc print data=ConvergeStatus;  run;
Convergence status table in PROC MIXED, which shows which BY groups had convergence or modeling problems

The SampleID column is the value of the BY-group variable. The Reason column explains why the optimization process terminated. The Status column is 0 if the model converged and nonzero otherwise. Note that the third model converged, even though the G matrix was not positive definite!

To detect nonpositive definite matrices, you need to look at the pdG column, The pdG indicates which models had a positive definite G matrix (pdG=1) or did not (pdG=0). Although I do not discuss it in this article, the pdH column is an indicator variable that has value 0 if the SAS log displays the message NOTE: Convergence criteria met but final hessian is not positive definite.

The pdG column tells you which models did not have a positive definite variance matrix. You can merge the ConverenceStatus table with the original data and exclude (or keep) the samples that did not converge or that had invalid variance estimates, as shown in the following DATA step:

/* keep samples that converge and for which gradient and Hessian are positive definite */
data GoodSamples;
merge ByData ConvergeStatus;
by SampleID;
if Status=0 & pdG & pdH;

If you run PROC MIXED on the samples in the GoodSamples data set, all samples will converge without any scary-looking notes.

In summary, this article discusses the note Estimated G matrix is not positive definite, which sometimes occurs when using PROC MIXED or PROC GLMMIX in SAS. You should not ignore this note. This note can indicate that the model is misspecified. This article shows how you can detect this problem programmatically during a simulation study or a BY-group analysis. You can use the ConvergenceStatus table to identify the BY groups for which this the problem occurs. Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) provide more insight into this note and other problems that you might encounter when fitting mixed models.

The post Convergence in mixed models: When the estimated G matrix is not positive definite appeared first on The DO Loop.

4月 012019

Many SAS procedures support the BY statement, which enables you to perform an analysis for subgroups of the data set. Although the SAS/IML language does not have a built-in "BY statement," there are various techniques that enable you to perform a BY-group analysis. The two I use most often are the UNIQUE-LOC technique and the UNIQUEBY technique. The first is more intuitive, the second is more efficient. This article shows how to use SAS/IML to read and process BY-group data from a data set.

I previously showed that you can perform BY-group processing in SAS/IML by using the UNIQUEBY technique, so this article uses the UNIQUE-LOC technique. The statistical application is simulating clusters of data. If you have a SAS data set that contains the centers and covariance matrices for several groups of observations, you can then read that information into SAS/IML and simulate new observations for each group by using a multivariate normal distribution.

Matrix operations and BY groups

A BY-group analysis in SAS/IML usually starts with a SAS data set that contains a bunch of covariance or correlation matrices. A simple example is a correlation analysis of each species of flower in Fisher's iris data set. The BY-group variable is the species of iris: Setosa, Versicolor, or Virginica. The variables are measurements (in mm) of the sepals and petals of 150 flowers, 50 from each species. A panel of scatter plots for the iris data is shown to the right. You can see that the three species appear to be clustered. From the shapes of the clusters, you might decide to model each cluster by using a multivariate normal distribution.

You can use the OUTP= and COV options in PROC CORR to output mean and covariance statistics for each group, as follows:

proc corr data=sashelp.iris outp=CorrOut COV noprint;
   by Species;
   var Petal: Sepal:;
proc print data=CorrOut(where=(_TYPE_ in ("N", "MEAN", "COV"))) noobs;
   where Species="Setosa";  /* just view information for one group */
   by Species  _Type_ notsorted;
   var _NAME_ Petal: Sepal:;

The statistics for one of the groups (Species='Setosa') are shown. The number of observations in the group (N) is actually a scalar value, but it was replicated to fit into a rectangular data set.

Reading BY-group information into SAS/IML

This section reads the sample size, mean vector, and covariance matrix for all groups. A WHERE clause selects only the observations of interest:

/* Read in N, Mean, and Cov for each species. Use to create a parametric bootstrap 
   by simulating N[i] observations from a MVN(Mean[i], Cov[i]) distribution */
proc iml;
varNames = {'PetalLength' 'PetalWidth' 'SepalLength' 'SepalWidth'};
use CorrOut where (_TYPE_="N" & Species^=" ");       /* N */
read all var varNames into mN[rowname=Species];      /* read for all groups */
print mN[c=varNames];
use CorrOut where (_TYPE_="MEAN" & Species^=" ");    /* mean */
read all var varNames into mMean[rowname=Species];   /* read for all groups */
print mMean[c=varNames];
use CorrOut where (_TYPE_="COV" & Species^=" ");     /* covariance */
read all var varNames into mCov[rowname=Species];    /* read for all groups */
print mCov[c=varNames];

The output (not shown) shows that the matrices mN, mMean, and mCov contain the vertical concatenation (for all groups) of the sample size, mean vectors, and covariance matrices, respectively.

The grouping variable is Species. You can use the UNIQUE function to get the unique (sorted) values of the groups. You can then iterate over the unique values and use the LOC function to extract only the rows of the matrices that correspond to the ith group. What you do with that information depends on your application. In the following program, the information for each group is used to create a random sample from a multivariate normal distribution that has the same size, mean, and covariance as the ith group:

/* Goal: Write random MVN values in X to data set */
X = j(1, ncol(varNames), .);         /* X variables will be numeric */
Spec = BlankStr(nleng(Species));     /* and Spec will be character */
create SimOut from X[rowname=Spec colname=varNames]; /* open for writing */
/* The BY-group variable is Species */
call randseed(12345);
u = unique(Species);                 /* get unique values for groups */
do i = 1 to ncol(u);                 /* iterate over unique values */
   idx = loc(Species=u[i]);          /* rows for the i_th group */
   N = mN[i, 1];                     /* extract scalar from i_th group */
   mu = mMean[i,];                   /* extract vector from i_th group */
   Sigma = mCov[idx,];               /* extract matrix from i_th group */
   /* The parameters for this group are now in N, mu, and Sigma.
      Do something with these values. */
   X = RandNormal(N, mu, Sigma);     /* simulate obs for the i_th group */
   X = round(X);                     /* measure to nearest mm */
   Spec = j(N, 1, u[i]);             /* ID for this sample */
   append from X[rowname=Spec];
close SimOut;
ods graphics / attrpriority=none;
title "Parametric Bootstrap Simulation of Iris Data"; 
proc sgscatter data=SimOut(rename=(Spec=Species));
   matrix Petal: Sepal: / group=Species;

The simulation has generated a new set of clustered data. If you compare the simulated data with the original, you will notice many statistical similarities.

Although the main purpose of this article is to discuss BY-group processing in SAS/IML, I want to point out that the simulation in this article is an example of the parametric bootstrap. Simulating Data with SAS (Wicklin, 2013) states that "the parametric bootstrap is nothing more than the process of fitting a model distribution to the data and simulating data from the fitted model." That is what happens in this program. The sample means and covariance matrices are used as parameters to generate new synthetic observations. Thus, the parametric bootstrap technique is really a form of simulation where the parameters for the simulation are obtained from the data.

In conclusion, sometimes you have many matrices in a SAS data set, each identified by a categorical variable. You can perform "BY-group processing" in SAS/IML by reading in all the matrices into a big matrix and then use the UNIQUE-LOC technique to iterate over each matrix.

The post Matrix operations and BY groups appeared first on The DO Loop.

3月 272019

In simulation studies, sometimes you need to simulate outliers. For example, in a simulation study of regression techniques, you might want to generate outliers in the explanatory variables to see how the technique handles high-leverage points. This article shows how to generate outliers in multivariate normal data that are a specified distance from the center of the distribution. In particular, you can use this technique to generate "regular outliers" or "extreme outliers."

When you simulate data, you know the data-generating distribution. In general, an outlier is an observation that has a small probability of being randomly generated. Consequently, outliers are usually generated by using a different distribution (called the contaminating distribution) or by using knowledge of the data-generating distribution to construct improbable data values.

The geometry of multivariate normal data

A canonical example of adding an improbable value is to add an outlier to normal data by creating a data point that is k standard deviations from the mean. (For example, k = 5, 6, or 10.) For multivaraite data, an outlier does not always have extreme values in all coordinates. However, the idea of an outliers as "far from the center" does generalize to higher dimensions. The following technique generates outliers for multivariate normal data:

  1. Generate uncorrelated, standardized, normal data.
  2. Generate outliers by adding points whose distance to the origin is k units. Because you are using standardized coordinates, one unit equals one standard deviation.
  3. Use a linear transformation (called the Cholesky transformation) to produce multivariate normal data with a desired correlation structure.

In short, you use the Cholesky transformation to transform standardized uncorrelated data into scaled correlated data. The remarkable thing is that you can specify the covariance and then directly compute the Cholesky transformation that will result in that covariance. This is a special property of the multivariate normal distribution. For a discussion about how to perform similar computations for other multivariate distributions, see Chapter 9 in Simulating Data with SAS (Wicklin 2013).

Let's illustrate this method by using a two-dimensional example. The following SAS/IML program generates 200 standardized uncorrelated data points. In the standardized coordinate system, the Euclidean distance and the Mahalanobis distance are the same. It is therefore easy to generate an outlier: you can generate any point whose distance to the origin is larger than some cutoff value. The following program generates outliers that are 5 units from the origin:

ods graphics / width=400px height=400px attrpriority=none;
proc iml;
/* generate standardized uncorrelated data */
call randseed(123);
N = 200;
x = randfun(N//2, "Normal"); /* 2-dimensional data. X ~ MVN(0, I(2)) */
/* covariance matrix is product of diagonal (scaling) and correlation matrices.
   See https://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
R = {1   0.9,             /* correlation matrix */
     0.9 1  };
D = {4 9};                /* standard deviations */
Sigma = corr2cov(R, D);   /* Covariance matrix Sigma = D*R*D */
print Sigma;
/* U is the Cholesky transformation that scales and correlates the data */
U = root(Sigma);
/* add a few unusual points (outliers) */
pi = constant('pi');
t = T(0:11) * pi / 6;            /* evenly spaced points in [0, 2p) */
outliers = 5#cos(t) || 5#sin(t); /* evenly spaced on circle r=5 */
v = x // outliers;               /* concatenate MVN data and outliers */
w = v*U;                         /* transform from stdized to data coords */
labl = j(N,1," ") // T('0':'11');
Type = j(N,1,"Normal") // j(12,1,"Outliers");
title "Markers on Circle r=5";
title2 "Evenly spaced angles t[i] = i*pi/6";
call scatter(w[,1], w[,2]) grid={x y} datalabel=labl group=Type;

The program simulates standardized uncorrelated data (X ~ MVN(0, I(2))) and then appends 12 points on the circle of radius 5. The Cholesky transformation correlates the data. The correlated data are shown in the scatter plot. The circle of outliers has been transformed into an ellipse of outliers. The original outliers are 5 Euclidean units from the origin; the transformed outliers are 5 units of Mahalanobis distance from the origin, as shown by the following computation:

center = {0 0};
MD = mahalanobis(w, center, Sigma);     /* compute MD based on MVN(0, Sigma) */
obsNum = 1:nrow(w);
title "Mahalanobis Distance of MVN Data and Outliers";
call scatter(obsNum, MD) grid={x y} group=Type;

Adding outliers at different Mahalanobis distances

The same technique enables you to add outliers at any distance to the origin. For example, the following modification of the program adds outliers that are 5, 6, and 10 units from the origin. The t parameter determines the angular position of the outlier on the circle:

MD = {5, 6, 10};               /* MD for outlier */
t = {0, 3, 10} * pi / 6;       /* angular positions */
outliers = MD#cos(t) ||  MD#sin(t);
v = x // outliers;             /* concatenate MVN data and outliers */
w = v*U;                       /* transform from stdized to data coords */
labl = j(N,1," ") // {'p1','p2','p3'};
Type = j(N,1,"Normal") // j(nrow(t),1,"Outlier");
call scatter(w[,1], w[,2]) grid={x y} datalabel=labl group=Type;

If you use the Euclidean distance, the second and third outliers (p2 and p3) are closer to the center of the data than p1. However, if you draw the probability ellipses for these data, you will see that p1 is more probable than p2 and p3. This is why the Mahalanobis distance is used for measuring how extreme an outlier is. The Mahalanobis distances of p1, p2, and p3 are 5, 6, and 10, respectively.

Adding outliers in higher dimensions

In two dimensions, you can use the formula (x(t), y(t)) = r*(cos(t), sin(t)) to specify an observation that is r units from the origin. If t is chosen randomly, you can obtain a random point on the circle of radius r. In higher dimensions, it is not easy to parameterize a sphere, which is the surface of an d-dimensional ball. Nevertheless, you can still generate random points on a d-dimensional sphere of radius r by doing the following:

  1. Generate a point from the d-dimensional multivariate normal distribution: Y = MVN(0, I(d)).
  2. Project the point onto the surface of the d-dimensional sphere of radius r: Z = r*Y / ||Y||.
  3. Use the Cholesky transformation to correlate the variables.

In the SAS/IML language, it would look like this:

/* General technique to generate random outliers at distance r */
d = 2;                              /* use any value of d */
MD = {5, 6, 10};                    /* MD for outlier */
Y = randfun(nrow(MD)//d, "Normal"); /* d-dimensional Y ~ MVN(0, I(d)) */
outliers = MD # Y / sqrt(Y[,##]);   /* surface of spheres of radii MD */
/* then append to data, use Cholesky transform, etc. */

For these particular random outliers, the largest outliers (MD=10) is in the upper right corner. The others are in the lower left corner. If you change the random number seed in the program, the outliers will appear in different locations.

In summary, by understanding the geometry of multivariate normal data and the Cholesky transformation, you can manufacture outliers in specific locations or in random locations. In either case, you can control whether you are adding a "near outlier" or an "extreme outlier" by specifying an arbitrary Mahalanobis distance.

You can download the SAS program that generates the graphs in this article.

The post How to simulate multivariate outliers appeared first on The DO Loop.

3月 252019

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate normal data. It is better than looking at the univariate z-scores of each coordinate because a multivariate outlier does not necessarily have extreme coordinate values.

The geometry of multivariate outliers

In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in every coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in any coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

  • The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
  • For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
  • The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
  • The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.
Geometry of multivariate outliers showing the relationship between Mahalanobis distance and univariate outliers. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'B' is a multivariate outlier.

The diagram displays two observations, labeled "A" and "B":

  • The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a univariate outlier for both the X1 and X2 coordinates!
  • The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the combination of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate z scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its z scores are relatively small.

Applications to real data

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate z scores. Another observation had large z scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

Geometry of multivariate outliers. The point 'A' has large univariate z scores but the Mahalanobis distance is only about 2.5. The point 'B' has a Mahalanobis distance of 5 and is a multivariate outlier.

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within t MD units of the center is 1 - exp(-t2/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6. Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

Summary and further reading

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

3月 202019

An analyst was using SAS to analyze some data from an experiment. He noticed that the response variable is always positive (such as volume, size, or weight), but his statistical model predicts some negative responses. He posted the data and asked if it is possible to modify the graph so that only positive responses are displayed.

This article shows how you can truncate a surface or a contour plot so that negative values are not displayed. You could do something similar to truncate unreasonably high values in a surface plot.

Why does the model predict negative values?

Before showing how to truncate the surface plot, let's figure out why the model predicts negative values when all the observed responses are positive. The following DATA step is a simplified version of the real data. The RSREG procedure uses least squares regression to fit a quadratic response surface. If you use the PLOTS=SURFACE option, the procedure automatically displays a contour plot and surface plot for the predicted response:

data Sample;
input X Y Z @@;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
ods graphics / width=400px height=400px ANTIALIASMAX=10000;
proc rsreg data=Sample plots=surface(fill=pred overlaypairs);
   model Z = Y X;
proc rsreg data=Sample plots=surface(3d fill=Pred gridsize=80);
   model Z = Y X;
   ods select Surface;
   ods output Surface=Surface; /* use ODS OUTPUT to save surface data to a data set */

The contour plot overlays a scatter plot of the data. You can see that the data are observed only in the upper-right portion of the plot (the red regions) and that no data are in the lower-left portion of the plot. The RSREG procedure fits a quadratic model to the data. The predicted values near the observed data are all positive. Some of the predicted values that are far from the observed data are negative.

I previously wrote about this phenomenon and showed how to compute the convex hull for these bivariate data. When you evaluate the model inside the convex hull, you are interpolating. When you evaluate the model outside the convex hull, you are extrapolating. It is well known that polynomial regression models can give nonsensical results if you extrapolate far from the data.

Truncating a response surface

The RSREG procedure is not aware that the response variable should be positive. A quadratic surface will eventually get arbitrarily big in the positive and/or negative directions. You can see this on the contour and surface plots, which show the predictions of the model on a regular grid of (X, Y) values.

If you want to display only the positive portion of the prediction surface, you can replace each negative predicted value with a missing value. The first step is to obtain the predicted values on a regular grid. You can use the "missing value trick" to score the quadratic model on a grid, or you can use the ODS OUTPUT statement to obtain the gridded values that are used in the surface plot. I chose the latter option. In the previous section, I used the ODS OUTPUT statement to write the gridded predicted values for the surface plot to a SAS data set named Surface.

As Warren Kuhfeld points out in his article about processing ODS OUTPUT data set, the names in an ODS data object can be "long and hard to type." Therefore, I rename the variables. I also combine the gridded values with the original data so that I can optionally overlay the data and the predicted values.

/* rename vars and set negative responses to missing */
data Surf2;
set Surface(rename=(
       Predicted0_1_0_0 = Pred  /* rename the long ODS names */
       Factor1_0_1_0_0  = GY    /* 'G' for 'gridded' */
       Factor2_0_1_0_0  = GX))  
    Sample(in=theData);         /* combine with original data */
if theData then Type = "Data   ";
else            Type = "Gridded";
if Pred < 0 then Pred = .;      /* replace negative predictions with missing values */
label GX = 'X'  GY = 'Y';

You can use the Graph Template Language (GTL) to generate graphs that are similar to those produced by PROC RSREG. You can then use PROC SGRENDER to create the graph. Because the negative response values were set to missing, the contour plot displays a missing value color (black, for this ODS style) in the lower-left and upper-right portions of the plot. Similarly, the missing values cause the surface plot to be truncated. By using the GRIDSIZE= option, you can make the jagged edges small.

Notice that the colors in the graphs are now based on the range [0, 50], whereas previously the colors were based on the range [-60, 50]. I've added a continuous legend to the plots so that the range of the response variable is obvious.

I'd like to stress that sometimes "nonsensical values" indicate an inappropriate model. If you notice nonsensical values, you should always ask yourself why the model is predicting those values. You shouldn't modify the prediction surface without a good reason. But if you do have a good reason, the techniques in this article should help you.

You can download the complete SAS program that analyzes the data and generates the truncated graphs.

The post Truncate response surfaces appeared first on The DO Loop.