Matrix Computations

12月 112019

Binary matrices are used for many purposes. I have previously written about how to use binary matrices to visualize missing values in a data matrix. They are also used to indicate the co-occurrence of two events. In ecology, binary matrices are used to indicate which species of an animal are present in which ecological site. For example, if you remember your study of Darwin's finches in high school biology class, you might recall that certain finches (species) are present or absent on various Galapagos islands (sites). You can use a binary matrix to indicate which finches are present on which islands.

Recently I was involved in a project that required performing a permutation test on rows of a binary matrix. As part of the project, I had to solve three smaller problems involving rows of a binary matrix:

  1. Given two rows, find the locations at which the rows differ.
  2. Given two binary matrices, determine how similar the matrices are by computing the proportion of elements that are the same.
  3. Given two rows, swap some of the elements that differ.

This article shows how to solve each problem by using the SAS/IML matrix language. A future article will discuss permutation tests for binary matrices. For clarity, I introduce the following macro that uses a temporary variable to swap two sets of values:

/* swap values of A and B. You can use this macro in the DATA step or in the SAS/IML language */
%macro SWAP(A, B, TMP=_tmp);
   &TMP = &A; &A = &B; &B = &TMP;

Where do binary matrices differ?

The SAS/IML matrix language enables you to treat matrices as high-level objects. You often can answer questions about matrices without writing loops that iterate over the elements of the matrices. For example, if you have two matrices of the same dimensions, you can determine the cells at which the matrices are unequal by using the "not equal" (^=) operator. The following SAS/IML statements define two 2 x 10 matrices and use the "not equal" operator to find the cells that are different:

proc iml;
A = {0 0 1 0 0 1 0 1 0 0 ,
     1 0 0 0 0 0 1 1 0 1 };
B = {0 0 1 0 0 0 0 1 0 1 ,
     1 0 0 0 0 1 1 1 0 0 };
colLabel = "Sp1":("Sp"+strip(char(ncol(A))));
rowLabel = "A1":("A"+strip(char(nrow(A))));
/* 1. Find locations where binary matrices differ */
Diff = (A^=B);
print Diff[c=colLabel r=rowLabel];

The matrices A and B are similar. They both have three 1s in the first row and fours 1s in the second row. However, the output shows that the matrices are different for the four elements in the sixth and tenth columns. Although I used entire matrices for this example, the same operations work on row vectors.

The proportion of elements in common

You can use various statistics to measure the similarity between the A and B matrices. A simple statistic is the proportion of elements that are in common. These matrices have 20 elements, and 16/20 = 0.8 are common to both matrices. You can compute the proportion in common by using the express (A=B)[:], or you can use the following statements if you have previously computed the Diff matrix:

/* 2. the proportion of elements in B that are the same as in A */
propDiff = 1 - Diff[:];
print propDiff;

As a reminder, the mean subscript reduction operator ([:]) computes the mean value of the elements of the matrix. For a binary matrix, the mean value is the proportion of ones.

Swap elements of rows

The first two tasks were easy. A more complicated task is swapping values that differ between rows. The swapping operation is not difficult, but it requires finding the k locations where the rows differ and then swapping all or some of those values. In a permutation test, the number of elements that you swap is a random integer between 1 and k, but for simplicity, this example uses the SWAP macro to swap two cells that differ. For clarity, the following example uses temporary variables such as x1, x2, d1, and d2 to swap elements in the matrix A:

/* specify the rows whose value you want to swap */
i1 = 1;                         /* index of first row to compare and swap */
i2 = 2;                         /* index of second row to compare and swap */
/* For clarity, use temp vars x1 & x2 instead of A[i1, ] and A[i2, ] */
x1 = A[ i1, ];                  /* get one row */
x2 = A[ i2, ];                  /* get another row */
idx = loc( x1^=x2 );            /* find the locations where rows differ */
if ncol(idx) > 0 then do;       /* do the rows differ? */
   d1 = x1[ ,idx];              /* values at the locations that differ */
   d2 = x2[ ,idx];
   print (d1//d2)[r={'r1' 'r2'} L='Original'];
   /* For a permutation test, choose a random number of locations to swap */
   /* numSwaps = randfun(1, "Integer", 1, n);
      idxSwap = sample(1:ncol(idx), numSwaps, "NoReplace");  */
   idxSwap = {2 4};              /* for now, hard-code locations to swap */
   %SWAP(d1[idxSwap], d2[idxSwap]);
   print (d1//d2)[r={'d1' 'd2'} L='New'];
   /* update matrix */
   A[ i1, idx] = d1;
   A[ i2, idx] = d2;
print A[c=colLabel r=rowLabel];

The vectors x1 and x2 are the rows of A to compare. The vectors d1 and d2 are the subvectors that contain only the elements of x1 and x2 that differ. The example swaps the second and fourth columns of d1 and d2. The new values are then inserted back into the matrix A. You can compare the final matrix to the original to see that the process swapped two elements in each of two rows.

Although the examples are for binary matrices, these techniques work for any numerical matrices.

The post Swap elements in binary matrices appeared first on The DO Loop.

11月 272019
Visualization of a quadratic function and a linear subspace

This article discusses how to restrict a multivariate function to a linear subspace. This is a useful technique in many situations, including visualizing an objective function that is constrained by linear equalities. For example, the graph to the right is from a previous article about how to evaluate quadratic polynomials. The graph shows a heat map for a quadratic polynomial of two variables. The diagonal line represents a linear constraint between the X and Y variables. If you restrict the polynomial to the diagonal line, you obtain a one-dimensional function that you can easily graph and visualize. By repeating this process for other lines, you can construct a "stack" of lower-dimensional slices that enable you to understand the graph of the original bivariate function.

This example generalizes. For a function of many variables, you can restrict the function to a lower-dimensional subspace. By visualizing the restricted function, you can understand the high-dimensional function better. I previously demonstrated this technique for multidimensional regression models by using the SLICEFIT option in the EFFECTPLOT statement in SAS. This article shows how to use ideas from vector calculus to restrict a function to a parameterized linear subspace.

Visualize high-dimensional data and functions

There are basically two techniques for visualizing high-dimensional objects: projection and slicing. Many methods in multivariate statistics compute a linear subspace such that the projection of the data onto the subspace has a desirable property. One example is principal component analysis, which endeavors to find a low dimensional subspace that captures most of the variation in the data. Another example is linear discriminant analysis, which aims to find a linear subspace that maximally separates groups in the data. In both cases, the data are projected onto the computed subspaces to reveal characteristics of the data. There are also statistical methods that attempt to find nonlinear subspaces.

Whereas projection is used for data, slicing is often used to visualize the graphs of functions. In regression, you model a response variable as a function of the explanatory variables. Slicing the function along a linear subspace of the domain can reveal important characteristics about the function. That is the method used by the SLICEFIT option in the EFFECTPLOT statement, which enables you to visualize multidimensional regression models. The most common "slice" for a regression model is to specify constant values for all but one continuous variable in the model.

When visualizing an objective function that is constrained by a set of linear equalities, the idea is similar. The main difference is that the "slice" is usually determined by a general linear combination of variables.

Restrict a function to a one-dimensional subspace

A common exercise in multivariate calculus is to restrict a bivariate function to a line. (This is equivalent to a "slice" of the bivariate function.) Suppose that you choose a point, x0, and a unit vector, u, which represents the "direction". Then a parametric line through x0 in the direction of u is given by the vector expression v(t) = x0 + t u, where t is any real number. If you restrict a multivariate function to the image of v, you obtain a one-dimensional function.

For example, consider the quadratic polynomial in two variables:
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
The function is visualized by the heat map at the top of this article. Suppose x0 = {0, -1} and choose u to be the unit vector in the direction of the vector {3, 1}. (This choice corresponds to a linear constraint of the form x – 3y = 3.)

The following SAS/IML program constructs the parameterized line v(t) for a uniform set of t values. The quadratic function is evaluated at this set of values and the resulting one-dimensional function is graphed:

proc iml;
/* Evaluate the quadratic function at each column of X and return a row vector. */
start Func(XY);
   x = XY[1,]; y = XY[2,];
   return (9*x##2  + x#y + 4*y##2) - 12*x - 4*y + 6;
x0 = {0, -1};          /* evaluate polynomial at this point */
d  = {3, 1};           /* vector that determines direction */
u  = d / norm(d);      /* unit vector (direction) */
t = do(-2, 2, 0.1);    /* parameter values */
v = x0 + t @ u;        /* evenly spaced points along the linear subspace */
f = Func(v);           /* evaluate the 2-D function along the 1-D subspace */
title "Quadratic Form Evaluated on Linear Subspace";
call series(t, f) grid={x y};
Quadratic function restricted to linear subspace

The graph shows the restriction of the 2-D function to the 1-D subspace. According to the graph, the restricted function reaches a minimum value for t ≈ 0.92. The corresponding (x, y) values and the corresponding function value are shown below:

tMin = 0.92;           /* t* = parameter near the minimum */
vMin = x0 + tMin * u;  /* corresponding (x(t*), y(t*)) */
fMin = Func(vMin);     /* f(x(t*), y(t*)) */
print tMin vMin fMin;

If you look back to the heat map at the top of this article, you will see that the value (x,y) = (0.87,-0.71) correspond to the location for which the function achieves a minimum value when restricted to the linear subspace. The value of the function at that (x,y) value is approximately 6.6.

The SAS/IML program uses the Kronecker direct-product operator (@) to create points along the line. The operation is v = x0 + t @ u. The symbols x0 and u are both column vectors with two elements. The Kronecker product creates a 2 x k matrix, where k is the number of elements in the row vector t. The Kronecker product enables you to vectorize the program instead of looping over the elements of t and forming the individual vectors x0 + t[i]*u.

You can generalize this example to evaluate a multivariate function on a two-dimensional subspace. If u1 and u2 are two orthogonal, p-dimensional, unit vectors, the expression x0 + s*u1 + t*u2 spans a two-dimensional subspace of Rp. You can use the ExpandGrid function in SAS/IML to generate ordered pairs (s, t) on a two-dimensional grid.


In summary, you can slice a multivariate function along a linear subspace. Usually, 1-D or 2-D subspaces are used and the resulting (restricted) function is graphed. This helps you to understand the function. You can choose a series of slices (often parallel to each other) and "stack" the slices in order to visualize the multivariate function. Typically, this technique works best for functions that have three or four continuous variables.

You can download the SAS program that creates the graphs in this article.

The post Evaluate a function on a linear subspace appeared first on The DO Loop.

11月 252019

What is an efficient way to evaluate a multivariate quadratic polynomial in p variables? The answer is to use matrix computations! A multivariate quadratic polynomial can be written as the sum of a purely quadratic term (degree 2), a purely linear term (degree 1), and a constant term (degree 0). The purely quadratic term is called a quadratic form. This article shows how to use matrix computations to efficiently evaluate a multivariate quadratic polynomial.

Quadratic polynomials and matrix expressions

Visualization of a quadratic polynomial in SAS

As I wrote in a previous article about optimizing a quadratic function, the matrix of second derivatives and the gradient of first derivatives appear in the matrix representation of a quadratic polynomial. I will use the same example as in my previous article. Namely, the following quadratic polynomial in two variables:
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = (1/2) x` Q x + L` x + 6
where Q = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a column vector, and x = {x, y} is a column vector that represents the coordinates at which to evaluate the quadratic function.

The graph at the right visualizes this quadratic function by using a heat map. Small values of the function are shown in white. Larger values of the function are shown in blues and greens. The largest values are shown in reds and blacks. The global minimum of this function is approximately (x, y) = (0.64, 0.42), and that point is indicated by a star.

This example generalizes. Every quadratic function in p variables can be written as the sum of a quadratic form (1/2 x` Q x), a linear term (L` x) and a constant. Here Q is a p x p symmetric matrix of second derivatives (the Hessian matrix of f) and L and x are p-dimensional column vectors.

Evaluate quadratic polynomial in SAS

Because the computation involves vectors and matrices, the SAS/IML language is the natural place to use matrices to evaluate a quadratic function. The following SAS/IML statements implement a simple function to evaluate a quadratic polynomial (given in terms of Q, L, and a constant) at an arbitrary two-dimensional vector, x:

proc iml;
/* Evaluate  f(x) = 0.5 * x` * Q * x + L`*x + const, where
   Q is p x p symmetric matrix, 
   L and x are col vector with p elements.
   This version evaluates ONE vector x and returns a scalar value. */
start EvalQuad(x, Q, L, const=0);
   return 0.5 * x`*Q*x + L`*x + const;
/* compute Q and L for f(x,y)= 9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6 */
Q = {18 1,             /* matrix of second derivatives */
      1 8};
L = { -12, -4};        /* use column vectors for L and x */
const = 6;
x0 = {0, -1};          /* evaluate polynomial at this point */
f = EvalQuad(x0, Q, L, const);
print f[L="f(0,-1)"];

Evaluate a quadratic polynomial at multiple points

When I implement a function in the SAS/IML language, I try to "vectorize" it so that it can evaluate multiple points in a single call. Often you can use matrix operations to vectorize a function evaluation, but I don't see how to make the math work for this problem. The natural way to evaluate a quadratic polynomial at k vectors X1, X2, ..., Xk, is to pack those vectors into a p x k matrix X such that each column of X is a point at which to evaluate the polynomial. Unfortunately, the matrix computation of the quadratic form M = 0.5 * X`*Q*X results in a k x k matrix. Only the k diagonal elements are needed for evaluating the polynomial on the k input vectors, so although it is possible to compute M, doing so would be very inefficient.

In this case, it seems more efficient to loop over the columns of X. The following function implements a SAS/IML module that evaluates a quadratic polynomial at every column of X and returns a row vector of the results. The module is demonstrated by calling it on a matrix that has 5 columns.

/* Evaluate the quadratic function at each column of X and return a row vector. */
start EvalQuadVec(X, Q, L, const=0);
   f = j(1, ncol(X), .);
   do i = 1 to ncol(X);
      v = X[,i];
      f[i] = 0.5 * v`*Q*v + L`*v + const;
   return f;
/*    X1  X2  X3 X4  X5  */
vx = {-1 -0.5 0  0.5 1 ,
      -3 -2  -1  0   1 };
f = EvalQuadVec(vx, Q, L, const=0);
print (vx // f)[r={'x' 'y' 'f(x,y)'} c=('X1':'X5')];

Evaluate a quadratic polynomial on a uniform grid of points

You can use the EvalQuadVec function to evaluate a quadratic polynomial on any set of multiple points. In particular, you can use the ExpandGrid function to construct a regular 2-D grid of points. By evaluating the function at each point on the grid, you can visualize the function. The following statements create a heat map of the function on a regular grid. The heat map is shown at the top of this article.

x = do(-1, 1, 0.1); 
y = do(-3, 1.5, 0.1);
xy = expandgrid(x, y);             /* 966 x 2 matrix */
f = EvalQuadVec(xy`, Q, L, const); /* evaluate polynomial at all points */
/* write results to a SAS data set and visualize the function by using a heat map */
M = xy || f`;
create Heatmap from M[c={'x' 'y' 'f'}];  append from M;  close;
data optimal;
   xx=0.64; yy=0.42;  /* optional: add the optimal (x,y) value */
data All;  set Heatmap optimal; run;
title "Heat Map of Quadratic Function";
proc sgplot data=All;
   heatmapparm x=x y=y colorresponse=f / colormodel= (WHITE CYAN YELLOW RED BLACK);
   scatter x=xx y=yy / markerattrs=(symbol=StarFilled);
   xaxis offsetmin=0 offsetmax=0;
   yaxis offsetmin=0 offsetmax=0;

Quadratic approximations

Perhaps you do not often use quadratic polynomials. This technique is useful even for general nonlinear functions because it enables you to find the best quadratic approximation to a multivariate function at any point x0. The multivariate Taylor series at the point x0, truncated at second order, is
f(x) ≈ f(x0) + L` · (xx0) + (1/2) (xx0)` · Q · (xx0)
where L = ∇f(x0) is the gradient of f evaluate at x0 and Q = D2f(x0) is the symmetric Hessian matrix of second derivatives of f evaluated at x0.


In summary, you can use matrix computations to evaluate a multivariate quadratic polynomial. This article shows how to evaluate a quadratic polynomial at multiple points. For a polynomial of two variables, you can use this technique to visualize quadratic functions.

The post Evaluate a quadratic polynomial in SAS appeared first on The DO Loop.

11月 132019

Biplots are two-dimensional plots that help to visualize relationships in high dimensional data. A previous article discusses how to interpret biplots for continuous variables. The biplot projects observations and variables onto the span of the first two principal components. The observations are plotted as markers; the variables are plotted as vectors. The observations and/or vectors are not usually on the same scale, so they need to be rescaled so that they fit on the same plot. There are four common scalings (GH, COV, JK, and SYM), which are discussed in the previous article.

This article shows how to create biplots in SAS. In particular, the goal is to create the biplots by using modern ODS statistical graphics. You can obtain biplots that use the traditional SAS/GRAPH system by using the %BIPLOT macro by Michael Friendly. The %BIPLOT macro is very powerful and flexible; it is discussed later in this article.

There are four ways to create biplots in SAS by using ODS statistical graphics:

  • You can use PROC PRINQUAL in SAS/STAT software to create the COV biplot.
  • If you have a license for SAS/GRAPH software (and SAS/IML software), you can use Friendly's %BIPLOT macro and use the OUT= option in the macro to save the coordinates of the markers and vectors. You can then use PROC SGPLOT to create a modern version of Friendly's biplot.
  • You can use the matrix computations in SAS/IML to "manually" compute the coordinates of the markers and vectors. (These same computations are performed internally by the %BIPLOT macro.) You can use the Biplot module to create a biplot, or you can use the WriteBiplot module to create a SAS data set that contains the biplot coordinates. You can then use PROC SGPLOT to create the biplot.

For consistency with the previous article, all methods in this article standardize the input variables to have mean zero and unit variance (use the SCALE=STD option in the %BIPLOT macro). All biplots show projections of the same four-dimensional Fisher's iris data. The following DATA step assigns a blank label. If you do not supply an ID variable, some biplots display observations numbers.

data iris;
   set Sashelp.iris;
   id = " ";        /* create an empty label variable */

Use PROC PRINQUAL to compute the COV biplot

The PRINQUAL procedure can perform a multidimensional preference analysis, which is visualized by using a MDPREF plot. The MDPREF plot is closely related to biplot (Jackson (1991), A User’s Guide to Principal Components, p. 204). You can get PROC PRINQUAL to produce a COV biplot by doing the following:

  • Use the N=2 option to specify you want to compute two principal components.
  • Use the MDPREF=1 option to specify that the procedure should not rescale the vectors in the biplot. By default, MDPREF=2.5 and the vectors appear 2.5 larger than they should be. (More on scaling vectors later.)
  • Use the IDENTITY transformation so that the variables are not transformed in a nonlinear manner.

The following PROC PRINQUAL statements produce a COV biplot (click to enlarge):

proc prinqual data=iris plots=(MDPref) 
              n=2       /* project onto Prin1 and Prin2 */
              mdpref=1; /* use COV scaling */
   transform identity(SepalLength SepalWidth PetalLength PetalWidth);  /* identity transform */
   id ID;
   ods select MDPrefPlot;
COV biplot, created in SAS by using PROC PRINQUAL

Use Friendly's %BIPLOT macro

Friendly's books [SAS System for Statistical Graphics (1991) and Visualizing Categorical Data (2000)] introduced many SAS data analysts to the power of using visualization to accompany statistical analysis, and especially the analysis of multivariate data. His macros use traditional SAS/GRAPH graphics from the 1990s. In the mid-2000s, SAS introduced ODS statistical graphics, which were released with SAS 9.2. Although the %BIPLOT macro does not use ODS statistical graphics directly, the macro supports the OUT= option, which enables you to create an output data set that contains all the coordinates for creating a biplot.

The following example assumes that you have downloaded the %BIPLOT macro and the %EQUATE macro from Michael Friendly's web site.

/* A. Use %BIPLOT macro, which uses SAS/IML to compute the biplot coordinates. 
      Use the OUT= option to get the coordinates for the markers and vectors.
   B. Transpose the data from long to wide form.
   C. Use PROC SGPLOT to create the biplot
%let FACTYPE = SYM;   /* options are GH, COV, JK, SYM */
title "Biplot: &FACTYPE, STD";
        var=SepalLength SepalWidth PetalLength PetalWidth, 
        factype=&FACTYPE,  /* GH, COV, JK, SYM */
        std=std,           /* NONE, MEAN, STD  */
        scale=1,           /* if you do not specify SCALE=1, vectors are auto-scaled */
        out=biplotFriendly,/* write SAS data set with results */ 
        symbols=circle dot, inc=1);
/* transpose from long to wide */
data Biplot;
set biplotFriendly(where=(_TYPE_='OBS') rename=(dim1=Prin1 dim2=Prin2 _Name_=_ID_))
    biplotFriendly(where=(_TYPE_='VAR') rename=(dim1=vx dim2=vy _Name_=_Variable_));
proc sgplot data=Biplot aspect=1 noautolegend;
   refline 0 / axis=x; refline 0 / axis=y;
   scatter x=Prin1 y=Prin2 / datalabel=_ID_;
   vector  x=vx    y=vy    / datalabel=_Variable_
                             lineattrs=GraphData2 datalabelattrs=GraphData2;
   xaxis grid offsetmin=0.1 offsetmax=0.2;
   yaxis grid;
SYM biplot, created in SAS by using ODS statistical graphics

Because you are using PROC SGPLOT to display the biplot, you can easily configure the graph. For example, I added grid lines, which are not part of the output from the %BIPLOT macro. You could easily change attributes such as the size of the fonts or add additional features such as an inset. With a little more work, you can merge the original data and the biplot data and color-code the markers by a grouping variable (such as Species) or by a continuous response variable.

Notice that the %BUPLOT macro supports a SCALE= option. The SCALE= option applies an additional linear scaling to the vectors. You can use this option to increase or decrease the lengths of the vectors in the biplot. For example, in the SYM biplot, shown above, the vectors are long relative to the range of the data. If you want to display vectors that are only 25% as long, you can specify SCALE=0.25. You can specify numbers greater than 1 to increase the vector lengths. For example, SCALE=2 will double the lengths of the vectors. If you omit the SCALE= option or set SCALE=0, then the %BIPLOT macro automatically scales the vectors to the range of the data. If you use the SCALE= option, you should tell the reader that you did so.

SAS/IML modules that compute biplots

The %BIPLOT macro uses SAS/IML software to compute the locations of the markers and vectors for each type of biplot. I wrote three SAS/IML modules that perform the three steps of creating a biplot:

  • The CalcBiplot module computes the projections of the observations and scores onto the first few principal components. This module (formerly named CalcPrinCompBiplot) was written in the mid-2000s and has been distributed as part of the SAS/IML Studio application. It returns the scores and vectors as SAS/IML matrices.
  • The WriteBiplot module calls the CalcBiplot module and then writes the scores to a SAS data set called _SCORES and the vectors (loadings) to a SAS data set called _VECTORS. It also creates two macro variables, MinAxis and MaxAxis, which you can use if you want to equate the horizontal and vertical scales of the biplot axes.
  • The Biplot function calls the WriteBiplot module and then calls PROC SGPLOT to create a biplot. It is the "raw SAS/IML" version of the %BIPLOT macro.

You can use the CalcBiplot module to compute the scores and vectors and return them in IML matrices. You can use the WriteBiplot module if you want that information in SAS data sets so that you can create your own custom biplot. You can use the Biplot module to create standard biplots. The Biplot and WriteBiplot modules are demonstrated in the next sections.

Use the Biplot module in SAS/IML

The syntax of the Biplot module is similar to the %BIPLOT macro for most arguments. The input arguments are as follows:

  • X: The numerical data matrix
  • ID: A character vector of values used to label rows of X. If you pass in an empty matrix, observation numbers are used to label the markers. This argument is ignored if labelPoints=0.
  • varNames: A character vector that contains the names of the columns of X.
  • FacType: The type of biplot: 'GH', 'COV', 'JK', or 'SYM'.
  • StdMethod: How the original variables are scaled: 'None', 'Mean', or 'Std'.
  • Scale: A numerical scalar that specifies additional scaling applied to vectors. By default, SCALE=1, which means the vectors are not scaled. To shrink the vectors, specify a value less than 1. To lengthen the vectors, specify a value greater than 1. (Note: The %BIPLOT macro uses SCALE=0 as its default.)
  • labelPoints: A binary 0/1 value. If 0 (the default) points are not labeled. If 1, points are labeled by the ID values. (Note: The %BIPLOT macro always labels points.)

The last two arguments are optional. You can specify them as keyword-value pairs outside of the parentheses. The following examples show how you can call the Biplot module in a SAS/IML program to create a biplot:

ods graphics / width=480px height=480px;
proc iml;
/* assumes the modules have been previously stored */
load module=(CalcBiplot WriteBiplot Biplot);
use sashelp.iris;
read all var _NUM_ into X[rowname=Species colname=varNames];
title "COV Biplot with Scaled Vectors and Labels";
run Biplot(X, Species, varNames, "COV", "Std") labelPoints=1;   /* label obs */
title "JK Biplot: Relationships between Observations";
run Biplot(X, NULL, varNames, "JK", "Std");
title "JK Biplot: Automatic Scaling of Vectors";
run Biplot(X, NULL, varNames, "JK", "Std") scale=0;            /* auto scale; empty ID var */
title "SYM Biplot: Vectors Scaled by 0.25";
run Biplot(X, NULL, varNames, "SYM", "Std") scale=0.25;        /* scale vectors by 0.25 */
SYM biplot with auto-scaled vectors. Biplot created by using ODS statistical graphics

The program creates four biplots, but only the last one is shown. The last plot uses the SCALE=0.25 option to rescale the vectors of the SYM biplot. You can compare this biplot to the SYM biplot in the previous section, which did not rescale the length of the vectors.

Use the WriteBiplot module in SAS/IML

If you prefer to write an output data set and then create the biplot yourself, use the WriteBiplot module. After loading the modules and the data (see the previous section), you can write the biplot coordinates to the _Scores and _Vectors data sets, as follows. A simple DATA step appends the two data sets into a form that is easy to graph:

run WriteBiplot(X, NULL, varNames, "JK", "Std") scale=0;   /* auto scale vectors */
data Biplot;
   set _Scores _Vectors;    /* append the two data sets created by the WriteBiplot module */
title "JK Biplot: Automatic Scaling of Vectors";
title2 "FacType=JK; Std=Std";
proc sgplot data=Biplot aspect=1 noautolegend;
   refline 0 / axis=x; refline 0 / axis=y;
   scatter x=Prin1 y=Prin2 / ; 
   vector  x=vx    y=vy    / datalabel=_Variable_
                             lineattrs=GraphData2 datalabelattrs=GraphData2;
   xaxis grid offsetmin=0.1 offsetmax=0.1 min=&minAxis max=&maxAxis;
   yaxis grid min=&minAxis max=&maxAxis;
JK biplot, created in SAS by using ODS statistical graphics

In the program that accompanies this article, there is an additional example in which the biplot data is merged with the original data so that you can color-code the observations by using the Species variable.


This article shows four ways to use modern ODS statistical graphics to create a biplot in SAS. You can create a COV biplot by using the PRINQUAL procedure. If you have a license for SAS/IML and SAS/GRAPH, you can use Friendly's %BIPLOT macro to write the biplot coordinates to a SAS data set, then use PROC SGPLOT to create the biplot. This article also presents SAS/IML modules that compute the same biplots as the %BIPLOT macro. The WriteBiplot module writes the data to two SAS data sets (_Score and _Vector), which can be appended and used to plot a biplot. This gives you complete control over the attributes of the biplot. Or, if you prefer, you can use the Biplot module in SAS/IML to automatically create biplots that are similar to Friendly's but are displayed by using ODS statistical graphics.

You can download the complete SAS program that is used in this article. For convenience, I have also created a separate file that defines the SAS/IML modules that create biplots.

The post Create biplots in SAS appeared first on The DO Loop.

11月 062019
COV biplot of Fisher's iris data. Commputed in SAS.

Principal component analysis (PCA) is an important tool for understanding relationships in multivariate data. When the first two principal components (PCs) explain a significant portion of the variance in the data, you can visualize the data by projecting the observations onto the span of the first two PCs. In a PCA, this plot is known as a score plot. You can also project the variable vectors onto the span of the PCs, which is known as a loadings plot. See the article "How to interpret graphs in a principal component analysis" for a discussion of the score plot and the loadings plot.

A biplot overlays a score plot and a loadings plot in a single graph. An example is shown at the right. Points are the projected observations; vectors are the projected variables. If the data are well-approximated by the first two principal components, a biplot enables you to visualize high-dimensional data by using a two-dimensional graph.

In general, the score plot and the loadings plot will have different scales. Consequently, you need to rescale the vectors or observations (or both) when you overlay the score and loadings plots. There are four common choices of scaling. Each scaling emphasizes certain geometric relationships between pairs of observations (such as distances), between pairs of variables (such as angles), or between observations and variables. This article discusses the geometry behind two-dimensional biplots and shows how biplots enable you to understand relationships in multivariate data.

Some material in this blog post is based on documentation that I wrote in 2004 when I was working on the SAS/IML Studio product and writing the SAS/IML Studio User's Guide. The documentation is available online and includes references to the literature.

The Fisher iris data

A previous article shows the score plot and loadings plot for a PCA of Fisher's iris data. For these data, the first two principal components explain 96% of the variance in the four-dimensional data. Therefore, these data are well-approximated by a two-dimensional set of principal components. For convenience, the score plot (scatter plot) and the loadings plot (vector plot) are shown below for the iris data. Notice that the loadings plot has a much smaller scale than the score plot. If you overlay these plots, the vectors would appear relatively small unless you rescale one or both plots.

Score plot for Fisher's iris data; first two principal components. Loadings plot for Fisher's iris data; first two principal components.

The mathematics of the biplot

You can perform a PCA by using a singular value decomposition of a data matrix that has N rows (observations) and p columns (variables). The first step in constructing a biplot is to center and (optionally) scale the data matrix. When variables are measured in different units and have different scales, it is usually helpful to standardize the data so that each column has zero mean and unit variance. The examples in this article use standardized data.

The heart of the biplot is the singular value decomposition (SVD). If X is the centered and scaled data matrix, then the SVD of X is
X = U L V`
where U is an N x N orthogonal matrix, L is a diagonal N x p matrix, and V is an orthogonal p x p matrix. It turns out that the principal components (PCs) of X`X are the columns of V and the PC scores are the columns of U. If the first two principal components explain most of the variance, you can choose to keep only the first two columns of U and V and the first 2 x 2 submatrix of L. This is the closest rank-two approximation to X. In a slight abuse of notation,
X ≈ U L V`
where now U, L, and V all have only two columns.

Since L is a diagonal matrix, you can write L = Lc L1-c for any number c in the interval [0, 1]. You can then write
X ≈ (U Lc)(L1-c V`)
   = A B

This the factorization that is used to create a biplot. The most common choices for c are 0, 1, and 1/2.

The four types of biplots

The choice of the scaling parameter, c, will linearly scale the observations and vectors separately. In addition, you can write X ≈ (β A) (B / β) for any constant β. Each choice for c corresponds to a type of biplot:

  • When c=0, the vectors are represented faithfully. This corresponds to the GH biplot. If you also choose β = sqrt(N-1), you get the COV biplot.
  • When c=1, the observations are represented faithfully. This corresponds to the JK biplot.
  • When c=1/2, the observations and vectors are treated symmetrically. This corresponds to the SYM biplot.

The GH biplot for variables

GHbiplot of Fisher's iris data. Commputed in SAS.

If you choose c = 0, then A = U and B = L V`. The literature calls this biplot the GH biplot. I call it the "variable preserving" biplot because it provides the most faithful two-dimensional representation of the relationship between vectors. In particular:

  • The length of each vector (a row of B) is proportional to the variance of the corresponding variable.
  • The Euclidean distance between the i_th and j_th rows of A is proportional to the Mahalanobis distance between the i_th and j_th observations in the data.

In preserving the lengths of the vectors, this biplot distorts the Euclidean distance between points. However, the distortion is not arbitrary: it represents the Mahalanobis distance between points.

The GH biplot is shown to the right, but it is not very useful for these data. In choosing to preserve the variable relationships, the observations are projected onto a tiny region near the origin. The next section discusses an alternative scaling that is more useful for the iris data.

The COV biplot

If you choose c = 0 and β = sqrt(N-1), then A = sqrt(N-1) U and B = L V` / sqrt(N-1). The literature calls this biplot the COV biplot. This biplot is shown at the top of this article. It has two useful properties:

  • The length of each vector is equal to the variance of the corresponding variable.
  • The Euclidean distance between the i_th and j_th rows of A is equal to the Mahalanobis distance between the i_th and j_th observations in the data.

In my opinion, the COV biplot is usually superior to the GH biplot.

The JK biplot

JK biplot of Fisher's iris data. Commputed in SAS.

If you choose c = 1, you get the JK biplot, which preserves the Euclidean distance between observations. Specifically, the Euclidean distance between the i_th and j_th rows of A is equal to the Euclidean distance between the i_th and j_th observations in the data.

In faithfully representing the observations, the angles between vectors are distorted by the scaling.

The SYM biplot

If you choose c = 1/2, you get the SYM biplot (also called the SQ biplot), which attempts to treat observations and variables in a symmetric manner. Although neither the observations nor the vectors are faithfully represented, often neither representation is very distorted. Consequently, some people prefer the SYM biplot as a compromise between the COV and JK biplots. The SYM biplot is shown in the next section.

How to interpret a biplot

SYM biplot of Fisher's iris data. Commputed in SAS.

As discussed in the SAS/IML Studio User's Guide, you can interpret a biplot in the following ways:

  • The cosine of the angle between a vector and an axis indicates the importance of the contribution of the corresponding variable to the principal component.
  • The cosine of the angle between pairs of vectors indicates correlation between the corresponding variables. Highly correlated variables point in similar directions; uncorrelated variables are nearly perpendicular to each other.
  • Points that are close to each other in the biplot represent observations with similar values.
  • You can approximate the relative coordinates of an observation by projecting the point onto the variable vectors within the biplot. However, you cannot use these biplots to estimate the exact coordinates because the vectors have been centered and scaled. You could extend the vectors to become lines and add tick marks, but that becomes messy if you have more than a few variables.

If you want to faithfully interpret the angles between vectors, you should equate the horizontal and vertical axes of the biplot, as I have done with the plots on this page.

If you apply these facts to the standardized iris data, you can make the following interpretations:

  • The PetalLength and PetalWidth variables are the most important contributors to the first PC. The SepalWidth variable is the most important contributor to the second PC.
  • The PetalLength and PetalWidth variables are highly correlated. The SepalWidth variable is almost uncorrelated with the other variables.
  • Although I have suppressed labels on the points, you could label the points by an ID variable or by the observation number and use the relative locations to determine which flowers had measurements that were most similar to each other.


This article presents an overview of biplots. A biplot is an overlay of a score plot and a loadings plot, which are two common plots in a principal component analysis. These two plots are on different scales, but you can rescale the two plots and overlay them on a single plot. Depending upon the choice of scaling, the biplot can provide faithful information about the relationship between variables (lengths and angles) or between observations (distances). It can also provide approximates relationships between variables and observations.

In a subsequent post, I will show how to use SAS to create the biplots in this article.

The post What are biplots? appeared first on The DO Loop.

10月 232019

In response to a recent article about how to compute the cosine similarity of observations, a reader asked whether it is practical (or even possible) to perform these types of computations on data sets that have many thousands of observations. The problem is that the cosine similarity matrix is an N x N matrix, where N is the sample size. Thus, the matrix can be very big even for moderately sized data.

The same problem exists for all distance- and similarity-based measurements. Because the Euclidean distance is familiar to most readers, I will use distance to illustrate the techniques in this article. To be specific, suppose you have N=50,000 observations. The full distance matrix contains 2.5 billion distances between pairs of observations. However, in many applications, you are not interested in all the distances. Rather, you are interested in extreme distances, such as finding observations that are very close to each other or are very far from each other. For definiteness, suppose you want to find observations that are less than a certain distance from each other. That distance (called the cutoff distance) serves as a filter. Instead of storing N2 observations, you only need to store the observations that satisfy the cutoff criterion.

Compute by using block-matrix computations

A way to perform a distance computation is to divide the data into blocks and perform block-matrix computations. If X is the data matrix, then the pairwise distance matrix of the N rows is an N x N matrix. But you can subdivide X into blocks where each block has approximately M rows, where M is a smaller number such as M=1,000. Write X = [A1, A2, ..., Ak], where the block matrices are stacked vertically. Then the full distance matrix D = dist(X, X) is equal to stacking the distance matrices for each of the block matrices: D = [dist(A1,X), dist(A2,X), ..., dist(Ak,X)]. Each of the block matrices is an M x N matrix. The following diagram shows the block-matrices. The main idea is to compute each block matrix (which fits into memory), then identify and extract the values that satisfy the cutoff distance. With this technique, you never hold the full N x N distance matrix in memory.

A small example of block computations

A programming principle that I strongly endorse is "start small." Even if your goal is to work with large data, develop and debug the program for small data. In this section, the data contains only 13 observations and 10 variables. The goal is to use block-matrix computations (with a block size of M=3) to compute the 132 = 169 pairwise distances between the observations. As each block of distances is computed, those distances that are less than 2 (the cutoff distance in this example) are identified and stored. The data are simulated by using the following DATA step:

/* start with a smaller example: 13 observations and 10 vars */
%let N = 13;
data Have;
call streaminit(12345);
array x[10];
do i = 1 to &N;
   do j = 1 to dim(x);   x[j] = rand("Uniform", -1, 1);  end;
keep x:;

It makes sense to report the results by giving the pair of observations numbers and the distance for those pairs that satisfy the criterion. For these simulated data, there are 10 pairs of observations that are closer than 2 units apart, as we shall see shortly.

The following SAS/IML program shows how to use block computation to compute the pairs of observations that are close to each other. For illustration, set the block size to M=3. Suppose you are computing the second block, which contains the observations 4, 5, and 6. The following program computes the distance matrix from those observations to every other observation:

proc iml;
use Have; read all var _NUM_ into X; close;
m = 3;                       /* block size */
n = nrow(X);                 /* number of observations */
cutoff = 2;                  /* report pairs where distance < cutoff */
/* Block matrix X = [A1, A2, A3, ..., Ak]  */
startrow=4; endrow=6;        /* Example: Look at 2nd block with obs 4,5,6 */
rows = startRow:endRow;
/* Compute D_i = distance(Ai, X)         */
A = X[rows,];               /* extract block */
D = distance(A, X);         /* compute all distances for obs in block */
print D[r=rows c=(1:n) F=5.2];

The output shows the distances between the 4th, 5th, and 6th observations and all other observation. From the output, you can see that

  • The 4th observation is not less than 2 units from any observation (except itself).
  • The 5th observation is within 2 units of observations 3, 8, 9, and 10.
  • The 6th observation is not less than 2 units from any observation.

The following statements use the LOC function to locate the elements of D that satisfy the cutoff criterion. (You should always check that the vector returned by the LOC function is nonempty before you use it.) The NDX2SUBS function converts indices into (row, col) subscripts. The subscripts are for the smaller block matrix, so you have to map them to the row numbers in the full distance matrix:

/* Extract elements that satisfy the criterion */
idx = loc( D < cutoff );            /* find elements that satisfy cutoff criterion */
*if ncol(idx) > 0 then do;          /* should check that idx is not empty */
   Distance = D[idx];               /* extract close distances */
   s = ndx2sub(dimension(D), idx);  /* s contains matrix subscripts */
   Row = rows[s[,1]];               /* convert rows to row numbers for full matrix */
   Col = s[,2];          
   print Row Col Distance[F=5.2];

Because the distance matrix is symmetric (and we can exclude the diagonal elements, which are 0), let's agree to report only the upper-triangular portion of the (full) distance matrix. Again, you can use the LOC function to find the (Row, Col) pairs for which Row is strictly less than Col. Those are the pairs of observations that we want to store:

   /* filter: output only upper triangular when full matrix is symmetric */
   k = loc( Row < Col );               /* store only upper triangular in full */
   *if ncol(k)>0 then do;              /* should check that k is not empty */
      Row = Row[k];
      Col = Col[k];
      Distance = Distance[k];
      print Row Col Distance[F=5.2];

Success! The printed observations are the elements of the upper-triangular distance matrix that satisfy the cutoff criterion. If you repeat these computations for other blocks (observations 1-3, 4-6, 7-9, 10-12, and 13) then you can compute all pairs of observations that are less than 2 units from each other. For the record, here are the 10 pairs of observations that are within 0.5 units of each other. Although the point of this article is to avoid computing the full 13 x 13 distance matrix, for this small example you can check the following table by visual inspection of the full distance matrix, F = distance(X, X).

Block computations to compute distances in large data

The previous section describes the main ideas about using block matrix computations to perform distance computations for large data when the whole N x N distance matrix will not fit into memory. The same ideas apply to cosine similarity and or other pairwise computations. For various reasons, I recommend that the size of the block matrices be less than 2 GB.

The following DATA step generates 50,000 observations, which means there are 2.5 billion pairs of observations. The program computes all pairwise distances and finds all pairs that are within 0.5 units. The full distance matrix is about 18.6 GB, but the program never computes the full matrix. Instead, the program uses a block size of 5,000 rows and computes a total of 10 distance matrices, each of which requires 1.86 GB of RAM. (See "How much RAM do I need to store that matrix?) For each of these block matrices, the program finds and writes the pairs of observations that are closer than 0.5 units. The program requires about a minute to run on my desktop PC. If you have a slow PC, you might want to use N=20000 to run the example.

/* Create 10 variables and a large number (N) of obs. The complete 
   distance matrix requires N^2 elements. */
%let N = 50000;
data Have;
call streaminit(12345);
array x[10];
do i = 1 to &N;
   do j = 1 to dim(x);  x[j] = rand("Uniform", -1, 1);  end;
keep x:;
/* Find pairs of observations that within 'cutoff' of each other.
   Write these to a data set. */
proc iml;
use Have; read all var _NUM_ into X; close;
n = nrow(X);                    /* number of observations */
m = 5000;                       /* block size: best if n*m elements fit into 2 GB RAM */
cutoff = 0.5;                   /* report pairs where distance < cutoff */
startRow = 1;
/* set up output data set to store results */
create Results var {'Obs1' 'Obs2' 'Distance'};
do while (startRow <= n);
   /* Block matrix X = [A1, A2, A3, ..., Ak]  */
   endRow = min(startRow + m - 1, n);  /* don't exceed size of matrix */
   rows = startRow:endRow;
   /* Compute D_i = distance(Ai, X)         */
   A = X[rows,];                /* extract block */
   D = distance(A, X);          /* compute all distances for obs in block */
   /* Extract elements that satisfy the criterion */
   idx = loc( D < cutoff );     /* find elements that satisfy cutoff criterion */
   if ncol(idx) > 0 then do;
      Distance = D[idx];        /* extract close distances */
      s = ndx2sub(dimension(D), idx);  /* s contains matrix subscripts */
      Row = rows[s[,1]];        /* convert rows to row numbers for full matrix */ 
      Col = s[,2];          
      /* filter: output only upper triangular when full matrix is symmetric */
      k = loc( Row < Col );     /* store only upper triangular in full */
      if ncol(k)>0 then do;
         Obs1 = Row[k]; Obs2 = Col[k]; Distance = Distance[k];
         append;                /* Write results to data set */
   startRow = startRow + m;
title "Distances Less than 0.5 for Pairs of Observations";
title2 "N = 50,000";
proc sgplot data=Results;
   histogram Distance;

From the 2.5 billion distances between pairs of observations, the program found 1620 pairs that are closer than 0.5 units in a 10-dimensional Euclidean space. The distribution of those distances is shown in the following histogram.

Further improvements

When the full matrix is symmetric, as in this example, you can improve the algorithm. Because only the upper triangular elements are relevant, you can improve the efficiency of the algorithm by skipping computations that are in the lower triangular portion of the distance matrix.

The k_th block matrix, A, represents consecutive rows in the matrix X. Let L be the row number of the first row of X that is contained in A. You do not need to compute the distance between rows of A and all rows of X. Instead, set Y = X[L:N, ] and compute distance(A, Y). With this modification, you need additional bookkeeping to map the column numbers in the block computation to the column numbers in the full matrix. However, for large data sets, you decrease the computational time by almost half. I leave this improvement as an exercise for the motivated reader.


In summary, you can use a matrix langue like SAS/IML to find pairs of observations that are close to (or far from) each other. By using block-matrix computations, you never have to compute the full N x N distance matrix. Instead, you can compute smaller pieces of the matrix and find the pairs of observations that satisfy your search criterion.

Because this problem is quadratic in the number of observations, you can expect the program to take four times as long if you double the size of the input data. So the program in this article will take about four minutes for 100,000 observations and 16 minutes for 200,000 computations. If that seems long, remember that with 200,000 observations, the program must compute 40 billion pairwise distances.

The post Perform matrix computations when the matrices don't fit in memory appeared first on The DO Loop.

6月 122019

For linear regression models, there is a class of statistics that I call deletion diagnostics or leave-one-out statistics. These observation-wise statistics address the question, "If I delete the i_th observation and refit the model, what happens to the statistics for the model?" For example:

  • The PRESS statistic is similar to the residual sum of squares statistic but is based on fitting n different models, where n is the sample size and the i_th model excludes the i_th observation.
  • Cook's D statistic measures the influence of the i_th observation on the fit.
  • The DFBETAS statistics measure how the regression estimates change if you delete the i_th observation.

Although most references define these statistics in terms of deleting an observation and refitting the model, you can use a mathematical trick to compute the statistics without ever refitting the model! For example, the Wikipedia page on the PRESS statistic states, "each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors." Although this paragraph is conceptually correct, theSAS/STAT documentation for PROC GLMSELECT states that the PRESS statistic "can be efficiently obtained without refitting the model n times."

A rank-1 update to the inverse of a matrix

Recall that you can use the "normal equations" to obtain the least squares estimate for the regression problem with design matrix X and observed responses Y. The normal equations are b = (X`X)-1(X`Y), where X`X is known as the sum of squares and crossproducts (SSCP) matrix and b is the least squares estimate of the regression coefficients. For data sets with many observations (very large n), the process of reading the data and forming the SSCP is a relatively expensive part of fitting a regression model. Therefore, if you want the PRESS statistic, it is better to avoid rebuilding the SSCP matrix and computing its inverse n times. Fortunately, there is a beautiful result in linear algebra that relates the inverse of the full SSCP matrix to the inverse when a row of X is deleted. The result is known as the Sherman-Morrison formula for rank-1 updates.

The key insight is that one way to compute the SSCP matrix is as a sum of outer products of the rows of X. Therefore if xi is the i_th row of X, the SCCP matrix for data where xi is excluded is equal to X`X - xi`xi. You have to invert this matrix to find the least squares estimates after excluding xi.

The Sherman-Morrison formula enables you to compute the inverse of X`X - xi`xi when you already know the inverse of X`X. For brevity, set A = X`X. The Sherman-Morrison formula for deleting a row vector xi` is
(A – xi`xi)-1 = A-1 + A-1 xi`xi A-1 / (1 – xiA-1xi`)

Implement the Sherman-Morrison formula in SAS

The formula shows how to compute the inverse of the updated SSCP by using a matrix-vector multiplication and an outer product. Let's use a matrix language to demonstrate the update method. The following SAS/IML program reads in a small data set, forms the SSCP matrix (X`X), and computes its inverse:

proc iml;
use Sashelp.Class;   /* read data into design matrix X */
read all var _NUM_ into X[c=varNames];  
XpX = X`*X;          /* form SSCP */
XpXinv = inv(XpX);   /* compute the inverse */

Suppose you want to compute a leave-one-out statistic such as PRESS. For each observation, you need to estimate the parameters that result if you delete that observation. For simplicity, let's just look at deleting the first row of the X matrix. The following program creates a new design matrix (Z) that excludes the row, forms the new SSCP matrix, and finds its inverse:

/* Inefficient: Manually delete the row from the X matrix 
   and recompute the inverse */
n = nrow(X);
Z = X[2:n, ];       /* delete first row */
ZpZ = Z`*Z;         /* reform the SSCP matrix */
ZpZinv = inv(ZpZ);  /* recompute the inverse */
print ZpZinv[c=varNames r=varNames L="Inverse of SSCP After Deleting First Row"];

The previous statements essentially repeat the entire least squares computation. To compute a leave-one-out statistic, you would perform a total of n similar computations.

In contrast, it is much cheaper to apply the Sherman-Morrison formula to update the inverse of the original SSCP. The following statements apply the Sherman-Morrison formula as it is written:

/* Alternative: Do not change X or recompute the inverse. 
   Use the Sherman-Morrison rank-1 update formula.–Morrison_formula */
r = X[1, ];          /* first row */
rpr = r`*r;          /* outer product */
/* apply Sherman-Morrison formula */
NewInv = XpXinv + XPXinv*rpr*XPXinv / (1 - r*XpXinv*r`);
print NewInv[c=varNames r=varNames L="Inverse from Sherman-Morrison Formula"];

These statements compute the new inverse by using the old inverse, an outer product, and a few matrix multiplications. Notice that the denominator of the Sherman-Morrison formula includes the expression r*(X`X)-1*r`, which is the leverage statistic for the i_th row.

The INVUPDT function in SAS/IML

Because it is important to be able to update an inverse matrix quickly when an observation is deleted (or added!), the SAS/IML language supports the IMVUPDT function, which implements the Sherman-Morrison formula. You merely specify the inverse matrix to update, the vector (as a column vector) to use for the rank-one update, and an optional scalar value, which is usually +1 if you are adding a new observation and -1 if you are deleting an observation. For example, the following statements are the easiest way to implement the Sherman-Morrison formula in SAS for a leave-one-out statistic:

NewInv2 = invupdt(XpXinv, r`, -1);
print NewInv2[c=varNames r=varNames L="Inverse from INVUPDT Function"];

The output is not displayed because the matrix NewInv2 is the same as the matrix NewInv in the previous section. The documentation includes additional examples.

The general Sherman-Morrison-Woodbury formula

The Sherman-Morrison formula shows how to perform a rank-1 update of an inverse matrix. There is a more general formula, called the Sherman-Morrison-Woodbury formula, which enables you to update an inverse for any rank-k modification of the original matrix. The general formula (Golub and van Loan, p. 51 of 2nd ed. or p. 65 of 4th ed.) shows how to find the matrix of a rank-k modification to a nonsingular matrix, A, in terms of the inverse of A. The general formula is
(A + U VT)-1 = A-1 – A-1 U (I + VT A-1 U) VT A-1
where U and V are p x k and all inverses are assumed to exist. When k = 1, the matrices U and V become vectors and the k x k identify matrix becomes the scalar value 1. In the previous section, U equals -xiT and V equals xiT.

The Sherman-Morrison-Woodbury formula is one of my favorite results in linear algebra. It shows that a rank-k modification of a matrix results in a rank-k modification of its inverse. It is not only a beautiful theoretical result, but it has practical applications to leave-one-out statistics because you can use the formula to quickly compute the linear regression model that results by dropping an observation from the data. In this way, you can study the influence of each observation on the model fit (Cook's D, DFBETAS,...) and perform leave-one-out cross-validation techniques, such as the PRESS statistic.

The post Leave-one-out statistics and a formula to update a matrix inverse appeared first on The DO Loop.

5月 222019

The eigenvalues of a matrix are not easy to compute. It is remarkable, therefore, that with relatively simple mental arithmetic, you can obtain bounds for the eigenvalues of a matrix of any size. The bounds are provided by using a marvelous mathematical result known as Gershgorin's Disc Theorem. For certain matrices, you can use Gershgorin's theorem to quickly determine that the matrix is nonsingular or positive definite.

The Gershgorin Disc Theorem appears in Golub and van Loan (p. 357, 4th Ed; p. 320, 3rd Ed), where it is called the Gershgorin Circle Theorem. The theorem states that the eigenvalues of any N x N matrix, A, are contained in the union of N discs in the complex plane. The center of the i_th disc is the i_th diagonal element of A. The radius of the i_th disc is the absolute values of the off-diagonal elements in the i_th row. In symbols,
Di = {z ∈ C | |z - Ai i| ≤ ri }
where ri = Σi ≠ j |Ai j|. Although the theorem holds for matrices with complex values, this article only uses real-valued matrices.

An example of Gershgorin discs is shown to the right. The discs are shown for the following 4 x 4 symmetric matrix:

At first glance, it seems inconceivable that we can know anything about the eigenvalues without actually computing them. However, two mathematical theorems tells us quite a lot about the eigenvalues of this matrix, just by inspection. First, because the matrix is real and symmetric, the Spectral Theorem tells us that all eigenvalues are real. Second, the Gershgorin Disc Theorem says that the four eigenvalues are contained in the union of the following discs:

  • The first row produces a disc centered at x = 200. The disc has radius |30| + |-15| + |5| = 50.
  • The second row produces a disc centered at x = 100 with radius |30| + |5| + |5| = 40.
  • The third row produces a disc centered at x = 55 with radius |-15| + |5| + |0| = 20.
  • The last row produces a disc centered at x = 25 with radius |5| + |5| + |0| = 10.

Although the eigenvalues for this matrix are real, the Gershgorin discs are in the complex plane. The discs are visualized in the graph at the top of this article. The true eigenvalues of the matrix are shown inside the discs.

For this example, each disc contains an eigenvalue, but that is not true in general. (For example, the matrix A = {1 −1, 2 −1} does not have any eigenvalues in the disc centered at x=1.) What is true, however, is that disjoint unions of discs must contain as many eigenvalues as the number of discs in each disjoint region. For this matrix, the discs centered at x=25 and x=200 are disjoint. Therefore they each contain an eigenvalue. The union of the other two discs must contain two eigenvalues, but, in general, the eigenvalues can be anywhere in the union of the discs.

The visualization shows that the eigenvalues for this matrix are all positive. That means that the matrix is not only symmetric but also positive definite. You can predict that fact from the Gershgorin discs because no disc intersects the negative X axis.

Of course, you don't have to perform the disc calculations in your head. You can write a program that computes the centers and radii of the Gershgorin discs, as shown by the following SAS/IML program, which also computes the eigenvalues for the matrix:

proc iml;
A = { 200  30 -15  5,
       30 100   5  5,
      -15   5  55  0, 
        5   5   0 15};
evals = eigval(A);                 /* compute the eigenvalues */
center = vecdiag(A);               /* centers = diagonal elements */
radius = abs(A)[,+] - abs(center); /* sum of abs values of off-diagonal elements of each row */
discs = center || radius || round(evals,0.01);
print discs[c={"Center" "Radius" "Eigenvalue"} L="Gershgorin Discs"];

Diagonally dominant matrices

For this example, the matrix is strictly diagonally dominant. A strictly diagonally dominant matrix is one for which the magnitude of each diagonal element exceeds the sum of the magnitudes of the other elements in the row. In symbols, |Ai i| > Σi ≠ j |Ai j| for each i. Geometrically, this means that no Gershgorin disc intersects the origin, which implies that the matrix is nonsingular. So, by inspection, you can determine that his matrix is nonsingular.

Gershgorin discs for correlation matrices

The Gershgorin theorem is most useful when the diagonal elements are distinct. For repeated diagonal elements, it might not tell you much about the location of the eigenvalues. For example, all diagonal elements for a correlation matrix are 1. Consequently, all Gershgorin discs are centered at (1, 0) in the complex plane. The following graph shows the Gershgorin discs and the eigenvalues for a 10 x 10 correlation matrix. The eigenvalues of any 10 x 10 correlation matrix must be real and in the interval [0, 10], so the only new information from the Gershgorin discs is a smaller upper bound on the maximum eigenvalue.

Gershgorin discs for unsymmetric matrices

Gershgorin's theorem can be useful for unsymmetric matrices, which can have complex eigenvalues. The SAS/IML documentation contains the following 8 x 8 block-diagonal matrix, which has two pairs of complex eigenvalues:

A = {-1  2  0       0       0       0       0  0,
     -2 -1  0       0       0       0       0  0,
      0  0  0.2379  0.5145  0.1201  0.1275  0  0,
      0  0  0.1943  0.4954  0.1230  0.1873  0  0,
      0  0  0.1827  0.4955  0.1350  0.1868  0  0,
      0  0  0.1084  0.4218  0.1045  0.3653  0  0,
      0  0  0       0       0       0       2  2,
      0  0  0       0       0       0      -2  0 };

The matrix has four smaller Gershgorin discs and three larger discs (radius 2) that are centered at (-1,0), (2,0), and (0,0), respectively. The discs and the actual eigenvalues of this matrix are shown in the following graph. Not only does the Gershgorin theorem bound the magnitude of the real part of the eigenvalues, but it is clear that the imaginary part cannot exceed 2. In fact, this matrix has eigenvalues -1 ± 2 i, which are on the boundary of one of the discs, which shows that the Gershgorin bound is tight.


In summary, the Gershgorin Disc Theorem provides a way to visualize the possible location of eigenvalues in the complex plane. You can use the theorem to provide bounds for the largest and smallest eigenvalues.

I was never taught this theorem in school. I learned it from a talented mathematical friend at SAS. I use this theorem to create examples of matrices that have particular properties, which can be very useful for developing and testing software.

This theorem also helped me to understand the geometry behind "ridging", which is a statistical technique in which positive multiples of the identity are added to a nearly singular X`X matrix. The Gershgorin Disc Theorem shows the effect of ridging a matrix is to translate all of the Gershgorin discs to the right, which moves the eigenvalues away from zero while preserving their relative positions.

You can download the SAS program that I used to create the images in this article.

Further reading

There are several papers on the internet about Gershgorin discs. It is a favorite topic for advanced undergraduate projects in mathematics.

The post Gershgorin discs and the location of eigenvalues appeared first on The DO Loop.

4月 152019

A quadratic form is a second-degree polynomial that does not have any linear or constant terms. For multivariate polynomials, you can quickly evaluate a quadratic form by using the matrix expression
x` A x
This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like
x` A-1 x
where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.

What is a quadratic form?

For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example, if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x equals the second degree polynomial q(x1, x2) = 9*x12 - 4*x1 x2 + 5*x22. A contour plot of this polynomial is shown below.

Contour plot of the quadratic form x` A x for A = {9 -2, -2 5}

Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` Id x = x` x = Σi xi2. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x12 + 4*x22. If you add in off-diagonal elements, you get cross terms in the polynomial.

Efficient evaluation of a quadratic form

If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).

proc iml;
   q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 +
                   2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3
A = {4 3 2,
     3 6 1,
     2 1 9};
x = {-1, 2, 0.5};
q1 = x`*A*x;
print q1;

When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory. Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:

w = {4, 6, 9};
q2 = x`*(w#x);     /* more efficient than q = x`*diag(w)*x; */
print q2;

Quadratic forms with positive definite matrices

In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function. (Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.) An important example is the squared Mahalanobis distance x` Σ-1 x, which is a quadratic form.

As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance. The trick is to compute the Cholesky decomposition of the SPD matrix. I'll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:

/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite.
   Let A = U`*U be the Cholesky decomposition of A.
   Then q = x`(U`*U)*x = (U*x)`(Ux)
   So define z = U*x and compute q = z`*z
start EvalSPDQuadForm(A, x);
   U = root(A);           /* Cholesky root */
   z = U*x;
   return (z` * z);       /* dot product of vectors */
/* Run on large example */
call randseed(123);
N = 1000;
A = toeplitz(N:1);
x = randfun(N, "Normal");
q3 = EvalSPDQuadForm(A, x);  /* efficient */
qCheck = x`*A*x;             /* check computation by using a slower method */
print q3 qCheck;

You can use a similar trick to evaluate the quadratic form x` A-1 x. I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.

/* Evaluate quadratic form x`*inv(A)*x, where  A is symmetric positive definite.
   Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition.
   To compute  q = x` * inv(A) * x = x` * w, you need to solve for w.
   (U`U)*w = x, so
        First solve triangular system U` z = x   for z,
        then solve triangular system  U w = z   for w 
start EvalInvQuadForm(A, x);
   U = root(A);              /* Cholesky root */
   z = trisolv(2, U, x);     /* solve linear system */
   w = trisolv(1, U, z);  
   return (x` * w);          /* dot product of vectors */
/* test the function */
q4 = EvalInvQuadForm(A, x);  /* efficient */
qCheck = x`*Solve(A,x);      /* check computation by using a slower method */
print q4 qCheck;

You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.

In summary, you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.

The post Efficient evaluation of a quadratic form appeared first on The DO Loop.

4月 102019

In numerical linear algebra, there are often multiple ways to solve a problem, and each way is useful in various contexts. In fact, one of the challenges in matrix computations is choosing from among different algorithms, which often vary in their use of memory, data access, and speed. This article describes four ways to perform the sum of squares and crossproducts matrix, which is usually abbreviated as the SSCP matrix. The SSCP matrix is an essential matrix in ordinary least squares (OLS) regression. The normal equations for OLS are written as (X`*X)*b = X`*Y, where X is a design matrix, Y is the vector of observed responses, and b is the vector of parameter estimates, which must be computed. The X`*X matrix (pronounced "X-prime-X") is the SSCP matrix and the topic of this article.

If you are performing a least squares regressions of "long" data (many observations but few variables), forming the SSCP matrix consumes most of the computational effort. In fact, the PROC REG documentation points out that for long data "you can save 99% of the CPU time by reusing the SSCP matrix rather than recomputing it." That is because the SSCP matrix for an n x p data matrix is a symmetric p x p matrix. When n ≫ p, forming the SSCP matrix requires computing with a lot of rows. After it is formed, it is relatively simple to solve a p x p linear system.

SSCP as a matrix computation

Conceptually, the simplest way to compute the SSCP matrix is by multiplying the matrices in an efficient manner. This is what the SAS/IML matrix language does. It recognizes that X`*X is a special kind of matrix multiplication and uses an efficient algorithm to form the product. However, this approach requires that you be able to hold the entire data matrix in RAM, which might not be possible when you have billions of rows.

The following SAS DATA Step creates a data matrix that contains 426 rows and 10 variables. The PROC IML step reads the data into a matrix and forms the SSCP matrix:

/* remove any rows that contain a missing value: */
data cars;
if not nmiss(of _numeric_);
proc iml;
use cars;  read all var _NUM_ into X[c=varNames]; close;
/* If you want, you can add an intercept column: X = j(nrow(X),1,1) || X; */
n = nrow(X);
p = ncol(X);
SSCP = X`*X;         /* 1. Matrix multiplication */
print n p, SSCP;

Although the data has 426 rows, it only has 10 variables. Consequentially, the SSCP matrix is a small 10 x 10 symmetric matrix. (To simplify the code, I did not add an intercept column, so technically this SSCP matrix would be used for a no-intercept regression.)

SSCP as an array of inner product computations

The (i,j)th element of the SSCP matrix is the inner product of the i_th column and the j_th column. In general, Level 1 BLAS computations (inner products) are not as efficient as Level 3 computations (matrix products). However, if you have the data variables (columns) in an array of vectors, you can compute the p(p+1)/2 elements of the SSCP matrix by using the following loops over columns. This computation assumes that you can hold at least two columns in memory at the same time:

/* 2. Compute SSCP[i,j] as an inner-product of i_th and j_th columns */
Y = j(p, p, .);
do i = 1 to p;
   u = X[,i];             /* u = i_th column */
   do j = 1 to i;
      v = X[,j];          /* v = j_th column */
      Y[i,j] = u` * v;   
      Y[j,i] = Y[i,j];    /* assign symmetric element */

SSCP as a sum of outer products of rows

The third approach is to compute the SSCP matrix as a sum of outer products of rows. Before I came to SAS, I considered the outer-product method to be inferior to the other two. After all, you need to form n matrices (each p x p) and add these matrices together. This did not seem like an efficient scheme. However, when I came to SAS I learned that this method is extremely efficient for dealing with Big Data because you never need to keep more than one row of data into memory! A SAS procedure like PROC REG has to read the data anyway, so as it reads each row, it also forms outer product and updates the SSCP. When it finishes reading the data, the SSCP is fully formed and ready to solve!

I've recently been working on parallel processing, and the outer-product SSCP is ideally suited for reading and processing data in parallel. Suppose you have a grid of G computational nodes, each holding part of a massive data set. If you want to perform a linear regression on the data, each node can read its local data and form the corresponding SSCP matrix. To get the full SSCP matrix, you merely need to add the G SSCP matrices together, which are relatively small and thus cheap to pass between nodes. Consequently, any algorithm that uses the SSCP matrix can greatly benefit from a parallel environment when operating on Big Data. You can also use this scheme for streaming data.

For completeness, here is what the outer-product method looks like in SAS/IML:

/* 3. Compute SSCP as a sum of rank-1 outer products of rows */
Z = j(p, p, 0);
do i = 1 to n;
   w = X[i,];           /* Note: you could read the i_th row; no need to have all rows in memory */
   Z = Z + w` * w;

For simplicity, the previous algorithm works on one row at a time. However, it can be more efficient to process multiple rows. You can easily buffer a block of k rows and perform an outer product of the partial data matrix. The value of k depends on the number of variables in the data, but typically the block size, k, is dozens or hundreds. In a procedure that reads a data set (either serially or in parallel), each operation would read k observations except, possibly, the last block, which would read the remaining observations. The following SAS/IML statements loop over blocks of k=10 observations at a time. You can use the FLOOR-MOD trick to find the total number of blocks to process, assuming you know the total number of observations:

/* 4. Compute SSCP as the sum of rank-k outer products of k rows */
/* number of blocks: */
k = 10;                                /* block size */
numBlocks = floor(n / k) + (mod(n, k) > 0);  /* FLOOR-MOD trick */
W = j(p, p, 0);
do i = 1 to numBlocks;
   idx = (1 + k*(i-1)) : min(k*i, n);  /* indices of the next block of rows to process */
   A = X[idx,];                        /* partial data matrix: k x p */
   W = W + A` * A;

All computations result in the same SSCP matrix. The following statements compute the sum of squares of the differences between elements of X`*X (as computed by using matrix multiplication) and the other methods. The differences are zero, up to machine precision.

diff = ssq(SSCP - Y) || ssq(SSCP - Z) || ssq(SSCP - W);
if max(diff) < 1e-12 then 
   print "The SSCP matrices are equivalent.";
print diff[c={Y Z W}];


In summary, there are several ways to compute a sum of squares and crossproducts (SSCP) matrix. If you can hold the entire data in memory, a matrix multiplication is very efficient. If you can hold two variables of the data at a time, you can use the inner-product method to compute individual cells of the SSCP. Lastly, you can process one row at a time (or a block of rows) and use outer products to form the SSCP matrix without ever having to hold large amounts of data in RAM. This last method is good for Big Data, streaming data, and parallel processing of data.

The post 4 ways to compute an SSCP matrix appeared first on The DO Loop.