Matrix Computations

4月 182018

The sweep operator performs elementary row operations on a system of linear equations. The sweep operator enables you to build regression models by "sweeping in" or "sweeping out" particular rows of the X`X matrix. As you do so, the estimates for the regression coefficients, the error sum of squares, and the generalized inverse of the system are updated simultaneously. This is true not only for a single response variable but also for multiple responses.

You might remember elementary row operations from when you learned Gaussian elimination, which is a general technique for solving linear systems of equations. Sweeping a matrix is similar to Gaussian elimination, but the sweep operator is most often used to solve least squares problems for the normal equations X`X = X`Y. Therefore, in practice, the sweep operator is usually applied to the rows of a symmetric positive definite system.

How to use the sweep operator to obtain least squares estimates

Before we discuss details of the sweep operator, let's show how it produces regression estimates. The following program uses the SWEEP function in SAS/IML on a six-observation data set from J. Goodnight (1979, p. 153). The first three columns of the matrix M contains the design matrix for the explanatory variables; the first column of M represents the intercept parameter. The last column of the augmented matrix is the response variable. The sweep operator is called on the uncorrected sum of squares and crossproducts matrix (the USSCP matrix), which is computed as M`*M. In block form, the USSCP matrix contains the submatrices X`X, X`Y, Y`X, and Y`Y. The following program sweeps the first three rows of the USSCP matrix:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,
        1   2   1   3,
        1   3   1   3,
        1   1  -1   2,
        1   2  -1   2,
        1   3  -1   1};
S = sweep(M`*M, 1:3);     /* sweep first three rows (Intercept, X1, X2) */
print S;
The sweep operator applied to all ros of a USSCP matrix to obtain regression coefficients, SSe, and generalized inverse

The resulting matrix (S) is divided into blocks by black lines. The leading 3x3 block is the (generalized) inverse of the X`X matrix. The first three elements of the last column (outlined in red) are the least squares estimates for the regression model. The lower right cell (outlined in blue) is the sum of squared errors (SSE) for the regression model.

You can compare the result with the results from PROC REG, as follows. The parameter estimates and error sum of squares are highlighted for easy comparison:

data Have;
input X1 X2 Y @@;
1   1   1   2   1   3   3   1   3
1  -1   2   2  -1   2   3  -1   1
proc reg data=Have USSCP plots=none;
   model Y = X1 X2;
   ods select USSCP ANOVA ParameterEstimates;
run; quit;
Regression estimates and SSE for least squares analysis in SAS PROC REG

As claimed, the sweep operator produces the same parameter estimates and SSE statistic as PROC REG. The next sections discuss additional details of the sweep operator.

The basics of the sweep operator

Goodnight (1979) defines the sweep operator as the following sequence of row operations. Given a symmetric positive definite matrix A, SWEEP(A, k) modifies the matrix A by using the pivot element A[k,k] and the k_th row, as follows:

  1. Let D = A[k,k] be the k_th diagonal element.
  2. Divide the k_th row by D.
  3. For every other row i ≠ k, let B = A[i,k] be the i_th element of the k_th column. Subtract B x (row k) from row i. Then set A[i,k] = –B/D.
  4. Set A[k,k] = 1/D.

You could program these steps in a matrix language such as SAS/IML, but as shown previously, the SAS/IML language supports the SWEEP function as a built-in function.

Sweeping in effects

The following program uses the same data as the first section. The USSCP matrix (S0) represents the normal equations (plus a little extra information). If you sweep the first row of the S0 matrix, you solve the regression problem for an intercept-only model:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,    1   2   1   3,    1   3   1   3,
        1   1  -1   2,    1   2  -1   2,    1   3  -1   1};
S0 = M`*M;          print S0;   /* USSCP matrix */
/* sweep in 1st row (Intercept) */
S1 = sweep(S0, 1);   print S1[F=BEST6.];
The sweep operator applied to the first row of a USSCP matrix to obtain an intercept-only model

The first element in the fourth row (S1[1,4], outlined in red) is the parameter estimate for the intercept-only model. The last element (S1[4,4], outlined in blue) is the sum of squared errors (SSE) for the intercept-only model.

If you "sweep in" the second row, you solve the least squares problem for a model that includes the intercept and X1. The corresponding elements of the last column contain the parameter estimates for the model. The lower-right element again contains the SSE for the revised model. If you proceed to "sweep in" the third row, the last column again contains the parameter estimates and the SSE, this time for the model that includes the intercept, X1, and X2:

/* sweep in 2nd row (X1) and 3rd row (X2) */
S2 = sweep(S1, 2);   print S2[F=BEST6.]; 
S3 = sweep(S2, 3);   print S3[F=BEST6.];
The sweep operator applied to subsequent rows of a USSCP matrix to 'sweep in' additional effects

Sweeping out effects

One of the useful features of the sweep operators is that you can remove an effect from a model as easily as you can add it. The sweep operator has a reversibility property: if you sweep a row a second time you "undo" the first sweep. In symbols, the operator has the property that SWEEP( SWEEP(A,k), k) = A for any row k. For example, if you want to take the X1 variable out of the model, merely sweep on the second row again:

S4 = sweep(S3, 2);   print S4[F=BEST6.]; /* model with Intercept + X2 */
The sweep operator applied twice 'sweeps out' an effect from a regression model

Of course, as shown in an earlier section, you do not need to sweep in one row at a time. You can sweep in multiple rows with one call by specifying a vector of indices. In fact, the sweep operator also commutes with itself, which means that you can specify the rows in any order. The call SWEEP(S0, 1:3) is equivalent to SWEEP(S0, {3 1 2}) or SWEEP(S0, {2 3 1}) or any other permutation of {1 2 3}.


The SWEEP operator (which is implemented in the SWEEP function in SAS/IML) enables you to construct least squares model estimates from the USSCP matrix, which is a block matrix that contains the normal equations. You can "sweep in" rows to add effects to a model or "sweep out" rows to remove effects. After each operation, the parameter estimates appear in the portion of the adjusted USSCP matrix that corresponds to the block for X`Y. Furthermore, the residual sum of squares appears in the portion of the adjusted USSCP matrix that corresponds to Y`Y.

Further reading

The sweep operator is known by different names in different fields and has been rediscovered several times. In statistics, the operator was popularized in a TAS article by J. Goodnight (1979) who mentions Ralston (1960) and Beaton (1964) as early researchers. Outside of statistics, the operator is known as the Principal Pivot Transform (Tucker, 1960), gyration (Duffy, Hazony, and Morrison, 1966), or exchange (Stewart and Stewart, 1998). Tsatsomeros (2000) provides an excellent review of the literature and the history of the sweep operator.

The post The sweep operator: A fundamental operation in regression appeared first on The DO Loop.

4月 112018

Sometimes it is important to ensure that a matrix has unique rows. When the data are all numeric, there is an easy way to detect (and delete!) duplicate rows in a matrix.

The main idea is to subtract one row from another. Start with the first row and subtract it from every row beneath it. If any row of the difference matrix is identically zero, then you have found a row that is identical to the first row. Then do the same thing for the second row: subtract it from the third and higher rows and see if you obtain a row of zeros. Repeat this process for the third and higher rows.

The following SAS/IML program implements this algorithm for an arbitrary numeric matrix. It returns a binary indicator variable (a column vector). The i_th row of the returned vector is 1 for rows that are duplicates of some previous row:

proc iml;
/* return binary vector that indicates which rows in a 
   numerical matrix are duplicates */
start DupRows(A);
   N = nrow(A);
   dupRows = j(N, 1, 0);         /* binary indicator matrix */
   do i = 1 to N-1;
      if dupRows[i] = 0 then do; /* skip rows that are known to be duplicates */
         r = i+1:N;              /* remaining rows */ 
         M = A[r, ]-A[i, ];      /* subtract current row from remaining rows */
         b = M[ ,##];            /* sum of squares = 0 iff duplicate row */
         idx = loc( b=0 );       /* any duplicate rows for current row? */
         if ncol(idx) > 0 then
            dupRows[r[idx]] = 1; /* if so, flag them */
   return dupRows;

To test the function, consider the following 6 x 3 matrix. You can see by inspection that the matrix has three duplicate rows: the third, fifth, and sixth rows. You can call the DupRows function and print the matrix adjacent to the binary vector that indicates the duplicate rows:

A = {1 1 1,
     1 2 3,
     1 1 1,  /* duplicate row */
     3 2 1,
     3 2 1,  /* duplicate row */
     1 1 1};  /* duplicate row */
DupRows = DupRows(A);
print A DupRows;

You can use the DupRows function to write a function that excludes the duplicate rows in a matrix and returns only the unique rows, as follows:

/* return the unique rows in a numerical matrix */
start UniqueRows(A);
   uniqRows = loc(DupRows(A)=0);
   return(A[uniqRows, ]);   /* return rows that are NOT duplicates */
U = UniqueRows(A);
print U;

I like this algorithm because it uses subtraction, which is a very fast operation. However, This algorithm has a complexity of O(n(n-1)/2), where n is the number of rows in the matrix, so it is best for small to moderate values of n.

For long and skinny matrices (or for character data), it might be more efficient to sort the matrix as I showed in a previous article about how to find (and count) the unique rows of a matrix. However, the sorting algorithm requires that you sort the data by ALL columns, which can be inefficient for very wide and short data tables. For small and medium-sized data, you can use either method to find the unique rows in a matrix.

This article discusses how to remove duplicates records for numerical matrices in SAS/IML. In Base SAS, you can use PROC SORT or PROC SQL to remove duplicate records from a SAS DATA set, as shown in Lafler (2017) and other references.

The post Find the unique rows of a numeric matrix appeared first on The DO Loop.

3月 212018
Pseudocode for the conjugate gradient method for solving a linear system (from Wikipedia)

I often claim that the "natural syntax" of the SAS/IML language makes it easy to implement an algorithm or statistical formula as it appears in a textbook or journal. The other day I had an opportunity to test the truth of that statement. A SAS programmer wanted to implement the conjugate gradient algorithm, which is an iterative method for solving a system of equations with certain properties. I looked up the Wikipedia article about the conjugate gradient method and saw the following text:

"The algorithm is detailed below for solving Ax = b where A is a real, symmetric, positive-definite matrix. The input vector x0 can be an approximate initial solution or 0." The text was followed by pseudocode (in the box; click to enlarge)

The conjugate gradient method in SAS/IML

I used the pseudocode to implement the conjugate gradient method in SAS/IML. (The method is explained further in the next section.) I chose not to use the 'k' and 'k+1' notation but merely to overwrite the old values of variables with new values. In the program, the variables x, r, and p are vectors and the variable A is a matrix.

/* Linear conjugate gradient method as presented in Wikipedia:
   Solve the linear system A*x = b, where A is a symmetric positive definite matrix.
   The algorithm converges in at most n iterations, where n is the dimension of A.
   This function requires an initial guess, x0, which can be the zero vector.
   The function does not verify that the matrix A is symmetric positive definite.
   This module returns a matrix mX= x1 || x2 || ... || xn 
   whose columns contain the iterative path from x0 to the approximate solution vector xn.  */
proc iml;
start ConjGrad( x0, A, b, Tolerance=1e-6 );
  x = x0;                            /* initial guess */
  r = b - A*x0;                      /* residual */
  p = r;                             /* initial direction */
  mX = j(nrow(A), nrow(A), .);       /* optional: each column is result of an iteration */
  do k = 1 to ncol(mX) until ( done );
    rpr = r`*r;  Ap = A*p;           /* store partial results */
    alpha =  rpr / (p`*Ap);          /* step size */
    x = x + alpha*p;   mX[,k] = x;   /* remember new guess */
    r = r - alpha*Ap;                /* new residual */
    done = (sqrt(rpr) < Tolerance);  /* stop if ||r|| < Tol */
    beta = (r`*r) / rpr;             /* new stepsize */
    p = r + beta*p;                  /* new direction */
  return ( mX );     /* I want to return the entire iteration history, not just x */

The SAS/IML program is easy to read and looks remarkably similar to the pseudocode in the Wikipedia article. This is in contrast to lower-level languages such as C in which the implementation looks markedly different from the pseudocode.

Convergence of the conjugate gradient method

The conjugate gradient method is an iterative method to find the solution of a linear system A*x=b, where A is a symmetric positive definite n x n matrix, b is a vector, and x is the unknown solution vector. (Recall that symmetric positive definite matrices arise naturally in statistics as the crossproduct matrix (or covariance matrix) of a set of variables.) The beauty of the conjugate gradient method is twofold: it is guaranteed to find the solution (in exact arithmetic) in at most n iterations, and it requires only simple operations, namely matrix-vector multiplications and vector additions. This makes it ideal for solving system of large sparse systems, because you can implement the algorithm without explicitly forming the coefficient matrix.

It is fun to look at how the algorithm converges from the initial guess to the final solution. The following example converges gradually, but I know of examples for which the algorithm seems to make little progress for the first n – 1 iterations, only to make a huge jump on the final iteration straight to the solution!

Recall that you can use a Toeplitz matrix to construct a symmetric positive definite matrix. The following statements define a banded Toeplitz matrix with 5 on the diagonal and specify the right-hand side of the system. The zero vector is used as an initial guess for the algorithm. The call to the ConjGrad function returns a matrix whose columns contain the iteration history for the method. For this problem, the method requires five iterations to converge, so the fifth column contains the solution vector. You can check that the solution to this system is (x1, x2, 3, x4, x5) = (-0.75, 0, 1.5, 0.5, -0.75), either by performing matrix multiplication or by using the SOLVE function in IML to compute the solution vector.

A = {5 4 3 2 1,          /* SPD Toeplitz matrix */
     4 5 4 3 2, 
     3 4 5 4 3,
     2 3 4 5 4,
     1 2 3 4 5};
b = {1, 3, 5, 4, 2};     /* right hand side */
n = ncol(A);
x0 = j(n,1,0);           /* the zero vector */
traj = ConjGrad( x0, A, b );
x = traj[ ,n];           /* for this problem, solution is in last column */

It is instructive to view how the iteration progresses from an initial guess to the final solution. One way to view the iterations is to compute the Euclidean distance between each partial solution and the final solution. You can then graph the distance between each iteration and the final solution, which decreases monotonically (Hestenes and Stiefel, 1952).

Distance = sqrt( (traj - x)[##, ] ); /* || x[j] - x_Soln || */
Iteration = 1:n;
title "Convergence of Conjugate Gradient Method";
call series(Iteration, Distance) grid={x y} xValues=Iteration 
     option="markers" label={"" "Distance to Solution"};

Notice in the distance graph that the fourth iteration almost equals the final solution. You can try different initial guesses to see how the guess affects the convergence.

In addition to the global convergence, you can visualize the convergence in each coordinate. Because the vectors live in high-dimensional space, it impossible to draw a scatter plot of the iterative solutions. However, you can visualize each vector in "parallel coordinates" as a sequence of line segments that connect the coordinates in each variable. In the following graph, each "series plot" represents a partial solution. The curves are labeled by the iteration number. The blue horizontal line represents the initial guess (iteration 0). The partial solution after the first iteration is shown in red and so on until the final solution, which is displayed in black. You can see that the third iteration (for this example) is close to the final solution. The fourth partial solution is so close that it cannot be visually distinguished from the final solution.

Convergence of the conjugate gradient method

In summary, this post shows that the natural syntax of the SAS/IML language makes it easy to translate pseudocode into a working program. The article focuses on the conjugate gradient method, which solves a symmetric, positive definite, linear system in at most n iterations. The article shows two ways to visualize the convergence of the iterative method to the solution.

The post The conjugate gradient method appeared first on The DO Loop.

3月 052018

Many people know that a surface can contain a saddle point, but did you know that you can define the saddle point of a matrix? Saddle points in matrices are somewhat rare, which means that if you choose a random matrix you are unlikely to choose one that has a saddle point. This article shows how to

  • Locate a saddle point in a matrix
  • Use simulation to estimate the probability of finding a saddle point in a random matrix.
  • Use simulation to explore the distribution of the location of a saddle point in a random matrix.

A saddle point of a matrix

You might remember from multivariable calculus that a critical point (x0, y0) is a saddle point of a function f if it is a local minimum of the surface in one direction and a local maximum in another direction. The canonical example is f(x,y) = x2 - y2 at the point (x0, y0) = (0, 0). Along the horizontal line y=0, the point (0, 0) is a local minimum. Along the vertical line x=0, the point (0, 0) is a local maximum.

The definition of a saddle point of a matrix is similar. For an n x p matrix M, the cell (i, j) is a saddle point of the matrix if M[i, j] is simultaneously the minimum value of the i_th row and the maximum value of the j_th column.

Consider the following 3 x 3 matrices. The minimum value for each row is highlighted in blue. The maximum value for each column is highlighted in red. Two of the matrices have saddle points, which are highlighted in purple.

  • For the matrix M1, the (3, 1) cell is a saddle point because 7 is the smallest value in row 3 and the largest value in column 1.
  • For the matrix M2, the (2, 2) cell is a saddle point because 5 is the smallest value in row 2 and the largest value in column 2.
  • The matrix M3 does not have a saddle point. No cell is simultaneously the smallest value in its row and the largest value in its column.
The saddle point of a matrix

Compute the location of a saddle point in a matrix

In a matrix language such as SAS/IML, you can use row and column operators to find the location of a saddle point, if it exists. The SAS/IML language supports subscript reduction operators, which enable you to compute certain statistics for all rows or all columns without writing a loop. For this problem, you can use the "index of the minimum operator" (>:<) to find the elements that are the minimum for each row and the "index of the maximum operator" (<:>) to find the elements that are the maximum for each column. To find whether there is an element in common, you can use the XSECT function find the intersection of the two sets. (SAS/IML supports several functions that perform set operations.)

A function that computes the location of the saddle point follows. It uses the SUB2NDX function to convert subscripts to indices.

proc iml;
start LocSaddlePt(M);
   dim = dimension(M);  n = dim[1];  p = dim[2];
   minRow = M[ ,>:<];                    /* for each row, find column that contains min */
   minSubscripts = T(1:n) || minRow;     /* location as (i,j) pair */
   minIdx = sub2ndx(dim, minSubscripts); /* convert location to index in [1, np] */
   maxCol = T(M[<:>, ]);                 /* for each column, find row that contains max */
   maxSubscripts = maxCol || T(1:p);     /* loation as (i,j) pair */
   maxIdx = sub2ndx(dim, maxSubscripts); /* convert location to to index in [1, np] */
   xsect = xsect(minIdx, maxIdx);        /* intersection; might be empty matrix */
   if ncol(xsect) > 0 then
      return ( xsect );  
      return ( . );
M1 = {1 2 3, 4 5 6, 7 8 9};  idx1 = LocSaddlePt(M1);
M2 = {9 1 2, 8 5 7, 3 4 6};  idx2 = LocSaddlePt(M2);
M3 = {8 1 9, 7 2 6, 3 4 5};  idx3 = LocSaddlePt(M3);
print idx1 idx2 idx3;
Find the location of the saddle point of a matrix

The indices of the saddle points are shown for the three example matrices. The 7th element of the M1 matrix is a saddle point, which corresponds to the (3, 1) cell for a 3 x 3 matrix in row-major order. The 5th element (cell (2,2)) of the M2 matrix is a saddle point. The M3 matrix does not contain a saddle point.

The probability of a saddle point in a 3x3 random matrix

A matrix contains either zero or one saddle point. There is a theorem (Ord, 1979) that says that the probability of finding a saddle point in a random n x p matrix is (n! p!) / (n+p-1)! when the elements of the matrix are randomly chosen from any continuous probability distribution.

Although the theorem holds for any continuous probability distribution, it suffices to prove it for the uniform distribution. Notice that the existence and location of a saddle point is invariant under any monotonic transformation because those transformations do not change the relative order of elements. In particular, if F is any probability distribution, the existence and location of the saddle point is invariant under the monotonic transformation F–1, which transforms the numbers into a sample from the uniform distribution. In fact, you can replace the number by their ranks, which converts the random matrix into a matrix of that contains the consecutive integers 1, 2, ..., np. Thus the previous integer matrices represent the ranks of a random matrix where the elements are drawn from any continuous distribution.

For small values of n and p, you can use the ALLPERM function to enumerate all possible matrices that have integers in the range [1, np]. You can reshape the permutation into an n x p matrix and compute whether the matrix contains a saddle point. If you use a 0/1 indicator array to encode the result, then the exact probability of finding a saddle point in a random n x p matrix is found by computing the mean of the indicator array. The following example computes the exact probability of finding a saddle point in a random 3 x 3 matrix:

A = allperm(9);            /* all permutations of 1:9 */
saddle = j(nrow(A), 1);    /* allocate vector for results */
do i = 1 to nrow(A);       /* for each permutation (row) ... */
   M = shape(A[i,], 3);    /*    reshape row into 3x3 matrix */
   idx = LocSaddlePt( M ); /*    find location of saddle */
   saddle[i] = (idx > 0);  /*    0/1 indicator variable */
prob = saddle[:];          /* (sum of 1s) / (number of matrices) */
print prob;
The probability of a saddle point in a 3 x 3 matrix

As you can see, the computational enumeration shows that the probability of a saddle point is 0.3 for 3 x 3 matrices. This agrees with the theoretical formula (3! 3!) / 5! = 36 / 120 = 3/10.

The probability of a saddle point in a random matrix

Of course, when n and p are larger it becomes infeasible to completely enumerate all n x p matrices. However, you can adopt a Monte Carlo approach: generate a large number of random matrices and compute the proportion which contains a saddle point. The following program simulates one million random 4 x 4 matrices from the uniform distribution and computes the proportion that has a saddle point. This is a Monte Carlo estimate of the true probability.

NSim = 1e6;
n = 4; p=4;                  /* random 4x4 matrices from uniform distribution */
A = j(NSim, n*p);
call randgen(A, "Uniform");
saddleLoc = j(NSim, 1, .);   /* store location of the saddle point */
saddle = j(NSim, 1);         /* binary indicator variable */
do i = 1 to NSim;            /* Monte Carlo estimation of probability */
   M = shape(A[i,], n, p);
   saddleLoc[i] = LocSaddlePt( M );  /* save the location 1-16 */
   saddle[i] = (saddleLoc[i] > 0 );  /* 0/1 binary variable */
estProb = saddle[:];
Prob = fact(n)*fact(p) / fact(n+p-1);
print estProb Prob;
The probability of a saddle point of a 4 x 4 matrix

The output shows that the estimated probability agrees with the exact probability to four decimal places.

The distribution of locations of saddle points

The saddle point can occur in any cell in the matrix. Consider, for example, the integer matrix M1 from the earlier 3 x 3 example. If you swap the first and third rows of M1, you will obtain a new matrix that has a saddle point in the (1,1) cell. If you swap the second and third rows of M1, you obtain a saddle point in the (2,1) cell. Similarly, you can swap columns to move a saddle point to a different column. In general, you can move the saddle point to any cell in the matrix by a transposition of rows and columns. Thus each cell in the matrix has an equal probability of containing a saddle point, and the probability that the saddle point is in cell (i,j) is 1/(np) times to the probability that a saddle point exists.

In the preceding SAS/IML simulation, the saddleLoc variable contains the location of the saddle point in the 4 x 4 matrix. The following statements compute the empirical proportion of times that a saddle point appeared in each of the 16 cells of the matrix:

call tabulate(location, freq, saddleLoc);  /* frequencies in cells 1-16 */
counts = shape(freq, n, p);                /* shape into matrix     */
EstProbLoc = counts / (NSim);              /* empirical proportions */
print EstProbLoc[r=('1':'4') c=('1':'4') F=7.5];
The probabilities that a saddle point of a matrix is in the (i,j)th position

The output indicates that each cell has a probability that is estimated by 0.007 (the "James Bond" probability!). Theoretically, you can prove that the probability in each cell is given by the value of the complete beta function B(n, p), which for n=p=4 yields B(4,4) = 0.0071429 (Hofri, 2006).

In summary, this article has explored three topics that are dear to my heart: probability, simulation, and properties of matrices. Along the way, we encountered factorials, permutations, the complete beta function, and got to write some fun SAS/IML programs. For more information about the probability of a saddle point in random matrices, see "The probability that a matrix has a saddle point" (Thorp, 1979) and and "On the distribution of a saddle point value in a random matrix" (Hofri, 2006).

The post The probability of a saddle point in a matrix appeared first on The DO Loop.

12月 112017
Self-similar Christmas tree created in SAS

Happy holidays to all my readers! My greeting-card to you is an image of a self-similar Christmas tree. The image (click to enlarge) was created in SAS by using two features that I blog about regularly: matrix computations and ODS statistical graphics.

Self-similarity in Kronecker products

I have previously shown that the repeated Kronecker product of a binary matrix leads to self-similar structures.

Specifically, if M is a binary matrix then the Kronecker product M@M is a block matrix in which each 1 in the original matrix is replaced with a copy of M and each 0 is replaced by a zero block. (Here '@' is the Kronecker product (or direct product) operator, which is implemented in SAS/IML software.) If you repeat this process three or more times and create a heat map of the product matrix, the self-similar structure is apparent.

For example, the following 5 x 7 binary matrix has a 1 in positions that approximate the shape of a Christmas tree. The direct product M@M@M is a matrix that has 53 = 125 rows and 73 = 343 columns. If you use the HEATMAPDISC subroutine in SAS/IML to create a heat map of the direct product, you obtain the following image in which a green cell represents a 1:

proc iml;
M = {0 0 0 1 0 0 0,
     0 0 1 1 1 0 0,
     0 1 1 1 1 1 0,
     1 1 1 1 1 1 1,
     0 0 0 1 0 0 0};
Tree = M @ M @ M;     /* M is 5 x 7, so Tree is 125 x 343 */
ods graphics / width=300px height=450px  ANTIALIASMAX=50000;
call heatmapdisc(Tree) title="Happy Holidays to All my Readers!"
     colorramp={White Green} displayoutlines=0 ShowLegend=0;

That's not a bad result for three SAS/IML statements! You can see that the "tree" has a self-similar structure: the large tree is composed of 17 smaller trees, each of which is composed of 17 mini trees.

Adding a star

My readers are worth extra effort, so I want to put a star on the top of the tree. One way to do this is to use a scatter plot and plot a special observation by using a star symbol. To plot the "tree" by using a scatter plot, it is necessary to "unpack" the 125 x 343 matrix into a list of (x, y) values (this is a wide-to-long conversion of the matrix). You can use the NDX2SUB function to extract the row and column of each 1 in the matrix, as follows:

/* extract (row, column) pairs for matrix elements that are '1' */
idx = loc(Tree = 1);                /* indices for the 1s */
s = ndx2sub(dimension(Tree), idx);  /* convert indices to subscripts */
create Tree from s[c={"RNames" "CNames"}];
append from s;
call symputx("XPos", range(s[,2])/2); /* midrange = horiz position of star */

The previous statements create a SAS data set called TREE that contains the (x, y) positions of the 1s in the direct product matrix. The statements also saved the midrange value to a macro variable. The following SAS procedures create the location of the star, concatenate the two data sets, and create a scatter plot of the result, which is shown at the top of this article.

data star;
x = &XPos; y = 0; 
data Tree2;
set Tree star;
ods graphics / width=600px height=800px  ANTIALIASMAX=50000;
title "Happy Holidays";
proc sgplot data=Tree2 noborder noautolegend;
scatter x=CNames y=RNames / markerattrs=(color=ForestGreen symbol=SquareFilled size=3);
scatter x=x y=y / markerattrs=(symbol=StarFilled size=20)             /* yellow star */
                  filledoutlinedmarkers markerfillattrs=(color=yellow);
yaxis reverse display=none;
xaxis display=none;

Previous SAS-created Christmas cards

All year long I blog about how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and simulation. Consequently, I enjoy my occasional forays into a fun and frivolous topic such as how to create a greeting card with SAS. If you like seeing geeky SAS-generated images, here is a list of my efforts from previous years:

Wherever you are and whatever holiday you celebrate, may you enjoy peace and happiness this season and in the New Year.

The post A self-similar Christmas tree appeared first on The DO Loop.

8月 302017

Aa previous article discussed the mathematical properties of the singular value decomposition (SVD) and showed how to use the SVD subroutine in SAS/IML software. This article uses the SVD to construct a low-rank approximation to an image. Applications include image compression and denoising an image.

Construct a grayscale image

The value of each pixel in a grayscale image can be stored in a matrix where each element of the matrix is a value between 0 (off) and 1 (full intensity). I want to create a small example that is easy to view, so I'll create a small matrix that contains information for a low-resolution image of the capital letters "SVD." Each letter will be five pixels high and three pixels wide, arranges in a 7 x 13 array of 0/1 values. The following SAS/IML program creates the array and displays a heat map of the data:

ods graphics / width=390px height=210px;
proc iml;
SVD = {0 0 0 0 0 0 0 0 0 0 0 0 0,
       0 1 1 1 0 1 0 1 0 1 1 0 0,
       0 1 0 0 0 1 0 1 0 1 0 1 0,
       0 1 1 1 0 1 0 1 0 1 0 1 0,
       0 0 0 1 0 1 0 1 0 1 0 1 0,
       0 1 1 1 0 0 1 0 0 1 1 0 0,
       0 0 0 0 0 0 0 0 0 0 0 0 0 };
A = SVD;
/* ColorRamp:  White    Gray1    Gray2    Gray3    Black   */
ramp =        {CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000};
call heatmapcont(A) colorramp=ramp showlegend=0 title="Original Image" range={0,1};
Rank-5 data matrix for singular value decomposition (SVD)

A low-rank approximation to an image

Because the data matrix contains only five non-zero rows, the rank of the A matrix cannot be more than 5. The following statements compute the SVD of the data matrix and create a plot of the singular values. The plot of the singular values is similar to the scree plot in principal component analysis, and you can use the plot to help choose the number of components to retain when approximating the data.

call svd(U, D, V, A);  /* A = U*diag(D)*V` */
title "Singular Values";
call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};
Singular values of rank-5 data matrix

For this example, it looks like retaining three or five components would be good choices for approximating the data. To see how low-rank approximations work, let's generate and view the rank-1, rank-2, and rank-3 approximations:

keep = 1:3;          /* use keep=1:5 to see all approximations */
do i = 1 to ncol(keep);
   idx = 1:keep[i];                           /* components to keep */
   Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;  /* rank k approximation */
   Ak = (Ak - min(Ak)) / range(Ak);           /* for plotting, stdize into [0,1] */
   s = "Rank = " + char(keep[i],1);
   call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1};
Low-rank approximation: Rank-1 approximation via SVD of a data matrix

The rank-1 approximation does a good job of determining the columns and do not contain that contain the letters. The approximation also picks out two rows that contain the horizontal strokes of the capital "S."

Low-rank approximation: Rank-2 approximation via SVD of a data matrix

The rank-2 approximation refines the image and adds additional details. You can begin to make out the letters "SVD." In particular, all three horizontal strokes for the "S" are visible and you can begin to see the hole in the capital "D."

Low-rank approximation: Rank-3 approximation via SVD of a data matrix

The rank-3 approximation contains enough details that someone unfamiliar with the message can read it. The "S" is reconstructed almost perfectly and the space inside the "V" and "D" is almost empty. Even though this data is five-dimensional, this three-dimensional approximation is very good.

Not only is a low-rank approximation easier to work with than the original five-dimensional data, but a low-rank approximation represents a compression of the data. The original image contains 7 x 13 = 91 values. For the rank-3 approximation, three columns of the U matrix contain 33 numbers and three columns of VT contain 15 numbers. So the total number of values required to represent the rank-3 approximation is only 48, which is almost half the number of values as for the original image.

Denoising an image

You can use the singular value decomposition and low-rank approximations to try to eliminate random noise that has corrupted an image. Every TV detective series has shown an episode in which the police obtain a blurry image of a suspect's face or license plate. The detective asks the computer technician if she can enhance the image. With the push of a button, the blurry image is replaced by a crystal clear image that enables the police to identify and apprehend the criminal.

The image reconstruction algorithms used in modern law enforcement are more sophisticated than the SVD. Nevertheless, the SVD can do a reasonable job of removing small random noise from an image, thereby making certain features easier to see. The SVD has to have enough data to work with, so the following statements duplicate the "SVD" image four times before adding random Gaussian noise to the data. The noise has a standard deviation equal to 25% of the range of the data. The noisy data is displayed as a heat map on the range [-0.25, 1.25] by using a color ramp that includes yellow for negative values and blue for values greater than 1.

call randseed(12345);
A = (SVD // SVD) || (SVD // SVD);                     /* duplicate the image */
A = A + randfun( dimension(A), "Normal", 0, 0.25);    /* add Gaussian noise */
/*       Yellow   White    Gray1    Gray2    Gray3    Black    Blue    */
ramp2 = {CXFFEDA0 CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000 CX3182BD};
call heatmapcont(A) colorramp=ramp2 showlegend=0 title="Noisy Image" range={-0.5, 1.5};
Image corrupted by Gaussian noise

I think most police officers would be able to read this message in spite of the added noise, but let's see if the SVD can clean it up. The following statements compute the SVD and create a plot of the singular values:

call svd(U, D, V, A);  /* A = U*diag(D)*V` */
call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};
Plot of singular values for a small image that is corrupted by Gaussian noise

There are 14 non-zero singular values. In theory, the main signal is contained in the components that correspond to the largest singular values whereas the noise is captured by the components for the smallest singular values. For these data, the plot of the singular values suggests that three or five (or nine) components might capture the main signal while ignoring the noise. The following statements create and display the rank-3 and rank-5 approximations. Only the rank-5 approximation is shown.

keep = {3 5};        /* which components to examine? */
do i = 1 to ncol(keep);
   idx = 1:keep[i];
   Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;
   Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */
   s = "Rank = " + char(keep[i],2);
   call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1};
Low-rank approximation: Rank-5 reconstruction via SVD of a small image that is corrupted by Gaussian noise

The denoised low-rank image is not as clear as Hollywood depicts, but it is quite readable. It is certainly good enough to identify a potential suspect. I can hear the detective mutter, "Book 'em, Danno! Murder One."

Summary and further reading

In summary, the singular value decomposition (SVD) enables you to approximate a data matrix by using a low-rank approximation. This article uses a small example for which the full data matrix is rank-5. A plot of the singular values can help you choose the number of components to retain. For this example, a rank-3 approximation represents the essential features of the data matrix.

For similar analyses and examples that use the singular value decomposition, see

In SAS software, the SVD is heavily used in the SAS Text Miner product. For an overview of how a company can use text mining to analyze customer support issues, see Sanders and DeVault (2004) "Using SAS at SAS: The Mining of SAS Technical Support."

The post The singular value decomposition and low-rank approximations appeared first on The DO Loop.

7月 312017

A SAS user needed to convert a program from MATLAB into the SAS/IML matrix language and asked whether these is a SAS/IML equivalent to the fliplr and flipud functions in MATLAB. These functions flip the columns or rows (respectively) of a matrix; "LR" stands for "left-right" and "UD" stands for "up-down."

No, SAS/IML does not have equivalent built-in functions, but as Devo once sang: "When a problem comes along, you must whip it." This article shows how to implement these functions in SAS/IML. Or, to paraphrase Devo's maxim: "When a matrix comes along, you must flip it. It's not too late to flip it. Flip it good!"

The FLIPLR and FLIPUD functions in SAS/IML

To key to reversing the columns or rows of a matrix is to use subscript notation. Recall that in the SAS/IML language, the expression A[ ,1] extracts the first column of A. (When the row subscript is not specified, it means "use all rows.") Similarly, A[ ,1:3] extracts the first three columns of A (in order), whereas A[ ,3:1] extracts the first three columns of A in reverse order. A similar notation is valid for rows. The expression A[3:1, ] extracts the first three rows of A in reverse order. Therefore, the following SAS/IML functions implement the functionality of the fliplr and flipud functions for matrices:

proc iml;
/* reverse columns of a matrix */
start fliplr(A);
   return A[ , ncol(A):1];
/* reverse rows of a matrix */
start flipud(A);
   return A[nrow(A):1, ];
/* test the functions on a 2x3 matrix */
x = {1 2 3, 4 5 6};
lr = fliplr(x);
ud = flipud(x);
print x, lr, ud;

Reversing all elements

You can also reverse all elements of a matrix. Recall that a SAS/IML matrix is stored in row-major order, which means that the elements are enumerated by counting across the first row, then across the second row, and so forth. Thus if A is an n x m matrix, then A[n*m:1] returns the elements of A in reverse order. However, the elements are returned in a column vector. If you want the elements in a matrix that has the same dimensions as A, you can "shape it up" by using the SHAPE function, as follows:

/* reverse elements of a matrix */
start reverse(A);
   n = nrow(A); m = ncol(A);
   return shape(A[n*m:1], n, m);     /* Now flip it into SHAPE */
rev = reverse(x);
print rev;

The MATLAB language also supports the rot90 function, which rotates a matrix. I have previously shown how to rotate the elements of a matrix in the SAS/IML language.

In conclusion, although SAS/IML does not contain built-in functions for reversing the rows and columns of a matrix, these functions are easily defined in terms of matrix subscript operations. In short, when given a matrix, it is not hard to flip it, flip it good!

The post Flip it. Flip it good. appeared first on The DO Loop.

7月 242017

For a time series { y1, y2, ..., yN }, the difference operator computes the difference between two observations. The kth-order difference is the series { yk+1 - y1, ..., yN - yN-k }. In SAS, The DIF function in the SAS/IML language takes a column vector of values and returns a vector of differences.

For example, the following SAS/IML statements define a column vector that has five observations and calls the DIF function to compute the first-order differences between adjacent observations. By convention, the DIF function returns a vector that is the same size as the input vector and inserts a missing value in the first element.

proc iml;
x = {0, 0.1, 0.3, 0.7, 1};
dif = dif(x);            /* by default DIF(x, 1) ==> first-order differences */
print x dif;
First-order difference of time series

The difference operator is a linear operator that can be represented by a matrix. The first nonmissing value of the difference is x[2]-x[1], followed by x[3]-x[2], and so forth. Thus the linear operator can be represented by the matrix that has -1 on the main diagonal and +1 on the super-diagonal (above the diagonal). An efficient way to construct the difference operator is to start with the zero matrix and insert ±1 on the diagonal and super-diagonal elements. You can use the DO function to construct the indices for the diagonal and super-diagonal elements in a matrix:

start DifOp(dim);
   D = j(dim-1, dim, 0);        /* allocate zero martrix */
   n = nrow(D); m = ncol(D);
   diagIdx = do(1,n*m, m+1);    /* index diagonal elements */
   superIdx  = do(2,n*m, m+1);  /* index superdiagonal elements */
   *subIdx  = do(m+1,n*m, m+1); /* index subdiagonal elements (optional) */
   D[diagIdx] = -1;             /* assign -1 to diagonal elements */
   D[superIdx] = 1;             /* assign +1 to super-diagonal elements */
   return D;
B = DifOp(nrow(x));
d = B*x;
print B, d[L="Difference"];
Difference operator and first-order difference of a time series

You can see that the DifOp function constructs an (n-1) x n matrix, which is the correct dimension for transforming an n-dimensional vector into an (n-1)-dimensional vector. Notice that the matrix multiplication omits the element that previously held a missing value.

You probably would not use a matrix multiplication in place of the DIF function if you needed the first-order difference for a single time series. However, the matrix formulation makes it possible to use one matrix multiplication to find the difference for many time series.

The following matrix contains three time-series, one in each column. The B matrix computes the first-order difference for all columns by using a single matrix-matrix multiplication. The same SAS/IML code is valid whether the X matrix has three columns or three million columns.

/* The matrix can operate on a matrix where each column is a time series */
x = {0    0    0,
     0.1  0.2  0.3,
     0.3  0.8  0.5,
     0.7  0.9  0.8,
     1    1    1   };
B = DifOp(nrow(x));
d = B*x;                        /* apply the difference operator */
print d[L="Difference of Columns"];
First-order differences for multiple time series

Other operators in time series analysis can also be represented by matrices. For example, the first-order lag operator is represented by a matrix that has +1 on the super-diagonal. Moving average operators also have matrix representations.

The matrix formulation is efficient for short time series but is not efficient for a time series that contains thousands of elements. If the time series contains n elements, then the dense-matrix representation of the difference operator contains about n2 elements, which consumes a lot of RAM when n is large. However, as we have seen, the matrix representation of an operator is advantageous when you want to operate on a large number of short time series, as might arise in a simulation.

The post Difference operators as matrices appeared first on The DO Loop.

11月 072016

Rotation matrices are used in computer graphics and in statistical analyses. A rotation matrix is especially easy to implement in a matrix language such as the SAS Interactive Matrix Language (SAS/IML). This article shows how to implement three-dimensional rotation matrices and use them to rotate a 3-D point cloud.

Define a 3-D rotation matrix

In three dimensions there are three canonical rotation matrices:

  • The matrix Rx(α) rotates points counterclockwise by the angle α about the X axis. Equivalently, the rotation occurs in the (y, z) plane.
  • The matrix Ry(α) rotates points counterclockwise by the angle α in the (x, z) plane.
  • The matrix Rz(α) rotates points counterclockwise by the angle α in the (x, y) plane.

Each of the following SAS/IML functions return a rotation matrix. The RotPlane function takes an angle and a pair of integers. It returns the rotation matrix that corresponds to a counterclockwise rotation in the (xi, xj) plane. The Rot3D function has a simpler calling syntax. You specify and axis (X, Y, or Z) to get a rotation matrix in the plane that is perpendicular to the specified axis:

proc iml;
/* Rotate a vector by a counterclockwise angle in a coordinate plane. 
        [ 1    0       0   ]
   Rx = [ 0  cos(a) -sin(a)]        ==> Rotate in the (y,z)-plane
        [ 0  sin(a)  cos(a)]
        [ cos(a)  0   -sin(a)]
   Ry = [   0     1      0   ]      ==> Rotate in the (x,z)-plane
        [ sin(a)  0    cos(a)]
        [ cos(a) -sin(a) 0]
   Rz = [ sin(a)  cos(a) 0]         ==> Rotate in the (x,y)-plane
        [   0       0    1]
start RotPlane(a, i, j);
   R = I(3);  
   c = cos(a); s = sin(a);
   R[i,i] = c;  R[i,j] = -s;
   R[j,i] = s;  R[j,j] =  c;
   return R;
start Rot3D(a, axis);   /* rotation in plane perpendicular to axis */
   if upcase(axis)="X" then       
      return RotPlane(a, 2, 3);
   else if upcase(axis)="Y" then
      return RotPlane(a, 1, 3);
   else if upcase(axis)="Z" then
      return RotPlane(a, 1, 2);
   else return I(3);
store module=(RotPlane Rot3D);

NOTE: Some sources define rotation matrices by leaving the object still and rotating a camera (or observer). This is mathematically equivalent to rotating the object in the opposite direction, so if you prefer a camera-based rotation matrix, use the definitions above but specify the angle -α. Note also that some authors change the sign for the Ry matrix; the sign depends whether you are rotating about the positive or negative Y axis.

3-D rotation matrices are simple to construct and use in a matrix language #SASTip
Click To Tweet

Applying rotations to data

Every rotation is a composition of rotations in coordinate planes. You can compute a composition by using matrix multiplication. Let's see how rotations work by defining and rotating some 3-D data. The following SAS DATA step defines a point at the origin and 10 points along a unit vector in each coordinate direction:

data MyData;               /* define points on coordinate axes */
x = 0; y = 0; z = 0; Axis="O"; output;    /* origin */
Axis = "X";
do x = 0.1 to 1 by 0.1;    /* points along unit vector in x direction */
x = 0; Axis = "Y";
do y = 0.1 to 1 by 0.1;    /* points along unit vector in y direction */
y = 0; Axis = "Z";
do z = 0.1 to 1 by 0.1;    /* points along unit vector in z direction */
proc sgscatter data=Mydata;
matrix X Y Z;

If you use PROC SGSCATTER to visualize the data, the results (not shown) are not very enlightening. Because the data are aligned with the coordinate directions, the projection of the 3-D data onto the coordinate planes always projects 10 points onto the origin. The projected data does not look very three-dimensional.

However, you can slightly rotate the data to obtain nondegenerate projections onto the coordinate planes. The following computations form a matrix P which represents a rotation of the data by -π/6 in one coordinate plane followed by a rotation by -π/3 in another coordinate plane:

proc iml;
/* choose any 3D projection matrix as product of rotations */
load module=Rot3D;
pi = constant('pi');
Rz = Rot3D(-pi/6, "Z");    /* rotation matrix for (x,y) plane */
Rx = Rot3D(-pi/3, "X");    /* rotation matrix for (y,z) plane */ 
Ry = Rot3D(    0, "Y");    /* rotation matrix for (x,z) plane */
P = Rx*Ry*Rz;              /* cumulative rotation */
print P;
A rotation matrix is a product of canonical rotations

For a column vector, v, the rotated vector is P*v. However, the data in the SAS data set is in row vectors, so use the transposed matrix to rotate all observations with a single multiplication, as follows:

use MyData;
read all var {x y z} into M;
read all var "Axis";
RDat = M * P`;                    /* rotated data */

Yes, that's it. That one line rotates the entire set of 3-D data. You can confirm the rotation by plotting the projection of the data onto the first two coordinates:

title "Rotation and Projection of Data";
Px = RDat[,1]; Py = RDat[,2]; 
call scatter(Px, Py) group=Axis 
   option="markerattrs=(size=12 symbol=CircleFilled)";
Scatter plot of rotated and projected 3-D data

Alternatively, you can write the rotated data to a SAS data set. You can add reference axes to the plot if you write the columns of the P` matrix to the same SAS data set. The columns are the rotated unit vectors in the coordinate directions, so plotting those coordinates by using the VECTOR statement adds reference axes:

create RotData from RDat[colname={"Px" "Py" "Pz"}];
append from RDat;
A = P`;          /* rotation of X, Y, and Z unit vectors */
create Axes from A[colname={"Ax" "Ay" "Az"}];  append from A; close;
Labels = "X":"Z";
create AxisLabels var "Labels";  append; close;
data RotData;    /* merge all data sets */
merge MyData Rot Axes AxisLabels;
proc sgplot data=RotData;
   vector x=Ax y=Ay / lineattrs=(thickness=3) datalabel=Labels;
   scatter x=Px y=Py / group=Axis markerattrs=(size=12 symbol=CircleFilled);
   xaxis offsetmax=0.1; yaxis offsetmax=0.1; 
Scatter plot of rotated and projected 3-D data

All the data points are visible in this projection of the (rotated) data onto a plane. The use of the VECTOR statement to add coordinate axes is not necessary, but I think it's a nice touch.

Visualizing clouds of 3-D data

This article is about rotation matrices, and I showed how to use matrices to rotate a 3-D cloud of observations. However, I don't want to give the impression that you have to use matrix operations to plot 3-D data! SAS has several "automatic" 3-D visualization methods that more convenient and do not require that you program rotation matrices. The visualization methods include

I also want to mention that Sanjay Matange created a 3-D scatter plot macro that uses ODS graphics to visualize a 3-D point cloud. Sanjay also uses rotation matrices, but because he uses the DATA step and PROC FCMP, his implementation is longer and less intuitive than the equivalent operations in the SAS/IML matrix language. In his blog he says that his macro "is provided for illustration purposes."

In summary, the SAS/IML language makes it convenient to define and use rotation matrices. An application is rotating a 3-D cloud of points. In my next blog post I will present a more interesting visualization example.

tags: Matrix Computations, Statistical Graphics

The post Rotation matrices and 3-D data appeared first on The DO Loop.

10月 312016

Every year near Halloween I write an article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick (which features the CMISS function and the crossproducts matrix in SAS/IML) enables you to count the number of observations that are missing for pairs of columns. In other words, this trick counts how many missing values are in columns 1 and 2, columns 1 and 3, and so on for all "k choose 2" pairwise combinations of k columns.

This trick generalizes, so keep reading even if you don't care about counting missing values. Given ANY indicator matrix, you can use this trick to count the pairwise occurrence of events.

Counting pairwise events: A connection to matrix multiplication #Statistics
Click To Tweet

Counting missing values

Of course, you can count combinations of missing values without using SAS/IML or matrices. I've previously shown how to use PROC MI to count all combinations of missing values among a set of k variables. However, as I said, this SAS/IML trick enables you to count more than just missing values: it generalizes to count ANY pairwise event.

To get started, read data into a SAS/IML matrix. The following statements read 5209 observations and eight variables into the matrix X:

proc iml;
varNames = {"AgeAtStart" "Diastolic" "Systolic" 
            "Height" "Weight" "MRW" "Smoking" "Cholesterol"};
use Sashelp.Heart;               /* open data set */
read all var varNames into X;    /* create numeric data matrix, X */
close Sashelp.Heart;

The first part of the trick is to convert the matrix into a 0/1 indicator matrix, where 1 indicates that a data value is missing and 0 indicates nonmissing. I like to use the CMISS function in Base SAS for this task.

The second part of the trick is to recognize that if C is an indicator matrix, the crossproducts matrix P=C`C is a matrix that gives the counts of the pairwise events in C. The element P[i,j] is the inner product of the i_th and j_th columns of C, and because C is an indicator matrix the inner product counts the number of simultaneous "events" for column i and column j.

In the SAS/IML language, the code to compute pairwise counts of missing values requires only two lines:

C = cmiss(X);                    /* create 0/1 indicator matrix */
P = C`*C;                        /* pairwise counts */
print P[c=varNames r=varNames label="Pairwise Missing Counts"];

The P matrix is symmetric. The diagonal elements P[i,i] show the number of missing values for the i_th variable. For example, the Height, Weight, and MRW variables have 6 missing values whereas the Cholesterol variable has 152 missing values. The off-diagonal elements P[i,j] show the number of observations that are simultaneously missing for the i_th and j_th variables. For example, 28 observations have missing values for both the Smoking and Cholesterol variables; two observations have missing values for both the Height and Weight variables.

Counting any pair of events

If C is any indicator matrix, the crossproducts matrix P=C`C counts the number of observations for which two columns of C are simultaneously 1.

For example, the following statements standardize the data to create z-scores. You can form an indicator matrix that has the value 1 if a z-score is exceeds 3 in absolute value; these are outliers for normally distributed data. The crossproducts matrix counts the number of observations that are univariate outliers (on the diagonal) and the number that are outliers for a pair of variables:

Z = (X - mean(X)) / std(X);      /* standardize data */
L = (abs(Z) > 3);                /* L indicates outliers */
Q = L`*L;                        /* counts for pairs of outliers */
print Q[c=varNames r=varNames label="Pairwise Outliers"];

The table indicates that 52 patients have a diastolic blood pressure (BP) that is more than three standard deviations above the mean. Similarly, 83 patients are outliers for systolic BP, and 35 patients are outliers for both measurements.

The following graph visualizes the patients that have high blood pressure in each category:

Outlier = j(nrow(L), 1, "No         ");       /* create grouping variable */
Outlier[ loc(L[,2] | L[,3]) ] = "Univariate";
Outlier[ loc(L[,2] & L[,3]) ] = "Bivariate";
title "Standardized Blood Pressure";
title2 "Outliers More than Three Standard Deviations from Mean";
call scatter(Z[,2], Z[,3]) group=Outlier 
         label={"Standardized Diastolic" "Standardized Systolic"}
         option="transparency=0.5 markerattrs=(symbol=CircleFilled)"
         other="refline 3/axis=x; refline 3/axis=y;";

In conclusion, you can create an indicator matrix for any set of conditions. By forming the crossproducts matrix, you get the counts of all observations that satisfy each condition individually and in pairs. I think this trick is a real treat!

tags: Data Analysis, Matrix Computations, Tips and Techniques

The post Counting observations for which two events occur appeared first on The DO Loop.