Matrix Computations

10月 222018

A SAS programmer asked how to rearrange elements of a matrix. The rearrangement he wanted was rather complicated: certain blocks of data needed to move relative to other blocks, but the values within each block were to remain unchanged. It turned out that the mathematical operation he needed is called a block transpose. The BTRAN function in SAS/IML performs block-transpose operations, so the complicated rearrangement was easy to implement with a judicious call to the BTRAN function.

This article discusses the block-transpose operation and gives an example. It also shows how a block transpose can conveniently transform wide data into long data and vice versa.

The block-transpose operation

In general, a block matrix can contain blocks of various sizes. However, the BTRAN function requires that all blocks be the same size. Specifically, suppose A is a block matrix where each block is an n x m matrix. That means that A is an (r n) x (c m) matrix for some whole numbers r and c. The BTRAN function transposes the blocks to create a (c n) x (r m) block matrix.

The idea is more easily explained by a picture. The following image shows a matrix A that is composed of six blocks (of the same size). The block-transpose operation rearranges the blocks but leaves the elements within the blocks unchanged.

A block-transpose operation transposes the blocks but leaves the contents of each block unchanged

The BTRAN function in SAS/IML

Let's see how the block transpose works on an example in the SAS/IML language. The following statements create a 4 x 12 matrix. I have overlaid some grid lines to help you visualize this matrix as a 2 x 3 block matrix, where each block is 2 x 4.

proc iml;
x = repeat(1:12,2);
x = x // (10+x);
print x;
A 4x12 matrix visualized as a 2x3 block matrix. Each block is 2x4.

You can use the BTRAN function to apply a block transpose. The first argument is the matrix that contains the data. The second and third arguments specify the size of the blocks:

y = btran(x, 2, 4);    /* block size is 2x4 */
print y;
A 6x6 matrix visualized as a 3x2 block matrix. Each block is 2x4.

The output is a 6 x 6 matrix, visualized as a 3 x 2 block matrix.

Block transpose for wide-to-long transforms

If the number of rows in the block equals the number of rows in the data, then the BTRAN function stacks columns. In particular, if the block size is n x 1, the BTRAN function stacks multiple variables into a single column, just like the SHAPECOL function. For example, the Sashelp.Iris data contains four variables named SepalLength, SepalWidth, PetalLength, and PetalWidth. The following SAS/IML statements stack the four variables into a single column and create a new column that identifies the name of the original variable:

proc iml;
/* wide to long */
varNames = {"SepalLength" "SepalWidth" "PetalLength" "PetalWidth"};
use Sashelp.Iris;
   read all var "Species";        /* 150 x 3 */
   read all var varNames into X;  /* 150 x 4 */
n = nrow(X); m = ncol(X);         /* n = 150; m = 4 */
ID = repeat(Species, m);          /* repeat the vars that are not transposed */
Var = shapecol(repeat(varNames, n), n*m); /* create column that contains variable names */
Value = btran(X, n, 1);           /* vector with 150*4 rows */ 
create test var {ID "Var" "Value"}; append; close;

Block transpose for wide-to-long transforms

In a similar way, you can transpose balanced data from long form to wide form. (Recall that "balanced data" means that each group has the same number of observations.) For example, the Sashelp.Iris data contains a Species variable. The first 50 observations have the value 'Setosa', the next 50 have the value 'Versicolor', and the last 50 have the value 'Virginica'. You can use a WHERE clause or a BY statement to analyze each species separately, but suppose you want to create a new data set that contains only 50 observations and has variables named SepalLengthSetosa, SepalLengthVersicolor, SepalLengthVirginica, and so forth for the other variables. The BTRAN function makes this easy: just specify a block size of 50 x 4, as follows.

/* create pairwise combinations of var names and group levels */
BlockRows = 50;
w = btran(X, BlockRows, m);                  /* 50 x 12 matrix */
s = Species[ do(1, n, BlockRows) ];          /* s = {'Setosa' 'Versicolor', 'Virginica'} */
vNames = rowcatc(expandgrid(varNames, s));   /* combine var names and species */
print vNames, w[c=vNames L="Measurenents (cm)"];

In summary, the BTRAN function is a useful function when you need to rearrange blocks of data without changing the values in a block. For balanced designs, you can use the BTRAN function to convert data between wide and long formats.

The post Transpose blocks to reshape data appeared first on The DO Loop.

10月 032018

It is sometimes necessary for researchers to simulate data with thousands of variables. It is easy to simulate thousands of uncorrelated variables, but more difficult to simulate thousands of correlated variables. For that, you can generate a correlation matrix that has special properties, such as a Toeplitz matrix or a first-order autoregressive (AR(1)) correlation matrix. I have previously written about how to generate a large Toeplitz matrix in SAS. This article describes three useful results for an AR(1) correlation matrix:

  • How to generate an AR(1) correlation matrix in the SAS/IML language
  • How to use a formula to compute the explicit Cholesky root of an AR(1) correlation matrix.
  • How to efficiently simulate multivariate normal variables with AR(1) correlation.

Generate an AR(1) correlation matrix in SAS

The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L1 distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:

proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, p); 
   return rho##distance(T(1:p), T(1:p), "L1");
/* test on 10 x 10 matrix with rho = 0.8 */
rho = 0.8;  p = 10;
Sigma = AR1Corr(rho, p);
print Sigma[format=Best7.];

A formula for the Cholesky root of an AR(1) correlation matrix

Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = RTR, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:

/* direct computation of Cholesky root for an AR(1) matrix */
start AR1Root(rho, p);
   R = j(p,p,0);                /* allocate p x p matrix */
   R[1,] = rho##(0:p-1);        /* formula for 1st row */
   c = sqrt(1 - rho**2);        /* scaling factor: c^2 + rho^2 = 1 */
   R2 = c * R[1,];              /* formula for 2nd row */
   do j = 2 to p;               /* shift elements in 2nd row for remaining rows */
      R[j, j:p] = R2[,1:p-j+1]; 
   return R;
R = AR1Root(rho, p);   /* compare to R = root(Sigma), which requires forming Sigma */
print R[L="Cholesky Root" format=Best7.];

You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ2.

Efficient simulation of multivariate normal variables with AR(1) correlation

An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.

When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:

/* simulate multivariate normal data from a population with AR(1) correlation */
start RandNormalAR1( N, Mean, rho );
    mMean = rowvec(Mean);
    p = ncol(mMean);
    U = AR1Root(rho, p);      /* use explicit formula instead of ROOT(Sigma) */
    Z = j(N,p);
    call randgen(Z,'NORMAL');
    return (mMean + Z*U);
call randseed(12345);
p = 1000;                  /* big matrix */
mean = j(1, p, 0);         /* mean of MVN distribution */
/* simulate data from MVN distribs with different values of rho */
v = do(0.01, 0.99, 0.01);  /* choose rho from list 0.01, 0.02, ..., 0.99 */
t0 = time();               /* time it! */
do i = 1 to ncol(v);
   rho = v[i];
   X = randnormalAR1(500, mean, rho);  /* simulate 500 obs from MVN with p vars */
t_SimMVN = time() - t0;     /* total time to simulate all data */
print t_SimMVN;

The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.

In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.

The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.

7月 092018

Back in high school, you probably learned to find the intersection of two lines in the plane. The intersection requires solving a system of two linear equations. There are three cases: (1) the lines intersect in a unique point, (2) the lines are parallel and do not intersect, or (3) the lines are coincident. Thus, for two lines, the intersection problem has either 1, 0, or infinitely many solutions. Most students quickly learn that the lines always intersect when their slopes are different, whereas the special cases (parallel or coincident) occur when the lines have the same slope.

Recently I had to find the intersection between two line segments in the plane. Line segments have finite extent, so segments with different slopes may or may not intersect. For example, the following panel of graphs shows three pairs of line segments in the plane. In the first panel, the segments intersect. In the second panel, the segments have the same slopes as in the first panel, but these segments do not intersect. In the third panel, the segments intersect in an interval. This article shows how to construct a linear system of equations that distinguishes between the three cases and compute an intersection point, if it exists.

Parameterization of line segments

Let p1 and p2 be the endpoints of one segment and let q1 and q2 be the endpoints of the other. Recall that a parametrization of the first segment is (1-s)*p1 + s*p2, where s ∈ [0,1] and the endpoints are treated as 2-D vectors. Similarly, a parametrization of the second segment is (1-t)*q1 + t*q2, where t ∈ [0,1]. Consequently, the segments intersect if and only if there exists values for (s,t) in the unit square such that
(1-s)*p1 + s*p2 = (1-t)*q1 + t*q2
You can rearrange the terms to rewrite the equation as
(p2-p1)*s + (q1-q2)*t = q1 - p1

This is a vector equation which can be rewritten in terms of matrices and vectors. Define the 2 x 2 matrix A whose first column contains the elements of (p2-p1) and whose second column contains the elements of (q1-q2). Define b = q1 - p1 and x = (s,t). If the solution of the linear system A*x = b is in the unit square, then the segments intersect. If the solution is not in the unit square, the segments do not intersect. If the segments have the same slope, then the matrix A is singular and you need to perform additional tests to determine whether the segments intersect.

A vector solution for the intersection of segments

As shown above, the intersection of two planar line segments is neatly expressed in terms of a matrix-vector system. In SAS, the SAS/IML language provides a natural syntax for expressing and solving matrix-vector equations. The following SAS/IML function constructs and solves a linear system. For simplicity, this version does not handle the degenerate case of two segments that have the same slope. That case is handled in the next section.

start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
/* Test 1: intersection at (0.95, 1.25) */
p1 = {1.8 2.1};   p2 = {0.8 1.1};
q1 = {1 1.25};    q2 = {0 1.25};
z = IntersectSegsSimple(p1,p2,q1,q2);
print z;
/* Test 2: no intersection */
p1 = {-1 0.5};  p2 = {1 0.5};
q1 = {0 1};     q2 = {0 2};
v = IntersectSegsSimple(p1, p2, q1, q2);
print v;

The function contains only a few statements. The function is called to solve the examples in the first two panels of the previous graph. The SOLVE function solves the linear system (assuming that a solution exists), and the IF-THEN statement tests whether the solution is in the unit square [0,1] x [0,1]. If so, the function returns the point of intersection. If not, the function returns a pair of missing values.

The general case: Handling overlapping line segments

For many applications, the function in the previous section is sufficient because it handles the generic cases. For completeness the following module also handles segments that have identical slopes. The DET function determines whether the segments have the same slope. If so, the segments could be parallel or collinear. To determine whether collinear segments intersect, you can test for three conditions:

  • The endpoint q1 is inside the segment [p1, p2].
  • The endpoint q2 is inside the segment [p1, p2].
  • The endpoint p1 is inside the segment [q1, q2].

Notice that the condition "p2 is inside [q1,q2]" does not need to be checked separately because it is already handled by the existing checks. If any of the three conditions are true, there are infinitely many solutions (or the segments share an endpoint). If none of the conditions hold, the segments do not intersect. For overlapping segments, the following function returns an endpoint of the intersection interval.

/* handle all cases: determine intersection of two planar line segments 
   [p1, p2] and [q1, q2] */
start Intersect2DSegs(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2);
   if det(A)^=0 then do;         /* nonsingular system: 0 or 1 intersection */
      x = solve(A, b);           /* x = (s,t) */
      if all(0<=x && x<=1) then  /* if x is in [0,1] x [0,1] */
         return (1-x[1])*p1 + x[1]*p2;  /* return intersection */
      else                       /* segments do not intersect */
         return ({. .});         /* return missing values */
   /* segments are collinear: 0 or infinitely many intersections */
   denom = choose(p2-p1=0, ., p2-p1);  /* protect against division by 0 */
   s = (q1 - p1) / denom;        /* Is q1 in [p1, p2]? */
   if any(0<=s && s<=1) then
      return q1;
   s = (q2 - p1) / denom;        /* Is q2 in [p1, p2]? */
   if any(0<=s && s<=1) then
      return q2;
   denom = choose(q2-q1=0, ., q2-q1);  /* protect against division by 0 */
   s = (p1 - q1) / denom;        /* Is p1 in [q1, q2]? */
   if any(0<=s && s<=1) then
      return p1;
   return ({. .});               /* segments are disjoint */
/* test overlapping segments; return endpoint of one segment */
p1 = {-1 1};  p2 = {1 1};
q1 = {0 1};   q2 = {2 1};
w = Intersect2DSegs(p1, p2, q1, q2);
print w;

In summary, by using matrices, vectors, and linear algebra, you can easily solve for the intersection of two line segments or determine that the segments do not intersect. The general case needs some special logic to handle degenerate configurations, but the code that solves the generic cases is straightforward when expressed in a vectorized language such as SAS/IML.

The post The intersection of two line segments appeared first on The DO Loop.

4月 182018

The sweep operator performs elementary row operations on a system of linear equations. The sweep operator enables you to build regression models by "sweeping in" or "sweeping out" particular rows of the X`X matrix. As you do so, the estimates for the regression coefficients, the error sum of squares, and the generalized inverse of the system are updated simultaneously. This is true not only for a single response variable but also for multiple responses.

You might remember elementary row operations from when you learned Gaussian elimination, which is a general technique for solving linear systems of equations. Sweeping a matrix is similar to Gaussian elimination, but the sweep operator is most often used to solve least squares problems for the normal equations X`X = X`Y. Therefore, in practice, the sweep operator is usually applied to the rows of a symmetric positive definite system.

How to use the sweep operator to obtain least squares estimates

Before we discuss details of the sweep operator, let's show how it produces regression estimates. The following program uses the SWEEP function in SAS/IML on a six-observation data set from J. Goodnight (1979, p. 153). The first three columns of the matrix M contains the design matrix for the explanatory variables; the first column of M represents the intercept parameter. The last column of the augmented matrix is the response variable. The sweep operator is called on the uncorrected sum of squares and crossproducts matrix (the USSCP matrix), which is computed as M`*M. In block form, the USSCP matrix contains the submatrices X`X, X`Y, Y`X, and Y`Y. The following program sweeps the first three rows of the USSCP matrix:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,
        1   2   1   3,
        1   3   1   3,
        1   1  -1   2,
        1   2  -1   2,
        1   3  -1   1};
S = sweep(M`*M, 1:3);     /* sweep first three rows (Intercept, X1, X2) */
print S;
The sweep operator applied to all ros of a USSCP matrix to obtain regression coefficients, SSe, and generalized inverse

The resulting matrix (S) is divided into blocks by black lines. The leading 3x3 block is the (generalized) inverse of the X`X matrix. The first three elements of the last column (outlined in red) are the least squares estimates for the regression model. The lower right cell (outlined in blue) is the sum of squared errors (SSE) for the regression model.

You can compare the result with the results from PROC REG, as follows. The parameter estimates and error sum of squares are highlighted for easy comparison:

data Have;
input X1 X2 Y @@;
1   1   1   2   1   3   3   1   3
1  -1   2   2  -1   2   3  -1   1
proc reg data=Have USSCP plots=none;
   model Y = X1 X2;
   ods select USSCP ANOVA ParameterEstimates;
run; quit;
Regression estimates and SSE for least squares analysis in SAS PROC REG

As claimed, the sweep operator produces the same parameter estimates and SSE statistic as PROC REG. The next sections discuss additional details of the sweep operator.

The basics of the sweep operator

Goodnight (1979) defines the sweep operator as the following sequence of row operations. Given a symmetric positive definite matrix A, SWEEP(A, k) modifies the matrix A by using the pivot element A[k,k] and the k_th row, as follows:

  1. Let D = A[k,k] be the k_th diagonal element.
  2. Divide the k_th row by D.
  3. For every other row i ≠ k, let B = A[i,k] be the i_th element of the k_th column. Subtract B x (row k) from row i. Then set A[i,k] = –B/D.
  4. Set A[k,k] = 1/D.

You could program these steps in a matrix language such as SAS/IML, but as shown previously, the SAS/IML language supports the SWEEP function as a built-in function.

Sweeping in effects

The following program uses the same data as the first section. The USSCP matrix (S0) represents the normal equations (plus a little extra information). If you sweep the first row of the S0 matrix, you solve the regression problem for an intercept-only model:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,    1   2   1   3,    1   3   1   3,
        1   1  -1   2,    1   2  -1   2,    1   3  -1   1};
S0 = M`*M;          print S0;   /* USSCP matrix */
/* sweep in 1st row (Intercept) */
S1 = sweep(S0, 1);   print S1[F=BEST6.];
The sweep operator applied to the first row of a USSCP matrix to obtain an intercept-only model

The first element in the fourth row (S1[1,4], outlined in red) is the parameter estimate for the intercept-only model. The last element (S1[4,4], outlined in blue) is the sum of squared errors (SSE) for the intercept-only model.

If you "sweep in" the second row, you solve the least squares problem for a model that includes the intercept and X1. The corresponding elements of the last column contain the parameter estimates for the model. The lower-right element again contains the SSE for the revised model. If you proceed to "sweep in" the third row, the last column again contains the parameter estimates and the SSE, this time for the model that includes the intercept, X1, and X2:

/* sweep in 2nd row (X1) and 3rd row (X2) */
S2 = sweep(S1, 2);   print S2[F=BEST6.]; 
S3 = sweep(S2, 3);   print S3[F=BEST6.];
The sweep operator applied to subsequent rows of a USSCP matrix to 'sweep in' additional effects

Sweeping out effects

One of the useful features of the sweep operators is that you can remove an effect from a model as easily as you can add it. The sweep operator has a reversibility property: if you sweep a row a second time you "undo" the first sweep. In symbols, the operator has the property that SWEEP( SWEEP(A,k), k) = A for any row k. For example, if you want to take the X1 variable out of the model, merely sweep on the second row again:

S4 = sweep(S3, 2);   print S4[F=BEST6.]; /* model with Intercept + X2 */
The sweep operator applied twice 'sweeps out' an effect from a regression model

Of course, as shown in an earlier section, you do not need to sweep in one row at a time. You can sweep in multiple rows with one call by specifying a vector of indices. In fact, the sweep operator also commutes with itself, which means that you can specify the rows in any order. The call SWEEP(S0, 1:3) is equivalent to SWEEP(S0, {3 1 2}) or SWEEP(S0, {2 3 1}) or any other permutation of {1 2 3}.


The SWEEP operator (which is implemented in the SWEEP function in SAS/IML) enables you to construct least squares model estimates from the USSCP matrix, which is a block matrix that contains the normal equations. You can "sweep in" rows to add effects to a model or "sweep out" rows to remove effects. After each operation, the parameter estimates appear in the portion of the adjusted USSCP matrix that corresponds to the block for X`Y. Furthermore, the residual sum of squares appears in the portion of the adjusted USSCP matrix that corresponds to Y`Y.

Further reading

The sweep operator is known by different names in different fields and has been rediscovered several times. In statistics, the operator was popularized in a TAS article by J. Goodnight (1979) who mentions Ralston (1960) and Beaton (1964) as early researchers. Outside of statistics, the operator is known as the Principal Pivot Transform (Tucker, 1960), gyration (Duffy, Hazony, and Morrison, 1966), or exchange (Stewart and Stewart, 1998). Tsatsomeros (2000) provides an excellent review of the literature and the history of the sweep operator.

The post The sweep operator: A fundamental operation in regression appeared first on The DO Loop.

4月 112018

Sometimes it is important to ensure that a matrix has unique rows. When the data are all numeric, there is an easy way to detect (and delete!) duplicate rows in a matrix.

The main idea is to subtract one row from another. Start with the first row and subtract it from every row beneath it. If any row of the difference matrix is identically zero, then you have found a row that is identical to the first row. Then do the same thing for the second row: subtract it from the third and higher rows and see if you obtain a row of zeros. Repeat this process for the third and higher rows.

The following SAS/IML program implements this algorithm for an arbitrary numeric matrix. It returns a binary indicator variable (a column vector). The i_th row of the returned vector is 1 for rows that are duplicates of some previous row:

proc iml;
/* return binary vector that indicates which rows in a 
   numerical matrix are duplicates */
start DupRows(A);
   N = nrow(A);
   dupRows = j(N, 1, 0);         /* binary indicator matrix */
   do i = 1 to N-1;
      if dupRows[i] = 0 then do; /* skip rows that are known to be duplicates */
         r = i+1:N;              /* remaining rows */ 
         M = A[r, ]-A[i, ];      /* subtract current row from remaining rows */
         b = M[ ,##];            /* sum of squares = 0 iff duplicate row */
         idx = loc( b=0 );       /* any duplicate rows for current row? */
         if ncol(idx) > 0 then
            dupRows[r[idx]] = 1; /* if so, flag them */
   return dupRows;

To test the function, consider the following 6 x 3 matrix. You can see by inspection that the matrix has three duplicate rows: the third, fifth, and sixth rows. You can call the DupRows function and print the matrix adjacent to the binary vector that indicates the duplicate rows:

A = {1 1 1,
     1 2 3,
     1 1 1,  /* duplicate row */
     3 2 1,
     3 2 1,  /* duplicate row */
     1 1 1};  /* duplicate row */
DupRows = DupRows(A);
print A DupRows;

You can use the DupRows function to write a function that excludes the duplicate rows in a matrix and returns only the unique rows, as follows:

/* return the unique rows in a numerical matrix */
start UniqueRows(A);
   uniqRows = loc(DupRows(A)=0);
   return(A[uniqRows, ]);   /* return rows that are NOT duplicates */
U = UniqueRows(A);
print U;

I like this algorithm because it uses subtraction, which is a very fast operation. However, This algorithm has a complexity of O(n(n-1)/2), where n is the number of rows in the matrix, so it is best for small to moderate values of n.

For long and skinny matrices (or for character data), it might be more efficient to sort the matrix as I showed in a previous article about how to find (and count) the unique rows of a matrix. However, the sorting algorithm requires that you sort the data by ALL columns, which can be inefficient for very wide and short data tables. For small and medium-sized data, you can use either method to find the unique rows in a matrix.

This article discusses how to remove duplicates records for numerical matrices in SAS/IML. In Base SAS, you can use PROC SORT or PROC SQL to remove duplicate records from a SAS DATA set, as shown in Lafler (2017) and other references.

The post Find the unique rows of a numeric matrix appeared first on The DO Loop.

3月 212018
Pseudocode for the conjugate gradient method for solving a linear system (from Wikipedia)

I often claim that the "natural syntax" of the SAS/IML language makes it easy to implement an algorithm or statistical formula as it appears in a textbook or journal. The other day I had an opportunity to test the truth of that statement. A SAS programmer wanted to implement the conjugate gradient algorithm, which is an iterative method for solving a system of equations with certain properties. I looked up the Wikipedia article about the conjugate gradient method and saw the following text:

"The algorithm is detailed below for solving Ax = b where A is a real, symmetric, positive-definite matrix. The input vector x0 can be an approximate initial solution or 0." The text was followed by pseudocode (in the box; click to enlarge)

The conjugate gradient method in SAS/IML

I used the pseudocode to implement the conjugate gradient method in SAS/IML. (The method is explained further in the next section.) I chose not to use the 'k' and 'k+1' notation but merely to overwrite the old values of variables with new values. In the program, the variables x, r, and p are vectors and the variable A is a matrix.

/* Linear conjugate gradient method as presented in Wikipedia:
   Solve the linear system A*x = b, where A is a symmetric positive definite matrix.
   The algorithm converges in at most n iterations, where n is the dimension of A.
   This function requires an initial guess, x0, which can be the zero vector.
   The function does not verify that the matrix A is symmetric positive definite.
   This module returns a matrix mX= x1 || x2 || ... || xn 
   whose columns contain the iterative path from x0 to the approximate solution vector xn.  */
proc iml;
start ConjGrad( x0, A, b, Tolerance=1e-6 );
  x = x0;                            /* initial guess */
  r = b - A*x0;                      /* residual */
  p = r;                             /* initial direction */
  mX = j(nrow(A), nrow(A), .);       /* optional: each column is result of an iteration */
  do k = 1 to ncol(mX) until ( done );
    rpr = r`*r;  Ap = A*p;           /* store partial results */
    alpha =  rpr / (p`*Ap);          /* step size */
    x = x + alpha*p;   mX[,k] = x;   /* remember new guess */
    r = r - alpha*Ap;                /* new residual */
    done = (sqrt(rpr) < Tolerance);  /* stop if ||r|| < Tol */
    beta = (r`*r) / rpr;             /* new stepsize */
    p = r + beta*p;                  /* new direction */
  return ( mX );     /* I want to return the entire iteration history, not just x */

The SAS/IML program is easy to read and looks remarkably similar to the pseudocode in the Wikipedia article. This is in contrast to lower-level languages such as C in which the implementation looks markedly different from the pseudocode.

Convergence of the conjugate gradient method

The conjugate gradient method is an iterative method to find the solution of a linear system A*x=b, where A is a symmetric positive definite n x n matrix, b is a vector, and x is the unknown solution vector. (Recall that symmetric positive definite matrices arise naturally in statistics as the crossproduct matrix (or covariance matrix) of a set of variables.) The beauty of the conjugate gradient method is twofold: it is guaranteed to find the solution (in exact arithmetic) in at most n iterations, and it requires only simple operations, namely matrix-vector multiplications and vector additions. This makes it ideal for solving system of large sparse systems, because you can implement the algorithm without explicitly forming the coefficient matrix.

It is fun to look at how the algorithm converges from the initial guess to the final solution. The following example converges gradually, but I know of examples for which the algorithm seems to make little progress for the first n – 1 iterations, only to make a huge jump on the final iteration straight to the solution!

Recall that you can use a Toeplitz matrix to construct a symmetric positive definite matrix. The following statements define a banded Toeplitz matrix with 5 on the diagonal and specify the right-hand side of the system. The zero vector is used as an initial guess for the algorithm. The call to the ConjGrad function returns a matrix whose columns contain the iteration history for the method. For this problem, the method requires five iterations to converge, so the fifth column contains the solution vector. You can check that the solution to this system is (x1, x2, 3, x4, x5) = (-0.75, 0, 1.5, 0.5, -0.75), either by performing matrix multiplication or by using the SOLVE function in IML to compute the solution vector.

A = {5 4 3 2 1,          /* SPD Toeplitz matrix */
     4 5 4 3 2, 
     3 4 5 4 3,
     2 3 4 5 4,
     1 2 3 4 5};
b = {1, 3, 5, 4, 2};     /* right hand side */
n = ncol(A);
x0 = j(n,1,0);           /* the zero vector */
traj = ConjGrad( x0, A, b );
x = traj[ ,n];           /* for this problem, solution is in last column */

It is instructive to view how the iteration progresses from an initial guess to the final solution. One way to view the iterations is to compute the Euclidean distance between each partial solution and the final solution. You can then graph the distance between each iteration and the final solution, which decreases monotonically (Hestenes and Stiefel, 1952).

Distance = sqrt( (traj - x)[##, ] ); /* || x[j] - x_Soln || */
Iteration = 1:n;
title "Convergence of Conjugate Gradient Method";
call series(Iteration, Distance) grid={x y} xValues=Iteration 
     option="markers" label={"" "Distance to Solution"};

Notice in the distance graph that the fourth iteration almost equals the final solution. You can try different initial guesses to see how the guess affects the convergence.

In addition to the global convergence, you can visualize the convergence in each coordinate. Because the vectors live in high-dimensional space, it impossible to draw a scatter plot of the iterative solutions. However, you can visualize each vector in "parallel coordinates" as a sequence of line segments that connect the coordinates in each variable. In the following graph, each "series plot" represents a partial solution. The curves are labeled by the iteration number. The blue horizontal line represents the initial guess (iteration 0). The partial solution after the first iteration is shown in red and so on until the final solution, which is displayed in black. You can see that the third iteration (for this example) is close to the final solution. The fourth partial solution is so close that it cannot be visually distinguished from the final solution.

Convergence of the conjugate gradient method

In summary, this post shows that the natural syntax of the SAS/IML language makes it easy to translate pseudocode into a working program. The article focuses on the conjugate gradient method, which solves a symmetric, positive definite, linear system in at most n iterations. The article shows two ways to visualize the convergence of the iterative method to the solution.

The post The conjugate gradient method appeared first on The DO Loop.

3月 052018

Many people know that a surface can contain a saddle point, but did you know that you can define the saddle point of a matrix? Saddle points in matrices are somewhat rare, which means that if you choose a random matrix you are unlikely to choose one that has a saddle point. This article shows how to

  • Locate a saddle point in a matrix
  • Use simulation to estimate the probability of finding a saddle point in a random matrix.
  • Use simulation to explore the distribution of the location of a saddle point in a random matrix.

A saddle point of a matrix

You might remember from multivariable calculus that a critical point (x0, y0) is a saddle point of a function f if it is a local minimum of the surface in one direction and a local maximum in another direction. The canonical example is f(x,y) = x2 - y2 at the point (x0, y0) = (0, 0). Along the horizontal line y=0, the point (0, 0) is a local minimum. Along the vertical line x=0, the point (0, 0) is a local maximum.

The definition of a saddle point of a matrix is similar. For an n x p matrix M, the cell (i, j) is a saddle point of the matrix if M[i, j] is simultaneously the minimum value of the i_th row and the maximum value of the j_th column.

Consider the following 3 x 3 matrices. The minimum value for each row is highlighted in blue. The maximum value for each column is highlighted in red. Two of the matrices have saddle points, which are highlighted in purple.

  • For the matrix M1, the (3, 1) cell is a saddle point because 7 is the smallest value in row 3 and the largest value in column 1.
  • For the matrix M2, the (2, 2) cell is a saddle point because 5 is the smallest value in row 2 and the largest value in column 2.
  • The matrix M3 does not have a saddle point. No cell is simultaneously the smallest value in its row and the largest value in its column.
The saddle point of a matrix

Compute the location of a saddle point in a matrix

In a matrix language such as SAS/IML, you can use row and column operators to find the location of a saddle point, if it exists. The SAS/IML language supports subscript reduction operators, which enable you to compute certain statistics for all rows or all columns without writing a loop. For this problem, you can use the "index of the minimum operator" (>:<) to find the elements that are the minimum for each row and the "index of the maximum operator" (<:>) to find the elements that are the maximum for each column. To find whether there is an element in common, you can use the XSECT function find the intersection of the two sets. (SAS/IML supports several functions that perform set operations.)

A function that computes the location of the saddle point follows. It uses the SUB2NDX function to convert subscripts to indices.

proc iml;
start LocSaddlePt(M);
   dim = dimension(M);  n = dim[1];  p = dim[2];
   minRow = M[ ,>:<];                    /* for each row, find column that contains min */
   minSubscripts = T(1:n) || minRow;     /* location as (i,j) pair */
   minIdx = sub2ndx(dim, minSubscripts); /* convert location to index in [1, np] */
   maxCol = T(M[<:>, ]);                 /* for each column, find row that contains max */
   maxSubscripts = maxCol || T(1:p);     /* loation as (i,j) pair */
   maxIdx = sub2ndx(dim, maxSubscripts); /* convert location to to index in [1, np] */
   xsect = xsect(minIdx, maxIdx);        /* intersection; might be empty matrix */
   if ncol(xsect) > 0 then
      return ( xsect );  
      return ( . );
M1 = {1 2 3, 4 5 6, 7 8 9};  idx1 = LocSaddlePt(M1);
M2 = {9 1 2, 8 5 7, 3 4 6};  idx2 = LocSaddlePt(M2);
M3 = {8 1 9, 7 2 6, 3 4 5};  idx3 = LocSaddlePt(M3);
print idx1 idx2 idx3;
Find the location of the saddle point of a matrix

The indices of the saddle points are shown for the three example matrices. The 7th element of the M1 matrix is a saddle point, which corresponds to the (3, 1) cell for a 3 x 3 matrix in row-major order. The 5th element (cell (2,2)) of the M2 matrix is a saddle point. The M3 matrix does not contain a saddle point.

The probability of a saddle point in a 3x3 random matrix

A matrix contains either zero or one saddle point. There is a theorem (Ord, 1979) that says that the probability of finding a saddle point in a random n x p matrix is (n! p!) / (n+p-1)! when the elements of the matrix are randomly chosen from any continuous probability distribution.

Although the theorem holds for any continuous probability distribution, it suffices to prove it for the uniform distribution. Notice that the existence and location of a saddle point is invariant under any monotonic transformation because those transformations do not change the relative order of elements. In particular, if F is any probability distribution, the existence and location of the saddle point is invariant under the monotonic transformation F–1, which transforms the numbers into a sample from the uniform distribution. In fact, you can replace the number by their ranks, which converts the random matrix into a matrix of that contains the consecutive integers 1, 2, ..., np. Thus the previous integer matrices represent the ranks of a random matrix where the elements are drawn from any continuous distribution.

For small values of n and p, you can use the ALLPERM function to enumerate all possible matrices that have integers in the range [1, np]. You can reshape the permutation into an n x p matrix and compute whether the matrix contains a saddle point. If you use a 0/1 indicator array to encode the result, then the exact probability of finding a saddle point in a random n x p matrix is found by computing the mean of the indicator array. The following example computes the exact probability of finding a saddle point in a random 3 x 3 matrix:

A = allperm(9);            /* all permutations of 1:9 */
saddle = j(nrow(A), 1);    /* allocate vector for results */
do i = 1 to nrow(A);       /* for each permutation (row) ... */
   M = shape(A[i,], 3);    /*    reshape row into 3x3 matrix */
   idx = LocSaddlePt( M ); /*    find location of saddle */
   saddle[i] = (idx > 0);  /*    0/1 indicator variable */
prob = saddle[:];          /* (sum of 1s) / (number of matrices) */
print prob;
The probability of a saddle point in a 3 x 3 matrix

As you can see, the computational enumeration shows that the probability of a saddle point is 0.3 for 3 x 3 matrices. This agrees with the theoretical formula (3! 3!) / 5! = 36 / 120 = 3/10.

The probability of a saddle point in a random matrix

Of course, when n and p are larger it becomes infeasible to completely enumerate all n x p matrices. However, you can adopt a Monte Carlo approach: generate a large number of random matrices and compute the proportion which contains a saddle point. The following program simulates one million random 4 x 4 matrices from the uniform distribution and computes the proportion that has a saddle point. This is a Monte Carlo estimate of the true probability.

NSim = 1e6;
n = 4; p=4;                  /* random 4x4 matrices from uniform distribution */
A = j(NSim, n*p);
call randgen(A, "Uniform");
saddleLoc = j(NSim, 1, .);   /* store location of the saddle point */
saddle = j(NSim, 1);         /* binary indicator variable */
do i = 1 to NSim;            /* Monte Carlo estimation of probability */
   M = shape(A[i,], n, p);
   saddleLoc[i] = LocSaddlePt( M );  /* save the location 1-16 */
   saddle[i] = (saddleLoc[i] > 0 );  /* 0/1 binary variable */
estProb = saddle[:];
Prob = fact(n)*fact(p) / fact(n+p-1);
print estProb Prob;
The probability of a saddle point of a 4 x 4 matrix

The output shows that the estimated probability agrees with the exact probability to four decimal places.

The distribution of locations of saddle points

The saddle point can occur in any cell in the matrix. Consider, for example, the integer matrix M1 from the earlier 3 x 3 example. If you swap the first and third rows of M1, you will obtain a new matrix that has a saddle point in the (1,1) cell. If you swap the second and third rows of M1, you obtain a saddle point in the (2,1) cell. Similarly, you can swap columns to move a saddle point to a different column. In general, you can move the saddle point to any cell in the matrix by a transposition of rows and columns. Thus each cell in the matrix has an equal probability of containing a saddle point, and the probability that the saddle point is in cell (i,j) is 1/(np) times to the probability that a saddle point exists.

In the preceding SAS/IML simulation, the saddleLoc variable contains the location of the saddle point in the 4 x 4 matrix. The following statements compute the empirical proportion of times that a saddle point appeared in each of the 16 cells of the matrix:

call tabulate(location, freq, saddleLoc);  /* frequencies in cells 1-16 */
counts = shape(freq, n, p);                /* shape into matrix     */
EstProbLoc = counts / (NSim);              /* empirical proportions */
print EstProbLoc[r=('1':'4') c=('1':'4') F=7.5];
The probabilities that a saddle point of a matrix is in the (i,j)th position

The output indicates that each cell has a probability that is estimated by 0.007 (the "James Bond" probability!). Theoretically, you can prove that the probability in each cell is given by the value of the complete beta function B(n, p), which for n=p=4 yields B(4,4) = 0.0071429 (Hofri, 2006).

In summary, this article has explored three topics that are dear to my heart: probability, simulation, and properties of matrices. Along the way, we encountered factorials, permutations, the complete beta function, and got to write some fun SAS/IML programs. For more information about the probability of a saddle point in random matrices, see "The probability that a matrix has a saddle point" (Thorp, 1979) and and "On the distribution of a saddle point value in a random matrix" (Hofri, 2006).

The post The probability of a saddle point in a matrix appeared first on The DO Loop.

12月 112017
Self-similar Christmas tree created in SAS

Happy holidays to all my readers! My greeting-card to you is an image of a self-similar Christmas tree. The image (click to enlarge) was created in SAS by using two features that I blog about regularly: matrix computations and ODS statistical graphics.

Self-similarity in Kronecker products

I have previously shown that the repeated Kronecker product of a binary matrix leads to self-similar structures.

Specifically, if M is a binary matrix then the Kronecker product M@M is a block matrix in which each 1 in the original matrix is replaced with a copy of M and each 0 is replaced by a zero block. (Here '@' is the Kronecker product (or direct product) operator, which is implemented in SAS/IML software.) If you repeat this process three or more times and create a heat map of the product matrix, the self-similar structure is apparent.

For example, the following 5 x 7 binary matrix has a 1 in positions that approximate the shape of a Christmas tree. The direct product M@M@M is a matrix that has 53 = 125 rows and 73 = 343 columns. If you use the HEATMAPDISC subroutine in SAS/IML to create a heat map of the direct product, you obtain the following image in which a green cell represents a 1:

proc iml;
M = {0 0 0 1 0 0 0,
     0 0 1 1 1 0 0,
     0 1 1 1 1 1 0,
     1 1 1 1 1 1 1,
     0 0 0 1 0 0 0};
Tree = M @ M @ M;     /* M is 5 x 7, so Tree is 125 x 343 */
ods graphics / width=300px height=450px  ANTIALIASMAX=50000;
call heatmapdisc(Tree) title="Happy Holidays to All my Readers!"
     colorramp={White Green} displayoutlines=0 ShowLegend=0;

That's not a bad result for three SAS/IML statements! You can see that the "tree" has a self-similar structure: the large tree is composed of 17 smaller trees, each of which is composed of 17 mini trees.

Adding a star

My readers are worth extra effort, so I want to put a star on the top of the tree. One way to do this is to use a scatter plot and plot a special observation by using a star symbol. To plot the "tree" by using a scatter plot, it is necessary to "unpack" the 125 x 343 matrix into a list of (x, y) values (this is a wide-to-long conversion of the matrix). You can use the NDX2SUB function to extract the row and column of each 1 in the matrix, as follows:

/* extract (row, column) pairs for matrix elements that are '1' */
idx = loc(Tree = 1);                /* indices for the 1s */
s = ndx2sub(dimension(Tree), idx);  /* convert indices to subscripts */
create Tree from s[c={"RNames" "CNames"}];
append from s;
call symputx("XPos", range(s[,2])/2); /* midrange = horiz position of star */

The previous statements create a SAS data set called TREE that contains the (x, y) positions of the 1s in the direct product matrix. The statements also saved the midrange value to a macro variable. The following SAS procedures create the location of the star, concatenate the two data sets, and create a scatter plot of the result, which is shown at the top of this article.

data star;
x = &XPos; y = 0; 
data Tree2;
set Tree star;
ods graphics / width=600px height=800px  ANTIALIASMAX=50000;
title "Happy Holidays";
proc sgplot data=Tree2 noborder noautolegend;
scatter x=CNames y=RNames / markerattrs=(color=ForestGreen symbol=SquareFilled size=3);
scatter x=x y=y / markerattrs=(symbol=StarFilled size=20)             /* yellow star */
                  filledoutlinedmarkers markerfillattrs=(color=yellow);
yaxis reverse display=none;
xaxis display=none;

Previous SAS-created Christmas cards

All year long I blog about how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and simulation. Consequently, I enjoy my occasional forays into a fun and frivolous topic such as how to create a greeting card with SAS. If you like seeing geeky SAS-generated images, here is a list of my efforts from previous years:

Wherever you are and whatever holiday you celebrate, may you enjoy peace and happiness this season and in the New Year.

The post A self-similar Christmas tree appeared first on The DO Loop.

8月 302017

Aa previous article discussed the mathematical properties of the singular value decomposition (SVD) and showed how to use the SVD subroutine in SAS/IML software. This article uses the SVD to construct a low-rank approximation to an image. Applications include image compression and denoising an image.

Construct a grayscale image

The value of each pixel in a grayscale image can be stored in a matrix where each element of the matrix is a value between 0 (off) and 1 (full intensity). I want to create a small example that is easy to view, so I'll create a small matrix that contains information for a low-resolution image of the capital letters "SVD." Each letter will be five pixels high and three pixels wide, arranges in a 7 x 13 array of 0/1 values. The following SAS/IML program creates the array and displays a heat map of the data:

ods graphics / width=390px height=210px;
proc iml;
SVD = {0 0 0 0 0 0 0 0 0 0 0 0 0,
       0 1 1 1 0 1 0 1 0 1 1 0 0,
       0 1 0 0 0 1 0 1 0 1 0 1 0,
       0 1 1 1 0 1 0 1 0 1 0 1 0,
       0 0 0 1 0 1 0 1 0 1 0 1 0,
       0 1 1 1 0 0 1 0 0 1 1 0 0,
       0 0 0 0 0 0 0 0 0 0 0 0 0 };
A = SVD;
/* ColorRamp:  White    Gray1    Gray2    Gray3    Black   */
ramp =        {CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000};
call heatmapcont(A) colorramp=ramp showlegend=0 title="Original Image" range={0,1};
Rank-5 data matrix for singular value decomposition (SVD)

A low-rank approximation to an image

Because the data matrix contains only five non-zero rows, the rank of the A matrix cannot be more than 5. The following statements compute the SVD of the data matrix and create a plot of the singular values. The plot of the singular values is similar to the scree plot in principal component analysis, and you can use the plot to help choose the number of components to retain when approximating the data.

call svd(U, D, V, A);  /* A = U*diag(D)*V` */
title "Singular Values";
call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};
Singular values of rank-5 data matrix

For this example, it looks like retaining three or five components would be good choices for approximating the data. To see how low-rank approximations work, let's generate and view the rank-1, rank-2, and rank-3 approximations:

keep = 1:3;          /* use keep=1:5 to see all approximations */
do i = 1 to ncol(keep);
   idx = 1:keep[i];                           /* components to keep */
   Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;  /* rank k approximation */
   Ak = (Ak - min(Ak)) / range(Ak);           /* for plotting, stdize into [0,1] */
   s = "Rank = " + char(keep[i],1);
   call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1};
Low-rank approximation: Rank-1 approximation via SVD of a data matrix

The rank-1 approximation does a good job of determining the columns and do not contain that contain the letters. The approximation also picks out two rows that contain the horizontal strokes of the capital "S."

Low-rank approximation: Rank-2 approximation via SVD of a data matrix

The rank-2 approximation refines the image and adds additional details. You can begin to make out the letters "SVD." In particular, all three horizontal strokes for the "S" are visible and you can begin to see the hole in the capital "D."

Low-rank approximation: Rank-3 approximation via SVD of a data matrix

The rank-3 approximation contains enough details that someone unfamiliar with the message can read it. The "S" is reconstructed almost perfectly and the space inside the "V" and "D" is almost empty. Even though this data is five-dimensional, this three-dimensional approximation is very good.

Not only is a low-rank approximation easier to work with than the original five-dimensional data, but a low-rank approximation represents a compression of the data. The original image contains 7 x 13 = 91 values. For the rank-3 approximation, three columns of the U matrix contain 33 numbers and three columns of VT contain 15 numbers. So the total number of values required to represent the rank-3 approximation is only 48, which is almost half the number of values as for the original image.

Denoising an image

You can use the singular value decomposition and low-rank approximations to try to eliminate random noise that has corrupted an image. Every TV detective series has shown an episode in which the police obtain a blurry image of a suspect's face or license plate. The detective asks the computer technician if she can enhance the image. With the push of a button, the blurry image is replaced by a crystal clear image that enables the police to identify and apprehend the criminal.

The image reconstruction algorithms used in modern law enforcement are more sophisticated than the SVD. Nevertheless, the SVD can do a reasonable job of removing small random noise from an image, thereby making certain features easier to see. The SVD has to have enough data to work with, so the following statements duplicate the "SVD" image four times before adding random Gaussian noise to the data. The noise has a standard deviation equal to 25% of the range of the data. The noisy data is displayed as a heat map on the range [-0.25, 1.25] by using a color ramp that includes yellow for negative values and blue for values greater than 1.

call randseed(12345);
A = (SVD // SVD) || (SVD // SVD);                     /* duplicate the image */
A = A + randfun( dimension(A), "Normal", 0, 0.25);    /* add Gaussian noise */
/*       Yellow   White    Gray1    Gray2    Gray3    Black    Blue    */
ramp2 = {CXFFEDA0 CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000 CX3182BD};
call heatmapcont(A) colorramp=ramp2 showlegend=0 title="Noisy Image" range={-0.5, 1.5};
Image corrupted by Gaussian noise

I think most police officers would be able to read this message in spite of the added noise, but let's see if the SVD can clean it up. The following statements compute the SVD and create a plot of the singular values:

call svd(U, D, V, A);  /* A = U*diag(D)*V` */
call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};
Plot of singular values for a small image that is corrupted by Gaussian noise

There are 14 non-zero singular values. In theory, the main signal is contained in the components that correspond to the largest singular values whereas the noise is captured by the components for the smallest singular values. For these data, the plot of the singular values suggests that three or five (or nine) components might capture the main signal while ignoring the noise. The following statements create and display the rank-3 and rank-5 approximations. Only the rank-5 approximation is shown.

keep = {3 5};        /* which components to examine? */
do i = 1 to ncol(keep);
   idx = 1:keep[i];
   Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;
   Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */
   s = "Rank = " + char(keep[i],2);
   call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1};
Low-rank approximation: Rank-5 reconstruction via SVD of a small image that is corrupted by Gaussian noise

The denoised low-rank image is not as clear as Hollywood depicts, but it is quite readable. It is certainly good enough to identify a potential suspect. I can hear the detective mutter, "Book 'em, Danno! Murder One."

Summary and further reading

In summary, the singular value decomposition (SVD) enables you to approximate a data matrix by using a low-rank approximation. This article uses a small example for which the full data matrix is rank-5. A plot of the singular values can help you choose the number of components to retain. For this example, a rank-3 approximation represents the essential features of the data matrix.

For similar analyses and examples that use the singular value decomposition, see

In SAS software, the SVD is heavily used in the SAS Text Miner product. For an overview of how a company can use text mining to analyze customer support issues, see Sanders and DeVault (2004) "Using SAS at SAS: The Mining of SAS Technical Support."

The post The singular value decomposition and low-rank approximations appeared first on The DO Loop.

7月 312017

A SAS user needed to convert a program from MATLAB into the SAS/IML matrix language and asked whether these is a SAS/IML equivalent to the fliplr and flipud functions in MATLAB. These functions flip the columns or rows (respectively) of a matrix; "LR" stands for "left-right" and "UD" stands for "up-down."

No, SAS/IML does not have equivalent built-in functions, but as Devo once sang: "When a problem comes along, you must whip it." This article shows how to implement these functions in SAS/IML. Or, to paraphrase Devo's maxim: "When a matrix comes along, you must flip it. It's not too late to flip it. Flip it good!"

The FLIPLR and FLIPUD functions in SAS/IML

To key to reversing the columns or rows of a matrix is to use subscript notation. Recall that in the SAS/IML language, the expression A[ ,1] extracts the first column of A. (When the row subscript is not specified, it means "use all rows.") Similarly, A[ ,1:3] extracts the first three columns of A (in order), whereas A[ ,3:1] extracts the first three columns of A in reverse order. A similar notation is valid for rows. The expression A[3:1, ] extracts the first three rows of A in reverse order. Therefore, the following SAS/IML functions implement the functionality of the fliplr and flipud functions for matrices:

proc iml;
/* reverse columns of a matrix */
start fliplr(A);
   return A[ , ncol(A):1];
/* reverse rows of a matrix */
start flipud(A);
   return A[nrow(A):1, ];
/* test the functions on a 2x3 matrix */
x = {1 2 3, 4 5 6};
lr = fliplr(x);
ud = flipud(x);
print x, lr, ud;

Reversing all elements

You can also reverse all elements of a matrix. Recall that a SAS/IML matrix is stored in row-major order, which means that the elements are enumerated by counting across the first row, then across the second row, and so forth. Thus if A is an n x m matrix, then A[n*m:1] returns the elements of A in reverse order. However, the elements are returned in a column vector. If you want the elements in a matrix that has the same dimensions as A, you can "shape it up" by using the SHAPE function, as follows:

/* reverse elements of a matrix */
start reverse(A);
   n = nrow(A); m = ncol(A);
   return shape(A[n*m:1], n, m);     /* Now flip it into SHAPE */
rev = reverse(x);
print rev;

The MATLAB language also supports the rot90 function, which rotates a matrix. I have previously shown how to rotate the elements of a matrix in the SAS/IML language.

In conclusion, although SAS/IML does not contain built-in functions for reversing the rows and columns of a matrix, these functions are easily defined in terms of matrix subscript operations. In short, when given a matrix, it is not hard to flip it, flip it good!

The post Flip it. Flip it good. appeared first on The DO Loop.