Matrix Computations

11月 282018

I remember the first time I used PROC GLM in SAS to include a classification effect in a regression model. I thought I had done something wrong because the parameter estimates table was followed by a scary-looking note:

Note: The X'X matrix has been found to be singular, and a generalized inverse 
      was used to solve the normal equations. Terms whose estimates are 
      followed by the letter 'B' are not uniquely estimable. 

Singular matrix? Generalized inverse? Estimates not unique? "What's going on?" I thought.

In spite of the ominous note, nothing is wrong. The note merely tells you that the GLM procedure has computed one particular estimate; other valid estimates also exist. This article explains what the note means in terms of the matrix computations that are used to estimate the parameters in a linear regression model.

The GLM parameterization is a singular parameterization

The note is caused by the fact that the GLM model includes a classification variable. Recall that a classification variable in a regression model is a categorical variable whose levels are used as explanatory variables. Examples of classification variables (called CLASS variables in SAS) are gender, race, and treatment. The levels of the categorical variable are encoded in the columns of a design matrix. The columns are often called dummy variables. The design matrix is used to form the "normal equations" for least squares regression. In terms of matrices, the normal equations are written as (X`*X)*b = X`*Y, where X is a design matrix, Y is the vector of observed responses, and b is the vector of parameter estimates, which must be computed.

There are many ways to construct a design matrix from classification variables. If X is a design matrix that has linearly dependent columns, the crossproduct matrix X`X is singular. Some ways of creating the design matrix always result in linearly dependent columns; these constructions are said to use a singular parameterization.

The simplest and most common parameterization encodes each level of a categorical variable by using a binary indicator column. This is known as the GLM parameterization. It is a singular parameterization because if X1, X2, ..., Xk are the binary columns that indicate the k levels, then Σ Xi = 1 for each observation.

Not surprisingly, the GLM procedure in SAS uses the GLM parameterization. Here is an example that generates the "scary" note. The data are a subset of the Sashelp.Heart data set. The levels of the BP_Status variable are "Optimal", "Normal", and "High":

data Patients;
   set Sashelp.Heart;
   keep BP_Status Cholesterol;
   if ^cmiss(BP_Status, Cholesterol); /* discard any missing values */
proc glm data=Patients plots=none;
   class BP_Status;
   model Cholesterol =  BP_Status / solution;

If you change the reference levels, you get a different estimate

If you have linearly dependent columns among the explanatory variables, the parameter estimates are not unique. The easiest way to see this is to change the reference level for a classification variable. In PROC GLM, you can use the REF=FIRST or REF=LAST option on the CLASS statement to change the reference level. However, the following example uses PROC GLMSELECT (without variable selection) because you can simultaneously use the OUTDESIGN= option to write the design matrix to a SAS data set. The first call writes the design matrix that PROC GLM uses (internally) for the default reference levels. The second call writes the design matrix for an alternate reference level:

/* GLMSELECT can fit the data and output a design matrix in one step */
title "Estimates for GLM Paremeterization";
title2 "Default (Last) Reference Levels";
ods select ParameterEstimates(persist);
proc glmselect data=Patients outdesign(fullmodel)=GLMDesign1;
   class BP_Status;
   model Cholesterol =  BP_Status / selection=none;  
/* Change reference levels. Different reference levels result in different estimates. */ 
title2 "Custom Reference Levels";
proc glmselect data=Patients outdesign(fullmodel)=GLMDesign2;
   class BP_Status(ref='Normal');
   model Cholesterol =  BP_Status / selection=none;  
ods select all;
/* compare a few rows of the design matrices */
proc print data=GLMDesign1(obs=10 drop=Cholesterol); run;
proc print data=GLMDesign2(obs=10 drop=Cholesterol); run;

The output shows that changing the reference level results in different parameter estimates. (However, the predicted values are identical for the two estimates.) If you use PROC PRINT to display the first few observations in each design matrix, you can see that the matrices are the same except for the order of two columns. Thus, if you have linearly dependent columns, the GLM estimates might depend on the order of the columns.

The SWEEP operator produces a generalized inverse that is not unique

You might wonder why the parameter estimates change when you change reference levels (or, equivalently, change the order of the columns in the design matrix). The mathematical answer is that there is a whole family of solutions that satisfy the (singular) regression equations, and from among the infinitely many solutions, the GLM procedure chooses the solution for which the estimate of the reference level is exactly zero.

Last week I discussed generalized inverses, including the SWEEP operator and the Moore-Penrose inverse. The SWEEP operator is used by PROC GLM to obtain the parameter estimates. The SWEEP operator produces a generalized inverse that is not unique. In particular, the SWEEP operator computes a generalized inverse that depends on the order of the columns in the design matrix.

The following SAS/IML statements read in the design matrices for each GLM parameterization and use the SWEEP function to reproduce the parameter estimates that are computed by the GLM procedure. For each design matrix, the program computes solutions to the normal equations (X`*X)*b = (X`*Y). The program also computes the Moore-Penrose solution for each design matrix.

proc iml;
/* read design matrix and form X`X and X`*Y */
use GLMDesign1; read all var _NUM_ into Z[c=varNames]; close;
p = ncol(Z);
X = Z[, 1:(p-1)];  Y = Z[, p];  vNames = varNames[,1:(p-1)];
A = X`*X;  c = X`*Y;
/* obtain G2 and Moore-Penrose solution for this design matrix */
Sweep1 = sweep(A)*c;
GInv1  = ginv(A)*c;
print Sweep1[r=vNames], GInv1;
/* read other design matrix and form X`X and X`*Y */
use GLMDesign2; read all var _NUM_ into Z[c=varNames]; close;
p = ncol(Z);
X = Z[, 1:(p-1)];  Y = Z[, p]; vNames = varNames[,1:(p-1)];
A = X`*X;  c = X`*Y;
/* obtain G2 and Moore-Penrose solution for this design matrix */
Sweep2 = sweep(A)*c;
GInv2 = ginv(A)*c;
print Sweep2[r=vNames], GInv2;

The results demonstrate that the SWEEP solution depends on the order of columns in a linearly dependent design matrix. However, the Moore-Penrose solution does not depend on the order. The Moore-Penrose solution is the same no matter which reference levels you choose for the GLM parameterization of classification effects.

In summary, the scary note that PROC GLM produces reminds you of the following mathematical facts:

  • When you include classification effects in a linear regression model and use the GLM parameterization to construct the design matrix, the design matrix has linearly dependent columns.
  • The X`X matrix is singular when X has linearly dependent columns. Consequently, the parameter estimates for least squares regression are not unique.
  • From among the infinitely many solutions to the normal equations, the solution that PROC GLM (and other SAS procedures) computes is based on a generalized inverse that is computed by using the SWEEP operator.
  • The solution obtained by the SWEEP operator depends on the reference levels for the CLASS variables.

If the last fact bothers you (it shouldn't), an alternative estimate is available by using the GINV function to compute the Moore-Penrose inverse. The corresponding estimate is unique and does not depend on the reference level.

The post Singular parameterizations, generalized inverses, and regression estimates appeared first on The DO Loop.

11月 212018

A data analyst asked how to compute parameter estimates in a linear regression model when the underlying data matrix is rank deficient. This situation can occur if one of the variables in the regression is a linear combination of other variables. It also occurs when you use the GLM parameterization of a classification variable. In the GLM parameterization, the columns of the design matrix are linearly dependent. As a result, the matrix of crossproducts (the X`X matrix) is singular. In either case, you can understand the computation of the parameter estimates learning about generalized inverses in linear systems. This article presents an overview of generalized inverses. A subsequent article will specifically apply generalized inverses to the problem of estimating parameters for regression problems with collinearities.

What is a generalized inverse?

Recall that the inverse matrix of a square matrix A is a matrix G such as A*G = G*A = I, where I is the identity matrix. When such a matrix exists, it is unique and A is said to be nonsingular (or invertible). If there are linear dependencies in the columns of A, then an inverse does not exist. However, you can define a series of weaker conditions that are known as the Penrose conditions:

  1. A*G*A = A
  2. G*A*G = G
  3. (A*G)` = A*G
  4. (G*A)` = G*A

Any matrix, G, that satisfies the first condition is called a generalized inverse (or sometimes a "G1" inverse) for A. A matrix that satisfies the first and second condition is called a "G2" inverse for A. The G2 inverse is used in statistics to compute parameter estimates for regression problems (see Goodnight (1979), p. 155). A matrix that satisfies all four conditions is called the Moore-Penrose inverse or the pseudoinverse. When A is square but singular, there are infinitely many matrices that satisfy the first two conditions, but the Moore-Penrose inverse is unique.

Computations with generalized inverses

In regression problems, the parameter estimates are obtained by solving the "normal equations." The normal equations are the linear system (X`*X)*b = (X`*Y), where X is the design matrix, Y is the vector of observed responses, and b is the parameter estimates to be solved. The matrix A = X`*X is symmetric. If the columns of the design matrix are linearly dependent, then A is singular. The following SAS/IML program defines a symmetric singular matrix A and a right-hand-side vector c, which you can think of as X`*Y in the regression context. The call to the DET function computes the determinant of the matrix. A zero determinant indicates that A is singular and that there are infinitely many vectors b that solve the linear system:

proc iml;
A = {100  50 20 10,
      50 106 46 23,
      20  46 56 28,
      10  23 28 14 };
c = {130, 776, 486, 243};
det = det(A);         /* demonstrate that A is singular */
print det;

For nonsingular matrices, you can use either the INV or the SOLVE functions in SAS/IML to solve for the unique solution of the linear system. However, both functions give errors when called with a singular matrix. SAS/IML provides several ways to compute a generalized inverse, including the SWEEP function and the GINV function. The SWEEP function is an efficient way to use Gaussian elimination to solve the symmetric linear systems that arise in regression. The GINV function is a function that computes the Moore-Penrose pseudoinverse. The following SAS/IML statements compute two different solutions for the singular system A*b = c:

b1 = ginv(A)*c;       /* solution even if A is not full rank */
b2 = sweep(A)*c;
print b1 b2;

The SAS/IML language also provides a way to obtain any of the other infinitely many solutions to the singular system A*b = c. Because A is a rank-1 matrix, it has a one-dimensional kernel (null space). The HOMOGEN function in SAS/IML computes a basis for the null space. That is, it computes a vector that is mapped to the zero vector by A. The following statements compute the unit basis vector for the kernel. The output shows that the vector is mapped to the zero vector:

xNull = homogen(A);   /* basis for nullspace of A */
print xNull (A*xNull)[L="A*xNull"];

All solutions to A*b = c are of the form b + α*xNull, where b is any particular solution.

Properties of the Moore-Penrose solution

You can verify that the Moore-Penrose matrix GINV(A) satisfies the four Penrose conditions, whereas the G2 inverse (SWEEP(A)) only satisfies the first two conditions. I mentioned that the singular system has infinitely many solutions, but the Moore-Penrose solution (b1) is unique. It turns out that the Moore-Penrose solution is the solution that has the smallest Euclidean norm. Here is a computation of the norm for three solutions to the system A*b = c:

/* GINV gives the estimate that has the smallest L2 norm: */
GINVnorm  = norm(b1);
sweepNorm = norm(b2);
/* you can add alpha*xNull to any solution to get another solution */
b3 = b1 + 2*xNull;  /* here's another solution (alpha=2) */
otherNorm = norm(b3);
print ginvNorm sweepNorm otherNorm;

Because all solutions are of the form b1 + α*xNull, where xNull is the basis for the nullspace of A, you can graph the norm of the solutions as a function of α. The graph is shown below and indicates that the Moore-Penrose solution is the minimal-norm solution:

alpha = do(-2.5, 2.5, 0.05);
norm = j(1, ncol(alpha), .);
do i = 1 to ncol(alpha);
   norm[i] = norm(b1 + alpha[i] * xNull);
title "Euclidean Norm of Solutions b + alpha*xNull";
title2 "b = Solution from Moore-Penrose Inverse";
title3 "xNull = Basis for Nullspace of A";
call series(alpha, norm) 
     other="refline 0 / axis=x label='b=GINV';refline 1.78885 / axis=x label='SWEEP';";
Graph of norm of solutions to the singular system A*b=c. The norm is plotted for vectors b + alpha*x_Null where b is the Moore-Penrose solution and x_Null is a basis for the nullspace of A.

In summary, a singular linear system has infinitely many solutions. You can obtain a particular solution by using the sweep operator or by finding the Moore-Penrose solution. You can use the HOMOGEN function to obtain the full family of solutions. The Moore-Penrose solution is expensive to compute but has an interesting property: it is the solution that has the smallest Euclidean norm. The sweep solution is more efficient to compute and is used in SAS regression procedures such as PROC GLM to estimate parameters in models that include classification variables and use a GLM parameterization. The next blog post explores regression estimates in more detail.

The post Generalized inverses for matrices appeared first on The DO Loop.

10月 222018

A SAS programmer asked how to rearrange elements of a matrix. The rearrangement he wanted was rather complicated: certain blocks of data needed to move relative to other blocks, but the values within each block were to remain unchanged. It turned out that the mathematical operation he needed is called a block transpose. The BTRAN function in SAS/IML performs block-transpose operations, so the complicated rearrangement was easy to implement with a judicious call to the BTRAN function.

This article discusses the block-transpose operation and gives an example. It also shows how a block transpose can conveniently transform wide data into long data and vice versa.

The block-transpose operation

In general, a block matrix can contain blocks of various sizes. However, the BTRAN function requires that all blocks be the same size. Specifically, suppose A is a block matrix where each block is an n x m matrix. That means that A is an (r n) x (c m) matrix for some whole numbers r and c. The BTRAN function transposes the blocks to create a (c n) x (r m) block matrix.

The idea is more easily explained by a picture. The following image shows a matrix A that is composed of six blocks (of the same size). The block-transpose operation rearranges the blocks but leaves the elements within the blocks unchanged.

A block-transpose operation transposes the blocks but leaves the contents of each block unchanged

The BTRAN function in SAS/IML

Let's see how the block transpose works on an example in the SAS/IML language. The following statements create a 4 x 12 matrix. I have overlaid some grid lines to help you visualize this matrix as a 2 x 3 block matrix, where each block is 2 x 4.

proc iml;
x = repeat(1:12,2);
x = x // (10+x);
print x;
A 4x12 matrix visualized as a 2x3 block matrix. Each block is 2x4.

You can use the BTRAN function to apply a block transpose. The first argument is the matrix that contains the data. The second and third arguments specify the size of the blocks:

y = btran(x, 2, 4);    /* block size is 2x4 */
print y;
A 6x6 matrix visualized as a 3x2 block matrix. Each block is 2x4.

The output is a 6 x 6 matrix, visualized as a 3 x 2 block matrix.

Block transpose for wide-to-long transforms

If the number of rows in the block equals the number of rows in the data, then the BTRAN function stacks columns. In particular, if the block size is n x 1, the BTRAN function stacks multiple variables into a single column, just like the SHAPECOL function. For example, the Sashelp.Iris data contains four variables named SepalLength, SepalWidth, PetalLength, and PetalWidth. The following SAS/IML statements stack the four variables into a single column and create a new column that identifies the name of the original variable:

proc iml;
/* wide to long */
varNames = {"SepalLength" "SepalWidth" "PetalLength" "PetalWidth"};
use Sashelp.Iris;
   read all var "Species";        /* 150 x 3 */
   read all var varNames into X;  /* 150 x 4 */
n = nrow(X); m = ncol(X);         /* n = 150; m = 4 */
ID = repeat(Species, m);          /* repeat the vars that are not transposed */
Var = shapecol(repeat(varNames, n), n*m); /* create column that contains variable names */
Value = btran(X, n, 1);           /* vector with 150*4 rows */ 
create test var {ID "Var" "Value"}; append; close;

Block transpose for wide-to-long transforms

In a similar way, you can transpose balanced data from long form to wide form. (Recall that "balanced data" means that each group has the same number of observations.) For example, the Sashelp.Iris data contains a Species variable. The first 50 observations have the value 'Setosa', the next 50 have the value 'Versicolor', and the last 50 have the value 'Virginica'. You can use a WHERE clause or a BY statement to analyze each species separately, but suppose you want to create a new data set that contains only 50 observations and has variables named SepalLengthSetosa, SepalLengthVersicolor, SepalLengthVirginica, and so forth for the other variables. The BTRAN function makes this easy: just specify a block size of 50 x 4, as follows.

/* create pairwise combinations of var names and group levels */
BlockRows = 50;
w = btran(X, BlockRows, m);                  /* 50 x 12 matrix */
s = Species[ do(1, n, BlockRows) ];          /* s = {'Setosa' 'Versicolor', 'Virginica'} */
vNames = rowcatc(expandgrid(varNames, s));   /* combine var names and species */
print vNames, w[c=vNames L="Measurenents (cm)"];

In summary, the BTRAN function is a useful function when you need to rearrange blocks of data without changing the values in a block. For balanced designs, you can use the BTRAN function to convert data between wide and long formats.

The post Transpose blocks to reshape data appeared first on The DO Loop.

10月 032018

It is sometimes necessary for researchers to simulate data with thousands of variables. It is easy to simulate thousands of uncorrelated variables, but more difficult to simulate thousands of correlated variables. For that, you can generate a correlation matrix that has special properties, such as a Toeplitz matrix or a first-order autoregressive (AR(1)) correlation matrix. I have previously written about how to generate a large Toeplitz matrix in SAS. This article describes three useful results for an AR(1) correlation matrix:

  • How to generate an AR(1) correlation matrix in the SAS/IML language
  • How to use a formula to compute the explicit Cholesky root of an AR(1) correlation matrix.
  • How to efficiently simulate multivariate normal variables with AR(1) correlation.

Generate an AR(1) correlation matrix in SAS

The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L1 distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:

proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, p); 
   return rho##distance(T(1:p), T(1:p), "L1");
/* test on 10 x 10 matrix with rho = 0.8 */
rho = 0.8;  p = 10;
Sigma = AR1Corr(rho, p);
print Sigma[format=Best7.];

A formula for the Cholesky root of an AR(1) correlation matrix

Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = RTR, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:

/* direct computation of Cholesky root for an AR(1) matrix */
start AR1Root(rho, p);
   R = j(p,p,0);                /* allocate p x p matrix */
   R[1,] = rho##(0:p-1);        /* formula for 1st row */
   c = sqrt(1 - rho**2);        /* scaling factor: c^2 + rho^2 = 1 */
   R2 = c * R[1,];              /* formula for 2nd row */
   do j = 2 to p;               /* shift elements in 2nd row for remaining rows */
      R[j, j:p] = R2[,1:p-j+1]; 
   return R;
R = AR1Root(rho, p);   /* compare to R = root(Sigma), which requires forming Sigma */
print R[L="Cholesky Root" format=Best7.];

You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ2.

Efficient simulation of multivariate normal variables with AR(1) correlation

An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.

When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:

/* simulate multivariate normal data from a population with AR(1) correlation */
start RandNormalAR1( N, Mean, rho );
    mMean = rowvec(Mean);
    p = ncol(mMean);
    U = AR1Root(rho, p);      /* use explicit formula instead of ROOT(Sigma) */
    Z = j(N,p);
    call randgen(Z,'NORMAL');
    return (mMean + Z*U);
call randseed(12345);
p = 1000;                  /* big matrix */
mean = j(1, p, 0);         /* mean of MVN distribution */
/* simulate data from MVN distribs with different values of rho */
v = do(0.01, 0.99, 0.01);  /* choose rho from list 0.01, 0.02, ..., 0.99 */
t0 = time();               /* time it! */
do i = 1 to ncol(v);
   rho = v[i];
   X = randnormalAR1(500, mean, rho);  /* simulate 500 obs from MVN with p vars */
t_SimMVN = time() - t0;     /* total time to simulate all data */
print t_SimMVN;

The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.

In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.

The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.

7月 092018

Back in high school, you probably learned to find the intersection of two lines in the plane. The intersection requires solving a system of two linear equations. There are three cases: (1) the lines intersect in a unique point, (2) the lines are parallel and do not intersect, or (3) the lines are coincident. Thus, for two lines, the intersection problem has either 1, 0, or infinitely many solutions. Most students quickly learn that the lines always intersect when their slopes are different, whereas the special cases (parallel or coincident) occur when the lines have the same slope.

Recently I had to find the intersection between two line segments in the plane. Line segments have finite extent, so segments with different slopes may or may not intersect. For example, the following panel of graphs shows three pairs of line segments in the plane. In the first panel, the segments intersect. In the second panel, the segments have the same slopes as in the first panel, but these segments do not intersect. In the third panel, the segments intersect in an interval. This article shows how to construct a linear system of equations that distinguishes between the three cases and compute an intersection point, if it exists.

Parameterization of line segments

Let p1 and p2 be the endpoints of one segment and let q1 and q2 be the endpoints of the other. Recall that a parametrization of the first segment is (1-s)*p1 + s*p2, where s ∈ [0,1] and the endpoints are treated as 2-D vectors. Similarly, a parametrization of the second segment is (1-t)*q1 + t*q2, where t ∈ [0,1]. Consequently, the segments intersect if and only if there exists values for (s,t) in the unit square such that
(1-s)*p1 + s*p2 = (1-t)*q1 + t*q2
You can rearrange the terms to rewrite the equation as
(p2-p1)*s + (q1-q2)*t = q1 - p1

This is a vector equation which can be rewritten in terms of matrices and vectors. Define the 2 x 2 matrix A whose first column contains the elements of (p2-p1) and whose second column contains the elements of (q1-q2). Define b = q1 - p1 and x = (s,t). If the solution of the linear system A*x = b is in the unit square, then the segments intersect. If the solution is not in the unit square, the segments do not intersect. If the segments have the same slope, then the matrix A is singular and you need to perform additional tests to determine whether the segments intersect.

A vector solution for the intersection of segments

As shown above, the intersection of two planar line segments is neatly expressed in terms of a matrix-vector system. In SAS, the SAS/IML language provides a natural syntax for expressing and solving matrix-vector equations. The following SAS/IML function constructs and solves a linear system. For simplicity, this version does not handle the degenerate case of two segments that have the same slope. That case is handled in the next section.

start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
/* Test 1: intersection at (0.95, 1.25) */
p1 = {1.8 2.1};   p2 = {0.8 1.1};
q1 = {1 1.25};    q2 = {0 1.25};
z = IntersectSegsSimple(p1,p2,q1,q2);
print z;
/* Test 2: no intersection */
p1 = {-1 0.5};  p2 = {1 0.5};
q1 = {0 1};     q2 = {0 2};
v = IntersectSegsSimple(p1, p2, q1, q2);
print v;

The function contains only a few statements. The function is called to solve the examples in the first two panels of the previous graph. The SOLVE function solves the linear system (assuming that a solution exists), and the IF-THEN statement tests whether the solution is in the unit square [0,1] x [0,1]. If so, the function returns the point of intersection. If not, the function returns a pair of missing values.

The general case: Handling overlapping line segments

For many applications, the function in the previous section is sufficient because it handles the generic cases. For completeness the following module also handles segments that have identical slopes. The DET function determines whether the segments have the same slope. If so, the segments could be parallel or collinear. To determine whether collinear segments intersect, you can test for three conditions:

  • The endpoint q1 is inside the segment [p1, p2].
  • The endpoint q2 is inside the segment [p1, p2].
  • The endpoint p1 is inside the segment [q1, q2].

Notice that the condition "p2 is inside [q1,q2]" does not need to be checked separately because it is already handled by the existing checks. If any of the three conditions are true, there are infinitely many solutions (or the segments share an endpoint). If none of the conditions hold, the segments do not intersect. For overlapping segments, the following function returns an endpoint of the intersection interval.

/* handle all cases: determine intersection of two planar line segments 
   [p1, p2] and [q1, q2] */
start Intersect2DSegs(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2);
   if det(A)^=0 then do;         /* nonsingular system: 0 or 1 intersection */
      x = solve(A, b);           /* x = (s,t) */
      if all(0<=x && x<=1) then  /* if x is in [0,1] x [0,1] */
         return (1-x[1])*p1 + x[1]*p2;  /* return intersection */
      else                       /* segments do not intersect */
         return ({. .});         /* return missing values */
   /* segments are collinear: 0 or infinitely many intersections */
   denom = choose(p2-p1=0, ., p2-p1);  /* protect against division by 0 */
   s = (q1 - p1) / denom;        /* Is q1 in [p1, p2]? */
   if any(0<=s && s<=1) then
      return q1;
   s = (q2 - p1) / denom;        /* Is q2 in [p1, p2]? */
   if any(0<=s && s<=1) then
      return q2;
   denom = choose(q2-q1=0, ., q2-q1);  /* protect against division by 0 */
   s = (p1 - q1) / denom;        /* Is p1 in [q1, q2]? */
   if any(0<=s && s<=1) then
      return p1;
   return ({. .});               /* segments are disjoint */
/* test overlapping segments; return endpoint of one segment */
p1 = {-1 1};  p2 = {1 1};
q1 = {0 1};   q2 = {2 1};
w = Intersect2DSegs(p1, p2, q1, q2);
print w;

In summary, by using matrices, vectors, and linear algebra, you can easily solve for the intersection of two line segments or determine that the segments do not intersect. The general case needs some special logic to handle degenerate configurations, but the code that solves the generic cases is straightforward when expressed in a vectorized language such as SAS/IML.

The post The intersection of two line segments appeared first on The DO Loop.

4月 182018

The sweep operator performs elementary row operations on a system of linear equations. The sweep operator enables you to build regression models by "sweeping in" or "sweeping out" particular rows of the X`X matrix. As you do so, the estimates for the regression coefficients, the error sum of squares, and the generalized inverse of the system are updated simultaneously. This is true not only for a single response variable but also for multiple responses.

You might remember elementary row operations from when you learned Gaussian elimination, which is a general technique for solving linear systems of equations. Sweeping a matrix is similar to Gaussian elimination, but the sweep operator is most often used to solve least squares problems for the normal equations X`X = X`Y. Therefore, in practice, the sweep operator is usually applied to the rows of a symmetric positive definite system.

How to use the sweep operator to obtain least squares estimates

Before we discuss details of the sweep operator, let's show how it produces regression estimates. The following program uses the SWEEP function in SAS/IML on a six-observation data set from J. Goodnight (1979, p. 153). The first three columns of the matrix M contains the design matrix for the explanatory variables; the first column of M represents the intercept parameter. The last column of the augmented matrix is the response variable. The sweep operator is called on the uncorrected sum of squares and crossproducts matrix (the USSCP matrix), which is computed as M`*M. In block form, the USSCP matrix contains the submatrices X`X, X`Y, Y`X, and Y`Y. The following program sweeps the first three rows of the USSCP matrix:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,
        1   2   1   3,
        1   3   1   3,
        1   1  -1   2,
        1   2  -1   2,
        1   3  -1   1};
S = sweep(M`*M, 1:3);     /* sweep first three rows (Intercept, X1, X2) */
print S;
The sweep operator applied to all ros of a USSCP matrix to obtain regression coefficients, SSe, and generalized inverse

The resulting matrix (S) is divided into blocks by black lines. The leading 3x3 block is the (generalized) inverse of the X`X matrix. The first three elements of the last column (outlined in red) are the least squares estimates for the regression model. The lower right cell (outlined in blue) is the sum of squared errors (SSE) for the regression model.

You can compare the result with the results from PROC REG, as follows. The parameter estimates and error sum of squares are highlighted for easy comparison:

data Have;
input X1 X2 Y @@;
1   1   1   2   1   3   3   1   3
1  -1   2   2  -1   2   3  -1   1
proc reg data=Have USSCP plots=none;
   model Y = X1 X2;
   ods select USSCP ANOVA ParameterEstimates;
run; quit;
Regression estimates and SSE for least squares analysis in SAS PROC REG

As claimed, the sweep operator produces the same parameter estimates and SSE statistic as PROC REG. The next sections discuss additional details of the sweep operator.

The basics of the sweep operator

Goodnight (1979) defines the sweep operator as the following sequence of row operations. Given a symmetric positive definite matrix A, SWEEP(A, k) modifies the matrix A by using the pivot element A[k,k] and the k_th row, as follows:

  1. Let D = A[k,k] be the k_th diagonal element.
  2. Divide the k_th row by D.
  3. For every other row i ≠ k, let B = A[i,k] be the i_th element of the k_th column. Subtract B x (row k) from row i. Then set A[i,k] = –B/D.
  4. Set A[k,k] = 1/D.

You could program these steps in a matrix language such as SAS/IML, but as shown previously, the SAS/IML language supports the SWEEP function as a built-in function.

Sweeping in effects

The following program uses the same data as the first section. The USSCP matrix (S0) represents the normal equations (plus a little extra information). If you sweep the first row of the S0 matrix, you solve the regression problem for an intercept-only model:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,    1   2   1   3,    1   3   1   3,
        1   1  -1   2,    1   2  -1   2,    1   3  -1   1};
S0 = M`*M;          print S0;   /* USSCP matrix */
/* sweep in 1st row (Intercept) */
S1 = sweep(S0, 1);   print S1[F=BEST6.];
The sweep operator applied to the first row of a USSCP matrix to obtain an intercept-only model

The first element in the fourth row (S1[1,4], outlined in red) is the parameter estimate for the intercept-only model. The last element (S1[4,4], outlined in blue) is the sum of squared errors (SSE) for the intercept-only model.

If you "sweep in" the second row, you solve the least squares problem for a model that includes the intercept and X1. The corresponding elements of the last column contain the parameter estimates for the model. The lower-right element again contains the SSE for the revised model. If you proceed to "sweep in" the third row, the last column again contains the parameter estimates and the SSE, this time for the model that includes the intercept, X1, and X2:

/* sweep in 2nd row (X1) and 3rd row (X2) */
S2 = sweep(S1, 2);   print S2[F=BEST6.]; 
S3 = sweep(S2, 3);   print S3[F=BEST6.];
The sweep operator applied to subsequent rows of a USSCP matrix to 'sweep in' additional effects

Sweeping out effects

One of the useful features of the sweep operators is that you can remove an effect from a model as easily as you can add it. The sweep operator has a reversibility property: if you sweep a row a second time you "undo" the first sweep. In symbols, the operator has the property that SWEEP( SWEEP(A,k), k) = A for any row k. For example, if you want to take the X1 variable out of the model, merely sweep on the second row again:

S4 = sweep(S3, 2);   print S4[F=BEST6.]; /* model with Intercept + X2 */
The sweep operator applied twice 'sweeps out' an effect from a regression model

Of course, as shown in an earlier section, you do not need to sweep in one row at a time. You can sweep in multiple rows with one call by specifying a vector of indices. In fact, the sweep operator also commutes with itself, which means that you can specify the rows in any order. The call SWEEP(S0, 1:3) is equivalent to SWEEP(S0, {3 1 2}) or SWEEP(S0, {2 3 1}) or any other permutation of {1 2 3}.


The SWEEP operator (which is implemented in the SWEEP function in SAS/IML) enables you to construct least squares model estimates from the USSCP matrix, which is a block matrix that contains the normal equations. You can "sweep in" rows to add effects to a model or "sweep out" rows to remove effects. After each operation, the parameter estimates appear in the portion of the adjusted USSCP matrix that corresponds to the block for X`Y. Furthermore, the residual sum of squares appears in the portion of the adjusted USSCP matrix that corresponds to Y`Y.

Further reading

The sweep operator is known by different names in different fields and has been rediscovered several times. In statistics, the operator was popularized in a TAS article by J. Goodnight (1979) who mentions Ralston (1960) and Beaton (1964) as early researchers. Outside of statistics, the operator is known as the Principal Pivot Transform (Tucker, 1960), gyration (Duffy, Hazony, and Morrison, 1966), or exchange (Stewart and Stewart, 1998). Tsatsomeros (2000) provides an excellent review of the literature and the history of the sweep operator.

The post The sweep operator: A fundamental operation in regression appeared first on The DO Loop.

4月 112018

Sometimes it is important to ensure that a matrix has unique rows. When the data are all numeric, there is an easy way to detect (and delete!) duplicate rows in a matrix.

The main idea is to subtract one row from another. Start with the first row and subtract it from every row beneath it. If any row of the difference matrix is identically zero, then you have found a row that is identical to the first row. Then do the same thing for the second row: subtract it from the third and higher rows and see if you obtain a row of zeros. Repeat this process for the third and higher rows.

The following SAS/IML program implements this algorithm for an arbitrary numeric matrix. It returns a binary indicator variable (a column vector). The i_th row of the returned vector is 1 for rows that are duplicates of some previous row:

proc iml;
/* return binary vector that indicates which rows in a 
   numerical matrix are duplicates */
start DupRows(A);
   N = nrow(A);
   dupRows = j(N, 1, 0);         /* binary indicator matrix */
   do i = 1 to N-1;
      if dupRows[i] = 0 then do; /* skip rows that are known to be duplicates */
         r = i+1:N;              /* remaining rows */ 
         M = A[r, ]-A[i, ];      /* subtract current row from remaining rows */
         b = M[ ,##];            /* sum of squares = 0 iff duplicate row */
         idx = loc( b=0 );       /* any duplicate rows for current row? */
         if ncol(idx) > 0 then
            dupRows[r[idx]] = 1; /* if so, flag them */
   return dupRows;

To test the function, consider the following 6 x 3 matrix. You can see by inspection that the matrix has three duplicate rows: the third, fifth, and sixth rows. You can call the DupRows function and print the matrix adjacent to the binary vector that indicates the duplicate rows:

A = {1 1 1,
     1 2 3,
     1 1 1,  /* duplicate row */
     3 2 1,
     3 2 1,  /* duplicate row */
     1 1 1};  /* duplicate row */
DupRows = DupRows(A);
print A DupRows;

You can use the DupRows function to write a function that excludes the duplicate rows in a matrix and returns only the unique rows, as follows:

/* return the unique rows in a numerical matrix */
start UniqueRows(A);
   uniqRows = loc(DupRows(A)=0);
   return(A[uniqRows, ]);   /* return rows that are NOT duplicates */
U = UniqueRows(A);
print U;

I like this algorithm because it uses subtraction, which is a very fast operation. However, This algorithm has a complexity of O(n(n-1)/2), where n is the number of rows in the matrix, so it is best for small to moderate values of n.

For long and skinny matrices (or for character data), it might be more efficient to sort the matrix as I showed in a previous article about how to find (and count) the unique rows of a matrix. However, the sorting algorithm requires that you sort the data by ALL columns, which can be inefficient for very wide and short data tables. For small and medium-sized data, you can use either method to find the unique rows in a matrix.

This article discusses how to remove duplicates records for numerical matrices in SAS/IML. In Base SAS, you can use PROC SORT or PROC SQL to remove duplicate records from a SAS DATA set, as shown in Lafler (2017) and other references.

The post Find the unique rows of a numeric matrix appeared first on The DO Loop.

3月 212018
Pseudocode for the conjugate gradient method for solving a linear system (from Wikipedia)

I often claim that the "natural syntax" of the SAS/IML language makes it easy to implement an algorithm or statistical formula as it appears in a textbook or journal. The other day I had an opportunity to test the truth of that statement. A SAS programmer wanted to implement the conjugate gradient algorithm, which is an iterative method for solving a system of equations with certain properties. I looked up the Wikipedia article about the conjugate gradient method and saw the following text:

"The algorithm is detailed below for solving Ax = b where A is a real, symmetric, positive-definite matrix. The input vector x0 can be an approximate initial solution or 0." The text was followed by pseudocode (in the box; click to enlarge)

The conjugate gradient method in SAS/IML

I used the pseudocode to implement the conjugate gradient method in SAS/IML. (The method is explained further in the next section.) I chose not to use the 'k' and 'k+1' notation but merely to overwrite the old values of variables with new values. In the program, the variables x, r, and p are vectors and the variable A is a matrix.

/* Linear conjugate gradient method as presented in Wikipedia:
   Solve the linear system A*x = b, where A is a symmetric positive definite matrix.
   The algorithm converges in at most n iterations, where n is the dimension of A.
   This function requires an initial guess, x0, which can be the zero vector.
   The function does not verify that the matrix A is symmetric positive definite.
   This module returns a matrix mX= x1 || x2 || ... || xn 
   whose columns contain the iterative path from x0 to the approximate solution vector xn.  */
proc iml;
start ConjGrad( x0, A, b, Tolerance=1e-6 );
  x = x0;                            /* initial guess */
  r = b - A*x0;                      /* residual */
  p = r;                             /* initial direction */
  mX = j(nrow(A), nrow(A), .);       /* optional: each column is result of an iteration */
  do k = 1 to ncol(mX) until ( done );
    rpr = r`*r;  Ap = A*p;           /* store partial results */
    alpha =  rpr / (p`*Ap);          /* step size */
    x = x + alpha*p;   mX[,k] = x;   /* remember new guess */
    r = r - alpha*Ap;                /* new residual */
    done = (sqrt(rpr) < Tolerance);  /* stop if ||r|| < Tol */
    beta = (r`*r) / rpr;             /* new stepsize */
    p = r + beta*p;                  /* new direction */
  return ( mX );     /* I want to return the entire iteration history, not just x */

The SAS/IML program is easy to read and looks remarkably similar to the pseudocode in the Wikipedia article. This is in contrast to lower-level languages such as C in which the implementation looks markedly different from the pseudocode.

Convergence of the conjugate gradient method

The conjugate gradient method is an iterative method to find the solution of a linear system A*x=b, where A is a symmetric positive definite n x n matrix, b is a vector, and x is the unknown solution vector. (Recall that symmetric positive definite matrices arise naturally in statistics as the crossproduct matrix (or covariance matrix) of a set of variables.) The beauty of the conjugate gradient method is twofold: it is guaranteed to find the solution (in exact arithmetic) in at most n iterations, and it requires only simple operations, namely matrix-vector multiplications and vector additions. This makes it ideal for solving system of large sparse systems, because you can implement the algorithm without explicitly forming the coefficient matrix.

It is fun to look at how the algorithm converges from the initial guess to the final solution. The following example converges gradually, but I know of examples for which the algorithm seems to make little progress for the first n – 1 iterations, only to make a huge jump on the final iteration straight to the solution!

Recall that you can use a Toeplitz matrix to construct a symmetric positive definite matrix. The following statements define a banded Toeplitz matrix with 5 on the diagonal and specify the right-hand side of the system. The zero vector is used as an initial guess for the algorithm. The call to the ConjGrad function returns a matrix whose columns contain the iteration history for the method. For this problem, the method requires five iterations to converge, so the fifth column contains the solution vector. You can check that the solution to this system is (x1, x2, 3, x4, x5) = (-0.75, 0, 1.5, 0.5, -0.75), either by performing matrix multiplication or by using the SOLVE function in IML to compute the solution vector.

A = {5 4 3 2 1,          /* SPD Toeplitz matrix */
     4 5 4 3 2, 
     3 4 5 4 3,
     2 3 4 5 4,
     1 2 3 4 5};
b = {1, 3, 5, 4, 2};     /* right hand side */
n = ncol(A);
x0 = j(n,1,0);           /* the zero vector */
traj = ConjGrad( x0, A, b );
x = traj[ ,n];           /* for this problem, solution is in last column */

It is instructive to view how the iteration progresses from an initial guess to the final solution. One way to view the iterations is to compute the Euclidean distance between each partial solution and the final solution. You can then graph the distance between each iteration and the final solution, which decreases monotonically (Hestenes and Stiefel, 1952).

Distance = sqrt( (traj - x)[##, ] ); /* || x[j] - x_Soln || */
Iteration = 1:n;
title "Convergence of Conjugate Gradient Method";
call series(Iteration, Distance) grid={x y} xValues=Iteration 
     option="markers" label={"" "Distance to Solution"};

Notice in the distance graph that the fourth iteration almost equals the final solution. You can try different initial guesses to see how the guess affects the convergence.

In addition to the global convergence, you can visualize the convergence in each coordinate. Because the vectors live in high-dimensional space, it impossible to draw a scatter plot of the iterative solutions. However, you can visualize each vector in "parallel coordinates" as a sequence of line segments that connect the coordinates in each variable. In the following graph, each "series plot" represents a partial solution. The curves are labeled by the iteration number. The blue horizontal line represents the initial guess (iteration 0). The partial solution after the first iteration is shown in red and so on until the final solution, which is displayed in black. You can see that the third iteration (for this example) is close to the final solution. The fourth partial solution is so close that it cannot be visually distinguished from the final solution.

Convergence of the conjugate gradient method

In summary, this post shows that the natural syntax of the SAS/IML language makes it easy to translate pseudocode into a working program. The article focuses on the conjugate gradient method, which solves a symmetric, positive definite, linear system in at most n iterations. The article shows two ways to visualize the convergence of the iterative method to the solution.

The post The conjugate gradient method appeared first on The DO Loop.

3月 052018

Many people know that a surface can contain a saddle point, but did you know that you can define the saddle point of a matrix? Saddle points in matrices are somewhat rare, which means that if you choose a random matrix you are unlikely to choose one that has a saddle point. This article shows how to

  • Locate a saddle point in a matrix
  • Use simulation to estimate the probability of finding a saddle point in a random matrix.
  • Use simulation to explore the distribution of the location of a saddle point in a random matrix.

A saddle point of a matrix

You might remember from multivariable calculus that a critical point (x0, y0) is a saddle point of a function f if it is a local minimum of the surface in one direction and a local maximum in another direction. The canonical example is f(x,y) = x2 - y2 at the point (x0, y0) = (0, 0). Along the horizontal line y=0, the point (0, 0) is a local minimum. Along the vertical line x=0, the point (0, 0) is a local maximum.

The definition of a saddle point of a matrix is similar. For an n x p matrix M, the cell (i, j) is a saddle point of the matrix if M[i, j] is simultaneously the minimum value of the i_th row and the maximum value of the j_th column.

Consider the following 3 x 3 matrices. The minimum value for each row is highlighted in blue. The maximum value for each column is highlighted in red. Two of the matrices have saddle points, which are highlighted in purple.

  • For the matrix M1, the (3, 1) cell is a saddle point because 7 is the smallest value in row 3 and the largest value in column 1.
  • For the matrix M2, the (2, 2) cell is a saddle point because 5 is the smallest value in row 2 and the largest value in column 2.
  • The matrix M3 does not have a saddle point. No cell is simultaneously the smallest value in its row and the largest value in its column.
The saddle point of a matrix

Compute the location of a saddle point in a matrix

In a matrix language such as SAS/IML, you can use row and column operators to find the location of a saddle point, if it exists. The SAS/IML language supports subscript reduction operators, which enable you to compute certain statistics for all rows or all columns without writing a loop. For this problem, you can use the "index of the minimum operator" (>:<) to find the elements that are the minimum for each row and the "index of the maximum operator" (<:>) to find the elements that are the maximum for each column. To find whether there is an element in common, you can use the XSECT function find the intersection of the two sets. (SAS/IML supports several functions that perform set operations.)

A function that computes the location of the saddle point follows. It uses the SUB2NDX function to convert subscripts to indices.

proc iml;
start LocSaddlePt(M);
   dim = dimension(M);  n = dim[1];  p = dim[2];
   minRow = M[ ,>:<];                    /* for each row, find column that contains min */
   minSubscripts = T(1:n) || minRow;     /* location as (i,j) pair */
   minIdx = sub2ndx(dim, minSubscripts); /* convert location to index in [1, np] */
   maxCol = T(M[<:>, ]);                 /* for each column, find row that contains max */
   maxSubscripts = maxCol || T(1:p);     /* loation as (i,j) pair */
   maxIdx = sub2ndx(dim, maxSubscripts); /* convert location to to index in [1, np] */
   xsect = xsect(minIdx, maxIdx);        /* intersection; might be empty matrix */
   if ncol(xsect) > 0 then
      return ( xsect );  
      return ( . );
M1 = {1 2 3, 4 5 6, 7 8 9};  idx1 = LocSaddlePt(M1);
M2 = {9 1 2, 8 5 7, 3 4 6};  idx2 = LocSaddlePt(M2);
M3 = {8 1 9, 7 2 6, 3 4 5};  idx3 = LocSaddlePt(M3);
print idx1 idx2 idx3;
Find the location of the saddle point of a matrix

The indices of the saddle points are shown for the three example matrices. The 7th element of the M1 matrix is a saddle point, which corresponds to the (3, 1) cell for a 3 x 3 matrix in row-major order. The 5th element (cell (2,2)) of the M2 matrix is a saddle point. The M3 matrix does not contain a saddle point.

The probability of a saddle point in a 3x3 random matrix

A matrix contains either zero or one saddle point. There is a theorem (Ord, 1979) that says that the probability of finding a saddle point in a random n x p matrix is (n! p!) / (n+p-1)! when the elements of the matrix are randomly chosen from any continuous probability distribution.

Although the theorem holds for any continuous probability distribution, it suffices to prove it for the uniform distribution. Notice that the existence and location of a saddle point is invariant under any monotonic transformation because those transformations do not change the relative order of elements. In particular, if F is any probability distribution, the existence and location of the saddle point is invariant under the monotonic transformation F–1, which transforms the numbers into a sample from the uniform distribution. In fact, you can replace the number by their ranks, which converts the random matrix into a matrix of that contains the consecutive integers 1, 2, ..., np. Thus the previous integer matrices represent the ranks of a random matrix where the elements are drawn from any continuous distribution.

For small values of n and p, you can use the ALLPERM function to enumerate all possible matrices that have integers in the range [1, np]. You can reshape the permutation into an n x p matrix and compute whether the matrix contains a saddle point. If you use a 0/1 indicator array to encode the result, then the exact probability of finding a saddle point in a random n x p matrix is found by computing the mean of the indicator array. The following example computes the exact probability of finding a saddle point in a random 3 x 3 matrix:

A = allperm(9);            /* all permutations of 1:9 */
saddle = j(nrow(A), 1);    /* allocate vector for results */
do i = 1 to nrow(A);       /* for each permutation (row) ... */
   M = shape(A[i,], 3);    /*    reshape row into 3x3 matrix */
   idx = LocSaddlePt( M ); /*    find location of saddle */
   saddle[i] = (idx > 0);  /*    0/1 indicator variable */
prob = saddle[:];          /* (sum of 1s) / (number of matrices) */
print prob;
The probability of a saddle point in a 3 x 3 matrix

As you can see, the computational enumeration shows that the probability of a saddle point is 0.3 for 3 x 3 matrices. This agrees with the theoretical formula (3! 3!) / 5! = 36 / 120 = 3/10.

The probability of a saddle point in a random matrix

Of course, when n and p are larger it becomes infeasible to completely enumerate all n x p matrices. However, you can adopt a Monte Carlo approach: generate a large number of random matrices and compute the proportion which contains a saddle point. The following program simulates one million random 4 x 4 matrices from the uniform distribution and computes the proportion that has a saddle point. This is a Monte Carlo estimate of the true probability.

NSim = 1e6;
n = 4; p=4;                  /* random 4x4 matrices from uniform distribution */
A = j(NSim, n*p);
call randgen(A, "Uniform");
saddleLoc = j(NSim, 1, .);   /* store location of the saddle point */
saddle = j(NSim, 1);         /* binary indicator variable */
do i = 1 to NSim;            /* Monte Carlo estimation of probability */
   M = shape(A[i,], n, p);
   saddleLoc[i] = LocSaddlePt( M );  /* save the location 1-16 */
   saddle[i] = (saddleLoc[i] > 0 );  /* 0/1 binary variable */
estProb = saddle[:];
Prob = fact(n)*fact(p) / fact(n+p-1);
print estProb Prob;
The probability of a saddle point of a 4 x 4 matrix

The output shows that the estimated probability agrees with the exact probability to four decimal places.

The distribution of locations of saddle points

The saddle point can occur in any cell in the matrix. Consider, for example, the integer matrix M1 from the earlier 3 x 3 example. If you swap the first and third rows of M1, you will obtain a new matrix that has a saddle point in the (1,1) cell. If you swap the second and third rows of M1, you obtain a saddle point in the (2,1) cell. Similarly, you can swap columns to move a saddle point to a different column. In general, you can move the saddle point to any cell in the matrix by a transposition of rows and columns. Thus each cell in the matrix has an equal probability of containing a saddle point, and the probability that the saddle point is in cell (i,j) is 1/(np) times to the probability that a saddle point exists.

In the preceding SAS/IML simulation, the saddleLoc variable contains the location of the saddle point in the 4 x 4 matrix. The following statements compute the empirical proportion of times that a saddle point appeared in each of the 16 cells of the matrix:

call tabulate(location, freq, saddleLoc);  /* frequencies in cells 1-16 */
counts = shape(freq, n, p);                /* shape into matrix     */
EstProbLoc = counts / (NSim);              /* empirical proportions */
print EstProbLoc[r=('1':'4') c=('1':'4') F=7.5];
The probabilities that a saddle point of a matrix is in the (i,j)th position

The output indicates that each cell has a probability that is estimated by 0.007 (the "James Bond" probability!). Theoretically, you can prove that the probability in each cell is given by the value of the complete beta function B(n, p), which for n=p=4 yields B(4,4) = 0.0071429 (Hofri, 2006).

In summary, this article has explored three topics that are dear to my heart: probability, simulation, and properties of matrices. Along the way, we encountered factorials, permutations, the complete beta function, and got to write some fun SAS/IML programs. For more information about the probability of a saddle point in random matrices, see "The probability that a matrix has a saddle point" (Thorp, 1979) and and "On the distribution of a saddle point value in a random matrix" (Hofri, 2006).

The post The probability of a saddle point in a matrix appeared first on The DO Loop.

12月 112017
Self-similar Christmas tree created in SAS

Happy holidays to all my readers! My greeting-card to you is an image of a self-similar Christmas tree. The image (click to enlarge) was created in SAS by using two features that I blog about regularly: matrix computations and ODS statistical graphics.

Self-similarity in Kronecker products

I have previously shown that the repeated Kronecker product of a binary matrix leads to self-similar structures.

Specifically, if M is a binary matrix then the Kronecker product M@M is a block matrix in which each 1 in the original matrix is replaced with a copy of M and each 0 is replaced by a zero block. (Here '@' is the Kronecker product (or direct product) operator, which is implemented in SAS/IML software.) If you repeat this process three or more times and create a heat map of the product matrix, the self-similar structure is apparent.

For example, the following 5 x 7 binary matrix has a 1 in positions that approximate the shape of a Christmas tree. The direct product M@M@M is a matrix that has 53 = 125 rows and 73 = 343 columns. If you use the HEATMAPDISC subroutine in SAS/IML to create a heat map of the direct product, you obtain the following image in which a green cell represents a 1:

proc iml;
M = {0 0 0 1 0 0 0,
     0 0 1 1 1 0 0,
     0 1 1 1 1 1 0,
     1 1 1 1 1 1 1,
     0 0 0 1 0 0 0};
Tree = M @ M @ M;     /* M is 5 x 7, so Tree is 125 x 343 */
ods graphics / width=300px height=450px  ANTIALIASMAX=50000;
call heatmapdisc(Tree) title="Happy Holidays to All my Readers!"
     colorramp={White Green} displayoutlines=0 ShowLegend=0;

That's not a bad result for three SAS/IML statements! You can see that the "tree" has a self-similar structure: the large tree is composed of 17 smaller trees, each of which is composed of 17 mini trees.

Adding a star

My readers are worth extra effort, so I want to put a star on the top of the tree. One way to do this is to use a scatter plot and plot a special observation by using a star symbol. To plot the "tree" by using a scatter plot, it is necessary to "unpack" the 125 x 343 matrix into a list of (x, y) values (this is a wide-to-long conversion of the matrix). You can use the NDX2SUB function to extract the row and column of each 1 in the matrix, as follows:

/* extract (row, column) pairs for matrix elements that are '1' */
idx = loc(Tree = 1);                /* indices for the 1s */
s = ndx2sub(dimension(Tree), idx);  /* convert indices to subscripts */
create Tree from s[c={"RNames" "CNames"}];
append from s;
call symputx("XPos", range(s[,2])/2); /* midrange = horiz position of star */

The previous statements create a SAS data set called TREE that contains the (x, y) positions of the 1s in the direct product matrix. The statements also saved the midrange value to a macro variable. The following SAS procedures create the location of the star, concatenate the two data sets, and create a scatter plot of the result, which is shown at the top of this article.

data star;
x = &XPos; y = 0; 
data Tree2;
set Tree star;
ods graphics / width=600px height=800px  ANTIALIASMAX=50000;
title "Happy Holidays";
proc sgplot data=Tree2 noborder noautolegend;
scatter x=CNames y=RNames / markerattrs=(color=ForestGreen symbol=SquareFilled size=3);
scatter x=x y=y / markerattrs=(symbol=StarFilled size=20)             /* yellow star */
                  filledoutlinedmarkers markerfillattrs=(color=yellow);
yaxis reverse display=none;
xaxis display=none;

Previous SAS-created Christmas cards

All year long I blog about how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and simulation. Consequently, I enjoy my occasional forays into a fun and frivolous topic such as how to create a greeting card with SAS. If you like seeing geeky SAS-generated images, here is a list of my efforts from previous years:

Wherever you are and whatever holiday you celebrate, may you enjoy peace and happiness this season and in the New Year.

The post A self-similar Christmas tree appeared first on The DO Loop.