Matrix Computations

7月 142021
 

In a previous article, I discussed various ways to solve a least-square linear regression model. I discussed the SWEEP operator (used by many SAS regression routines), the LU-based methods (SOLVE and INV in SAS/IML), and the QR decomposition (CALL QR in SAS/IML). Each method computes the estimates for the regression coefficients, b, by using the normal equations (X`X) b = X`y, where X is a design matrix for the data.

This article describes a QR-based method that does not use the normal equations but works directly with the overdetermined system X b = y. It then compares the performance of the direct QR method to the computational methods that use the normal equations.

The QR solution of an overdetermined system

As shown in the previous article, you can use the QR algorithm to solve the normal equations. However, if you search the internet for "QR algorithm and least squares," you find many articles that show how you can use the QR decomposition to directly solve the overdetermined system X b = y. How does the direct QR method compare to the methods that use the normal equations?

Recall that X is an n x m design matrix, where n > m and X is assumed to be full rank of m. For simplicity, I will ignore column pivoting. If you decompose X = QRL, the orthogonal matrix Q is n x n, but the matrix RL is not square. ("L" stands for "long.") However, RL is the vertical concatenation of a square triangular matrix and a rectangular matrix of zeros:
\({\bf R_L} = \begin{bmatrix} {\bf R} \\ {\bf 0} \end{bmatrix}\)
If you let Q1 be the first m columns of Q and let Q2 be the remaining (n-m) columns, you get a partitioned matrix equation:
\(\begin{bmatrix} {\bf Q_1} & {\bf Q_2} \end{bmatrix} \begin{bmatrix} {\bf R} \\ {\bf 0} \end{bmatrix} {\bf b} = {\bf y}\)
If you multiply both sides by Q` (the inverse of the orthogonal matrix, Q), you find out that the important matrix equation to solve is \({\bf R b} = {\bf Q_1^{\prime} y}\). The vector \({\bf Q_1^{\prime} y}\) is the first m rows of the vector \({\bf Q^{\prime} y}\). The QR call in SAS/IML enables you to obtain the triangular R matrix and the vector Q` y directly from the data matrix and the observed vector. The following program uses the same design matrix as for my previous article. Assuming X has rank m, the call to the QR subroutine returns the m x m triangular matrix, R, and the vector Q` y. You can then extract the first m rows of that vector and solve the triangular system, as follows:

/* Use PROC GLMSELECT to write a design matrix */
proc glmselect data=Sashelp.Class outdesign=DesignMat;
   class Sex;
   model Weight = Height Sex Height*Sex/ selection=none;
run;
 
proc iml;
use DesignMat;
   read all var {'Intercept' 'Height' 'Sex_F' 'Height_Sex_F'} into X;
   read all var {'Weight'} into Y;
close;
 
/* The QR algorithm can work directly with the design matrix and the observed responses. */
call QR(Qty, R, piv, lindep, X, , y);   /* return Q`*y and R (and piv) */
m = ncol(X);
c = QTy[1:m];                           /* we only need the first m rows of Q`*y */
b = trisolv(1, R, c, piv);              /* solve triangular system */
print b[L="Direct QR" F=D10.4];

This is the same least-squares solution that was found by using the normal equations in my previous article.

Compare the performance of least-squares solutions

How does this direct method compare with the methods that use the normal equations? You can download a program that creates simulated data and runs each algorithm to estimate the least-squares regression coefficients. The simulated data has 100,000 observations; the number of variables is chosen to be m={10, 25, 50, 75, 100, 250, 500}. The program uses SAS/IML 15.1 on a desktop PC to time the algorithms. The results are shown below:

The most obvious feature of the graph is that the "Direct QR" method that is described in this article is not as fast as the methods that use the normal equations. For 100 variables and 100,000 observations, the "Direct QR" call takes more than 12 seconds on my PC. (It's faster on a Linux server). The graph shows that the direct method shown in this article is not competitive with the normal-equation-based algorithms when using the linear algebra routines in SAS/IML 15.1.

The graph shows that the algorithms that use the normal equations are relatively faster. For the SAS/IML calls on my PC, you can compute the regression estimates for 500 variables in about 2.6 seconds. The graph has a separate line for the time required to form the normal equations (which you can think of as forming the X`X matrix). Most of the time is spent computing the normal equations; only a fraction of the time is spent actually solving the normal equations. The following table shows computations on my PC for the case of 500 variables:

The table shows that it takes about 2.6 seconds to compute the X`X matrix and the vector X`y. After you form the normal equations, solving them is very fast. For this example, the SOLVE and INV methods take only a few milliseconds to solve a 500 x 500. The QR algorithms take 0.1–0.2 seconds longer. So, for this example, forming the normal equations accounts for more than 90% of the total time.

Another article compares the performance of the SOLVE and INV routines in SAS/IML.

SAS regression procedures

These results are not the best that SAS can do. SAS/IML is a general-purpose tool. SAS regression procedures like PROC REG are optimized to compute regression estimates even faster. They also use the SWEEP operator, which is faster than the SOLVE function. For more than 20 years, SAS regression procedures have used multithreaded computations to optimize the performance of regression computations (Cohen, 2002). More recently, SAS Viya added the capability for parallel processing, which can speed up the computations even more. And, of course, they compute much more than only the coefficient estimates! They also compute standard errors, p-values, related statistics (MSE, R square,....), diagnostic plots, and more.

Summary

This article compares several methods for obtaining least-squares regression estimates. It uses simulated data where the number of observations is much greater than the number of variables. It shows that methods that use the normal equations are faster than a "Direct QR" method, which does not use the normal equations. When you use the normal equations, most of the time is spent actually forming the normal equations. After you have done that, the time required to solve the system is relatively fast.

You can download the SAS program that computes the tables and graphs in this article.

The post Compare computational methods for least squares regression appeared first on The DO Loop.

7月 122021
 

In computational statistics, there are often several ways to solve the same problem. For example, there are many ways to solve for the least-squares solution of a linear regression model. A SAS programmer recently mentioned that some open-source software uses the QR algorithm to solve least-squares regression problems and asked how that compares with SAS. This article shows how to estimate a least-squares regression model by using the QR method in SAS. It shows how to use the QR method in an efficient way.

Solving the least-squares problem

Before discussing the QR method, let's briefly review other ways to construct a least-squares solution to a regression problem. In a regression problem, you have an n x m data matrix, X, and an n x 1 observed vector of responses, y. When n > m, this is an overdetermined system and typically there is no exact solution. Consequently, the goal is to find the m x 1 vector of regression coefficients, b, such that the predicted values (X b) are as close as possible to the observed values. If b is a least-squares solution, then b minimizes the vector norm || X b - y ||2, where ||v||2 is the sum of the squares of the components of a vector.

Using calculus, you can show that the solution vector, b, satisfies the normal equations (X`X) b = X`y. The normal equations have a unique solution when the crossproduct matrix X`X is nonsingular. Most numerical algorithms for least-squares regression start with the normal equations, which have nice numerical properties that can be exploited.

Creating a design matrix

The first step of solving a regression problem is to create the design matrix. For continuous explanatory variables, this is easy: You merely append a column of ones (the intercept column) to the matrix of the explanatory variables. For more complicated regression models that contain manufactured effects such as classification effects, interaction effects, or spline effects, you need to create a design matrix. The GLMSELECT procedure is the best way to create a design matrix for fixed effects in SAS. The following call to PROC GLMSELECT writes the design matrix to the DesignMat data set. The call to PROC REG estimates the regression coefficients:

proc glmselect data=Sashelp.Class outdesign=DesignMat;  /* write design matrix */
   class Sex;
   model Weight = Height Sex Height*Sex/ selection=none;
run;
/* test: use the design matrix to estimate the regression coefficients */
proc reg data=DesignMat plots=none;
   model Weight = Height Sex_F Height_Sex_F;
run;

The goal of the rest of this article is to reproduce the regression estimates by using various other linear algebra operations. For each method, we want to produce the same estimates: {-141.10, 3.91, 23.73, -0.49}.

Linear regression: A linear algebraic approach

I've previously discussed the fact that most SAS regression procedures use the sweep operator to construct least-squares solutions. Another alternative is to use the SOLVE function in SAS/IML:

proc iml;
use DesignMat;
   read all var {'Intercept' 'Height' 'Sex_F' 'Height_Sex_F'} into X;
   read all var {'Weight'} into Y;
close;
 
/* form normal equations */
XpX = X`*X;
Xpy = X`*y;
 
/* an efficient numerical solution: solve a particular RHS */
b = solve(XpX, Xpy);
print b[L="SOLVE" F=D10.4];

The SOLVE function is very efficient and gives the same parameter estimates as the SWEEP operator (which was used by PROC REG). Using the SOLVE function on the system A*b=z is mathematically equivalent to using the matrix inverse to find b=A-1z. However, the INV function explicitly forms the inverse matrix, whereas the SOLVE function does not. Consequently, the SOLVE function is faster and more efficient than using the following SAS/IML statement:

/* a less efficient numerical solution */
Ainv = inv(XpX);       /* explicitly form the (m x m) inverse matrix */
b = Ainv*Xpy;          /* apply the inverse matrix to RHS */
print b[L="INV" F=D10.4];

For a performance comparison of the SOLVE and INV functions, see the article, "Solving linear systems: Which technique is fastest?"

An introduction to the QR method for solving linear equations

There are many ways to solve the normal equations. The SOLVE (and INV) functions use the LU decomposition. An alternative is the QR algorithm, which is slower but can be more accurate for ill-conditioned systems. You can apply the QR decomposition to the normal equations or to the original design matrix. This section discusses the QR decomposition of the design matrix. A subsequent article discusses decomposing the data matrix directly.

Let's see how the QR algorithm solves the normal equations. You can decompose the crossproduct matrix as the product of an orthogonal matrix, Q, and an upper-triangular matrix, R. If A = X`X, then A = QR. (I am ignoring column pivoting, which is briefly discussed below.) The beauty of an orthogonal matrix is that the transpose equals the inverse. This means that you can reduce the system A b = (X` y) to a triangular system R b = Q` (X` y), which is easily solved by using the TRISOLV function in SAS/IML.

The SAS/IML version of the QR algorithm is a version known as "QR with column pivoting." Using column pivoting improves solving rank-deficient systems and provides better numerical accuracy. However, when the algorithm uses column pivoting, you are actually solving the system AP = QR, where P is a permutation matrix. This potentially adds some complexity to dealing with the QR algorithm. Fortunately, the TRISOLVE function supports column pivoting. If you use the TRISOLV function, you do not need to worry about whether pivoting occurred or not.

Using the QR method in SAS/IML

Recall from the earlier example that it is more efficient to use the SOLVE function than the INV function. The reason is that the INV function explicitly constructs an m x m matrix, which then is multiplied with the right-hand side (RHS) vector to obtain the answer. In contrast, the SOLVE function never forms the inverse matrix. Instead, it directly applies transformations to the RHS vector.

The QR call in SAS/IML works similarly. If you do not supply a RHS vector, v, then the QR call returns the full m x m matrix, Q. You then must multiply Q` v yourself. If you do supply the RHS vector, then the QR call returns Q` v without ever forming Q. This second method is more efficient,

See the documentation of the QR call for the complete syntax. The following call uses the inefficient method in which the Q matrix is explicitly constructed:

/* It is inefficient to factor A=QR and explicitly solve the equation R*b = Q`*c */
call QR(Q, R, piv, lindep, XpX);       /* explicitly form the (m x m) matrix Q */
c = Q`*Xpy;                            /* apply the inverse Q matrix to RHS */
b = trisolv(1, R, c, piv);             /* equivalent to b = inv(R)*Q`*c */
print b[L="General QR" F=D10.4];

In contrast, the next call is more efficient because it never explicitly forms the Q matrix:

/* It is more efficient to solve for a specific RHS */
call QR(c, R, piv, lindep, XpX, ,Xpy); /* returns c = Q`*Xpy */
b = trisolv(1, R, c, piv);             /* equivalent to b = inv(R)*Q`*c */
print b[L="Specific QR" F=D10.4];

The output is not shown, but both calls produce estimates for the regression coefficients that are exactly the same as for the earlier examples. Notice that the TRISOLV function takes the pivot vector (which represents a permutation matrix) as the fourth argument. That means you do not need to worry about whether column pivoting occurred during the QR algorithm.

QR applied to the design matrix

As mentioned earlier, you can also apply the QR algorithm to the design matrix, X, and the QR algorithm will return the least-square solution without ever forming the normal equations. This is shown in a subsequent article, which also compares the speed of the various methods for solving the least-squares problem.

Summary

This article discusses three ways to solve a least-squares regression problem. All start by constructing the normal equations: (X`X) b = X` y. The solution of the normal equations (b) is the vector that minimizes the squared differences between the predicted values, X b, and the observed responses, y. This article discusses

  • the SWEEP operator, which is used by many SAS regression procedures
  • the SOLVE and INV function, which use the LU factorization
  • the QR call, which implements the QR algorithm with column pivoting

A follow-up article compares the performance of these methods and of another QR algorithm, which does not use the normal equations.

The post The QR algorithm for least-squares regression appeared first on The DO Loop.

2月 082021
 

Look at the following matrices. Do you notice anything that these matrices have in common?

If you noticed that the rows of each matrix are arithmetic progressions, good for you. For each row, there is a constant difference (also called the "increment") between adjacent elements. For these examples:

  • In the first matrix, each row is an arithmetic progression with a difference of 1. That is, the value of the (j+1)th column is one more than the value of the j_th column.
  • In the second matrix, each row is an arithmetic progression with a difference of 2. That is, the value of the (j+1)th column is two more than the value of the j_th column.
  • In the third matrix, each row is an arithmetic progression, but the increments for the rows are not identical. The increment for the first row is -3. The increment for the second row is +1. The increments for the remaining rows are -2 and +3, respectively.

However, there is another characteristic that these matrices have in common: they are all singular matrices. It turns out this is not a coincidence. An n x n matrix (n>2) is singular if each row is an arithmetic progression.

Why are these matrices singular?

Recall that a matrix will be singular if one of the columns can be written as a linear combination of other columns. For these three matrices, you can use mental arithmetic to verify that the third column is a linear combination of the first two columns. Specifically, you can verify that the third column can be written as
X3 = 2 X2 – X1
where X1, X2, X3, ... are the columns of a matrix. Thus, these matrices are singular.

It turns out that this relationship holds in general for any matrix whose rows are arithmetic progressions. Suppose the elements of the i_th row satisfy the progression with increment di so that the elements of the i_th row satisfy X[i, j+1] = X[i, j] + d[i]  for j=1,2,...,n. Let d be the vector of differences. Then
X2 = X1 + d
which you can solve for d to obtain
d = X2 – X1.
By substitution,
X3 = X2 + d = X2 + (X2 – X1), or
X3 = 2*X2 – X1.
Therefore, the third column is always a linear combination of the first two columns, and the coefficients are independent of the differences in the arithmetic progressions of the rows. It doesn't matter what increments are used in the arithmetic progressions of the rows.

In fact, even more is true. You can use similar arguments to show that the j_th column can be written as a linear combination of the first two columns for any j > 2. Thus, regardless of the matrix size, matrices whose rows are arithmetic progressions are rank-2 matrices: they have only two linearly independent columns.

Why should you care about this?

I decided to write about this topic for three reasons:

  1. It's mathematically interesting, and I never noticed it before.
  2. When I write a program, I often need to quickly create an example matrix to test the program. I often use a matrix such as {1 2 3, 4 5 6, 7 8 9} or SHAPE(1:25,5,5). This article shows that both matrices are singular, which is important to know if the program requires a nonsingular matrix.
  3. Lastly, I wanted to emphasize the difference between a "random matrix" and an "arbitrary matrix." In a previous article, I discussed the probability that a random matrix is singular and concluded that a random matrix is almost always nonsingular. Here, "random matrix" means that the matrix elements are drawn randomly from a continuous probability distribution such as the uniform or normal distribution.

    However, the other day I received a comment on that blog post. Essentially, the comment said, "I wrote down this matrix at random, but it is singular. How do you explain that?" The matrix was similar to the matrices at the top of this article. The answer is that the elements of the matrix were not generated randomly. The rows of the matrix form an arithmetic progression (very NON-random!), which, as we have seen, implies that the matrix is singular.

The post A matrix is singular if its rows are arithmetic sequences appeared first on The DO Loop.

9月 162020
 

A data analyst recently asked a question about restricted least square regression in SAS. Recall that a restricted regression puts linear constraints on the coefficients in the model. Examples include forcing a coefficient to be 1 or forcing two coefficients to equal each other. Each of these problems can be solved by using PROC REG in SAS. The analyst's question was, "Why can I use PROC REG to solve a restricted regression problem when the restriction involves EQUALITY constraints, but PROC REG can't solve a regression problem that involves an INEQUALITY constraint?"

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that has linear constraints on the parameters. I also use geometry and linear algebra to show that solving an equality-constrained regression problem is mathematically similar to solving an unrestricted least squares system of equations. This explains why PROC REG can solve equality-constrained regression problems. In a future article, I will show how to solve regression problems that involve inequality constraints on the parameters.

The geometry of restricted regression

The linear algebra for restricted least squares regression gets messy, but the geometry is easy to picture. A schematic depiction of restricted regression is shown to the right.

Geometrically, ordinary least-squares (OLS) regression is the orthogonal projection of the observed response (Y) onto the column space of the design matrix. (For continuous regressors, this is the span of the X variables, plus an "intercept column.") If you introduce equality constraints among the parameters, you are restricting the solution to a linear subspace of the span (shown in green). But the geometry doesn't change. The restricted least squares (RLS) solution is still a projection of the observed response, but this time onto the restricted subspace.

In terms of linear algebra, the challenge is to write down the projection matrix for the restricted problem. As shown in the diagram, you can obtain the RLS solution in two ways:

  • Starting from Y: You can directly project the Y vector onto the restricted subspace. The linear algebra for this method is straightforward and is often formulated in terms of Lagrange multipliers.
  • Starting from the OLS solution: If you already know the OLS solution, you can project that solution vector onto the restricted subspace. Writing down the projection matrix is complicated, so I refer you to the online textbook Practical Econometrics and Data Science by A. Buteikis.

Restricted models and the TEST statement in PROC REG

Let's start with an example, which is based on an example in the online textbook by A. Buteikis. The following SAS DATA step simulates data from a regression model
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2 and ε ~ N(0, 3) is a random error term.

/* Model: Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4
   where B3 = B1 and B4 = -2*B2
*/
data RegSim;
array beta[0:4] (-5, 2, 3, 2, -6);  /* parameter values for regression coefficients */
N = 100;                            /* sample size */
call streaminit(123);
do i = 1 to N;
   x1 = rand("Normal", 5, 2);       /* X1 ~ N(5,2) */
   x2 = rand("Integer", 1, 50);     /* X2 ~ random integer 1:50 */
   x3 = (i-1)*10/(N-1);             /* X3 = sequence: 0 to 10 by 1/N */
   x4 = rand("Normal", 10, 3);      /* X4 ~ N(10, 3) */
   eps = rand("Normal", 0, 3);      /* RMSE = 3 */
   Y = beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3 + beta[4]*x4 + eps;
   output;
end;
run;

Although there are five parameters in the model (B0-B4), there are only three free parameters because the values of B2 and B4 are determined by the values of B1 and B4, respectively.

If you suspect that a parameter in your model is a known constant or is equal to another parameter, you can use the RESTRICT statement in PROC REG to incorporate that restriction into the model. Before you do, however, it is usually a good idea to perform a statistical test to see if the data supports the model. For this example, you can use the TEST statement in PROC REG to hypothesize that B3 = B1 and B4 = –2*B2. Recall that the syntax for the TEST statement uses the variable names (X1-X4) to represent the coefficients of the variable. The following call to PROC REG carries out this analysis:

proc reg data=RegSim plots=none;
   model Y = x1 - x4;
   test x3 = x1, x4 = -2*x2;
run;

The output shows that the parameter estimates for the simulated data are close to the parameter values used to generate the data. Specifically, the root MSE is close to 3, which is the standard deviation for the error term. The Intercept estimate is close to -5. The coefficients for the X1 and X3 term are each approximately 2. The coefficient for X4 is approximately –2 times the coefficient of X2. The F test for the test of hypotheses has a large p-value, so you should not reject the hypotheses that B3 = B1 and B4 = –2*B2.

The RESTRICT statement in PROC REG

If the hypothesis on the TEST statement fails to reject, then you can use the RESTRICT statement to recompute the parameter estimates subject to the constraints. You could rerun the entire analysis or, if you are running SAS Studio in interactive mode, you can simply submit a RESTRICT statement and PROC REG will compute the new parameter estimates without rereading the data:

   restrict x3 = x1, x4 = -2*x2; /* restricted least squares (RLS) */
   print;
quit;

The new parameter estimates enforce the restrictions among the regression coefficients. In the new model, the coefficients for X1 and X3 are exactly equal, and the coefficient for X4 is exactly –2 times the coefficient for X2.

Notice that the ParameterEstimates table is augmented by two rows labeled "RESTRICT". These rows have negative degrees of freedom because they represent constraints. The "Parameter Estimate" column shows the values of the Lagrange multipliers that are used to enforce equality constraints while solving the least squares system. Usually, a data analyst does not care about the actual values in these rows, although the Wikipedia article on Lagrange multipliers discusses ways to interpret the values.

The linear algebra of restricted regression

This section shows the linear algebra behind the restricted least squares solution by using SAS/IML. Recall that the usual way to compute the unrestricted OLS solution is the solve the "normal equations" (X`*X)*b = X`*Y for the parameter estimates, b. Although textbooks often solve this equation by using the inverse of the X`X matrix, it is more efficient to use the SOLVE function in SAS/IML to solve the equation, as follows:

proc iml;
varNames = {x1 x2 x3 x4};
use RegSim;                  /* read the data */
   read all var varNames into X;
   read all var "Y";
close;
X = j(nrow(X), 1, 1) || X;   /* add intercept column to form design matrix */
XpX = X`*X;
 
/* Solve the unrestricted OLS system */
b_ols = solve(XpX, X`*Y);    /* normal equations X`*X * b = X`*Y */
ParamNames = "Intercept" || varNames;
print b_ols[r=ParamNames F=10.5];

The OLS solution is equal to the first (unrestricted) solution from PROC REG.

If you want to require that the solution satisfies B3 = B1 and B4 = –2*B2, then you can augment the X`X matrix to enforce these constraints. If L*λ = c is the matrix equation for the linear constraints, then the augmented system is
\( \begin{pmatrix} X^{\prime}X & L^{\prime} \\ L & 0 \end{pmatrix} \begin{pmatrix} b_{\small\mathrm{RLS}} \\ \lambda \end{pmatrix} = \begin{pmatrix} X^{\prime}Y \\ c \end{pmatrix} \)

The following statements solve the augmented system:

/* Direct method: Find b_RLS by solving augmented system */
/* Int X1 X2 X3 X4 */
L = {0  1  0 -1  0,     /*    X1 - X3 = 0 */
     0  0 -2  0 -1};    /* -2*X2 - X4 = 0 */
c = {0, 0};
 
zero = j(nrow(L), nrow(L), 0);
A = (XpX || L`   ) //
    (L   || zero );
z = X`*Y // c;
b_rls = solve(A, z);
rNames = ParamNames||{'Lambda1' 'Lambda2'};
print b_rls[r=rNames F=10.5];

The parameter estimates are the same as are produced by using PROC REG. You can usually ignore the last two rows, which are the values of the Lagrange multipliers that enforce the constraints.

By slogging through some complicated linear algebra, you can also obtain the restricted least squares solution from the OLS solution and the constraint matrix (L). The following equations use the formulas in the online textbook by A. Buteikis to compute the restricted solution:

/* Adjust the OLS solution to obtain the RLS solution.
   b_RLS = b_OLS - b_adjustment
   Follows
   http://web.vu.lt/mif/a.buteikis/wp-content/uploads/PE_Book/4-4-Multiple-RLS.html
*/
RA_1 = solve(XpX, L`);
RA_3 = solve(L*RA_1, L*b_ols - c);
b_adj = RA_1 * RA_3;
b_rls = b_ols - b_adj;
print b_rls[r=ParamNames F=10.5];

This answer agrees with the previous answer but does not compute the Lagrange multipliers. Instead, it computes the RLS solution as the projection of the OLS solution onto the restricted linear subspace.

Summary

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that involves linear constraints on the parameters. Solving an equality-constrained regression problem is very similar to solving an unrestricted least squares system of equations. Geometrically, they are both projections onto a linear subspace. Algebraically, you can solve the restricted problem directly or as the projection of the OLS solution.

Although changing one of the equality constraints into an INEQUALITY seems like a minor change, it means that the restricted region is no longer a linear subspace. Ordinary least squares can no longer solve the problem because the solution might not be an orthogonal projection. The next article discusses ways to solve problems where the regression coefficients are related by linear inequalities.

The post Restricted least squares regression in SAS appeared first on The DO Loop.

9月 082020
 

Matrix balancing is an interesting problem that has a long history. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. One of the early algorithms for matrix balancing is known as the RAS algorithm, but it is also called the raking algorithm in some fields. The presentation in this article is inspired by a paper by Carol Alderman (1992).

The problem: Only marginal totals are known

Alderman shows a typical example (p. 83), but let me give a smaller example. Suppose a troop of six Girl Scouts sells seven different types of cookies. At the end of the campaign, each girl reports how many boxes she sold. Also, the troop leader knows how many boxes of each cookie type were sold. However, no one kept track of how many boxes of each type were sold by each girl.

The situation is summarized by the following 7 x 6 table. We know the row totals, which are the numbers of boxes that were sold for each cookie type. We also know the column totals, which are the numbers of boxes that were sold by each girl. But we do not know the values of the individual cells, which are the type-by-girl totals.

Guessing the cells of a matrix

As stated, the problem usually has infinitely many solutions. For an r x c table, there are r*c cells, but the marginal totals only provide r + c linear constraints. Except for 2 x 2 tables, there are more unknowns than equations, which leads to an underdetermined linear system that (typically) has infinitely many solutions.

Fortunately, often obtain a unique solution if we provide an informed guess for the cells in the table. Perhaps we can ask the girls to provide estimates of the number of boxes of each type that they sold. Or perhaps we have values for the cells from the previous year. In either case, if we can estimate the cell values, we can use a matrix balancing algorithm to adjust the cells so that they match the observed marginal totals.

Let's suppose that the troop leader asks the girls to estimate (from memory) how many boxes of each type they sold. The estimates are shown in the following SAS/IML program. The row and column sums of the estimates do not match the known marginal sums, but that's okay. The program uses subscript reduction operators to compute the marginal sums of the girls' estimates. These are displayed next to the actual totals, along with the differences.

proc iml;
/* Known marginal totals: */
u = {260, 214, 178, 148, 75, 67, 59}; /* total boxes sold, by type */
v = {272 180 152 163 134 100};        /* total boxes sold, by girl */
 
/* We don't know the cell values that produce these marginal totals,
   but we can ask each girl to estimate how many boxes of each type she sold */
/*   G1 G2 G3 G4 G5 G6          */
A = {75 45 40 40 40 30 ,  /* C1 */
     40 35 45 35 30 30 ,  /* C2 */
     40 25 30 40 30 20 ,  /* C3 */
     40 25 25 20 20 20 ,  /* C4 */
     30 25  0 10 10  0 ,  /* C5 */
     20 10 10 10 10  0 ,  /* C6 */
     20 10  0 10  0  0 }; /* C7 */
 
/* notice that the row/col totals for A do not match the truth: */
rowTotEst = A[ ,+];        /* for each row, sum across all cols */
colTotEst = A[+, ];        /* for each col, sum down all rows   */
Items  = 'Cookie1':'Cookie7';                  /* row labels    */
Person = 'Girl1':'Girl6';                      /* column labels */
 
/* print known totals, estimated totals, and difference */
rowSums = (u || rowTotEst || (u-rowTotEst));
print rowSums[r=Items c={'rowTot (u)' 'rowTotEst' 'Diff'}];
colSums = (v // colTotEst // (v-colTotEst));
print colSums[r={'colTot (v)' 'colTotEst' 'Diff'} c=Person];

You can see that the marginal totals do not match the known row and column totals. However, we can use the guesses as an initial estimate for the RAS algorithm. The RAS algorithm will adjust the estimates to obtain a new matrix that DOES satisfy the marginal totals.

Adjusting a matrix to match marginal totals

The RAS algorithm can iteratively adjust the girls' estimates until the row sums and column sums of the matrix match the known row sums and column sums. The resulting matrix will probably not be an integer matrix. It will, however, reflect the structure of the girls' estimates in three ways:

  1. It will approximately preserve each girl's recollection of the relative proportions of cookie types that she sold. If a girl claims that she sold many of one type of cookie and few of another, that relationship will be reflected in the balanced matrix.
  2. It will approximately preserve the relative proportions of cookie types among girls. If one girl claims that she sold many more of one type of cookie than another girl, that relationship will be reflected in the balanced matrix.
  3. It will preserve zeros. If a girl claims that she did not sell any of one type of cookie, the balanced matrix will also contain a zero for that cell.

The RAS algorithm is explained in Alderman (1992). The matrix starts with a nonnegative r x c matrix. Let u be the observed r x 1 vector of row sums. Let v be the observed 1 x c vector of column sums. The main steps of the RAS algorithm are:

  1. Initialize X = A.
  2. Scale the rows. Let R = u / RowSum(X), where 'RowSum' is a function that returns the vector of row sums. Update X ← R # X. Here R#X indicates elementwise multiplication of each column of X by the corresponding element of R.
  3. Scale the columns. Let S = v / ColSum(X), where 'ColSum' is a function that returns the vector of column sums. Update X ← X # S. Here X#S indicates elementwise multiplication of each row of X by the corresponding element of S.
  4. Check for convergence. If RowSum(X) is close to u and ColSum(X) is close to v, then the algorithm has converged, and X is the balanced matrix. Otherwise, return to Step 2.

The RAS algorithm is implemented in the following SAS/IML program:

/* Use the RAS matrix scaling algorithm to find a matrix X that 
   approximately satisfies 
   X[ ,+] = u  (marginals constraints on rows)
   X[+, ] = v  (marginals constraints on cols)
   and X is obtained by scaling rows and columns of A */
start RAS(A, u, v);
   X = A;       /* Step 0: Initialize X */
   converged = 0;
   do k = 1 to 100 while(^Converged);
      /* Step 1: Row scaling: X = u # (X / X[ ,+]); */
      R = u / X[ ,+];
      X = R # X;
 
      /* Step 2: Column scaling: X = (X / X[+, ]) # v; */
      S = v / X[+, ];
      X = X # S;
 
      /* check for convergence: u=rowTot and v=colTot */
      rowTot = X[ ,+];   /* for each row, sum across all cols */
      colTot = X[+, ];   /* for each col, sum down all rows */
      maxDiff = max( abs(u-rowTot), abs(v-colTot) );
      Converged = (maxDiff < 1e-8);
   end;
   return( X );
finish;
 
X = RAS(A, u, v);
print X[r=Items c=Person format=7.1];

The matrix X has row and column sums that are close to the specified values. The matrix also preserves relationships among individual columns and rows, based on the girls' recollections of how many boxes they sold. The algorithm preserves zeros for the girls who claim they sold zero boxes of a certain type. It does not introduce any new zeros for cells.

You can't determine how close the X matrix is to "the truth," because we don't know the true number of boxes that each girl sold. The X matrix depends on the girls' recollections and estimates. If the recollections are close to the truth, then X will be close to the truth as well.

Summary

This article looks at the RAS algorithm, which is one way to perform matrix balancing on a table that has nonnegative cells. "Balancing" refers to adjusting the interior cells of a matrix to match observed marginal totals. The RAS algorithm starts with an estimate, A. The estimate does not need to satisfy the marginal totals, but it should reflect relationships between the rows and columns that you want to preserve, such as relative proportions and zero cells. The RAS algorithm iteratively scales the rows and columns of A to create a new matrix that matches the observed row and column sums.

It is worth mentioning the row scaling could be accumulated into a diagonal matrix, R, and the column scaling into a diagonal matrix, S. Thus, the balanced matrix is the product of these diagonal matrices and the original estimate, A. As a matrix equation, you can write X = RAS. It is this matrix factorization that gives the RAS algorithm its name.

The RAS is not the only algorithm for matrix balancing, but it is the simplest. Another algorithm, called iterative proportional fitting (IPF), will be discussed in a second article.

The post Matrix balancing: Update matrix cells to match row and column sums appeared first on The DO Loop.

7月 272020
 

The Kronecker product (also called the direct product) is a binary operation that combines two matrices to form a new matrix. The Kronecker product appears in textbooks about the design of experiments and multivariate statistics. The Kronecker product seems intimidating at first, but often one of the matrices in the product has a special form, such as being a matrix of all 1s. In these special cases, the Kronecker product becomes much easier to understand. This article demonstrates some of the most common uses of the Kronecker product, including using it to concatenate matrices and to form block matrices.

What is the Kronecker product?

The Kronecker product is best understood by thinking about block matrices. If A is a n x p matrix and B is a m x q matrix, then the Kronecker product A ⊗ B is a block matrix that is formed from blocks of B, where each block is multiplied by an element of A:

The SAS/IML language uses the "at sign" (@) to represent the binary operator. Thus, the SAS/IML expression A @ B forms the Kronecker product A ⊗ B.

The Kronecker product for vectors of 1s and special matrices

If A or B has a special form, the Kronecker product simplifies. First, consider only the case where A is a vector of all 1s or a special matrix. The following statements follow directly from the definition of the Kronecker product.

  • If A is a row vector of all 1s, then A ⊗ B is the horizontal concatenation of p copies of B.
  • If A is a column vector of all 1s, then A ⊗ B is the vertical concatenation of n copies of B.
  • If A is a matrix of all 1s, then A ⊗ B is a block matrix that contains blocks of the matrix B.
  • If A is the identity matrix, then A ⊗ B is a block diagonal matrix that contains copies of the matrix B along the diagonal.

Next, let A be an arbitrary matrix but consider only special forms for B:

  • If B is a row vector of all 1s, then A ⊗ B contains q copies of the first column of A, followed by q copies of the second column of A, and so forth. I call this sequential replication of columns of A.
  • If B is a column vector of all 1s, then A ⊗ B contains m copies of the first row of A, followed by m copies of the second row of A, and so forth. This is sequential replication of rows of A.
  • If B is a matrix of all 1s, then A ⊗ B is a block matrix where the (i,j)th block is the constant matrix where all elements are A[i,j].
  • If B is an identity matrix, then A ⊗ B is a block matrix where the (i,j)th block is the matrix A[i,j]*I.

The next sections show concrete examples of these eight ways to use the Kronecker product operator.

Use the Kronecker product for horizonal or vertical concatenation

You can use the Kronecker product to perform horizontal or vertical concatenation. For example, the following SAS/IML program defines two vectors that contain only 1s. The vector w is a row vector and the vector h is a column vector. The program computes w ⊗ B and h ⊗ B for various choices of B. The ODS LAYOUT GRIDDED statement arranges the output in a grid.

proc iml;
/* specify RHS vectors */
w  = {1  1};       /* w for 'wide' (row vector) */
h  = {1, 1};       /* h for 'high' (col vector) */
 
/* specify data matrices */
r = {1  2  3};      /* row vector    */
c = {1, 2, 3};      /* column vector */
x = {1 2 3,         /* 2 x 3 matrix  */
     4 5 6};
 
/******************************/
/* vector @ B : Concatenation */
/******************************/
wr = w @ r;         /* horiz concat. Same as repeat(r, 1, ncol(w)) */
wx = w @ x;         /* horiz concat. Same as repeat(x, 1, ncol(w)) */
hc = h @ c;         /* vert  concat. Same as repeat(c, nrow(h), 1) */
hx = h @ x;         /* vert  concat. Same as repeat(x, nrow(h), 1) */
 
ods layout gridded columns=2 advance=table;
print wr[L='w@r'], wx[L='w@x'], hc[L='h@c'], hx[L='h@x'];
ods layout end;

The output indicates that the Kronecker product with w on the left results in horizontal concatenation. Using h on the left results in vertical concatenation.

Use the Kronecker product for sequential replication

The previous section showed how to take a sequence of numbers such as {1,2,3} and repeat the entire sequence to form a new sequence such as {1,2,3,1,2,3}. But sometimes you want to repeat each element before repeating the next element. In other words, you might be interested in forming a sequence such as {1,1,2,2,3,3}. You can perform this kind of "sequential replication" by putting the data on the left and the vectors of 1s on the right, as follows:

/***************************************/
/* A @ vector : Sequential replication */
/***************************************/
rw = r @ w;         /* repeat cols sequentially */
xw = x @ w;
ch = c @ h;         /* repeat rows sequentially */
xh = x @ h;
 
ods layout gridded columns=2 advance=table;
print rw[L='r@w'], xw[L='x@w'], ch[L='c@h'], xh[L='x@h'];
ods layout end;

It is worth mentioning that the second row of output is the transpose of the first row. That is because the transpose operation "distributes" over the Kronecker product operator. That is, for any matrices A and B, it is true that (A ⊗ B)` = A` ⊗ B`. Since r`=c and w`=h, then (r ⊗ w)` = (r` ⊗ w`) = c ⊗ h.

Use the Kronecker product to construct block matrices

The Kronecker product is essentially an operation that forms block matrices. When one of the components is a vector of all 1s, then "forming a block matrix" is the same as concatenation. But if A is a binary matrix, then A ⊗ B is a block matrix that has the same structure as A, but each nonzero block is a copy of B.

Probably the most important subcase is that the matrix I(n) ⊗ B is a block diagonal matrix, where each diagonal block is a copy of B. The following list gives some other examples that use the identity or a constant matrix. Let I(n) be the n x k identity matrix and let J(n) be the n x n matrix of all 1s. Then

  • I(n) ⊗ B is a block diagonal matrix with copies of B on the diagonal.
  • J(n) ⊗ B is a block matrix with a copy of B in each block.
  • A ⊗ I(n) is a block matrix where each block is a diagonal matrix.
  • A ⊗ J(n) is a block matrix where each block is a constant matrix.
Ix = I(2) @ x;      /* diagonal block matrix */    
Jx = J(2) @ x;      /* 2 x 2 block matrix    */
xI = x @ I(2);      /* blocks of diagonal matrices */     
xJ = x @ J(2);      /* blocks of constant matrices */
 
ods layout gridded columns=2 advance=table;
print Ix[L='I@x'], Jx[L='J@x'], xI[L='x@I'], xJ[L='x@J'];
ods layout end;

How is the Kronecker product used in statistics?

Recall that (mathematically) you can only add matrices that have the same dimensions. For example, you can't add a matrix and a vector. However, a common operation in multivariate statistics is to center a data matrix by subtracting the mean of each column from that column. If X is the data matrix, "mean(X)" is a row vector where the i_th element of the vector is the mean of the i_th column. The centering operation is sometimes informally written as "X – mean(X)," even though that expression does not make mathematical sense. Formally, you need to vertically concatenate the mean vector n times, where n is the number of rows in the data matrix. As we have seen, if h is a column vector of length n that contains all 1s, then mean(X) ⊗ h is a matrix and the expression X – mean(X) ⊗ h is a proper mathematical expression that indicates the centering operation.

As described in a previous article, in the early days of SAS, statistical programmers had to use Kronecker product operator to ensure that all matrix operations were mathematically correct. But that changed in SAS version 8 (circa 1999) when the IML language introduced a shorthand notation that makes it easier to perform row and column operations. Back in the "old days," you saw lots of programs that used the Kronecker operator like this:

/* read data matrix into X */
use Sashelp.Class; read all var _num_ into X;  close;
mean = x[:,];            /* row vector of means for each column */
h = j(nrow(X), 1, 1);    /* column vector */
CenterX = X - mean @ h;  /* subtract matrices of same dimension */

The modern SAS/IML programmer will typically use the shorthand notation to avoid the Kronecker product:

CenterX = X - mean;      /* modern code: Subtract the row vector from each row of X */

Although you no longer need to use the Kronecker product for basic centering operations, the Kronecker operator is still useful for other tasks, such as building block matrices and concatenation. Block diagonal matrices are used in mixed models of longitudinal data (Chapter 12 of Wicklin, 2013).

Summary

This article shows eight ways that you can use the Kronecker product. The Kronecker operator enables you to perform horizontal and vertical concatenation, perform sequential repetition of rows and columns, and build block matrices with special structure. You can use this article as a "cheat sheet" to remind you about some common uses of the Kronecker operator.

Do you have a favorite way to use the Kronecker product? Leave a comment.

The post 8 ways to use the Kronecker product appeared first on The DO Loop.

6月 242020
 

If you have ever run a Kolmogorov-Smirnov test for normality, you have encountered the Kolmogorov D statistic. The Kolmogorov D statistic is used to assess whether a random sample was drawn from a specified distribution. Although it is frequently used to test for normality, the statistic is "distribution free" in the sense that it compares the empirical distribution to any specified distribution. I have previously written about how to interpret the Kolmogorov D statistic and how to compute it in SAS.

If you can compute a statistic, you can use simulation to approximate its sampling distribution and to approximate its critical values. Although I am a fan of simulation, in this article I show how to compute the exact distribution and critical values for Kolmogorov D statistic. The technique used in this article is from Facchinetti (2009), "A Procedure to Find Exact Critical Values of Kolmogorov-Smirnov Test."

An overview of the Facchinetti method

Suppose you have a sample of size n and you perform a Kolmogorov-Smirnov test that compares the empirical distribution to a reference distribution. You obtain the test statistic D. To know whether you should reject the null hypothesis (that the test came from the reference distribution), you need to be able to compute the probability CDF(D) = Pr(Dn ≤ D). Facchinetti shows that you can compute the probability for any D by solving a system of linear equations. If a random sample contains n observations, you can form a certain (2n+2) x (2n+2) matrix whose entries are composed of binomial probabilities. You can use a discrete heat map to display the structure of the matrix that is used in the Facchinetti method. One example is shown below. For n=20, the matrix is 42 x 42. As Facchinetti shows in Figure 6 of her paper, the matrix has a block structure, with triangular matrices in four blocks.

The right-hand side of the linear system is also composed of binomial probabilities. The system is singular, but you can use a generalized inverse to obtain a solution. From the solution, you can compute the probability. For the details, see Facchinetti (2009).

Facchinetti includes a MATLAB program in the appendix of her paper. I converted the MATLAB program into SAS/IML and improved the efficiency by vectorizing some of the computations. You can download the SAS/IML program that computes the CDF of the Kolmogorov D statistic and all the graphs in this article. The name of the SAS/IML function is KolmogorovCDF, which requires two arguments, n (sample size) and D (value of the test statistic).

Examples of using the CDF of the Kolmorov distribution

Suppose you have a sample of size 20. You can use a statistical table to look up the critical value of the Kolmogorov D statistic. For α = 0.05, the (two-sided) critical value for n=20 is D = 0.294. If you load and call the KolmogorovCDF function, you can confirm that the CDF for the Kolmogorov D distribution is very close to 0.95 when D=0.294:

proc iml;
load  module=KolmogorovCDF;      /* load the function that implements the Facchinetti method */
/* simple example of calling the function */
n = 20;
D = 0.29408;
cdf = KolmogorovCDF(n, D);
print n D cdf;

The interpretation of this result is that if you draw a random samples of size 20 from the reference distribution, there is a 5% chance that the Kolmogorov D value will exceed 0.294. A test statistic that exceeds 0.294 implies that you should reject the null hypothesis at the 0.05 significance level.

Visualize the Kolmogorov D distribution

If you can compute the CDF for one value of D, you can compute it for a sequence of D values. In this way, you can visualize the CDF curve. The following SAS/IML statements compute the CDF for the Kolmogorov D distribution when n=20 by running the computation on a sequence of D values in the open interval (0, 1):

D = do(0.01, 0.99, 0.01);             /* sequence of D values */
CDF = j(1, ncol(D), .);
do i = 1 to ncol(D);
   CDF[i] = KolmogorovCDF(n, D[i]);   /* CDF for each D */ 
end;
title "Kolmogorov CDF, n=20";
call series(D, CDF) grid={x y};       /* create a graph of the CDF */

The graph enables you to read probabilities and quantiles for the D20 distribution. The graph shows that almost all random samples of size 20 (drawn from the reference distribution) will have a D statistic of less than 0.4. Geometrically, the D statistic represents the largest gap between the empirical distribution and the reference distribution. A large gap indicates that the sample was probably not drawn from the reference distribution.

Visualize a family of Kolmogorov D distributions

The previous graph was for n=20, but you can repeat the computation for other values of n to obtain a family of CDF curves for the Kolmogorov D statistic. (Facchinetti writes Dn to emphasize that the statistic depends on the sample size.) The computation is available in the program that accompanies this article. The results are shown below:

This graph enables you to estimate probabilities and quantiles for several Dn distributions.

Visualize a family of density curves

Because the PDF is the derivative of the CDF, you can use a forward difference approximation of the derivative to also plot the density of the distribution. The following density curves are derived from the previous CDF curves:

A table of critical values

If you can compute the CDF, you can compute quantiles by using the inverse CDF. You can use a root-finding technique to obtain a quantile.

In the SAS/IML language, the FROOT function can find univariate roots. The critical values of the Dn statistic are usually tabulated by listing the sample size down columns and the significance level (α) or confidence level (1-α) across the columns. Facchinetti (p. 352) includes a table of critical values in her paper. The following SAS/IML statements show how to recreate the table. For simplicity, I show only a subset of the table for n=20, 25, 20, and 35, and for four common values of α.

/* A critical value is the root of the function
   f(D) = KolmogorovCDF(n, D) - (1-alpha)
   for any specified n and alpha value.
*/
start KolmogorovQuantile(D) global (n, alpha);
   return  KolmogorovCDF(n, D) - (1-alpha);
finish;
 
/* create a table of critical values for the following vector of n and alpha values */
nVals = do(20, 35, 5);
alphaVals = {0.001 0.01 0.05 0.10};
CritVal = j(ncol(nVals), ncol(alphaVals), .);
do i = 1 to ncol(nVals);
   n = nVals[i];
   do j = 1 to ncol(alphaVals);
      alpha = alphaVals[j];
      CritVal[i,j] = froot("KolmogorovQuantile", {0.01, 0.99});
   end;
end;
print CritVal[r=nVals c=alphaVals format=7.5 L="Critical Values (n,alpha)"];

For each sample size, the corresponding row of the table shows the critical values of D for which CDF(D) = 1-α. Of course, a printed table is unnecessary when you have a programmatic way to compute the quantiles.

Summary

Facchinetti (2009) shows how to compute the CDF for the Kolmogorov D distribution for any value of n and D. The Kolmogorov D statistic is often used for goodness-of-fit tests. Facchinetti's method consists of forming and solving a system of linear equations. To make her work available to the SAS statistical programmer, I translated her MATLAB program into a SAS/IML program that computes the CDF of the Kolmogorov D distribution. This article demonstrates how to use the KolmogorovCDF in SAS/IML to compute and visualize the CDF and density of the Kolmogorov D statistic. It also shows how to use the computation in conjunction with a root-finding method to obtain quantiles and exact critical values of the Kolmogorov D distribution.

The post The Kolmogorov D distribution and exact critical values appeared first on The DO Loop.

6月 222020
 

Sometimes in matrix computations, it is important to display the nonzero elements of a matrix. This can be useful for visualizing the structure of a sparse matrix (one that has many zeros) and it is also useful when describing a matrix algorithm (such as Gaussian elimination) that introduces zeros at each step of the algorithm.

Let A be the matrix you want to visualize. You can visualize the nonzero elements of the matrix A in a graph or in a table:

  • Use a custom SAS format that prints an asterisk (*) for nonzero values and a blank for zero values.
  • Create a discrete heat map of the binary matrix A^=0, which replaces the nonzero cells of A with the value 1.

A text display method

For small matrices, you can use a custom SAS format to print the matrix. The format is defined by using PROC FORMAT. This format displays a star (*) for nonzero values and displays a blank for zero values:

proc format;
value SparseFmt   
      0 = " "      /* display blank instead of 0 */
      other = "*"; /* display '*' instead of number */
run;
 
proc iml;
A = {1.0  0.2  0.0  0.0  0.0  0.0 1e-6,
     0.2  1.0  0.2  0.0  0.0  0.0 0.0,
     0.0  0.2  1.0  0.2  0.0  0.0 0.0,
     0.0  0.0  0.2  1.0  0.2  0.0 0.0,
     1e-6 0.0  0.0  0.2  1.0  0.2 0.0,
     2e-6 1e-6 0.0  0.0  0.2  1.0 0.2,
     3e-6 2e-6 1e-6 0.0  0.0  0.2 1.0 };
 
/* use a custom format to print * or blank in each cell */
print A[format=SparseFmt.];
Visualize a sparse matrix by using a custom format that print a blank or an asterisk

You can see the banded structure of the matrix and also see that elements near the corners are nonzero. This textual visualization is concise and is used in textbooks about numerical linear algebra.

A graphical method

For larger matrices, a graphical method is probably better. You can form the binary logical matrix B=(A^=0) and then plot B by using a two-color heat map. In the SAS/IML language, you can use the HEATMAPDISC call to create a discrete heat map of a matrix. I like to use white for the zeroes. The following SAS/IML statements define a 7 x 7 matrix and use a discrete heat map to display the nonzero values:

/* discrete heat map of binary matrix A^=0 */
ods graphics / width=400px height=400px;
call heatmapdisc(A^=0) colorramp={white steelblue}
         title="Discrete heat map of binary matrix";
Visualize a sparse matrix by using a heat map of a binary matrix where each cell is zero (0) or nonzero (1)

The information in the heat map is identical to the printed display. However, for large matrices, the heat map is easier to view. As shown, you can also use the ODS GRAPHICS statement to control the dimensions of the graph. So, for example, if you are visualizing a square matrix, you can make the graph square.

The post Visualize the structure of a sparse matrix appeared first on The DO Loop.

12月 162019
 
Rockin' around the Christmas tree
At the Christmas party hop.
     – Brenda Lee
Christmas tree image with added noise

Last Christmas, I saw a fun blog post that used optimization methods to de-noise an image of a Christmas tree. Although there are specialized algorithms that remove random noise from an image, I am not going to use a specialized de-noising algorithm. Instead, I will use the SVD algorithm, which is a general-purpose matrix algorithm that has many applications. One application is to construct a low-rank approximation of a matrix. This article applies the SVD algorithm to a noisy image of a Christmas tree. The result is recognizable as a Christmas tree, but much of the noise has been eliminated.

A noisy Christmas tree image

The image to the right is a heat map of a binary (0/1) matrix that has 101 rows and 100 columns. First, I put the value 1 in certain cells so that the heat map resembles a Christmas tree. About 41% of the cells have the value 1. We will call these cells "the tree".

To add random noise to the image, I randomly switched 10% of the cells. The result is an image that is recognizable as a Christmas tree but is very noisy.

Apply the SVD to a matrix

As mentioned earlier, I will attempt to de-noise the matrix (image) by using the general-purpose SVD algorithm, which is available in SAS/IML software. If M is the 101 x 100 binary matrix, then the following SAS/IML statements factor the matrix by using a singular-value decomposition. A series plot displays the first 20 singular values (ordered by magnitude) of the noisy Christmas tree matrix:

call svd(U, D, V, M);                             /* A = U*diag(D)*V` */
call series(1:20, D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};

The graph (which is similar to a scree plot in principal component analysis) indicates that the first four singular values are somewhat larger than the remainder. The following statements approximate M by using rank-4 matrix. The rank-4 matrix is not a binary matrix, but you can visualize the rank-4 approximation matrix by using a continuous heat map. For convenience, the matrix elements are shifted and scaled into the interval [0,1].

keep = 4;        /* how many components to keep? */
idx = 1:keep;
Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;
Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */
s = "Rank " + strip(char(keep)) + " Approximation of Noisy Christmas Tree Image";
call heatmapcont(Ak) colorramp=ramp displayoutlines=0 showlegend=0 title=s range={0, 1};

Apply a threshold cutoff to classify pixels

The continuous heat map shows a dark image of the Christmas tree surrounded by a light-green "fog". The dark-green pixels represent cells that have values near 1. The light-green pixels have smaller values. You can use a histogram to show the distribution of values in the rank-4 matrix:

ods graphics / width=400px height=300px;
call histogram(Ak);

You can think of the cell values as being a "score" that predicts whether each cell belongs to the Christmas tree (green) or not (white). The histogram indicates that the values in the matrix appear to be a mixture of two distributions. The high values in the histogram correspond to dark-green pixels (cells); the low values correspond to light-green pixels. To remove the light-green "fog" in the rank-4 image, you can use a "threshold value" to convert the continuous values to binary (0/1) values. Essentially, this operation predicts which cells are part of the tree and which are not.

For every choice of the threshold parameter, there will be correct and incorrect predictions:

  • A "true positive" is a cell that belongs to the tree and is colored green.
  • A "false positive" is a cell that does not belong to the tree but is colored green.
  • A "true negative" is a cell that does not belong to the tree and is colored white.
  • A "false negative" is a cell that belongs to the tree but is colored white.

By looking at the histogram, you might guess that a threshold value of 0.5 will do a good job of separating the low and high values. The following statements use 0.5 to convert the rank-4 matrix into a binary matrix, which you can think of as the predicted values of the de-noising process:

t = 0.5;  /* threshold parameter */
s = "Discrete Filter: Threshold = " + strip(char(t,4,2)) ;
call HeatmapDisc(Ak > t) colorramp=ramp displayoutlines=0 showlegend=0 title=s;

Considering that 10% of the original image was corrupted, the heat map of the reconstructed matrix is impressive. It looks like a Christmas tree, but has much less noise. The reconstructed matrix agrees with the original (pre-noise) matrix for 98% of the cells. In addition:

  • There are only a few false positives (green dots) that are far from the tree.
  • There are some false negatives in the center of the tree, but many fewer than were in the original noisy image.
  • The locations where the image is the most ragged is along the edge and trunk of the Christmas tree. In that region, the de-noising could not tell the difference between tree and non-tree.

The effect of the threshold parameter

You might wonder what the reconstruction would look like for different choices of the cutoff threshold. The following statements create two other heat maps, one for the threshold value 0.4 and the other for the threshold value 0.6. As you might have guessed, smaller threshold values result in more false positives (green pixels away from the tree), whereas higher threshold values result in more false negatives (white pixels inside the tree).

do t = 0.4 to 0.6 by 0.2;
   s = "Discrete Filter: Threshold = " + strip(char(t,4,2)) ;
   call HeatmapDisc(Ak > t) colorramp=ramp displayoutlines=0 showlegend=0 title=s;
end;

Summary

Although I have presented this experiment in terms of an image of a Christmas tree, it is really a result about matrices and low-rank approximations. I started with a binary 0/1 matrix, and I added noise to 10% of the cells. The SVD factorization enables you to approximate the matrix by using a rank-4 approximation. By using a cutoff threshold, you can approximate the original matrix before the random noise was added. Although the SVD algorithm is a general-purpose algorithm that was not designed for de-noising images, it does a good job eliminating the noise and estimating the original matrix.

If you are interested in exploring these ideas for yourself, you can download the complete SAS program that creates the graphs in this article.

The post Math-ing around the Christmas tree: Can the SVD de-noise an image? appeared first on The DO Loop.

12月 112019
 

Binary matrices are used for many purposes. I have previously written about how to use binary matrices to visualize missing values in a data matrix. They are also used to indicate the co-occurrence of two events. In ecology, binary matrices are used to indicate which species of an animal are present in which ecological site. For example, if you remember your study of Darwin's finches in high school biology class, you might recall that certain finches (species) are present or absent on various Galapagos islands (sites). You can use a binary matrix to indicate which finches are present on which islands.

Recently I was involved in a project that required performing a permutation test on rows of a binary matrix. As part of the project, I had to solve three smaller problems involving rows of a binary matrix:

  1. Given two rows, find the locations at which the rows differ.
  2. Given two binary matrices, determine how similar the matrices are by computing the proportion of elements that are the same.
  3. Given two rows, swap some of the elements that differ.

This article shows how to solve each problem by using the SAS/IML matrix language. A future article will discuss permutation tests for binary matrices. For clarity, I introduce the following macro that uses a temporary variable to swap two sets of values:

/* swap values of A and B. You can use this macro in the DATA step or in the SAS/IML language */
%macro SWAP(A, B, TMP=_tmp);
   &TMP = &A; &A = &B; &B = &TMP;
%mend;

Where do binary matrices differ?

The SAS/IML matrix language enables you to treat matrices as high-level objects. You often can answer questions about matrices without writing loops that iterate over the elements of the matrices. For example, if you have two matrices of the same dimensions, you can determine the cells at which the matrices are unequal by using the "not equal" (^=) operator. The following SAS/IML statements define two 2 x 10 matrices and use the "not equal" operator to find the cells that are different:

proc iml;
A = {0 0 1 0 0 1 0 1 0 0 ,
     1 0 0 0 0 0 1 1 0 1 };
B = {0 0 1 0 0 0 0 1 0 1 ,
     1 0 0 0 0 1 1 1 0 0 };
colLabel = "Sp1":("Sp"+strip(char(ncol(A))));
rowLabel = "A1":("A"+strip(char(nrow(A))));
 
/* 1. Find locations where binary matrices differ */
Diff = (A^=B);
print Diff[c=colLabel r=rowLabel];

The matrices A and B are similar. They both have three 1s in the first row and fours 1s in the second row. However, the output shows that the matrices are different for the four elements in the sixth and tenth columns. Although I used entire matrices for this example, the same operations work on row vectors.

The proportion of elements in common

You can use various statistics to measure the similarity between the A and B matrices. A simple statistic is the proportion of elements that are in common. These matrices have 20 elements, and 16/20 = 0.8 are common to both matrices. You can compute the proportion in common by using the express (A=B)[:], or you can use the following statements if you have previously computed the Diff matrix:

/* 2. the proportion of elements in B that are the same as in A */
propDiff = 1 - Diff[:];
print propDiff;

As a reminder, the mean subscript reduction operator ([:]) computes the mean value of the elements of the matrix. For a binary matrix, the mean value is the proportion of ones.

Swap elements of rows

The first two tasks were easy. A more complicated task is swapping values that differ between rows. The swapping operation is not difficult, but it requires finding the k locations where the rows differ and then swapping all or some of those values. In a permutation test, the number of elements that you swap is a random integer between 1 and k, but for simplicity, this example uses the SWAP macro to swap two cells that differ. For clarity, the following example uses temporary variables such as x1, x2, d1, and d2 to swap elements in the matrix A:

/* specify the rows whose value you want to swap */
i1 = 1;                         /* index of first row to compare and swap */
i2 = 2;                         /* index of second row to compare and swap */
 
/* For clarity, use temp vars x1 & x2 instead of A[i1, ] and A[i2, ] */
x1 = A[ i1, ];                  /* get one row */
x2 = A[ i2, ];                  /* get another row */
idx = loc( x1^=x2 );            /* find the locations where rows differ */
if ncol(idx) > 0 then do;       /* do the rows differ? */
   d1 = x1[ ,idx];              /* values at the locations that differ */
   d2 = x2[ ,idx];
   print (d1//d2)[r={'r1' 'r2'} L='Original'];
 
   /* For a permutation test, choose a random number of locations to swap */
   /* numSwaps = randfun(1, "Integer", 1, n);
      idxSwap = sample(1:ncol(idx), numSwaps, "NoReplace");  */
   idxSwap = {2 4};              /* for now, hard-code locations to swap */
   %SWAP(d1[idxSwap], d2[idxSwap]);
   print (d1//d2)[r={'d1' 'd2'} L='New'];
   /* update matrix */
   A[ i1, idx] = d1;
   A[ i2, idx] = d2;
end;
print A[c=colLabel r=rowLabel];

The vectors x1 and x2 are the rows of A to compare. The vectors d1 and d2 are the subvectors that contain only the elements of x1 and x2 that differ. The example swaps the second and fourth columns of d1 and d2. The new values are then inserted back into the matrix A. You can compare the final matrix to the original to see that the process swapped two elements in each of two rows.

Although the examples are for binary matrices, these techniques work for any numerical matrices.

The post Swap elements in binary matrices appeared first on The DO Loop.