Matrix Computations

2月 132023
 

You can use a Markov transition matrix to model the transition of an entity between a set of discrete states. A transition matrix is also called a stochastic matrix. A previous article describes how to use transition matrices for stochastic modeling.

You can estimate a Markov transition matrix by using historical data that is collected over a certain length of time. Let's say that you are interested in the transitions of working-age individuals between three employment states: employed full-time, employed part-time, and unemployed. You survey a set of individuals and ask them to report their employment status today and one year ago. You can use their answers to estimate the annual transition matrix between the three employment states.

But suppose you want to estimate the MONTHLY transition matrix from the data. That is, what is the average monthly rate at which workers transition between the three states of employment? To answer that question, you need to compute the 12th root of the transition matrix. That is, if T is the annual transition matrix, you want to find another transition matrix, M, such that M12 = T. Unfortunately, this is not always possible. Not all matrices have pth roots, and a pth root (if it exists) is not necessarily a transition matrix. In an interesting paper, Higham and Lin ("On pth roots of stochastic matrices", 2011) characterize when a real matrix has a real pth root and when the pth root of a transition matrix is itself a transition matrix.

Since an exact solution is not always possible, you can relax the requirement of equality and instead find a transition matrix, M, such that M12 is as close as possible to T in the least squares sense. Given a transition matrix, T, this article shows how to use nonlinear optimization in SAS to compute a transition matrix, M, that is approximately the pth root of T. To keep the exposition simple, I use p=12, but the algorithm works equally well for other values of p.

A Markov transition matrix

To begin investigating the pth root of a transition matrix, let's first use the following 3 x 3 annual transition matrix:

    [ 0.8  0.1  0.1  ]
T = [ 0.25 0.6  0.15 ]
    [ 0.25 0.25 0.5  ]

The (i,j)th entry is the probability of transitioning from State i to State j in one year, where the order of the states are Full-Time, Part-Time, and Unemployed. Because each cell in a transition matrix is a probability, the elements of T are in the interval [0,1]. Furthermore, every row of a transition matrix sums to unity. For the population that was surveyed, the previous transition matrix provides the following information:

  • Of those who were employed full-time a year ago, 80% currently have full-time employment, 10% have part-time employment, and 10% are unemployed. These probabilities are the first row in the annual transition matrix.
  • Of those who were employed part-time a year ago, 25% currently have full-time employment, 60% have part-time employment, and 15% are unemployed. These probabilities are the second row in the annual transition matrix.
  • Of those who were unemployed a year ago, 25% currently have full-time employment, 25% have part-time employment, and 50% are unemployed. These probabilities are the third row in the annual transition matrix.

One way to find the pth root of a matrix is to use a constrained nonlinear optimization. For an n x n matrix, there are n2 parameters. Each parameter is constrained to be in the interval [0,1], and there are n linear constraints because each row must sum to unity.

The following SAS/IML program for SAS 9.4 defines an annual transition matrix for three employment states. It then sets up a nine-parameter constrained optimization problem to find the 3 x 3 pth-root matrix. In SAS 9.4, you can use the NLP* routines in PROC IML to solve this problem. You must specify a matrix that defines the boundary and linear constraints for the problem. For a 3 x 3 matrix, the constraint matrix is defined as follows:

  • The first row specifies the lower bound for each cell in the (pth root) transition matrix. These are probabilities, so all cells must be 0 or greater.
  • The second row specifies the upper bound for each cell in the (pth root) transition matrix. These are probabilities, so all cells must be 1 or less.
  • The next rows specify linear constraints. To be a transition matrix, the sum of the first row (elements 1-3) must equal 1. Similarly, the sum of the second row (elements 4-6) and the sum of the third row (elements 7-9) must equal 1.

Here is a PROC IML program that uses Newton-Raphson optimization to find a 3 x 3 transition matrix that is close to the 12th root of a specified transition matrix:

/* 12-th roots. Suppose you have estimated a Markov
   transition matrix, T,  by using annual data. You want to 
   estimate the average monthly transition matrix. That is, you 
   want to estimate a transition matrix A such that A**12 is 
   as close as possible to the the matrix T. */
proc iml;
start ObjFun(parms) global(periods, Target);
   A = shape(parms, 3, 3);         /* reshape input as 3x3 matrix */
   M = A**periods;                 /* compute A**12 */
   SSE = sum(ssq(M - Target));     /* sum of squares of difference */
   return SSE;
finish;
 
/* define the target matrix, T. We want to find transition matrix, M,
   such that M**12 is as close as possible to the the target matrix. */
Target = {0.8  0.1  0.1,
          0.25 0.6  0.15,
          0.25 0.25 0.5};  
 
/*
Target = {0.1 0.4 0.5,
          0.6 0.4   0,
          0     0   1};   * no 12th root;
*/
periods = 12;
 
/* define linear and boundary constraints */
con = {  0 0 0  0 0 0  0 0 0 . .,  /* lower bound is 0 for all parameters */
         1 1 1  1 1 1  1 1 1 . .,  /* upper bound is 1 for all parameters */
         1 1 1  . . .  . . . 0 1,  /* sum of 1st row = 1 */
         . . .  1 1 1  . . . 0 1,  /* sum of 2nd row = 1 */
         . . .  . . .  1 1 1 0 1}; /* sum of 3rd row = 1 */
 
x = I(3);                          /* initial guess is the identity matrix */
x0 = rowvec(x);                    /* reshape into a row vector */
call nlpnra(rc, result, "ObjFun", x0) BLC=con;
 
M = shape(result, 3, 3);           /* reshape answer into 3x3 matrix */
print M[L="M = Best Estimate" F=BEST4.]; 
/* check how close the 12th iterate is to the target */
B = M**periods;
SSE = sum(ssq(B - target));
print B[L="M**12" F=BEST4.], SSE;

The program shows the estimate for the monthly transition matrix, M, which is the 12th root of the annual transition matrix, T. For this choice of T, there is an exact 12th root. The matrix M12 is not merely close to T, but it equals T to numeric precision.

A transition matrix that does not have root

In the previous section, the program found a matrix that was exactly the 12th root of the annual transition matrix. This is not always possible. For example, change the target matrix to the following:

/* define the target matrix, T. We want to find the transition matrix, M,
   such that M**12 is as close as possible to the target matrix. */
Target = {0.1 0.4 0.5,
          0.6 0.4   0,
          0     0   1};   * no 12th root for this matrix;

For the people in this survey, everyone who was unemployed last year is still unemployed this year. Furthermore, no one who was employed part-time last year is currently unemployed. Therefore, the transition matrix contains several cells that are zero. It turns out that this matrix does not have a 12th root that is itself a transition matrix. If you rerun the PROC IML program for this annual transition matrix, you get an estimate for the monthly transition matrix, but when you multiply the monthly matrix by itself 12 times, you do not get the original annual matrix. The results are shown below:

Although this matrix has a small sum of squares of differences (SSE=0.16), the 12th power is not very close to the annual matrix in terms of probabilities. Some of the cells differ by 0.2 or 0.25. Although you have found an estimate for a monthly transition matrix, it isn't a good estimate. Situations like this can occur when the monthly transition matrix is not constant over the 12-month period. For example, hiring in retail and agriculture exhibits seasonal variation.

Using PROC OPTMODEL to solve for a pth root

It is worth mentioning, that the matrix-root problem can have multiple valid solutions. This is not unusual. For example, both +2 and -2 are square roots of the 1 x 1 matrix 4. The 2 x 2 identity matrix has infinitely many square roots because every rotation matrix of the following form is a square root:

R(t) = [ sin(t)   cos(t) ]
       [ cos(t)  -sin(t) ]

Thus, the solution given by the nonlinear optimization algorithm might depend on the initial guess that you provide.

Rob Pratt wrote the following program, which shows how to use PROC OPTMODEL in SAS/OR software to solve the same problem. The program uses the MULTISTART option on the SOLVE statement to specify that the algorithm (an active-set method) will run multiple times, each time using a different initial guess from among a set of uniformly distributed points in the feasible region.

/* The PROC OPTMODEL code was written by Rob Pratt */
proc optmodel;
   set ROWS = 1..3;
   set COLS = ROWS;
   num target {ROWS, COLS} = 
      [0.1 0.4 0.5,
       0.6 0.4   0,
       0     0   1];
   num numPeriods = 12;
   set PERIODS = 1..numPeriods;
 
   var X {i in ROWS, j in COLS} >= 0 <= 1;
   impvar XPower {i in ROWS, j in COLS, p in PERIODS} = 
      (if p = 1 then X[i,j] else sum {k in COLS} XPower[i,k,p-1] * XPower[k,j,p-1]);
   min SSE = sum {i in ROWS, j in COLS} (XPower[i,j,numPeriods] - target[i,j])^2;
   con RowSum {i in ROWS}:
      sum {j in COLS} X[i,j] = 1;
 
   solve with nlp / multistart algorithm=ActiveSet SEED=1234;
   print X 6.4;
   print {i in ROWS, j in COLS} XPower[i,j,numPeriods] 6.4;
   print target;
quit;

The log displays several notes, including the following:

NOTE: 91 distinct local optima were found.
NOTE: The best objective value found by local solver = 0.1745366339.
NOTE: The solution found by local solver with objective = 0.1745366339 was returned.

The notes tell you that the algorithm found 91 distinct local minima. The best solution has an SSE of 0.1745, which is slightly worse than the solution found by using the identity matrix as the initial guess for Newton's method. The procedure displays the following output, which shows that the best solution is close to the identity matrix. This estimate is a valid transition matrix, and the 12th power is close to the target matrix, but is a quite different solution than was found by using PROC IML.

Summary

In summary, this article shows how to use SAS to estimate the root of a Markov transition matrix. This problem is important when you have data that enable you to estimate a transition matrix, T, over a long time period (such as a year), but you want an estimate for a transition matrix, M, over a shorter time period (such as a month) such that Mp = T for a known value of p. In general, this problem does not have an exact solution, but you can find a transition matrix whose pth power is close to the target matrix in a least squares sense. Unfortunately, the root of a matrix is not unique. For some target matrices, you can find many matrices whose pth power is close to the target matrix. This might indicate that the transition matrix is not constant over the shorter time periods.

References

Nicholas J. Higham and Lijing Lin, 2011, "On pth roots of stochastic matrices," Linear Algebra and its Applications, 435 (3), p 448-463.

The post Estimate the pth root of a Markov transition matrix appeared first on The DO Loop.

10月 192022
 

Recently, I needed to write a program that can provide a solution to a regression-type problem, even when the data are degenerate. Mathematically, the problem is an overdetermined linear system of equations X*b = y, where X is an n x p design matrix and y is an n x 1 vector. For most data, you can obtain a least squares solution by solving the "normal equations" for b. In matrix notation, the normal equations are the matrix system (X`*X)*b = X`*y. However, if the columns of the design matrix are linearly dependent, the matrix (X`*X) is rank-deficient and the least squares system is singular. In the event of a singular system, I wanted to display a warning in the log.

A previous article shows that you can use a generalized inverse to construct a special solution for a rank-deficient system of equations. The special solution has the smallest L2 norm among the infinitely many possible solutions. As I have previously discussed, one way to obtain a minimal-norm solution is to use the Moore-Penrose generalized inverse, which you can do by using the GINV function in SAS/IML software.

Unfortunately, the GINV function in SAS/IML does not indicate whether the input matrix is singular, which makes it difficult to display a warning in the log. However, there is a second option: the SAS/IML language supports the APPCORT subroutine, which returns the minimal-norm solution and an integer that specifies the number of linear dependencies among the columns of the design matrix. The integer is sometimes called the co-rank of the matrix. If the co-rank is 0, the matrix is full rank.

Least squares solutions and the number of linear dependencies

The following 7 x 5 design matrix contains five linearly independent columns. Therefore, the normal equations are nonsingular. You can obtain a least squares solution by using many methods. For nonsingular systems, I like to use the SOLVE function, but the following program also demonstrates that you can also use the GINV function or the APPCORT subroutine.

proc iml;
reset fuzz;  /* print tiny numbers as 0 */
X = {1 0 1 0 0,
     1 0 0 1 0,
     1 0 0 0 1,
     0 1 1 0 0,
     0 1 0 1 0,
     0 1 0 0 1,
     0 0 1 1 1};
y = { 3, 2, 4, 2, 1, 3, 3 };
 
/* find LS solution: min  |X*b-y|^2 by solving X`*x*b = X`*y for b */
A = X`*X;
z = X`*y;
 
b_Solve = solve(A, z); /* solve a nonsingular linear system */
 
Ainv = ginv(A);              /* find generalized inverse (uses SVD) */
b_Ginv = Ainv*z;
 
call appcort(b_Appcort, numLinDep, A, z); /* find LS soln (uses QR) */
print b_Solve b_Ginv b_Appcort, numLinDep;

The output shows that all three methods give the same least squares solution for the nonsingular system. However, notice that the APPCORT subroutine provides more information. The subroutine returns the solution but also a scalar that indicates the number of linear dependencies that were detected while solving the system. For a nonsingular system, the number is zero. However, the next section repeats the analysis with a design matrix that has collinearities among the columns.

Least squares solutions when there are collinearities

If one of the columns of X is a linear combination of other columns, the SOLVE function will display an error message: ERROR: (execution) Matrix should be non-singular. In contrast, both the GINV function and the APPCORT routine can solve the resulting rank-deficient system of normal equations. In addition, the APPCORT subroutine provides information about whether the system was singular. If so, it tells you the number of collinearities.

The following program sets the fifth column of X to be a linear combination of two other columns. This causes the normal equations to be singular. Both the GINV function and the APPCORT subroutine solve the singular system by returning the solution that has the smallest 2-norm. The APPCORT subroutine also provides an integer that tells you the number of linear dependencies among the columns of X:

/* change X matrix so that the 5th column is a linear 
   combination of the first and 4th columns */
X[,5] = X[,1] + X[,4];
 
A = X`*X;
z = X`*y;
 
Ainv = ginv(A);              /* find generalized inverse (uses SVD) */
b_Ginv = Ainv*z;
 
call appcort(b_Appcort, numLinDep, A, z); /* find LS soln (uses QR) */
print b_Ginv b_Appcort, numLinDep;

Notice that the GINV function and the APPCORT subroutine give the same solution (the one with minimal 2-norm). However, the APPCORT subroutine also tells you that the system is singular and that there is one linear dependency among the columns of X.

Summary

In summary, the APPCORT subroutine enables you to solve a least squares system, just like the SOLVE function and the GINV function. However, when the system is singular, the SOLVE function stops and displays an error. In contrast, the GINV function returns a solution. So does the APPCORT subroutine, and it also alerts you to the fact that the system is singular.

For the curious, the SOLVE function uses an LU decomposition to solve linear systems. The GINV function uses an SVD decomposition, and the APPCORT subroutine uses a QR decomposition. If you want to see the QR factorization that the APPCORT subroutine uses, you can use the COMPORT subroutine to get it.

The post On solving rank-deficient systems of equations in SAS appeared first on The DO Loop.

5月 092022
 

Did you know that there is a mathematical formula that simplifies finding the derivative of a determinant? You can compute the derivative of a determinant of an n x n matrix by using the sum of n other determinants. The n determinants are for matrices that are equal to the original matrix except that you modify the i_th row in the i_th matrix by taking its derivative.

This method is especially useful when you want to evaluate the derivative at a point. Without the formula, you need to compute a derivative in symbolic form, compute the derivative (which will require many applications of the product rule!), and then evaluate the derivative of the determinant at the point. By using the formula, you can compute the derivative of individual elements, evaluate n matrices at a point, and then compute the sum of the n numerical determinants.

This article shows the formula and applies it to computing the derivative or structured covariance matrices that arise in statistical models. However, the formula is applicable to any square matrix.

What is the derivative of a determinant?

If A is a square matrix of numbers, the determinant of A is a scalar number. Geometrically, the determinant tells you how the matrix expands or contracts a cube of volume in the domain when it maps the cube into the range. If det(A) > 1, the linear transformation expands volume; if det(A) < 1, the linear transformation contracts volume. Algebraically, the determinant tells you whether the transformation is invertible (det(A) ≠ 0) or is singular (det(A) = 0).

When A is a constant matrix, det(A) is a number. But if some cells in the matrix depend on a parameter, then the determinant is a function of that parameter. A familiar example from statistics is a structured covariance matrix such as the autoregressive AR(1; ρ) correlation matrix. A 4 x 4 correlation matrix with an AR(1) structure is shown to the right. The value of an off-diagonal element is given as a function of the parameter ρ

The AR(1) correlation structure is used in statistics to model correlated errors. If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals.

The determinant of an AR(1) matrix

Because the AR(1) matrix has a structure, you can compute the determinant in symbolic form as a function of the parameter, ρ. You will get a polynomial in ρ, and it is easy to take the derivative of a polynomial. The computations are long but not complicated. If A is an n x n AR(1) matrix, the following list gives the analytical derivatives of the determinant of A for a few small matrices:

  • n=2: d/dρ( det(A) ) = -2 ρ
  • n=3: d/dρ( det(A) ) = +4 ρ (ρ2 – 1)
  • n=4: d/dρ( det(A) ) = -6 ρ (ρ2 – 1)2
  • n=5: d/dρ( det(A) ) = +8 ρ (ρ2 – 1)3

A rowwise method to find the derivative of a determinant

I learned the derivative formula from a YouTube video by Maksym Zubkov ("Derivative of Determinant (for nxn Matrix)"). To find the derivative of the determinant of A, you can compute the determinant of n auxiliary matrices, Di. The matrix Di is equal to A except for the i_th row. For the i_th row, you replace the elements of A with their derivatives. Because auxiliary matrices are formed by taking the derivatives of rows, this method is sometimes called the rowwise-method of differentiating the determinant.

For example, let's look at n=2. Typesetting matrices in this blog software is tedious, so I will use the SAS/IML notation where elements on the same row are separated by spaces and commas are used to indicate each row. Thus, the 2 x 2 AR(1) matrix is denoted as
A={1 ρ, ρ 1}.
To compute the derivative of the determinant of A, you form the following auxiliary matrices:

  • D1 = {0 1, ρ 1}. The first row of D1 contains the derivatives of the first row of A. The determinant of D1 is det(D1) = -ρ
  • D2 = {1 ρ, 1 0}. The second row of D2 contains the derivatives of the second row of A. The determinant is det(D2) = -ρ
  • The derivative of the determinant of A is the sum of the determinants of the auxiliary matrices, which is -2 ρ. This matches the analytical derivative from the previous section.

Flushed with victory, let's try n=3, which is
A = {1 ρ ρ2, ρ 1 ρ, ρ2 ρ 1}.

  • D1 = {0 1 ρ, ρ 1 ρ, ρ2 ρ 1}. The first row of D1 contains the derivatives of the first row of A. The determinant is det(D1) = ρ(ρ2 – 1).
  • D2 = {1 ρ ρ2, 1 0 1, ρ2 ρ 1}. The determinant is det(D2) = 2 ρ(ρ2 – 1).
  • D3 = {1 ρ ρ2, ρ 1 ρ, 2ρ 1 0}. The determinant of D3 is ρ(ρ2 – 1).
  • The derivative of the determinant of A is the sum of the determinants of the auxiliary matrices, which is +4 ρ (ρ2 – 1). Again, this matches the analytical derivative from the previous section.

The following figure shows the mathematical formulas for the derivative of the determinant of a 3 x 3 AR(1) matrix:

The same method works for any other matrices. To find the derivative of det(A), find the sum of the determinants of the auxiliary matrices, where the i_th auxiliary matrix (Di) is obtained by taking the derivative of the i_th row of A.

Numerical computation of derivatives of determinants

Let's graph the determinant function for a 3 x 3 AR(1) matrix. You can do this by evaluating the AR(1) matrices for values of ρ in the interval [0, 1] and graphing the determinants as a function of ρ:

proc iml;
/* Construct AR(1; rho):
See https://blogs.sas.com/content/iml/2018/10/03/ar1-cholesky-root-simulation.html 
*/
/* return dim x dim matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, dim); 
   u = cuprod(j(1,dim-1,rho)); /* cumulative product */
   return( toeplitz(1 || u) );
finish;
 
/* compute the DET at each value of rho and graph the determinant function */
rho = do(0, 1, 0.01);
det = j(1, ncol(rho), .);
do i = 1 to ncol(rho);
   det[i] = det(AR1Corr(rho[i],3));
end;
title "Det(A(rho))";
title2 "A is a 3x3 AR(1) Correlation Matrix";
call series(rho, det) grid={x y} other="refline 0;";

This is the graph of the determinant as a function of the parameter, ρ. From the graph, you can tell that the derivative of this function is 0 at ρ=0 and ρ=1. Furthermore, the derivative is negative on the interval (0, 1) and is most negative near ρ=0.6.

Let's use the derivative formula to compute and graph the derivative of the determinant function:

/* Use the rowwise formula for the derivative of a 3x3 AR(1) matrix:
   The i_th row of D_i is the derivative of the i_th row of A */
start dAR_3D(rho, i);                     /* A=matrix; i = row number */
   A = AR1Corr(rho, 3);
   if      i=1 then A[1,] = {0  1} || 2*rho;   /* derivative of row 1 */
   else if i=2 then A[2,] = {1  0  1};         /* derivative of row 2 */
   else if i=3 then A[3,] = 2*rho || {1  0};   /* derivative of row 3 */
   return( A );
finish;
 
/* compute derivative of DET at each value of rho and graph the derivative function */
rho = do(0, 1, 0.01);
dDet = j(1, ncol(rho), .);
do i = 1 to ncol(rho);
   det = 0;
   do j = 1 to 3;
      det = det + det(dAR_3D(rho[i],j)); /* sum of the determinants of the auxiliary matrices */
   end;
   dDet[i] = det;
end;
 
title "Derivative of Det(A(rho))";
title2 "Sum of Determinants of Auxiliary Matrices";
call series(rho, Ddet) grid={x y} other="refline 0;";

This is the graph of the DERIVATIVE of the determinant as a function of the parameter, ρ. As expected, the derivative is 0 at ρ=0 and ρ=1 and is negative on the interval (0, 1). The derivative reaches a minimum value near ρ=0.6.

Summary

The determinant of a square matrix provides useful information about the linear transformation that the matrix represents. The derivative of the determinant tells you how the determinant changes as a parameter changes. This article shows how to use a rowwise method to compute the derivative of the determinant as the sum of auxiliary matrices. The i_th auxiliary matrix is obtained from the original matrix by differentiating the i_th row. An AR(1) correlation matrix is used as an example to demonstrate the rowwise method.

The post The derivative of the determinant of a matrix appeared first on The DO Loop.

4月 132022
 

Some matrices are so special that they have names. The identity matrix is the most famous, but many are named after a researcher who studied them such as the Hadamard, Hilbert, Sylvester, Toeplitz, and Vandermonde matrices. This article is about the Pascal matrix, which is formed by using elements from Pascal's triangle.

As I've previously discussed, Pascal's triangle has many interesting properties. If you put Pascal's triangle into the elements of a matrix, there are two ways you can do it. The first is to use a lower triangular matrix, L: put Pascal's triangle in the lower triangular elements and zeros in the upper triangular elements. I call this triangular matrix a Pascal matrix of the first kind. The second is to use a full matrix, P, for which the i_th antidiagonal is the i_th row of Pascal's triangle. I call this full matrix a Pascal matrix of the second kind. These two definitions are shown below for Pascal matrices that have N=8 rows:

Interestingly, these two matrices are intimately connected. I have previously discussed that the full matrix (which is symmetric) has a Cholesky factorization, and that factorization is precisely P = L Lt. That's pretty cool!

Why Cholesky factorization is important

I've written about the geometry of the Cholesky transformation and how it correlates and uncorrelates variables in statistical applications. But from the perspective of numerical linear algebra, matrix factorizations are useful because they enable us to solve certain problems efficiently. For example, if you have a symmetric matrix, P, then the Cholesky root enables you to solve the linear system P*v = b by solving two linear systems:

  • Solve L w = b for the vector w.
  • Solve Lt v = w for the vector v, which is also the solution to P v = b.

In the SAS IML language, you can use the TRISOLV function to solve linear systems. You use various flags to tell the TRISOLV function whether to solve a lower- or upper-triangular system, as follows:

b = {-4, 24, 150, 528, 1452, 3432, 7293, 14300};
v1 = solve(P, b);         /* general solution: solve P*v = b */
 
/* If you have Cholesky root (P=L*L`), you can use TRISOLV */
w = trisolv(4, L, b);     /* solve L *w = b for w */
v = trisolv(3, L, w);     /* solve L`*v = w for v */
print v1 v;

As the output shows, the two different ways to solve the problem give similar solutions. In fact, when you call the SOLVE function, it internally performs a decomposition and solves two triangular systems, so the two perform similar computations, although the SOLVE method does not exploit the symmetry of the Pascal matrix.

An explicit inverse for the Pascal matrix

Some matrices have names because they have amazing mathematical properties. We've already seen that Pascal matrices (of the second kind) have Cholesky roots that are also Pascal matrices (of the first kind). But did you know that you can explicitly write down the inverse matrix for the Pascal matrices of the first kind? The following figure shows the inverse of the 8 x8 8 Pascal matrices of the first kind:

The figure shows that the inverse of L looks similar to L itself except that the signs of certain elements are negative. If G is the inverse matrix, then G[i,j] = L[i,j] (-1)i + j where i=1,2,3,... is the row index and j=1,2,3,... is the column index. There are several ways to form the inverse matrix from L. The following statements create a "sign matrix," H, defined by H[i,j] = (-1)i + j. The inverse matrix for L is the elementwise multiplication of L and H, as follows:

/* There is an EXACT inverse for L */
i = row(P);         /* n x n matrix where i_th row equals i */
j = col(P);         /* n x n matrix where j_th col equals j */
Sign = (-1)##(i+j); /* n x n sign matrix +/-1 */
Linv = Sign#L;
print Linv[F=5. r=rname c=cname L="Inverse of L"];

The matrix was shown earlier. You can directly solve the system P v = b by using the inverse of the Cholesky roots and two well-known results from linear algebra:

  • The inverse of the product equals the (reversed) product of the inverses: inv(A*B) = inv(B)*inv(A)
  • The inverse of the transpose equals the transpose of the inverse: inv(A`) = inv(A)`
Putting these results together gives a direct method to solve a linear system that involves a Pascal matrix, as follows:

/* The inverse of the product equals the (reversed) product of the inverses.
   The inverse of the transpose equals the transpose of the inverse.
   To solve P*v = b, use
   v = inv(L*L`)*b = inv(L`)*inv(L)*b = (inv(L))`*inv(L)*b */
v = Linv`*Linv*b;
print v;

The solution vector is identical to the solutions from other methods. However, having an explicit formula for the inverse means that you can solve a linear system that involves a Pascal matrix without performing any numerical linear algebra. For a Pascal matrix (of the second kind) You can construct the Cholesky roots without doing any linear algebra because the roots are the Pascal matrix of the first kind. Furthermore, you can explicitly construct the inverse of the Cholesky roots without doing any linear algebra. Put these facts together and you can solve linear systems that involve Pascal matrices by using only matrix multiplication.

Summary

You might never need to solve a system of linear equations that involves a Pascal matrix, but there are some general principles that statistical programmers can learn from this article:

  • If you have a matrix that has a special structure, exploit that structure. Examples include symmetric, banded, tridiagonal, and special "named" matrices. Often you can solve linear systems more efficiently if you use a specific solver rather than a general solver.
  • If your matrix is a "named" matrix, read the literature to find out if there are special formulas that enable you to directly write down the form of a root or inverse matrix. Doing so can lead to an efficient way to solve linear systems.
  • An interactive matrix language such as SAS/IML enables you to implement mathematical formulas in a few lines of code.

For more information about Pascal matrices, see Edelman, A. and Strang, G. (2004) "Pascal Matrices," The American Mathematical Monthly.

The post Pascal matrices and inverses appeared first on The DO Loop.

4月 132022
 

Some matrices are so special that they have names. The identity matrix is the most famous, but many are named after a researcher who studied them such as the Hadamard, Hilbert, Sylvester, Toeplitz, and Vandermonde matrices. This article is about the Pascal matrix, which is formed by using elements from Pascal's triangle.

As I've previously discussed, Pascal's triangle has many interesting properties. If you put Pascal's triangle into the elements of a matrix, there are two ways you can do it. The first is to use a lower triangular matrix, L: put Pascal's triangle in the lower triangular elements and zeros in the upper triangular elements. I call this triangular matrix a Pascal matrix of the first kind. The second is to use a full matrix, P, for which the i_th antidiagonal is the i_th row of Pascal's triangle. I call this full matrix a Pascal matrix of the second kind. These two definitions are shown below for Pascal matrices that have N=8 rows:

Interestingly, these two matrices are intimately connected. I have previously discussed that the full matrix (which is symmetric) has a Cholesky factorization, and that factorization is precisely P = L Lt. That's pretty cool!

Why Cholesky factorization is important

I've written about the geometry of the Cholesky transformation and how it correlates and uncorrelates variables in statistical applications. But from the perspective of numerical linear algebra, matrix factorizations are useful because they enable us to solve certain problems efficiently. For example, if you have a symmetric matrix, P, then the Cholesky root enables you to solve the linear system P*v = b by solving two linear systems:

  • Solve L w = b for the vector w.
  • Solve Lt v = w for the vector v, which is also the solution to P v = b.

In the SAS IML language, you can use the TRISOLV function to solve linear systems. You use various flags to tell the TRISOLV function whether to solve a lower- or upper-triangular system, as follows:

b = {-4, 24, 150, 528, 1452, 3432, 7293, 14300};
v1 = solve(P, b);         /* general solution: solve P*v = b */
 
/* If you have Cholesky root (P=L*L`), you can use TRISOLV */
w = trisolv(4, L, b);     /* solve L *w = b for w */
v = trisolv(3, L, w);     /* solve L`*v = w for v */
print v1 v;

As the output shows, the two different ways to solve the problem give similar solutions. In fact, when you call the SOLVE function, it internally performs a decomposition and solves two triangular systems, so the two perform similar computations, although the SOLVE method does not exploit the symmetry of the Pascal matrix.

An explicit inverse for the Pascal matrix

Some matrices have names because they have amazing mathematical properties. We've already seen that Pascal matrices (of the second kind) have Cholesky roots that are also Pascal matrices (of the first kind). But did you know that you can explicitly write down the inverse matrix for the Pascal matrices of the first kind? The following figure shows the inverse of the 8 x8 8 Pascal matrices of the first kind:

The figure shows that the inverse of L looks similar to L itself except that the signs of certain elements are negative. If G is the inverse matrix, then G[i,j] = L[i,j] (-1)i + j where i=1,2,3,... is the row index and j=1,2,3,... is the column index. There are several ways to form the inverse matrix from L. The following statements create a "sign matrix," H, defined by H[i,j] = (-1)i + j. The inverse matrix for L is the elementwise multiplication of L and H, as follows:

/* There is an EXACT inverse for L */
i = row(P);         /* n x n matrix where i_th row equals i */
j = col(P);         /* n x n matrix where j_th col equals j */
Sign = (-1)##(i+j); /* n x n sign matrix +/-1 */
Linv = Sign#L;
print Linv[F=5. r=rname c=cname L="Inverse of L"];

The matrix was shown earlier. You can directly solve the system P v = b by using the inverse of the Cholesky roots and two well-known results from linear algebra:

  • The inverse of the product equals the (reversed) product of the inverses: inv(A*B) = inv(B)*inv(A)
  • The inverse of the transpose equals the transpose of the inverse: inv(A`) = inv(A)`
Putting these results together gives a direct method to solve a linear system that involves a Pascal matrix, as follows:

/* The inverse of the product equals the (reversed) product of the inverses.
   The inverse of the transpose equals the transpose of the inverse.
   To solve P*v = b, use
   v = inv(L*L`)*b = inv(L`)*inv(L)*b = (inv(L))`*inv(L)*b */
v = Linv`*Linv*b;
print v;

The solution vector is identical to the solutions from other methods. However, having an explicit formula for the inverse means that you can solve a linear system that involves a Pascal matrix without performing any numerical linear algebra. For a Pascal matrix (of the second kind) You can construct the Cholesky roots without doing any linear algebra because the roots are the Pascal matrix of the first kind. Furthermore, you can explicitly construct the inverse of the Cholesky roots without doing any linear algebra. Put these facts together and you can solve linear systems that involve Pascal matrices by using only matrix multiplication.

Summary

You might never need to solve a system of linear equations that involves a Pascal matrix, but there are some general principles that statistical programmers can learn from this article:

  • If you have a matrix that has a special structure, exploit that structure. Examples include symmetric, banded, tridiagonal, and special "named" matrices. Often you can solve linear systems more efficiently if you use a specific solver rather than a general solver.
  • If your matrix is a "named" matrix, read the literature to find out if there are special formulas that enable you to directly write down the form of a root or inverse matrix. Doing so can lead to an efficient way to solve linear systems.
  • An interactive matrix language such as SAS/IML enables you to implement mathematical formulas in a few lines of code.

For more information about Pascal matrices, see Edelman, A. and Strang, G. (2004) "Pascal Matrices," The American Mathematical Monthly.

The post Pascal matrices and inverses appeared first on The DO Loop.

4月 132022
 

Some matrices are so special that they have names. The identity matrix is the most famous, but many are named after a researcher who studied them such as the Hadamard, Hilbert, Sylvester, Toeplitz, and Vandermonde matrices. This article is about the Pascal matrix, which is formed by using elements from Pascal's triangle.

As I've previously discussed, Pascal's triangle has many interesting properties. If you put Pascal's triangle into the elements of a matrix, there are two ways you can do it. The first is to use a lower triangular matrix, L: put Pascal's triangle in the lower triangular elements and zeros in the upper triangular elements. I call this triangular matrix a Pascal matrix of the first kind. The second is to use a full matrix, P, for which the i_th antidiagonal is the i_th row of Pascal's triangle. I call this full matrix a Pascal matrix of the second kind. These two definitions are shown below for Pascal matrices that have N=8 rows:

Interestingly, these two matrices are intimately connected. I have previously discussed that the full matrix (which is symmetric) has a Cholesky factorization, and that factorization is precisely P = L Lt. That's pretty cool!

Why Cholesky factorization is important

I've written about the geometry of the Cholesky transformation and how it correlates and uncorrelates variables in statistical applications. But from the perspective of numerical linear algebra, matrix factorizations are useful because they enable us to solve certain problems efficiently. For example, if you have a symmetric matrix, P, then the Cholesky root enables you to solve the linear system P*v = b by solving two linear systems:

  • Solve L w = b for the vector w.
  • Solve Lt v = w for the vector v, which is also the solution to P v = b.

In the SAS IML language, you can use the TRISOLV function to solve linear systems. You use various flags to tell the TRISOLV function whether to solve a lower- or upper-triangular system, as follows:

b = {-4, 24, 150, 528, 1452, 3432, 7293, 14300};
v1 = solve(P, b);         /* general solution: solve P*v = b */
 
/* If you have Cholesky root (P=L*L`), you can use TRISOLV */
w = trisolv(4, L, b);     /* solve L *w = b for w */
v = trisolv(3, L, w);     /* solve L`*v = w for v */
print v1 v;

As the output shows, the two different ways to solve the problem give similar solutions. In fact, when you call the SOLVE function, it internally performs a decomposition and solves two triangular systems, so the two perform similar computations, although the SOLVE method does not exploit the symmetry of the Pascal matrix.

An explicit inverse for the Pascal matrix

Some matrices have names because they have amazing mathematical properties. We've already seen that Pascal matrices (of the second kind) have Cholesky roots that are also Pascal matrices (of the first kind). But did you know that you can explicitly write down the inverse matrix for the Pascal matrices of the first kind? The following figure shows the inverse of the 8 x8 8 Pascal matrices of the first kind:

The figure shows that the inverse of L looks similar to L itself except that the signs of certain elements are negative. If G is the inverse matrix, then G[i,j] = L[i,j] (-1)i + j where i=1,2,3,... is the row index and j=1,2,3,... is the column index. There are several ways to form the inverse matrix from L. The following statements create a "sign matrix," H, defined by H[i,j] = (-1)i + j. The inverse matrix for L is the elementwise multiplication of L and H, as follows:

/* There is an EXACT inverse for L */
i = row(P);         /* n x n matrix where i_th row equals i */
j = col(P);         /* n x n matrix where j_th col equals j */
Sign = (-1)##(i+j); /* n x n sign matrix +/-1 */
Linv = Sign#L;
print Linv[F=5. r=rname c=cname L="Inverse of L"];

The matrix was shown earlier. You can directly solve the system P v = b by using the inverse of the Cholesky roots and two well-known results from linear algebra:

  • The inverse of the product equals the (reversed) product of the inverses: inv(A*B) = inv(B)*inv(A)
  • The inverse of the transpose equals the transpose of the inverse: inv(A`) = inv(A)`
Putting these results together gives a direct method to solve a linear system that involves a Pascal matrix, as follows:

/* The inverse of the product equals the (reversed) product of the inverses.
   The inverse of the transpose equals the transpose of the inverse.
   To solve P*v = b, use
   v = inv(L*L`)*b = inv(L`)*inv(L)*b = (inv(L))`*inv(L)*b */
v = Linv`*Linv*b;
print v;

The solution vector is identical to the solutions from other methods. However, having an explicit formula for the inverse means that you can solve a linear system that involves a Pascal matrix without performing any numerical linear algebra. For a Pascal matrix (of the second kind) You can construct the Cholesky roots without doing any linear algebra because the roots are the Pascal matrix of the first kind. Furthermore, you can explicitly construct the inverse of the Cholesky roots without doing any linear algebra. Put these facts together and you can solve linear systems that involve Pascal matrices by using only matrix multiplication.

Summary

You might never need to solve a system of linear equations that involves a Pascal matrix, but there are some general principles that statistical programmers can learn from this article:

  • If you have a matrix that has a special structure, exploit that structure. Examples include symmetric, banded, tridiagonal, and special "named" matrices. Often you can solve linear systems more efficiently if you use a specific solver rather than a general solver.
  • If your matrix is a "named" matrix, read the literature to find out if there are special formulas that enable you to directly write down the form of a root or inverse matrix. Doing so can lead to an efficient way to solve linear systems.
  • An interactive matrix language such as SAS/IML enables you to implement mathematical formulas in a few lines of code.

For more information about Pascal matrices, see Edelman, A. and Strang, G. (2004) "Pascal Matrices," The American Mathematical Monthly.

The post Pascal matrices and inverses appeared first on The DO Loop.

1月 052022
 

You can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This method is encapsulated in the RANDNORMAL function in SAS/IML software, but you can also perform the computations manually by calling the ROOT function to get the Cholesky root and then use matrix multiplication to correlate multivariate norm (MVN) data, as follows:

%let d = 12;           /* number of variables */
%let N = 500;          /* number of observations */
proc iml;
d = &d;                /* want to generate d variables that are MVN */
N = &N;                /* sample size */
 
/* easy to generate uncorrelated MVN variables */
call randseed(1234);
z = j(d, N);
call randgen(z, "Normal"); /* z is MVN(0,I(d)) */
 
/* use Cholesky root to transform to correlated variables */
Sigma = toeplitz(d:1);   /* example of covariance matrix */
G = root(Sigma);         /* the Cholesky root of the full covariance matrix */
x = G`*z;                /* x is MVN(0, Sigma) */
title "Two Components of MVN(0,Sigma) Data";
call scatter(x[1,], x[2,]) label={x1 x2};

The simulation creates d=12 correlated variables and 500 observations. The first two variables are plotted against each other so that you can see that they are correlated. The other pairs of variables are also correlated according to the entries in the Σ covariance matrix.

Generating a huge number of MVN variables

In the comments to one of my previous articles, a SAS programmer asked whether it is possible to generate MVN data even if the covariance matrix is so large that it cannot be factored by the ROOT function. For example, suppose you want to generate d=20,000 MVN variables. A d x d covariance matrix of that size requires 3 GB of RAM, and on some operating system you cannot form the Cholesky root of a matrix that large. However, I have developed an algorithm that enables you to deal with the covariance matrix (Σ) as a block matrix.

To introduce the algorithm, represent the Σ matrix as a 2 x 2 block matrix. (See the last section for a more general representation.) For a 2 x 2 block algorithm, you only need to form the Cholesky roots of matrices of size (d/2), which are 10000 x 10000 matrices for this example. Matrices of that size require only 0.75 GB and can be quickly factored. Of course, there is no such thing as a free lunch: The new algorithm is more complicated and requires additional computations, albeit on smaller matrices.

Let's look at how to simulate MVN data by treating Σ as a block matrix of size d/2. For ease of exposition, assume d is even and each block is of size d/2. However, the algorithm easily handles the case where d is odd. When d is odd, choose the upper blocks to have floor(d/2) rows and choose the lower clocks to have ceil(d/2) rows. The SAS code (below) handles the general case. (In fact, the algorithm doesn't care about the block size. For any p in (0, 1), one block can be size pd and the other size (1-p)d.

In block form, the covariance matrix is
\(\Sigma = \begin{bmatrix}A & B\\B^\prime & C\end{bmatrix}\)
There is a matrix decomposition called the LDLT decomposition that enables you to write Σ as the product of three block matrices:

\(\begin{bmatrix}A & B\\B^\prime & C\end{bmatrix} = \begin{bmatrix}I & 0\\B^\prime A^{-1} & I\end{bmatrix} \begin{bmatrix}A & 0\\0 & S\end{bmatrix} \begin{bmatrix}I & A^{-1}B\\0 & I\end{bmatrix} \)
where \(S = C - B^\prime A^{-1} B\) is the Schur complement of the block matrix C. There is a theorem that says that if Σ is symmetric positive definite (SPD), then so is every principal submatrix, so A is SPD. Thus, we can use the Cholesky decomposition and write \(A = G^\prime_A G_A\) for an upper triangular matrix, \(G_A\). By reordering the variables, C is SPD as well. I don't have a proof that S is PD, but it seems to be true in the examples I've tried, so let's list that as an assumption and assume that \(S = G^\prime_S G_S\) for an upper triangular matrix \(G_S\).

Using these definitions, the Cholesky decomposition of Σ can be written in block form as
\(\Sigma = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}G_A & (G^\prime_A)^{-1} B \\ 0 & G_S\end{bmatrix} \)

To simulate MVN data, we let z be uncorrelated MVN(0, I) data and define
\(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}z_1 \\ z_2\end{bmatrix} = \begin{bmatrix}G^\prime_A z_1 \\ B^\prime G^{-1}_A z_1 + G^\prime_S z_2\end{bmatrix} \)

The previous equation indicates that you can generate correlated MVN(0, Σ) data if you know the Cholesky decomposition of the upper-left block of Σ (which is GA) and the Cholesky decomposition of the Schur complement (which is GS). Each of the matrices in these equations is of size d/2, which means that the computations are performed on matrices that are half the size of Σ. However, notice that generating the x2 components requires some extra work: the Schur complement involves the inverse of A and the formula for x2 also involves the inverse of GA. These issues are not insurmountable, but it means that the block algorithm is more complicated than the original algorithm, which merely multiplies z by the Cholesky root of Σ.

A SAS program for block simulation of MVN data

The SAS/IML program in a previous section shows how to generate correlated MVN(0, Σ) data (x) from uncorrelated MVN(0, I) data (z) by using the Cholesky root (G) of Σ. This section shows how to get exactly the same x values by performing block operations. For the block operations, each matrix is size d/2, so has 1/4 of the elements of Σ.

The first step is to represent Σ and z in block form.

/* The Main Idea: get the same MVN data by performing block 
   operations on matrices that are size d/2 */
/* 1. Represent Sigma={A B, B` C} and z={z1, z2} in block form */
d1 = ceil(d/2);           /* half the variables */
d2 = d - d1;              /* the other half of the vars */
dp1 = d1 + 1;
 
A = Sigma[1:d1,  1:d1];   /* break up the symmetric (d x d) covariance */
B = Sigma[1:d1,  dp1:d];  /* matrix into blocks of size d/2 */
C = Sigma[dp1:d, dp1:d];
z1 = z[1:d1, ];           /* extract the first d1 obs for MVN(0,I) data */
z2 = z[dp1:d, ];          /* extract the last d2 obs for MVN(0,I) data */

It is easy to generate x1, which contains the first d/2 components of the MVN(0, Σ) simulated data. You simply use the Cholesky decomposition of A, which is the upper-left block of Σ:

/* 2. Compute Cholesky root of A and compute x1 z1 */
G_A = root(A);            /* Cholesky of upper left block */
x1 = G_A`*z1;             /* generate first half of variables */

It is not as easy to generate x2, which contains the last d/2 components. In block form, \(x_2 = v + u\), where \(v = B^\prime G_A^{-1} z_1\) and \(u = G_S^\prime z_2\), and where \(S = G_S^\prime G_S\) is the Cholesky decomposition of the Shur complement. This is shown in the following statements:

/* 3. Generate the second half of variables */
S = C - B`*inv(A)*B;      /* S is the Schur complement */
G_S = root(S);            /* Cholesky root of Schur complement */
v = B`*inv(G_A)*z1;
u = G_S`*z2;
x2 = v + u;
 
/* verify that the computation gives exactly the same simulated data */
print (max(abs(x[1:d1,] - x1)))[L="Diff(x - x1)"];
print (max(abs(x[dp1:d,] - x2)))[L="Diff(x - x2)"];

The output shows that the block computation, which uses multiple matrices of size d/2, gives exactly the same answer as the original computation, which uses matrices of size d.

Analysis of the block algorithm

Now that the algorithm runs on a small example, you can increase the dimensions of the matrices. For example, try rerunning the algorithm with the following parameters:

%let d = 5000;    /* number of variables */
%let N = 1000;    /* number of observations */

When I run this larger program on my PC, it takes about 2.3 seconds to simulate the MVN data by using the full Cholesky decomposition. It takes about 17 seconds to simulate the data by using the block method. Most of the time for the block method is spent on the computation v = B`*inv(G_A)*z1, which takes about 15 seconds. So if you want to improve the performance, you should focus on improving that computation.

You can improve the performance of the block algorithm in several ways:
  • You don't need to find the inverse of A. You have the Cholesky root of A, which is triangular, so you can define H = inv(G`A)*B and compute the Schur complement as C – H`*H.
  • You don't need to explicitly form any inverses. Ultimately, all these matrices are going to operate on the vector z2, which means that you can simulate the data by using only matrix multiplication and solving linear equations.

I tried several techniques, but I could not make the block algorithm competitive with the performance of the original Cholesky algorithm. The issue is that the block algorithm computes two Cholesky decompositions, solves some linear equations, and performs matrix multiplication. Even though each computation is on a matrix that is half the size of the original matrix, the additional computations result in a longer runtime.

Generalization to additional blocks

The algorithm generalizes to other block sizes. I will outline the case for a 3 x 3 block matrix. The general case of k x k blocks follows by induction.

The figure to the right shows the Σ covariance matrix as a 3 x 3 block matrix. By using 2 x 2 block algorithm, you can obtain multivariate normal vectors x1 and x2, which correspond to the covariances in the four blocks in the upper-left corner of the matrix. You can then view the upper four blocks as a single (larger) block, thus obtaining a 2 x 2 block matrix where the upper-left block has size 2d/3 and the lower-right block (Σ33) has size d/3.

There are many details to work through, but the main idea is that you can repeat the computation on this new block matrix where the block sizes are unequal. To proceed, you need the Cholesky decomposition of the large upper-left block, but we have already shown how to compute that in terms of the Cholesky root of Σ11, its inverse, and the Cholesky root of the Schur complement of Σ22. You also need to form the Schur complement of Σ33, but that is feasible because we have the Cholesky decomposition of the large upper-left block.

I have not implemented the details because it is not clear to me whether three or more blocks provides a practical advantage. Even if I work out the details for k=3, the algebra for k > 3 blocks seems to be daunting.

Summary

It is well known that you can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This article shows how to break up the task by using a block Cholesky method. The method is implemented for k=2 blocks. The block method requires more matrix operations, but each operation is performed on a smaller matrix. Although the block method takes longer to run, it can be used to generate a large number of variables by generating k variables at a time, where k is small enough that the k x k matrices can be easily handled.

The block method generalizes to k > 2 blocks, but it is not clear whether more blocks provide any practical benefits.

The post A block-Cholesky method to simulate multivariate normal data appeared first on The DO Loop.

7月 142021
 

In a previous article, I discussed various ways to solve a least-square linear regression model. I discussed the SWEEP operator (used by many SAS regression routines), the LU-based methods (SOLVE and INV in SAS/IML), and the QR decomposition (CALL QR in SAS/IML). Each method computes the estimates for the regression coefficients, b, by using the normal equations (X`X) b = X`y, where X is a design matrix for the data.

This article describes a QR-based method that does not use the normal equations but works directly with the overdetermined system X b = y. It then compares the performance of the direct QR method to the computational methods that use the normal equations.

The QR solution of an overdetermined system

As shown in the previous article, you can use the QR algorithm to solve the normal equations. However, if you search the internet for "QR algorithm and least squares," you find many articles that show how you can use the QR decomposition to directly solve the overdetermined system X b = y. How does the direct QR method compare to the methods that use the normal equations?

Recall that X is an n x m design matrix, where n > m and X is assumed to be full rank of m. For simplicity, I will ignore column pivoting. If you decompose X = QRL, the orthogonal matrix Q is n x n, but the matrix RL is not square. ("L" stands for "long.") However, RL is the vertical concatenation of a square triangular matrix and a rectangular matrix of zeros:
\({\bf R_L} = \begin{bmatrix} {\bf R} \\ {\bf 0} \end{bmatrix}\)
If you let Q1 be the first m columns of Q and let Q2 be the remaining (n-m) columns, you get a partitioned matrix equation:
\(\begin{bmatrix} {\bf Q_1} & {\bf Q_2} \end{bmatrix} \begin{bmatrix} {\bf R} \\ {\bf 0} \end{bmatrix} {\bf b} = {\bf y}\)
If you multiply both sides by Q` (the inverse of the orthogonal matrix, Q), you find out that the important matrix equation to solve is \({\bf R b} = {\bf Q_1^{\prime} y}\). The vector \({\bf Q_1^{\prime} y}\) is the first m rows of the vector \({\bf Q^{\prime} y}\). The QR call in SAS/IML enables you to obtain the triangular R matrix and the vector Q` y directly from the data matrix and the observed vector. The following program uses the same design matrix as for my previous article. Assuming X has rank m, the call to the QR subroutine returns the m x m triangular matrix, R, and the vector Q` y. You can then extract the first m rows of that vector and solve the triangular system, as follows:

/* Use PROC GLMSELECT to write a design matrix */
proc glmselect data=Sashelp.Class outdesign=DesignMat;
   class Sex;
   model Weight = Height Sex Height*Sex/ selection=none;
run;
 
proc iml;
use DesignMat;
   read all var {'Intercept' 'Height' 'Sex_F' 'Height_Sex_F'} into X;
   read all var {'Weight'} into Y;
close;
 
/* The QR algorithm can work directly with the design matrix and the observed responses. */
call QR(Qty, R, piv, lindep, X, , y);   /* return Q`*y and R (and piv) */
m = ncol(X);
c = QTy[1:m];                           /* we only need the first m rows of Q`*y */
b = trisolv(1, R, c, piv);              /* solve triangular system */
print b[L="Direct QR" F=D10.4];

This is the same least-squares solution that was found by using the normal equations in my previous article.

Compare the performance of least-squares solutions

How does this direct method compare with the methods that use the normal equations? You can download a program that creates simulated data and runs each algorithm to estimate the least-squares regression coefficients. The simulated data has 100,000 observations; the number of variables is chosen to be m={10, 25, 50, 75, 100, 250, 500}. The program uses SAS/IML 15.1 on a desktop PC to time the algorithms. The results are shown below:

The most obvious feature of the graph is that the "Direct QR" method that is described in this article is not as fast as the methods that use the normal equations. For 100 variables and 100,000 observations, the "Direct QR" call takes more than 12 seconds on my PC. (It's faster on a Linux server). The graph shows that the direct method shown in this article is not competitive with the normal-equation-based algorithms when using the linear algebra routines in SAS/IML 15.1.

The graph shows that the algorithms that use the normal equations are relatively faster. For the SAS/IML calls on my PC, you can compute the regression estimates for 500 variables in about 2.6 seconds. The graph has a separate line for the time required to form the normal equations (which you can think of as forming the X`X matrix). Most of the time is spent computing the normal equations; only a fraction of the time is spent actually solving the normal equations. The following table shows computations on my PC for the case of 500 variables:

The table shows that it takes about 2.6 seconds to compute the X`X matrix and the vector X`y. After you form the normal equations, solving them is very fast. For this example, the SOLVE and INV methods take only a few milliseconds to solve a 500 x 500. The QR algorithms take 0.1–0.2 seconds longer. So, for this example, forming the normal equations accounts for more than 90% of the total time.

Another article compares the performance of the SOLVE and INV routines in SAS/IML.

SAS regression procedures

These results are not the best that SAS can do. SAS/IML is a general-purpose tool. SAS regression procedures like PROC REG are optimized to compute regression estimates even faster. They also use the SWEEP operator, which is faster than the SOLVE function. For more than 20 years, SAS regression procedures have used multithreaded computations to optimize the performance of regression computations (Cohen, 2002). More recently, SAS Viya added the capability for parallel processing, which can speed up the computations even more. And, of course, they compute much more than only the coefficient estimates! They also compute standard errors, p-values, related statistics (MSE, R square,....), diagnostic plots, and more.

Summary

This article compares several methods for obtaining least-squares regression estimates. It uses simulated data where the number of observations is much greater than the number of variables. It shows that methods that use the normal equations are faster than a "Direct QR" method, which does not use the normal equations. When you use the normal equations, most of the time is spent actually forming the normal equations. After you have done that, the time required to solve the system is relatively fast.

You can download the SAS program that computes the tables and graphs in this article.

The post Compare computational methods for least squares regression appeared first on The DO Loop.

7月 122021
 

In computational statistics, there are often several ways to solve the same problem. For example, there are many ways to solve for the least-squares solution of a linear regression model. A SAS programmer recently mentioned that some open-source software uses the QR algorithm to solve least-squares regression problems and asked how that compares with SAS. This article shows how to estimate a least-squares regression model by using the QR method in SAS. It shows how to use the QR method in an efficient way.

Solving the least-squares problem

Before discussing the QR method, let's briefly review other ways to construct a least-squares solution to a regression problem. In a regression problem, you have an n x m data matrix, X, and an n x 1 observed vector of responses, y. When n > m, this is an overdetermined system and typically there is no exact solution. Consequently, the goal is to find the m x 1 vector of regression coefficients, b, such that the predicted values (X b) are as close as possible to the observed values. If b is a least-squares solution, then b minimizes the vector norm || X b - y ||2, where ||v||2 is the sum of the squares of the components of a vector.

Using calculus, you can show that the solution vector, b, satisfies the normal equations (X`X) b = X`y. The normal equations have a unique solution when the crossproduct matrix X`X is nonsingular. Most numerical algorithms for least-squares regression start with the normal equations, which have nice numerical properties that can be exploited.

Creating a design matrix

The first step of solving a regression problem is to create the design matrix. For continuous explanatory variables, this is easy: You merely append a column of ones (the intercept column) to the matrix of the explanatory variables. For more complicated regression models that contain manufactured effects such as classification effects, interaction effects, or spline effects, you need to create a design matrix. The GLMSELECT procedure is the best way to create a design matrix for fixed effects in SAS. The following call to PROC GLMSELECT writes the design matrix to the DesignMat data set. The call to PROC REG estimates the regression coefficients:

proc glmselect data=Sashelp.Class outdesign=DesignMat;  /* write design matrix */
   class Sex;
   model Weight = Height Sex Height*Sex/ selection=none;
run;
/* test: use the design matrix to estimate the regression coefficients */
proc reg data=DesignMat plots=none;
   model Weight = Height Sex_F Height_Sex_F;
run;

The goal of the rest of this article is to reproduce the regression estimates by using various other linear algebra operations. For each method, we want to produce the same estimates: {-141.10, 3.91, 23.73, -0.49}.

Linear regression: A linear algebraic approach

I've previously discussed the fact that most SAS regression procedures use the sweep operator to construct least-squares solutions. Another alternative is to use the SOLVE function in SAS/IML:

proc iml;
use DesignMat;
   read all var {'Intercept' 'Height' 'Sex_F' 'Height_Sex_F'} into X;
   read all var {'Weight'} into Y;
close;
 
/* form normal equations */
XpX = X`*X;
Xpy = X`*y;
 
/* an efficient numerical solution: solve a particular RHS */
b = solve(XpX, Xpy);
print b[L="SOLVE" F=D10.4];

The SOLVE function is very efficient and gives the same parameter estimates as the SWEEP operator (which was used by PROC REG). Using the SOLVE function on the system A*b=z is mathematically equivalent to using the matrix inverse to find b=A-1z. However, the INV function explicitly forms the inverse matrix, whereas the SOLVE function does not. Consequently, the SOLVE function is faster and more efficient than using the following SAS/IML statement:

/* a less efficient numerical solution */
Ainv = inv(XpX);       /* explicitly form the (m x m) inverse matrix */
b = Ainv*Xpy;          /* apply the inverse matrix to RHS */
print b[L="INV" F=D10.4];

For a performance comparison of the SOLVE and INV functions, see the article, "Solving linear systems: Which technique is fastest?"

An introduction to the QR method for solving linear equations

There are many ways to solve the normal equations. The SOLVE (and INV) functions use the LU decomposition. An alternative is the QR algorithm, which is slower but can be more accurate for ill-conditioned systems. You can apply the QR decomposition to the normal equations or to the original design matrix. This section discusses the QR decomposition of the design matrix. A subsequent article discusses decomposing the data matrix directly.

Let's see how the QR algorithm solves the normal equations. You can decompose the crossproduct matrix as the product of an orthogonal matrix, Q, and an upper-triangular matrix, R. If A = X`X, then A = QR. (I am ignoring column pivoting, which is briefly discussed below.) The beauty of an orthogonal matrix is that the transpose equals the inverse. This means that you can reduce the system A b = (X` y) to a triangular system R b = Q` (X` y), which is easily solved by using the TRISOLV function in SAS/IML.

The SAS/IML version of the QR algorithm is a version known as "QR with column pivoting." Using column pivoting improves solving rank-deficient systems and provides better numerical accuracy. However, when the algorithm uses column pivoting, you are actually solving the system AP = QR, where P is a permutation matrix. This potentially adds some complexity to dealing with the QR algorithm. Fortunately, the TRISOLVE function supports column pivoting. If you use the TRISOLV function, you do not need to worry about whether pivoting occurred or not.

Using the QR method in SAS/IML

Recall from the earlier example that it is more efficient to use the SOLVE function than the INV function. The reason is that the INV function explicitly constructs an m x m matrix, which then is multiplied with the right-hand side (RHS) vector to obtain the answer. In contrast, the SOLVE function never forms the inverse matrix. Instead, it directly applies transformations to the RHS vector.

The QR call in SAS/IML works similarly. If you do not supply a RHS vector, v, then the QR call returns the full m x m matrix, Q. You then must multiply Q` v yourself. If you do supply the RHS vector, then the QR call returns Q` v without ever forming Q. This second method is more efficient,

See the documentation of the QR call for the complete syntax. The following call uses the inefficient method in which the Q matrix is explicitly constructed:

/* It is inefficient to factor A=QR and explicitly solve the equation R*b = Q`*c */
call QR(Q, R, piv, lindep, XpX);       /* explicitly form the (m x m) matrix Q */
c = Q`*Xpy;                            /* apply the inverse Q matrix to RHS */
b = trisolv(1, R, c, piv);             /* equivalent to b = inv(R)*Q`*c */
print b[L="General QR" F=D10.4];

In contrast, the next call is more efficient because it never explicitly forms the Q matrix:

/* It is more efficient to solve for a specific RHS */
call QR(c, R, piv, lindep, XpX, ,Xpy); /* returns c = Q`*Xpy */
b = trisolv(1, R, c, piv);             /* equivalent to b = inv(R)*Q`*c */
print b[L="Specific QR" F=D10.4];

The output is not shown, but both calls produce estimates for the regression coefficients that are exactly the same as for the earlier examples. Notice that the TRISOLV function takes the pivot vector (which represents a permutation matrix) as the fourth argument. That means you do not need to worry about whether column pivoting occurred during the QR algorithm.

QR applied to the design matrix

As mentioned earlier, you can also apply the QR algorithm to the design matrix, X, and the QR algorithm will return the least-square solution without ever forming the normal equations. This is shown in a subsequent article, which also compares the speed of the various methods for solving the least-squares problem.

Summary

This article discusses three ways to solve a least-squares regression problem. All start by constructing the normal equations: (X`X) b = X` y. The solution of the normal equations (b) is the vector that minimizes the squared differences between the predicted values, X b, and the observed responses, y. This article discusses

  • the SWEEP operator, which is used by many SAS regression procedures
  • the SOLVE and INV function, which use the LU factorization
  • the QR call, which implements the QR algorithm with column pivoting

A follow-up article compares the performance of these methods and of another QR algorithm, which does not use the normal equations.

The post The QR algorithm for least-squares regression appeared first on The DO Loop.

2月 082021
 

Look at the following matrices. Do you notice anything that these matrices have in common?

If you noticed that the rows of each matrix are arithmetic progressions, good for you. For each row, there is a constant difference (also called the "increment") between adjacent elements. For these examples:

  • In the first matrix, each row is an arithmetic progression with a difference of 1. That is, the value of the (j+1)th column is one more than the value of the j_th column.
  • In the second matrix, each row is an arithmetic progression with a difference of 2. That is, the value of the (j+1)th column is two more than the value of the j_th column.
  • In the third matrix, each row is an arithmetic progression, but the increments for the rows are not identical. The increment for the first row is -3. The increment for the second row is +1. The increments for the remaining rows are -2 and +3, respectively.

However, there is another characteristic that these matrices have in common: they are all singular matrices. It turns out this is not a coincidence. An n x n matrix (n>2) is singular if each row is an arithmetic progression.

Why are these matrices singular?

Recall that a matrix will be singular if one of the columns can be written as a linear combination of other columns. For these three matrices, you can use mental arithmetic to verify that the third column is a linear combination of the first two columns. Specifically, you can verify that the third column can be written as
X3 = 2 X2 – X1
where X1, X2, X3, ... are the columns of a matrix. Thus, these matrices are singular.

It turns out that this relationship holds in general for any matrix whose rows are arithmetic progressions. Suppose the elements of the i_th row satisfy the progression with increment di so that the elements of the i_th row satisfy X[i, j+1] = X[i, j] + d[i]  for j=1,2,...,n. Let d be the vector of differences. Then
X2 = X1 + d
which you can solve for d to obtain
d = X2 – X1.
By substitution,
X3 = X2 + d = X2 + (X2 – X1), or
X3 = 2*X2 – X1.
Therefore, the third column is always a linear combination of the first two columns, and the coefficients are independent of the differences in the arithmetic progressions of the rows. It doesn't matter what increments are used in the arithmetic progressions of the rows.

In fact, even more is true. You can use similar arguments to show that the j_th column can be written as a linear combination of the first two columns for any j > 2. Thus, regardless of the matrix size, matrices whose rows are arithmetic progressions are rank-2 matrices: they have only two linearly independent columns.

Why should you care about this?

I decided to write about this topic for three reasons:

  1. It's mathematically interesting, and I never noticed it before.
  2. When I write a program, I often need to quickly create an example matrix to test the program. I often use a matrix such as {1 2 3, 4 5 6, 7 8 9} or SHAPE(1:25,5,5). This article shows that both matrices are singular, which is important to know if the program requires a nonsingular matrix.
  3. Lastly, I wanted to emphasize the difference between a "random matrix" and an "arbitrary matrix." In a previous article, I discussed the probability that a random matrix is singular and concluded that a random matrix is almost always nonsingular. Here, "random matrix" means that the matrix elements are drawn randomly from a continuous probability distribution such as the uniform or normal distribution.

    However, the other day I received a comment on that blog post. Essentially, the comment said, "I wrote down this matrix at random, but it is singular. How do you explain that?" The matrix was similar to the matrices at the top of this article. The answer is that the elements of the matrix were not generated randomly. The rows of the matrix form an arithmetic progression (very NON-random!), which, as we have seen, implies that the matrix is singular.

The post A matrix is singular if its rows are arithmetic sequences appeared first on The DO Loop.