Tips and Techniques

8月 052020

Have you ever seen the "brain teaser" for children that shows a 4 x 4 grid and asks "how many squares of any size are in this grid?" To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square. Hence there are a total of (16+9+4+1) = 30 squares of any size.

Recently a SAS statistical programmer asked a similar question about submatrices of a matrix. Specifically, the programmer wanted to form square submatrices of consecutive rows and columns. This article shows a few tricks in the SAS/IML language that enable you to generate and compute with all k x k submatrices of a matrix, where the submatrices are formed by using consecutive rows and columns of the original matrix. Although it is possible to consider more general submatrices, in this article, a "submatrix" always refers to one that is formed by using consecutive rows and columns.

Use subscripts to form submatrices

I've previously written about several ways to extract submatrices from a matrix. Recall the "colon operator" (formally called the index creation operator) generates a vector of consecutive integers. You can use the vectors in the row and column position of the subscripts to extract a submatrix. For example, the following SAS/IML statements define a 4 x 4 matrix and extract the four 3 x 3 submatrices:

A = {4 3 1 6,
     2 4 3 1,
     0 2 4 3,
     5 0 2 4};
/* extract the four 3 x 3 submatrices */
A11 = A[1:3, 1:3];
A12 = A[1:3, 2:4];
A21 = A[2:4, 1:3];
A22 = A[2:4, 2:4];

Manually specifying the row and column numbers is tedious. Let's write a SAS/IML function that will return a k x k submatrix of A by starting at the (i,j)th position of A. Call k the order of the submatrix.

/* Given A, return the square submatrix of order k starting at index (i,j) */
start submat(A, i, j, k);
   return A[ i:(i+k-1), j:(j+k-1) ];
/* example : 3 x 3 submatrices of A */
A11 = submat(A, 1, 1, 3);
A12 = submat(A, 1, 2, 3);
A21 = submat(A, 2, 1, 3);
A22 = submat(A, 2, 2, 3);
/* */
ods layout gridded columns=4 advance=table;
print A11, A12, A21, A22;
ods layout end;

The number of submatrices

If A is an n x m matrix, we can ask how many submatrices of order k are in A. Recall that we are only considering submatrices formed by consecutive rows and columns. There are (n – k + 1) sequences of consecutive rows of length k, such as 1:k, 2:(k+1), and so forth. Similarly, there are (m – k + 1) sequences of consecutive columns of length k. So the following SAS/IML function counts the number of submatrices of order k. You can call the function for different k=1, 2, 3, and 4 in order to solve the "Counting Squares" brain teaser:

/* count all submatrices of order k for the matrix A */
start CountAllSubmat(A, k);
   numRows = nrow(A)-k+1;
   numCols = ncol(A)-k+1;
   return numRows * numCols;          
n = nrow(A);
numSubmat = j(n, 1, .);
do k = 1 to n;
   numSubmat[k] = CountAllSubmat(A, k);
order = T(1:n);
print order numSubmat;

If you add up the number of squares each size, you get a total of 30 squares.

Iterate over submatrices of a specified order

You can iterate over all submatrices of a specified order. You can perform a matrix computation on each submatrix (for example, compute its determinant), or you can pack it into a list for future processing. The following SAS/IML function iterates over submatrices and puts each submatrix into a list. The STRUCT subroutine from the ListUtil package can be used to indicate the contents of the list in a concise form.

start GetListSubmat(A, k);
   numSub = CountAllSubmat(A, k);
   L = ListCreate( numSub );           /* create list with numSub items */
   cnt = 1;
   do i = 1 to nrow(A)-k+1;            /* for each row */
      do j = 1 to ncol(A)-k+1;         /* and for each column */
         B = submat(A, i, j, k);       /* compute submatrix of order k from (i,j)th cell */
         call ListSetItem(L, cnt, B);  /* put in list (or compute with B here) */
         cnt = cnt + 1;
   return L;
/* Example: get all matrices of order k=2 */
L = GetListSubmat(A, 2);
package load ListUtil;
call Struct(L);

The output from the STRUCT call shwos that there are nine 2 x 2 matrices. The Value1–Valuie4 columns show the values (in row-major order). For example, the first 2 x 2 submatrix (in the upper left corner of A) is {4 3, 2 4}. The second submatrix (beginning with A[1,2]) is {3 1, 4 3}.

It is straightforward to modify the function to compute the determinant or some other matrix computation.


This article shows some tips and techniques for dealing with submatrices of a matrix. The submatrices in this article are formed by using consecutive rows and columns. Thus, they are similar to the "Counting Squares" brain teaser, which requires that you consider sub-squares of order 1, 2, 3, and so forth.

The post Submatrices of matrices appeared first on The DO Loop.

1月 062020

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category.

SAS programming tips

The Essential Guide to Binning

  • Create training, testing, and validation data sets:: This post shows how to create training, validation, and test data sets in SAS. This technique is popular in data science and machine learning because you typically want to fit the model on one set of data and then evaluate the goodness of fit by using a different set of data.
  • 5 reasons to use PROC FORMAT to recode variables in SAS: Often SAS programmers use PROC SQL or the SAS DATA step to create new data variables to recode raw data. This is not always necessary. It is more efficient to use PROC FORMAT to recode the raw data. Learn five reasons why you should use PROC FORMAT to recode variables.
  • Conditionally append observations to a data set: In preparing data for graphing. you might want to add additional data to the end of a data set. (For example, to plot reference lines or text.) You can do this by using one DATA step to create the new observations and another DATA step to merge the new observations to the end of the original data. However, you can also use the END= option and end-of-file processing to append new observations to data. Read about how to use the END= option to append data. The comments at the end of the article show how to perform similar computations by using a hash table, PROC SQL, and a DOW loop.
  • Use PROC HPBIN to bin numerical variables: In machine learning, it is common to bin numerical variables into a set of discrete values. You can use PROC HPBIN to bin multiple variables in a single pass through the data. PROC HPBIN can bin data into equal-length bins (called bucket binning) or by using quantiles of the data. This article and the PROC FORMAT article are both referenced in my Essential Guide to Binning in SAS.

Statistical analyses

Geometric Meaning of Kolmogorov's D

Data visualization

Visualize an Interaction Effect

I always enjoy learning new programming methods, new statistical ideas, and new data visualization techniques. If you like to learn new things, too, read (or re-read!) these popular articles from 2019. Then share this page with a friend. I hope we both have many opportunities to learn and share together in the new year.

The post Top posts from <em>The DO Loop</em> in 2019 appeared first on The DO Loop.

10月 302019

Every year at Halloween, I post an article that shows a SAS trick that is a real treat. This article shows how to use the INTNX function to find dates that are related to a specified date. The INTNX function is a sweet treat, indeed.

I previously wrote an article about how to use the INTCK and INTNX functions to compute with dates. These Base SAS functions are very powerful and deserve to be known more widely. In particular, the INTNX function enables you to compute the next or previous date that is a certain number of time units away from a known date. The "time unit" is usually a day, week, a month, or a year, but the function supports many options.

Recently, a SAS programmer asked how to get "the first and last days of the previous month." The programmer added that the expression needs to be used in a WHERE clause. Because the expression needs to appear in a WHERE clause, it should be concise and not require several statements or temporary variables.

The first day of a previous (or subsequent) time interval

Finding the first day of the previous month is an ideal situation for using the INTNX function. The basic syntax of the INTNX function is
      INTNX(timeUnit, startDate, numberOfUnits)
This form of the INTNX function returns the first day of the specified time unit. For example, the following statements give dates relative to the bombing of Pearl Harbor on 07DEC1941. The time interval is 'month'. Notice that you can ask for dates after the given date (a positive number of time units) or before the given date (a negative number of units). If you specify 0 for the third argument, you get the current month.

data Months;
date = '07DEC1941'd;
FirstDayCurrMonth = intnx('month', Date,  0);   /*  0 = current month */
FirstDayPrevMonth = intnx('month', Date, -1);   /* -1 = previous month */
FirstDayNextMonth = intnx('month', Date,  1);   /*  1 = next month */
FirstDay6Months   = intnx('month', Date,  6);   /*  6 = six months later */
format _ALL_ date9.;
proc print data=Months noobs;

Because the time unit is 'month' for this example, the calculated dates are the first day of the months relative to 07DEC1941. If you change 'month' to 'year', all the calculated dates will be 01JAN of some year relative to 1941.

The last (or same) day of a previous (or subsequent) time interval

A cool fact about the INTNX function is that it supports an optional fourth argument that enables you to specify whether you want the calculated date to be at the beginning, the middle, or the end of the specified time interval. You can even specify that you want the "same" characteristics as the source date, which is useful for finding anniversaries of an event. For example, the following statements vary the fourth argument, which can be one of four values:

data FirstLastMiddle;
date = '07DEC1941'd;
FirstDayPrevMonth = intnx('month', Date, -1, 'B');   /* B = beginning */
LastDayPrevMonth  = intnx('month', Date, -1, 'E');   /* E = end */
MiddlePrevMonth   = intnx('month', Date, -1, 'M');   /* M = middle */
FirstAnniv        = intnx('year',  Date,  1, 'S');   /* S = same */
format _ALL_ date9.;
proc print data=FirstLastMiddle noobs;

The program shows that you can find the first day of the previous month, the last day of the previous month, the middle of the previous month, or an anniversary of the specified date. In particular, the program answers the programmer's question by showing a concise "one-liner" that you can use to get the first and last days of the previous month.

In summary, the INTNX function is a powerful tool for working with dates. It enables you to find dates that are related to a specified date. You can use the first argument to specify the time unit (day, week, month, year,...) and the third argument to specify the number of time units before or after the specified data. The optional fourth argument determines whether you want the first, last, middle, or "same" portion of the time interval.

Whether you work with dates in SAS every day or whether you work with them occasionally, the INTNX function is a sweet treat to remember. No tricks required.

The post Compute the first or last day of a month or year appeared first on The DO Loop.

10月 282019

A common task in SAS programming is to specify a list of variables that satisfy some pattern. You can specify lists for the KEEP= or DROP= data set options, and you can use lists of variables on many SAS statements such as the VAR and MODEL statements. Although SAS has built-in support for some patterns (like variables that start with the same prefix), you might want to match variable names to less-common patterns. In those situations, you can use regular expressions to match variable names by using the PRXMATCH function in Base SAS. The PRXMATCH function is one of several functions in SAS that support Perl regular expressions (PRX).

Built-in support for specifying variables in SAS

In a previous article, I discussed six different ways to create a list of variable names in SAS. Of these, the most common are

  • The colon operator for specifying names that have a common prefix. For example, AGE: specifies variables that begin with the prefix "age" (recall that SAS variable names are case insensitive).
  • The dash operator for specifying a sequence of variable names that have the same prefix and a numerical suffix. For example, X1-X5 matches the variables X1, X2, X3, X4, and X5.

For example, the following table shows variables in the Sashelp.Heart data set:

proc contents data=Sashelp.Heart short varnum; run;

You can see that several variables begin with the prefix 'Age'. By using the colon operator (Age:) you can match all variables that match the prefix, such as in the following example:

title "Colon (Prefix) Variable Names";
data PrefixVars;
   set Sashelp.Heart(keep=Age:);
proc contents short varnum; run;

Specifying variables that have a common suffix

Several of the variables in the Sashelp.Heart data end with the suffix 'Status'. Unfortunately, there is no built-in operator in SAS to match a suffix such as 'Status'. However, Bruno Mueller and Mark Jordan have provided a SAS macro that uses regular expressions to select variables that match any pattern. The list of variables is returned as a text string, so you can use it in the usual places, such as the KEEP= options or a VAR statement. The following example assumes you have downloaded and run the definition of the %varListPattern macro.

title "Suffix Variable Names";
data SuffixVars;
   set Sashelp.Heart(keep=%varListPattern(sashelp.Heart,*status));
proc contents short varnum; run;

This macro fills a need, and I like it a lot. In addition to matching variables that share a common suffix, you can also use the macro to find variables that match other patterns. For example, you can use the pattern "*at*" to match variables that have the string "at" anywhere in their name. In addition to the "status" variables found above, that pattern also matches the variables DeathCause, AgeAtStart, and AgeAtDeath.

Some programmers don't realize that all SAS procedures support data set options (such as KEEP= and DROP=) when you specify the name of a data set in a procedure. That means that you can use the built-in pattern matching and the %varListPattern macro throughout SAS. For example, here is how you would read a set of prefix-variables and a set of suffix-variables into SAS/IML matrices:

proc iml;
use Sashelp.Heart(keep=Age:);       /* use the coon operator to match prefixes */
   read all var _ALL_ into X[colname=prefixVars];
print prefixVars;
use Sashelp.Heart(keep=%varListPattern(sashelp.Heart,*status)); /* use macro to match suffixes */
   read all var _ALL_ into C[colname=suffixVars];
print suffixVars;

Extract a vector of strings that match a pattern

At its heart, the %varListPattern macro calls the PRXMATCH function. You can use the PRXMATCH function to determine whether a variable name (in fact, any string!) matches a pattern that is specified by a regular expression. You can use the PRXMATCH function when the names of variables are in a data set. You can then use PROC SQL to create a macro variable that contains a space-separated list of all variables that match the pattern.

I needed this functionality recently when I was writing a SAS/IML function and needed to select all strings that had a common suffix. To understand the next example, recall the following SAS/IML programming features:

The following SAS/IML functions construct patterns like /^PREFIX/i and /.*SUFFIX$/i and use the PRXMATCH function to find strings that match the patterns. To demonstrate the function, the program uses the variable names in the Sashelp.Heart data.

proc iml;
/* Use PRXMATCH to find strings with a common prefix */
start MatchPrefix(prefix, str);
   re = '/^' + strip(prefix) + '/i';    /* beginning of word, case insensitive */
   idx = loc(prxmatch(re, str));
   if ncol(idx)=0 then return( {} );    /* return empty matrix */
   return( str[idx] );
/* Use PRXMATCH to find strings with a common suffix */
start MatchSuffix(suffix, str);
   re = '/.*' + strip(suffix) + '$/i';  /* end of word, case insensitive */
   idx = loc(prxmatch(re, right(str))); /* shift right so no blank spaces at end */
   if ncol(idx)=0 then return( {} );    /* return empty matrix */
   return( str[idx] );
Variable = contents( "Sashelp", "Heart" );   /* get variable names in data order */
prefixVars = MatchPrefix("Age", Variable);
suffixVars = MatchSuffix("status", Variable);
print prefixVars, suffixVars;

I emphasize that although this example uses variable names as the strings, you can use the PRXMATCH function to search for patterns in arbitrary strings. In a similar way, you can construct other functions that find strings that match any regular expression that you can construct.

In summary, regular expressions in SAS are a powerful feature for matching strings that contain some pattern of characters. The examples in this article use simple regular expressions to find all variables that contain a common prefix (same as the built-in colon operator) or a common suffix. For many SAS procedures that require a list of variable names, you can use the %varListPattern macro to generate the variable list. For other applications, you can call the PRXMATCH function directly.

For an introduction to regular expressions in SAS, see "An Introduction to Perl Regular Expressions in SAS 9" (Cody, 2004) and "The Basics of the PRX Functions" (Cassell, 2007).

The post Use regular expressions to specify variable names in SAS appeared first on The DO Loop.

8月 072019
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.

5月 132019

In SAS/IML programs, a common task is to write values in a matrix to a SAS data set. For some programs, the values you want to write are in a matrix and you use the CREATE FROM/APPEND FROM syntax to create the data set, as follows:

proc iml;
X = {1  2  3, 
     4  5  6, 
     7  8  9, 
    10 11 12};
create MyData from X[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from X;                                /* write all rows of X */
close;                                        /* close the data set */

In other programs, the results are computed inside an iterative DO loop. If you can figure out how many observations are generated inside the loop, it is smart to allocate room for the results prior to the loop, assign the rows inside the loop, and then write to a data set after the loop.

However, sometimes you do not know in advance how many results will be generated inside a loop. Examples include certain kinds of simulations and algorithms that iterate until convergence. An example is shown in the following program. Each iteration of the loop generates a different number of rows, which are appended to the Z matrix. If you do not know in advance how many rows Z will eventually contain, you cannot allocate the Z matrix prior to the loop. Instead, a common technique is to use vertical concatenation to append each new result to the previous results, as follows:

/* sometimes it is hard to determine in advance how many rows are in the final result */
free Z;
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Y = j(k , 3, n);         /* k x 3 matrix */
   Z = Z // Y;              /* vertical concatenation of results */
create MyData2 from Z[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from Z;                                 /* write all rows */
close;                                         /* close the data set */

Concatenation within a loop tends to be inefficient. As I like to say, "friends don't let friends concatenate results inside a loop!"

If your ultimate goal is to write the observations to a data set, you can write each sub-result to the data set from inside the DO loop! The APPEND FROM statement writes whatever data are in the specified matrix, and you can call the APPEND FROM statement multiple times. Each call will write the contents of the matrix to the open data set. You can update the matrix or even change the number of rows in the matrix. For example, the following program opens the data set prior to the DO loop, appends to the data set multiple times (each time with a different number of rows), and then closes the data set after the loop ends.

/* alternative: create data set, write to it during the loop, then close it */
Z = {. . .};                /* tell CREATE stmt that data will contain three numerical variables */
create MyData3 from Z[colname={'A' 'B' 'C'}];   /* open before the loop. The TYPE of the variables are known. */
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Z = j(k , 3, n);         /* k x 3 matrix */
   append from Z;           /* write each block of data */
close;                      /* close the data set */

The following output shows the contents of the MyData3 data set, which is identical to the MyData2 data set:

Notice that the CREATE statement must know the number and type (numerical or character) of the data set variables so that it can set up the data set for writing. If you are writing character variables, you also need to specify the length of the variables. I typically use missing values to tell the CREATE statement the number and type of the variables. These values are not written to the data set. It is the APPEND statement that writes data.

I previously wrote about this technique in the article "Writing data in chunks," which was focused on writing large data set that might not fit into RAM. However, the same technique is useful for writing data when the total number of rows is not known until run time. I also use it when running simulations that generate multivariate data. This technique provides a way to write data from inside a DO loop and to avoid concatenating matrices within the loop.

The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.

4月 152019

A quadratic form is a second-degree polynomial that does not have any linear or constant terms. For multivariate polynomials, you can quickly evaluate a quadratic form by using the matrix expression
x` A x
This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like
x` A-1 x
where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.

What is a quadratic form?

For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example, if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x equals the second degree polynomial q(x1, x2) = 9*x12 - 4*x1 x2 + 5*x22. A contour plot of this polynomial is shown below.

Contour plot of the quadratic form x` A x for A = {9 -2, -2 5}

Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` Id x = x` x = Σi xi2. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x12 + 4*x22. If you add in off-diagonal elements, you get cross terms in the polynomial.

Efficient evaluation of a quadratic form

If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).

proc iml;
   q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 +
                   2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3
A = {4 3 2,
     3 6 1,
     2 1 9};
x = {-1, 2, 0.5};
q1 = x`*A*x;
print q1;

When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory. Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:

w = {4, 6, 9};
q2 = x`*(w#x);     /* more efficient than q = x`*diag(w)*x; */
print q2;

Quadratic forms with positive definite matrices

In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function. (Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.) An important example is the squared Mahalanobis distance x` Σ-1 x, which is a quadratic form.

As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance. The trick is to compute the Cholesky decomposition of the SPD matrix. I'll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:

/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite.
   Let A = U`*U be the Cholesky decomposition of A.
   Then q = x`(U`*U)*x = (U*x)`(Ux)
   So define z = U*x and compute q = z`*z
start EvalSPDQuadForm(A, x);
   U = root(A);           /* Cholesky root */
   z = U*x;
   return (z` * z);       /* dot product of vectors */
/* Run on large example */
call randseed(123);
N = 1000;
A = toeplitz(N:1);
x = randfun(N, "Normal");
q3 = EvalSPDQuadForm(A, x);  /* efficient */
qCheck = x`*A*x;             /* check computation by using a slower method */
print q3 qCheck;

You can use a similar trick to evaluate the quadratic form x` A-1 x. I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.

/* Evaluate quadratic form x`*inv(A)*x, where  A is symmetric positive definite.
   Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition.
   To compute  q = x` * inv(A) * x = x` * w, you need to solve for w.
   (U`U)*w = x, so
        First solve triangular system U` z = x   for z,
        then solve triangular system  U w = z   for w 
start EvalInvQuadForm(A, x);
   U = root(A);              /* Cholesky root */
   z = trisolv(2, U, x);     /* solve linear system */
   w = trisolv(1, U, z);  
   return (x` * w);          /* dot product of vectors */
/* test the function */
q4 = EvalInvQuadForm(A, x);  /* efficient */
qCheck = x`*Solve(A,x);      /* check computation by using a slower method */
print q4 qCheck;

You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.

In summary, you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.

The post Efficient evaluation of a quadratic form appeared first on The DO Loop.

4月 032019

I've previously written about how to deal with nonconvergence when fitting generalized linear regression models. Most generalized linear and mixed models use an iterative optimization process, such as maximum likelihood estimation, to fit parameters. The optimization might not converge, either because the initial guess is poor or because the model is not a good fit to the data. SAS regression procedures for which this might happen include PROC LOGISTIC, GENMOD, MIXED, GLMMIX, and NLMIXED.

For mixed models, several problems can occur if you have a misspecified model. One issue results in the following note in the SAS log:
NOTE: Estimated G matrix is not positive definite.
This article describes what the note means, why it might occur, and what to do about it. If you encounter this note during a BY-group analysis or simulation study, this article shows how to identify the samples that have the problem.

There are two excellent references that discuss issues related to convergence in the SAS mixed model procedures:

What is the G matrix?

Before we discuss convergence, let's review what the G matrix is and why it needs to be positive definite. The matrix formulation of a mixed model is
Y = X β + Z γ + ε
where β is a vector of fixed-effect parameters. The random effects are assumed to be random realizations from multivariate normal distributions. In particular, γ ~ MVN(0, G) and ε ~ MVN(0, R), where G and R are covariance matrices. The variance-covariance matrix G is often used to specify subject-specific effects, whereas R specifies residual effects. A goal of mixed models is to specify the structure of the G and/or R matrices and estimate the variance-covariance parameters.

Because G is a covariance matrix, G must be positive semidefinite. A nondegenerate covariance matrix will be fully positive definite. However, estimates of G might not have this property. SAS alerts you if the estimate is not positive definite.

As stated in Kiernan (2018, p. ), "It is important that you do not ignore this message."

Reasons the estimated G matrix is not positive definite

A SAS Usage Note states that the PROC MIXED message means that "one or more variance components on the RANDOM statement is/are estimated to be zero and could/should be removed from the model." Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) describe several reasons why an estimated G matrix can fail to be positive definite. Two of the more common reasons include:

  • There is not enough variation in the response. After controlling for the fixed effects, there isn't any (or much) variation for the random effects. You might want to remove the random effect from the model. (KTG, p. 9)
  • The model is misspecified. You might need to modify the model or change the covariance structure to use fewer parameters. (Kiernan, p. 18)

Convergence issues in simulation studies or BY-group analyses

If you encounter the note "Estimated G matrix is not positive definite" for real data, you should modify the model or collect more data. In a simulation study, however, there might be simulated samples that do not fit the model even when the data is generated from the model! This can happen for very small data sets and for studies in which the variance components are very small. In either case, you might want to identify the samples for which the model fails to converge, either to exclude them from the analysis or to analyze them by using a different model. The same situation can occur for few BY groups during a traditional BY-group analysis.

The following SAS program illustrates how to detect the samples for which the estimated G matrix is not positive definite. The 'SP' data are from an example in the PROC MIXED documentation. The data set named 'ByData' is constructed for this blog post. It contains four copies of the real data, along with an ID variable with values 1, 2, 3, and 4. For the first copy, the response variable is set to 0, which means that there is no variance in the response. For the third copy, the response variable is simulated from a normal distribution. It has variance, but does not conform to the model. The call to PROC MIXED fits the same random-effects model to all four samples:

/* Example from PROC MIXED documentation:  */
data sp;
input Block A B Y @@;
1 1 1  56  1 1 2  41  1 2 1  50  1 2 2  36  1 3 1  39  1 3 2  35
2 1 1  30  2 1 2  25  2 2 1  36  2 2 2  28  2 3 1  33  2 3 2  30
3 1 1  32  3 1 2  24  3 2 1  31  3 2 2  27  3 3 1  15  3 3 2  19
4 1 1  30  4 1 2  25  4 2 1  35  4 2 2  30  4 3 1  17  4 3 2  18
/* concatenate four copies of the data, but modify Y for two copies */
data ByData;
call streaminit(1);
set sp(in=a1) sp(in=a2) sp(in=a3) sp(in=a4);
SampleID = a1 + 2*a2 + 3*a3 + 4*a4;
if a1 then Y = 0;                         /* response is constant (no variation) */
if a3 then Y = rand("Normal", 31, 10);    /* response is indep of factors */
/* suppress ODS output: */
%macro ODSOff(); ods graphics off; ods exclude all; ods noresults;  %mend;
%macro ODSOn();  ods graphics on; ods exclude none; ods results;    %mend;
proc mixed data=ByData;
   by sampleID;
   class A B Block;
   model Y = A B A*B;
   random Block A*Block;
   ods output ConvergenceStatus=ConvergeStatus;

The SAS log displays various notes about convergence. The second and fourth samples (the doc example) converged without difficulty. The log shows a WARNING for the first group. The third group displays the note Estimated G matrix is not positive definite.

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive residual variance estimate.
WARNING: Stopped because of infinite likelihood.
NOTE: The above message was for the following BY group:
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:
NOTE: Convergence criteria met.
NOTE: Estimated G matrix is not positive definite.
NOTE: The above message was for the following BY group:
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:

Notice that the ODS OUTPUT statement writes the ConvergenceStatus table for each BY group to a SAS data set called ConvergeStatus. For each BY group, the ConvergenceStatus table is appended to a SAS data set called ConvergeStatus. The result of the PROC PRINT statement shows the columns of the table for each of the four groups:

proc print data=ConvergeStatus;  run;
Convergence status table in PROC MIXED, which shows which BY groups had convergence or modeling problems

The SampleID column is the value of the BY-group variable. The Reason column explains why the optimization process terminated. The Status column is 0 if the model converged and nonzero otherwise. Note that the third model converged, even though the G matrix was not positive definite!

To detect nonpositive definite matrices, you need to look at the pdG column, The pdG indicates which models had a positive definite G matrix (pdG=1) or did not (pdG=0). Although I do not discuss it in this article, the pdH column is an indicator variable that has value 0 if the SAS log displays the message NOTE: Convergence criteria met but final hessian is not positive definite.

The pdG column tells you which models did not have a positive definite variance matrix. You can merge the ConverenceStatus table with the original data and exclude (or keep) the samples that did not converge or that had invalid variance estimates, as shown in the following DATA step:

/* keep samples that converge and for which gradient and Hessian are positive definite */
data GoodSamples;
merge ByData ConvergeStatus;
by SampleID;
if Status=0 & pdG & pdH;

If you run PROC MIXED on the samples in the GoodSamples data set, all samples will converge without any scary-looking notes.

In summary, this article discusses the note Estimated G matrix is not positive definite, which sometimes occurs when using PROC MIXED or PROC GLMMIX in SAS. You should not ignore this note. This note can indicate that the model is misspecified. This article shows how you can detect this problem programmatically during a simulation study or a BY-group analysis. You can use the ConvergenceStatus table to identify the BY groups for which this the problem occurs. Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) provide more insight into this note and other problems that you might encounter when fitting mixed models.

The post Convergence in mixed models: When the estimated G matrix is not positive definite appeared first on The DO Loop.

1月 092019

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community. Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.

I believe that learning should not stop when we leave school. If you, too, are a lifelong learner, the following topics deserve a second look. I've included articles from four different categories.

Data Visualization

  • Fringe plot: When fitting a logistic model, you can plot the predicted probabilities versus a continuous covariate or versus the empirical probability. You can use a fringe plot to overlay the data on the plot of predicted probabilities. The SAS developer of PROC LOGISTIC liked this article a lot, so look for fringe plots in a future release of SAS/STAT software!
  • Order variables in a correlation matrix or scatter plot matrix: When displaying a graph that shows many variables (such as a scatter plot matrix), you can make the graph more understandable by ordering the variables so that similar variables are adjacent to each other. The article uses single-link clustering to order the variables, as suggested by Hurley (2004).
  • A stacked band plot, created in SAS by using PROC SGPLOT
  • Stacked band plot: You can use PROC SGPLOT to automatically create a stacked bar plot. However, when the bars represent an ordered categorical variable (such as months or years), you might want to create a stacked band plot instead. This article shows how to create a stacked band plot in SAS.

Statistics and Data Analysis

Random numbers and resampling methods

Process flow diagram shows how to resample data to create a bootstrap distribution.


These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!

Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!

The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.

1月 022019

Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category.

Data Visualization

General SAS programming techniques

Statistics and Data Analysis

I write this blog because I love to learn new things and share what I know with others. If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues. Happy New Year! I hope we both have many opportunities to learn and share in 2019!

The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.