Tips and Techniques

1月 182021

Have you ever heard of the DOLIST syntax? You might know the syntax even if you are not familiar with the name. The DOLIST syntax is a way to specify a list of numerical values to an option in a SAS procedure. Applications include:

  • Specify the end points for bins of a histogram
  • Specify percentiles to be output to a data set
  • Specify tick marks for a custom axis on a graph
  • Specify the location of reference lines on a graph
  • Specify a list of parameters for an algorithm. Examples include smoothing parameters (the SMOOTH= option in PROC LOESS), sample sizes (the NTOTAL= option in PROC POWER), and initial guess for parameters in an optimization (the PARMS statement in PROC NLMIXED and PROC NLIN)

This article demonstrates how to use the DOLIST syntax to specify a list of values in SAS procedures. It shows how to use a single statement to specify individual values and also a sequence of values.

The DOLIST syntax enables you to write a single statement that specifies individual values and one or more sequences of values. The DOLIST syntax should be in the toolbox of every intermediate-level SAS programmer!

The DOLIST syntax in the SAS DATA step

According to the documentation of PROC POWER, the syntax described in this article is sometimes called the DOLIST syntax because it is based on the syntax for the iterative DO loop in the DATA step.

The most common syntax for a DO loop is DO x = start TO stop BY increment. For example, DO x = 10 TO 90 BY 20; iterates over the sequence of values 10, 30, 50, 70, and 90. If the increment is 1, you can omit the BY increment portion of the statement. However, you can also specify values as a common-separated list, such as DO x = 10, 30, 50, 70, 90;, which generates the same values. What you might not know is that you can combine these two methods. For example, in the following DATA step, the values are specified by using two comma-separated lists and three sequences. For clarity, I have placed each list on a separate line, but that is not necessary:

/* the DOLIST syntax for a DO loop in the DATA step */
data A;
do pctl = 5,                  /* individual value(s)  */
          10 to 50 by 20,     /* a sequence of values */
          54.3, 69.1,         /* individual value(s)  */
          80 to 90 by 5,      /* another sequence     */
          60 to 40 by -20;    /* yet another sequence */
proc print; run;

The output (not shown) is a list of values: 5, 10, 30, 50, 54.3, 69.1, 80, 85, 90, 60, 40. Notice that the values do not need to be in sorted order, although they often are.

The expressions to the right of the equal sign are what I mean by the "DOLIST syntax." You can use the same syntax to specify a list of options in many SAS procedures. When the SAS documentation says that an option takes a "list of values," you can often use a comma-separated list, a space-separated list, and the syntax start TO stop BY increment. (Or a combination of these expressions!) The following sections provide a few examples, but there are literally hundreds of options in SAS that support the DOLIST syntax!

Some procedures (for example, PROC SGPLOT) require the DOLIST values to be in parentheses. Consequently, I have adopted the convention of always using parentheses around DOLIST values, even if the parentheses are not strictly required. As far as I know, it is never wrong to put the DOLIST inside parentheses, and it keeps me from having to remember whether parentheses are required. The examples in this article all use parentheses to enclose DOLIST values.

Histogram bins and percentiles

You can use the DOLIST syntax to specify the endpoints of bins in a histogram. For example, in PROC UNIVARIATE, the ENDPOINTS= option in the HISTOGRAM statement supports a DOLIST. Because histograms use evenly spaced bins, usually you will specify only one sequence, as follows:

proc univariate;
   var weight;
   histogram weight / endpoints=(1800 to 7200 by 600);   /* DOLIST sequence expression */

You can also use the DOLIST syntax to specify percentiles. For example, the PCTLPTS= option on the OUTPUT statement enables you to specify which percentiles of the data should be written to a data set:

proc univariate;
   var MPG_City;
   output out=UniOut pctlpre=P_  pctlpts=(50 75, 95 to 100 by 2.5);  /* DOLIST */

Notice that this example specifies both individual percentiles (50 and 75) and a sequence of percentiles (95, 97.5, 100).

Tick marks and reference lines

The SGPLOT procedure enables you to specify the locations of tick marks on the axis of a graph. Most of the time you will specify an evenly spaced set of values, but (just for fun) the following example shows how you can use the DOLIST syntax to combine evenly spaced values and a few custom values:

title "Specify Ticks on the Y Axis";
proc sgplot;
   scatter x=Weight y=Mpg_City;
   yaxis grid values=(10 to 40 by 5, 50 60); /* DOLIST; commas optional */

As shown in the previous example, the GRID option on the XAXIS and YAXIS statements enables you to display reference lines at each tick location. However, sometimes you want to display reference lines independently from the tick marks. In that case, you can use the REFLINE statement, as follows:

title "Many Reference Lines";
proc sgplot;
   scatter x=Weight y=MPG_City;
   refline (1800 to 6000 by 600, 7000) / axis=x;  /* many reference lines */

Statistical procedures

Many statistical procedures have options that support lists. In most cases, you can use the DOLIST syntax to provide values for the list.

I have already written about how to use the DOLIST syntax to specify initial guesses for the PARM statement in PROC NLMIXED and PROC NLIN. The documentation for the POWER procedure discusses how to specify lists of values and uses the term "DOLIST" in its discussion.

Some statistical procedures enable you to specify multiple parameter values, and the analysis is repeated for each parameter in the list. One example is the SMOOTH= option in the MODEL statement of the LOESS procedure. The SMOOTH= option specifies values of the loess smoothing parameter. The following call to PROC LOESS fits four loess smoothers to the data. The call to PROC SGPLOT overlays the smoothers on a scatter plot of the data:

proc loess plots=none;
   model MPG_City = Weight / smooth=(0.1 to 0.5 by 0.2, 0.75); /* value-list */
   output out=LoessOut P=Pred;
proc sort data=LoessOut; by SmoothingParameter Weight; run;
proc sgplot data=LoessOut;
   scatter x=Weight y=MPG_City / transparency=0.9;
   series x=Weight y=Pred / group=SmoothingParameter 
          curvelabel curvelabelpos=min;


In summary, this article describes the DOLIST syntax in SAS, which enables you to simultaneously specify individual values and evenly spaced sequences of values. A sequence is specified by using the start TO step BY increment syntax. The DOLIST syntax is valid in many SAS procedures and in the DATA step. In some procedures (such as PROC SGPLOT), the syntax needs to be inside parentheses. For readability, you can use commas to separate individual values and sequences.

Many SAS procedures accept the special syntax even if it is not explicitly mentioned in the documentation. In the documentation for an option, look for terms such as value-list or numlist or value-1 <...value-n>, which indicate that the option supports the DOLIST syntax.

The post The DOLIST syntax: Specify a list of numerical values in SAS appeared first on The DO Loop.

1月 112021

On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst.

However, among last year's 100+ articles are many that discuss advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.

Machine learning concepts

Relationship between a threshold value and true/false negatives and positives

Statistical smoothers

Bilinear interpolation of 12 data values

I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:

SAS Viya and parallel computing

SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:

The map-reduce functionality in the iml action

  • The map-reduce paradigm is a two-step process for distributing a computation. Every thread runs a function and produces a result for the data that it sees. The results are aggregated and returned. The iml action supports the MAPREDUCE function, which implements the map-reduce paradigm.
  • The parallel-tasks paradigm is a way to run independent computations concurrently. The iml action supports the PARTASKS function, which implements the map-reduce paradigm.

Simulation and visualization

Decomposition of a convex polygon into triangles

Generate random points in a polygon

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2020? If so, leave a comment and tell me what topic you found interesting or useful. And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.

The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.

1月 042021

Last year, I wrote more than 100 posts for The DO Loop blog. In previous years, the most popular articles were about SAS programming tips, statistical analysis, and data visualization. But not in 2020. In 2020, when the world was ravaged by the coronavirus pandemic, the most-read articles were related to analyzing and visualizing the tragic loss and suffering of the pandemic. Here are some of the most popular articles from 2019 in several categories.

The coronavirus pandemic

Relationship between new cases and cumulative cases in COVID-19 infections

Statistical analysis: Regression

Visualization of collinearity diagnostics

Other statistical analyses

Data visualization

Many articles in the previous sections included data visualization, but two popular articles are specifically about data visualization:

Reference lines at values of statistical estimates

Many people claim they want to forget 2020, but these articles provide a few tips and techniques that you might want to remember. So, read (or re-read!) these popular articles from 2020. And if you made a resolution to learn something new this year, consider subscribing to The DO Loop so you don't miss a single article!

The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.

8月 052020

Have you ever seen the "brain teaser" for children that shows a 4 x 4 grid and asks "how many squares of any size are in this grid?" To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square. Hence there are a total of (16+9+4+1) = 30 squares of any size.

Recently a SAS statistical programmer asked a similar question about submatrices of a matrix. Specifically, the programmer wanted to form square submatrices of consecutive rows and columns. This article shows a few tricks in the SAS/IML language that enable you to generate and compute with all k x k submatrices of a matrix, where the submatrices are formed by using consecutive rows and columns of the original matrix. Although it is possible to consider more general submatrices, in this article, a "submatrix" always refers to one that is formed by using consecutive rows and columns.

Use subscripts to form submatrices

I've previously written about several ways to extract submatrices from a matrix. Recall the "colon operator" (formally called the index creation operator) generates a vector of consecutive integers. You can use the vectors in the row and column position of the subscripts to extract a submatrix. For example, the following SAS/IML statements define a 4 x 4 matrix and extract the four 3 x 3 submatrices:

A = {4 3 1 6,
     2 4 3 1,
     0 2 4 3,
     5 0 2 4};
/* extract the four 3 x 3 submatrices */
A11 = A[1:3, 1:3];
A12 = A[1:3, 2:4];
A21 = A[2:4, 1:3];
A22 = A[2:4, 2:4];

Manually specifying the row and column numbers is tedious. Let's write a SAS/IML function that will return a k x k submatrix of A by starting at the (i,j)th position of A. Call k the order of the submatrix.

/* Given A, return the square submatrix of order k starting at index (i,j) */
start submat(A, i, j, k);
   return A[ i:(i+k-1), j:(j+k-1) ];
/* example : 3 x 3 submatrices of A */
A11 = submat(A, 1, 1, 3);
A12 = submat(A, 1, 2, 3);
A21 = submat(A, 2, 1, 3);
A22 = submat(A, 2, 2, 3);
/* */
ods layout gridded columns=4 advance=table;
print A11, A12, A21, A22;
ods layout end;

The number of submatrices

If A is an n x m matrix, we can ask how many submatrices of order k are in A. Recall that we are only considering submatrices formed by consecutive rows and columns. There are (n – k + 1) sequences of consecutive rows of length k, such as 1:k, 2:(k+1), and so forth. Similarly, there are (m – k + 1) sequences of consecutive columns of length k. So the following SAS/IML function counts the number of submatrices of order k. You can call the function for different k=1, 2, 3, and 4 in order to solve the "Counting Squares" brain teaser:

/* count all submatrices of order k for the matrix A */
start CountAllSubmat(A, k);
   numRows = nrow(A)-k+1;
   numCols = ncol(A)-k+1;
   return numRows * numCols;          
n = nrow(A);
numSubmat = j(n, 1, .);
do k = 1 to n;
   numSubmat[k] = CountAllSubmat(A, k);
order = T(1:n);
print order numSubmat;

If you add up the number of squares each size, you get a total of 30 squares.

Iterate over submatrices of a specified order

You can iterate over all submatrices of a specified order. You can perform a matrix computation on each submatrix (for example, compute its determinant), or you can pack it into a list for future processing. The following SAS/IML function iterates over submatrices and puts each submatrix into a list. The STRUCT subroutine from the ListUtil package can be used to indicate the contents of the list in a concise form.

start GetListSubmat(A, k);
   numSub = CountAllSubmat(A, k);
   L = ListCreate( numSub );           /* create list with numSub items */
   cnt = 1;
   do i = 1 to nrow(A)-k+1;            /* for each row */
      do j = 1 to ncol(A)-k+1;         /* and for each column */
         B = submat(A, i, j, k);       /* compute submatrix of order k from (i,j)th cell */
         call ListSetItem(L, cnt, B);  /* put in list (or compute with B here) */
         cnt = cnt + 1;
   return L;
/* Example: get all matrices of order k=2 */
L = GetListSubmat(A, 2);
package load ListUtil;
call Struct(L);

The output from the STRUCT call shwos that there are nine 2 x 2 matrices. The Value1–Valuie4 columns show the values (in row-major order). For example, the first 2 x 2 submatrix (in the upper left corner of A) is {4 3, 2 4}. The second submatrix (beginning with A[1,2]) is {3 1, 4 3}.

It is straightforward to modify the function to compute the determinant or some other matrix computation.


This article shows some tips and techniques for dealing with submatrices of a matrix. The submatrices in this article are formed by using consecutive rows and columns. Thus, they are similar to the "Counting Squares" brain teaser, which requires that you consider sub-squares of order 1, 2, 3, and so forth.

The post Submatrices of matrices appeared first on The DO Loop.

1月 062020

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category.

SAS programming tips

The Essential Guide to Binning

  • Create training, testing, and validation data sets:: This post shows how to create training, validation, and test data sets in SAS. This technique is popular in data science and machine learning because you typically want to fit the model on one set of data and then evaluate the goodness of fit by using a different set of data.
  • 5 reasons to use PROC FORMAT to recode variables in SAS: Often SAS programmers use PROC SQL or the SAS DATA step to create new data variables to recode raw data. This is not always necessary. It is more efficient to use PROC FORMAT to recode the raw data. Learn five reasons why you should use PROC FORMAT to recode variables.
  • Conditionally append observations to a data set: In preparing data for graphing. you might want to add additional data to the end of a data set. (For example, to plot reference lines or text.) You can do this by using one DATA step to create the new observations and another DATA step to merge the new observations to the end of the original data. However, you can also use the END= option and end-of-file processing to append new observations to data. Read about how to use the END= option to append data. The comments at the end of the article show how to perform similar computations by using a hash table, PROC SQL, and a DOW loop.
  • Use PROC HPBIN to bin numerical variables: In machine learning, it is common to bin numerical variables into a set of discrete values. You can use PROC HPBIN to bin multiple variables in a single pass through the data. PROC HPBIN can bin data into equal-length bins (called bucket binning) or by using quantiles of the data. This article and the PROC FORMAT article are both referenced in my Essential Guide to Binning in SAS.

Statistical analyses

Geometric Meaning of Kolmogorov's D

Data visualization

Visualize an Interaction Effect

I always enjoy learning new programming methods, new statistical ideas, and new data visualization techniques. If you like to learn new things, too, read (or re-read!) these popular articles from 2019. Then share this page with a friend. I hope we both have many opportunities to learn and share together in the new year.

The post Top posts from <em>The DO Loop</em> in 2019 appeared first on The DO Loop.

10月 302019

Every year at Halloween, I post an article that shows a SAS trick that is a real treat. This article shows how to use the INTNX function to find dates that are related to a specified date. The INTNX function is a sweet treat, indeed.

I previously wrote an article about how to use the INTCK and INTNX functions to compute with dates. These Base SAS functions are very powerful and deserve to be known more widely. In particular, the INTNX function enables you to compute the next or previous date that is a certain number of time units away from a known date. The "time unit" is usually a day, week, a month, or a year, but the function supports many options.

Recently, a SAS programmer asked how to get "the first and last days of the previous month." The programmer added that the expression needs to be used in a WHERE clause. Because the expression needs to appear in a WHERE clause, it should be concise and not require several statements or temporary variables.

The first day of a previous (or subsequent) time interval

Finding the first day of the previous month is an ideal situation for using the INTNX function. The basic syntax of the INTNX function is
      INTNX(timeUnit, startDate, numberOfUnits)
This form of the INTNX function returns the first day of the specified time unit. For example, the following statements give dates relative to the bombing of Pearl Harbor on 07DEC1941. The time interval is 'month'. Notice that you can ask for dates after the given date (a positive number of time units) or before the given date (a negative number of units). If you specify 0 for the third argument, you get the current month.

data Months;
date = '07DEC1941'd;
FirstDayCurrMonth = intnx('month', Date,  0);   /*  0 = current month */
FirstDayPrevMonth = intnx('month', Date, -1);   /* -1 = previous month */
FirstDayNextMonth = intnx('month', Date,  1);   /*  1 = next month */
FirstDay6Months   = intnx('month', Date,  6);   /*  6 = six months later */
format _ALL_ date9.;
proc print data=Months noobs;

Because the time unit is 'month' for this example, the calculated dates are the first day of the months relative to 07DEC1941. If you change 'month' to 'year', all the calculated dates will be 01JAN of some year relative to 1941.

The last (or same) day of a previous (or subsequent) time interval

A cool fact about the INTNX function is that it supports an optional fourth argument that enables you to specify whether you want the calculated date to be at the beginning, the middle, or the end of the specified time interval. You can even specify that you want the "same" characteristics as the source date, which is useful for finding anniversaries of an event. For example, the following statements vary the fourth argument, which can be one of four values:

data FirstLastMiddle;
date = '07DEC1941'd;
FirstDayPrevMonth = intnx('month', Date, -1, 'B');   /* B = beginning */
LastDayPrevMonth  = intnx('month', Date, -1, 'E');   /* E = end */
MiddlePrevMonth   = intnx('month', Date, -1, 'M');   /* M = middle */
FirstAnniv        = intnx('year',  Date,  1, 'S');   /* S = same */
format _ALL_ date9.;
proc print data=FirstLastMiddle noobs;

The program shows that you can find the first day of the previous month, the last day of the previous month, the middle of the previous month, or an anniversary of the specified date. In particular, the program answers the programmer's question by showing a concise "one-liner" that you can use to get the first and last days of the previous month.

In summary, the INTNX function is a powerful tool for working with dates. It enables you to find dates that are related to a specified date. You can use the first argument to specify the time unit (day, week, month, year,...) and the third argument to specify the number of time units before or after the specified data. The optional fourth argument determines whether you want the first, last, middle, or "same" portion of the time interval.

Whether you work with dates in SAS every day or whether you work with them occasionally, the INTNX function is a sweet treat to remember. No tricks required.

The post Compute the first or last day of a month or year appeared first on The DO Loop.

10月 282019

A common task in SAS programming is to specify a list of variables that satisfy some pattern. You can specify lists for the KEEP= or DROP= data set options, and you can use lists of variables on many SAS statements such as the VAR and MODEL statements. Although SAS has built-in support for some patterns (like variables that start with the same prefix), you might want to match variable names to less-common patterns. In those situations, you can use regular expressions to match variable names by using the PRXMATCH function in Base SAS. The PRXMATCH function is one of several functions in SAS that support Perl regular expressions (PRX).

Built-in support for specifying variables in SAS

In a previous article, I discussed six different ways to create a list of variable names in SAS. Of these, the most common are

  • The colon operator for specifying names that have a common prefix. For example, AGE: specifies variables that begin with the prefix "age" (recall that SAS variable names are case insensitive).
  • The dash operator for specifying a sequence of variable names that have the same prefix and a numerical suffix. For example, X1-X5 matches the variables X1, X2, X3, X4, and X5.

For example, the following table shows variables in the Sashelp.Heart data set:

proc contents data=Sashelp.Heart short varnum; run;

You can see that several variables begin with the prefix 'Age'. By using the colon operator (Age:) you can match all variables that match the prefix, such as in the following example:

title "Colon (Prefix) Variable Names";
data PrefixVars;
   set Sashelp.Heart(keep=Age:);
proc contents short varnum; run;

Specifying variables that have a common suffix

Several of the variables in the Sashelp.Heart data end with the suffix 'Status'. Unfortunately, there is no built-in operator in SAS to match a suffix such as 'Status'. However, Bruno Mueller and Mark Jordan have provided a SAS macro that uses regular expressions to select variables that match any pattern. The list of variables is returned as a text string, so you can use it in the usual places, such as the KEEP= options or a VAR statement. The following example assumes you have downloaded and run the definition of the %varListPattern macro.

title "Suffix Variable Names";
data SuffixVars;
   set Sashelp.Heart(keep=%varListPattern(sashelp.Heart,*status));
proc contents short varnum; run;

This macro fills a need, and I like it a lot. In addition to matching variables that share a common suffix, you can also use the macro to find variables that match other patterns. For example, you can use the pattern "*at*" to match variables that have the string "at" anywhere in their name. In addition to the "status" variables found above, that pattern also matches the variables DeathCause, AgeAtStart, and AgeAtDeath.

Some programmers don't realize that all SAS procedures support data set options (such as KEEP= and DROP=) when you specify the name of a data set in a procedure. That means that you can use the built-in pattern matching and the %varListPattern macro throughout SAS. For example, here is how you would read a set of prefix-variables and a set of suffix-variables into SAS/IML matrices:

proc iml;
use Sashelp.Heart(keep=Age:);       /* use the coon operator to match prefixes */
   read all var _ALL_ into X[colname=prefixVars];
print prefixVars;
use Sashelp.Heart(keep=%varListPattern(sashelp.Heart,*status)); /* use macro to match suffixes */
   read all var _ALL_ into C[colname=suffixVars];
print suffixVars;

Extract a vector of strings that match a pattern

At its heart, the %varListPattern macro calls the PRXMATCH function. You can use the PRXMATCH function to determine whether a variable name (in fact, any string!) matches a pattern that is specified by a regular expression. You can use the PRXMATCH function when the names of variables are in a data set. You can then use PROC SQL to create a macro variable that contains a space-separated list of all variables that match the pattern.

I needed this functionality recently when I was writing a SAS/IML function and needed to select all strings that had a common suffix. To understand the next example, recall the following SAS/IML programming features:

The following SAS/IML functions construct patterns like /^PREFIX/i and /.*SUFFIX$/i and use the PRXMATCH function to find strings that match the patterns. To demonstrate the function, the program uses the variable names in the Sashelp.Heart data.

proc iml;
/* Use PRXMATCH to find strings with a common prefix */
start MatchPrefix(prefix, str);
   re = '/^' + strip(prefix) + '/i';    /* beginning of word, case insensitive */
   idx = loc(prxmatch(re, str));
   if ncol(idx)=0 then return( {} );    /* return empty matrix */
   return( str[idx] );
/* Use PRXMATCH to find strings with a common suffix */
start MatchSuffix(suffix, str);
   re = '/.*' + strip(suffix) + '$/i';  /* end of word, case insensitive */
   idx = loc(prxmatch(re, right(str))); /* shift right so no blank spaces at end */
   if ncol(idx)=0 then return( {} );    /* return empty matrix */
   return( str[idx] );
Variable = contents( "Sashelp", "Heart" );   /* get variable names in data order */
prefixVars = MatchPrefix("Age", Variable);
suffixVars = MatchSuffix("status", Variable);
print prefixVars, suffixVars;

I emphasize that although this example uses variable names as the strings, you can use the PRXMATCH function to search for patterns in arbitrary strings. In a similar way, you can construct other functions that find strings that match any regular expression that you can construct.

In summary, regular expressions in SAS are a powerful feature for matching strings that contain some pattern of characters. The examples in this article use simple regular expressions to find all variables that contain a common prefix (same as the built-in colon operator) or a common suffix. For many SAS procedures that require a list of variable names, you can use the %varListPattern macro to generate the variable list. For other applications, you can call the PRXMATCH function directly.

For an introduction to regular expressions in SAS, see "An Introduction to Perl Regular Expressions in SAS 9" (Cody, 2004) and "The Basics of the PRX Functions" (Cassell, 2007).

The post Use regular expressions to specify variable names in SAS appeared first on The DO Loop.

8月 072019
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.

5月 132019

In SAS/IML programs, a common task is to write values in a matrix to a SAS data set. For some programs, the values you want to write are in a matrix and you use the CREATE FROM/APPEND FROM syntax to create the data set, as follows:

proc iml;
X = {1  2  3, 
     4  5  6, 
     7  8  9, 
    10 11 12};
create MyData from X[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from X;                                /* write all rows of X */
close;                                        /* close the data set */

In other programs, the results are computed inside an iterative DO loop. If you can figure out how many observations are generated inside the loop, it is smart to allocate room for the results prior to the loop, assign the rows inside the loop, and then write to a data set after the loop.

However, sometimes you do not know in advance how many results will be generated inside a loop. Examples include certain kinds of simulations and algorithms that iterate until convergence. An example is shown in the following program. Each iteration of the loop generates a different number of rows, which are appended to the Z matrix. If you do not know in advance how many rows Z will eventually contain, you cannot allocate the Z matrix prior to the loop. Instead, a common technique is to use vertical concatenation to append each new result to the previous results, as follows:

/* sometimes it is hard to determine in advance how many rows are in the final result */
free Z;
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Y = j(k , 3, n);         /* k x 3 matrix */
   Z = Z // Y;              /* vertical concatenation of results */
create MyData2 from Z[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from Z;                                 /* write all rows */
close;                                         /* close the data set */

Concatenation within a loop tends to be inefficient. As I like to say, "friends don't let friends concatenate results inside a loop!"

If your ultimate goal is to write the observations to a data set, you can write each sub-result to the data set from inside the DO loop! The APPEND FROM statement writes whatever data are in the specified matrix, and you can call the APPEND FROM statement multiple times. Each call will write the contents of the matrix to the open data set. You can update the matrix or even change the number of rows in the matrix. For example, the following program opens the data set prior to the DO loop, appends to the data set multiple times (each time with a different number of rows), and then closes the data set after the loop ends.

/* alternative: create data set, write to it during the loop, then close it */
Z = {. . .};                /* tell CREATE stmt that data will contain three numerical variables */
create MyData3 from Z[colname={'A' 'B' 'C'}];   /* open before the loop. The TYPE of the variables are known. */
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Z = j(k , 3, n);         /* k x 3 matrix */
   append from Z;           /* write each block of data */
close;                      /* close the data set */

The following output shows the contents of the MyData3 data set, which is identical to the MyData2 data set:

Notice that the CREATE statement must know the number and type (numerical or character) of the data set variables so that it can set up the data set for writing. If you are writing character variables, you also need to specify the length of the variables. I typically use missing values to tell the CREATE statement the number and type of the variables. These values are not written to the data set. It is the APPEND statement that writes data.

I previously wrote about this technique in the article "Writing data in chunks," which was focused on writing large data set that might not fit into RAM. However, the same technique is useful for writing data when the total number of rows is not known until run time. I also use it when running simulations that generate multivariate data. This technique provides a way to write data from inside a DO loop and to avoid concatenating matrices within the loop.

The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.

4月 152019

A quadratic form is a second-degree polynomial that does not have any linear or constant terms. For multivariate polynomials, you can quickly evaluate a quadratic form by using the matrix expression
x` A x
This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like
x` A-1 x
where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.

What is a quadratic form?

For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example, if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x equals the second degree polynomial q(x1, x2) = 9*x12 - 4*x1 x2 + 5*x22. A contour plot of this polynomial is shown below.

Contour plot of the quadratic form x` A x for A = {9 -2, -2 5}

Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` Id x = x` x = Σi xi2. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x12 + 4*x22. If you add in off-diagonal elements, you get cross terms in the polynomial.

Efficient evaluation of a quadratic form

If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).

proc iml;
   q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 +
                   2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3
A = {4 3 2,
     3 6 1,
     2 1 9};
x = {-1, 2, 0.5};
q1 = x`*A*x;
print q1;

When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory. Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:

w = {4, 6, 9};
q2 = x`*(w#x);     /* more efficient than q = x`*diag(w)*x; */
print q2;

Quadratic forms with positive definite matrices

In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function. (Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.) An important example is the squared Mahalanobis distance x` Σ-1 x, which is a quadratic form.

As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance. The trick is to compute the Cholesky decomposition of the SPD matrix. I'll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:

/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite.
   Let A = U`*U be the Cholesky decomposition of A.
   Then q = x`(U`*U)*x = (U*x)`(Ux)
   So define z = U*x and compute q = z`*z
start EvalSPDQuadForm(A, x);
   U = root(A);           /* Cholesky root */
   z = U*x;
   return (z` * z);       /* dot product of vectors */
/* Run on large example */
call randseed(123);
N = 1000;
A = toeplitz(N:1);
x = randfun(N, "Normal");
q3 = EvalSPDQuadForm(A, x);  /* efficient */
qCheck = x`*A*x;             /* check computation by using a slower method */
print q3 qCheck;

You can use a similar trick to evaluate the quadratic form x` A-1 x. I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.

/* Evaluate quadratic form x`*inv(A)*x, where  A is symmetric positive definite.
   Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition.
   To compute  q = x` * inv(A) * x = x` * w, you need to solve for w.
   (U`U)*w = x, so
        First solve triangular system U` z = x   for z,
        then solve triangular system  U w = z   for w 
start EvalInvQuadForm(A, x);
   U = root(A);              /* Cholesky root */
   z = trisolv(2, U, x);     /* solve linear system */
   w = trisolv(1, U, z);  
   return (x` * w);          /* dot product of vectors */
/* test the function */
q4 = EvalInvQuadForm(A, x);  /* efficient */
qCheck = x`*Solve(A,x);      /* check computation by using a slower method */
print q4 qCheck;

You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.

In summary, you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.

The post Efficient evaluation of a quadratic form appeared first on The DO Loop.