Statistical Programming

12月 132010
 
There are three kinds of programming errors: parse-time errors, run-time errors, and logical errors. It doesn't matter what language you are using (SAS/IML, MATLAB, R, C/C++, Java,....), these errors creep up everywhere. Two of these errors cause a program to report an error, whereas the third is more insidious because the program might run to completion while silently delivering the wrong answer.

This article describes ways to find and fix each kind of error.

Error, Errors, Everywhere

Suppose you are trying to write a SAS/IML program that computes the factorial of a number, n. The following statements might represent your initial attempt:

proc iml;
n = 200;  
fact = 1  
do k = 1 to n;
   fact = fact * n;
end;
print fact;

The program contains three errors—one of each kind—for an impressive error-to-line ratio of 50%. Can you find the three errors?

Parse-Time Errors

A parse-time error occurs when the syntax of the program is incorrect. (This is also called a compile-time error for languages such as C/C++ and Java.) A parse-time error is the easiest error to correct because the parser (or compiler) tells you exactly what is wrong and on what line the problem occurs.

Common parse-time errors include mistyping a statement, forgetting a semicolon, or failing to close a set of parentheses. In strongly typed languages such as C/C++, Java, and IMLPlus, you also get a parse-time error when you try to use a variable of one type when a different type is expected. For example, it is an error to pass an integer into a function that is expecting an array or an object of a class.

In PROC IML, parse-time errors are reported in the SAS log. In SAS/IML Studio, you can select Program > Check Syntax to check your program for parse-time errors.

For the example program, SAS/IML Studio reports the following error:

ERROR: The program contains a syntax error. 
IMLPlus did not expect the following text: do (4, 1)

There is nothing wrong with the DO statement on Line 4, but there is a missing semicolon at the end of Line 3. As a result, the IMLPlus parser sees the statement fact = 1 do k = 1 to n; which is invalid syntax.

Fix #1: To fix the parse-time error, insert a semicolon at the end of Line 3.

Run-Time Errors

A run-time error does not occur until the program is actually run. Common SAS/IML run-time errors include adding matrices that are different sizes, taking the logarithm of a negative value, and using the matrix index operator to specify indices that do not exist.

A previous blog post shows how to interpret error messages that appear in the SAS log when SAS/IML software encounters a run-time error.

SAS/IML Studio has some nice features for finding and fixing parse-time and run-time errors. When you run the revised test program, the SAS log reports the following error:

»ERROR: Overflow error in *.

You can jump directly to the location of the error in a program window, by doing the following:

  1. Right-click the error message in the Error Log window. A pop-up menu is displayed.
  2. Select Go to Source.


SAS/IML Studio positions the cursor in the Program window at the location of the error. By using the techniques in the previous blog post and by using the Auxiliary Input window to interrogate the value of k at the time of the error, you discover that the numerical overflow error occurs when k is 134. "Hmmmm," you think to yourself, "200! is a big number (in fact, it's 374 digits long!), perhaps I should try a more manageable value."

Fix #2: To fix the run-time error, decrease the value of n. For definiteness, set n=20.

Logical Errors

By far, the most difficult error to find is the logical error. A program can have a logical error due to a mistyped formula or due to an incorrectly implemented algorithm. The savvy statistical programmer can use the following techniques to find and eliminate logical errors:

  • Test the program on simple cases for which the result of the program is known.
  • Break down the program into a sequence of basic steps and independently test each component.
  • Favor clarity and simplicity when you initially write the program. After the program is working, you can profile the code and go back to optimize sections that are performance bottlenecks.

If you run the program with n=20, the program prints the following value:

fact

1.0486E26

Is this the correct value for 20!? I don't know. Maybe I should test my program on a simpler case? I know that 3!=6, so I'll change the value of n to n=3 and re-run the program. The program prints the following value:

fact

27

I know that 27 isn't correct, and I recognize that 27 = 33 so I review the logic of my program statements. Sure enough, I've made a mistake. The statement inside the loop should involve the counter k, not the value n.

Fix #3: To fix the logical error, the statement inside the loop should be fact = fact * k.

Conclusions

SAS/IML Studio contains features that help you to find and fix the three types of programming errors. You can use the techniques in this article to find errors.

However, an even better strategy is to avoid errors by "being a code samurai," which means that you should think hard about the problem and research it before you write any code. For example, if I had gone to support.sas.com and searched for "factorial," I would have discovered that SAS has a built-in FACT function, which reduces the program to a single line:

fact = fact(20);

It doesn't get much simpler than that.

12月 082010
 
Sample covariance matrices and correlation matrices are used frequently in multivariate statistics. This post shows how to compute these matrices in SAS and use them in a SAS/IML program. There are two ways to compute these matrices:
  1. Compute the covariance and correlation with PROC CORR and read the results into PROC IML
  2. Compute the matrices entirely with PROC IML

Computing a Covariance and Correlation Matrix with PROC CORR

You can use PROC CORR to compute the correlation matrix (or, more correctly, the "Pearson product-moment correlation matrix," since there are other measures of correlation which you can also compute with PROC CORR). The following statements compute the covariance matrix and the correlation matrix for the three numerical variables in the SASHELP.CLASS data set.

ods select Cov PearsonCorr;
proc corr data=sashelp.class noprob outp=OutCorr /** store results **/
          nomiss /** listwise deletion of missing values **/
          cov;   /**  include covariances **/
var Height Weight Age;
run;

The OutCorr data set contains various statistics about the data, as shown by running PROC PRINT:

proc print data=OutCorr; run;

Obs

_TYPE_

_NAME_

Height

Weight

Age

1

COV

Height

26.287

102.493

6.2099

2

COV

Weight

102.493

518.652

25.1857

3

COV

Age

6.210

25.186

2.2281

4

MEAN

62.337

100.026

13.3158

5

STD

5.127

22.774

1.4927

6

N

19.000

19.000

19.0000

7

CORR

Height

1.000

0.878

0.8114

8

CORR

Weight

0.878

1.000

0.7409

9

CORR

Age

0.811

0.741

1.0000

If you want to use the covariance or correlation matrix in PROC IML, you can read the appropriate values into a SAS/IML matrix by using a WHERE clause on the USE statement:


proc iml;
use OutCorr where(_TYPE_="COV");
read all var _NUM_ into cov[colname=varNames];

use OutCorr where(_TYPE_="CORR");
read all var _NUM_ into corr[colname=varNames];
close OutCorr;

Notice that this SAS/IML code is independent of the number of variables in the data set.

Computation of the Covariance and Correlation Matrix in PROC IML

If the data are in SAS/IML vectors, you can compute the covariance and correlation matrices by using matrix multiplication to form the matrix that contains the corrected sum of squares of cross products (CSSCP).

Suppose you are given p SAS/IML vectors x1, x2, ..., xp. To form the covariance matrix for these data:

  1. Use the horizontal concatenation operator to concatenate the vectors into a matrix whose columns are the vectors.
  2. Center each vector by subtracting the sample mean.
  3. Form the CSSCP matrix (also called the "X-prime-X matrix") by multiplying the matrix transpose and the matrix.
  4. Divide by n-1 where n is the number of observations in the vectors.
This process assumes that there are no missing values in the data. Otherwise, it needs to be slightly amended. Formulas for various matrix quantities are given in the SAS/STAT User's Guide.

The following SAS/IML statements define a SAS/IML module that computes the sample covariance matrix of a data matrix. For this example the data are read from a SAS data set, so Step 1 (horizontal concatenation of vectors) is skipped.

proc iml;
start Cov(A);             /** define module to compute a covariance matrix **/
   n = nrow(A);           /** assume no missing values **/
   C = A - A[:,];         /** subtract mean to center the data **/
   return( (C` * C) / (n-1) );
finish;

/** read or enter data matrix into X **/
varNames = {"Height" "Weight" "Age"};
use sashelp.class; read all var varNames into X; close sashelp.class;

cov = Cov(X);
print cov[c=varNames r=varNames];

cov

Height

Weight

Age

Height

26.286901

102.49342

6.2099415

Weight

102.49342

518.65205

25.185673

Age

6.2099415

25.185673

2.2280702

Computing the Pearson correlation matrix requires the same steps, but also that the columns of the centered data matrix be scaled to have unit standard deviation. SAS/IML software already has a built-in CORR function, so it is not necessary to define a Corr module, but it is nevertheless instructive to see how such a module might be written:

start MyCorr(A);
   n = nrow(A);                   /** assume no missing values     **/
   C = A - A[:,];                 /** center the data              **/
   stdCol = sqrt(C[##,] / (n-1)); /** std deviation of columns     **/
   stdC = C / stdCol;             /** assume data are not constant **/
   return( (stdC` * stdC) / (n-1) );
finish;

corr = MyCorr(X);
print corr[c=varNames r=varNames];

corr

Height

Weight

Age

Height

1

0.8777852

0.8114343

Weight

0.8777852

1

0.7408855

Age

0.8114343

0.7408855

1

You should use the built-in CORR function instead of the previous module, because the built-in function handles the case of constant data.

Computation of the Covariance and Correlation Matrix in PROC IML (post-9.2)

In November 2010, SAS released the 9.22 (pronounced "nine point twenty-two") release of SAS/IML software. This release includes the following:

  • a built-in COV function which handles missing values in either of two ways
  • new features for the built-in CORR function including
    • handling missing values in either of two ways
    • support for different measures of correlation, including rank-based correlations
12月 082010
 
I enjoy reading about the Le Monde puzzles (and other topics!) at Christian Robert's blog. Recently he asked how to convert a number with s digits into a numerical vector where each element of the vector contains the corresponding digit (by place value). For example, if the number is 4321, then s=4 and the desired vector is {4, 3, 2, 1}.

Here's my SAS/IML solution, which converts the number to a string, and uses the SUBSTR function to grab each digit:

proc iml;
start ConvertNumberToArray(n);
   s = ceil(log10(n+1)); /** find magnitude of number **/
   c = char(n,s);  /** convert to string **/
   x = j(1,s);     /** allocate vector **/
   do i = 1 to s;
      x[i] = num(substr(c,i,1)); /** substring, as numeric **/
   end;
   return (x);
finish;

x = ConvertNumberToArray( 4321 );
print x;

x

4

3

2

1

11月 232010
 
I am thankful to be a statistical programmer.

When I wake up in the morning, I am eager to start my day. I love statistics, programming, and working at SAS, and I write my blog to share that joy. This a Golden Age for statistical programmers because theoretical ideas and computational power have converged during the last 30 years so that it is now possible to attack problems that were previously intractable.

Here, then, is my list of ten areas of computational statistics that make me thankful to be a statistical programmer:

  1. Bootstrapping and resampling techniques: I remember reading a Scientific American article on the bootstrap in 1983 when I was in high school. It seemed like magic. Now, more than 25 years later, it still seems like magic, but it is magic that I can easily program and use.
  2. Nonparametric modeling: From kernel density estimation to multivariate regression splines, nonparametric techniques not only fit a wide variety of data, but also reduce the modeling biases of parametric models. Two of my favorite tools are PROC UNIVARIATE for density estimation and PROC LOESS for scatter plot smoothing, but I have used all of the SAS/STAT nonparametric regression procedures and I physically salivate if you mention the new EFFECT statement available in SAS/STAT 9.22.
  3. Variable selection techniques: If you have a response variable and p explanatory variables, there are 2p possible subsets that you can use to model the relationship between the response and the explanatory variables. Which subset to use? Some people want the data to decide. Variable selection techniques find a subset that optimizes some fit statistic. Programming an algorithm that finds the subset quickly requires a range of skills from computer science, statistics, and optimization. Fortunately, SAS has developers that excel in these areas, and PROC GLMSELECT (which includes popular techniques such as LAR and LASSO) is rapidly becoming a favorite tool with SAS customers because of its speed and flexibility.
  4. Robust estimation: Outliers in data can influence estimates of location, scale, regression coefficients, and so on. Robust techniques are often computationally more intensive than their classical counterparts, but in many cases they produce superior estimates. I enjoy using the ROBUSTREG procedure for robust regression and the MCD subroutine in PROC IML for robust estimation of covariance matrices.
  5. Bayesian computations: Alan Gelfand writes that "Markov chain Monte Carlo (MCMC) has revolutionized the way that statistical models are fitted" [Statistics in the 21st Century, p. 341]. I couldn't agree more. SAS customers flock to every presentation and paper on PROC MCMC and other SAS procedures that support Bayesian analysis. The enthusiasm does not seem to be waning.
  6. Imputing missing values: Missing values are a way of life, but new techniques are helping statisticians to impute missing values under certain conditions. Before I started studying statistics, I had never heard of a missing value. And I certainly was not aware that you can use multiple imputation techniques to make valid statistical inferences in the presence of missing values. In SAS/STAT software, the MI and MIANALYZE procedures do exactly that.
  7. Maximum likelihood estimation: Not only is MLE useful for the practicing data analyst, but it can be fun and challenging for the statistical programmer because it requires understanding statistics, numerical analysis, and optimization. There have been hundreds of papers written just on MLE for the three-parameter Weibull distribution! For a statistical programmer, MLE means "job security," and who isn't thankful for that?
  8. Simulation: Although simulation is hardly new, languages such as SAS/IML make it easy to investigate characteristics of a statistical technique by applying it to simulated data with known properties. In fact, I'll be describing how to do this at my "data simulation" seminar at SAS Global Forum in Las Vegas, April 4–7, 2011.
  9. Statistical graphics: I am a visual learner. I graph my data to discover relationships between variables and to examine unusual observations. I graph after I build a statistical model in order to assess the model fit. I'm thankful that most SAS/STAT procedures automatically create suitable graphs by using ODS Statistical Graphics. When I need to create specialized graphs, I can use the SGPLOT procedure or program custom, dynamically linked graphs in SAS/IML Studio.
  10. High-level computational languages: In my youth, there was FORTRAN. Powerful? Yes, but programming in FORTRAN is like eating overcooked Brussels sprouts: it leaves me nauseous and with a bad taste in my mouth. For me, statistical programming with SAS/IML software is the cure for my Brussels sprouts blues. Whether you program with SAS/STAT procedures and the DATA step, or whether you dive into matrix languages such as the SAS/IML language, R, or MATLAB, statistical programming is more fun and more productive when you use a high-level language.

Did I omit the statistical area that excites you the most? What makes you thankful to be a statistical programmer?

11月 152010
 
I am pleased to announce that the fine folks at SAS Press have made Chapter 2 of my book, Statistical Programming with SAS/IML Software available as a free PDF document.

The chapter is titled "Getting Started with the SAS/IML Matrix Programming Language," and it features

  • More than 60 fully functional examples with output
  • 25 tips for writing effective programs in the SAS/IML language
  • Basic syntax for interacting with SAS/IML matrices, including creating matrices, indexing matrices, and comparing matrices
  • Basic syntax for statistical programming in the SAS/IML language, including conditional statements, loops, and logical operators

If you are new to the SAS/IML language, this chapter is for you. However, I've been told by experienced programmers that they learned a few new tricks as well. Specifically, "old-timers" might want to look at Table 2.1, since some of these features were added to the SAS/IML language in SAS 9.2.

Enjoy!

11月 052010
 
Sampling with replacement is a useful technique for simulations and for resampling from data. Over at the SAS/IML Discussion Forum, there was a recent question about how to use SAS/IML software to sample with replacement from a set of events. I have previously blogged about efficient sampling, but this topic is so important that I'm happy to revisit it.

In Chapter 12 of my book, Statistical Programming with SAS/IML Software, I present a complete module for sampling with replacement. The module handles sampling with equal or unequal probabilities. After the book is published, you can download the code for that module.

For now, consider the case of sampling with replacement with equal probability. That is, suppose you have a box with n different items and you want to draw from this box k times. After each draw you replace the item that you took out, so that the probability of drawing a particular item is fixed at 1/n.

After you have drawn one sample of size k, you might want to repeat the procedure by drawing additional samples. This enables you to deal with statistical variation and to estimate probabilities of certain events.

To be concrete, suppose you want to draw a random sample of size k=5 from a set of size n=8. One way to simulate this is to generate 5 random numbers from the uniform distribution. Each of these numbers is in the interval (0,1), so you can map these numbers to the integers 1–8 by multiplying each number by 8 and rounding up the result. Here is a SAS/IML module for random sampling with replacement with uniform probability:

proc iml;
/** Module for random sampling with replacement and uniform probability.
    A is row vector with n elements and each sample contains k elements. 
    The result is an nSamples x k matrix. Each row contains one sample. **/
start SampleReplaceUni(A, nSamples, k);
   results = j(nSamples, k);            /** allocate result matrix **/
   call randgen(results, "Uniform");    /** fill with random U(0,1) **/
   n = ncol(A);                         /** number of elements **/
   results = ceil(n*results);           /** convert to integers 1,2,...n **/
   return(shape(A[results], nSamples)); /** reshape and return from A **/
finish;

To test the module:

call randseed(1234);
/** Draw 5 times from {1,2,...,8}. Repeat 10 times. **/
s = SampleReplaceUni(1:8, 10, 5);
print s; 

s

7

6

3

3

8

2

3

5

2

5

2

6

4

3

8

7

4

8

1

3

8

8

6

8

7

3

2

6

1

5

2

1

8

1

8

4

1

3

6

8

5

5

6

8

2

8

2

1

1

1

When you have multiple samples, you can ask questions such as

  • What is the average number of ones that appeared in each sample?

    ones = (s=1)[,+];
    aveOnes = ones[:];
    print aveOnes;
    

    aveOnes

    0.8

  • In what percentage of the samples did a two appear on the second draw?

    numTwos = (s=2)[+,2];
    pctTwos = numTwos / nrow(s);
    print pctTwos[format=PERCENT7.2];
    

    pctTwos

    20.0%

Evaluating an Iterated Integral

 Numerical Analysis, Statistical Programming  Evaluating an Iterated Integral已关闭评论
10月 272010
 
The SAS/IML language provides the QUAD function for evaluating one-dimensional integrals. You can also use the QUAD function to compute a double integral as an iterated integral.

A One-Dimensional Integral

Suppose you want to evaluate the following integral:


To evaluate this integral in the SAS/IML language:

  1. Define a function module that evaluates the integrand at an arbitrary point y.
  2. Call the QUAD routine to numerically integrate the function over the domain of integration.

The module needs to be a function of a single variable (the integration variable). If the integral contains parameters, pass those in by using the GLOBAL clause to the START statement. For example, the following statements evaluate the integral for the parameter a=2:

proc iml;
start inner(y) global(a); 
   return(a+y);
finish; 

/** evaluate integral for the parameter value, a=2 **/
a = 2;
yinterval = 0 || a;
call quad(w, "inner", yinterval); 
print w;

w

6

Iterated Integrals

An iterated integral is a double integral in which you treat the variable in the outer integral as a constant while you evaluate the inner integral. If the domain of integration is not rectangular, the limits of integration for the inner integral are functions of the outer variable.

For example, consider the following iterated integral:


Notice that the inner integral is exactly the integral that you computed in the previous section, except that the parameter is called x instead of a. You already know how to evaluate the inner integral, so to evaluate the double integral:

  1. Define a function module that evaluates the integrand of the outer integral at an arbitrary point x. This module must internally evaluate the inner integral, which is a function of x.
  2. Call the QUAD routine to numerically integrate the function over the domain of integration.

The following statements implement this approach:

start outer(x) global(a); 
   a = x;
   yinterval = 0 || a;
   /** evaluate inner integral for the parameter value, a=x **/ 
   call quad(w, "inner", yinterval);
   return (w);
finish; 

xinterval= {0 3};
call quad(v, "outer", xinterval); /** outer integral **/ 
print v;

v

13.5

10月 252010
 
I was recently asked how to create a tridiagonal matrix in SAS/IML software. For example, how can you easily specify the following symmetric tridiagonal matrix without typing all of the zeros?

proc iml;
m = {1 6 0 0 0,
     6 2 7 0 0,
     0 7 3 8 0,
     0 0 8 4 9,
     0 0 0 9 5 };

This matrix can be created in two steps. The first step is to use the DIAG function to create a matrix that contains specific values on the diagonal and zeros elsewhere:

/** create diagonal matrix **/
d  = 1:5;
m = diag(d);

The second step relies on the fact that SAS/IML matrices are stored in row-major order. Therefore, the indices of the upper diagonal of m are 2, 8, 14, and 20. In general, for an n x n matrix, the following statements compute the indices of the upper diagonal and assign values to those locations:

n = ncol(m);                /** size of matrix **/
d1 = 6:9;                   /** values for upper diag **/
uppIdx = do(2, n*n, n+1);   /** indices of upper diag **/
m[uppIdx]= d1;              /** set upper diag **/

Similarly, the following statements assign the lower diagonal:

lowIdx = do(n+1, n*n, n+1); /** indices of lower diag **/
m[lowIdx] = d1;             /** set lower diag **/
print m;

Of course, you can also use the indices of the upper or lower diagonal to extract the values, rather than to assign the values.

Looping versus LOC-ing Revisited

 Sampling and Simulation, Statistical Programming  Looping versus LOC-ing Revisited已关闭评论
10月 222010
 
In a previous post, I discussed how to use the LOC function to eliminate loops over observations. Dale McLerran chimed in to remind me that another way to improve efficiency is to use subscript reduction operators.

I ended my previous post by issuing a challenge: can you write an efficient module that finds the rows in a matrix for which all elements are positive?

The submissions I received basically fall into three categories: loop over rows, loop over columns, or use the subscript reduction operators to eliminate the loop. Let's time each algorithm and plot the relative performance as the number of columns varies.

Looping over Rows

This will be the slowest method when there are millions of rows and few columns. However, as the number of columns increases, this method becomes competitive with the others. The following module is a typical implementation:

proc iml;

/** test for positive entries: loop over rows **/
start TestAllVarsPosByRows(x);
   b = j(nrow(x),1);
   do i = 1 to nrow(x);
      b[i] = all(x[i,]>0);
   end;
   return( loc(b) );
finish;

Looping over Columns

This will be fast provided that there are many more rows than columns. The following module is typical:

/** test for positive entries: loop over columns **/
start TestAllVarsPosByCols(x);
   b = (x[,1]>0); /** initialize vector of 0's and 1's **/
   do j = 2 to ncol(x);
      b = b & (x[,j]>0); /** logical AND operator **/
   end;
   return( loc(b) );
finish;

Subscript Reduction Operators

For the problem I posed, you can use the "minimum value" subscript reduction operator to eliminate all loops. Although you cannot use this trick every time that you want to find the rows that satisfy a condition, you should use subscript reduction operators when you can. Dale's implementation is as follows:

/** no loop - use subscript reduction operators (Dale McLerran) **/
start TestAllVarsPosBySubRed(x);
   return( loc(x[ ,>< ] >0) );
finish;

Comparing the Performance of the Algorithms

The performance of an algorithm depends on the size of the data that is processed. If you fix the size of the data at, say, one million numbers, the algorithms might depend on the number of rows and columns.

The following program generates one million random numbers from the standard normal distribution. These numbers are reshaped into matrices with 1, 2, 3, ..., 25 columns, and the time required for each module to run is recorded.

y = j(1e6,1);
call randseed(12345);
call randgen(y, "Normal");  /** generate random numbers **/

p = 1:25;
n = round(1e6 / p);
results = j(ncol(n),5);
do i = 1 to ncol(n);
   x = shape(y, n[i], p[i]);  /** reshape data **/
   /** time each algorithm **/
   t0 = time();  r = TestAllVarsPosByRows(x);    tRows = time()-t0;
   t0 = time();  c = TestAllVarsPosByCols(x);    tCols = time()-t0;
   t0 = time();  s = TestAllVarsPosBySubRed(x);  tSubs = time()-t0;
   
   results[i,] = n[i] || p[i] || tRows || tCols || tSubs;
end;

Interpreting the Results

The graph (which is created in SAS/IML Studio) shows the following:

  • The algorithm that loops over rows is substantially slower than the other algorithms when there are fewer than 10 columns.
  • For more than 10 columns, the row- and column-wise algorithms are comparable. Each algorithm processes the one million numbers in about 0.1 seconds.
  • The subscript reduction algorithm processes the data almost instantly for all shapes of matrices.
The lesson is that your programs will run faster if you can reduce the use of loops. (This is not unique to SAS/IML programs, but is true of all high-level languages such as MATLAB and R.) In the SAS/IML language, both the LOC function and subscript reduction operators help you to eliminate loops in your program.