Statistical Programming

10月 142011

I previously wrote about using SAS/IML for nonlinear optimization, and demonstrated optimization by maximizing a likelihood function. Many well-known optimization algorithms require derivative information during the optimization, including the conjugate gradient method (implemented in the NLPCG subroutine) and the Newton-Raphson method (implemented in the NLPNRA method).

You should specify analytic derivatives whenever you can, because they are more accurate and more efficient than numerical derivatives. Computing a derivative is not hard, but it is important to get it right. A mistake in the coding of derivative information can difficult to find, and can lead to wrong answers in the optimization.

That's why I suggest two techniques to make sure that derivatives are correct:

  • Use symbolic derivative software to compute complex derivatives.
  • Use finite difference computations to check that the derivatives are correct. In SAS/IML software, you can use the NLPFDD subroutine to approximate derivatives with finite-differences.

Example: Derivatives in MLE estimation

Suppose that you have a real-valued function of several variables. To optimize the function, you might want to compute the gradient function, which is the vector of first derivatives. (The gradient is identically zero at local optima.)

To be concrete, let's reuse the log-likelihood example from my previous blog post. The following SAS/IML module defines the log-likelihood function for fitting a normal density curve to data. Given a data vector (x1, x2, ..., xn), the function can be used to find parameters (μ, σ) such that the data are most likely to be a sample from a normal distribution with those parameters.

proc iml;
/* the log-likelihood function for the normal distribution N(mu, sigma) */
start LogLik(param) global (x);
   mu = param[1];
   sigma2 = param[2]##2;
   n = nrow(x);
   c = x - mu;
   f = -n/2*log(sigma2) - 1/2/sigma2*sum(c##2);
   return ( f );

Tip 1: Compute symbolic derivatives

The previous log-likelihood function is simple enough that you can manually compute the derivatives of the function with respect to the parameters mu and sigma. However, you might be interested in knowing that there are online tools that you can use to compute symbolic derivatives. The one I use is Wolfram Alpha, which is powered by the same software as Mathematica®. In Mathematica, the D operator is used to compute derivatives. The same operator works in Wolfram Alpha. So, for example, to compute the derivative with respect to sigma, you can type
D[ -n/2*log(sigma^2) - 1/2/sigma^2*Sum[(x_i-mu)^2], sigma]
and Wolfram Alpha will return the derivative
-n/sigma + Sum[(x_i-mu)^2] / sigma^3

I was less successful computing the derivative with respect to mu. Perhaps the Sum[] function confused the software. However, you can take the derivative of the quantity inside the summation:
D[ -1/2/sigma^2*(x_i-mu)^2, mu]
and Wolfram Alpha will return the derivative
(x_i-mu) / sigma^2

The gradient of the log-likelihood function is therefore as follows:

start GradLogLik(param) global (x);
   mu = param[1];
   sigma = param[2];
   sigma2 = sigma##2;
   n = nrow(x);
   c = x - mu;
   dfdmu = sum(c) / sigma2;
   dfdsigma = -n/sigma + sum(c##2)/sigma##3;
   return ( dfdmu || dfdsigma );

Tip 2: Use the NLPFDD subroutine to check derivatives

Is the gradient function programmed correctly? Let's find out by calling the GradLogLik function and also the NLPFDD subroutine, which will compute a finite difference approximation to the derivatives of the LogLik function:

/* Use the NLPFDD subroutine to check analytic derivatives */
use Sashelp.Iris;
read all var {SepalWidth} into x;
close Sashelp.Iris;
p = {30 4}; /* evaluate derivative at this (mu, sigma) */
grad = GradLogLik(p);
call nlpfdd(func, FDDGrad, hessian, "LogLik", p);  
diff = max(abs(grad - FDDGrad));
print grad, FDDGrad, diff;

Success! The gradient from the GradLogLik function and the finite difference approximation (in the FDDGrad vector) are essentially the same. Of course, we might have gotten lucky; we evaluated the derivative only at a single set of parameter values. If you want to be extra cautious, the following statements use the Define2DGrid module to generate a whole grid of parameter values. The exact gradient and the finite difference derivatives are evaluated at each of these parameter values.

/* optional: evaluate derivatives on a grid */
params = Define2DGrid( 27, 36, 11, 
                       3,   6, 11 );
exactgrad = j(nrow(params),2);
fddgrad   = j(nrow(params),2);
do i = 1 to nrow(params);
   exactgrad[i,] = GradLogLik( params[i,] );
   call nlpfdd(func, grad, hessian, "LogLik", params[i,] ); 
   fddgrad[i,] = grad;
diff = max(abs( exactgrad - fddgrad ));
print diff;

This was probably more work than was necessary, but it shows that the analytical derivatives and the finite difference derivatives agree to five decimal places. I am now confident that my analytical derivatives are correct.

tags: Numerical Analysis, Statistical Programming, Tips and Techniques
10月 122011

A popular use of SAS/IML software is to optimize functions of several variables. One statistical application of optimization is estimating parameters that optimize the maximum likelihood function. This post gives a simple example for maximum likelihood estimation (MLE): fitting a parametric density estimate to data.

Which density curve fits the data?

If you plot a histogram for the SepalWidth variable in the famous Fisher's iris data, the data look normally distributed. The normal distribution has two parameters: μ is the mean and σ is the standard deviation. For each possible choice of (μ, σ), you can ask the question, "If the true population is N(μ, σ), how likely is it that I would get the SepalWidth sample?" Maximum likelihood estimation is a technique that enables you to estimate the "most likely" parameters. This is commonly referred to as fitting a parametric density estimate to data.

Visually, you can think of overlaying a bunch of normal curves on the histogram and choosing the parameters for the best-fitting curve. For example, the following graph shows four normal curves overlaid on the histogram of the SepalWidth variable:

proc sgplot data=Sashelp.Iris;
histogram SepalWidth;
density SepalWidth / type=normal(mu=35 sigma=5.5);
density SepalWidth / type=normal(mu=32.6 sigma=4.2);
density SepalWidth / type=normal(mu=30.1 sigma=3.8);
density SepalWidth / type=normal(mu=30.5 sigma=4.3);

It is clear that the first curve, N(35, 5.5), does not fit the data as well as the other three. The second curve, N(32.6, 4.2), fits a little better, but mu=32.6 seems too large. The remaining curves fit the data better, but it is hard to determine which is the best fit. If I had to guess, I'd choose the brown curve, N(30.5, 4.3), as the best fit among the four.

The method of maximum likelihood provides an algorithm for choosing the best set of parameters: choose the parameters that maximize the likelihood function for the data. Parameters that maximize the log-likelihood also maximize the likelihood function (because the log function is monotone increasing), and it turns out that the log-likelihood is easier to work with. (For the normal distribution, you can explicitly solve the likelihood equations for the normal distribution. This provides an analytical solution against which to check the numerical solution.)

The likelihood and log-likelihood functions

The UNIVARIATE procedure uses maximum likelihood estimation to fit parametric distributions to data. The UNIVARIATE procedure supports fitting about a dozen common distributions, but you can use SAS/IML software to fit any parametric density to data. There are three steps:

  1. Write a SAS/IML module that computes the log-likelihood function. Optionally, since many numerical optimization techniques use derivative information, you can provide a module for the gradient of the log-likelihood function. If you do not supply a gradient module, SAS/IML software automatically uses finite differences to approximate the derivatives.
  2. Set up any constraints for the parameters. For example, the scale parameters for many distributions are restricted to be positive values.
  3. Call one of the SAS/IML nonlinear optimization routines. For this example, I call the NLPNRA subroutine, which uses a Newton-Raphson algorithm to optimize the log-likelihood.

Step 1: Write a module that computes the log-likelihood

A general discussion of log-likelihood functions, including formulas for some common distributions, is available in the documentation for the GENMOD procedure. The following module computes the log-likelihood for the normal distribution:

proc iml;
/* write the log-likelihood function for Normal dist */
start LogLik(param) global (x);
   mu = param[1];
   sigma2 = param[2]##2;
   n = nrow(x);
   c = x - mu;
   f = -n/2*log(sigma2) - 1/2/sigma2*sum(c##2);
   return ( f );

Notice that the arguments to this module are the two parameters mu and sigma. The data vector (which is constant during the optimization) is specified as the global variable x. When you use an optimization method, SAS/IML software requires that the arguments to the function be the quantities to optimize. Pass in other parameters by using the GLOBAL parameter list.

It's always a good idea to test your module. Remember those four curves that were overlaid on the histogram of the SepalWidth data? Let's evaluate the log-likelihood function at those same parameters. We expect that log-likelihood should be low for parameter values that do not fit the data well, and high for parameter values that do fit the data.

use Sashelp.Iris;
read all var {SepalWidth} into x;
close Sashelp.Iris;
/* optional: test the module */
params = {35   5.5,
          32.6 4.2,
          30.1 3.8,
          30.5 4.3};
LogLik = j(nrow(params),1);
do i = 1 to nrow(params);
   p = params[i,];
   LogLik[i] = LogLik(p);
print Params[c={"Mu" "Sigma"} label=""] LogLik;

The log-likelihood values confirm our expectations. A normal density curve with parameters (35, 5.5) does not fit the data as well as the other parameters, and the curve with parameters (30.5, 4.3) fits the curve the best because its log-likelihood is largest.

Step 2: Set up constraints

The SAS/IML User's Guide describes how to specify constraints for nonlinear optimization. For this problem, specify a matrix with two rows and k columns, where k is the number of parameters in the problem. For this example, k=2.

  • The first row of the matrix specifies the lower bounds on the parameters. Use a missing value (.) if the parameter is not bounded from below.
  • The second row of the matrix specifies the upper bounds on the parameters. Use a missing value (.) if the parameter is not bounded from above.

The only constraint on the parameters for the normal distribution is that sigma is positive. Therefore, the following statement specifies the constraint matrix:

/*     mu-sigma constraint matrix */
con = { .   0,  /* lower bounds: -infty < mu; 0 < sigma */
        .   .}; /* upper bounds:  mu < infty; sigma < infty */

Step 3: Call an optimization routine

You can now call an optimization routine to find the MLE estimate for the data. You need to provide an initial guess to the optimization routine, and you need to tell it whether you are finding a maximum or a minimum. There are other options that you can specify, such as how much printed output you want.

p = {35 5.5};/* initial guess for solution */
opt = {1,    /* find maximum of function   */
       4};   /* print a LOT of output      */
call nlpnra(rc, result, "LogLik", p, opt, con);

The following table and graph summarize the optimization process. Starting from the initial condition, the Newton-Raphson algorithm takes six steps before it reached a maximum of the log-likelihood function. (The exact numbers might vary in different SAS releases.) At each step of the process, the log-likelihood function increases. The process stops when the log-likelihood stops increasing, or when the gradient of the log-likelihood is zero, which indicates a maximum of the function.

You can summarize the process graphically by plotting the path of the optimization on a contour plot of the log-likelihood function. You can see that the optimization travels "uphill" until it reaches a maximum.

Check the optimization results

As I mentioned earlier, you can explicitly solve for the parameters that maximize the likelihood equations for the normal distribution. The optimal value of the mu parameter is the sample mean of the data. The optimal value of the sigma parameter is the unadjusted standard deviation of the sample. The following statements compute these quantities for the SepalWidth data:

OptMu = x[:];
OptSigma = sqrt( ssq(x-OptMu)/nrow(x) );

The values found by the NLPNRA subroutine agree with these exact values through seven decimal places.

tags: Data Analysis, Statistical Programming
10月 052011

When you misspell a word on your mobile device or in a word-processing program, the software might "autocorrect" your mistake. This can lead to some funny mistakes, such as the following:

I hate Twitter's autocorrect, although changing "extreme couponing" to "extreme coupling" did make THAT tweet more interesting. [@AnnMariaStat]

When you type "hte," how does the software know that you probably meant "the"? Even more impressively, when you type "satitsisc" into Microsoft Word, how does it create a list of the words you might have meant, a list that includes "statistics"?

For that matter, have you every mistyped the name of a SAS procedure or an option, only to discover that your program still ran? For example, if you type dat a; run; the program will run, although the SAS log will warn you of the misspelling: "WARNING 14-169: Assuming the symbol DATA was misspelled as dat." How can SAS determine that what you typed is close enough to an actual keyword that it assumes that the keyword was misspelled?

There are various techniques for correcting misspellings, but most of them involve a notion of the distance between words. Given a string of letters, you can define a number that relates how "close" the string is to other words (for example, in a dictionary). The string that is typed is called the keyword. The words in the dictionary are called query words. If the keyword is not in the dictionary, then you can list the query words that are close to the keyword. Often the keyword is a misspelling of one of the words on this list.

Base SAS has several functions that you can use to compare how close words are to each other. In this post, I describe the SPEDIS function, which I assume is an acronym for "spelling distance." Similar SAS functions include the COMPLEV function, the COMPGED function, and the SOUNDEX function.

What words are close to "satitsisc"

The SPEDIS function computes the "cost" of converting the keyword to a query word. Each operation—deletion, insertion, transposition, and so forth—is assigned a certain cost.

As an example, type the letters "satitsisc" into Microsoft Word and look at the top words that Word suggests for the misspelling. The words are statistics, statistic, and sadistic. The words satisfy and Saturday are not on the list of suggested words. By using the SAS DATA step or the SAS/IML language, you can examine how close each of these words are to the garbled keyword, as measured by the SPEDIS function:

proc iml;
keyword = "satitsisc";
query = {"statistics" "statistic" "sadistic" "satisfy" "Saturday"};
cost = spedis(query, keyword);
print cost;

According to the SPEDIS function, the words suggested by MS Word (the first three query words) have a lower cost than satisfy and Saturday, which were not suggested.

The cost returned by the SPEDIS function is not symmetric. For example, the cost of converting sadistic to statistic is different from the cost of converting statistic to sadistic, because inverse operations have different costs. For example, the cost of inserting a letter is different from the cost of deleting a letter.

This lack of symmetry means that the SPEDIS function does not return a true distance between words, although some people refer to the cost matrix as an "asymmetric distance." You can use the SAS/IML language to compute the cost matrix, in which the ijth element is the cost of converting the ith word into the jth word:

words = query || keyword;
n = ncol(words);
/* compute (n x n) cost matrix (not symmetric) */
cost = j(n,n,0);
do i = 1 to ncol(words);
   key = words[i];
   cost[i,] = spedis(words, key);
print cost[c=words r=words];

It is possible to visualize the distance between words by using the MDS procedure to compute how to position the words in the plane so that the planar distances most nearly match the (scaled) values in the cost matrix:

/* write SAS/IML cost matrix to data set */
create cost(type=DISTANCE) from cost[c=words r=words];
append from cost[r=words];
/* find 2D configuration of points */
ods graphics on;
proc mds data=cost dimension=2 shape=square;
subject words;

PROC MDS creates a graph that is similar to the one shown. Notice that the cluster in the upper right consists of the keyword and the words that MS Word suggests. These words are close to the keyword. On the other hand, the two words that were not suggested by MS Word are situated far from the keyword.

Further Reading

This post was inspired by Charlie Huang's interesting article titled Top 10 most powerful functions for PROC SQL. The SPEDIS and SOUNDEX functions are listed in item #6.

If you want to read funny examples of mistakes in text messages that were caused by the autocorrect feature of the iPhone, do an internet search for "funny iPhone autocorrect mistakes." Be aware, however, that many of the examples involve adult humor.

You can use the SPEDIS function to do fuzzy matching, as shown in the following papers:

tags: Data Analysis, Statistical Programming
9月 292011

I previously wrote about an intriguing math puzzle that involves 5-digit numbers with certain properties. This post presents my solution in the SAS/IML language.

It is easy to generate all 5-digit perfect squares, but the remainder of the problem involves looking at the digits of the squares. For this reason, I converted the set of all 5-digit numbers into an n x 5 array. I used the PUTN function to convert the numbers to strings, and then used the SUBSTR function to extract the first, second, third, fourth, and fifth digits into columns. (I then used the NUM function to change the character array back into a numeric matrix, but this is not necessary.)

The solution enables me to highlight three new functions in SAS 9.3:
  • The ELEMENT function enables you to find which elements in one set are contained in another set. I use this function to get rid of all 5-digit perfect squares that contain the digits {6,7,8,9,0}.
  • The ALLCOMB function generates all combinations of n elements taken k at a time. After I reduced the problem to a set of nine 5-digit numbers, I used the ALLCOMB function to look at all triplets of the candidate numbers.
  • The TABULATE subroutine computes the frequency distribution of elements in a vector or matrix. I used this subroutine to check the frequency of the digits in the triplets of numbers.

Here is my commented solution:

proc iml;
/* generate all 5-digit squares */
f0 = ceil(sqrt(10000));
f1 = floor(sqrt(99999));
AllSquares = T(f0:f1)##2;
/* convert to (n x 5) character array */
c = putn(AllSquares,"5.0");
m = j(nrow(c), 5," ");
do i = 1 to 5;   m[,i] = substr(c,i,1);  end;
m = num(m); /* convert to (n x 5) numerical matrix */
/* The numbers are clearly 1,2,3,4,5, since there 
   are 15 digits and each appears a unique number of times.
   Get rid of any rows that don't have these digits. */
bad = (6:9) || 0;
b = element(m, bad);   /* ELEMENT: SAS/IML 9.3 function */
hasBad = b[ ,+];       /* sum across columns */
g = m[loc(hasBad=0),]; /* only nine perfect squares left! */
/* Look at all 3-way combinations */
k = allcomb(nrow(g),3);/* ALLCOMB: SAS/IML 9.3 function */
SolnNumber=0;          /* how many solutions found? */
do i = 1 to nrow(k);
   soln = g[k[i,], ]; /* 3x5 matrix */
   /* The frequencies of the digits should be 1,2,3,4,5
      and the freq of a digit cannot equal the digit */
   call tabulate(levels, freq, soln); /* /* TABULATE: SAS/IML 9.3 function */
   if ncol(unique(freq))=5 then do;   /* are there five freqs? */
      if ^any(freq=(1:5)) then do;    /* any freq same as digit? */
         SolnNumber = SolnNumber+1;
         print "********" SolnNumber "********",
         soln[r={"Square1" "Square2" "Square3"}], freq;

At first, I didn't understand the last clue, so I printed out all seven triplets of numbers that satisfied the first five conditions. When I looked at the output, I finally made sense of the last clue, which is "If you knew which digit I have used just once, you could deduce my three squares with certainty." This told me to look closely at the FREQ vectors in the output. Of the seven solutions, one has frequency vector {3 5 4 2 1}, which means that the 1 digit appears three times, the 2 digit appears five times, and so on to the 5 digit, which appears once. In all of the other solutions, it is the 3 digit that appears once. Therefore, there is a unique solution in which the 5 digit appears only one time. The solution is as follows:

The SAS/IML language gave me some powerful tools that I used to solve the math puzzle. I'm particularly pleased that I only used two loops to solve this problem. I was able to vectorize the other computations.

Can you improve my solution? Use the comment to post (or link to) your program that solves the problem. The original post on the SAS Discussion Forum includes other ways to solve the problem in SAS.

tags: Just for Fun, Statistical Programming
9月 192011

The other day I encountered a SAS Knowledge Base article that shows how to count the number of missing and nonmissing values for each variable in a data set. However, the code is a complicated macro that is difficult for a beginning SAS programmer to understand. (Well, it was hard for me to understand!) The code not only counts the number of missing values for each variable, but also creates a SAS data set with the complete results. That's a nice bonus feature, but it contributes to the complexity of the macro.

This article simplifies the process and shows an alternative way to count the number of missing and nonmissing values for each variable in a data set.

The easy case: Count missing values for numeric variables

If you are only interested in the number of missing values for numeric variables, then a single call to the MEANS procedure computes the answer:

/* create sample data */
data one;
  input a $ b $ c $ d e;
a . a 1 3
. b . 2 4
a a a . 5
. . b 3 5
a a a . 6
a a a . 7
a a a 2 8
proc means data=one NMISS N; run;

In many SAS procedures, including PROC MEANS, you can omit the VAR statement in order to operate on all relevant variables. For the MEANS procedure, "relevant" means "numeric."

Count missing values for all variables

The MEANS procedure computes statistics for numeric variables, but other SAS procedures enable you to count the number of missing values for character and numeric variables.

The FREQ procedure is a SAS workhorse that I use almost every day. To get the FREQ procedure to count missing values, use three tricks:

  1. Specify a format for the variables so that the missing values all have one value and the nonmissing values have another value. PROC FREQ groups a variable's values according to the formatted values.
  2. Specify the MISSING and MISSPRINT options on the TABLES statement.
  3. Use the _CHAR_ and _NUM_ keywords on the TABLES statement to specify that the FREQ procedure should compute statistics for all character or all numeric variables.

The following statements count the number of missing and nonmissing values for every variable: first the character variables and then the numeric ones.

/* create a format to group missing and nonmissing */
proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
proc freq data=one; 
format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */
tables _CHAR_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;

Using the SAS/IML language to count missing values

In the SAS/IML Language, you can use the COUNTN and COUNTMISS functions that were introduced in SAS/IML 9.22. Strictly speaking, you need to use only one of the functions, since the result of the other is determined by knowing the number of observations in the data set. For the sake of the example, I'll be inefficient and use both of the functions.

As is the case for the PROC FREQ example, the trick is to use the _CHAR_ and _NUM_ keywords to read in and operate on the character and numeric variables in separate steps:

proc iml;
use one;
read all var _NUM_ into x[colname=nNames]; 
n = countn(x,"col");
nmiss = countmiss(x,"col");
read all var _CHAR_ into x[colname=cNames]; 
close one;
c = countn(x,"col");
cmiss = countmiss(x,"col");
/* combine results for num and char into a single table */
Names = cNames || nNames;
rNames = {"    Missing", "Not Missing"};
cnt = (cmiss // c) || (nmiss // n);
print cnt[r=rNames c=Names label=""];

This is similar to the output produced by the macro in the SAS Knowledge Base article. You can also write the cnt matrix to a data set, if necessary.

tags: Getting Started, SAS Programming, Statistical Programming
9月 142011

Polynomials are used often in data analysis. Low-order polynomials are used in regression to model the relationship between variables. Polynomials are used in numerical analysis for numerical integration and Taylor series approximations. It is therefore important to be able to evaluate polynomials in an efficient manner.

My favorite evaluation technique is known as Horner's scheme. This algorithm is amazingly simple to describe and to implement. It has the advantage that a polynomial of degree d is evaluated with d additions and d multiplications. With this scheme, you never have to explicitly raise the varible, x, to a power.

To explain Horner's scheme, I will use the polynomial
p(x) = 1x3 - 2x2 - 4x + 3.
Horner's scheme rewrites the poynomial as a set of nested linear terms:
p(x) = ((1x - 2)x - 4)x + 3.
To evaluate the polynomial, simply evaluate each linear term. The partial answer at each step of the iteration is used as the "coefficient" for the next linear evaluation.

In SAS/IML software, it is customary to store polynomial coefficients in a vector in order of descending powers of the polynomial. For example, the coefficients of the example polynomial are stored in the vector {1, -2, -4, 3}. The following SAS/IML module implements Horner's scheme for evaluating a polynomial at a set of points:

start Polyval(coef, v);
/*  Polyval(c,x) uses Horner's scheme to return the value of a polynomial of 
degree n evaluated at x. The input argument c is a column 
vector of length n+1 whose elements are the coefficients 
in descending powers of the polynomial to be evaluated.
The input argument x can be a vector of values. */
   c = colvec(coef); x = colvec(v); /* make column vectors */
   y = j(nrow(x), 1, c[1]); /* initialize to c[1] */
   do j = 2 to nrow(c); 
      y = y # x + c[j];
/* 1*x##3 - 2*x##2 -4*x + 3 */
c = {1, -2, -4, 3};
x = do(-2, 3.5, 0.5); /* evaluate at these points */
y = Polyval(c, x);
print (x`)[label="x"] y;

The polynomial is evaluated at a set of points in the interval [-2, 3.5]. The function values show three sign changes, indicating that the polynomial has roots (zeros) in the intervals [-2, -1.5] and [0.5, 1], and at the point x=3. As shown in my post on finding the roots of univariate functions, you can use the POLYROOT function in SAS/IML to locate the roots.

The Polyval module is efficient not only because it uses Horner's scheme to evaluate the polynomial, but also because it is vectorized: it can evaluate the polynomial at a set of points with a single call. A few coments on the implementation:

  • The COLVEC module, which is part of the IMLMLIB library, converts the arguments to column vectors. That way, you can pass in row vectors, column vectors, or even matrices, and the module works correctly. Alternatively, you can have the module always output a matrix that is the same dimensions as x.
  • The output vector, y, is initialized to contain c[1], the leading coefficient of the polynomial. The y vector contains the same number of elements as the x vector.
  • The fact that the module accepts a vector of x values is almost hidden. The only hints are that y is allocated to be a vector and the elementwise multiplication operator (#) is used within the DO loop.
9月 122011

You can extend the capability of the SAS/IML language by writing modules. A module is a user-defined function. You can define a module by using the START and FINISH statements.

Many people, including myself, define modules at the top of the SAS/IML program in which they are used. You can do this explicitly or you can use the %INCLUDE statement.

An alternative is to store modules in a SAS catalog and to load the modules when you need to use them. As stated in the SAS/IML User's Guide:

Modules are stored in the form of their compiled code. Once modules are loaded, they do not need to be parsed again, making their use very efficient.

Storing modules

Storing a SAS/IML module requires two steps:

  1. Decide where the module will be stored.
  2. Store the module by using the STORE statement.

As an example of storing a module, consider the following module that finds the rows in a matrix for which all variables are nonmissing. This module is from my book, Statistical Programming with SAS/IML Software.

proc iml;
/* create a module to return the nonmissing rows of a matrix */
start LocNonMissingRows(x);
   r = countmiss(x, "row");     /* 9.22 function: number missing in row */
   nonMissingIdx = loc(r=0);    /* rows that do not contain missing */
   return ( nonMissingIdx );

You can store the module in any SAS library such as SASUSER or a user-defined libref. Use the RESET STORAGE statement to set the name of a storage location. (By default, the module will be stored in the WORK library, which vanishes when you exit SAS!) The following statements set up a libref stores the module in a catalog called BlogModules:

libname BlogDir "C:\Users\userid\Documents\My SAS Files\Blog";
reset storage=BlogDir.BlogModules;  /* set location for storage */
store module=LocNonMissingRows;

In PROC IML, the STORE statement stores the LocNonMissingRows module in a SAS catalog in the BlogDir library. By using the RESET STORAGE, you ensure that the module is stored to a permanent location.

Loading modules

Similarly, suppose it is a few months later and you want to use the LocNonMissingRows module in a program that you are writing. You can use the RESET STORAGE statement and a LOAD statement to load the module definition:

proc iml;
libname BlogDir "C:\Users\userid\Documents\My SAS Files\Blog";
reset storage=BlogDir.BlogModules;  /* set location for storage */
load module=LocNonMissingRows;

You can now use the module in your program to find all rows in a matrix for which no entry is a missing value:

z = {1 2, 3 ., 5 6, 7 8, . 10, 11 12};
nonMissing = LocNonMissingRows(z);
if ncol(nonMissing)>0 then
   z = z[nonMissing, ];
print z;
9月 092011

Do you know someone who has a birthday in mid-September? Odds are that you do: the middle of September is when most US babies are born, according to data obtained from the National Center for Health Statistics (NCHS) Web site (see Table 1-16).

There's an easy way to remember this fact:
Q: When are most babies born?
A: After Labor day!

The following graph shows the number of weekly births in 2002, normalized so that the mean number of births is represented by 100. This is called the "index of occurrence" by researchers in the Office of Vital Statistics (Martin, et al., 2006). Notice that the peak number of births occurs in weeks 37, 38, and 39, which corresponds to September 9–29, 2002.

Other years show similar behavior, as noted in Martin, et al., 2006:

Historically, the number of births peaks during the summer, and is at its lowest during the winter.... Observed birth rates... were at their highest in September and lowest in December.

Perhaps an even more interesting phenomenon is that there is a "man-made" day-of-the-week effect, due to the large number of US deliveries that are induced or delivered by a scheduled cesarean delivery. The following graph, which is taken from the book Statistical Programming with SAS/IML Software, shows the distribution of the 4,021,726 live births in the US for the year 2002. (Click to enlarge.) You can also download the data from the book's Web site.

The shapes and colors of the markers correspond to days of the week. Most US babies are born Tuesday through Friday. Compared with the other weekdays, Mondays have fewer births. The fewest babies are born on weekends and holidays.

Martin, et al., 2006, comment on these patterns and attribute the day-of-week effect to induced labor and cesarean deliveries:

Patterns in the average number of births by day of the week may be influenced by the scheduling of induction of labor and cesarean delivery. . . . The relatively narrow range for spontaneous vaginal births contrasts sharply with that of . . . cesarean deliveries that ranged from [fewest deliveries] on Sunday to [most deliveries] on Tuesday.

The birth data also indicate a "holiday effect" in which there are fewer babies born on US holidays. In particular, in a great instance of statistical irony, there are fewer babies born on Labor Day (Monday, 02SEP2002) than would be expected for a Monday in September. Can you find the blue circle for Labor Day, 2002, near the bottom of the plot? That value is about 65% less than what would be expected for a non-holiday Monday in September.

9月 072011

Looping is essential to statistical programming. Whether you need to iterate over parameters in an algorithm or indices in an array, a loop is often one of the first programming constructs that a beginning programmer learns.

Today is the first anniversary of this blog, which is named The DO Loop, so it seems appropriate to blog about DO loops in SAS. I'll describe looping in the SAS DATA step and compare it with looping in the SAS/IML language.

Loops in SAS

Loops are fundamental to programming because they enable you to repeat a computation for various values of parameters. Different languages use different keywords to define the iteration statement. The most well-known statement is the "for loop," which is used by C/C++, MATLAB, R, and other languages. Older languages, such as FORTRAN and SAS, call the iteration statement a "do loop," but it is exactly the same concept.

DO loops in the DATA step

The basic iterative DO statement in SAS has the syntax DO value = start TO stop. An END statement marks the end of the loop, as shown in the following example:

data A;
do i = 1 to 5;
   y = i**2; /* values are 1, 4, 9, 16, 25 */

By default, each iteration of a DO statement increments the value of the counter by 1, but you can use the BY option to increment the counter by other amounts, including non-integer amounts. For example, each iteration of the following DATA step increments the value i by 0.5:

data A;
do i = 1 to 5 by 0.5;
   y = i**2; /* values are 1, 2.25, 4, 6.25, ..., 25 */

You can also iterate "backwards" by using a negative value for the BY option: do i=5 to 1 by -0.5.

DO loops in SAS/IML Software

A basic iterative DO statement in the SAS/IML language has exactly the same syntax as in the DATA step, as shown in the following PROC IML statements:

proc iml;
x = 1:4; /* vector of values {1 2 3 4} */
do i = 1 to 5;
   z = sum(x##i); /* 10, 30, 100, 354, 1300 */

In the body of the loop, z is the sum of powers of the elements of x. During the ith iteration, the elements of x are raised to the ith power. As mentioned in the previous section, you can also use the BY option to increment the counter by non-unit values and by negative values.

Variations on the DO loop: DO WHILE and DO UNTIL

On occasion, you might want to stop iterating if a certain condition occurs. There are two ways to do this: you can use the WHILE clause to iterate as long as a certain condition holds, or you can use the UNTIL clause to iterate until a certain condition holds.

You can use the DO statement with a WHILE clause to iterate while a condition is true. The condition is checked before each iteration, which implies that you should intialize the stopping condition prior to the loop. The following statements extend the DATA step example and iterate as long as the value of y is less than 20:

data A;
y = 0;
do i = 1 to 5 by 0.5 while(y < 20);
   y = i**2; /* values are 1, 2.25, 4, 6.25, ..., 16 */

You can use the iterative DO statement with an UNTIL clause to iterate until a condition becomes true. The UNTIL condition is evaluated at the end of the loop, so you do not have to initialize the condition prior to the loop. The following statements extend the PROC IML example. The iteration stops after the value of z exceeds 200.

proc iml;
x = 1:4;
do i = 1 to 5 until(z > 200);
   z = sum(x##i); /* 10, 30, 100, 354 */

In these examples, the iteration stopped because the WHILE or UNTIL condition was satisfied. If the condition is not satisfied when i=5 (the last value for the counter), the loop stops anyway. Consequently, the examples have two stopping conditions: a maximum number of iterations and the WHILE or UNTIL criterion. SAS also supports a DO WHILE and DO UNTIL syntax that does not involve using a counter variable.

Looping over a set of items (foreach)

Some languages support a "foreach loop" that iterates over objects in a collection. SAS doesn't support that syntax directly, but there is a variant of the DO loop in which you can iterate over values in a specified list. The syntax in the DATA step is to specify a list of values (numeric or character) after the equal sign. The following example iterates over a few terms in the Fibonacci sequence:

data A;
do v = 1, 1, 2, 3, 5, 8, 13, 21;
   y = v/lag(v);

The ratio of adjacent values in a Fibonacci sequence converges to the golden ratio, which is 1.61803399....

The SAS/IML language does not support this syntax, but does enable you to iterate over values that are contained in a vector (or matrix). The following statements create a vector, v, that contains the Fibonacci numbers. An ordinary DO loop is used to iterate over the elements of the vector. At the end of the loop, the vector z contains the same values as the variable Y that was computed in the DATA step.

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21};
z = j(nrow(v),1,.); /* initialize ratio to missing values */
do i = 2 to nrow(v);
   z[i] = v[i]/v[i-1];

Avoid unnecessary loops in the SAS/IML Language

I have some advice on using DO loops in SAS/IML language: look carefully to determine if you really need a loop. The SAS/IML language is a matrix/vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities. Experienced SAS/IML programmers rarely operate on each element of a vector. Rather, they manipulate the vector as a single quantity. For example, the previous SAS/IML loop can be eliminated:

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21};
idx = 2:nrow(v);
z = v[idx]/v[idx-1];

This computation, which computes the nonmissing ratios, is more efficient than looping over elements. For other tips and techniques that make your SAS/IML programs more efficient, see my book Statistical Programming with SAS/IML Software.

8月 312011

I previously showed how to generate random numbers in SAS by using the RAND function in the DATA step or by using the RANDGEN subroutine in SAS/IML software. These functions generate a stream of random numbers. (In statistics, the random numbers are usually a sample from a distribution such as the uniform or the normal distribution.) You can control the stream by setting the seed for the random numbers. The random number seed is set by using the STREAMINIT subroutine in the DATA step or the RANDSEED subroutine in the SAS/IML language.

A random number seed enables you to generate the same set of random numbers every time that you run the program. This seems like an oxymoron: if they are the same every time, then how can they be random? The resolution to this paradox is that the numbers that we call "random" should more accurately be called "pseudorandom numbers." Pseudorandom numbers are generated by an algorithm, but have statistical properties of randomness. A good algorithm generates pseudorandom numbers that are indistinguishable from truly random numbers. The random number generator used in SAS is the Mersenne-Twister random number generator (Matsumoto and Nishimura, 1998), which is known to have excellent statistical properties.

Why would you want a reproducible sequence of random numbers? Documentation and testing are two important reasons. When I write SAS code and publish it on this blog, in a book, or in SAS documentation, it is important that SAS customers be able to run the code and obtain the same results.

Random number streams in the DATA step

The STREAMINIT subroutine is used to set the random number seed for the RAND function in the DATA step. The seed value controls the sequence of random numbers. Syntactically, you should call the STREAMINIT subroutine one time per DATA step, prior to the first invocation of the RAND function. This ensures that when you run the DATA step later, it produces the same pseudorandom numbers.

If you start a new DATA step, you can specify a new seed value. If you use a seed value of 0, or if you do not specify a seed value, then the system time is used to determine the seed value. In this case, the random number stream is not reproducible.

To see how random number streams work, each of the following DATA step creates five random observations. The first and third data sets use the same random number seed (123), so the random numbers are identical. The second and fourth variables both use the system time (at the time that the RAND function is first called) to set the seed. Consequently, those random number streams are different. The last data set contains random numbers generated by a different seed (456). This stream of numbers is different from the other streams.

data A(drop=i);
  call streaminit(123);
  do i = 1 to 5;
    x123 = rand("Uniform"); output;
data B(drop=i);
  call streaminit(0);
  do i = 1 to 5;
    x0 = rand("Uniform"); output;
data C(drop=i);
  call streaminit(123);
  do i = 1 to 5;
    x123_2 = rand("Uniform"); output;
data D(drop=i);
  /* no call to streaminit */
  do i = 1 to 5;
    x0_2 = rand("Uniform"); output;
data E(drop=i);
  call streaminit(456);
  do i = 1 to 5;
    x456 = rand("Uniform"); output;
data AllRand;  merge A B C D E; run; /* concatenate */
proc print data=AllRand; run;

Notice that the STREAMINIT subroutine, if called, is called exactly one time at the beginning of the DATA step. It does not make sense to call STREAMINIT multiple times within the same DATA step; subsequent calls are ignored. In the one DATA step (D) that does not call STREAMINIT, the first call to the RAND function implicitly calls STREAMINIT with 0 as an argument.

If a single program contains multiple DATA steps that generate random numbers (as above), use a different seed in each DATA step or else the streams will not be independent. This is also important if you are writing a macro function that generates random numbers. Do not hard-code a seed value. Rather, enable the user to specify the seed value in the syntax of the function.

Random number streams in PROC IML

So that it is easier to compare random numbers generated in SAS/IML with random numbers generated by the SAS DATA step, I display the table of SAS/IML results first:

These numbers are generated by the RANDGEN and RANDSEED subroutines in PROC IML. The numbers are generated by five procedure calls, and the random number seeds are identical to those used in the DATA step example. The first and third variables were generated from the seed value 123, the second and fourth variables were generated by using the system time, and the last variable was generated by using the seed 456. The following program generates the data sets, which are then concatenated together.

proc iml;
  call randseed(123);
  x = j(5,1); call randgen(x, "Uniform");
  create A from x[colname="x123"]; append from x;
proc iml;
  call randseed(0);
  x = j(5,1); call randgen(x, "Uniform");
  create B from x[colname="x0"]; append from x;
proc iml;
  call randseed(123);
  x = J(5,1); call randgen(x, "Uniform");
  create C from x[colname="x123_2"]; append from x;
proc iml;
  /* no call to randseed */
  x = J(5,1); call randgen(x, "Uniform");
  create D from x[colname="x0_2"]; append from x;
proc iml;
  call randseed(456);
  x = J(5,1); call randgen(x, "Uniform");
  create E from x[colname="x456"]; append from x;
data AllRandgen; merge A B C D E; run;
proc print data=AllRandgen; run;

Notice that the numbers in the two tables are identical for columns 1, 3, and 5. The DATA step and PROC IML use the same algorithm to generate random numbers, so they produce the same stream of random values when given the same seed.


  • To generate random numbers, use the RAND function (for the DATA step) and the RANDGEN call (for PROC IML).
  • To create a reproducible stream of random numbers, call the STREAMINIT (for the DATA step) or the RANDSEED (for PROC IML) subroutine prior to calling RAND or RANDGEN. Pass a positive value (called the seed) to the routines.
  • To initialize a stream of random numbers that is not reproducible, call STREAMINIT or RANDSEED with the seed value 0.
  • To ensure independent streams within a single program, use a different seed value in each DATA step or procedure.