Statistical Programming

9月 292011
 

I previously wrote about an intriguing math puzzle that involves 5-digit numbers with certain properties. This post presents my solution in the SAS/IML language.

It is easy to generate all 5-digit perfect squares, but the remainder of the problem involves looking at the digits of the squares. For this reason, I converted the set of all 5-digit numbers into an n x 5 array. I used the PUTN function to convert the numbers to strings, and then used the SUBSTR function to extract the first, second, third, fourth, and fifth digits into columns. (I then used the NUM function to change the character array back into a numeric matrix, but this is not necessary.)

The solution enables me to highlight three new functions in SAS 9.3:
  • The ELEMENT function enables you to find which elements in one set are contained in another set. I use this function to get rid of all 5-digit perfect squares that contain the digits {6,7,8,9,0}.
  • The ALLCOMB function generates all combinations of n elements taken k at a time. After I reduced the problem to a set of nine 5-digit numbers, I used the ALLCOMB function to look at all triplets of the candidate numbers.
  • The TABULATE subroutine computes the frequency distribution of elements in a vector or matrix. I used this subroutine to check the frequency of the digits in the triplets of numbers.

Here is my commented solution:

proc iml;
/* generate all 5-digit squares */
f0 = ceil(sqrt(10000));
f1 = floor(sqrt(99999));
AllSquares = T(f0:f1)##2;
 
/* convert to (n x 5) character array */
c = putn(AllSquares,"5.0");
m = j(nrow(c), 5," ");
do i = 1 to 5;   m[,i] = substr(c,i,1);  end;
m = num(m); /* convert to (n x 5) numerical matrix */
 
/* The numbers are clearly 1,2,3,4,5, since there 
   are 15 digits and each appears a unique number of times.
   Get rid of any rows that don't have these digits. */
bad = (6:9) || 0;
b = element(m, bad);   /* ELEMENT: SAS/IML 9.3 function */
hasBad = b[ ,+];       /* sum across columns */
g = m[loc(hasBad=0),]; /* only nine perfect squares left! */
 
/* Look at all 3-way combinations */
k = allcomb(nrow(g),3);/* ALLCOMB: SAS/IML 9.3 function */
SolnNumber=0;          /* how many solutions found? */
do i = 1 to nrow(k);
   soln = g[k[i,], ]; /* 3x5 matrix */
   /* The frequencies of the digits should be 1,2,3,4,5
      and the freq of a digit cannot equal the digit */
   call tabulate(levels, freq, soln); /* /* TABULATE: SAS/IML 9.3 function */
   if ncol(unique(freq))=5 then do;   /* are there five freqs? */
      if ^any(freq=(1:5)) then do;    /* any freq same as digit? */
         SolnNumber = SolnNumber+1;
         print "********" SolnNumber "********",
         soln[r={"Square1" "Square2" "Square3"}], freq;
      end;
   end;
end;

At first, I didn't understand the last clue, so I printed out all seven triplets of numbers that satisfied the first five conditions. When I looked at the output, I finally made sense of the last clue, which is "If you knew which digit I have used just once, you could deduce my three squares with certainty." This told me to look closely at the FREQ vectors in the output. Of the seven solutions, one has frequency vector {3 5 4 2 1}, which means that the 1 digit appears three times, the 2 digit appears five times, and so on to the 5 digit, which appears once. In all of the other solutions, it is the 3 digit that appears once. Therefore, there is a unique solution in which the 5 digit appears only one time. The solution is as follows:

The SAS/IML language gave me some powerful tools that I used to solve the math puzzle. I'm particularly pleased that I only used two loops to solve this problem. I was able to vectorize the other computations.

Can you improve my solution? Use the comment to post (or link to) your program that solves the problem. The original post on the SAS Discussion Forum includes other ways to solve the problem in SAS.

tags: Just for Fun, Statistical Programming
9月 192011
 

The other day I encountered a SAS Knowledge Base article that shows how to count the number of missing and nonmissing values for each variable in a data set. However, the code is a complicated macro that is difficult for a beginning SAS programmer to understand. (Well, it was hard for me to understand!) The code not only counts the number of missing values for each variable, but also creates a SAS data set with the complete results. That's a nice bonus feature, but it contributes to the complexity of the macro.

This article simplifies the process and shows an alternative way to count the number of missing and nonmissing values for each variable in a data set.

The easy case: Count missing values for numeric variables

If you are only interested in the number of missing values for numeric variables, then a single call to the MEANS procedure computes the answer:

/* create sample data */
data one;
  input a $ b $ c $ d e;
cards;
a . a 1 3
. b . 2 4
a a a . 5
. . b 3 5
a a a . 6
a a a . 7
a a a 2 8
;
run;
 
proc means data=one NMISS N; run;

In many SAS procedures, including PROC MEANS, you can omit the VAR statement in order to operate on all relevant variables. For the MEANS procedure, "relevant" means "numeric."

Count missing values for all variables

The MEANS procedure computes statistics for numeric variables, but other SAS procedures enable you to count the number of missing values for character and numeric variables.

The FREQ procedure is a SAS workhorse that I use almost every day. To get the FREQ procedure to count missing values, use three tricks:

  1. Specify a format for the variables so that the missing values all have one value and the nonmissing values have another value. PROC FREQ groups a variable's values according to the formatted values.
  2. Specify the MISSING and MISSPRINT options on the TABLES statement.
  3. Use the _CHAR_ and _NUM_ keywords on the TABLES statement to specify that the FREQ procedure should compute statistics for all character or all numeric variables.

The following statements count the number of missing and nonmissing values for every variable: first the character variables and then the numeric ones.

/* create a format to group missing and nonmissing */
proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;
 
proc freq data=one; 
format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */
tables _CHAR_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;
run;

Using the SAS/IML language to count missing values

In the SAS/IML Language, you can use the COUNTN and COUNTMISS functions that were introduced in SAS/IML 9.22. Strictly speaking, you need to use only one of the functions, since the result of the other is determined by knowing the number of observations in the data set. For the sake of the example, I'll be inefficient and use both of the functions.

As is the case for the PROC FREQ example, the trick is to use the _CHAR_ and _NUM_ keywords to read in and operate on the character and numeric variables in separate steps:

proc iml;
use one;
read all var _NUM_ into x[colname=nNames]; 
n = countn(x,"col");
nmiss = countmiss(x,"col");
 
read all var _CHAR_ into x[colname=cNames]; 
close one;
c = countn(x,"col");
cmiss = countmiss(x,"col");
 
/* combine results for num and char into a single table */
Names = cNames || nNames;
rNames = {"    Missing", "Not Missing"};
cnt = (cmiss // c) || (nmiss // n);
print cnt[r=rNames c=Names label=""];

This is similar to the output produced by the macro in the SAS Knowledge Base article. You can also write the cnt matrix to a data set, if necessary.

tags: Getting Started, SAS Programming, Statistical Programming
9月 142011
 

Polynomials are used often in data analysis. Low-order polynomials are used in regression to model the relationship between variables. Polynomials are used in numerical analysis for numerical integration and Taylor series approximations. It is therefore important to be able to evaluate polynomials in an efficient manner.

My favorite evaluation technique is known as Horner's scheme. This algorithm is amazingly simple to describe and to implement. It has the advantage that a polynomial of degree d is evaluated with d additions and d multiplications. With this scheme, you never have to explicitly raise the varible, x, to a power.

To explain Horner's scheme, I will use the polynomial
p(x) = 1x3 - 2x2 - 4x + 3.
Horner's scheme rewrites the poynomial as a set of nested linear terms:
p(x) = ((1x - 2)x - 4)x + 3.
To evaluate the polynomial, simply evaluate each linear term. The partial answer at each step of the iteration is used as the "coefficient" for the next linear evaluation.

In SAS/IML software, it is customary to store polynomial coefficients in a vector in order of descending powers of the polynomial. For example, the coefficients of the example polynomial are stored in the vector {1, -2, -4, 3}. The following SAS/IML module implements Horner's scheme for evaluating a polynomial at a set of points:

start Polyval(coef, v);
/*  Polyval(c,x) uses Horner's scheme to return the value of a polynomial of 
degree n evaluated at x. The input argument c is a column 
vector of length n+1 whose elements are the coefficients 
in descending powers of the polynomial to be evaluated.
The input argument x can be a vector of values. */
   c = colvec(coef); x = colvec(v); /* make column vectors */
   y = j(nrow(x), 1, c[1]); /* initialize to c[1] */
   do j = 2 to nrow(c); 
      y = y # x + c[j];
   end;
   return(y);
finish;
 
/* 1*x##3 - 2*x##2 -4*x + 3 */
c = {1, -2, -4, 3};
x = do(-2, 3.5, 0.5); /* evaluate at these points */
y = Polyval(c, x);
print (x`)[label="x"] y;

The polynomial is evaluated at a set of points in the interval [-2, 3.5]. The function values show three sign changes, indicating that the polynomial has roots (zeros) in the intervals [-2, -1.5] and [0.5, 1], and at the point x=3. As shown in my post on finding the roots of univariate functions, you can use the POLYROOT function in SAS/IML to locate the roots.

The Polyval module is efficient not only because it uses Horner's scheme to evaluate the polynomial, but also because it is vectorized: it can evaluate the polynomial at a set of points with a single call. A few coments on the implementation:

  • The COLVEC module, which is part of the IMLMLIB library, converts the arguments to column vectors. That way, you can pass in row vectors, column vectors, or even matrices, and the module works correctly. Alternatively, you can have the module always output a matrix that is the same dimensions as x.
  • The output vector, y, is initialized to contain c[1], the leading coefficient of the polynomial. The y vector contains the same number of elements as the x vector.
  • The fact that the module accepts a vector of x values is almost hidden. The only hints are that y is allocated to be a vector and the elementwise multiplication operator (#) is used within the DO loop.
9月 122011
 

You can extend the capability of the SAS/IML language by writing modules. A module is a user-defined function. You can define a module by using the START and FINISH statements.

Many people, including myself, define modules at the top of the SAS/IML program in which they are used. You can do this explicitly or you can use the %INCLUDE statement.

An alternative is to store modules in a SAS catalog and to load the modules when you need to use them. As stated in the SAS/IML User's Guide:

Modules are stored in the form of their compiled code. Once modules are loaded, they do not need to be parsed again, making their use very efficient.

Storing modules

Storing a SAS/IML module requires two steps:

  1. Decide where the module will be stored.
  2. Store the module by using the STORE statement.

As an example of storing a module, consider the following module that finds the rows in a matrix for which all variables are nonmissing. This module is from my book, Statistical Programming with SAS/IML Software.

proc iml;
/* create a module to return the nonmissing rows of a matrix */
start LocNonMissingRows(x);
   r = countmiss(x, "row");     /* 9.22 function: number missing in row */
   nonMissingIdx = loc(r=0);    /* rows that do not contain missing */
   return ( nonMissingIdx );
finish;

You can store the module in any SAS library such as SASUSER or a user-defined libref. Use the RESET STORAGE statement to set the name of a storage location. (By default, the module will be stored in the WORK library, which vanishes when you exit SAS!) The following statements set up a libref stores the module in a catalog called BlogModules:

libname BlogDir "C:\Users\userid\Documents\My SAS Files\Blog";
reset storage=BlogDir.BlogModules;  /* set location for storage */
store module=LocNonMissingRows;

In PROC IML, the STORE statement stores the LocNonMissingRows module in a SAS catalog in the BlogDir library. By using the RESET STORAGE, you ensure that the module is stored to a permanent location.

Loading modules

Similarly, suppose it is a few months later and you want to use the LocNonMissingRows module in a program that you are writing. You can use the RESET STORAGE statement and a LOAD statement to load the module definition:

proc iml;
libname BlogDir "C:\Users\userid\Documents\My SAS Files\Blog";
reset storage=BlogDir.BlogModules;  /* set location for storage */
load module=LocNonMissingRows;

You can now use the module in your program to find all rows in a matrix for which no entry is a missing value:

z = {1 2, 3 ., 5 6, 7 8, . 10, 11 12};
nonMissing = LocNonMissingRows(z);
if ncol(nonMissing)>0 then
   z = z[nonMissing, ];
print z;
9月 092011
 

Do you know someone who has a birthday in mid-September? Odds are that you do: the middle of September is when most US babies are born, according to data obtained from the National Center for Health Statistics (NCHS) Web site (see Table 1-16).

There's an easy way to remember this fact:
Q: When are most babies born?
A: After Labor day!

The following graph shows the number of weekly births in 2002, normalized so that the mean number of births is represented by 100. This is called the "index of occurrence" by researchers in the Office of Vital Statistics (Martin, et al., 2006). Notice that the peak number of births occurs in weeks 37, 38, and 39, which corresponds to September 9–29, 2002.

Other years show similar behavior, as noted in Martin, et al., 2006:

Historically, the number of births peaks during the summer, and is at its lowest during the winter.... Observed birth rates... were at their highest in September and lowest in December.

Perhaps an even more interesting phenomenon is that there is a "man-made" day-of-the-week effect, due to the large number of US deliveries that are induced or delivered by a scheduled cesarean delivery. The following graph, which is taken from the book Statistical Programming with SAS/IML Software, shows the distribution of the 4,021,726 live births in the US for the year 2002. (Click to enlarge.) You can also download the data from the book's Web site.

The shapes and colors of the markers correspond to days of the week. Most US babies are born Tuesday through Friday. Compared with the other weekdays, Mondays have fewer births. The fewest babies are born on weekends and holidays.

Martin, et al., 2006, comment on these patterns and attribute the day-of-week effect to induced labor and cesarean deliveries:

Patterns in the average number of births by day of the week may be influenced by the scheduling of induction of labor and cesarean delivery. . . . The relatively narrow range for spontaneous vaginal births contrasts sharply with that of . . . cesarean deliveries that ranged from [fewest deliveries] on Sunday to [most deliveries] on Tuesday.

The birth data also indicate a "holiday effect" in which there are fewer babies born on US holidays. In particular, in a great instance of statistical irony, there are fewer babies born on Labor Day (Monday, 02SEP2002) than would be expected for a Monday in September. Can you find the blue circle for Labor Day, 2002, near the bottom of the plot? That value is about 65% less than what would be expected for a non-holiday Monday in September.

9月 072011
 

Looping is essential to statistical programming. Whether you need to iterate over parameters in an algorithm or indices in an array, a loop is often one of the first programming constructs that a beginning programmer learns.

Today is the first anniversary of this blog, which is named The DO Loop, so it seems appropriate to blog about DO loops in SAS. I'll describe looping in the SAS DATA step and compare it with looping in the SAS/IML language.

Loops in SAS

Loops are fundamental to programming because they enable you to repeat a computation for various values of parameters. Different languages use different keywords to define the iteration statement. The most well-known statement is the "for loop," which is used by C/C++, MATLAB, R, and other languages. Older languages, such as FORTRAN and SAS, call the iteration statement a "do loop," but it is exactly the same concept.

DO loops in the DATA step

The basic iterative DO statement in SAS has the syntax DO value = start TO stop. An END statement marks the end of the loop, as shown in the following example:

data A;
do i = 1 to 5;
   y = i**2; /* values are 1, 4, 9, 16, 25 */
   output;
end;
run;

By default, each iteration of a DO statement increments the value of the counter by 1, but you can use the BY option to increment the counter by other amounts, including non-integer amounts. For example, each iteration of the following DATA step increments the value i by 0.5:

data A;
do i = 1 to 5 by 0.5;
   y = i**2; /* values are 1, 2.25, 4, 6.25, ..., 25 */
   output;
end;
run;

You can also iterate "backwards" by using a negative value for the BY option: do i=5 to 1 by -0.5.

DO loops in SAS/IML Software

A basic iterative DO statement in the SAS/IML language has exactly the same syntax as in the DATA step, as shown in the following PROC IML statements:

proc iml;
x = 1:4; /* vector of values {1 2 3 4} */
do i = 1 to 5;
   z = sum(x##i); /* 10, 30, 100, 354, 1300 */
end;

In the body of the loop, z is the sum of powers of the elements of x. During the ith iteration, the elements of x are raised to the ith power. As mentioned in the previous section, you can also use the BY option to increment the counter by non-unit values and by negative values.

Variations on the DO loop: DO WHILE and DO UNTIL

On occasion, you might want to stop iterating if a certain condition occurs. There are two ways to do this: you can use the WHILE clause to iterate as long as a certain condition holds, or you can use the UNTIL clause to iterate until a certain condition holds.

You can use the DO statement with a WHILE clause to iterate while a condition is true. The condition is checked before each iteration, which implies that you should intialize the stopping condition prior to the loop. The following statements extend the DATA step example and iterate as long as the value of y is less than 20:

data A;
y = 0;
do i = 1 to 5 by 0.5 while(y < 20);
   y = i**2; /* values are 1, 2.25, 4, 6.25, ..., 16 */
   output;
end;
run;

You can use the iterative DO statement with an UNTIL clause to iterate until a condition becomes true. The UNTIL condition is evaluated at the end of the loop, so you do not have to initialize the condition prior to the loop. The following statements extend the PROC IML example. The iteration stops after the value of z exceeds 200.

proc iml;
x = 1:4;
do i = 1 to 5 until(z > 200);
   z = sum(x##i); /* 10, 30, 100, 354 */
end;

In these examples, the iteration stopped because the WHILE or UNTIL condition was satisfied. If the condition is not satisfied when i=5 (the last value for the counter), the loop stops anyway. Consequently, the examples have two stopping conditions: a maximum number of iterations and the WHILE or UNTIL criterion. SAS also supports a DO WHILE and DO UNTIL syntax that does not involve using a counter variable.

Looping over a set of items (foreach)

Some languages support a "foreach loop" that iterates over objects in a collection. SAS doesn't support that syntax directly, but there is a variant of the DO loop in which you can iterate over values in a specified list. The syntax in the DATA step is to specify a list of values (numeric or character) after the equal sign. The following example iterates over a few terms in the Fibonacci sequence:

data A;
do v = 1, 1, 2, 3, 5, 8, 13, 21;
   y = v/lag(v);
   output;
end;
run;

The ratio of adjacent values in a Fibonacci sequence converges to the golden ratio, which is 1.61803399....

The SAS/IML language does not support this syntax, but does enable you to iterate over values that are contained in a vector (or matrix). The following statements create a vector, v, that contains the Fibonacci numbers. An ordinary DO loop is used to iterate over the elements of the vector. At the end of the loop, the vector z contains the same values as the variable Y that was computed in the DATA step.

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21};
z = j(nrow(v),1,.); /* initialize ratio to missing values */
do i = 2 to nrow(v);
   z[i] = v[i]/v[i-1];
end;

Avoid unnecessary loops in the SAS/IML Language

I have some advice on using DO loops in SAS/IML language: look carefully to determine if you really need a loop. The SAS/IML language is a matrix/vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities. Experienced SAS/IML programmers rarely operate on each element of a vector. Rather, they manipulate the vector as a single quantity. For example, the previous SAS/IML loop can be eliminated:

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21};
idx = 2:nrow(v);
z = v[idx]/v[idx-1];

This computation, which computes the nonmissing ratios, is more efficient than looping over elements. For other tips and techniques that make your SAS/IML programs more efficient, see my book Statistical Programming with SAS/IML Software.

8月 312011
 

I previously showed how to generate random numbers in SAS by using the RAND function in the DATA step or by using the RANDGEN subroutine in SAS/IML software. These functions generate a stream of random numbers. (In statistics, the random numbers are usually a sample from a distribution such as the uniform or the normal distribution.) You can control the stream by setting the seed for the random numbers. The random number seed is set by using the STREAMINIT subroutine in the DATA step or the RANDSEED subroutine in the SAS/IML language.

A random number seed enables you to generate the same set of random numbers every time that you run the program. This seems like an oxymoron: if they are the same every time, then how can they be random? The resolution to this paradox is that the numbers that we call "random" should more accurately be called "pseudorandom numbers." Pseudorandom numbers are generated by an algorithm, but have statistical properties of randomness. A good algorithm generates pseudorandom numbers that are indistinguishable from truly random numbers. The random number generator used in SAS is the Mersenne-Twister random number generator (Matsumoto and Nishimura, 1998), which is known to have excellent statistical properties.

Why would you want a reproducible sequence of random numbers? Documentation and testing are two important reasons. When I write SAS code and publish it on this blog, in a book, or in SAS documentation, it is important that SAS customers be able to run the code and obtain the same results.

Random number streams in the DATA step

The STREAMINIT subroutine is used to set the random number seed for the RAND function in the DATA step. The seed value controls the sequence of random numbers. Syntactically, you should call the STREAMINIT subroutine one time per DATA step, prior to the first invocation of the RAND function. This ensures that when you run the DATA step later, it produces the same pseudorandom numbers.

If you start a new DATA step, you can specify a new seed value. If you use a seed value of 0, or if you do not specify a seed value, then the system time is used to determine the seed value. In this case, the random number stream is not reproducible.

To see how random number streams work, each of the following DATA step creates five random observations. The first and third data sets use the same random number seed (123), so the random numbers are identical. The second and fourth variables both use the system time (at the time that the RAND function is first called) to set the seed. Consequently, those random number streams are different. The last data set contains random numbers generated by a different seed (456). This stream of numbers is different from the other streams.

data A(drop=i);
  call streaminit(123);
  do i = 1 to 5;
    x123 = rand("Uniform"); output;
  end;
run;
data B(drop=i);
  call streaminit(0);
  do i = 1 to 5;
    x0 = rand("Uniform"); output;
  end;
run;
data C(drop=i);
  call streaminit(123);
  do i = 1 to 5;
    x123_2 = rand("Uniform"); output;
  end;
run;
data D(drop=i);
  /* no call to streaminit */
  do i = 1 to 5;
    x0_2 = rand("Uniform"); output;
  end;
run;
data E(drop=i);
  call streaminit(456);
  do i = 1 to 5;
    x456 = rand("Uniform"); output;
  end;
run;
data AllRand;  merge A B C D E; run; /* concatenate */
proc print data=AllRand; run;

Notice that the STREAMINIT subroutine, if called, is called exactly one time at the beginning of the DATA step. It does not make sense to call STREAMINIT multiple times within the same DATA step; subsequent calls are ignored. In the one DATA step (D) that does not call STREAMINIT, the first call to the RAND function implicitly calls STREAMINIT with 0 as an argument.

If a single program contains multiple DATA steps that generate random numbers (as above), use a different seed in each DATA step or else the streams will not be independent. This is also important if you are writing a macro function that generates random numbers. Do not hard-code a seed value. Rather, enable the user to specify the seed value in the syntax of the function.

Random number streams in PROC IML

So that it is easier to compare random numbers generated in SAS/IML with random numbers generated by the SAS DATA step, I display the table of SAS/IML results first:

These numbers are generated by the RANDGEN and RANDSEED subroutines in PROC IML. The numbers are generated by five procedure calls, and the random number seeds are identical to those used in the DATA step example. The first and third variables were generated from the seed value 123, the second and fourth variables were generated by using the system time, and the last variable was generated by using the seed 456. The following program generates the data sets, which are then concatenated together.

proc iml;
  call randseed(123);
  x = j(5,1); call randgen(x, "Uniform");
  create A from x[colname="x123"]; append from x;
proc iml;
  call randseed(0);
  x = j(5,1); call randgen(x, "Uniform");
  create B from x[colname="x0"]; append from x;
proc iml;
  call randseed(123);
  x = J(5,1); call randgen(x, "Uniform");
  create C from x[colname="x123_2"]; append from x;
proc iml;
  /* no call to randseed */
  x = J(5,1); call randgen(x, "Uniform");
  create D from x[colname="x0_2"]; append from x;
proc iml;
  call randseed(456);
  x = J(5,1); call randgen(x, "Uniform");
  create E from x[colname="x456"]; append from x;
quit;
data AllRandgen; merge A B C D E; run;
proc print data=AllRandgen; run;

Notice that the numbers in the two tables are identical for columns 1, 3, and 5. The DATA step and PROC IML use the same algorithm to generate random numbers, so they produce the same stream of random values when given the same seed.

Summary

  • To generate random numbers, use the RAND function (for the DATA step) and the RANDGEN call (for PROC IML).
  • To create a reproducible stream of random numbers, call the STREAMINIT (for the DATA step) or the RANDSEED (for PROC IML) subroutine prior to calling RAND or RANDGEN. Pass a positive value (called the seed) to the routines.
  • To initialize a stream of random numbers that is not reproducible, call STREAMINIT or RANDSEED with the seed value 0.
  • To ensure independent streams within a single program, use a different seed value in each DATA step or procedure.
8月 242011
 

In SAS, you can generate a set of random numbers that are uniformly distributed by using the RAND function in the DATA step or by using the RANDGEN subroutine in SAS/IML software. (These same functions also generate samples from other common distributions such as binomial and normal.) The syntax is simple. The following DATA step creates a data set that contains 10 random uniform numbers in the range [0,1]:

data A;
call streaminit(123); /* set random number seed */
do i = 1 to 10;
   u = rand("Uniform"); /* u ~ U[0,1] */
   output;
end;
run;

The syntax for the SAS/IML program is similar, except that you can avoid the loop (vectorize) by allocating a vector and then filling all elements by using a single call to RANDGEN:

proc iml;
call ranseed(123); /* set random number seed */
u = j(10,1); /* allocate */
call randgen(u, "Uniform"); /* u ~ U[0,1] */

Random uniform on the interval [a,b]

If you want generate random numbers on the interval [a,b], you have to scale and translate the values that are produced by RAND and RANDGEN. The width of the interval [a,b] is b-a, so the following statements produce random values in the interval [a,b]:

   a = -1; b = 1;  /* example values */
   x = a + (b-a)*u;

The same expression is valid in the DATA step and the SAS/IML language.

Random integers

You can use the FLOOR or CEIL functions to transform (continuous) random values into (discrete) random integers. In statistical programming, it is common to generate random integers in the range 1 to Max for some value of Max, because you can use those values as observation numbers (indices) to sample from data. The following statements generate random integers in the range 1 to 10:

   Max = 10; 
   k = ceil( Max*u );  /* uniform integer in 1..Max */

If you want random integers between 0 and Max or between Min and Max, the FLOOR function is more convenient:

   Min = 5;
   n = floor( (1+Max)*u ); /* uniform integer in 0..Max */
   m = min + floor( (1+Max-Min)*u ); /* uniform integer in Min..Max */

Again, the same expressions are valid in the DATA step and the SAS/IML language.

Putting it all together

The following DATA step demonstrates all the ideas in this blog post and generates 1,000 random uniform values with various properties:

%let NObs = 1000;
data Unif(keep=u x k n m);
call streaminit(123);
a = -1; b = 1;
Min = 5; Max = 10;
do i = 1 to &NObs;
   u = rand("Uniform");    /* U[0,1] */
   x = a + (b-a)*u;        /* U[a,b] */
   k = ceil( Max*u );      /* uniform integer in 1..Max */
   n = floor( (1+Max)*u ); /* uniform integer in 0..Max */
   m = min + floor((1+Max-Min)*u); /* uniform integer in Min..Max */
   output;
end;
run;

You can use the UNIVARIATE and FREQ procedures to see how closely the statistics of the sample match the characteristics of the populations. The PROC UNIVARIATE output is not shown, but the histograms show that the sample data for the u and x variables are, indeed, uniformly distributed on [0,1] and [-1,1], respectively. The PROC FREQ output shows that the k, n, and m variables contain integers that are uniformly distributed within their respective ranges. Only the output for the m variable is shown.

proc univariate data=Unif;
var u x;
histogram u/ endpoints=0 to 1 by 0.05;
histogram x/ endpoints=-1 to 1 by 0.1;
run;
 
proc freq data=Unif;
tables k n m / chisq;
run;
8月 222011
 

When I write SAS/IML programs, I usually do my development in the SAS/IML Studio environment. Why? There are many reasons, but the one that I will discuss today is the fact that the application is multithreaded and supports multiple programming workspaces.

The advantages of multiple programming workspaces

I am always multitasking, which means that usually I am writing and running multiple programs that are in various stages of completion. For example, today I am working on three projects: a program for my blog (ContaminatedNormal), a customer's program that I am trying to analyze for SAS Technical Support (TechSupportG01832), and a program that I am writing for an upcoming presentation (7_MVNormal). When I am working on my upcoming presentation, my SAS/IML Studio environment might look like the following:

In the image, I drew a red ellipse on the workspace bar, which shows that there are three workspaces. The following image is a close-up of the workspace bar:

The names of the programs appear on the buttons on the workspace bar, and I can switch to a different program by clicking a button.

Each workspace is a separate SAS/IML programming environment. It has its own program window and an independent WORK library for storing temporary data sets. Each program runs independently: a variable named x in one program has no relationship to a variable of the same name in a different program. Consequently, I can develop many SAS/IML programs simultaneously.

The advantages of a multithreaded application

Even more useful is that I can run multiple programs simultaneously! Because the SAS/IML application is multithreaded, I can run arbitrarily many SAS/IML programs concurrently (but there is only one SAS process). This is great if I am running a simulation that requires several minutes to complete. I can run the simulation in one programming workspace and then switch to a different workspace to work on a different program.

For example, suppose that I am hard at work when I get a phone call. The caller asks if I can come up to his office in ten minutes. No problem. I immediately open a prewritten program called Alarm and run it in SAS/IML Studio. I then switch to the Technical Support issue and start working on that program. My workspace bar looks like the following:

The blue triangle on the Alarm button lets me know that the program is running in a separate thread, even while I investigate the Technical Support issue. (If I have access to several SAS servers, I can even tell SAS/IML Studio to run programs on a remote server.) After ten minutes, the Alarm program finishes running and makes a beeping sounds that reminds me of my appointment.

Notice that a multithreaded interface does not imply that the computations are multithreaded. However, SAS also provides multithreaded computations in Base SAS procedures (Ray, 2003) and in analytical procedures (Cohen, 2002).

A SAS/IML program that sounds an alarm

A blog post that doesn't contain any code is so unsatisfying, so here is my SAS/IML program that beeps after a specified length of time. You can run this program in SAS/IML Studio, then switch to a new workspace and continue your work:

/* Demonstrate that SAS/IML Studio is multithreaded */
/* no "proc iml" statement; run in SAS/IML Studio */
delay = 10; /* specify time in minutes (use 1/12 to demo program) */
 
AlarmTime = time()+60*delay;
print "Alarm will ring at " AlarmTime[format=time9.];
printnow; /* force printing to happen immediately */
 
/* pause for this many seconds */
call sleep(60*delay, 1); 
 
/* Time sound the alarm */
tone = repeat(do(400,800,50), 3);
call sound(tone, 0.05);
8月 172011
 

I've previously described ways to solve systems of linear equations, A*b = c. While discussing the relative merits of the solving a system for a particular right hand side versus solving for the inverse matrix, I made the assertion that it is faster to solve a particular system than it is to compute an inverse and use the inverse to solve the system.

In particular, in terms of the SAS/IML SOLVE function and INV function, I asserted that it is faster to run b = solve(A,c); than Ainv = inv(A); b = Ainv * c;.

A colleague asked a good question: "How much faster?"

The following SAS/IML program answers this question. The program generates a random n x n matrix for a range of values for n. For each matrix, the program times how long it takes to solve the linear system. The program is adapted from Chapter 15 of Wicklin (2010), Statistical Programming with SAS/IML Software.

proc iml;
/* this program computes the solution to a linear system in two 
   different ways and compares the performance of each method */
size = T(do(100, 1000, 100)); /* 100, 200, ... 1000 */
results = j(nrow(size), 2);   /* allocate room for results */
do i = 1 to nrow(size);
   n = size[i];
   A = rannor(j(n,n,1));           /* n x n matrix */
   b = rannor(j(n,1,1));           /* n x 1 vector */
 
   /* use the INV function to solve a linear system Ax=b */
   t0 = time();                    /* begin timing INV */
      AInv = inv(A);               /* compute inverse of A */
      x = AInv*b;                  /* solve linear equation A*x=b */
   results[i,1] = time() - t0;     /* end timing */
 
   /* use the SOLVE function to solve the same linear system */
   t0 = time();                    /* begin timing SOLVE */
      x = solve(A,b);              /* solve linear equation directly */
   results[i,2] = time() - t0;     /* end timing */
end;

You can save the times to a SAS data set and use the SGPLOT procedure to compare the performance of the two methods:

/* write results to a data set */
y = size || results;
create Performance from y[colname={"Size" "INV" "SOLVE"}];
append from y;
quit;
 
title "Time Required to Solve a Linear System: INV versus SOLVE";
title2 "From Wicklin (2010), Statistical Programming with SAS/IML Software";
proc sgplot data=Performance;
  series x=Size y=INV / curvelabel;
  series x=Size y=SOLVE /curvelabel;
  yaxis grid label="Time (s)";
  xaxis grid label="Size of Matrix";
run;

For a 1000 x 1000 matrix, it takes about 0.8 seconds to solve the system by computing the matrix inverse, whereas it takes 0.2 seconds to solve the system directly. That ratio is fairly typical: it takes about four times longer to solve a linear system with INV as with SOLVE.

This result is not unique to SAS/IML software. Although the algorithm that is used to compute each solution will affect the shape of the curves, solving a linear system directly should be faster than a solution that involves computing a matrix inverse.