sas programming

11月 232011
 

The SAS macro variable "inspector" is a custom task that plugs into SAS Enterprise Guide 4.3. You can use it to view the current values for all SAS macro variables that are defined within your SAS session. You can also evaluate "immediate" macro expressions in a convenient quick view window. If you develop or run SAS macro programs, this task can be a valuable debugging and learning tool.

UPDATE 28Nov2011: I've received several comments about this task, both here on the blog and in e-mail. Based on some of your very good suggestions, I've updated the task with additional features. If you downloaded the task before 28Nov2011, you might want to refresh your copy.

UPDATE 16Dec2011: The download package for this task now also includes a SAS Options viewer task, described on this separate blog post.

I've been working with SAS macros quite a bit lately, and I decided that a task like this could come in handy as a sort of "watch" window for macro values. I built the task using the custom task APIs and Microsoft .NET. I hope that you find the task useful. If you try it out and have suggestions for how to make it better, please share by adding a comment to the blog.

The custom task and an accompanying README.pdf file (containing description and detailed installation instructions) are available for download in this ZIP file. Installation is simple: copy the DLL to a designated folder on your PC, and SAS Enterprise Guide will detect the task automatically.

This add-in offers the following main features:

Always-visible window: Once you open the task from the Tools menu, you can leave it open for your entire SAS Enterprise Guide session. The window uses a "modeless" display, so you can still interact with other SAS Enterprise Guide features while the window is visible. This makes it easy to switch between SAS programs and other SAS Enterprise Guide windows and the macro variable viewer to see results.

Select active SAS server: If your SAS environment contains multiple SAS workspace connections, you can switch among the different servers to see macro values on multiple systems.

One-click refresh: Refresh the list of macro variables by clicking on the Refresh button in the toolbar.

View by scope or as a straight list: View the macro variables in their scope categories (for example, Global and Automatic) or as a straight list, sorted by variable name or current value. Click on the column headers to sort the list.

Filter results (NEW!): Type a string of characters in the "Filter results" field, and the list of macro variable results will be instantly filtered to those that contain the sequence that you type. The filtered results will match on variable names as well as values, and the search is case-insensitive. This is a convenient way to narrow the list to just a few variables that you're interested in. To clear the filter, click on the X button next to the "Filter results" field, or "blank out" the text field. (Note: this is in the 28Nov2011 update!)

Set window transparency: You can make the window appear "see-through" so that it doesn't completely obscure your other windows as you work with it.

Copy macro variables as %LET statements: Select one or more macro variables within the window, right-click and select Copy assignments. This generates a series of %LET statements -- one for each macro variable/value pair -- which you can then paste into a SAS program.

Macro expression "quick view": Have you ever wanted to test out a macro expression before using it in a longer program? This window allows you to get immediate feedback on a macro expression, whether a simple macro reference or a more complex expression with nested functions. If the expression generates a SAS warning or error, the feedback window shows that as well. Note: the expression can be any macro expression that is valid for the right-side of a macro variable assignment (%let statement).

More macro productivity in SAS Enterprise Guide is just a few clicks away! Download the task today and let me know what you think!

tags: .net, debugging, macro programming, SAS custom tasks, SAS Enterprise Guide, SAS programming
11月 182011
 

Rick posted a tip today about using abbreviations in the SAS program editor window (often referred to as the "enhanced editor"). Defining abbreviations is a great way to save keystrokes and re-use "templates" of code that you've squirreled away. (One of Rick's readers also picked up on the tip, and added it to his blog.)

If you use SAS Enterprise Guide 4.3, you can define those abbreviations even easier, and you can call them up just like any other part of the SAS syntax, using the autocomplete features of the program editor. Here's how.

1. Open a SAS program window (by opening an existing program or select File->New->Program)

2. Select Program->Add Abbreviation Macro

3. In the Add Abbreviation Macro window, type a short name for your code snippet, and then add the code you want to substitute when the abbreviation is triggered. Bonus: the code field in this window also features the SAS program editor that helps you to complete the valid syntax.

4. To use the abbreviation, simply type the abbreviation within your program. The editor will automatically "suggest" the abbreviation as a possibility as you type, and you can press the spacebar to commit the selection, just as with any other suggested keyword. If you hover the mouse cursor over the suggestion, you can see a preview of the text that will be substituted in. Note that the abbreviation entry shows as a special item (green diamond instead of blue square in this case), as another hint that this element is different than the built-in syntax. (Were the autocomplete icons inspired by our bowl of Lucky Charms one morning? I'll never tell.)

You don't have to rely on the autocomplete feature of the editor to get to your abbreviation. You can do as Rick suggests and assign a shortcut key by selecting Program->Enhanced Editor Keys. However, I'd stay away from Ctrl+I, as that currently maps to what I call the "indenter servant" -- the SAS code formatter. (But hey, you can change that key assignment too!)

tags: abbreviations, SAS Enterprise Guide, SAS program editor, SAS programming, SAS tips
11月 162011
 

I've been working with date-time data on a recent project, and I've come across a few SAS programs that have "opportunity for improvement" when it comes time to create reports.

(Or maybe I haven't, and I contrived this entire blog post so that I could reference one of my favorite quotes from the movie Animal House.)

Suppose we have a data set that contains a collection of measurements over a one-year period, each with a time stamp that shows when the measure was captured. Here is a DATA step program that creates a sample data set, generating just over 300,000 rows of random measurements (with a bit of variation built in for visual interest later):

/* Create a year's worth of timestamp data */
data measurements (keep=timestamp measure);
 now = datetime();
 /* go back one year - minus number of seconds in a year */
 /* remember, SAS datetime value is number of seconds */
 /* since 01JAN1960 */
 yearago = now-31556926;
 length timestamp 8;
 format timestamp datetime20.;
 do val = yearago to now;
   if (ranuni(0) < 0.01) then
    do;
     /* add a few more records on certain days of week */
     if (weekday(datepart(val)) in (1,2) and ranuni(0)<.5) then do;
      timestamp = val;
      measure = 5 + log(ranuni(0));
      output;
     end;
     timestamp = val;
     measure = 5 - log(ranuni(0));
     output;
   end;
 end;
run;

The "classic" method for reporting on this data is to use a line plot to show the time series, as shown by this program and plot:

ods graphics / width=800 height=300;
proc sgplot data=measurements;
 series x=timestamp y=measure ;
 xaxis minor label="Time of measure";
 yaxis min=-8 max=20 label="Measurement";
run;


But sometimes we want to report further on the characteristics of the data, to answer questions such as "how many measurements were captured for each month? for each quarter? or for each day of the week?"

Creating categories the hard way

Sometimes while creating these reports, programmers decide to prepare the data further by creating new categorical variables (such as a variable for the day of the week, or for the month and year). For example, I've seen snippets such as this:

  /* numeric day of week */
  day = weekday(datepart(timestamp)); 
  /* value such as "2011_3" for March 2011 */
  year_month = cats(year(datepart(timestamp)),'_',month(datepart(timestamp)));

But there are problems with this approach, especially with the "year_month" example. First, it requires another pass through the data and additional storage space for the output. And in the "year_month" case, the result is a character variable, which no longer retains the ability for sorting the output in chronological order. (Yes, with a more clever construction we could make sure that there is a leading zero for single-digit months, and that would help...but there is an easier way.)

Use SAS formats instead of creating new variables

The person who wrote the above code forgot about the power of SAS formats. Using SAS formats, you can "recast" your variable for analysis to create a category from values that are otherwise continuous. You don't have to modify the original data at all, and it doesn't require you to make an additional copy.

For example, to create summary statistics for the measurements classified into month "buckets", you can apply the DTMONYY5. format within a PROC MEANS step. The DTMONYYw. format takes a date-time value and converts it to a month-year appearance. SAS procedures such as MEANS, FREQ, SGPLOT, and most others will use the formatted value for classification. You can see how this program produces a report of statistics classified by month-year:

title "Measures by month";
proc means data=measurements min max mean n;
 /* tells the procedure to bucket values into monthly bins */
 format timestamp dtmonyy5.;
 label timestamp = "Month/Year";
 class timestamp;
 var measure;
run;


With a minor change to use DTYYQC4. instead, we can produce a report of stats by quarter:

title "Measures by quarter";
proc means data=measurements min max mean n;
 /* tells the procedure to bucket values into quarterly bins */
 format timestamp dtyyqc4.;
 label timestamp = "Quarter";
 class timestamp;
 var measure;
run;


This isn't limited to tabular output. Applying a format can also help to simplify plots as well. Here is an example of an SGPLOT step that creates a frequency report of the number of measures for each day of the week, thanks to the DTWKDATX8. format:

title "Number of Measures by Day of Week";
ods graphics / width=400 height=300;
proc sgplot data=measurements;
 /* tells the procedure to bucket values into day-of-week bins */
 format timestamp dtwkdatx8.;
 label timestamp = "Day";
 vbar timestamp;
run;

The next time that you find yourself creating a new variable to serve as a category, check whether there might be a SAS format that you can apply instead. Even if there isn't, it might still be more efficient to create a new format rather than calculate a new variable, especially if it's a category that you need to use for multiple data sources.

tags: formats, SAS programming, SAS tips, SGPLOT
9月 222011
 

SAS programming is taught in schools all over the world, including in high schools.  Occasionally, I receive questions via my blog such as this one:

Can somebody help me on this?
Write a short DATA _NULL_ step to determine the largest integer you can store on your computer in 3, 4, 5, 6, and 7 bytes.

This sounds like a homework assignment to me, not a typical "how do I" programming question from a person trying to get a job done.  I'm pretty sure that the professor who assigned it would not sanction a lazy web approach.

Besides that, given the brief problem statement above, how do we know what the professor wants?  Is the student supposed to write a program to calculate the largest integer that you can safely store within SAS?  Or is it okay to be clever and simply use built-in SAS functions to tell you the answer?  The first approach is an exercise in programming logic and math.  The second approach is a test of your resourcefulness -- can you find the answer without a "brute force" approach, perhaps by becoming familiar with SAS documentation?  Both approaches are valid, and will provide you with practice in skills that will help you with your ongoing SAS programming endeavors.

At the risk of rewarding lazy behavior, I will present one method to find the answer.  I'm sharing it only because it sheds light on the useful CONSTANT function, and other readers might find that helpful in their own work.

In this example, we'll use the EXACTINT constant. From the SAS documentation:

The exact integer is the largest integer k such that all integers less than or equal to k in absolute value have an exact representation in a SAS numeric variable of length nbytes. This information can be useful to know before you trim a SAS numeric variable from the default 8 bytes of storage to a lower number of bytes to save storage.

Here's the program:

data _null_;
  array len{8} _numeric_ BYTES_1-BYTES_8;
  do i=3 to 8;
   len{i} = constant('exactint',i);
  end;
  pi = constant('pi');
  put "Decimal:" (BYTES_3-BYTES_8) (=/comma32.0);
  put "Hexidecimal:" (BYTES_3-BYTES_8) (=/hex14.);
  put "Binary:" (BYTES_3-BYTES_8) (=/binary64.);
  put "and a slice of Pi:" (pi) (=/16.14);
run;

You'll have to run the program in SAS to see the results. As my high school history teacher used to say, "I'm not going to spoon-feed the answers to you." (Although I suppose that's what I just did...)

Bonus (even though you don't deserve it)

Here are a couple of resources that may be helpful in future assignments:

tags: CONSTANT function, homework, SAS functions, SAS programming
9月 212011
 

Sometimes a population of individuals is modeled as a combination of subpopulations. For example, if you want to model the heights of individuals, you might first model the heights of males and females separately. The height of the population can then be modeled as a combination of the male and female densities. If the subpopulations are not equally distributed, then the relative proportion of males and females in the population is used to determine the mixture density.

The population of heights is an example of a mixture distribution. The subpopulations (male and female) are the mixture components. This article describes how to sample from a mixture distribution.

Suppose that you want to model the length of time required to answer a random call received by a call center. From the data, you know that the calls typically fall into three categories. You decide to use a normal distribution to model the time required to answer each type of call, as follows:

  1. Easy calls are modeled by a N(4,1) distribution (mean 4 and standard deviation 1). These account for 50% of all calls.
  2. Specialized calls are modeled by a N(8,2) distribution and account for 30% of all calls.
  3. Hard calls are modeled by a N(10,3) distribution and account for the remaining 20% of calls,

If you want to simulate this call center, the model suggests that you sample according to the following scheme:

  • Randomly generate a category of call by using the TABLE distribution. (For details, see how to simulate categorical data.)
  • After the category is determined, sample the time required to answer a call from the appropriate normal distribution.

Sampling from a mixture distribution in the DATA step

In the SAS DATA step, the following program statements us the RAND function to sample 250 calls from the call center model:

/* sample from a mixture distribution */
%let N = 250;
data Calls(drop=i);
call streaminit(12345);
array prob [3] _temporary_ (0.5 0.3 0.2);
do i = 1 to &N;
   type = rand("Table", of prob[*]);
   if type=1 then      Time = rand("Normal", 4, 1); /* easy calls */
   else if type=2 then Time = rand("Normal", 8, 2); /* specialized */
   else                Time = rand("Normal", 10, 3); /* hard */             
   output;
end;
run;

You can use the UNIVARIATE procedure to visualize the resulting sample and to estimate the population density:

proc univariate data=Calls;
ods select Histogram;
histogram Time / vscale=proportion
         kernel(lower=0 c=SJPI);              
run;

In the histogram and in the kernel density estimate, you can clearly see the high density near three minutes and near 8 minutes, which correspond to the "easy" and "specialized" calls, respectively. The density of the "hard" calls is merged with the more numerous "specialized" calls, but the "hard" calls are visible as a small bump in the density estimate.

Because a normal distribution is used to model each type of call, you can simplify the previous program by eliminating the compound IF-THEN/ELSE statements. If you put the means and standard deviations in temporary arrays, you can use the statement Time = rand("Normal", mean[type], std[type]);

Sampling from a mixture distribution in the SAS/IML Language

The following SAS/IML program samples from the same mixture distribution:

/* sample from a mixture distribution */
proc iml;
N = &N;
call randseed(12345);
prob = {0.5 0.3 0.2};
Type = j(N, 1); /* allocate vector */
call randgen(Type, "Table", prob); /* fill with 1s, 2s, and 3s */
 
mean = {4 8 10};
sd   = {1 2  3};
Time = j(N,1); /* allocate vector */
do i = 1 to ncol(prob);
   idx = loc(Type=i);
   if ncol(idx) > 0 then do;
      x = j(ncol(idx),1);
      call randgen(x, "Normal", mean[i], sd[i]);
      Time[idx] = x; /* fill with times */
   end;
end;

Notice the following features of the program:

The distribution of the simulated times is similar to the distribution shown in the previous histogram. Simulations like these can be used to ensure that adequate resources are in place to handle the expected number of calls for the various categories.

tags: Sampling and Simulation, SAS Programming
9月 192011
 

The other day I encountered a SAS Knowledge Base article that shows how to count the number of missing and nonmissing values for each variable in a data set. However, the code is a complicated macro that is difficult for a beginning SAS programmer to understand. (Well, it was hard for me to understand!) The code not only counts the number of missing values for each variable, but also creates a SAS data set with the complete results. That's a nice bonus feature, but it contributes to the complexity of the macro.

This article simplifies the process and shows an alternative way to count the number of missing and nonmissing values for each variable in a data set.

The easy case: Count missing values for numeric variables

If you are only interested in the number of missing values for numeric variables, then a single call to the MEANS procedure computes the answer:

/* create sample data */
data one;
  input a $ b $ c $ d e;
cards;
a . a 1 3
. b . 2 4
a a a . 5
. . b 3 5
a a a . 6
a a a . 7
a a a 2 8
;
run;
 
proc means data=one NMISS N; run;

In many SAS procedures, including PROC MEANS, you can omit the VAR statement in order to operate on all relevant variables. For the MEANS procedure, "relevant" means "numeric."

Count missing values for all variables

The MEANS procedure computes statistics for numeric variables, but other SAS procedures enable you to count the number of missing values for character and numeric variables.

The FREQ procedure is a SAS workhorse that I use almost every day. To get the FREQ procedure to count missing values, use three tricks:

  1. Specify a format for the variables so that the missing values all have one value and the nonmissing values have another value. PROC FREQ groups a variable's values according to the formatted values.
  2. Specify the MISSING and MISSPRINT options on the TABLES statement.
  3. Use the _CHAR_ and _NUM_ keywords on the TABLES statement to specify that the FREQ procedure should compute statistics for all character or all numeric variables.

The following statements count the number of missing and nonmissing values for every variable: first the character variables and then the numeric ones.

/* create a format to group missing and nonmissing */
proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;
 
proc freq data=one; 
format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */
tables _CHAR_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;
run;

Using the SAS/IML language to count missing values

In the SAS/IML Language, you can use the COUNTN and COUNTMISS functions that were introduced in SAS/IML 9.22. Strictly speaking, you need to use only one of the functions, since the result of the other is determined by knowing the number of observations in the data set. For the sake of the example, I'll be inefficient and use both of the functions.

As is the case for the PROC FREQ example, the trick is to use the _CHAR_ and _NUM_ keywords to read in and operate on the character and numeric variables in separate steps:

proc iml;
use one;
read all var _NUM_ into x[colname=nNames]; 
n = countn(x,"col");
nmiss = countmiss(x,"col");
 
read all var _CHAR_ into x[colname=cNames]; 
close one;
c = countn(x,"col");
cmiss = countmiss(x,"col");
 
/* combine results for num and char into a single table */
Names = cNames || nNames;
rNames = {"    Missing", "Not Missing"};
cnt = (cmiss // c) || (nmiss // n);
print cnt[r=rNames c=Names label=""];

This is similar to the output produced by the macro in the SAS Knowledge Base article. You can also write the cnt matrix to a data set, if necessary.

tags: Getting Started, SAS Programming, Statistical Programming
9月 072011
 

Looping is essential to statistical programming. Whether you need to iterate over parameters in an algorithm or indices in an array, a loop is often one of the first programming constructs that a beginning programmer learns.

Today is the first anniversary of this blog, which is named The DO Loop, so it seems appropriate to blog about DO loops in SAS. I'll describe looping in the SAS DATA step and compare it with looping in the SAS/IML language.

Loops in SAS

Loops are fundamental to programming because they enable you to repeat a computation for various values of parameters. Different languages use different keywords to define the iteration statement. The most well-known statement is the "for loop," which is used by C/C++, MATLAB, R, and other languages. Older languages, such as FORTRAN and SAS, call the iteration statement a "do loop," but it is exactly the same concept.

DO loops in the DATA step

The basic iterative DO statement in SAS has the syntax DO value = start TO stop. An END statement marks the end of the loop, as shown in the following example:

data A;
do i = 1 to 5;
   y = i**2; /* values are 1, 4, 9, 16, 25 */
   output;
end;
run;

By default, each iteration of a DO statement increments the value of the counter by 1, but you can use the BY option to increment the counter by other amounts, including non-integer amounts. For example, each iteration of the following DATA step increments the value i by 0.5:

data A;
do i = 1 to 5 by 0.5;
   y = i**2; /* values are 1, 2.25, 4, 6.25, ..., 25 */
   output;
end;
run;

You can also iterate "backwards" by using a negative value for the BY option: do i=5 to 1 by -0.5.

DO loops in SAS/IML Software

A basic iterative DO statement in the SAS/IML language has exactly the same syntax as in the DATA step, as shown in the following PROC IML statements:

proc iml;
x = 1:4; /* vector of values {1 2 3 4} */
do i = 1 to 5;
   z = sum(x##i); /* 10, 30, 100, 354, 1300 */
end;

In the body of the loop, z is the sum of powers of the elements of x. During the ith iteration, the elements of x are raised to the ith power. As mentioned in the previous section, you can also use the BY option to increment the counter by non-unit values and by negative values.

Variations on the DO loop: DO WHILE and DO UNTIL

On occasion, you might want to stop iterating if a certain condition occurs. There are two ways to do this: you can use the WHILE clause to iterate as long as a certain condition holds, or you can use the UNTIL clause to iterate until a certain condition holds.

You can use the DO statement with a WHILE clause to iterate while a condition is true. The condition is checked before each iteration, which implies that you should intialize the stopping condition prior to the loop. The following statements extend the DATA step example and iterate as long as the value of y is less than 20:

data A;
y = 0;
do i = 1 to 5 by 0.5 while(y < 20);
   y = i**2; /* values are 1, 2.25, 4, 6.25, ..., 16 */
   output;
end;
run;

You can use the iterative DO statement with an UNTIL clause to iterate until a condition becomes true. The UNTIL condition is evaluated at the end of the loop, so you do not have to initialize the condition prior to the loop. The following statements extend the PROC IML example. The iteration stops after the value of z exceeds 200.

proc iml;
x = 1:4;
do i = 1 to 5 until(z > 200);
   z = sum(x##i); /* 10, 30, 100, 354 */
end;

In these examples, the iteration stopped because the WHILE or UNTIL condition was satisfied. If the condition is not satisfied when i=5 (the last value for the counter), the loop stops anyway. Consequently, the examples have two stopping conditions: a maximum number of iterations and the WHILE or UNTIL criterion. SAS also supports a DO WHILE and DO UNTIL syntax that does not involve using a counter variable.

Looping over a set of items (foreach)

Some languages support a "foreach loop" that iterates over objects in a collection. SAS doesn't support that syntax directly, but there is a variant of the DO loop in which you can iterate over values in a specified list. The syntax in the DATA step is to specify a list of values (numeric or character) after the equal sign. The following example iterates over a few terms in the Fibonacci sequence:

data A;
do v = 1, 1, 2, 3, 5, 8, 13, 21;
   y = v/lag(v);
   output;
end;
run;

The ratio of adjacent values in a Fibonacci sequence converges to the golden ratio, which is 1.61803399....

The SAS/IML language does not support this syntax, but does enable you to iterate over values that are contained in a vector (or matrix). The following statements create a vector, v, that contains the Fibonacci numbers. An ordinary DO loop is used to iterate over the elements of the vector. At the end of the loop, the vector z contains the same values as the variable Y that was computed in the DATA step.

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21};
z = j(nrow(v),1,.); /* initialize ratio to missing values */
do i = 2 to nrow(v);
   z[i] = v[i]/v[i-1];
end;

Avoid unnecessary loops in the SAS/IML Language

I have some advice on using DO loops in SAS/IML language: look carefully to determine if you really need a loop. The SAS/IML language is a matrix/vector language, so statements that operate on a few long vectors run much faster than equivalent statements that involve many scalar quantities. Experienced SAS/IML programmers rarely operate on each element of a vector. Rather, they manipulate the vector as a single quantity. For example, the previous SAS/IML loop can be eliminated:

proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21};
idx = 2:nrow(v);
z = v[idx]/v[idx-1];

This computation, which computes the nonmissing ratios, is more efficient than looping over elements. For other tips and techniques that make your SAS/IML programs more efficient, see my book Statistical Programming with SAS/IML Software.

8月 312011
 

I previously showed how to generate random numbers in SAS by using the RAND function in the DATA step or by using the RANDGEN subroutine in SAS/IML software. These functions generate a stream of random numbers. (In statistics, the random numbers are usually a sample from a distribution such as the uniform or the normal distribution.) You can control the stream by setting the seed for the random numbers. The random number seed is set by using the STREAMINIT subroutine in the DATA step or the RANDSEED subroutine in the SAS/IML language.

A random number seed enables you to generate the same set of random numbers every time that you run the program. This seems like an oxymoron: if they are the same every time, then how can they be random? The resolution to this paradox is that the numbers that we call "random" should more accurately be called "pseudorandom numbers." Pseudorandom numbers are generated by an algorithm, but have statistical properties of randomness. A good algorithm generates pseudorandom numbers that are indistinguishable from truly random numbers. The random number generator used in SAS is the Mersenne-Twister random number generator (Matsumoto and Nishimura, 1998), which is known to have excellent statistical properties.

Why would you want a reproducible sequence of random numbers? Documentation and testing are two important reasons. When I write SAS code and publish it on this blog, in a book, or in SAS documentation, it is important that SAS customers be able to run the code and obtain the same results.

Random number streams in the DATA step

The STREAMINIT subroutine is used to set the random number seed for the RAND function in the DATA step. The seed value controls the sequence of random numbers. Syntactically, you should call the STREAMINIT subroutine one time per DATA step, prior to the first invocation of the RAND function. This ensures that when you run the DATA step later, it produces the same pseudorandom numbers.

If you start a new DATA step, you can specify a new seed value. If you use a seed value of 0, or if you do not specify a seed value, then the system time is used to determine the seed value. In this case, the random number stream is not reproducible.

To see how random number streams work, each of the following DATA step creates five random observations. The first and third data sets use the same random number seed (123), so the random numbers are identical. The second and fourth variables both use the system time (at the time that the RAND function is first called) to set the seed. Consequently, those random number streams are different. The last data set contains random numbers generated by a different seed (456). This stream of numbers is different from the other streams.

data A(drop=i);
  call streaminit(123);
  do i = 1 to 5;
    x123 = rand("Uniform"); output;
  end;
run;
data B(drop=i);
  call streaminit(0);
  do i = 1 to 5;
    x0 = rand("Uniform"); output;
  end;
run;
data C(drop=i);
  call streaminit(123);
  do i = 1 to 5;
    x123_2 = rand("Uniform"); output;
  end;
run;
data D(drop=i);
  /* no call to streaminit */
  do i = 1 to 5;
    x0_2 = rand("Uniform"); output;
  end;
run;
data E(drop=i);
  call streaminit(456);
  do i = 1 to 5;
    x456 = rand("Uniform"); output;
  end;
run;
data AllRand;  merge A B C D E; run; /* concatenate */
proc print data=AllRand; run;

Notice that the STREAMINIT subroutine, if called, is called exactly one time at the beginning of the DATA step. It does not make sense to call STREAMINIT multiple times within the same DATA step; subsequent calls are ignored. In the one DATA step (D) that does not call STREAMINIT, the first call to the RAND function implicitly calls STREAMINIT with 0 as an argument.

If a single program contains multiple DATA steps that generate random numbers (as above), use a different seed in each DATA step or else the streams will not be independent. This is also important if you are writing a macro function that generates random numbers. Do not hard-code a seed value. Rather, enable the user to specify the seed value in the syntax of the function.

Random number streams in PROC IML

So that it is easier to compare random numbers generated in SAS/IML with random numbers generated by the SAS DATA step, I display the table of SAS/IML results first:

These numbers are generated by the RANDGEN and RANDSEED subroutines in PROC IML. The numbers are generated by five procedure calls, and the random number seeds are identical to those used in the DATA step example. The first and third variables were generated from the seed value 123, the second and fourth variables were generated by using the system time, and the last variable was generated by using the seed 456. The following program generates the data sets, which are then concatenated together.

proc iml;
  call randseed(123);
  x = j(5,1); call randgen(x, "Uniform");
  create A from x[colname="x123"]; append from x;
proc iml;
  call randseed(0);
  x = j(5,1); call randgen(x, "Uniform");
  create B from x[colname="x0"]; append from x;
proc iml;
  call randseed(123);
  x = J(5,1); call randgen(x, "Uniform");
  create C from x[colname="x123_2"]; append from x;
proc iml;
  /* no call to randseed */
  x = J(5,1); call randgen(x, "Uniform");
  create D from x[colname="x0_2"]; append from x;
proc iml;
  call randseed(456);
  x = J(5,1); call randgen(x, "Uniform");
  create E from x[colname="x456"]; append from x;
quit;
data AllRandgen; merge A B C D E; run;
proc print data=AllRandgen; run;

Notice that the numbers in the two tables are identical for columns 1, 3, and 5. The DATA step and PROC IML use the same algorithm to generate random numbers, so they produce the same stream of random values when given the same seed.

Summary

  • To generate random numbers, use the RAND function (for the DATA step) and the RANDGEN call (for PROC IML).
  • To create a reproducible stream of random numbers, call the STREAMINIT (for the DATA step) or the RANDSEED (for PROC IML) subroutine prior to calling RAND or RANDGEN. Pass a positive value (called the seed) to the routines.
  • To initialize a stream of random numbers that is not reproducible, call STREAMINIT or RANDSEED with the seed value 0.
  • To ensure independent streams within a single program, use a different seed value in each DATA step or procedure.
8月 292011
 

One of the highly visible changes in SAS 9.3 is the fact that the old LISTING destination is no longer the default destination for ODS output. Instead, the HTML destination is the default.

One positive consequence of this is that ODS graphics and tables are interlaced in the output. Another is that complicated tables are easier to read because SAS can use styles (colors, fonts, and cell shading) to make tables more readable. For example, the row and column headings of a table might be a different color than the cells of the table.

However, if you are like me, you occasionally like to clear the output window so that it doesn't get too crowded. In SAS 9.2, clearing the Output Window (by which I mean the LISTING destination) in the SAS Windowing Environment was easy: select Edit > Clear All or select the Output Window and click the New icon (or File > New).

I assumed that I could clear the HTML Results Viewer in SAS 9.3 the same way, but as the following image shows, the Edit > Clear All menu item is disabled!

No worries, though. You can clear the Results Viewer programmatically by submitting the following statements:

ods html close; /* close previous */
ods html; /* open new */

The Results Viewer will not immediately look empty, but the next time that you generate output you will see that the Results Viewer no longer includes the older content.

There is a SAS Usage Note that describes other ways to clear the contents of the Results Viewer. Thanks to the fine folks on the SAS-L Mailing List for asking this question and for linking to the SAS Knowledge Base article.

8月 262011
 

Exploring correlation between variables is an important part of exploratory data analysis. Before you start to model data, it is a good idea to visualize how variables related to one another. Zach Mayer, on his Modern Toolmaking blog, posted code that shows how to display and visualize correlations in R. This is such a useful task that I want to repeat it in SAS software.

Basic correlations and a scatter plot matrix

Mayer's used Fisher's iris data for his example, so I will, too. The following statement uses the CORR procedure to compute the correlation matrix and display a scatter plot matrix for the numerical variables in the data:

proc corr data=sashelp.iris plots=matrix(histogram); 
run;

Notice that by omitting the VAR statement, the CORR procedure analyzes all numerical variables in the data set.

The PLOTS=MATRIX option displays a scatter plot matrix. In SAS 9.3, ODS graphics are turned on by default. (In SAS 9.2, you need to submit ODS graphics on; prior to the PROC CORR statement.) The result is a "quick-and-dirty" visualization of pairwise relationships and the distribution of each variable (along the diagonal). This is the beauty of ODS graphics: the procedures automatically create graphs that are appropriate for an analysis.

A fancier a scatter plot matrix

Mayer also showed some fancier graphs. You can use the SGSCATTER procedure to re-plot the data, but with observations colored according to values of the Species variable, and with a kernel density estimate overlaid on the histogram.

proc sgscatter data=sashelp.iris; 
matrix SepalLength--PetalLength /group=Species diagonal=(histogram kernel);
run;

Notice how I specified the variables. Did you know that you can specify a range of consecutive variables by using a double-dash? This SAS syntax isn't widely known, but can be very useful.

More options, more details

I don't usually add smoothers to my scatter plot matrices because I think it gives the false impression that certain variables are response variables. I prefer to focus on correlation first and save modeling for later in the analysis. However, Mayer showed some loess smoothers on his plots, so I feel obligated to show SAS users how to produce similar displays.

The observant reader will have noticed that there are no scales or tick marks on the scatter plot matrices that I've shown so far. The reason is that axes and scales can distract from the primary goal of the exploratory analysis, which is to give an overview of the data and to see potential pairwise relationships. In Tufte's jargon, the scatter plot matrices that I've shown have a large data-to-ink ratio (Ch. 4, The Visual Display of Quantitative Information).

However, scatter plot matrices also can serve another purpose. During the modeling phase of data analysis they can serve as small multiples that enable you to quickly compare and contrast a sequence of related displays. In this context, scales, tick marks, and statistical smoothers are more relevant.

In general, you can use the SGPANEL procedure to display small multiples. However, I'll use the SGSCATTER procedure again to show how you can add more details to the display. Instead of using the MATRIX statement, I will use the PLOT statement to control exactly with pairs of variables I want to plot. If I think that PetalWidth variable explains the other variables, I can use the LOESS option to add a loess smoother to the scatter plots, as shown in the following example:

proc sgscatter data=sashelp.iris; 
plot (SepalLength SepalWidth PetalLength)*PetalWidth /
   group=Species loess rows=1 grid;
run;

Notice that the loess smoothers are added for each group because the GROUP= option is specified. If, instead, you want to smooth the data regardless of the group variable, you can specify the LOESS=(NOGROUP) option, which produces smoothers similar to those shown by Mayer.