sas programming

8月 262011
 

Exploring correlation between variables is an important part of exploratory data analysis. Before you start to model data, it is a good idea to visualize how variables related to one another. Zach Mayer, on his Modern Toolmaking blog, posted code that shows how to display and visualize correlations in R. This is such a useful task that I want to repeat it in SAS software.

Basic correlations and a scatter plot matrix

Mayer's used Fisher's iris data for his example, so I will, too. The following statement uses the CORR procedure to compute the correlation matrix and display a scatter plot matrix for the numerical variables in the data:

proc corr data=sashelp.iris plots=matrix(histogram); 
run;

Notice that by omitting the VAR statement, the CORR procedure analyzes all numerical variables in the data set.

The PLOTS=MATRIX option displays a scatter plot matrix. In SAS 9.3, ODS graphics are turned on by default. (In SAS 9.2, you need to submit ODS graphics on; prior to the PROC CORR statement.) The result is a "quick-and-dirty" visualization of pairwise relationships and the distribution of each variable (along the diagonal). This is the beauty of ODS graphics: the procedures automatically create graphs that are appropriate for an analysis.

A fancier a scatter plot matrix

Mayer also showed some fancier graphs. You can use the SGSCATTER procedure to re-plot the data, but with observations colored according to values of the Species variable, and with a kernel density estimate overlaid on the histogram.

proc sgscatter data=sashelp.iris; 
matrix SepalLength--PetalLength /group=Species diagonal=(histogram kernel);
run;

Notice how I specified the variables. Did you know that you can specify a range of consecutive variables by using a double-dash? This SAS syntax isn't widely known, but can be very useful.

More options, more details

I don't usually add smoothers to my scatter plot matrices because I think it gives the false impression that certain variables are response variables. I prefer to focus on correlation first and save modeling for later in the analysis. However, Mayer showed some loess smoothers on his plots, so I feel obligated to show SAS users how to produce similar displays.

The observant reader will have noticed that there are no scales or tick marks on the scatter plot matrices that I've shown so far. The reason is that axes and scales can distract from the primary goal of the exploratory analysis, which is to give an overview of the data and to see potential pairwise relationships. In Tufte's jargon, the scatter plot matrices that I've shown have a large data-to-ink ratio (Ch. 4, The Visual Display of Quantitative Information).

However, scatter plot matrices also can serve another purpose. During the modeling phase of data analysis they can serve as small multiples that enable you to quickly compare and contrast a sequence of related displays. In this context, scales, tick marks, and statistical smoothers are more relevant.

In general, you can use the SGPANEL procedure to display small multiples. However, I'll use the SGSCATTER procedure again to show how you can add more details to the display. Instead of using the MATRIX statement, I will use the PLOT statement to control exactly with pairs of variables I want to plot. If I think that PetalWidth variable explains the other variables, I can use the LOESS option to add a loess smoother to the scatter plots, as shown in the following example:

proc sgscatter data=sashelp.iris; 
plot (SepalLength SepalWidth PetalLength)*PetalWidth /
   group=Species loess rows=1 grid;
run;

Notice that the loess smoothers are added for each group because the GROUP= option is specified. If, instead, you want to smooth the data regardless of the group variable, you can specify the LOESS=(NOGROUP) option, which produces smoothers similar to those shown by Mayer.

8月 242011
 

In SAS, you can generate a set of random numbers that are uniformly distributed by using the RAND function in the DATA step or by using the RANDGEN subroutine in SAS/IML software. (These same functions also generate samples from other common distributions such as binomial and normal.) The syntax is simple. The following DATA step creates a data set that contains 10 random uniform numbers in the range [0,1]:

data A;
call streaminit(123); /* set random number seed */
do i = 1 to 10;
   u = rand("Uniform"); /* u ~ U[0,1] */
   output;
end;
run;

The syntax for the SAS/IML program is similar, except that you can avoid the loop (vectorize) by allocating a vector and then filling all elements by using a single call to RANDGEN:

proc iml;
call ranseed(123); /* set random number seed */
u = j(10,1); /* allocate */
call randgen(u, "Uniform"); /* u ~ U[0,1] */

Random uniform on the interval [a,b]

If you want generate random numbers on the interval [a,b], you have to scale and translate the values that are produced by RAND and RANDGEN. The width of the interval [a,b] is b-a, so the following statements produce random values in the interval [a,b]:

   a = -1; b = 1;  /* example values */
   x = a + (b-a)*u;

The same expression is valid in the DATA step and the SAS/IML language.

Random integers

You can use the FLOOR or CEIL functions to transform (continuous) random values into (discrete) random integers. In statistical programming, it is common to generate random integers in the range 1 to Max for some value of Max, because you can use those values as observation numbers (indices) to sample from data. The following statements generate random integers in the range 1 to 10:

   Max = 10; 
   k = ceil( Max*u );  /* uniform integer in 1..Max */

If you want random integers between 0 and Max or between Min and Max, the FLOOR function is more convenient:

   Min = 5;
   n = floor( (1+Max)*u ); /* uniform integer in 0..Max */
   m = min + floor( (1+Max-Min)*u ); /* uniform integer in Min..Max */

Again, the same expressions are valid in the DATA step and the SAS/IML language.

Putting it all together

The following DATA step demonstrates all the ideas in this blog post and generates 1,000 random uniform values with various properties:

%let NObs = 1000;
data Unif(keep=u x k n m);
call streaminit(123);
a = -1; b = 1;
Min = 5; Max = 10;
do i = 1 to &NObs;
   u = rand("Uniform");    /* U[0,1] */
   x = a + (b-a)*u;        /* U[a,b] */
   k = ceil( Max*u );      /* uniform integer in 1..Max */
   n = floor( (1+Max)*u ); /* uniform integer in 0..Max */
   m = min + floor((1+Max-Min)*u); /* uniform integer in Min..Max */
   output;
end;
run;

You can use the UNIVARIATE and FREQ procedures to see how closely the statistics of the sample match the characteristics of the populations. The PROC UNIVARIATE output is not shown, but the histograms show that the sample data for the u and x variables are, indeed, uniformly distributed on [0,1] and [-1,1], respectively. The PROC FREQ output shows that the k, n, and m variables contain integers that are uniformly distributed within their respective ranges. Only the output for the m variable is shown.

proc univariate data=Unif;
var u x;
histogram u/ endpoints=0 to 1 by 0.05;
histogram x/ endpoints=-1 to 1 by 0.1;
run;
 
proc freq data=Unif;
tables k n m / chisq;
run;
8月 032011
 

One of the great innovations with SAS 9.3 is the focus on ODS statistical graphics.

"Wait a minute," you're thinking, "weren't ODS graphics added in SAS 9.2?"

Yes, that's true.  But with SAS 9.3 there is even more capability: more analytical SAS procedures support the graphs, and there are more features in the specialized graph procedures such as SGPLOT.

But even more remarkable is the change in SAS display manager to bring ODS graphics into focus.  Now HTML output (not text listing) is the default output from SAS display manager.  And a new ODS style, named "HTMLBlue", has been designed to show the graphs in a crisp and colorful way.

Example of HTMLBlue style from PROC REG

If you are using SAS Enterprise Guide 4.3, you don't have HTMLBlue in your list of available styles.  But you can add it with a few simple steps.

1. Use SAS to create a copy of the HTMLBlue stylesheet.

filename out temp;
/* change the location of the css fileref as necessary */
filename css "c:\temp\htmlblue.css";
ods html file=out style=htmlblue stylesheet=css;
 proc print data=sashelp.class;
 run;
ods html close;

You can run this code in a SAS window or in SAS Enterprise Guide, making sure that you are able to write to the location that the CSS fileref is mapped to.  After running the program, the htmlblue.css file will appear in the location specified in the program.

2. Copy the htmlblue.css file to the Styles subfolder in your SAS Enterprise Guide application directory.

Depending on how you installed SAS Enterprise Guide, that location might be:

C:\Program Files\SAS\EnterpriseGuide\4.3\Styles
or
C:\Program Files\SASHome\x86\SASEnterpriseGuide\4.3\Styles

3. In SAS Enterprise Guide, change the preferences for your results output to use the new HTMLBlue style.

The style will automatically appear in the list of available styles.  For example, to change the default output for SAS Report, select Tools->Options->Results->SAS Report, and select "htmlblue" from the list of styles.

Results, Style selector in SAS Enterprise Guide

And even though the style is named "HTMLBlue", you don't have to use it for just HTML.  It's just as attractive when using SAS Report.  Of course you can also use it for RTF and PDF output as well, although you might prefer a cleaner style such as Journal for printed output.  The SAS OnlineDoc for statistical graphics contains a comparison of the various styles, with considerable focus on the details for the graphics.

 

7月 122011
 

It seems like such a simple problem: how can you reliably compute the age of someone or something? Susan lamented the subtle issues using the YRDIF function exactly 1.0356164384 years ago.

Sure, you could write your own function for calculating such things, as I suggested 0.1753424658 years ago.

Or you could ask your colleagues in the discussion forums, as I noticed some people doing just about 3.7890410959 years ago.

Or, now that SAS 9.3 is available, you can take advantage of the new AGE basis that the YRDIF function supports. For example:

data _null_;
  x = yrdif('29JUN2010'd,today(),'AGE');
  y = yrdif('09MAY2011'd,today(),'AGE');
  z = yrdif('27SEP2007'd,today(),'AGE');
  put x= y= z=;
run;

Yields:

x=1.0356164384 y=0.1753424658 z=3.7890410959

This new feature and many more were added in direct response to customer feedback from Susan and others. And that's a practice that never gets old.

7月 122011
 
It seems like such a simple problem: how can you reliably compute the age of someone or something? Susan lamented the subtle issues using the YRDIF function exactly 1.0356164384 years ago.

Sure, you could write your own function for calculating such things, as I suggested 0.1753424658 years ago.

Or you could ask your colleagues in the discussion forums, as I noticed some people doing just about 3.7890410959 years ago.

Or, now that SAS 9.3 is available, you can take advantage of the new AGE basis that the YRDIF function supports. For example:


data _null_;
  x = yrdif('29JUN2010'd,today(),'AGE');
  y = yrdif('09MAY2011'd,today(),'AGE');
  z = yrdif('27SEP2007'd,today(),'AGE');
  put x= y= z=;
run;
 
Yields:

x=1.0356164384 y=0.1753424658 z=3.7890410959
 
This new feature and many more were added in direct response to customer feedback from Susan and others. And that's a practice that never gets old.
7月 112011
 

SAS Enterprise Guide sets values for several useful SAS macro variables when it connects to a SAS session, including one macro variable, &_CLIENTPROJECTPATH, that contains the name and path of the current SAS Enterprise Guide project file.

(To learn about this and other macro variables that SAS Enterprise Guide assigns, look in Help->SAS Enterprise Guide help, and search the Index for "macro variables".)

If your SAS program knows the path of your project file, then you can use that information to make your project more "portable", and reference data resources using paths that are relative to the project location, instead of having to hard-code absolute paths into your program.

For example, I've been working with some data from DonorsChoose.org, and I've captured that work in a project file. After saving the project, I can easily get the name of the project file by checking the value of &_CLIENTPROJECTPATH:

15    %put &_clientprojectpath;
'C:\DataSources\DonorsChoose\DonorsChoose.egp'

If you can get the full path of the project file, then you can build that information into a SAS program so that as you move the project file and its "assets" (such as data), you don't need to make changes to the project to accommodate a new project folder "home". Here is the bit of code that takes the project file location and distills a path from it, and then uses it to assign a path-based SAS library.
/* a bit of code to detect the local project path                          */
/* NOTE: &_CLIENTPROJECTPATH is set only if you saved the project already! */
%let localProjectPath =
%sysfunc(substr(%sysfunc(dequote(&_CLIENTPROJECTPATH)), 1,
%sysfunc(findc(%sysfunc(dequote(&_CLIENTPROJECTPATH)), %str(/), -255 )))); 

libname DC "&localProjectPath";

Here's the output:
21  libname DC "&localProjectPath";
NOTE: Libref DC was successfully assigned as follows:
Engine:        V9
Physical Name: C:\DataSources\DonorsChoose

This allows me to keep the project together with the data that it consumes and creates. And I can easily share the project and data with a colleague, who can store it in a different folder on his/her machine, and it should work just the same without any changes.

7月 112011
 
SAS Enterprise Guide sets values for several useful SAS macro variables when it connects to a SAS session, including one macro variable, &_CLIENTPROJECTPATH, that contains the name and path of the current SAS Enterprise Guide project file.

(To learn about this and other macro variables that SAS Enterprise Guide assigns, look in Help->SAS Enterprise Guide help, and search the Index for "macro variables".)

If your SAS program knows the path of your project file, then you can use that information to make your project more "portable", and reference data resources using paths that are relative to the project location, instead of having to hard-code absolute paths into your program.

For example, I've been working with some data from DonorsChoose.org, and I've captured that work in a project file. After saving the project, I can easily get the name of the project file by checking the value of &_CLIENTPROJECTPATH:


15    %put &_clientprojectpath;
'C:\DataSources\DonorsChoose\DonorsChoose.egp'
 
If you can get the full path of the project file, then you can build that information into a SAS program so that as you move the project file and its "assets" (such as data), you don't need to make changes to the project to accommodate a new project folder "home". Here is the bit of code that takes the project file location and distills a path from it, and then uses it to assign a path-based SAS library.

/* a bit of code to detect the local project path                          */
/* NOTE: &_CLIENTPROJECTPATH is set only if you saved the project already! */
%let localProjectPath =
   %sysfunc(substr(%sysfunc(dequote(&_CLIENTPROJECTPATH)), 1,
   %sysfunc(findc(%sysfunc(dequote(&_CLIENTPROJECTPATH)), %str(/\), -255 ))));

libname DC "&localProjectPath";
 
Here's the output:

21  libname DC "&localProjectPath";
NOTE: Libref DC was successfully assigned as follows:
      Engine:        V9
      Physical Name: C:\DataSources\DonorsChoose
 
This allows me to keep the project together with the data that it consumes and creates. And I can easily share the project and data with a colleague, who can store it in a different folder on his/her machine, and it should work just the same without any changes.
7月 082011
 
The recent issue of InformationWeek features a Q&A session with Ken Thompson, one of the creators of the Unix operating system. (He collaborated with Dennis Ritchie, of C language fame. Since much of SAS is written in C, I daresay there are a few copies of K&R around here.)

One of the questions hit on the topic of code reviews:

Interviewer: Was there any concept of looking at each other's code or doing code reviews?

Thompson: [Shaking head] We were all pretty good coders.

The implication: only bad coders need code reviews.

This is an outdated attitude towards software development. Unfortunately, I've encountered this attitude among some software professionals that I've worked with (outside of SAS as well as within). Why do some programmers (especially those from the "good ol' days") place little value on code reviews?

Perhaps peer code reviews were less important 30 years ago. The constraints around running code were much tighter. There were fewer layers in the stack where bugs could lurk. Memory was scarce, instruction sets were limited. Computer applications were like my 1973 AMC Gremlin: fewer moving parts, so it was easier to see which parts weren't working correctly and predict when they would fail. (Most often, it was the carburetor -- another outmoded concept.)

Also, the discipline of programming was newer back then. Perhaps there was less variability among the approaches you could use to solve a problem with code. If you had the skills to code the solution, maybe your solution would not differ much from that of any other similarly skilled colleague.

Well, those days are gone. Now more than ever, we need code reviews to be a formal part of how we work as SAS professionals (and as software professionals in general). Mark Chu-Carroll spells out many of the reasons in his blog post, Things Everyone Should Do: Code Review. As Mark says, it's not about the problems that a reviewer will catch; it's more about the programmer who prepares to have his/her work reviewed. Knowing that you have to explain this to another person, that someone else will be looking at your work...that can help to keep you on your toes.

In addition to reasons that Mark cites, those of us who work with SAS have a special motivation: we need to pass on the knowledge of a rich legacy code base to our younger workers. And because the SAS language expands with each new release, old-time SAS programmers need exposure to people who can apply new techniques to old problems. Code reviews are one way to help make that happen: improve quality, ensure continuity, and keep it fresh. Who can argue against that?

6月 262011
 
About a year ago (wow, has it been that long?), I posted an example program that lets you report on the contents of a SAS information map. Using my example, you can see the data items, filters, and folder structure within a given information map.

Last week a reader posted a comment, wondering how to also learn the names of the SAS tables that contribute to the SAS information map's definition. I know that you can't get to that information using the information map dictionary tables, but I did learn that it is possible using PROC INFOMAPS and the LIST statement.

Here is an example that relates to the information map that I reported on earlier:


proc infomaps ;
 update infomap "Cars" mappath="/Shared Data/Maps";
 list datasources;
run;
 
The output goes to the SAS log:

Total datasources: 1
Data source: Sample Data.CARS_1993
ID: CARS_1993
Name: CARS_1993
Description:
 
Wow, I guess that was a pretty simple information map, with only one data source. I'm sure that you can come up with more complex examples.