sas programming

10月 292012
 

I was looking at some SAS documentation when I saw a Base SAS function that I never knew existed. The NWKDOM function returns the date for the nth occurrence of a weekday for the specified month and year. I surely could have used that function last spring when I blogged about how to compute the onset and end of daylight saving time (DST) in SAS!

As any handyman will tell you, having the right tool makes a job so much easier. Instead of nine complicated lines of SAS DATA step code, my new algorithm consists of two calls to the NWKDOM function! In the US, DST ends on the first Sunday in November. Here's all you need to do to figure out that date in 2012: nwkdom(1, 1, 11, 2012). The first argument specifies the week (first=1). The second argument specifies the day of the week as a number 1–7 (Sunday=1). The third argument specifies the month as a number 1–12 (November=11). The last argument is the year.

Therefore, the following DATA step code computes the beginning and end of DST for several years:

/* compute the beginning and end of daylight saving time for several years */
data DST;
format dst_beg dst_end DATE5.;
do year=2012 to 2022;
   dst_beg = nwkdom(2, 1, 3, year);/*DST begins 2nd Sun in March */
   dst_end = nwkdom(1, 1,11, year);/*DST ends 1st Sun in Nov */
   output;
end;
run;
 
proc print noobs; 
var year dst_beg dst_end;
run;

With the right tools, every task is easier.

tags: SAS Programming
10月 222012
 

What's in a name? As Shakespeare's Juliet said, "That which we call a rose / By any other name would smell as sweet." A similar statement holds true for the names of colors in SAS: "Rose" by any other name would look as red!

SAS enables you to specify a color in several ways. This article shows how to use color names and RGB colors to specify colors in the SAS statistical graphics procedures, such as PROC SGPLOT. It gathers together color resources in a single location, so you might want to bookmark this article for future reference.

Pre-defined SAS color names

SAS knows dozens of names of colors. There's "red" and "brown," of course, but also designer colors such as "aquamarine," "chartreuse," "khaki," and "teal." The following call to the SGPLOT procedure creates a scatter plot for which the marker color is set to "rose":

proc sgplot data=sashelp.class;
scatter x=height y=weight / markerattrs=(size=14 symbol=CircleFilled
                            color="Rose" );
run;

If you want to see some of the color names that SAS recognizes, run the following statements, which will print a list of colors to the SAS log:

proc registry list startat="COLORNAMES";
run;

The SAS color-naming convention

Some of the more "imaginative" names that SAS supports (such as "Peru," "Thistle," and "Gainsboro") might be unfamiliar. Fortunately, SAS has a color-naming scheme that enables you to specify a descriptive name for many colors, such as "Red," "Light Gray," or "Dark Green." To use the color-naming scheme, choose one value from each of the following categories:

  1. Lightness: Black (*), Very Dark, Dark, Medium (default), Light, Very Light, or White (*)
  2. Saturation: Gray (*), Grayish, Moderate, Strong, Vivid (default)
  3. Hue: Blue, Purple, Red, Orange or Brown, Yellow, Green

So, for example, valid names for colors include the following:

  • "Very Dark Grayish Blue"
  • "Dark Moderate Purple"
  • "Strong Red" (default value for Lightness)
  • "Light Brown" (default value for Saturation)

A few exceptions to the "choose one from each category" rule are indicated by asterisks. If you use Black or White, you cannot specify Saturation or Hue. Similarly, if you specify Gray, you cannot specify a Hue.

You can also specify a mixture of hues by specifying two adjacent hues in the list. Optionally, you can use the -ish suffix to diminish a hue, as in the following examples:

  • "Medium Strong Red Orange"
  • "Light Yellowish Green"
  • "Light Strong Brownish Orange"

For example, the following scatter plot sets the color of markers by using the SAS color-naming scheme:

proc sgplot data=sashelp.class;
scatter x=height y=weight / markerattrs=(size=14 symbol=CircleFilled
                            color="Light Strong Brownish Orange" );
run;

Representing colors by their RGB values

SAS also enables you to specify a color by using the red-green-blue (RGB) color model. The RGB color model is an additive model in which a color is specified in terms of red, green, and blue components. Each component ranges from 0 (which indicates the absence of a component) to 255 (which indicates the full presence of a component). A color is therefore a triplet of values that indicates the relative proportions of red, green, and blue. For example, the RGB triplet (255, 96, 96) represents a color that has a maximum amount of red and a small amount of green and blue.

The drawback of the RGB model is readability. If I write a program and specify a color as the RGB triplet (255, 96, 96), it is difficult to know what color is going to be produced. It turns out that this color is exactly "Rose," but you might have had a hard time predicting that in advance.

Most of the time, RGB triplets are compactly represented by a hexadecimal integer. The prefix CX tells SAS to interpret the integer as a color. You can find the hexadecimal representation of a color such as "Rose" by using the handy reference chart that lists SAS colors and their hexadecimal values.

The chart says that the RGB color for "Rose" is CXFF6060. This value is read two digits at a time:

  • CX means that the value is a color
  • FF is the hexadecimal (base-16) value for 16*15 + 15 = 255, which is the red component
  • 60 is the hexadecimal value for 16*6 + 0 = 96, which is the green component
  • 60 is the blue component

Consequently, the following statements reproduce the first image in this article:

/* "Rose" = RGB(255,96,96) = cxFF6060 */
proc sgplot data=sashelp.class;
scatter x=height y=weight / markerattrs=(size=14 symbol=CircleFilled
                            color=cxFF6060 );
run;

If you are a hard-core programmer, you're probably already thinking about how to write a SAS macro that converts a triplet of RGB values into a hexadecimal color. If you want to try to construct the macro yourself, stop reading now!

You can use the following SAS macro to create a hexadecimal value from a triplet of RGB colors. It's easy to use the PUT function to construct the hexadecimal value inside of a DATA step, but it is slightly more challenging to use the macro language to construct an "in-line" string that can be used outside of the DATA step. You can use the following macro to specify the COLOR= option for statement in the SGPLOT procedure:

/* convert integer 0--255 to 2 digit hex 00-FF */
%macro hex2(n);
  %local digits n1 n2;
  %let digits = 0123456789ABCDEF;
  %let n1 = %substr(&digits, &n / 16 + 1, 1);
  %let n2 = %substr(&digits, &n - &n / 16 * 16 + 1, 1);
  &n1&n2
%mend hex2;
 
/* convert RGB triplet (r,g,b) to SAS color in hexadecimal. 
   The r, g, and b parameters are integers in the range 0--255 */
%macro RGB(r,g,b);
  %cmpres(CX%hex2(&r)%hex2(&g)%hex2(&b))
%mend RGB;
 
data A;
put "Rose = %RGB(255,96,96)";
run;
 
proc sgplot data=sashelp.class;
scatter x=height y=weight / markerattrs=(size=14 symbol=CircleFilled
                            color=%RGB(255,96,96) );
run;

Because (255, 96, 96) specifies the same color as "rose," the image is identical to the first graph in this article.

So what's in a name? A lot of information! But in SAS, "rose" by the name of CXFF6060 looks just as sweet!

tags: Getting Started, SAS Programming, Statistical Graphics
10月 222012
 
SAS provides solutions that have analytics integrated into reports that deliver insight into what will happen in the future, not just report on what has happened in the past.  Who wouldn't love that? However, what makes the SAS programming environment so useful to end users is how efficient and effective [...]
10月 202012
 

According to the Daily Writing Tips blog, describing a thing as "somewhat unique" is bad form. Unique means "one of a kind", so either it is or it is not. The famous example (which the style police will use to chide you) is that you can't have something as "somewhat unique" any more than a woman can be "somewhat pregnant".

But when programming and dealing with computers, we often deal with the concept of "unique enough". And to fill that niche, we have the Globally Unique Identifiers, or GUIDs. (Most people pronounce this "GOO-ids".) These are also called UUIDs, for "universal unique identifiers". (But the Universe is pretty big, isn't it? Therefore using this term makes a pretty bold claim.)

A GUID is a 128-bit value, allowing for 2128 (or 3.4028237e+38) possibilities. Computer processes generate GUIDs behind the scenes all of the time for lots of purposes, and among geeks we joke that they will become a scarce resource, like helium. But rest assured: it will take us a long time to go through all of the possible GUID values.

When we see GUID values in the wild, they are usually expressed as a series of 32 hexadecimal characters, often containing hyphens like a crazy sort of phone number (as if you might attempt to commit it to memory):

   efb40385-6b5c-4e7f-9f19-1daeb7e97ed9

I created the above GUID fresh for this blog (it's never been seen before!) using a tool on my PC called uuidgen, which is part of the Windows SDK.

In the SAS world, I can use the UUIDGEN function to create these as needed from within my SAS program:

data wastedGuids;
do x=1 to 10;
  guid = uuidgen();
  output;
end;
run;

Result:

The UUIDGEN function relies on code libraries that are system-dependent, as there are different algorithms for creating GUID values. When I run the above program on a Linux version of SAS, I get a different pattern of results:

I suppose -- now that I've promoted this function on the blog -- we'll have SAS programmers cranking GUIDs out day-and-night. At this rate, will there be any GUIDs left for our grandchildren? Well, I can't worry about that -- live for today, I say.

Bonus joke: How do you catch a unique rabbit? (Reply in the comments, please.)

tags: SAS programming, uuidgen
10月 172012
 

Sometimes a graph is more interpretable if you assign specific colors to categories. For example, if you are graphing the number of Olympic medals won by various countries at the 2012 London Olympics, you might want to assign the colors gold, silver, and bronze to represent first-, second-, and third-place medals. A good choice of colors can reduce the time that a reader spends studying the legend, and increase the time spent studying the data. In this example, I assign a "traffic light" scheme to visualize data about gains and losses: green means positive gains, gray indicates little or no change, and red indicates negative change.

The SGPLOT procedure in SAS makes it easy to display a different color for each level of a grouping variable. By default, when you specify the GROUP= option for a graph, colors are assigned to groups based on the current ODS style. But what can you do if you want to assign specific colors for group categories?

There are several ways to assign colors to categories, including using PROC TEMPLATE to write your own ODS style. However, in SAS 9.3, the easiest option is to use an "attribute map." An attribute map is a small data set that describes how each category level should be rendered. Dan Heath wrote a blog post about the attribute map. In this article I give a simple example of using an attribute map with the SGPLOT procedure, and I show that you can use a format to help you create an attribute map.

In my last article, I created a scatter plot that shows the gains made by women in the workplace. This graph is reproduced below, and you can download the SAS program that contains the data and that creates the plot. The colors make it clear that the proportion of women has increased in most job categories.

The program begins by using PROC format to define a mapping between a continuous variable (the difference between the proportion of women in 1980 and the proportion in 2010) and five discrete levels:

proc format;
value Gain  low-< -10 ="Large Loss" /* [low,-10) */
           -10 -<  -5 ="Loss"       /* [-10, -5) */
            -5 -    5 ="No Change"  /* [ -5,  5] */
             5 -   10 ="Gains"      /* (  5, 10] */
            10  - high="Large Gains";/*( 10, high] */
run;

Notice that because I use a format, there is not an explicit categorical variable in the data set. Instead, there is a continuous variable (the difference) and a format on that variable.

Nevertheless, you can assign a color to each category by creating an attribute map that assigns formatted values to colors. To do this, define two variables, one named Value and the other MarkerColor. The values of the Value variable define the categories; the corresponding values of the MarkerColor variable define the colors of markers in the scatter plot. (There are other columns that you could define, such as LineColor and MarkerSymbol.) You could hard-code the values "Large Loss," "Loss", and so forth, but why not use the fact that there is a format? That way, if the labels in the format change, the graph will pick up the new labels. The following DATA step uses the PUTN function to apply the user-defined format to values that are representative of each category:

data Attrs;
length Value $20 MarkerColor $20;
ID = "Jobs"; 
Value = putn(-15,"Gain."); MarkerColor = "DarkRed     "; output;
Value = putn( -8,"Gain."); MarkerColor = "DarkOrange  "; output;
Value = putn(  0,"Gain."); MarkerColor = "MediumGray  "; output;
Value = putn(  8,"Gain."); MarkerColor = "GrayishGreen"; output;
Value = putn( 15,"Gain."); MarkerColor = "DarkGreen   "; output;
run;

Notice also that the attribute map includes an ID variable that is used to identify the map, because you can define multiple attribute maps in a single data set.

Equivalently, you could use arrays to store the "cut points" for each category, and use a loop to iterate over all possible categories. This is more general, and I think that the cut points make the program easier to understand than the "strange" numbers (15 and 8) in the previous attribute map.

data Attrs2;
length Value $20 MarkerColor $20;
ID = "Jobs";              
array cutpts{6} _temporary_(-100 -10 -5 5 10 100);
array colors{5} $20 _temporary_ 
      ("DarkRed" "DarkOrange" "MediumGray" "GrayishGreen" "DarkGreen");
drop i v;
do i = 1 to dim(cutpts)-1;
   v = (cutpts[i+1] + cutpts[i]) / 2; /* midpoint of interval */
   Value = putn(v,"Gain.");           /* formatted value for this interval */
   MarkerColor = colors[i];           /* color for this interval */
   output;
end;
run;

To use an attribute map, you have to specify two pieces of information: the data set (Attrs) and the ID variable (Jobs). You specify the data set by using the DATTRMAP= option on the PROC SGPLOT statement. You specify the map by using the ATTRID= option on the SCATTER statement, as follows:

proc sgplot data=Jobs DATTRMAP=Attrs; 
scatter x=PctFemale1980 y=PctFemale2010 / group=Diff ATTRID=Jobs; 
run;

Attribute maps are a useful feature of the SG procedures in SAS 9.3. Most of the time, the default colors that are defined in ODS styles are sufficient for visualizing grouped data. However, you can use attribute maps when you need to assign specific colors to categories. As I've shown in this article, you can also assign a color to a category that is defined by a format. In either case, an attribute map can make it easier for your audience to understand the data.

tags: Data Analysis, SAS Programming, Statistical Graphics
10月 032012
 

As citizens of the Internet, we are all familiar with IP addresses -- probably more so than our Internet founding fathers had ever intended. These addresses are typically represented in a 4-piece segmented list of numbers separated by dots. Here is an example: "149.173.5.120".

Each segment is called an octet because it contains 8 (count 'em, eight!) bits. The four-octect IP address is part of the IPv4 standard.

Note: There is a newer IPv6 standard (featuring 16 octets) that many newer networks use and which allows for more addresses. This has become necessary because all new consumer products are required by law to connect to the Internet. (I think that each of my daughter's "Polly Pocket" pieces can connect to WiFi.) But in this article I'm ignoring IPv6.

The easy-to-read segmented IP address is actually a 32-bit number, and sometimes it is useful to convert the display value into its numeric form. For example, consider the databases that help you to map an IP address to a geographic location in the world. These databases use a numerical range to map an address to a country or city. (For more on range-based geocoding, see this topic in the PROC GEOCODE documentation.)

Here is a SAS program that converts a character-based, 4-segment IP address into its equivalent numeric value. It uses the SCAN function, a DATA step ARRAY, and a little bit of math to do the work:

/* Calculate the numerical IP from "segmented" IP address            */
/* Example: (from Right to Left)                                     */
/* 1.2.3.4 = 4 + (3 * 256) + (2 * 256 * 256) + (1 * 256 * 256 * 256) */
/*   is 4 + 768 + 13,1072 + 16,777,216 = 16,909,060                  */
data ip_numbers (keep=ip_address ip_numeric);
  infile datalines dsd;
  length ip_address $ 20 ip_numeric 8;
  input ip_address;
  array ip_part {4};
  do i = 1 to 4;
    ip_part{i} = scan(ip_address,i,'.');
  end;
  ip_numeric = ip_part{4} +
    (ip_part{3} * 256) +
    (ip_part{2} * 256 * 256) +
    (ip_part{1} * 256 * 256 * 256);
datalines;
115.85.65.148
117.203.114.198
118.96.201.156
119.247.220.11
12.201.116.58
128.2.38.96
128.204.197.27
128.204.207.83
134.102.237.2
141.155.113.98
169.2.124.79
172.16.26.231
172.16.30.229
173.234.211.69
176.63.76.232
178.157.198.132
178.32.145.44
178.32.177.184
178.33.174.213
178.63.199.204
184.82.208.149
188.165.187.71
;
run;

Here's the output:

With this mapping, I can then combine my collection of IP addresses with one of the IP-to-geolocation databases that are available. (SAS provides a macro to help work with MaxMind, which you can learn about in this topic.) Here's a sample result:

tags: geocode, IP addresses, IPv4, SAS programming, SAS tips
9月 282012
 
Ron Cody’s at it again. It’s like he just can’t help himself; he knows SAS and wants to help others know it too. His new book, Cody's Collection of Popular SAS Programming Tasks and How to Tackle Them gathers often-used programming tasks and techniques in one handy volume, allowing users [...]
9月 012012
 
SAS users world-wide have turned to Susan Slaughter, Lora Delwiche, and The Little SAS Book  to learn SAS programming. This week's SAS tip is from their bestselling fourth edition of the book (the fifth edition is now available for preorder). Whichever version of The Little SAS Book you use, you'll benefit from the friendly [...]
8月 222012
 

Regular expressions provide a powerful method to find patterns in a string of text. However, the syntax for regular expressions is somewhat cryptic and difficult to devise. This is why, by my reckoning, approximately 97% of the regular expressions used in code today were copied and pasted from somewhere else. (Who knows who the original authors were? Some experts believe they were copied from ancient cave paintings in New Mexico.)

You can use regular expressions in your SAS programs, via the PRX* family of functions. These include PRXPARSE and PRXMATCH, among others. The classic example for regular expressions is to validate and standardize data values that might have been entered in different ways, such as a phone number or a zip code (with or without the plus-4 notation).

In this post I'll present another, less-trodden example. It's a regular expression that validates the syntax of a SAS variable name. (Now, I'm talking about the regular old traditional SAS variable names, and not those abominations that you can use with OPTIONS VALIDVARNAME=ANY.)

SAS variable names, as you know, can be 1 to 32 characters long, begin with a letter or underscore, and then contain letters, numbers, or underscores in any combination after that. If all you need is a way to validate such names, stop reading here and go learn about the NVALID function, which does exactly this. But if you want to learn a little bit about regular expressions, read on.

The following program shows the regular expression used in context:

data vars (keep=varname valid);
  length varname $ 50;
  input varname 1-50 ;
  re = prxparse('/^(?=.{1,32}$)([a-zA-Z_][a-zA-Z0-9_]*)$/');
  pos = prxmatch(re, trim(varname));
  valid = ifc(pos>0, "YES","NO");
cards;
var1
no space
1var
_temp
thisVarIsOver32charactersInLength
thisContainsFunkyChar!
_
yes_underscore_is_valid_name
run;

And the results:

Here's a closer look at the regular expression in "diagram" form, lightly annotated. (This reminds me a little of how we used to diagram sentences in 4th grade. Does anybody do that anymore?)


Among the more interesting aspects of this expression is the lookahead portion, which checks that the value is between 1 and 32 characters right out of the gate. If that test fails, the expression fails to match immediately. Of course, you could use the LENGTHN function to check that, but it's nice to include all of the tests in a single expression. I mean, if you're going to write a cryptic expression, you might as well go all the way.

Unless you live an alternate reality where normal text looks like cartoon-style swear words, there really isn't much that's "regular" about regular expressions. But they are extremely powerful as a data validation and cleansing tool, and they exist in just about every programming language, so it's a transferable skill. If you take the time to learn how to use them effectively, it's sure to pay off.

tags: prxmatch, regular expressions, SAS programming
8月 152012
 

The project that I'm currently working on requires several input data tables, and those tables must have a specific schema. That is, each input table must contain columns of a specific name, type, and length in order for the rest of the system to function correctly. The schema requirements aren't too fancy; there are no specs for keys, indexes, or constraints. It's simply "your data sets need columns that look like this".

Even though the schema is "published", the data tables themselves will be prepared by humans using Who Knows What Processes. Therefore, we must adopt a "trust but verify" approach before accepting the data and turning the crank on what could be a long-running analysis.

I could "eyeball" the differences by capturing the data column attributes from each table and comparing them with what I know I need, but that approach doesn't scale well.

I wrote a simple SAS macro that I use to create a report that shows whether the input data passes a simple schema "sniff test". It compares a "candidate" table against a table that has a known "good" column layout, and reports on the substantive differences.

DOWNLOAD: %check_reference_data from here.

The output of the macro is another data set that contains one record per column-pair, showing a list of potential problems with the input data, ranked by severity. The severity assignments are a bit arbitrary, but effective enough for my project:

  • "SEV 1" is assigned if a required column is missing, or is present but has a different type (numeric vs. character).
  • "SEV-2" is assigned if a required column has a greater-than-expected length, as it might lead to truncation of data values downstream.
  • "SEV-3" is assigned if the column has a different SAS format, as it could indicate possible misinterpretation (a DATE format vs. DATETIME, for example).
  • And "SEV-4" is more of an "FYI" when the input data contains additional columns that aren't required.

Sample uses:

%check_reference_data(SASHELP,CLASS,SASHELP,CLASSFIT,WORK.CHECK);
title "Does CLASSFIT have everything that CLASS has?  Yes, and more";
proc print data=work.check;
run;

%check_reference_data(SASHELP,CLASSFIT,SASHELP,CLASS,WORK.CHECK);
title "Does CLASS have everything that CLASSFIT has?  No, it does not.";
proc print data=work.check;
run;

If you examine the code behind the macro, you might notice that it creates a temporary view of the input data. Why? During work on my project, I noticed that some of the supplied input data differed from the schema only by one or two column names. Sometimes it's easier to "fix" that data on the way in than to send it back to the customer to regenerate. The temp view allows me to apply simple data set options like this:

%check_reference_data(REF,CLASS,CUST,CLASS(rename=(GENDER=SEX)),WORK.CHECK);
title "Fixed for customer-provided data that was SEX-less";
proc print data=work.check;
run;

And by the way, for this check to work I don't need to keep a complete copy of "good" data lying around. I can create an empty table with the desired schema "on the fly" like so, and then use it as the reference table in my macro:

/* create reference table attributes */
/* using SQL, but DATA step works too */
proc sql;
create table work.PATIENT (
  SEX                    VARCHAR(1) FORMAT=$1.,
  YOB                    NUMERIC(8) FORMAT=4.,
  ZIP_CODE               VARCHAR(5),
  PATIENT_ID             VARCHAR(15)
);
quit;

This macro provides a simple, "good-enough" approach for verifying the input data. It provides instant feedback as to whether the data can be used, and gives us confidence that we can proceed to the next step of the analysis.

tags: data sets, SAS programming, SAS tips, sql