sas programming

6月 242016

One thing that we have a lot of at SAS: installations of SAS software that we can run. I have SAS for Windows on my laptop, and I have access to many centralized instances of SAS that run on Linux and Windows servers. (I also have access to mainframe SAS, though it's been a while since I've used it. When I log in, I picture a Rube-Goldberg style mechanism that pokes an intern to mount a tape so my profile can be reloaded.)

I often develop programs using my local instance of SAS and SAS Enterprise Guide, but deploy them for use on a central server. I might run them as batch jobs or interactively with SAS Enterprise Guide or SAS Studio or even in SAS/IntrNet.

Our IT department wants SAS employees to have seamless access to their files whether on Windows or on Unix-style file systems, and so they make it easy to access the same network path from Windows (using UNC notation, or "\serverpath" syntax) and Unix (using "/node/usr/path" syntax). As I develop my SAS programs, I want the programs to work the same whether run from Windows or Unix, and I don't want to have to change LIBNAME paths each time. Fortunately, SAS programs are usually portable across different operating systems, and while SAS data sets might have different encodings across systems, SAS can always read a data set that was created by a different version.

I have a simple technique that references the proper path for the operating system that I'm using. I build a SAS macro variable by using the IFC function and the &SYSSCP automatic variable to check whether I'm running on Windows, then assign the path accordingly.

/* Use the IFC function as a shorthand for if-then, returning a character string */
%let tgtpath = %sysfunc(
  ifc(&SYSSCP. = WIN,
libname tgt "&tgtpath.";

When I run this on SAS for Linux, I see this in the log:

NOTE: Libref TGT was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /r/node/vol/vol01/mydept/project

And on Windows:

NOTE: Libref TGT was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: \sasprodrootdeptmydeptproject
tags: SAS programming, SAS tips

The post Assign a SAS library to a different path depending on your OS appeared first on The SAS Dummy.

6月 202016

Every beginning SAS programmer learns the simple IF-THEN/ELSE statement for conditional processing in the SAS DATA step. The basic If-THEN statement handles two cases: if a condition is true, the program does one thing, otherwise the program does something else.

Of course, you can handle more cases by using multiple ELSE IF statements. I have seen SAS programs that contain contains dozens of ELSE clauses. Sometimes a long sequence of IF-THEN/ELSE statements is necessary, such as when you are testing complex logical conditions.

Flow chart diagram for a switch statement (SELECT statement)

An alternative control statement in SAS is the SELECT-WHEN statement. The SELECT-WHEN statement (sometimes simply called the SELECT statement) enables you to conditionally execute statements based on the value of a single categorical variable. Usually the variable can have three or more valid values that you want to handle.

The following example uses the Sashelp.Heart data set, which contains data for 5,167 patients in a medical study. The Smoking_Status variable is a categorical variable that encodes the average number of cigarettes that each patient smokes per day. The following DATA step view implements a recoding scheme, which is sometimes the easiest way to force levels of a nominal variable to appear in a natural order during a SAS analysis.

/* example of using the SELECT statement */
data Heart / view=Heart;
set sashelp.heart;
select (Smoking_Status);
   when ('Non-smoker')        Smoking_Cat=1;
   when ('Light (1-5)')       Smoking_Cat=2;
   when ('Moderate (6-15)')   Smoking_Cat=3;
   when ('Heavy (16-25)')     Smoking_Cat=4;
   when ('Very Heavy (> 25)') Smoking_Cat=5;
   otherwise                  Smoking_Cat=.;

The SELECT-WHEN statement is easy to read. You specify the name of a variable on the SELECT statement. You then list a sequence of WHEN statements. Each WHEN statement specifies a particular value for the variable. If the variable has that value, the program conditionally executes a statement, which in this example assigns a value to the Smoking_Cat variable.

Notice that you can use the OTHERWISE keyword to handle missing values, invalid data, or default actions.

You can also combine categories in a WHEN statement. For example, in a statistical analysis you might want to combine the 'Heavy' and 'Very Heavy' categories into a single group. In the WHEN statement you can specify multiple values in a comma-separated list:

   /* combine the 'Heavy' and 'Very Heavy' categories */
   when ('Heavy (16-25)', 'Very Heavy (> 25)') Smoking_Cat=4;

If the WHEN condition is true, the program will execute one statement. This is the same rule that the IF-THEN statement follows. To execute more than one statement, use a DO-END block, which groups statements together:

   when ('Non-smoker') do;       /* execute multiple statements */
      IsSmoker = 0;

I use the SELECT-WHEN statement as a "table lookup" when a program needs to branch according to the value of a single categorical variable that has three or more valid values. The basic SELECT-WHEN statement is not as flexible as the IF-THEN/ELSE statement, but, when applicable, it results in very clean and easy-to-read programs.

Other languages have similar branching statements. The SQL language supports a CASE-WHEN statement. The C/C++ and Java/Javascript languages support a switch-case statement. Whereas the CASE-WHEN statement in SAS executes one statement, the switch-case statement implements fallthrough, so C-programmers often use the break statement to exit the switch block.

Some languages do not support a special switch statement, but instead require that you use IF-THEN/ELSE statements. Python and the SAS/IML language fall into this category.

There is an alternative syntax for the SELECT-WHEN statement that does not specify an expression in the SELECT statement. Instead, you specify logical conditions in the WHEN statements. This alternate syntax is essentially equivalent to an IF-THEN/ELSE statement, so which syntax you use is a matter of personal preference. Personally, I use SELECT-WHEN for branching on a known set of discrete values, and I use the IF-THEN/ELSE statement to handle more complex situations.

tags: Getting Started, SAS Programming

The post The SELECT statement in the SAS DATA step appeared first on The DO Loop.

5月 122016

These days SAS programmers have more choices than ever before about how to run SAS.  They can use the old Display Manager interface, or SAS Enterprise Guide, or the new kid on the block: SAS StudioAll of these are included with Base SAS.


SAS Display Manager


SAS Enterprise Guide


SAS Studio

Once upon a time, the only choices were Display Manager (officially named the SAS windowing environment), or batch.  Then along came SAS Enterprise Guide.  (Ok, I know there were a few others, but I don’t count SAS/ASSIST which was rightly spurned by SAS users, or the Analyst application which was just a stopover on the highway to SAS Enterprise Guide.)

I recently asked a SAS user, “Which interface do you use for SAS programming?”

She replied, “Interface?  I just install SAS and use it.”

“You’re using Display Manager,” I explained, but she had no idea what I was talking about.

Trust me.  This person is an extremely sophisticated SAS user who does a lot of leading-edge mathematical programming, but she didn’t realize that Display Manager is not SAS.  It is just an interface to SAS.

This is where old timers like me have an advantage.  If you can remember running SAS in batch, then you know that Display Manager, SAS Enterprise Guide, and SAS Studio are just interfaces to SAS–wonderful, manna from heaven–but still just interfaces.  They are optional.  You could write SAS programs in Word or Notepad or some other editor, and submit them in batch–but why would you?  (I know someone is going to tell me that they do, in fact, do that, but the point is that it is not mainstream.  Only mega-nerds with the instincts of a true hacker do that these days.)

Each of these interfaces has advantages and disadvantages.  I’m not going to list them all here, because this is a blog not an encyclopedia, but the tweet would be

“DM is the simplest, EG has projects, SS runs in browsers.”

Personally, I think all of these interfaces are keepers.  At least for the near future, all three of these interfaces will continue to be used.  What we are seeing here is a proliferation of choices, not displacement of one with another.

So what’s your SAS interface?


5月 092016

Last week I analyzed 12 million records of taxi cap transactions in New York City. As part of that analysis, I used a DATA step view to create a new variable, which was the ratio of the tip amount to the fare amount.

A novice SAS programmer told me that he has never heard of a "DATA step view." He asked, "What is a DATA step view?"

Simply put, a "view" is a SAS DATA step program that is stored for later execution. When the SAS language processor encounters the RUN statement, the program is compiled and saved, but not executed. When a procedure uses a data view, the program runs and serves data to the procedure as if the procedure were reading a regular SAS data set. Thus you can use a view to manipulate data "on the fly."

I like to create views when I want to construct a new variable in a huge data set, but I don't want to physically copy the data. When I analyze the data view by using another procedure, the constructed variable is computed on the fly.

Here's an example. Suppose that you have a large data set that includes heights and weights for millions of patients. Some (but perhaps not all) analysts at your company need to analyze the body-mass index (BMI) for these patients. You have two options. The first option is to create a new data set that has the new column in it. This requires that you duplicate the original data and add a new column. The second option is to keep the original data unchanged, but create a view that computes the BMI. Any analyst that needs the BMI can access the data by using the view.

Let's see how this works by using a small data set. The Sashelp.Class data set contains height and weight (in English measurements) for 19 children. The BMI formula for children is slightly more complicated than for adults, but for simplicity the following SAS program simply uses the adult formula. The following DATA step creates a data view by using the VIEW= option on the DATA statement. The MEANS procedure then analyzes the newly calculated BMI variable:

data BMI / view=BMI;                 /* define DATA step view */
   set Sashelp.Class;
   BMI = weight / height**2 * 703;   /* BMI formula for adults (pounds and inches) */
proc means data=BMI;
   var BMI;

As you can see, the syntax for the MEANS procedure does not change. In fact, the only syntax that changes is the VIEW= option in the DATA step.

When SAS encounters the RUN statement in the DATA step, it saves the program. The program is not executed until it is used in the DATA= option in PROC MEANS. At that point, SAS executes the program and computes the BMI variable, which the procedure consumes and analyzes.

There are three main advantage of a data view: reduced storage space, not cluttering up a data set with extra variables, and if the view uses data that is updated regularly (for example, nightly sales data) then the view always reads the current data.

Three advantages to DATA step views #SAStip
Click To Tweet

The main disadvantage to DATA step views is that the computed columns must be recomputed every time that a procedure uses the view. If you specify DATA=BMI for additional procedures, the BMI variable is recomputed each time, which is slower than reading pre-computed data.

For more information about SAS data views, see

Do you use DATA step views in your company? Leave a comment.

tags: SAS Programming

The post What is a DATA step view and why is it important? appeared first on The DO Loop.

4月 072016

I know what you're thinking: two "Boaty McBoatface" articles within two weeks? And we're past April Fool's Day?

But since I posted my original analysis about the "Name our ship" phenomenon that's happening in the UK right now, a new contender has appeared: Poppy-Mai.

The cause of Poppy-Mai, a critically ill infant who has captured the imagination of many British citizens (and indeed, of the world), has made a very large dent in the lead that Boaty McBoatface holds.

Yes, "Boaty" still has a-better-than 4:1 lead. But that's a lot closer than the 10:1 lead (over "Henry Worsley") from just over a week ago. Check out the box plot now: you can actually make out a few more dots. Voting is open for another 10 days -- and as we have seen, a lot can happen in that time.

As I take this second look at the submissions (now almost 6300) and voting data (almost 350,000 votes cast), I've found a few more entries that made me chuckle. Some of them struck me by their word play, and others cater to my nerdy sensibilities. Here they are (capitalization retained):

While I'm on this topic, I want to give a shout-out to regex101, the online regular expression tester. I was able to develop and test my regular expressions before dropping them into a PRXPARSE function call. I found that I had to adjust my regular expression to cast a wider net for valid titles from the names submissions data. Previously, I wasn't capturing all of the punctuation. While that's probably because I didn't expect punctuation to be part of a ship's name, that assumption doesn't stop people from suggesting and voting on such names. My new regex match:

  title_regex = prxparse("/'title':s?""([a-zA-Z0-9'.-_#s$%&()@!]+)/");

I could probably optimize by specifying an exception pattern instead of an inclusion pattern...but this isn't the sort of project where I worry about that.

Will I write about Boaty McBoatface again? What will my next Boaty article reveal? Stay tuned!

tags: Boaty McBoatface, regular expressions, SAS programming, SGPLOT

The post Boaty McBoatface is on the run appeared first on The SAS Dummy.

3月 262016

In a voting contest, is it possible for a huge population to get behind a ridiculous candidate with such force that no other contestant can possibly catch up? The answer is: Yes.

Just ask the folks at NERC, the environmental research organization in the UK. They are commissioning a new vessel for polar research, and they decided to crowdsource the naming process. Anyone in the world is welcome to visit their NameOurShip web site and suggest a name or vote on an existing name submission.

As of today, the leading name is "RRS Boaty McBoatface." ("RRS" is standard prefix for a Royal Research Ship.) This wonderfully creative name is winning the race by more than just a little bit: it has 10 times the number of votes as the next highest vote getter, "RRS Henry Worsley".

I wondered whether the raw data for this poll might be available, and I was pleased to find it embedded in the web page that shows the current entries. The raw data is in JSON format, embedded in the source of the HTML page. I saved the web page source to my local machine, copied out just the JSON line with the submissions data, then used SAS to parse the results. Here's my code:

filename records "c:projectsvotedata.txt";

data votes (keep=title likes);
 length likes 8;
 format likes comma20.;
 label likes="Votes";
 length len 8;
 infile records;
  if _n_ = 1 then
      retain likes_regex title_regex;
      likes_regex = prxparse("/'likes':s?([0-9]*)/");
      title_regex = prxparse("/'title':s?""([a-zA-Z0-9's]+)/");

 position = prxmatch(likes_regex,_infile_);
  if (position ^= 0) then
      call prxposn(likes_regex, 1, start, len);
      likes = substr(_infile_,start,len);
 start=0; len=0;

 position = prxmatch(title_regex,_infile_);
  if (position ^= 0) then
      call prxposn(title_regex, 1, start, len);
      title = substr(_infile_,start,len);

With the data in SAS, I used PROC FREQ to show the current tally:

title "Vote tally for NERC's Name Our Ship campaign";
proc freq data=votes order=freq;
table title;
weight likes;

The numbers are compelling: good ol' Boaty Mac has over 42% of the nearly 200,000 votes. The arguably more-respectable "Henry Worsley" entry is tracking at just 4%. I'm not an expert on polling and sample sizes, but even I can tell that Boaty McBoatface is going to be tough to beat.

To drive the point home a bit more, let's look at a box plot of the votes distribution.

title "Distribution of votes for ALL submissions";
proc sgplot data=votes;
hbox likes;
xaxis valueattrs=(size=12pt);

In this output, we have a clear outlier:
If we exclude Boaty, then it shows a slightly closer race among the other runners up (which include some good serious entries, plus some whimsical entries, such as "Boatimus Prime"):

title "Distribution of votes for ALL submissions except Boaty McBoatface";
proc sgplot data=votes(where=(title^="Boaty McBoatface"));
hbox likes;
xaxis valueattrs=(size=12pt);

See the difference between the automatic axis values between the two graphs? The tick marks show 80,000 vs. 8,000 as the top values.

Digging further, I wondered whether there were some recurring themes in the entries. I decided to calculate word frequencies using a technique I found on our SAS Support Communities (thanks to Cynthia Zender for sharing):

/* Tally the words across all submissions */
data wdcount(keep=word);
    set votes;
    i = 1;
    origword = scan(title,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        /* exclude the most common words */
        if word not in ('a','the','of','and') then output;
        i + 1;
        wordord = i;
        origword = scan(title,i);
        word = compress(lowcase(origword),'?');
proc sql;
   create table work.wordcounts as 
   select t1.word, 
          /* count_of_word */
            (count(t1.word)) as word_count
      from work.wdcount t1
      group by t1.word
      order by word_count desc;
title "Frequently occurring words in boat name submissions";
proc print data=wordcounts(obs=25);

The top words evoke the northern, cold nature of the boat's mission. Here are the top 25 words and their counts:

  1    polar         352 
  2    ice           193 
  3    explorer      110 
  4    arctic         86 
  5    red            69 
  6    sir            55 
  7    john           54 
  8    lady           46 
  9    sea            42 
 10    ocean          42 
 11    scott          41 
 12    bear           39 
 13    aurora         38 
 14    artic          37 
 15    queen          37 
 16    captain        36 
 17    james          36 
 18    endeavour      35 
 19    william        35 
 20    star           34 
 21    spirit         34 
 22    new            26 
 23    antarctic      26 
 24    boat           25 
 25    cold           25 

I don't know when voting closes, so maybe whimsy will yet be outvoted by a more serious entry. Or maybe NERC will exercise their right to "take this under advisement" and set a certain standard for the finalist names. Whatever the outcome, I'm sure we haven't heard the last of Boaty...

tags: regular expressions, SAS programming, SGPLOT

The post And it's Boaty McBoatface by an order of magnitude appeared first on The SAS Dummy.

3月 052016

In previous articles, I've shared tips about how you can work with SAS and ZIP files without requiring an external tool like WinZip, gzip, or 7-Zip. I've covered:

But a customer approached me the other day with one scenario I missed: how to add SAS data sets to an existing ZIP file. It's a variation of a tip that I've already shared, but with two differences. First, in order to add a data set to a ZIP file, you have to know its physical filename -- not just the LIBNAME.MEMBER reference that you use in SAS procedure steps. And second, I had not shown how to add a new file to an existing ZIP archive -- though it turns out that's pretty simple.

Find the file name for a SAS data set

There are several ways to do this. For my approach, I used the output from PROC CONTENTS. Notice that I had to capture the ODS output (not the OUT= data set) to grab the file name. I wrapped it in a macro for easy reuse. And since I ultimately need a SAS fileref to map to the path, I've assigned one (data_fn) in my macro.

/* macro to assign a fileref to a SAS data set in a Base library */
%macro assignFilerefToDataset(_dataset_name);
    %local outDsName;
    ods output EngineHost=File;
    proc contents data=&_dataset_name.;
    proc sql noprint;
        select cValue1 into: outDsName 
            from work.file where Label1="Filename";
    filename data_fn "&outDsName.";

How to add a new member to a ZIP file

Now that I have the source file, I need to designate a destination file in a ZIP archive. The FILENAME ZIP method will create a new ZIP file if one does not yet exist, or it can add to an existing ZIP. To ensure I'm starting from scratch, I assign a simple fileref to my target destination and then delete the file.

/* Assign the fileref - basic file method */
filename projzip "&projectDir./";
/* Start with a clean slate - delete ZIP if it exists */
data _null_;

To create a new ZIP file and designate a path and file name within it, I used the FILENAME ZIP method with the MEMBER= option. Note that I specified the "data/" subfolder in the MEMBER= value; this will place the file into a named subfolder within the archive.

/* Use FILENAME ZIP to add a new member -- CLASS */
/* Put it in the data subfolder */
filename addfile zip "&projectDir./" 

Then finally, I need to actually "copy" the file into the archive. I do this by streaming the source file into the target fileref byte-by-byte:

/* byte-by-byte copy */
/* "copies" the new file into the ZIP archive */
data _null_;
    infile data_fn recfm=n;
    file addfile recfm=n;
    input byte $char1. @;
    put  byte $char1. @;
filename addfile clear;

That's it! I now have a ZIP file with one member entry. Now I can "press repeat" to add a second entry:

/* Use FILENAME ZIP to add a new member -- CARS */
/* Put it in the data subfolder */
filename addfile zip "&projectDir./" 
/* byte-by-byte copy */
/* "copies" the new file into the ZIP archive */
data _null_;
    infile data_fn recfm=n;
    file addfile recfm=n;
    input byte $char1. @;
    put  byte $char1. @;
filename addfile clear;

Optional: Report on the ZIP file contents

If I want to report on the total contents of the ZIP file now, here's a DATA step and PROC CONTENTS step that does the job:

/* OPTIONAL for reporting */
/* Report on the contents of the ZIP file */
/* Assign a fileref wth the ZIP method */
filename inzip zip "&projectDir./";
/* Read the "members" (files) from the ZIP file */
data contents(keep=memname);
    length memname $200;
    if fid=0 then
    do i=1 to memcount;
/* create a report of the ZIP contents */
title "Files in the ZIP file";
proc print data=contents noobs N;


Files in the ZIP file 

N = 2

I hope that this helps to make the FILENAME ZIP method more useful to those who want to try it out. I'm sure that there will be more scenarios that people will ask about; someday, if I write enough blog posts, I'll have it all covered!

Sample program: You can view/download the entire SAS program (containing the snippets I've featured and more) from my GitHub profile.

tags: FILENAME ZIP, SAS 9.4, SAS programming, ZIP files

The post Add files to a ZIP archive with FILENAME ZIP appeared first on The SAS Dummy.

2月 292016

One of the first things SAS programmers learn is that SAS data sets can be specified in two ways. You can use a two-level name such as "sashelp.class" which uses a SAS libref (SASHELP) and a member name (CLASS) to specify the location of the data set. Alternatively, you can use a one-level name such as "TempData," and SAS searches for the data set in a default location.

In many SAS environments, one-level data set names are like the seven little dwarves: Heigh-Ho, heigh-ho, it's off to WORK they go! In other words, the WORK directory is the default location for one-level names. Consequently, one-level names often imply "temporary data," because data sets in WORK are deleted when you exit SAS.

However, it is possible to use the OPTIONS statement to change the libref that SAS searches when you specify a one-level SAS data set name. The option name is USER. The following statements specify a new libref that SAS should use as the default location for one-level data set names:

libname DEFLIB "C:/Temp";    /* define any libref */
options user=DEFLIB;         /* set the default location for one-level names */

For example, the following DATA step uses a one-level name for the data set. Consequently, the data set is created in the USER directory and PROC DATASETS lists the data sets in USER rather than WORK:

data TempData; x=1; y=2; z=3; run;  /* create data set using one-level name */
proc datasets; run;                 /* note that it is in the USER libref! */
Default libref for one-level name

Personally, I never do this because data sets in USER are not deleted when SAS exits. However, this example shows that one-level names are not always stored in WORK.

Discover the default storage location

If a one-level data set name is not necessarily in WORK, can you programmatically discover the libref where the data set is? Yes! The GETOPTION function returns the value for any SAS option, so you can retrieve the value of the USER option. For example, the following DATA step discovers the libref and data set name for a specified data set. For a two-level name, the name contains a period, which you can find by using the FINDC function. You can then use the SUBSTR function to extract the name of the libref and data set. If the data set name is a one-level name, then the GETOPTION function obtains the default libref. (If the USER option is not set, GETOPTION returns a blank string.)

%let MyData = TempData;       /* specify one-level or two-level data set name */
data _null_;
dsName = "&MyData";
LocDot = findc(dsName, ".");          /* Does name contain a period (.)?     */
if LocDot > 0 then do;                /*   Yes: it is a two-level name       */
   lib = substr(dsName, 1, LocDot-1); /*     get substring before the period */
   member = substr(dsName, LocDot+1); /*     get substring after the period  */
else do;                              /*   No: it is a one-level name        */
   lib = getoption("user");           /*   Has this option been defined?     */
   if lib = ' ' then lib = "work";    /*     No: use WORK                    */
   member = dsName;
put lib=;
put member=;

In summary, although one-level data set names are usually stored in WORK, that is not always the case. However, a programmer can use the GETOPTION function to discover the libref where one-level data sets are stored.

An application to SAS/IML programming

The reason I was interested in the GETOPTION function is that I was trying to write a function in SAS/IML that would accept a one- or two-level data set name and return the names of the variables in the data. The CONTENTS function in SAS/IML almost does what I want, but the CONTENTS function has two different signatures, one for two-level names and one for one-level names:

  • For two-level names, use two arguments: varNames = contents(lib, name);
  • For one-level names, use one argument: varNames = contents(name);

I wanted to write a function that accepts a single string (a one-level or two-level data set name) and calls the appropriate signature of the CONTENTS function. The following SAS/IML function does the job:

proc iml;
/* new CONTENTS function that handles one- and two-level data set names */
start ContentsEx( dsName );              /* "Ex" means "extended" */
   LocDot = findc(dsName, ".");          /* Does name contain a period (.)?     */
   if LocDot > 0 then do;                /*   Yes: it is a two-level name       */
      lib = substr(dsName, 1, LocDot-1); /*     get substring before the period */
      member = substr(dsName, LocDot+1); /*     get substring after the period  */
      return( contents(lib, member) );
   return( contents(dsName) );           /*   No: it is a one-level name        */
dsName = "&MyData";
varNames =  ContentsEx( dsName );
print varNames;

Have you ever had the need to use the USER option to override the default storage location for one-level data set names? Leave a comment.

tags: SAS Programming

The post One-level data set names in SAS are not always stored in WORK appeared first on The DO Loop.

2月 222016

A dummy variable (also known as indicator variable) is a numeric variable that indicates the presence or absence of some level of a categorical variable. The word "dummy" does not imply that these variables are not smart. Rather, dummy variables serve as a substitute or a proxy for a categorical variable, just as a "crash-test dummy" is a substitute for a crash victim, or a "sewing dummy" is a dressmaker's proxy for the human body.

In regression and other statistical analyses, a categorical variable can be replaced by dummy variables. For example, a categorical variable with levels "Low," "Moderate," and "High" can be represented by using three binary dummy variables. The first dummy variable has the value 1 for observations that have the level "Low," and 0 for the other observations. The second dummy variable has the value 1 for observations that have the level "Moderate," and zero for the others. The third dummy variable encodes the "High" level.

There are many ways to construct dummy variables in SAS. Some programmers use the DATA step, but there is an easier way. This article discusses the GLMMOD procedure, which produces basic binary dummy variables. A subsequent blog post discusses other SAS procedures that provide alternative methods for representating categorical variables.

Why generate dummy variables in SAS?

Many programmers never have to generate dummy variables in SAS because most SAS procedures that model categorical variables contain a CLASS statement. If a procedure contains a CLASS statement, then the procedure will automatically create and use dummy variables as part of the analysis.

However, it can be useful to create a SAS data set that explicitly contains a design matrix, which is a numerical matrix that use dummy variables to represent categorical variables. A design matrix also includes columns for continuous variables, the intercept term, and interaction effects. A few reasons to generate a design matrix are:

  • Students might need to create a design matrix so that they can fully understand the connections between regression models and matrix computations.
  • If a SAS procedure does not support a CLASS statement, you can use often use dummy variables in place of a classification variable. An example is PROC REG, which does not support the CLASS statement, although for most regression analyses you can use PROC GLM or PROC GLMSELECT. Another example is the MCMC procedure, whose documentation includes an example that creates a design matrix for a Bayesian regression model.
  • In simulation studies of regression models, it is easy to generate responses by using matrix computations with a numerical design matrix. It is harder to use classification variables directly.

PROC GLMMOD: Design matrices that use the GLM parameterization

The following DATA step create a data set with 10 observations. It has one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels.

data Patients;
   keep Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc print;  var Cholesterol Sex BP_Status;  run;
Original data with categorical variables

The GLMMOD procedure can create dummy variables for each categorical variable. If a categorical variable contains k levels, the GLMMOD procedure creates k binary dummy variables. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to use to create interaction effects.

The following call to PROC GLMMOD creates an output data set that contains the dummy variables. The output data set is named by using the OUTDESIGN= option. The OUTPARAM= option creates a second data set that associates each dummy variable to a level of a categorical variable:

proc glmmod data=Patients outdesign=GLMDesign outparm=GLMParm;
   class sex BP_Status;
   model Cholesterol = Sex BP_Status;
proc print data=GLMDesign; run;
proc print data=GLMParm; run;
Dummy variables in SAS for each level of the categorical variables

The OUTDESIGN= data set contains the design matrix, which includes variables named COL1, COL2, COL3, and so forth. The OUTPARM= data set associates levels of the original variables to the dummy variables. For these data, the GLMMOD procedure creates six binary columns. The first is the intercept column. The next two encode the Sex variable. The last three encode the BP_Status variable. If you specify interactions between the original variables, additional dummy variables are created. Notice that the order of the columns is the sort order of the values of their levels. For example, the "Female" column appears before the "Male" column.

When you use this design matrix in a regression analysis, the parameter estimates of main effects estimate the difference in the effects of each level compared to the last level (in alphabetical order). The following statements show that using the dummy variables in PROC REG give the same parameter estimates as are obtained by using the original classification variables in PROC GLM:

ods graphics off;
/* regression analysis by using dummy variables */
proc reg data=GLMDesign;
   DummyVars: model Cholesterol = COL2-COL6; /* dummy variables except intercept */
   ods select ParameterEstimates;
/* same analysis by using the CLASS statement */
proc glm data=Patients;
   class sex BP_Status;              /* generates dummy variables internally */
   model Cholesterol = Sex BP_Status / solution;
   ods select ParameterEstimates;
PROC REG output for dummy variables in SAS

The parameter estimates from PROC REG is shown. The parameter estimates from PROC GLM are identical. Notice that the parameter estimates for the last level are set to zero and the standard errors are assigned missing values. This occurs because the dummy variable for each categorical variable is redundant. For example, the second dummy variable for the Sex variable ("Males") is a linear combination of the intercept column and the dummy variable for "Females"). Similarly, the last dummy variable for the BP_Status variable ("Optimal") is a linear combination of the intercept column and the "High" and "Normal" dummy variables. By setting the parameter estimate to zero, the last column for each set of dummy variables does not contribute to the model.

For this reason, the GLM encoding is called a singular parameterization. In my next blog post I will present ways to parameterize levels of the categorical variables. These different parameterizations lead to nonsingular design matrices.

tags: Getting Started, SAS Programming

The post Create dummy variables in SAS appeared first on The DO Loop.