sas programming

1月 182019

It seems that everyone knows about GitHub -- the service that hosts many popular open source code projects. The underpinnings of GitHub are based on Git, which is itself an open-source implementation of a source management system. Git was originally built to help developers collaborate on Linux (yet another famous open source project) -- but now we all use it for all types of projects.

There are other free and for-pay services that use Git, like Bitbucket and GitLab. And there are countless products that embed Git for its versioning and collaboration features. In 2014, SAS developers added built-in Git support for SAS Enterprise Guide.

Since then, Git (and GitHub) have grown to play an even larger role in data science operations and DevOps in general. Automation is a key component for production work -- including check-in, check-out, commit, and rollback. In response, SAS has added Git integration to more SAS products, including:

  • the Base SAS programming language, via a collection of SAS functions.
  • SAS Data Integration Studio, via a new source control plugin
  • SAS Studio (experimental in v3.8)

You can use this Git integration with any service that supports Git (GitHub, GitLab, etc.), or with your own private Git servers and even just local Git repositories.

SAS functions for Git

Git infrastructure and functions were added to SAS 9.4 Maintenance 6. The new SAS functions all have the helpful prefix of "GITFN_" (signifying "Git fun!", I assume). Here's a partial list:

GITFN_CLONE  Clones a Git repository (for example, from GitHub) into a directory on the SAS server.
GITFN_COMMIT  Commits staged files to the local repository
GITFN_DIFF Returns the number of diffs between two commits in the local repository and creates a diff record object for the local repository.
GITFN_PUSH  Pushes the committed files in the local repository to the remote repository.
GITFN_NEW_BRANCH  Creates a Git branch


The function names make sense if you're familiar with Git lingo. If you're new to Git, you'll need to learn the terms that go with the commands: clone, repo, commit, stage, blame, and more. This handbook provided by GitHub is friendly and easy to read. (Or you can start with this xkcd comic.)

You can

data _null_;
 version = gitfn_version();
 put version=;             
 rc = gitfn_clone("",
 put rc=;

In one line, this function fetches an entire collection of code files from your source control system. Here's a more concrete example that fetches the code to a work space, then runs a program from that repository. (This is safe for you to try -- here's the code that will be pulled/run. It even works from SAS University Edition.)

options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-dummy-blog;
libname repo "&repoPath.";
libname repo clear;
/* Fetch latest code from GitHub */
data _null_;
 rc = gitfn_clone("",
 put rc=;
/* run the code in this session */
%include "&repoPath./";

You could use the other GITFN functions to stage and commit the output from your SAS jobs, including log files, data sets, ODS results -- whatever you need to keep and version.

Using Git in SAS Data Integration Studio

SAS Data Integration Studio has supported source control integration for many years, but only for CVS and Subversion (still in wide use, but they aren't media darlings like GitHub). By popular request, the latest version of SAS Data Integration Studio adds support for a Git plug-in.

Example of Git in SAS DI Studio

See the documentation for details:

Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series.

Using Git in SAS Enterprise Guide

This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here.

program history

If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties.

With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.

The post Using built-in Git operations in SAS appeared first on The SAS Dummy.

1月 022019

Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category.

Data Visualization

General SAS programming techniques

Statistics and Data Analysis

I write this blog because I love to learn new things and share what I know with others. If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues. Happy New Year! I hope we both have many opportunities to learn and share in 2019!

The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.

12月 212018

Good news -- the SAS program that you wrote and put into production 10 years ago still works. Hey, it's SAS, so you probably take that for granted. But are those techniques from 2008 still the best way to accomplish your task? SAS 9.4, first released in 2013 and now refreshed with its sixth maintenance release, continues to extend the SAS programming language. New features allow you to simplify your code, make it run faster, and erase some of that technical debt you've been carrying due to previous workarounds or limitations.

The reason that I'm writing this post now is to recognize the next chapter for my long-time colleague, Rick Langston. After 38 years at SAS (and more time as a SAS user before that) Rick is retiring from his role. Many of you know him as a major steward of the SAS programming language. And many of the tips that I've shared on this blog are made possible by Rick's work. Here's an interview that I hosted with Rick in 2013, just before SAS 9.4 was first released.

Five cool features of the SAS language: the details

In the above video, Rick talks about three SAS features that were first introduced in 2013. I've added a couple of more recent items to the list to round it out.

FILENAME ZIP access method

This brings the ability to read and write compressed ZIP files, and GZIP files, directly into the SAS language. Use this feature to replace the clunky (and not always feasible) calls into external tools such as gzip, WinZip, or 7-Zip. In SAS, a "native" FILENAME access method is more portable and robust than calling out to an external tool with FILENAME PIPE.

For more information, check out the many SAS blog posts with examples that I've shared over the years.

DOSUBL function

Have you ever wanted to run another SAS procedure from inside of a DATA step? Rick calls this "submitting SAS code on the side", as it allows you to run a SAS step or statement from within a currently running step. You can learn more this SAS Global Forum paper by Rick. I've also written a post with a specific example in SAS Enterprise Guide.

LOCKDOWN system option and statement

This one will excite SAS administrators. You can set the LOCKDOWN system option in a batch SAS session or SAS Workspace server to limit some of the "dangerous" functions of SAS and, more importantly, limit the file areas in which the SAS session will operate. Read more in this article, Fencing in your SAS users with LOCKDOWN.

Creating and managing directories within SAS

This technique combines two features into a one-two punch of file folder management. Use the DLCREATEDIR option and the LIBNAME statement to create a new directory, and then use the DLGCDIR function to change the current directory of the SAS session.

In the past, you would have had to issue operating system commands to create a new directory and then switch ('cd') into it. That approach is not portable across different operating systems, and it requires access to the operating system shell -- not available in many SAS sessions these days. See these blog posts for more information and examples about the new techniques:

Using %IF/%THEN/%ELSE in open code

Perhaps the most life-changing of all of these SAS language updates, you can now use simple if-then-else logic for program flow, outside of the confines of the SAS macro language. It's what makes defensive programming like this possible -- without having to wrap the logic in %MACRO/%MEND:

%if %symexist(config_root) %then %do;
  filename config "&config_root./config.json";
  libname config json fileref=config;
  data _null_;
   set config.root;
   call symputx('tenant_id',tenant_id,'G');
   call symputx('client_id',client_id,'G');
   call symputx('redirect_uri',redirect_uri,'G');
   call symputx('resource',resource,'G');
%else %do;
  %put ERROR: You must define the CONFIG_ROOT macro variable.; 

Read more here: Using %IF/%THEN/%ELSE in SAS programs.

A legacy in the making

I'm going to miss having Rick Langston as a SAS colleague. There aren't many other people who know how to spin up a version of SAS from 30 years ago to help me track down a curious question. However, I'm not worried about the future of the SAS language. Rick has been excellent about sharing his knowledge for decades (just check his annual contributions to SAS Global Forum), and his team is well-suited to carry on the work of extending the SAS programming language for the next generation of SAS users. Thanks to Rick for helping to build such a solid foundation.

The post Five SAS programming language features you should be using appeared first on The SAS Dummy.

11月 192018

You might know that you can use the ODS SELECT statement to display only some of the tables and graphs that are created by a SAS procedure. But did you know that you can use a WHERE clause on the ODS SELECT statement to display tables that match a pattern? This article shows how to use wildcards, regular expressions, and pattern matching to select ODS tables in SAS.

ODS SELECT: Filter the output from SAS procedures

A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.

Here's an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
ods graphics on;
ods trace on;
proc logistic data=roc;
   model popind(event='0') = alb tp totscore / nofit;
   roc 'Albumin' alb;
   roc 'K-G Score' totscore;
   roc 'Total Protein' tp;
   roccontrast reference('K-G Score') / estimate e;

The SAS log displays the names of the tables and graphs. A portion of log is shown below:

Output Added:
Name:       OddsRatios
Label:      Odds Ratios
Template:   Stat.Logistic.OddsRatios
Path:       Logistic.ROC3.OddsRatios

Output Added:
Name:       ROCCurve
Label:      ROC Curve
Template:   Stat.Logistic.Graphics.ROC
Path:       Logistic.ROC3.ROCCurve

Output Added:
Name:       ROCOverlay
Label:      ROC Curves
Template:   Stat.Logistic.Graphics.ROCOverlay
Path:       Logistic.ROCComparisons.ROCOverlay

Output Added:
Name:       ROCAssociation
Label:      ROC Association Statistics
Template:   Stat.Logistic.ROCAssociation
Path:       Logistic.ROCComparisons.ROCAssociation

Output Added:
Name:       ROCContrastCoeff
Label:      ROC Contrast Coefficients
Template:   Stat.Logistic.ROCContrastCoeff
Path:       Logistic.ROCComparisons.ROCContrastCoeff

Only a few of the 27 ODS objects are shown here. Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:

   ods select ROCCurve ROCOverlay;     /* specify the names literally */

Use a WHERE clause in the ODS SELECT statement

Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as 'ROC', in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains 'ROC' as a substring.

You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern. The object's name is available in a special variable named _NAME_. Similarly, the object's label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.

In SAS, the following operators and functions are useful for matching strings:

  • The CONTAINS keyword matches strings that contains a specified substring. The question mark (?) is an equivalent way to specify the CONTAINS operator.
  • The LIKE keyword matches strings to a pattern. The underscore (_) is a wildcard that matches any character. The percent sign (%) is a wildcard that matches one or more characters.
  • The "begins with" operators (=: and in:) match strings that begin with a certain pattern or set of patterns, respectively.
  • SAS functions such as FIND, INDEX, and SUBSTR can be used to match patterns.
  • The SAS PRXMATCH function, which enables you to use Perl regular expressions to match patterns.

For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:

   /* use any one of the following statements inside the PROC LOGISTIC call */
   ods select where=(_name_ =: 'ROC');                 /* name starts with 'ROC' */
   ods select where=(_name_ like 'ROC%');              /* name starts with 'ROC' */
   ods select where=(_path_ ? 'ROC');                  /* path contains 'ROC' */
   ods select where=(_label_ ? 'ROC');                 /* label contains 'ROC' */
   ods select where=(_name_ in: ('Odds', 'ROC'));      /* name starts with 'Odds' or 'ROC' */
   ods select where=(substr(_name_,4,8)='Contrast');   /* name has subtring 'Contrast' at position 4 */

For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld's graphics-focused blog post and the section of the SAS/STAT User's Guide that discusses selecting ODS graphics.

Use PRXMATCH to match regular expressions

Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a Perl Regular Expression "cheat sheet" that summarizes the syntax and commons search queries for the PRXMATCH function.

You can put any of the following statements inside the PROC LOGISTIC call:

   /* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */
   ods select where=(prxmatch('/ROC/',       _name_)); /* name contains 'ROC' anywhere */
   ods select where=(prxmatch('/^ROC/',      _name_)); /* name starts with 'ROC' */
   ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */
   ods select where=(prxmatch('/ROC/',      _name_)=0);     /* name does NOT contain 'ROC' anywhere */
   ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */

In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful. Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.

The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.

11月 132018

In my previous blog post I demonstrated how to create your own CAS actions and action sets.  In this post, we will explore how to create your own CAS functions using the CAS Language (CASL).  A function is a component of the CASL programming language that can accept arguments, perform a computation or other operation, and return a value.  The value that is returned can be used in an assignment statement or elsewhere in expressions.

About SAS functions

SAS provides two types of supplied functions: built-in functions and common functions.  Built-in functions contain functionality that is unique to CASL.  These allow you to perform operations on your result tables, arrays, and dictionaries, and provide run-time support for your CASL programs.  Built-in functions cannot be replaced with user-defined functions.

Conversely, common functions provide functionality that is common to other SAS functions.  When used in a CASL program, SAS functions take a CASL value and a CASL value is returned.  Unlike built-in functions, you can replace these functions with user-defined functions.

Since the capabilities of built-in functions are unique to CASL, let’s look at these in-depth and demonstrate with an example.  Save the following FedSQL code in an external file called  This code will be read into CAS and stored as a variable.

The execDirect action executes FedSQL code in CAS.  The READPATH built-in function reads the FedSQL code saved in and stores it in the CASL variable sqlcode which is used as input to the query parameter.

The fetch action displays the first 20 rows from the output table hmeq.out.

If you don’t feel like looking through the documentation for a built-in or common function, a list of each can be generated programmatically.  Run the following code to see a list of built-in functions.

Partial list of CASL built-in functions

Run the following code to see a list of common functions.

Partial list of common functions

User-defined CASL functions

In addition to the customizable capabilities of built-in functions supplied by SAS, you can also create your own functions using the FUNCTION statement.  User-defined functions can be called in expressions using CASL and they provide a large amount of flexibility.  The following example creates four different functions for temperature conversion.

After creating these functions, they can be called immediately, or you can store them in an external file and call them via a %include statement.  In this example, the user-defined functions have been stored in an external file called  You can call one, all, or any number of your user-defined functions.

The output from each function call is displayed in the log.

Lastly, if you want to see all user-defined functions, run the FUNCTIONLIST statement.  A list will be printed to the log.

More about CASL programming and using functions in CASL

Check out these resources for further information on programming in the CASL language and using functions in CASL.

Customize your CASL code with built-in and user-defined functions was published on SAS Users.

10月 312018

A useful feature in PROC SGPLOT is the ability to easily visualize subgroups of data. Most statements in the SGPLOT procedure support a GROUP= option that enables you to overlay plots of subgroups. When you use the GROUP= option, observations are assigned attributes (colors, line patterns, symbols, ...) that indicate the value of the grouping variable. This article reviews the GROUP= option and shows how to trick PROC SGPLOT into performing a group analysis for statements that do not support the GROUP= option.

Three ways to plot data by groups

It is common to use colors or symbols to indicate which observations belong to each category of a grouping variable. Typical grouping variables include gender (male and female), political affiliation (democrats, republicans, and independents), race, education level, and so forth. When you use the SAS SG procedures to plot subsets of the data, there are three ways to arrange the plots. You can plot each group individually, you can create a panel of graphs, or you can overlay the groups on a single graph:

  • If you use the BY statement in PROC SGPLOT, each subgroup is plotted independently in its own graph. The axes are scaled based only on the data in that subgroup.
  • If you use the PANELBY statement in PROC SGPANEL, each subgroup is plotted in a cell of a lattice in which the axes are scaled to a common range.
  • If you use the GROUP= option, the plots for each subgroup are overlaid in a single graph.

The following SAS statements demonstrate each approach. Only the GROUP= overlay is displayed because that is the topic of this article:

proc sgplot data=Sashelp.Iris;        /* BY-group visualization. Three independent graphs.  */
   by Species;
   histogram SepalLength;
   density SepalLength / type=kernel;
proc sgpanel data=Sashelp.Iris;       /* Panel visualization. Shared common axis. */
   panelby Species / columns=1 onepanel;
   histogram SepalLength;
   density SepalLength / type=kernel;
proc sgplot data=Sashelp.Iris;       /* Overlay three plots in one graph  */
   histogram SepalLength / GROUP=Species binstart=42 binwidth=3 transparency=0.5;
   density SepalLength / type=kernel GROUP=Species;

How to emulate the GROUP= option

Many SGPLOT statements (such as the SERIES and SCATTER statements) have supported the GROUP= option since the early days of ODS graphics. For other statements, support for the GROUP= option was added more recently. For example, the GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4M2.

Here is a trick (shown to me by my colleague, Paul) that you can use to emulate the GROUP= option. If a statement in the SGPLOT procedure does not support the GROUP= option, but the statement DOES support the FREQ= option, you can often use the FREQ= option to construct a graph that overlays the subgroups. You need to do two things. First, you need to create binary indicators variables (sometimes called dummy variables) for each level of the categorical variable. You then use multiple statements, each with a different frequency variable, to overlay the subgroups. These two steps are shown by the following DATA step and call to PROC SGPLOT, which uses the FREQ= trick to overlay three histograms:

/* emulate a GROUP= option for SGPLOT statements that do not support GROUP= */
data IrisFreq;
   set sashelp.Iris;
   Freq1 = (Species='Setosa');     /* Binary. Equals 1 if observation is in 'Setosa' group     */
   Freq2 = (Species='Versicolor'); /* Binary. Equals 1 if observation is in 'Versicolor' group */
   Freq3 = (Species='Virginica');  /* Binary. Equals 1 if observation is in 'Verginica' group  */
title "Overlay Histograms by Using the FREQ= Option";
%let binOpts = binstart=42 binWidth=3 transparency=0.5; /* ensure common bins */
proc sgplot data=IrisFreq;
   histogram SepalLength / freq=Freq1 &binOpts;    /* only the 'Setosa' group     */
   histogram SepalLength / freq=Freq2 &binOpts;    /* only the 'Versicolor' group */
   histogram SepalLength / freq=Freq3 &binOpts;    /* only the 'Virginica' group  */

The graph overlays three histograms, one for each value of the Species variable. The result is similar to the earlier graph that used the GROUP= option. You can use the same trick on the DENSITY statement, although you will need to manually set the line attributes so that they match the attributes for the corresponding histograms.

You can use this technique in old versions of SAS to emulate the GROUP= option on the HISTOGRAM statement. You can also use it for statements that do not support the GROUP= option.

Although this example uses the DATA step to manually create the dummy variables that are used as frequencies, you can also create the dummy variables automatically by generating the "design matrix" for the Species variable. The GLMMOD procedure is the simplest way to create dummy variables in SAS, but other procedures provide additional features.

Generate prediction ellipses for groups

Several years ago I showed how you can overlay prediction ellipses for each group on a scatter plot. (Note that the ELLIPSE statement does not support a GROUP= option.) The technique requires that you transpose the data from long to wide form by creating new variables, one for each group of the categorical variable. Paul recognized that creating a dummy variable and using the FREQ= option is a simpler way to overlay prediction ellipses on a scatter plot:

title "Prediction Ellipses for Iris Data";
proc sgplot data=IrisFreq;
   scatter x=PetalLength y=PetalWidth / group=Species;
   ellipse x=PetalLength y=PetalWidth / freq=Freq1 legendlabel="Setosa";
   ellipse x=PetalLength y=PetalWidth / freq=Freq2 legendlabel="Versicolor";
   ellipse x=PetalLength y=PetalWidth / freq=Freq3 legendlabel="Virginica";

Advantages and disadvantages of the FREQ= trick

The main advantage of using the FREQ= option for group processing is that it enables you to overlay subgroups even when a statement does not support the GROUP= option. A secondary advantage is that this technique gives you complete control of the attributes of each subgroup. Although you can use the STYLEATTRS statement to control many group attributes, the STYLEATTTRS statement does not enable you to control marker sizes or line widths, to name two examples.

The FREQ= trick does have some disadvantages:

  • You can't use the FREQ= trick for statements that produce graphs of categorical variables. The SGPLOT documentation states, “If your plot is overlaid with other categorization plots, then the first FREQ variable that you specified is used for all of the plots.” [My emphasis.]
  • As mentioned earlier, if you are trying to produce multiple grouped plots, you might need to manually assign attributes to obtain consistency among the levels of the grouping variables. By default, most ODS styles use different attributes for each statement. If you want the attributes for the fourth statement to match the attributes for the first statement, you need to use an option such as LINEATTRS=GraphData1 on the fourth statement.

In conclusion, if a statement supports the GROUP= option, you should probably use that option to overlay plots of the groups. But if a statement does NOT support the GROUP= option (such as the ELLIPSE and HEATMAP statements), you can use the FREQ= trick to emulate the GROUP= behavior.

I thank my colleague, Paul, for showing me the ellipse example. I hope you agree that this trick is a real treat, not just on Halloween, but every day!

The post A trick to plot groups in PROC SGPLOT appeared first on The DO Loop.

8月 102018

Hash tables are a very powerful and flexible data structure. Most SAS applications of hash tables focus on just one of their many powerful facilities: table lookup. Hash tables are a fantastic table lookup tool and their use for that should never be diminished. However, hash tables can do so [...]

The post Five things you (probably) don’t know you can do with a hash table appeared first on SAS Learning Post.

7月 302018

A programmer recently asked a question on a SAS discussion forum about design matrices for categorical variables. He had generated a design matrix by using PROC GLMMOD and wanted to use the design columns in a subsequent procedure. However, the columns were named COL1, COL2, COL3,..., so he couldn't tell which dummy variables correspond to each categorical variable. The following example illustrates his situation for the Weight_Status and Smoking_Status variables in the Sashelp.Heart data set:

proc glmmod data=Sashelp.Heart outdesign=GLMDesign;
   class Weight_Status Smoking_Status;
   model Cholesterol = Weight_Status Smoking_Status;
   ods select Parameters;
proc contents data=GLMDesign varnum short; run;
The association betwen columns in a design matrix and levels of the original categorical variables

The "Parameters" table shows the association between columns of the design matrix and levels of the categorical variables in the model. The output from PROC CONTENTS shows that the columns of the design matrix (as stored in the GLMDesign data set) are named COL1, COL2, and so forth. On the discussion forum, I showed how to save the "Parameters" table to a SAS data and use a DATA _NULL_ step to form macro variables that you can use to associate the design columns to the original variables.

In retrospect, I missed a golden opportunity to mention that the GLMMOD procedure (which always produces a singular design matrix) is not as friendly or powerful as other procedures in SAS that can generate design matrices. In particular, the GLMSELECT and TRANSREG procedures can create design matrices for many different parameterizations of the classification variables. Furthermore, these procedures automatically create macro variables that tell you the names of the columns in the design matrix.

As a reminder, a major reason to create a design matrix is to perform an analysis with a SAS procedure that does not support a CLASS statement. For example, the documentation of the MCMC procedure shows how to use PROC TRANSREG to create a design matrix as preparation for performing a Bayesian regression analysis. For details of the many ways to generate design matrices in SAS, see my previous article, "Four ways to create a design matrix in SAS."

GLMSELECT: An easier way to associate columns of a design matrix

The GLMSELECT procedure supports the OUTDESIGN= option, which enables you to output a design matrix for the variables in a regression model. The GLMSELECT procedure has the following advantages of the GLMMOD procedure:

  • The procedure supports the EFFECT statement, which you can use to define spline effects, collection effects, and more. The OUTDESIGN= option creates columns for any effect that you can define.
  • The procedure supports nonsingular parameterizations such as the 'effect' or 'reference' parameterizations (PARAM=EFFECT or PARAM=REF, respectively).
  • The dummy variables have meaningful names of the form VarName_Level, where VarName is the name of a categorical variable and Level is one of its values.
  • The procedure automatically creates a macro variable (called _GLSMod) that contains the names of the columns of the design matrix.

The following statements create a design matrix for the Weight_Status and Smoking_Status variables. The design matrix uses the 'effect coding' parameterization and is written to the GLSDesign data set. The value of the _GLSMod macro is displayed in the SAS log and can be used in a subsequent procedure call:

proc glmselect data=Sashelp.Heart outdesign(fullmodel)=GLSDesign noprint;
   class Weight_Status(ref='Normal')
         Smoking_Status(ref='Non-smoker') / param=effect;
   model Cholesterol = Weight_Status Smoking_Status / selection=none; /* include ALL dummy variables */
ods select Position;
proc contents data=GLSDesign varnum; run;   /* display all variables in OUTDESIGN= data set */
%put &_GLSMod;                              /* names of columns in design matrix */
--- SAS Log ---
Weight_Status_Overweight Weight_Status_Underweight
Smoking_Status_Heavy__16_25_ Smoking_Status_Light__1_5_
Smoking_Status_Moderate__6_15_ Smoking_Status_Very_Heavy____25_

The output from PROC CONTENTS show the variables in the GLSDesign data set, which are INTERCEPT, CHOLESTEROL (the response variable in the model), and the dummy variables listed in the _GLSMod macro variable. By default, the dummy variables have the pattern VarName_Level, although you can use the OUTDESIGN(PREFIX=prefix) option to generate column names of the form prefix1, prefix2, and so forth, where prefix is a user-specified prefix. You can see that values such as "Heavy (16-25)" are transformed into valid names for SAS variables.

In this example, the _GLSMOD macro contains the names of ALL dummy variables because the SELECTION=NONE option is used. If you use an option (such as SELECTION=LASSO) to perform variable selection, the _GLSMOD variable will contain the names of the dummy variables that are selected in the final model. This means that you can use the _GLSMOD variable to perform additional analyses. For example, if you want to use POC REG to output collinearity diagnostics for the variables in the final model, you could execute the following:

/* use the variables in the final model in a procedure that does not support a CLASS statement */
proc reg data=GLSDesign plots=none; 
   model Cholesterol = &_GLSMod / collin;

TRANSREG: Dummy variables and transformed variable

In a similar way, the TRANSREG procedure supports the DESIGN option, which adds dummy variables to the output data set, as shown in the following call. The syntax is different, but the dummy variables in the TRGDesign data set are formed by the same 'effect coding' as in the previous example. The procedure automatically creates a macro variable (called _TRGInd) that contains the names of the columns of the design matrix.

proc transreg data=Sashelp.Heart design;
   model class(Weight_Status Smoking_Status / EFFECTS 
          zero="Normal"      "Non-smoker");
   output out=TRGDesign;
%put &_TRGInd;
Weight_StatusOverweight Weight_StatusUnderweight Smoking_StatusHeavy__16_25_ Smoking_StatusLight__1_5_
Smoking_StatusModerate__6_15_ Smoking_StatusVery_Heavy____25_

The dummy variables contain the same values as in the previous example, but the two procedures construct slightly different names. The GLMSELECT names contain an extra underscore. For example, a dummy variable in the GLSDesign data set is Weight_Status_Overweight whereas the corresponding variable in the TRGDesign data set is Weight_StatusOverweight.

It is worth noting the PROC TRANSREG also supports macro variables that specify the names of dependent variables and transformed variables. You can even use the MACRO option to specify the name of the macro variables that are created. For details, see the documentation for the PROC TRANSREG statement.

In summary, when you generate a design matrix by using the GLMSELECT or TRANSREG procedures, the procedures create dummy variables that have meaningful names of the form VarName_Level or VarNameLevel. The procedures each create a macro variable that contains the names of the dummy variables. The _GLSMOD macro that is created by PROC GLMSELECT contains the names of dummy variables in the final selected model, so use SELECTION=NONE if you want all names. These macros make it easy for a programmer to refer to columns of the design matrix.

The post Meaningful names for columns of a design matrix appeared first on The DO Loop.