Now is your chance to learn even more about SAS hash tables with four additional articles on the subject.
Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category.
- Visualize repetition in song lyrics: In one of my favorite posts of the year, I use SAS to visualize the "repetition matrix" that shows how words and phrases are repeated in the lyrics of some popular songs. I had so much fun writing the post that I revisited it in December to visualize repetition in Christmas songs.
- Rating US Presidents: 166 experts in political science ranked the 44 US presidents on "greatness scale." You can use SAS to visualize the ranking of US presidents according to these experts.
- Customize legends in statistical graphs: In two articles, I showed ways to customize the legends in SAS statistical graphs. First, I showed how to combine symbols and line patterns in a legend by using the new LEGENDITEM statement. Later, I showed five ways to customize a legend in the SGPLOT procedure.
General SAS programming techniques
- CLASS versus BY variables: What is the difference between the CLASS statement and BY variables in SAS? When should you use one instead of the other in a SAS procedure?
- How to create a "Top 10" table and graph: Often it is useful to visualize the largest (most frequently occurring) categories of a categorical variable. This article shows how to use PROC FREQ to display only the top 10 levels (or five, or three,...) of a categorical variable.
- How to use DATA _NULL_ step: When you use the keyword '_NULL_' as the name of a data set, the output is not written. This article shows six reasons to use the _NULL_ data set in your SAS programs.
- FIRST.variable and LAST.variable in a BY-group: This article gives several examples of how to use the FIRST.variable and LAST.variable indicator variables for BY-group analysis in the SAS DATA step.
- Specify lists of variables in SAS: The SAS language provides syntax that enables you to specify lists of variables. This article describes six ways to specify a list of variables in SAS, including SAS keywords and syntax such as hyphens, colons, and double hyphens.
Statistics and Data Analysis
- New random number generators in SAS: Have you heard that SAS has a slew of new random number generators? Read about why you might want to use a new random number generator (RNG) instead of the default RNG. Or, you can watch a video about the new SAS random number generators.
- Calibration plots: A calibration plot for a binary response model can help you to rank subjects according to risk. It also is a way to diagnose lack of fit in a model. Learn about calibration curves and how to use a smooth curve to compare the predicted and empirical probabilities for the data. The SAS developer of PROC LOGISTIC liked this article a lot, so look for calibration plots in a future release of SAS/STAT software!
- Bootstrap methods in SAS: I published six bootstrap articles in 2018. The most popular demonstrate how to bootstrap the statistics in a two-sample t test. You can use the BOOTSTRAP statement in PROC TTEST or you can go "old school" and manually perform a bootstrap analysis for the t test.
- Chi-square tests in SAS: The chi-square test is a fundamental hypothesis test for testing the proportions of categorical variables. My first article shows how to compute a two-variable chi-square test for association in SAS. A second article shows how to specify the proportions for a one-way chi-square test of proportions.
I write this blog because I love to learn new things and share what I know with others. If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues. Happy New Year! I hope we both have many opportunities to learn and share in 2019!
Good news -- the SAS program that you wrote and put into production 10 years ago still works. Hey, it's SAS, so you probably take that for granted. But are those techniques from 2008 still the best way to accomplish your task? SAS 9.4, first released in 2013 and now refreshed with its sixth maintenance release, continues to extend the SAS programming language. New features allow you to simplify your code, make it run faster, and erase some of that technical debt you've been carrying due to previous workarounds or limitations.
The reason that I'm writing this post now is to recognize the next chapter for my long-time colleague, Rick Langston. After 38 years at SAS (and more time as a SAS user before that) Rick is retiring from his role. Many of you know him as a major steward of the SAS programming language. And many of the tips that I've shared on this blog are made possible by Rick's work. Here's an interview that I hosted with Rick in 2013, just before SAS 9.4 was first released.
Five cool features of the SAS language: the details
In the above video, Rick talks about three SAS features that were first introduced in 2013. I've added a couple of more recent items to the list to round it out.
FILENAME ZIP access method
This brings the ability to read and write compressed ZIP files, and GZIP files, directly into the SAS language. Use this feature to replace the clunky (and not always feasible) calls into external tools such as gzip, WinZip, or 7-Zip. In SAS, a "native" FILENAME access method is more portable and robust than calling out to an external tool with FILENAME PIPE.
For more information, check out the many SAS blog posts with examples that I've shared over the years.
Have you ever wanted to run another SAS procedure from inside of a DATA step? Rick calls this "submitting SAS code on the side", as it allows you to run a SAS step or statement from within a currently running step. You can learn more this SAS Global Forum paper by Rick. I've also written a post with a specific example in SAS Enterprise Guide.
LOCKDOWN system option and statement
This one will excite SAS administrators. You can set the LOCKDOWN system option in a batch SAS session or SAS Workspace server to limit some of the "dangerous" functions of SAS and, more importantly, limit the file areas in which the SAS session will operate. Read more in this article, Fencing in your SAS users with LOCKDOWN.
Creating and managing directories within SAS
This technique combines two features into a one-two punch of file folder management. Use the DLCREATEDIR option and the LIBNAME statement to create a new directory, and then use the DLGCDIR function to change the current directory of the SAS session.
In the past, you would have had to issue operating system commands to create a new directory and then switch ('cd') into it. That approach is not portable across different operating systems, and it requires access to the operating system shell -- not available in many SAS sessions these days. See these blog posts for more information and examples about the new techniques:
- SAS trick: get the LIBNAME statement to create folders for you
- Manage the current directory within your SAS program
Using %IF/%THEN/%ELSE in open code
Perhaps the most life-changing of all of these SAS language updates, you can now use simple if-then-else logic for program flow, outside of the confines of the SAS macro language. It's what makes defensive programming like this possible -- without having to wrap the logic in %MACRO/%MEND:
%if %symexist(config_root) %then %do; filename config "&config_root./config.json"; libname config json fileref=config; data _null_; set config.root; call symputx('tenant_id',tenant_id,'G'); call symputx('client_id',client_id,'G'); call symputx('redirect_uri',redirect_uri,'G'); call symputx('resource',resource,'G'); run; %end; %else %do; %put ERROR: You must define the CONFIG_ROOT macro variable.; %end;
Read more here: Using %IF/%THEN/%ELSE in SAS programs.
A legacy in the making
I'm going to miss having Rick Langston as a SAS colleague. There aren't many other people who know how to spin up a version of SAS from 30 years ago to help me track down a curious question. However, I'm not worried about the future of the SAS language. Rick has been excellent about sharing his knowledge for decades (just check his annual contributions to SAS Global Forum), and his team is well-suited to carry on the work of extending the SAS programming language for the next generation of SAS users. Thanks to Rick for helping to build such a solid foundation.
The post Five SAS programming language features you should be using appeared first on The SAS Dummy.
You might know that you can use the ODS SELECT statement to display only some of the tables and graphs that are created by a SAS procedure. But did you know that you can use a WHERE clause on the ODS SELECT statement to display tables that match a pattern? This article shows how to use wildcards, regular expressions, and pattern matching to select ODS tables in SAS.
ODS SELECT: Filter the output from SAS procedures
A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.
Here's an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:
data roc; input alb tp totscore popind @@; totscore = 10 - totscore; datalines; 3.0 5.8 10 0 3.2 6.3 5 1 3.9 6.8 3 1 2.8 4.8 6 0 3.2 5.8 3 1 0.9 4.0 5 0 2.5 5.7 8 0 1.6 5.6 5 1 3.8 5.7 5 1 3.7 6.7 6 1 3.2 5.4 4 1 3.8 6.6 6 1 4.1 6.6 5 1 3.6 5.7 5 1 4.3 7.0 4 1 3.6 6.7 4 0 2.3 4.4 6 1 4.2 7.6 4 0 4.0 6.6 6 0 3.5 5.8 6 1 3.8 6.8 7 1 3.0 4.7 8 0 4.5 7.4 5 1 3.7 7.4 5 1 3.1 6.6 6 1 4.1 8.2 6 1 4.3 7.0 5 1 4.3 6.5 4 1 3.2 5.1 5 1 2.6 4.7 6 1 3.3 6.8 6 0 1.7 4.0 7 0 3.7 6.1 5 1 3.3 6.3 7 1 4.2 7.7 6 1 3.5 6.2 5 1 2.9 5.7 9 0 2.1 4.8 7 1 2.8 6.2 8 0 4.0 7.0 7 1 3.3 5.7 6 1 3.7 6.9 5 1 3.6 6.6 5 1 ; ods graphics on; ods trace on; proc logistic data=roc; model popind(event='0') = alb tp totscore / nofit; roc 'Albumin' alb; roc 'K-G Score' totscore; roc 'Total Protein' tp; roccontrast reference('K-G Score') / estimate e; run;
The SAS log displays the names of the tables and graphs. A portion of log is shown below:
Output Added: ------------- Name: OddsRatios Label: Odds Ratios Template: Stat.Logistic.OddsRatios Path: Logistic.ROC3.OddsRatios ------------- Output Added: ------------- Name: ROCCurve Label: ROC Curve Template: Stat.Logistic.Graphics.ROC Path: Logistic.ROC3.ROCCurve ------------- Output Added: ------------- Name: ROCOverlay Label: ROC Curves Template: Stat.Logistic.Graphics.ROCOverlay Path: Logistic.ROCComparisons.ROCOverlay ------------- Output Added: ------------- Name: ROCAssociation Label: ROC Association Statistics Template: Stat.Logistic.ROCAssociation Path: Logistic.ROCComparisons.ROCAssociation ------------- Output Added: ------------- Name: ROCContrastCoeff Label: ROC Contrast Coefficients Template: Stat.Logistic.ROCContrastCoeff Path: Logistic.ROCComparisons.ROCContrastCoeff -------------
Only a few of the 27 ODS objects are shown here. Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:
ods select ROCCurve ROCOverlay; /* specify the names literally */
Use a WHERE clause in the ODS SELECT statement
Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as 'ROC', in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains 'ROC' as a substring.
You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern. The object's name is available in a special variable named _NAME_. Similarly, the object's label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.
In SAS, the following operators and functions are useful for matching strings:
- The CONTAINS keyword matches strings that contains a specified substring. The question mark (?) is an equivalent way to specify the CONTAINS operator.
- The LIKE keyword matches strings to a pattern. The underscore (_) is a wildcard that matches any character. The percent sign (%) is a wildcard that matches one or more characters.
- The "begins with" operators (=: and in:) match strings that begin with a certain pattern or set of patterns, respectively.
- SAS functions such as FIND, INDEX, and SUBSTR can be used to match patterns.
- The SAS PRXMATCH function, which enables you to use Perl regular expressions to match patterns.
For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:
/* use any one of the following statements inside the PROC LOGISTIC call */ ods select where=(_name_ =: 'ROC'); /* name starts with 'ROC' */ ods select where=(_name_ like 'ROC%'); /* name starts with 'ROC' */ ods select where=(_path_ ? 'ROC'); /* path contains 'ROC' */ ods select where=(_label_ ? 'ROC'); /* label contains 'ROC' */ ods select where=(_name_ in: ('Odds', 'ROC')); /* name starts with 'Odds' or 'ROC' */ ods select where=(substr(_name_,4,8)='Contrast'); /* name has subtring 'Contrast' at position 4 */
For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld's graphics-focused blog post and the section of the SAS/STAT User's Guide that discusses selecting ODS graphics.
Use PRXMATCH to match regular expressions
Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a Perl Regular Expression "cheat sheet" that summarizes the syntax and commons search queries for the PRXMATCH function.
You can put any of the following statements inside the PROC LOGISTIC call:
/* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */ ods select where=(prxmatch('/ROC/', _name_)); /* name contains 'ROC' anywhere */ ods select where=(prxmatch('/^ROC/', _name_)); /* name starts with 'ROC' */ ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */ ods select where=(prxmatch('/ROC/', _name_)=0); /* name does NOT contain 'ROC' anywhere */ ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */
In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful. Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.
The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.
SAS is launching the SAS Student Series, which bundles thousands of dollars’ worth of training and certification materials into an affordable package for university students and recent graduates.
The post Check out the SAS Student Series during winter break appeared first on SAS Learning Post.
In my previous blog post I demonstrated how to create your own CAS actions and action sets. In this post, we will explore how to create your own CAS functions using the CAS Language (CASL). A function is a component of the CASL programming language that can accept arguments, perform a computation or other operation, and return a value. The value that is returned can be used in an assignment statement or elsewhere in expressions.
About SAS functions
SAS provides two types of supplied functions: built-in functions and common functions. Built-in functions contain functionality that is unique to CASL. These allow you to perform operations on your result tables, arrays, and dictionaries, and provide run-time support for your CASL programs. Built-in functions cannot be replaced with user-defined functions.
Conversely, common functions provide functionality that is common to other SAS functions. When used in a CASL program, SAS functions take a CASL value and a CASL value is returned. Unlike built-in functions, you can replace these functions with user-defined functions.
Since the capabilities of built-in functions are unique to CASL, let’s look at these in-depth and demonstrate with an example. Save the following FedSQL code in an external file called hmeqsql.sas. This code will be read into CAS and stored as a variable.
The execDirect action executes FedSQL code in CAS. The READPATH built-in function reads the FedSQL code saved in hmeqsql.sas and stores it in the CASL variable sqlcode which is used as input to the query parameter.
The fetch action displays the first 20 rows from the output table hmeq.out.
If you don’t feel like looking through the documentation for a built-in or common function, a list of each can be generated programmatically. Run the following code to see a list of built-in functions.
Partial list of CASL built-in functions
Run the following code to see a list of common functions.
Partial list of common functions
User-defined CASL functions
In addition to the customizable capabilities of built-in functions supplied by SAS, you can also create your own functions using the FUNCTION statement. User-defined functions can be called in expressions using CASL and they provide a large amount of flexibility. The following example creates four different functions for temperature conversion.
After creating these functions, they can be called immediately, or you can store them in an external file and call them via a %include statement. In this example, the user-defined functions have been stored in an external file called FunctionStore.sas. You can call one, all, or any number of your user-defined functions.
The output from each function call is displayed in the log.
Lastly, if you want to see all user-defined functions, run the FUNCTIONLIST statement. A list will be printed to the log.
More about CASL programming and using functions in CASL
Check out these resources for further information on programming in the CASL language and using functions in CASL.
- Documentation: SAS® Cloud Analytic Services 3.4: CASL Programmer’s Guide
- Documentation: CASL Built-In Functions
- Documentation: User-Defined Functions
- SAS Global Forum paper: Management and Usage of User-Defined Formats in the SAS® Cloud Analytic Services Server
- Tech Talk Video: SAS programming in the cloud: CASL code
Customize your CASL code with built-in and user-defined functions was published on SAS Users.
A useful feature in PROC SGPLOT is the ability to easily visualize subgroups of data. Most statements in the SGPLOT procedure support a GROUP= option that enables you to overlay plots of subgroups. When you use the GROUP= option, observations are assigned attributes (colors, line patterns, symbols, ...) that indicate the value of the grouping variable. This article reviews the GROUP= option and shows how to trick PROC SGPLOT into performing a group analysis for statements that do not support the GROUP= option.
Three ways to plot data by groups
It is common to use colors or symbols to indicate which observations belong to each category of a grouping variable. Typical grouping variables include gender (male and female), political affiliation (democrats, republicans, and independents), race, education level, and so forth. When you use the SAS SG procedures to plot subsets of the data, there are three ways to arrange the plots. You can plot each group individually, you can create a panel of graphs, or you can overlay the groups on a single graph:
- If you use the BY statement in PROC SGPLOT, each subgroup is plotted independently in its own graph. The axes are scaled based only on the data in that subgroup.
- If you use the PANELBY statement in PROC SGPANEL, each subgroup is plotted in a cell of a lattice in which the axes are scaled to a common range.
- If you use the GROUP= option, the plots for each subgroup are overlaid in a single graph.
The following SAS statements demonstrate each approach. Only the GROUP= overlay is displayed because that is the topic of this article:
proc sgplot data=Sashelp.Iris; /* BY-group visualization. Three independent graphs. */ by Species; histogram SepalLength; density SepalLength / type=kernel; run; proc sgpanel data=Sashelp.Iris; /* Panel visualization. Shared common axis. */ panelby Species / columns=1 onepanel; histogram SepalLength; density SepalLength / type=kernel; run; proc sgplot data=Sashelp.Iris; /* Overlay three plots in one graph */ histogram SepalLength / GROUP=Species binstart=42 binwidth=3 transparency=0.5; density SepalLength / type=kernel GROUP=Species; run;
How to emulate the GROUP= option
Many SGPLOT statements (such as the SERIES and SCATTER statements) have supported the GROUP= option since the early days of ODS graphics. For other statements, support for the GROUP= option was added more recently. For example, the GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4M2.
Here is a trick (shown to me by my colleague, Paul) that you can use to emulate the GROUP= option. If a statement in the SGPLOT procedure does not support the GROUP= option, but the statement DOES support the FREQ= option, you can often use the FREQ= option to construct a graph that overlays the subgroups. You need to do two things. First, you need to create binary indicators variables (sometimes called dummy variables) for each level of the categorical variable. You then use multiple statements, each with a different frequency variable, to overlay the subgroups. These two steps are shown by the following DATA step and call to PROC SGPLOT, which uses the FREQ= trick to overlay three histograms:
/* emulate a GROUP= option for SGPLOT statements that do not support GROUP= */ data IrisFreq; set sashelp.Iris; Freq1 = (Species='Setosa'); /* Binary. Equals 1 if observation is in 'Setosa' group */ Freq2 = (Species='Versicolor'); /* Binary. Equals 1 if observation is in 'Versicolor' group */ Freq3 = (Species='Virginica'); /* Binary. Equals 1 if observation is in 'Verginica' group */ run; title "Overlay Histograms by Using the FREQ= Option"; %let binOpts = binstart=42 binWidth=3 transparency=0.5; /* ensure common bins */ proc sgplot data=IrisFreq; histogram SepalLength / freq=Freq1 &binOpts; /* only the 'Setosa' group */ histogram SepalLength / freq=Freq2 &binOpts; /* only the 'Versicolor' group */ histogram SepalLength / freq=Freq3 &binOpts; /* only the 'Virginica' group */ run;
The graph overlays three histograms, one for each value of the Species variable. The result is similar to the earlier graph that used the GROUP= option. You can use the same trick on the DENSITY statement, although you will need to manually set the line attributes so that they match the attributes for the corresponding histograms.
You can use this technique in old versions of SAS to emulate the GROUP= option on the HISTOGRAM statement. You can also use it for statements that do not support the GROUP= option.
Although this example uses the DATA step to manually create the dummy variables that are used as frequencies, you can also create the dummy variables automatically by generating the "design matrix" for the Species variable. The GLMMOD procedure is the simplest way to create dummy variables in SAS, but other procedures provide additional features.
Generate prediction ellipses for groups
Several years ago I showed how you can overlay prediction ellipses for each group on a scatter plot. (Note that the ELLIPSE statement does not support a GROUP= option.) The technique requires that you transpose the data from long to wide form by creating new variables, one for each group of the categorical variable. Paul recognized that creating a dummy variable and using the FREQ= option is a simpler way to overlay prediction ellipses on a scatter plot:
title "Prediction Ellipses for Iris Data"; proc sgplot data=IrisFreq; scatter x=PetalLength y=PetalWidth / group=Species; ellipse x=PetalLength y=PetalWidth / freq=Freq1 legendlabel="Setosa"; ellipse x=PetalLength y=PetalWidth / freq=Freq2 legendlabel="Versicolor"; ellipse x=PetalLength y=PetalWidth / freq=Freq3 legendlabel="Virginica"; run;
Advantages and disadvantages of the FREQ= trick
The main advantage of using the FREQ= option for group processing is that it enables you to overlay subgroups even when a statement does not support the GROUP= option. A secondary advantage is that this technique gives you complete control of the attributes of each subgroup. Although you can use the STYLEATTRS statement to control many group attributes, the STYLEATTTRS statement does not enable you to control marker sizes or line widths, to name two examples.
The FREQ= trick does have some disadvantages:
- You can't use the FREQ= trick for statements that produce graphs of categorical variables. The SGPLOT documentation states, “If your plot is overlaid with other categorization plots, then the first FREQ variable that you specified is used for all of the plots.” [My emphasis.]
- As mentioned earlier, if you are trying to produce multiple grouped plots, you might need to manually assign attributes to obtain consistency among the levels of the grouping variables. By default, most ODS styles use different attributes for each statement. If you want the attributes for the fourth statement to match the attributes for the first statement, you need to use an option such as LINEATTRS=GraphData1 on the fourth statement.
In conclusion, if a statement supports the GROUP= option, you should probably use that option to overlay plots of the groups. But if a statement does NOT support the GROUP= option (such as the ELLIPSE and HEATMAP statements), you can use the FREQ= trick to emulate the GROUP= behavior.
I thank my colleague, Paul, for showing me the ellipse example. I hope you agree that this trick is a real treat, not just on Halloween, but every day!
Hash tables are a very powerful and flexible data structure. Most SAS applications of hash tables focus on just one of their many powerful facilities: table lookup. Hash tables are a fantastic table lookup tool and their use for that should never be diminished. However, hash tables can do so [...]
The post Five things you (probably) don’t know you can do with a hash table appeared first on SAS Learning Post.
A programmer recently asked a question on a SAS discussion forum about design matrices for categorical variables. He had generated a design matrix by using PROC GLMMOD and wanted to use the design columns in a subsequent procedure. However, the columns were named COL1, COL2, COL3,..., so he couldn't tell which dummy variables correspond to each categorical variable. The following example illustrates his situation for the Weight_Status and Smoking_Status variables in the Sashelp.Heart data set:
proc glmmod data=Sashelp.Heart outdesign=GLMDesign; class Weight_Status Smoking_Status; model Cholesterol = Weight_Status Smoking_Status; ods select Parameters; run; proc contents data=GLMDesign varnum short; run;
The "Parameters" table shows the association between columns of the design matrix and levels of the categorical variables in the model. The output from PROC CONTENTS shows that the columns of the design matrix (as stored in the GLMDesign data set) are named COL1, COL2, and so forth. On the discussion forum, I showed how to save the "Parameters" table to a SAS data and use a DATA _NULL_ step to form macro variables that you can use to associate the design columns to the original variables.
In retrospect, I missed a golden opportunity to mention that the GLMMOD procedure (which always produces a singular design matrix) is not as friendly or powerful as other procedures in SAS that can generate design matrices. In particular, the GLMSELECT and TRANSREG procedures can create design matrices for many different parameterizations of the classification variables. Furthermore, these procedures automatically create macro variables that tell you the names of the columns in the design matrix.
As a reminder, a major reason to create a design matrix is to perform an analysis with a SAS procedure that does not support a CLASS statement. For example, the documentation of the MCMC procedure shows how to use PROC TRANSREG to create a design matrix as preparation for performing a Bayesian regression analysis. For details of the many ways to generate design matrices in SAS, see my previous article, "Four ways to create a design matrix in SAS."
GLMSELECT: An easier way to associate columns of a design matrix
The GLMSELECT procedure supports the OUTDESIGN= option, which enables you to output a design matrix for the variables in a regression model. The GLMSELECT procedure has the following advantages of the GLMMOD procedure:
- The procedure supports the EFFECT statement, which you can use to define spline effects, collection effects, and more. The OUTDESIGN= option creates columns for any effect that you can define.
- The procedure supports nonsingular parameterizations such as the 'effect' or 'reference' parameterizations (PARAM=EFFECT or PARAM=REF, respectively).
- The dummy variables have meaningful names of the form VarName_Level, where VarName is the name of a categorical variable and Level is one of its values.
- The procedure automatically creates a macro variable (called _GLSMod) that contains the names of the columns of the design matrix.
The following statements create a design matrix for the Weight_Status and Smoking_Status variables. The design matrix uses the 'effect coding' parameterization and is written to the GLSDesign data set. The value of the _GLSMod macro is displayed in the SAS log and can be used in a subsequent procedure call:
proc glmselect data=Sashelp.Heart outdesign(fullmodel)=GLSDesign noprint; class Weight_Status(ref='Normal') Smoking_Status(ref='Non-smoker') / param=effect; model Cholesterol = Weight_Status Smoking_Status / selection=none; /* include ALL dummy variables */ run; ods select Position; proc contents data=GLSDesign varnum; run; /* display all variables in OUTDESIGN= data set */ %put &_GLSMod; /* names of columns in design matrix */
--- SAS Log --- Weight_Status_Overweight Weight_Status_Underweight Smoking_Status_Heavy__16_25_ Smoking_Status_Light__1_5_ Smoking_Status_Moderate__6_15_ Smoking_Status_Very_Heavy____25_
The output from PROC CONTENTS show the variables in the GLSDesign data set, which are INTERCEPT, CHOLESTEROL (the response variable in the model), and the dummy variables listed in the _GLSMod macro variable. By default, the dummy variables have the pattern VarName_Level, although you can use the OUTDESIGN(PREFIX=prefix) option to generate column names of the form prefix1, prefix2, and so forth, where prefix is a user-specified prefix. You can see that values such as "Heavy (16-25)" are transformed into valid names for SAS variables.
In this example, the _GLSMOD macro contains the names of ALL dummy variables because the SELECTION=NONE option is used. If you use an option (such as SELECTION=LASSO) to perform variable selection, the _GLSMOD variable will contain the names of the dummy variables that are selected in the final model. This means that you can use the _GLSMOD variable to perform additional analyses. For example, if you want to use POC REG to output collinearity diagnostics for the variables in the final model, you could execute the following:
/* use the variables in the final model in a procedure that does not support a CLASS statement */ proc reg data=GLSDesign plots=none; model Cholesterol = &_GLSMod / collin; run;
TRANSREG: Dummy variables and transformed variable
In a similar way, the TRANSREG procedure supports the DESIGN option, which adds dummy variables to the output data set, as shown in the following call. The syntax is different, but the dummy variables in the TRGDesign data set are formed by the same 'effect coding' as in the previous example. The procedure automatically creates a macro variable (called _TRGInd) that contains the names of the columns of the design matrix.
proc transreg data=Sashelp.Heart design; model class(Weight_Status Smoking_Status / EFFECTS zero="Normal" "Non-smoker"); output out=TRGDesign; run; %put &_TRGInd;
Weight_StatusOverweight Weight_StatusUnderweight Smoking_StatusHeavy__16_25_ Smoking_StatusLight__1_5_ Smoking_StatusModerate__6_15_ Smoking_StatusVery_Heavy____25_
The dummy variables contain the same values as in the previous example, but the two procedures construct slightly different names. The GLMSELECT names contain an extra underscore. For example, a dummy variable in the GLSDesign data set is Weight_Status_Overweight whereas the corresponding variable in the TRGDesign data set is Weight_StatusOverweight.
It is worth noting the PROC TRANSREG also supports macro variables that specify the names of dependent variables and transformed variables. You can even use the MACRO option to specify the name of the macro variables that are created. For details, see the documentation for the PROC TRANSREG statement.
In summary, when you generate a design matrix by using the GLMSELECT or TRANSREG procedures, the procedures create dummy variables that have meaningful names of the form VarName_Level or VarNameLevel. The procedures each create a macro variable that contains the names of the dummy variables. The _GLSMOD macro that is created by PROC GLMSELECT contains the names of dummy variables in the final selected model, so use SELECTION=NONE if you want all names. These macros make it easy for a programmer to refer to columns of the design matrix.
Attention SAS administrators! When running SAS batch jobs on schedule (or manually), they usually produce date-stamped SAS logs which are essential for automated system maintenance and troubleshooting. Similar log files have been created by various SAS infrastructure services (Metadata server, Mid-tier servers, etc.) However, as time goes on, the relevance of such logs diminishes while clutter stockpiles. In some cases, this may even lead to disk space problems.
There are multiple ways to solve this problem, either by deleting older log files or by stashing them away for auditing purposes (zipping and archiving). One solution would be using Unix/Linux or Windows scripts run on schedule. The other is much "SAS-sier."
Let SAS clean up its "mess"
We are going to write a SAS code that you can run manually or on schedule, which for a specified directory (folder) deletes all .log files that are older than 30 days.
First, we need to capture the contents of that directory, then select those file names with extension .log, and finally, subset that file selection to a sub-list where Date Modified is less than Today's Date minus 30 days.
Perhaps the easiest way to get the contents of a directory is by using the X statement (submitting DOS’ DIR command from within SAS with a pipe (>) option, e.g.
x 'dir > dirlist.txt';
or using pipe option in the filename statement:
filename DIRLIST pipe 'dir "C:\Documents and Settings"';
However, SAS administrators know that in many organizations, due to cyber-security concerns IT department policies do not allow enabling the X statement by setting SAS XCMD system option to NOXCMD (XCMD system option for Unix). This is usually done system-wide for the whole SAS Enterprise client-server installation via SAS configuration. In this case, no operating system command can be executed from within SAS. Try running any X statement in your environment; if it is disabled you will get the following ERROR in the SAS log:
ERROR: Shell escape is not valid in this SAS session.
To avoid that potential roadblock, we’ll use a different technique of capturing the contents of a directory along with file date stamps.
Macro to delete old log files in a directory/folder
The following SAS macro cleans up a Unix directory or a Windows folder removing old .log files. I must admit that this statement is a little misleading. The macro is much more powerful. Not only it can delete old .log files, it can remove ANY file types specified by their extension.
%macro mr_clean(dirpath=,dayskeep=30,ext=.log); data _null_; length memname $256; deldate = today() - &dayskeep; rc = filename('indir',"&dirpath"); did = dopen('indir'); if did then do i=1 to dnum(did); memname = dread(did,i); if reverse(trim(memname)) ^=: reverse("&ext") then continue; rc = filename('inmem',"&dirpath/"!!memname); fid = fopen('inmem'); if fid then do; moddate = input(finfo(fid,'Last Modified'),date9.); rc = fclose(fid); if . < moddate <= deldate then rc = fdelete('inmem'); end; end; rc = dclose(did); rc = filename('inmem'); rc = filename('indir'); run; %mend mr_clean;
This macro has 3 parameters:
- dirpath - directory path (required);
- dayskeep - days to keep (optional, default 30);
- ext - file extension (optional, default .log).
This macro works in both Windows and Linux/Unix environments. Please note that dirpath and ext parameter values are case-sensitive.
Here are examples of the macro invocation:
1. Using defaults
%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs; %mr_clean(dirpath=&dir_to_clean)
With this macro call, all files with extension .log (default) which are older than 30 days (default) will be deleted from the specified directory.
2. Using default extension
%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs; %mr_clean(dirpath=&dir_to_clean,dayskeep=20)
With this macro call, all files with extension .log (default) which are older than 20 days will be deleted from the specified directory.
3. Using explicit parameters
%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs; %mr_clean(dirpath=&dir_to_clean,dayskeep=10,ext=.xls)
With this macro call, all files with extension .xls (Excel files) which are older than 10 days will be deleted from the specified directory.
Old file deletion SAS macro code explanation
The above SAS macro logic and actions are done within a single data _NULL_ step. First, we calculate the date from which file deletion starts (going back) deldate = today() - &dayskeep. Then we assign fileref indir to the specified directory &dirpath:
rc = filename('indir',"&dirpath");
Then we open that directory:
did = dopen('indir');
and if it opened successfully (did>0) we loop through its members which can be either files or directories:
do i=1 to dnum(did);
In that loop, first we grab the directory member name:
memname = dread(did,i);
and look for our candidates for deletion, i.e., determine if that name (memname) ends with "&ext". In order to do that we reverse both character strings and compare their first characters. If they don’t match (^=: operator) then we are not going to touch that member - the continue statement skips to the end of the loop. If they do match it means that the member name does end with "&ext" and it’s a candidate for deletion. We assign fileref inmem to that member:
rc = filename('inmem',"&dirpath/"!!memname);
Note that forward slash (/) Unix/Linux path separator in the above statement is also a valid path separator in Windows. Windows will convert it to back slash (\) for display purposes, but it interprets forward slash as a valid path separator along with back slash.
Then we open that file using fopen function:
fid = fopen('inmem');
If inmem is a directory, the opening will fail (fid=0) and we will skip the following do-group that is responsible for the file deletion. If it is file and is opened successfully (fid>0) then we go through the deletion do-group where we first grab the file Last Modified date as moddate, close the file, and if moddate <= deldate we delete that file:
rc = fdelete('inmem');
Then we close the directory and un-assign filerefs for the members and directory itself.
Deleting old files across multiple directories/folders
Macro %mr_clean is flexible enough to address various SAS administrators needs. You can use this macro to delete old files of various types across multiple directories/folders. First, let’s create a driver table as follows:
data delete_instructions; length days 8 extn $9 path $256; infile datalines truncover; input days 1-2 extn $ 4-12 path $ 14-270; datalines; 30 .log C:\PROJECTS\Automatically deleting old files\Logs1 20 .log C:\PROJECTS\Automatically deleting old files\Logs2 25 .txt C:\PROJECTS\Automatically deleting old files\Texts 35 .xls C:\PROJECTS\Automatically deleting old files\Excel 30 .sas7bdat C:\PROJECTS\Automatically deleting old files\SAS_Backups ;
This driver table specifies how many days to keep files of certain extensions in each directory. In this example, perhaps the most beneficial deletion applies to the SAS_Backups folder since it contains SAS data tables (extension .sas7bdat). Data files typically have much larger size than SAS log files, and therefore their deletion frees up much more of the valuable disk space.
Then we can use this driver table to loop through its observations and dynamically build macro invocations using CALL EXECUTE:
data _null_; set delete_instructions; s = cats('%nrstr(%mr_clean(dirpath=',path,',dayskeep=',days,',ext=',extn,'))'); call execute(s); run;
Alternatively, we can use DOSUBL() function to dynamically execute our macro at every iteration of the driver table:
data _null_; set delete_instructions; s = cats('%mr_clean(dirpath=',path,',dayskeep=',days,',ext=',extn,')'); rc = dosubl(s); run;
Put it on autopilot
When it comes to cleaning your old files (logs, backups, etc.), the best practice for SAS administrators is to schedule your cleaning job to automatically run on a regular basis. Then you can forget about this chore around your "SAS house" as %mr_clean macro will do it quietly for you without the noise and fuss of a Roomba.
Your turn, SAS administrators
Would you use this approach in your SAS environment? Any suggestions for improvement? How do you deal with old log files? Other old files? Please share below.
SAS administrators tip: Automatically deleting old SAS logs was published on SAS Users.