tips & techniques

7月 232020
 

Splitting one into smaller pieces

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

We might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

We might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

  1. Each smaller data set should have maximum of K observations.
  2. There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios.

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
/* Parameters:
/*   SRC_DATASET - name of the source data set     */
/*   OUT_PREFIX - prefix of the output data sets   */
/*   SPLIT_NUM - split number                      */
/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
 
   %local I K S TLIST;
 
   /* number of observations &K, number of smaller datasets &S */
   data _null_;
      if 0 then set &SRC_DATASET nobs=N;
      if upcase("&SPLIT_DEF")='NOBS' then
         do;
            call symputx('K',&SPLIT_NUM); 
            call symputx('S',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
         end;
         else if upcase("&SPLIT_DEF")='SETS' then
         do;
            call symputx('S',&SPLIT_NUM); 
            call symputx('K',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
        end;
         else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
      stop; 
   run;
 
 
   /* terminate macro if nothing to split */
   %if (&K le 0) or (&S le 0) %then %return;
 
    /* generate list of smaller dataset names */
   %do I=1 %to &S;
      %let TLIST = &TLIST &OUT_PREFIX._&I;
   %end;
 
   /* split source dataset into smaller datasets */
   data &TLIST;
      set &SRC_DATASET;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;
 
%mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
***MACRO SPLIT: Splitting into datasets of no more than 100 observations
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
***MACRO SPLIT: Splitting into 3 datasets
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

/* generate random numbers, R */
   data;
      set &SRC_DATASET;
      call streaminit(1234);
      R = rand('uniform');
   run;
 
   /* sort data in R order */
   proc sort;
      by R;
   run;
 
   /* split source dataset into smaller datasets */
   data &TLIST (drop=R);
      set;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

  • If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
  • If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Your thoughts?

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Splitting a data set into smaller data sets was published on SAS Users.

6月 042020
 

Learning never stops. When SAS had to change this year’s SAS Global Forum (SGF) to a virtual event, everyone was disappointed. I am, however, super excited about all of the papers and stream of video releases over the last month (and I encourage you to register for the upcoming live event in June). For now, I made a pact with myself to read or watch one piece of SGF related material per day. While I haven’t hit my goal 100%, I sure have learned a lot from all the reading and viewing. One particular paper, Using Jupyter to Boost Your Data Science Workflow, and its accompanying video by Hunter Glanz caught my eye this week. This post elaborates on one piece of his material: how to save Jupyter notebooks in other file formats.

Hunter’s story

Hunter is a professor who teaches multiple classes using SAS® University Edition, which comes equipped with an integrated Jupyter notebook. His focus is on SAS programming and he requires his students to create notebooks to complete assignments; however he wants to see the results of their work, not to run their raw code. The notebooks include text, code, images, reports, etc. Let's explore how the students can transform their navitve notebooks into other, more consumable formats. We'll also discuss other use cases in which SAS users may want to create a copy of their work from a notebook, to say a .pdf, .html, or .py file, just to name a few.

What you’ll find here and what you won’t

This post will not cover how to use Jupyter notebooks with SAS or other languages. There is a multitude of other resources, starting with Hunter’s work, to explore those topics. This post will cover how to produce other file formats in SAS, Python, and R. I’ll outline multiple methods including a point-and-click method, how to write inline code directly in the notebook, and finally using the command line.

Many of the processes discussed below are language agnostic. When there are distinct differences, I’ll make a note.

A LITTLE about Jupyter notebooks

A Jupyter notebook is a web application allowing clients to run commands, view responses, include images, and write inline text all in one concourse. The all-encompassing notebook supports users to telling complete story without having to use multiple apps. Jupyter notebooks were originally created for the Python language, and are now available for many other programming languages. JupyterLab, the notebooks’ cousin, is a later, more sophisticated version, but for this writing, we’ll focus on the notebook. The functionality in this use case is similar.

Where do we start? First, we need to install the notebook, if you're not working in a SAS University Edition.

Install Anaconda

The easiest way to get started with the Jupyter Notebook App is by installing Anaconda (this will also install JupyterLab). Anaconda is an open source distribution tool for the management and deployment of scientific computing. Out-of-the-box, the notebook from the Anaconda install includes the Python kernel. For use with other languages, you need to install additional kernels.

Install additional language kernels

In this post, we’ll focus on Python, R, and SAS. The Python kernel is readily available after the Anaconda install. For the R language, follow the instructions on the GitHub R kernel repository. I also found the instructions on How to Install R in Jupyter with IRKernel in 3 Steps quite straight forward and useful. Further, here are the official install instructions for the SAS kernel and a supporting SAS Community Library article.

With the additional kernels are in place, you should see all available languages when creating a new notebook as pictured below.

Available kernels list

File conversion methods

Now we’re ready to dive into the export process. Let’s look at three approaches in detail.

Download (Export) option

Once you’ve opened your notebook and run the code, select File-> Download As (appears as Export Notebook As… in JupyterLab).

"Download As"  option in Jupyter notebook

"Export Notebook As" option in JupyterLab

HTML format output

Notice the list of options, some more familiar than others. Select the HTML option and Jupyter converts your entire notebook: text, commands, figures, images, etc, into a file with a .html extension. Opening the resulting file would display in a browser as expected. See the images below for a comparison of the .ipynb and .html files.

SAS code in a Jupyther notebook

Corresponding SAS code notebook in html form

SAS (aka script) format output

Using the Save As-> SAS option renders a .sas file and is depicted in Enterprise Guide below. Note: when using a different kernel, say Python or R, you have the option to save in that language specific script format.

SAS code saved from a notebook displayed in Enterprise Guide

One thing to note here is only the code appears in the output file. The markdown code, figures, etc., from the original notebook, are not display options in EG, so they are removed.

PDF format output

There is one (two actually) special case(s) I need to mention. If you want to create a PDF (or LaTeX, which is used to create pdf files) output of your notebook, you need additional software. For converting to PDF, Jupyter uses the TeX document preparation ecosystem. If you attempt to download without TeX, the conversion fails, and you get a message to download TeX. Depending on your OS the TeX software will have a different name but will include TeX in the name. You may also, in certain instances, need Pandoc for certain formats. I suggest installing both to be safe. Install TeX from its dowload site. And do the same for Pandoc.

Once I’ve completed creating the files, the new files appear in my File Explorer.

New SAS file in Windows File Explorer

Cheaters may never win, but they can create a PDF quickly

Well, now that we’ve covered how to properly convert and download a .pdf file, there may be an easier way. While in the notebook, press the Crtl + P keys. In the Print window, select the Save to PDF option, choose a file destination and save. It works, but I felt less accomplished afterward. Your choice.

Inline code option

Point-and-click is a perfectly valid option, but let’s say you want to introduce automation into your world. The jupyter nbconvert command provides the capability to transform the current notebook into any format mentioned earlier. All you must do is pass the command with a couple of parameters in the notebook.

In Python, the nbconvert command is part of the os library. The following lines are representative of the general structure.

import os
os.system("jupyter nbconvert myNotebook.ipynb --to html")

An example with Python

The example below is from a Python notebook. The "0" out code represents success.

Code to create a PDF file from a Python notebook

An example with SAS

As you see with the Python example, the code is just that: Python. Generally, you cannot run Python code in a Jupyter notebook running the SAS kernel. Luckily we have Jupyter magics, which allow us to write and run Python code inside a SAS kernel. The magics are a two-way street and you can also run SAS code inside a Python shell. See the SASPy documentation for more information.

The code below is from a SAS notebook, but is running Python code (triggered by the %%python magic).

Code to create a PDF file from a SAS notebook

The EmployeeChurnSASCode.pdf file is created in same directory as the original notebook file:

Jupyter file system display in a web browser

An example with R

Things are fairly straight forward in an R notebook. However, you must install and load the nbconvert package.

Code to create an HTML file from an R notebook

The first line installs the package, the second line loads the package, and the third actually does the conversion. Double-check your paths if you run into trouble.

The command line

The last method we look at is the command line. This option is the same regardless of the language with which you’re working. The possibilities are endless for this option. You could include it in a script, use it in code to run and display in a web app, or create the file and email it to a colleague. The examples below were all run on a Windows OS machine using the Anaconda command prompt.

An example with a SAS notebook

Convert sasNotebook.ipynb to a SAS file.

>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
 
>> jupyter nbconvert --to script sasNotebook.ipynb
[NbConvertApp] Converting notebook sasNotebook.ipynb to script
[NbConvertApp] Writing 351 bytes to sasNotebook.sas
 
>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
-rw-r--r-- 1 jofurb 1049089     369 May 29 14:57 sasNotebook.sas

An example with a Python notebook

Convert 1_load_data.ipynb to a PDF file

>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
 
>> jupyter nbconvert 1_load_data.ipynb --to pdf
[NbConvertApp] Converting notebook 1_load_data.ipynb to pdf
[NbConvertApp] Writing 27341 bytes to .\notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', '.\\notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', '.\\notebook']
[NbConvertApp] WARNING | b had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 32957 bytes to 1_load_data.pdf
 
>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
-rw-r--r-- 1 jofurb 1049089  32957 May 29 15:23 1_load_data.pdf

An example with an R notebook

Convert HR_R.ipynb to an R file.

>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
 
>> jupyter nbconvert HR_R.ipynb --to script
[NbConvertApp] Converting notebook HR_R.ipynb to script
[NbConvertApp] Writing 981 bytes to HR_R.r
 
>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
-rw-r--r-- 1 jofurb 1049089   1021 May 29 15:44 HR_R.r

Wrapping things up

Whether you’re a student of Hunter’s, an analyst creating a report, or a data scientist monitoring data streaming models, you may have the need/requirement to transform you work from Jupyter notebook to a more consumable asset. Regardless of the language of your notebook, you have multiple choices for saving your work including menu options, inline code, and from the command line. This is a great way to show off your creation in a very consumable mode.

How to save Jupyter notebooks in assorted formats was published on SAS Users.

6月 042020
 

Learning never stops. When SAS had to change this year’s SAS Global Forum (SGF) to a virtual event, everyone was disappointed. I am, however, super excited about all of the papers and stream of video releases over the last month (and I encourage you to register for the upcoming live event in June). For now, I made a pact with myself to read or watch one piece of SGF related material per day. While I haven’t hit my goal 100%, I sure have learned a lot from all the reading and viewing. One particular paper, Using Jupyter to Boost Your Data Science Workflow, and its accompanying video by Hunter Glanz caught my eye this week. This post elaborates on one piece of his material: how to save Jupyter notebooks in other file formats.

Hunter’s story

Hunter is a professor who teaches multiple classes using SAS® University Edition, which comes equipped with an integrated Jupyter notebook. His focus is on SAS programming and he requires his students to create notebooks to complete assignments; however he wants to see the results of their work, not to run their raw code. The notebooks include text, code, images, reports, etc. Let's explore how the students can transform their navitve notebooks into other, more consumable formats. We'll also discuss other use cases in which SAS users may want to create a copy of their work from a notebook, to say a .pdf, .html, or .py file, just to name a few.

What you’ll find here and what you won’t

This post will not cover how to use Jupyter notebooks with SAS or other languages. There is a multitude of other resources, starting with Hunter’s work, to explore those topics. This post will cover how to produce other file formats in SAS, Python, and R. I’ll outline multiple methods including a point-and-click method, how to write inline code directly in the notebook, and finally using the command line.

Many of the processes discussed below are language agnostic. When there are distinct differences, I’ll make a note.

A LITTLE about Jupyter notebooks

A Jupyter notebook is a web application allowing clients to run commands, view responses, include images, and write inline text all in one concourse. The all-encompassing notebook supports users to telling complete story without having to use multiple apps. Jupyter notebooks were originally created for the Python language, and are now available for many other programming languages. JupyterLab, the notebooks’ cousin, is a later, more sophisticated version, but for this writing, we’ll focus on the notebook. The functionality in this use case is similar.

Where do we start? First, we need to install the notebook, if you're not working in a SAS University Edition.

Install Anaconda

The easiest way to get started with the Jupyter Notebook App is by installing Anaconda (this will also install JupyterLab). Anaconda is an open source distribution tool for the management and deployment of scientific computing. Out-of-the-box, the notebook from the Anaconda install includes the Python kernel. For use with other languages, you need to install additional kernels.

Install additional language kernels

In this post, we’ll focus on Python, R, and SAS. The Python kernel is readily available after the Anaconda install. For the R language, follow the instructions on the GitHub R kernel repository. I also found the instructions on How to Install R in Jupyter with IRKernel in 3 Steps quite straight forward and useful. Further, here are the official install instructions for the SAS kernel and a supporting SAS Community Library article.

With the additional kernels are in place, you should see all available languages when creating a new notebook as pictured below.

Available kernels list

File conversion methods

Now we’re ready to dive into the export process. Let’s look at three approaches in detail.

Download (Export) option

Once you’ve opened your notebook and run the code, select File-> Download As (appears as Export Notebook As… in JupyterLab).

"Download As"  option in Jupyter notebook

"Export Notebook As" option in JupyterLab

HTML format output

Notice the list of options, some more familiar than others. Select the HTML option and Jupyter converts your entire notebook: text, commands, figures, images, etc, into a file with a .html extension. Opening the resulting file would display in a browser as expected. See the images below for a comparison of the .ipynb and .html files.

SAS code in a Jupyther notebook

Corresponding SAS code notebook in html form

SAS (aka script) format output

Using the Save As-> SAS option renders a .sas file and is depicted in Enterprise Guide below. Note: when using a different kernel, say Python or R, you have the option to save in that language specific script format.

SAS code saved from a notebook displayed in Enterprise Guide

One thing to note here is only the code appears in the output file. The markdown code, figures, etc., from the original notebook, are not display options in EG, so they are removed.

PDF format output

There is one (two actually) special case(s) I need to mention. If you want to create a PDF (or LaTeX, which is used to create pdf files) output of your notebook, you need additional software. For converting to PDF, Jupyter uses the TeX document preparation ecosystem. If you attempt to download without TeX, the conversion fails, and you get a message to download TeX. Depending on your OS the TeX software will have a different name but will include TeX in the name. You may also, in certain instances, need Pandoc for certain formats. I suggest installing both to be safe. Install TeX from its dowload site. And do the same for Pandoc.

Once I’ve completed creating the files, the new files appear in my File Explorer.

New SAS file in Windows File Explorer

Cheaters may never win, but they can create a PDF quickly

Well, now that we’ve covered how to properly convert and download a .pdf file, there may be an easier way. While in the notebook, press the Crtl + P keys. In the Print window, select the Save to PDF option, choose a file destination and save. It works, but I felt less accomplished afterward. Your choice.

Inline code option

Point-and-click is a perfectly valid option, but let’s say you want to introduce automation into your world. The jupyter nbconvert command provides the capability to transform the current notebook into any format mentioned earlier. All you must do is pass the command with a couple of parameters in the notebook.

In Python, the nbconvert command is part of the os library. The following lines are representative of the general structure.

import os
os.system("jupyter nbconvert myNotebook.ipynb --to html")

An example with Python

The example below is from a Python notebook. The "0" out code represents success.

Code to create a PDF file from a Python notebook

An example with SAS

As you see with the Python example, the code is just that: Python. Generally, you cannot run Python code in a Jupyter notebook running the SAS kernel. Luckily we have Jupyter magics, which allow us to write and run Python code inside a SAS kernel. The magics are a two-way street and you can also run SAS code inside a Python shell. See the SASPy documentation for more information.

The code below is from a SAS notebook, but is running Python code (triggered by the %%python magic).

Code to create a PDF file from a SAS notebook

The EmployeeChurnSASCode.pdf file is created in same directory as the original notebook file:

Jupyter file system display in a web browser

An example with R

Things are fairly straight forward in an R notebook. However, you must install and load the nbconvert package.

Code to create an HTML file from an R notebook

The first line installs the package, the second line loads the package, and the third actually does the conversion. Double-check your paths if you run into trouble.

The command line

The last method we look at is the command line. This option is the same regardless of the language with which you’re working. The possibilities are endless for this option. You could include it in a script, use it in code to run and display in a web app, or create the file and email it to a colleague. The examples below were all run on a Windows OS machine using the Anaconda command prompt.

An example with a SAS notebook

Convert sasNotebook.ipynb to a SAS file.

>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
 
>> jupyter nbconvert --to script sasNotebook.ipynb
[NbConvertApp] Converting notebook sasNotebook.ipynb to script
[NbConvertApp] Writing 351 bytes to sasNotebook.sas
 
>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
-rw-r--r-- 1 jofurb 1049089     369 May 29 14:57 sasNotebook.sas

An example with a Python notebook

Convert 1_load_data.ipynb to a PDF file

>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
 
>> jupyter nbconvert 1_load_data.ipynb --to pdf
[NbConvertApp] Converting notebook 1_load_data.ipynb to pdf
[NbConvertApp] Writing 27341 bytes to .\notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', '.\\notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', '.\\notebook']
[NbConvertApp] WARNING | b had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 32957 bytes to 1_load_data.pdf
 
>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
-rw-r--r-- 1 jofurb 1049089  32957 May 29 15:23 1_load_data.pdf

An example with an R notebook

Convert HR_R.ipynb to an R file.

>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
 
>> jupyter nbconvert HR_R.ipynb --to script
[NbConvertApp] Converting notebook HR_R.ipynb to script
[NbConvertApp] Writing 981 bytes to HR_R.r
 
>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
-rw-r--r-- 1 jofurb 1049089   1021 May 29 15:44 HR_R.r

Wrapping things up

Whether you’re a student of Hunter’s, an analyst creating a report, or a data scientist monitoring data streaming models, you may have the need/requirement to transform you work from Jupyter notebook to a more consumable asset. Regardless of the language of your notebook, you have multiple choices for saving your work including menu options, inline code, and from the command line. This is a great way to show off your creation in a very consumable mode.

How to save Jupyter notebooks in assorted formats was published on SAS Users.

4月 232020
 

SAS macro function

SAS® Macro Language immensely empowers SAS programmers with versatility and efficiency of their code development. It allows SAS users to modularize programming code with “once written – many times used” components, and in many cases automatically generate data-driven SAS code.

Macro language and macro processor

Generally, SAS software processes your SAS program step by step, first scanning it for macro language objects - macro variables referenced as &somename, and macros referenced as %somename. If found, SAS software activates macro processor which resolves and substitutes those macro references according to the macro language syntax before SAS compiles and executes your programming steps.

SAS macro language vs. SAS programming language

A SAS program usually consists of two, often interwoven layers – macro layer and non-macro layer, each with its own syntax and its own timing of compilation and execution. In other words, SAS code is a combination of two distinct languages:

  • SAS programming language (comprised of DATA steps, PROC steps and global statements such as LIBNAME, OPTIONS, TITLE etc.)
  • SAS macro language (comprised of %LET, %IF, %DO, macro functions etc.) which is processed separately from and before SAS compiler executes SAS programming language code.

The difference between them is like a difference between cooking a meal and eating the meal. In this analogy meal=code, cooking=SAS macro language, eating=SAS programming language. Clear understanding of this difference is the key to becoming a successful SAS programmer.

Two types of SAS macros

There are two distinct types of SAS macros:

  1. Macros that generate some SAS programming language code which can span across SAS statements or steps;
  2. Macros that generate some string values which can be used as part of SAS programming language code or data values, but they are not complete SAS statements or steps. This type does not generate any SAS executable code, just a value.

What is a SAS macro function?

SAS macro function is a SAS macro that generates a value. In other words, it is the type 2 macro described above. As any SAS macros, SAS macro functions can have any number (zero or more) of positional or/and named parameters (arguments). SAS users may define their own macro functions, but in doing so you may not utilize any SAS language syntax; only SAS macro language syntax is allowed. You can use existing macro functions in your own macro function definition. Among others, one of the most powerful is %SYSFUNC macro function which brings a wealth of SAS language functions into SAS macro language.

Sources of SAS macro functions

SAS macro functions may come from the following three sources.

1. Pre-built macro functions

Pre-built macro functions that are part of the macro processor. These are such macro functions as %eval, %length, %quote, %scan, %str, %sysfunc, %upcase, etc. Here is a complete list of the pre-built SAS macro functions.

2. Auto-call macro functions

Auto-call macros, some of them are type 1 (macros), and some – type 2 (macro functions) such as %cmpres, %left, %lowcase, %trim, %verify, etc. These macro functions supplement the pre-built macro functions. The main difference from the pre-built macro functions is that the auto-call macro functions are program samples of the user-defined macro functions that are written in SAS macro language and made available to you without having to define or include them in your programs. The auto-call macro functions come with your SAS software installation and usually pre-configured for you by setting MAUTOSOURCE and SASAUTOS= macro system options. They may include several macro libraries depending on the SAS products licensed at your site. For example, for my SAS BASE installation the auto-call macro library is in the following folder:

C:\Program Files\SASHome\SASFoundation\9.4\core\sasmacro

Here is a selected list of auto-call macros provided with SAS software.

From the usage standpoint, you will not notice any difference between the pre-built and the auto-call macro functions. For example, macro function %upcase() is pre-built, while macro function %lowcase() is auto-call macro function. They belong to entirely different families, but we use them as if they are complementary siblings.

3. User-defined macro functions

Finally, there are user-defined macro functions that do not come with SAS installation. These are the macro functions that you define on your own. Usually, they are kept separately from the auto-call macros mainly in order to distinguish them from SAS-supplied ones.

To enable access to your own SAS macro library in addition to the auto-call macro library (or libraries), you can use the INSERT= system option:

options insert=(sasautos="path_to_your_own_macro_library_folder");

Instead of replacing the SASAUTOS value, this option inserts an additional value into the existing SASAUTOS option as the first value, thus allowing you to tap into your own macro library first, and then also into pre-set SAS auto-call libraries.

Creating user-defined macro function

Let’s consider the following example. Suppose, we want to create a macro function that takes a data set name as an argument and returns a value equal to the number of observations in that data set.

We know that the following code calculates the number of observations in a data set:

data _null_;
   set SASHELP.CARS (obs=0) nobs=n;
   call symputx('NOBS',n);
run;
 
%put &=NOBS;
NOBS=428

Can we create a SAS macro function by enclosing this code into macro? Something like this:

%macro nobs(dset=,result=);
   data _null_;
      set &dset (obs=0) nobs=n;
      call symputx("&result",n);
   run;
%mend nobs;

The answer is “No”. Yes, we created a valid macro; we can invoke this macro to produce the result:

%nobs(dset=SASHELP.CARS, result=NOBS);
%put &=NOBS;
NOBS=428

But this is not a macro function. Remember type 2 macro that does not generate any SAS programming language code, just a value? But this macro does generate SAS code which assigns a value to the macro variable specified as the second argument (result=NOBS).

In order to create a valid macro function, our macro should not have any SAS language code in it – neither a DATA step, nor a PROC step. It may only be comprised of the SAS macro language code. Here it is:

%macro nobs(dset);
   %local dsid n rc;
   %let dsid = %sysfunc(open(&dset));
   %if &dsid %then
   %do;
      %let n = %sysfunc(attrn(&dsid,nlobs));
      %let dsid = %sysfunc(close(&dsid));
   %end;
   %else %put %sysfunc(sysmsg());
   &n
%mend nobs;

When macro processor executes this macro, the only object that gets passed to the SAS language compiler is the value shown in the line right before the %mend. This is the calculated value of the number of observations (denoted by &n ). This is the only thing that is visible by the SAS language compiler, the rest is the macro language code visible and being handled by SAS macro processor.

IMPORTANT: When defining SAS macro function always use %local statement to list ALL macro variables that are created in your macro to ensure they will not accidentally overwrite same-named macro variables in the calling environment. You don’t need to declare %local for macro parameters as they are always local automatically.

SAS macro functions usage examples

When a macro function is defined this way, wherever you place its invocation %nobs(SASHELP.CARS) in your SAS code it will be evaluated and replaced with the corresponding value (in this case it is number 428) by the SAS macro processor. That way you can avoid substandard hard-coding and make your SAS code dynamic and powerful. You can use macro functions in many SAS coding contexts. For example:

  • Assignment statements for macro variable: %let NOBS=%nobs(SASHELP.CARS);
  • Assignment statement in a DATA step: x = %nobs(SASHELP.CARS);
  • As a value of the iterative do loop: do i=1 to %nobs(SASHELP.CARS);
  • As part of condition in IF statement: if %nobs(SASHELP.CARS) > 500 then do;

And so on.

Your thoughts?

Do you find this post useful? Do you use SAS macro functions? Can you suggest other usage examples? Please share with us in the Comments below.

Additional resources

 

How to create and use SAS macro functions was published on SAS Users.

4月 072020
 

As many of us are learning to navigate the changing world we are living in amid the COVID-19 outbreak, and as we care for our loved ones, friends, and our community, many of us now find ourselves working and studying from home much more than we did before. As an employee at SAS and an instructor at NC State University, I have found myself splitting my time between setting up my temporary home office and an at-home routine while also trying to help my students feel safe and comfortable as we move to a virtual classroom. But, as my commute and previous social time becomes Facetime calls and text messages, I’ve found myself with more downtime then I previously had, the time I want to dedicate to the training I’ve been wanting to do for the past year.

At SAS, we are striving to care for our users during this time—in that spirit, I wanted to share with you some free SAS offerings, as well as coping techniques I am doing from home.

Take care of yourself and your family

First and foremost, make sure you and your family are taking time for self-care. Whether it be meditation, using a mobile app or YouTube video, or getting some exercise together. I am finding my daily walk something I need to help relax my mind and get myself back to focus on my tasks.

Retrain on skills you haven’t touched in awhile

Sometimes we all need that reboot on tools or methods we use every day. I love SAS’ Statistics 1 course and SAS Programming 1, both of which are free. It’s a great refresher to those who haven’t taken a math course in a few years, or for those just wanting to start out with data science using SAS. Understanding the fundamentals and getting some refreshers on SAS language tricks is something I try to push through with my work constantly. I am also preparing to take the SAS Certification Exam at the end of the year, so the practice exam is something I also plan to use often during my study sessions.

Learn a new skill

It is also a great time to learn something you have always wanted to do. I started by taking some free online photography classes. I also have on my list enrolling in the SAS Academy for Data Science, which is free until the end of May 2020. The advanced analytics professional courses are something I have wanted to complete for a long time, so I am excited to get started on learning more about data modeling. The SAS e-books collection is now also free until April 30, 2020, so I’ve downloaded some great additional materials using code FREEBOOKS at checkout. 

Be kind to others

Being stuck inside can sometimes make you feel like you are spending more time with family than you are used to, or maybe you are spending more time alone. Using this time to connect with those I haven’t made time to talk to has been something I am really thankful for. I call my sister who is a nurse in Florida and check on her. As this outbreak affects us all differently, this is a great time to come together to connect. It is also a great time to think of others affected by the outbreak and who don’t have the ability to continue working. There are some great ways you can help others by getting involved in volunteer work or donating to a helpful cause.

Do fun activities

Though we are stuck at home, this time has been great for enjoying things I often don’t get to during my normal busy schedule. Besides taking free training, I’ve been playing some new video games (#ACNH) and some games I’ve neglected for far too long. I’ve also used this time to find what brings me joy from my home. Making time for reading is bringing great joy to my life right now.

As we all move through this turbulent time, make sure to take care of yourself and others. I hope some of these free tools and training will come in handy as you work towards your personal goals while remaining safe and healthy.

Working remotely? A list of 5 ways to spend your down time was published on SAS Users.

4月 032020
 

Whether you like it or not, Microsoft Excel is still a big hit in the data analysis world. From small to big customers, we still see fit for daily routines such as filtering, generating plots, calculating items on ad-hoc analysis or even running statistical models. Whenever I talk to customers, there is always someone who will either ask: Can this be exported to excel or can we import data from excel?. Recently, other questions started to come up more often: Can we run Python within SAS? How do I allow my team to choose their language of preference? How do I provide an interface that looks like Microsoft Excel, but has SAS functionalities?.

Well… good news is: we can answer YES to all of these questions. With the increase in number of users performing analytics and the number of analytical tools available, for me it was clear that we would end up having lots of disparate processes. For a while this was a problem, but naturally, companies started developing ways to integrate these siloed teams.

In the beginning of last decade, SAS developed SAS Add-in for Microsoft Office. The tool allows customers to run/embed SAS analytic capabilities inside Microsoft Office applications. More recently, SAS released a new version of PROC FCMP allowing users to write Python code and call, if as a function, inside SAS programs.

These advancements provide users the ability to run Python inside Excel. When I say inside, I really mean from within Excel's interface.

Before we jump to how we can do it, you may ask yourself: Why is this relevant to me? If I know SAS, I import the dataset and work with the data in SAS; If I know Python, I open a Jupyter notebook, import the data set and do my thing. Well… you are kind of right, but let me tell you a story.

The use case

Recently I worked with a customer and his business process was like this: I have a team of data scientists that is highly technical and knowledgeable in Python and SAS. Additionally, I have a team of analysts with little Python knowledge, but are always working with Excel to summarize data, create filters, graphs, etc. My teams need to communicate and collaborate. The normal chain of events follows:

  1. the Python team works on the data, and exports the results to Excel
  2. the analytics team picks up the data set, and runs SAS scripts and excel formulas

This is a problem of inefficiency for the customer. Why can't the data scientist pass his or her code to the analyst to execute it on the same project without having to wait on the Python specialist to run the code?

I know this sounds overly complicated, but as my SAS colleague Mike Zizzi concludes in his post SAS or Python? Why not use both? Using Python functions inside SAS programs, at the end of the day what matters is that you get your work done. No matter which language, software or IDE you are using. I highly recommend Mike's article if you want a deep dive on what PROC FCMP has to offer.

The process

Let's walk through a data scoring scenario similar to my customer's story. Imagine I am a SAS programmer using Excel to explore data. I am also part of a team that uses Python, creating scoring data code using analytical models developed in Python. My job is to score and analyze the data on Excel and pass the results to the service representative, so they can forward the response to the customer.

Importing data

The data set we'll work with in this example will help me analyze which customers are more likely to default on a loan. The data and all code used in this article are in the associated GitHub repository. The data dictionary for the data set is located here. First, I open the data set as seen on Sheet1 below in Excel.

Upload data to SAS

Before we jump to the coding part with SAS and Python, I need to send the data to SAS. We'll use the SAS add-in, in Excel to send data to the local server. I cover the steps in detail below.

I start by selecting the cells I want to upload to the library.

Next, I move to the SAS tab and select the Copy to SAS Server task.

A popup shows up where I confirm the selected cells.

After I click OK, I configure column, table, naming and location options.

SAS uploads the table to the requested library. Additionally, a new worksheet with the library.table name displays the results. As you can see on the image below, the sheet created follows the name WORK.IMPORTED_DATA we setup on the previous step. This represents the table in the SAS library memory. Notice, however, we are still working in Excel.

The next step is to incorporate the code sent from my teammate.

The Python code

The code our colleague sent is pure Python. I don't necessarily have to understand the code details, just what it does. The Python code below imports and scores a model and returns a score. Note: if you're attempting this in your own environment, make sure to update the hmeq_model.sav file location in the # Import model pickle file section.

def score_predictions(CLAGE, CLNO, DEBTINC,DELINQ, DEROG, LOAN, MORTDUE, NINQ,VALUE, YOJ):
	"Output: scored"
	# Imporing libraries
	import pandas as pd
	from sklearn.preprocessing import OneHotEncoder
	from sklearn.compose import ColumnTransformer
	from sklearn.externals import joblib
 
	# Create pandas dataframe with input vars
	dataset = pd.DataFrame({'CLAGE':CLAGE, 'CLNO':CLNO, 'DEBTINC':DEBTINC, 'DELINQ':DELINQ, 'DEROG':DEROG, 'LOAN':LOAN, 'MORTDUE':MORTDUE, 'NINQ':NINQ, 'VALUE':VALUE, 'YOJ':YOJ}, index=[0])
 
	X = dataset.values
 
	# Import model pickle file
	loaded_model = joblib.load("C://assets/hmeq_model.sav")
 
	# Score the input dataframe and get 0 or 1 
	scored = int(loaded_model.predict_proba(X)[0,1])
 
	# Return scored dataframe
	return scored

My SAS code calls this Python code from a SAS function defined in the next section.

The SAS code

Turning back to Excel, in the SAS Add-in side of the screen, I click on Programs. This displays a code editor, and as explained on this video, is like any other SAS code editor.

We will use this code editor to write, run and view results from our code.

The code below defines a FCMP function called Score_Python, that imports the Python script from my colleague and calls it from a SAS datastep. The output table, HMEQ_SCORED, is saved on the WORK library in SAS. Note: if you're attempting this in your own environment, make sure to update the script.py file location in the /* Getting Python file */ section.

proc fcmp outlib=work.fcmp.pyfuncs;
 
/* Defining FCMP function */
proc fcmp outlib=work.fcmp.pyfuncs;
	/* Defining name and arguments of the Python function to be called */
 
	function Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
		/* Python object */
		declare object py(python);
 
		/* Getting Python file  */
		rc = py.infile("C:\assets\script.py");
 
		/* Send code to Python interpreter */
		rc = py.publish();
 
		/* Call python function with arguments */
		rc = py.call("score_predictions",CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
 
		/* Pass Python results to SAS variable */
		MyFCMPResult = py.results["scored"];
 
		return(MyFCMPResult);
	endsub;
run;
 
options cmplib=work.fcmp;
 
/* Calling FCMP function from data step */
data work.hmeq_scored;
	set work._excelexport;
	scored_bad = Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
	put scored_bad=;
run;

 

I place my code in the editor.

We're now ready to run the code. Select the 'Running man' icon in the editor.

The output

The result represents the outcome of the Python model scoring code. Once the code completes the WORK.HMEQ_SCORED worksheet updates with a new column, scored_bad.

The binary value represents if the customer is likely (a '1') or unlikely (a '0') to default on his or her loan. I could now use any built in Excel features to filter or further analyze the data. For instance, I could filter all the customers likely to default on their loans and pass a report on to the customer management team.

Final thoughts

In this article we've explored how collaboration between teams with different skills can streamline processes for efficient data analysis. Each team focuses on what they're good at and all of the work is organized and completed in one place. It is a win-win for everyone.

Related resources

Extending Excel with Python and SAS Viya was published on SAS Users.

3月 112020
 

Automating SAS applications development

SAS variable labels are unique features of SAS data tables (aka data sets) that allow SAS users to enhance reading and interpretation of tables and reports.

Whether you use SAS data table as a data source in any of the reporting procedures or interactive interface such as SAS Visual Analytics, you will benefit from pre-assigning meaningful labels during the data preparation process. Besides being more efficient, such an early label assignment secures consistency of the data elements descriptions (labels) across different developers.

The most direct way of creating column labels is by explicitly assigning them to the data variables. You can do it during the data table creation in a DATA step using either LABEL statement or ATTRIB statement. Alternatively, you can do it after your data table is already created by using PROC DATASETS’ MODIFY statement with the LABEL= option.

However, in many situations there are ways of automating this tedious and voluminous process of column labels creation. Let’s look at one of them that I found useful for bulk column labeling. Plus, we are going to explore SAS coding technique using _DATA_and_LAST_special data sets.

Deriving variable labels from variable names

This method is suitable when variable names are well-formed, for example CUSTOMER_ADDRESS, FIRST_NAME, LAST_NAME, COMPANY_NAME, PLACE_OF_BIRTH, etc. Kudos to data designer!

We can transform these names into labels by replacing underscores with space characters and converting words from upper case to proper case. These are the labels we will get: Customer Address, First Name, Last Name, Company Name, Place Of Birth.

Let’s say our original data table is DEMO:

data DEMO;
   input CUSTOMER_CITY $ 1-15 FIRST_NAME $ 16-26 LAST_NAME $27-37 COMPANY_NAME $38-50 COUNTRY_OF_BIRTH $51-65;
   datalines;
Washington     Peter      Birn       Citibank     USA
Denver         Lisa       Roth       IBM          UK
Cary           Antony     Bessen     SAS          Spain
;

Then the following macro will create variable labels out of the variable names as described above:

options mprint; 
%macro ilabel (dataset);
   %local lbref dsname vname vlabel nvars;
 
   %if %index(&dataset,.) %then
   %do; /* 2-level dataset name */
      %let lbref  = %scan(&dataset,1,'.');
      %let dsname = %scan(&dataset,2,'.');
   %end;
   %else
   %do; /* 1-level dataset name */
      %let lbref  = WORK;
      %let dsname = &dataset;
   %end;
 
   /* get variable names */
   proc contents data=&dataset out=_data_(keep=name) noprint;
   run;
 
   /* create name/label pairs */
   data _null_;
      set _last_ end=eof nobs=n;
      call symput('vname'!!strip(put(_n_,best.)),name);
      lbl = propcase(translate(name,' ','_'));
      call symput('vlabel'!!strip(put(_n_,best.)),trim(lbl));
      if eof then call symputx('nvars',n);
   run;
 
   /* modify variable labels */
   proc datasets lib=&lbref nolist;
      modify &dsname;
         label
            %do i=1 %to &nvars;
               &&vname&i = "&&vlabel&i"
            %end;
         ;
   quit;
 
%mend ilabel;

You can invoke this macro by either one line of code:

%ilabel(DEMO)

or

%ilabel(WORK.DEMO)

Here are how our DEMO table looks before and after %ilabel macro modifies/assigns the labels based on the column names:

BEFORE:

Data table showing column names

 

AFTER:

Data table showing column labels

Macro code highlights

In this macro, we:

  1. Define local macro variables to make sure their names will not interfere with possible namesakes in the calling program.
  2. Determine libref and one-level data set name for the input data set.
  3. Create a table containing variable names in the input data set using PROC CONTENTS.
  4. Use DATA _NULL_ step to read through the variable names, and derive labels as

    lbl = propcase(translate(name,' ','_'));

    Here, transalate() function replaces underscores with blanks, then propcase() function converts every word in an argument to proper case (upper case for the first character and lower case for the remaining characters). We also create macro variables for each name/label pair (vname1, vlabel1, vname2, vlabel2, …) and macro variable nvars representing the number of such pairs.

  5. Use PROC DATASETS with MODIFY and LABEL statements to assign generated column labels to the source data set.

If some of the labels assigned by this macro are not what you need you may run another PROC DATASETS to individually adjust (re-assign) them according to your wishes or specification. But when you need to label data set columns on a large scale (many tables with dozens or hundreds of columns) this can be a good first draft that can save you time and efforts.

_DATA_ and _LAST_ special data sets

You might notice that I used _data_ data set name in the out= option of the PROC CONTENTS. This is not an explicit data set name; it is a keyword, a special data set that allows SAS to assign one of the available data set names dynamically. The created output data set will have a name that looks something like DATA1 or DATA2, etc. Try running this code:

data _data_;
   x=1;
run;

and look in the SAS LOG at what data set is created. I got:

NOTE: The data set WORK.DATA1 has 1 observations and 1 variables.

Special data set name _data_ tells SAS to create a data set in the WORK library from a list of names DATA1, DATA2, … according to the DATAn naming convention. These names (as well as WORK library) are unique for a given SAS session. The first time you use _data_ within a SAS session it will create data set named WORK.DATA1, the second time you use _data_ it will create WORK.DATA2, and so on.

Consequently, I used special data set name _last_ in the SET statement of the DATA step following the PROC CONTENTS. Again, here _last_ is a keyword, not a data set name; it is a special data set that refers to the name of the last created data set during your SAS session. That causes SAS to use the latest data set created prior to the _last_ reference.

Special data sets _data_ and _last_ are reserved names (or SAS keywords) along with special data set _null_ that is used in the DATA _NULL_ statement and causes SAS to execute the DATA step without creating a data set. (By the way, using DATA _NULL_ can increase your code efficiency when you use the DATA step for custom report writing or creating macro variables or other processing for which the output data set is not needed as it does not consume computer resources for writing and storing the output data set.)

If I were using an explicit table name in this macro instead, and your calling program accidentally were using the same table name, then the macro would overwrite your table which would wreak havoc to your program. Using _data_ and _last_ special data sets protect your SAS program from a possibility of inadvertently overwriting your other data set with the same name by executing the %ilabel macro. It is similar to using %LOCAL for macro variable names for protecting from possible overwrites of your %GLOBAL macro variables with the same names.

A WORD OF CAUTION: Remember, that the _data_ keyword creates table names that are unique only within a SAS session, so it works perfectly for the WORK data library which itself is a unique instance for a SAS session. While it is syntactically correct to use special data set notation _data_ for creating permanent data sets such as libref._data_ (including SASUSER._data_), I have to warn you against using it as it will not guarantee the name uniqueness in the permanent data library, and you may end up overwriting data sets that already exist there.

Your thoughts?

Do you find this post useful? How do you handle the task of assigning variable labels on a mass scale? Do you use _data_ and _last_ special data sets in your SAS coding? Please share in the comments section below.

Automating SAS variable labels creation was published on SAS Users.

1月 272020
 

Workday calendar including weekends and holidays

Way too often, SAS programmers run into a task when for a given date (e.g. event date) there is a need to shift (add or subtract) it by a specified number of days excluding weekends and holidays — in other words to move a date by a given number of workdays. It does not matter how many days off are in our date span as long as it contains exactly the required number of workdays.

For the purpose of this blog post, we will use the following words as synonyms: workdays, work days, working days, business days; as opposed to their antonym: days off.

In the ideal world

If not for gifts from governments and employers called holidays, shifting (incrementing or decrementing) a date by a number of workdays using SAS would be a piece of cake. It’s literally a single line of code using INTNX function with the WEEKDAY date interval. For example, the following DATA Step code:

data _null_;
   event_date = '02JAN2020'd;
   shift_date = intnx('weekday', event_date, -10);
   put shift_date= date9.;
run;

produces in the SAS log:

shift_date=19DEC2019

Here:

  • 'weekday' is date interval covering Monday through Friday;
  • event_date is starting date point;
  • -10 is number of workdays to shift by (positive number means increment; negative number means decrement).

Note, that the WEEKDAY date interval can be modified to accommodate different weekend days. For example:

  • WEEKDAY17W - five-day work week with a Sunday (1) and Saturday (7) weekend (same as WEEKDAY);
  • WEEKDAY1W - six-day week with Sunday (1) as the only weekend day;
  • WEEKDAY67W - five-day week with Friday (6) and Saturday (7) as weekend days, etc.

Holidays schedule

In the real world, however, weekends defined by the WEEKDAY interval are not the only days off, as they do not account for holidays. In the example above, when we shifted our starting date (2 January 2020) by -10 we arrived at 19 December 2019 which means we miscounted several holidays as workdays.

Which holidays (and how many) we miscounted depends on jurisdiction (country, state, business), as their holidays schedules vary. For example, for US federal agencies we would miss (1. New Year – 1Jan2020, 2. Christmas Day – 25Dec2019, and 3. Christmas Eve Day – 24Dec2019 – although this is not an official US federal holiday, most federal employees were given that day off by presidential executive order).

For SAS Institute (USA), we would miscue 6 non-weekend holiday days (Winter Holiday 25Dec2019 – 27Dec2019 and 30Dec2019 - 1Jan2020).

In other countries or businesses, this holidays schedule might be quite different, and that is why this date-shifting task that would account for holidays schedule causes so much confusion. Let’s straighten it out with the help of our old friend – SAS user-defined format. But first, let’s create a workday calendar – a data table listing all OUR work days as well as days off.

Workday calendar

Practically every organization has (or must have) a workday calendar that defines the valid working days and consists of a repeating pattern of days on and days off, as well as exceptions to that pattern. While such a calendar may span multiple years, for our purposes, we can use a subset of that calendar, which reliably covers the date range of our interest.

Let’s create an example of the workday calendar as a SAS data table:

data DAYS_OF_WEEK;
   format DATE date9.;
   do DATE='01JAN2019'd to '31JAN2020'd;
      WEEK_DAY = weekday(DATE);
      DAY_NAME = put(DATE,downame.);
      WORK_DAY = 1<WEEK_DAY<7;
      output;
   end;
run;
 
data DAYS_HOLIDAY;
   format DATE date9.;
   input DATE date9.;
   WORK_DAY = 0;
   datalines;
01JAN2019
21JAN2019
18FEB2019
27MAY2019
04JUL2019
02SEP2019
11NOV2019
28NOV2019
24DEC2019
25DEC2019
01JAN2020
20JAN2020
; 
 
/* Overlay holidays onto weekdays */
data DAYS_WEEKENDS_AND_HOLIDAYS;
   merge
      DAYS_OF_WEEK
      DAYS_HOLIDAY;
   by DATE;
run;

Here is a fragment of the resulting workday calendar table:
Workday calendar table

If date shifting is needed on an individual-level, then workday calendars should be created for every person and must include working days, weekends, holidays as well as personal days off such as vacations, sick days etc.

SAS format to distinguish workdays from days off

Now, for the dates range of our interest, we want to create a SAS user-defined format that lists all the days off while workdays are lumped into the other category. It’s just more efficient that way, as the number of days off is usually smaller than the number of work days so our explicit list of dates will be shorter. For example:

proc format;
   value dayoff
   '01DEC2019'd = 'Y'
   '07DEC2019'd = 'Y'
   '08DEC2019'd = 'Y'
   . . .
   '24DEC2019'd = 'Y'
   '25DeC2019'd = 'Y'
   '01JAN2020'd = 'Y'
   '20JAN2020'd = 'Y'
   other = 'N'
   ;
run;

In this user-defined SAS format values labeled 'Y' mean day off, and values labeled 'N' mean workday. That includes and takes care of both weekends and holidays.

The proc format above serves only for illustrational purposes of what kind of format we are going to create. However, by no means do I suggest implementing it this hard-coded way. Quite the contrary, we are going to create format dynamically and 100% data-driven. Here is how we do it:

data WORK.DAYSOFF (rename=(DATE=START));
   set DAYS_WEEKENDS_AND_HOLIDAYS(where=(WORK_DAY=0)) end=last;
   retain FMTNAME 'dayoff' TYPE 'n' LABEL 'Y';
   output;
   if last then do;
      HLO = 'O';
      LABEL = 'N';
      output;
   end;
run;
 
proc format cntlin=WORK.DAYSOFF;
run;

In the above code, HLO='O' and LABEL='N' are responsible for generating the OTHER category for the dayoff format.

Shifting dates by a number of workdays using dayoff format

With the dayoff user-defined format at hands, we can easily increment or decrement dates by a number of workdays. Here is how:

/* data table of some dates */
data EVENTS;
   do EVENT_DATE='01DEC2019'd to '31DEC2019'd;
      output;
   end;
   format EVENT_DATE date9.;
run;
 
/* Calculating new dates shifted by a number of workdays */
data EVENTS_WITH_SHIFTS;
   set EVENTS;
 
   /* Decrement EVENT_DATE by 10 workdays */ 
   d = EVENT_DATE;
   do i=1 to 10;
      d = d - 1;
      if put(d, dayoff.)='Y' then i = i - 1;
   end;
   BEFORE_DATE = d;
 
   /* Increment EVENT_DATE by 12 workdays */ 
   d = EVENT_DATE;
   do i=1 to 12;
      d = d + 1;
      if put(d, dayoff.)='Y' then i = i - 1;
   end;
   AFTER_DATE = d;
 
   format BEFORE_DATE AFTER_DATE date9.;
   drop d i;
run;

In this code, we decrement (d=d-1) or increment (d=d+1) our event date every time the do-loop iterates. It will iterate while counter i does not exceed the number of workdays. However, within the do-loop we modify counter i to i-1 every time we come across a day off as determined by condition put(d,dayoff.)='Y'. This will effectively exclude days off from counting towards the number of workdays. The do-loop will iterate the number of workdays plus the number of days off thus moving date d by the number of days that includes exactly the given number of workdays (plus some number of days off which we don’t care about). Just pause for a second and think to absorb this.

This simple technique can be modularized by implementing it as a SAS user-defined function or a SAS data-step macro.

User-defined function to shift a date by a number of workdays

Here is the user-defined function shiftwd() that shifts a beginning date specified in the first argument from_date by a number of workdays specified in the second argument shift_by. The second argument can be either positive or negative. Positive second argument means advancing the first argument (incrementing); negative second argument means subtracting workdays from the first argument (decrementing). Both arguments can be either variable names or numerals representing whole numbers.

libname funclib 'c:\projects\shift\functions';
proc fcmp outlib=funclib.funcs.dates; 
   function shiftwd(from_date, shift_by); 
      d = from_date; 
      do i=1 to abs(shift_by); 
         d = d + sign(shift_by); 
         if put(d,dayoff.)='Y' then i = i - 1; 
      end; 
      return(d); 
   endfunc; 
run;

Function usage example:

libname funclib 'c:\projects\shift\functions';
options cmplib= funclib.funcs;
 
data EVENTS_WITH_SHIFTS;
   set EVENTS;
   BEFORE_DATE = shiftwd(EVENT_DATE,-10); /* Decrement EVENT_DATE by 10 workdays */ 
   AFTER_DATE  = shiftwd(EVENT_DATE, 12); /* Increment EVENT_DATE by 12 workdays */ 
   format BEFORE_DATE AFTER_DATE date9.;
run;

SAS macro to shift a date by a number of workdays

Similarly, the same can be implemented as a data-step macro:

%macro shiftwd (fromvar=,endvar=,wdays=,sign=);
   &endvar = &fromvar;
   do i=1 to &wdays;
      &endvar = &endvar &sign 1;
      if put(&endvar, dayoff.)='Y' then i = i - 1;  
   end;
   drop i;
%mend;

This macro has 4 required parameters:

  • fromvar - variable name of the beginning date;
  • endvar - variable name of the ending date;
  • wdays - variable name or numeral representing number of workdays to move from the beginning date;
  • sign - operation sign defining direction of the date move (+ for incrementing, - for decrementing).

Macro usage example:

data EVENTS_WITH_SHIFTS;
   set EVENTS;
   %shiftwd(fromvar=EVENT_DATE,endvar=BEFORE_DATE,wdays=10,sign=-); /* Decrement EVENT_DATE by 10 workdays */ 
   %shiftwd(fromvar=EVENT_DATE,endvar=AFTER_DATE, wdays=12,sign=+); /* Increment EVENT_DATE by 12 workdays */ 
   format BEFORE_DATE AFTER_DATE date9.;
run;

Related materials

Calculating the number of working days between two dates (Blog post)

Custom Time Intervals (SAS Documentation)

Your thoughts?

Do you find this material useful? How do you handle the task of adding or subtracting workdays from a date? Please share in the comments section below.

Shifting a date by a given number of workdays was published on SAS Users.

1月 232020
 

When I attend a conference, one of the first things I do is look at the agenda. This gives me a good overview of how my time will be spent.

The next thing I do is find the detailed breakdown of sessions, so I can start building out my own personal agenda. I know my areas of interest, and I want to make sure my time is spent learning as much as possible. I’ve done this at industry conferences, as well as every SAS GF I’ve attended (and I’ve attended a lot).

I am happy to report that the session catalog for SAS Global Forum 2020 helps me understand what sessions are available so I can make the most of my conference experience. Here are my tips to make the most of the session catalog:

Use filters

For the first time, all programs are represented in the catalog. And anyone can attend any session. You can start broad, then filter by level of expertise, topic of interest, industry, product, or style of presentation. Filters are your friend…use them!

Use the interest list

You can bookmark sessions to your interest list, but beware that these do not automatically save for you. Be sure to use the export feature. Then you can share your chosen sessions with friends, print them, and cross-check them as you’re building your agenda in the mobile app starting March 1. It’s a great tool.

Review sample agendas

Speaking of agendas, it’s sometimes hard to narrow down your options. Start with one of the custom-tailored agendas within the session catalog and use these as a starting point to help complete your personal agenda. We keep adding new sample agendas, so you will have many great sessions to pick from!

Solicit the help of your SAS point of contact

If you’re a first-timer, or someone who is overwhelmed by too many choices, let your SAS contact help. They can help select sessions that they feel will be beneficial to your learning and networking.

Moral of the story

If you can’t tell, I’m a big fan of the session catalog. It’s totally customizable. The way I use it may not be the way you use it and that’s fine. That’s why there are so many ways to search, filter and save. Just give it a try. Your SAS Global Forum experience will be all the better when you include the session catalog.

And remember, anyone can check out the session catalog. Not planning to attend? Well, boo, go ahead and see what you’re missing. If you’re on the fence about the conference, check out all the great things SAS Global Forum has to offer then go register!

Make the most of SAS Global Forum with the session catalog was published on SAS Users.

12月 092019
 

Building on my last post, How to create checklist tables in SAS®, this one shows you how to compare SAS data Check mark and cross mark sets that include common and uncommon columns. You'll learn how to visualize side-by-side columns commonalities and differences in data tables.

As before, we're working with a comparison matrix (aka checklist table) where check-marks / x-marks indicate included / excluded columns.

Data tables will be comparable products while their columns (variables) will represent product features. We'll add background color to highlight which attributes are different in the common columns. Since there might be several different attributes for a given column, we will use a hierarchy typelengthlabel to indicate only the highest mismatched level of hierarchy. For example:

  • If same-named columns have different type (Numeric vs. Character), their corresponding check-mark will be shown on a light-red background, which indicates the highest degree of mismatch.
  • If same-named columns have the same type, a yellow background will indicate any difference in variables length.
  • When same-named variables type and length match, a light-blue background marks any difference in variables label.

SAS code to create color-enhanced comparison matrix

Let’s compare variable attributes in two data tables: one is SAS-supplied SASHELP.CARS, and another WORK.NEWCARS that I derive from the first one, slightly scrambling its column definitions:

data WORK.NEWCARS (drop=temp:);
   set SASHELP.CARS (rename=(Origin=Region EngineSize=temp1 Make=temp2));
   length EngineSize $3 Make $20;
   EngineSize = put(temp1,3.1);
   Make = temp2; 
   label Type='New Car Type';
run;

In this NEWCARS data table, I did the following:

  • Replaced column name Origin with Region
  • Changed type of column EngineSize from Numeric to Character
  • Changed length of column Make from $13 to $20
  • Changed label of column Type from blank to “New Car Type”

Now let’s build the comparison matrix:

proc contents data=SASHELP.CARS noprint out=DS1(keep=Name Type Length Label);
run;
 
proc contents data=WORK.NEWCARS noprint out=DS2(keep=Name Type Length Label);
run;
 
data comparison_matrix;
   merge
      DS1(in=in1 rename=(Type=Typ1 Length=Len1 Label=Lab1))
      DS2(in=in2 rename=(Type=Typ2 Length=Len2 Label=Lab2));
   by Name;
 
   /* set symbol shape: 1=V; 0=X */
   ds1 = 1; ds2 = 1;
   if in1 and not in2 then ds2 = 0; else
   if in2 and not in1 then ds1 = 0;
 
   /* add background color */
   if ds1=ds2=1 then
   select;
      when(Typ1^=Typ2) do; ds1=2; ds2=2; end;
      when(Len1^=Len2) do; ds1=3; ds2=3; end;
      when(Lab1^=Lab2) do; ds1=4; ds2=4; end;
      otherwise; 
   end;
 
   label
      Name = 'Column Name'
      ds1 = 'SASHELP.CARS'
      ds2 = 'WORK.NEWCARS'
      ;
run;
 
proc format;
   value chmark
      0   = '(*ESC*){unicode "2718"x}'
      1-4 = '(*ESC*){unicode "2714"x}'
      ;
   value chcolor
      0   = red
      1-4 = green
      ;
   value bgcolor
      2 = 'cxffccbb'
      3 = 'cxffe177'
      4 = 'cxd4f8d4' 
      ;
run;
 
ods html path='c:\temp' file='comp_marix.html' style=Seaside;
ods escapechar='^';
title 'Data set columns comparison matrix';
 
proc odstext;
   p '<div align="center">Mismatch Legend:'||
     '<span style="background-color:#ffccbb;margin-left:17px">^_^_^_^_</span> Type'||
     '<span style="background-color:#ffe177;margin-left:17px">^_^_^_^_</span> Length'||
     '<span style="background-color:#d4f8d4;margin-left:17px">^_^_^_^_</span> Label</div>'
   / style=[fontsize=9pt];
run;
 
title; 
proc print data=comparison_matrix label noobs;
   var Name / style={fontweight=bold width=100px};
   var ds1 ds2 / style={color=chcolor. backgroundcolor=bgcolor. just=center fontweight=bold width=120px};
   format ds1 ds2 chmark.;
run;
 
ods html close;

Here is a brief explanation of the code:

  1. Two PROC CONTENTS produce alphabetical lists (as datasets) of the data table column names, as well as their attributes (type, length, label)
  2. The DATA STEP merges these 2 lists and creates DS1 and DS2 variables indicating common name (values 1, 2, 3, 4) or uncommon name (value 0).
  3. PROC FORMAT creates 3 user-defined formats chmark, chcolor, bgcolor responsible for checkmark shape, checkmark color, and background color respectively. For checkmark shape, we use Unicode characters, and for colors we use both, color names (e.g. red, green) and hexadecimal RGB color notations (e.g. 'cxFFCCBB').
  4. PROC ODSTEXT’s P statement is used to display color legend for the comparison matrix.
  5. Finally, PROC PRINT with user-defined formats produces our color-enhanced comparison matrix.

Data tables comparison matrix – OUTPUT

The above code will generate the following HTML output with the comparison matrix for variables in two data sets:

Comparison matrix for common/uncommon variables in 2 datasets

Adding more detail to the comparison matrix chart

We can further enhance our output comparison matrix by adding detailed descriptive information about differences between variable attributes. For comprehensive view, we can add a COMMENTS column that spells out differences (attributes mismatches). In addition to the hierarchical logic defining only one mismatch of the highest degree indicated by color highlighting above, comments can include all found discrepancies. Simply add the following two pieces of SAS code:

1. Add the following group of statements to the above DATA Step (right after SELECT statement):

 length Comments $200;
   if ds1>1 then
   do;
      if Typ1^=Typ2 then Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
      if Len1^=Len2 then Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
      if Lab1^=Lab2 then Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');
   end;

Depending on your needs this Comments can be added unconditionally – you would just need to remove IF-THEN logic keeping only:

length Comments $200;
Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');

2. Add the following statement to the above PROC PRINT (right before the FORMAT statement):

var comments / style={width=250px};

Then your HTML output will look as follows:

Detailed comparison matrix for common/uncommon variables in 2 datasets

Conclusion

Comparison matrix charts are a convenient tool for data development and metadata validation when you're comparing a data table’s metadata against requirements descriptions.

It allows us to quickly identify tables’ common and uncommon variables, as well as common variable inconsistencies by type, length and other attributes, such as labels and formats.

We can easily add detailed descriptive information when needed.

On a related note

While this post focused on visualizing SAS data sets comparison vis-à-vis common and uncommon columns, it's worth noting SAS websites have plenty of info on finding common variables (or columns) in data sets. For example:

Your thoughts?

Do you find this material useful? What other usages of the checklist tables and color-enhanced comparison matrices can you suggest?

How to compare SAS data tables for common/uncommon columns was published on SAS Users.