Tech

122016
 

ProblemSolversBeing able to access information about your operating system from within SAS can be incredibly useful. For example, if you have a directory that contains an unknown number of CSV files that you would like to read into SAS, you certainly would not want to have to write an IMPORT procedure for each of the files. It would be great if SAS just knew the names of all the files as well how many there were. That would make this task much less tedious.

Fortunately, there is a straightforward way to accomplish this task within SAS. One of the easiest ways to gather information from your operating system from within SAS is to use a pipe (|). Pipes enable your SAS application to do the following:

  • Receive input from any operating-system command that writes to standard output.
  • Route output to any operating-system command that reads from standard input.

In other words, a pipe can be used to capture the output of an operating-system command and put that output into SAS, which gives you the ability to easily run PROC IMPORT for each CSV file in your directory.

Before we take a look at an example, let’s look at the syntax needed to use a pipe. To use a pipe on a directory-based operating system, you just need to issue a FILENAME statement with the following syntax:

FILENAME fileref PIPE 'operating-system-commandoption-list;

Here is additional information about the syntax above:

fileref: Can be any valid fileref, as described in Referencing External Files.

PIPE: Is the device-type keyword that tells SAS that you want to use an unnamed pipe.

‘operating-system-command’: Is the name of an operating-system command, executable program, or shell script that you want to route output to or that you want to read input from. You must enclose the command in quotation marks.

option-list: Can be any of the options that are valid in the FILENAME statement, such as the LRECL= or RECFM= options.  Options that affect the operating-system command can also be listed here.

So let’s take a look at a pipe in action.

Using a pipe to return the output of an operating system command to SAS® software

In the following example, I have an unknown number of CSV files in the MYFILES subdirectory. I want to read in each file and create an output data set that has the same name as the CSV file. Also, I am only interested in the files that were created within the last year.

To do this import, I am going to use a pipe, which enables me to use the DIR operating-system command to list all the files in the directory. The output of the DIR command shows the filename and its create date. Based on the create date, I can pass in each file =name to a macro that contains PROC IMPORT. Then, PROC IMPORT reads in each file and outputs it to a data set with the desired name.

Here is the code:

Note: The code below contains numbers that are enclosed in parentheses; these numbers correspond to further explanation below the code block.

%let mydir=c:myfiles;       (1)                                                                                                             
                                                                                                                                        
%macro mymac(fname,dsname);   (2)                                                                                                          
                                                                                                                                        
proc import datafile="&mydir&fname" out=&dsname                                                                                        
dbms=csv replace;                                                                                                                       
getnames=yes;                                                                                                                           
run;                                                                                                                                    
                                                                                                                                        
%mend;                                                                                                                                  
                                                                                                                                        
filename myfiles pipe "dir &mydir /T:C";   (3)                                                                                                  
                                                                                                                                        
data _null_;                                                                                                                            
  infile myfiles truncover;                                                                                                             
  input;      (4)                                                                                                                          
  if index(lowcase(_infile_),'.csv') then do;    (5)                                                                                       
    date=input(scan(_infile_,1,' '),mmddyy10.);  (6)                                                                                       
    fname=substr(_infile_,38);                   (7)                                                                                                        
    dsname=compress(tranwrd(strip(scan(fname,1,'.')),'_',' '),,'p');  (8)                                                                 
    dsname=tranwrd(strip(dsname),' ','_');                                                                                              
    if intck('year',date,today())<=1 then do;    (9)                                                                                         
      call=cats('%mymac(',fname,',',dsname,')');                                                                                        
      call execute(call);                        (10)                                                                                       
    end;                                                                                                                                
  end;                                                                                                                                  
run;

  1. The MYDIR macro variable lists the directory that contains the CSV files.
  2. The MYMAC macro contains the PROC IMPORT code needed to read in the CSV files. The macro contains the parameters FNAME and DSNAME. FNAME is the name of the CSV file being read in, and DSNAME is the name of the output data set being created.
  3. The FILENAME statement uses a pipe, which enables me to use the DIR operating-system command. This command shows the filename and the create date. Note that the ‘/T:C’ option is needed to show the create date. The default date returned by the DIR operating-system command is the last-modified date.
  4. The INPUT statement does not list any variables, so it loads the entire record into the input record buffer.
  5. _INFILE_ is an automatic variable in the DATA step that contains the contents of the current input record buffer. This buffer is created by the INPUT statement. In this example, I use the INDEX function to determine if the file has a .csv extension.
  6. When you run the DIR command, dates are always the first value listed. I use the SCAN function to extract the date and the INPUT function to convert the extracted date to a SAS date format.
  7. Before you can extract the filename, you have to search the output from the DIR command to find the column number that contains the filename value. In this example, the filename starts in the 38th column of the output file. The SUBSTR function specifies this column number and enables me to extract the filename.
  8. DATA-step logic is used to create the output data-set name. To make sure that there are no invalid characters in the filename like spaces and punctuation other than an underscore, I use the TRANWRD and STRIP functions and specify what characters should not be included.
  9. I use the INTCK function to select values that contain a create date that are a year or less from today’s date before I execute the macro.
  10. CALL EXECUTE invokes the MYMAC macro. CALL EXECUTE enables me to pass in the values of FNAME and DSNAME as the values of the macro’s parameters and conditionally execute the macro based on DATA-step logic.

As this example shows, using a pipe to capture the output of the DIR operating-system command can be extremely useful. In this example, it enables me to dynamically run code based on an unknown number of files that meet a certain condition. Using a pipe to get the file information for you can drastically reduce the amount of time it would take to perform this task manually.

 

 

tags: Problem Solvers, SAS Programmers

Using a pipe to return the output of an operating system command to SAS® software was published on SAS Users.

082016
 

DataFlux Data Quality profiling metrics in SAS Visual AnalyticsIn this post, I continue the journey of getting data profiling results into SAS Visual Analytics. In my first blog I described the process of collecting DataFlux Data Quality profiling metrics to load a datamart. Now we load this datamart into memory (LASR) and then plug a VA report on top to share data quality results widely and wisely.

Overall Process

The process is a 2-step process.

1.     Extract data from DataFlux repository and build a DQ profiling datamart (discussed in blog 1)

2.     Load the DQ profiling datamart into memory and plug VA reports on it

Here is the master DataFlux Process Job (called _loadProfilingInVA) that orchestrates the 2 steps.

Data Quality profiling11

The DQ profiling datamart is made of five tables which are SAS data sets. The goal is to load them into LASR. As we did the first step using a DataFlux process job, we will use another process job to do this step and to have a consistent process that we could orchestrate in a master DataFlux process job, seen above.

Loading data into LASR is very simple, especially if we assume that those tables have been loaded the first time manually in SAS Visual Analytics Administrator, in order to create all the necessary metadata (tables, jobs, etc.). If this is not the case, it’s not a big deal to do it automatically using proc METALIB. Once this initial step is done, we just want to refresh these tables in memory each time we have new data profiling results to share. That means, we have to unload old data from memory and then load new data in memory.

Even if we want to do that in DataFlux Data Management Studio, we require a SAS engine somewhere to execute SAS statements. DataFlux DMS does not provide nodes to load data directly in LASR. But it provides nodes (available in process jobs, not in data jobs) to execute SAS code against an existing SAS Workspace Server.

Data Quality profiling12

My technical environment is quite simple: everything on the same box: DataFlux Data Management Studio, LASR, VA, SAS Workspace Server, DQ profiling datamart SAS data sets, etc. No risk! But, in many SAS environments, it may not be as simple. The following questions might arise:

  • Where are my DQ profiling datamart SAS data sets? They are on the same box as DataFlux Data Management Studio/Server, which is probably not the same box as the SAS Workspace Server.
  • Is my SAS Workspace Server able to read those remote tables? Probably a data transfer is needed.
  • Is my SAS Workspace Server able to load data into LASR? I know there are limitations when the SAS Workspace Server is not located in the same box as the LASR master node but I won’t go into details, some resources are already available to deal with this.

Back to my basic mono-box configuration. Here is an example of how to setup the SAS Workspace Server to be called from DataFlux Data Management Studio/Server:
Data Quality profiling13

Here is the SAS code to unload data from memory:

Data Quality profiling14

And here is the code to load data into memory:

Data Quality profiling15

Now, I’m done with the data preparation phase. I’ve executed my master DataFlux Process Job (called _loadProfilingInVA) which:

  • Extracts data profiling results from the DataFlux repository
  • Creates a set of 5 output tables
  • Unloads the 5 old tables from LASR
  • Reloads the 5 new tables into LASR

I’m ready for the reporting phase, and I have prepared a VA report with multiples sections that leverage this DQ profiling datamart. Let’s have a look to what it looks like:

An amazing welcome screen

Data Quality profiling16

An overview of all the profiling metrics from all the profile jobs runs

Data Quality profiling17

Details of profiling metrics by table for a specific run

Data Quality profiling18

An historical view of profiling metrics

Data Quality profiling19

A view of value and pattern frequency distribution

Data Quality profiling20

A view of outliers and percentiles

Data Quality profiling21

tags: data management, SAS Professional Services, SAS Visual Analytics

Data Governance Series: Share DataFlux Data Quality profiling metrics in SAS Visual Analytics (Part 2) was published on SAS Users.

052016
 

In my previous post, Introducing data-driven loops, I suggested a way of implementing programming loops with a list of index variables pulled from an external data table. These ordinary programming loops iterate during code execution while processing some data elements of an input data table.

SAS macro loops, on the other hand, are completely different creatures as they do not iterate during execution time, but rather during code compilation time. That makes SAS macro loops a powerful code generator tool allowing to produce multiple variations of SAS code snippets with actually writing them just once.

Syntactically speaking, while SAS programming loops or do-loops always reside within SAS data step, SAS macro loops or %do-loops are located within SAS macros. They can be either within a data step (or proc step) generating multiple data/proc step statements, or outside data/proc step generating multiple data/proc steps or global statements.

Implementing SAS macro loops

To make macro loop driven by data we can use two index macro variables: the first one (primary index) iterates from 1 to n incrementing by 1 effectively going through the observations of a driver table, the other macro variable (secondary index) gets its values from the driver variable and is being a true data-driven index for our macro loop.  The following figure illustrates this concept.

data-driven-loops_2

SAS macro loops containing data or proc steps

For example, we have data table sashelp.prdsale that looks like this:

data-driven-loops_21

Suppose, we need to produce in separate HTML files for each country - charts showing actual product sales by years.

Here is how this can be done the data-driven way without any hard-coding:

/* output files location */
filename odsout "C:PROJECTS_BLOG_SASdata-driven-macro-loopshtml";

/* get unique countries */
proc sort data=sashelp.prdsale(keep=COUNTRY) out=work.countries nodupkey;
  by COUNTRY;
run;

%macro loop;

  %local num i cntry;

  /* get number of countries */
  %let dsid = %sysfunc(open(work.countries));
  %let num  = %sysfunc(attrn(&dsid,nlobs));
  %let rc   = %sysfunc(close(&dsid));

  %do i=1 %to #

    data _null_;
      p = &i;
      set work.countries point=p;
      call symputx('cntry',COUNTRY);
      stop;
    run;

    ods html path=odsout file="report_&cntry..html"  style=styles.seaside;
    goptions reset=all device=actximg colors=() htext=9pt hsize=5in vsize=3in;
    title1 "Product sales by year for &cntry";
    axis1 minor=none label=('Actual Sales');

    proc gchart data=sashelp.prdsale(where=(COUNTRY eq "&cntry"));
      vbar YEAR /
      sumvar = ACTUAL
      width = 10
      outside = sum
      raxis = axis1
      cframe = white nozero discrete
      ;
      format ACTUAL dollar12.0;
    run;
    quit;

    ods html close;

  %end;

%mend loop;
%loop;

The highlights of this code are:

  1. Using proc sort with nodupkey option we create a table work.countries of unique COUNTRY values. This can be done using proc sql as well.
  2. We determine the number of unique COUNTRY values, &num.
  3. Within macro called %loop, we use primary index – macro variable &i – to iterate from 1 to &num with increment 1.
  4. We use data _null_ step within that loop to sequentially read values of COUNTRY using direct access to observations of work.countries table by means of point= option. For each iteration &i of %do-loop, we create a secondary index – macro variable &cntry, which is used as a true index for our loop.
  5. During the code compilation, SAS macro processor loops through the %do-loop &i times repeatedly generating SAS code within it, each time with a new value &cntry, thus accomplishing our task.

This implementation of the macro %do-loop works perfectly fine, except in the situations when we need to use it within a data/proc step. The problem is the data _null_ statement that converts primary index &i to the secondary index &cntry, since we can’t use a data step within another data step.

SAS macro loop within data or proc step

Let’s solve the following coding problem. Suppose we have to create SALE_CATEGORY variable on our sashelp.prdsale table, something that you would routinely code like this:

data work.prdsale;
  set sashelp.prdsale;
  if ACTUAL < 50 then SALE_CATEGORY = 'A'; else
  if ACTUAL < 200 then SALE_CATEGORY = 'B'; else
  if ACTUAL < 500 then SALE_CATEGORY = 'C'; else
  if ACTUAL < 700 then SALE_CATEGORY = 'D'; else
  if ACTUAL < 900 then SALE_CATEGORY = 'E'; else
  if ACTUAL < 2000 then SALE_CATEGORY = 'F';
run;

What is wrong with this code? Nothing. Except when category definition changes you would have to find every place in your code where to apply that change. Besides, if a number of categories is large, the code becomes large too.

Let’s implement this the data-driven way, without any hard-coded values. Notice, that in the code above we have multiple if-then-else statements of a certain pattern that are repeated multiple times and thus they can be generated via %do-loop.

Let’s create the following driver table that contains boundary and sale category definitions that match the above hard-coded data step:

data-driven-loops_22

The data-driven macro loop can be implemented using the following code:

%macro mloop;

  /* get observations number - num, variable numbers - vnum1, vnum2, */
  /* variable type - vtype2, getfunc = getvarC or getvarN            */
  %let dsid = %sysfunc(open(work.salecategory));
  %let num  = %sysfunc(attrn(&dsid,nlobs));
  %let vnum1 = %sysfunc(varnum(&dsid,upboundary));
  %let vnum2 = %sysfunc(varnum(&dsid,salecat));
  %let vtype2 = %sysfunc(vartype(&dsid,&vnum2));
  %let getfunc = getvar&vtype2;

data work.prdsale;
  set sashelp.prdsale;

  %do i=1 %to #

    /* get upboundaty and salecat values from driver table work.salecategory */
    /* and assign them to upper and categ macro variables */
    %let rc = %sysfunc(fetchobs(&dsid,&i));
       %let upper = %sysfunc(getvarn(&dsid,&vnum1));
        %let categ = %sysfunc(&getfunc(&dsid,&vnum2));
        %if &vtype2 eq C %then %let categ = "&categ";

    /* generate if ... then ...; else statements */  
    if ACTUAL < &upper then SALE_CATEGORY = &categ;
    %if (&i ne &num) %then %str(else);

  %end;
  %let rc = %sysfunc(close(&dsid));

run;

%mend mloop;
%mloop;

With a little overhead of several %sysfunc() functions and SAS Component Language (SCL) functions we effectively generate a set of if-then-else statements based on the values in the driver table. Notably, even if the number of categories increases to hundreds the code does not have to be changed a bit.

Of course, this approach can be used for any SAS code generating efforts where there is a repetition of SAS code pattern.

Bonus

As a bonus to those who was patient enough to bear with me to the end, here is a macro equivalent of the data _null_ statement used in the first section - SAS macro loops containing data or proc step:

Data step:

data _null_;
      p = &i;
      set work.countries point=p;
      call symputx('cntry',COUNTRY);
      stop;
    run;

Macro equivalent:
%let dsid = %sysfunc(open(work.countries));
    %let vnum = %sysfunc(varnum(&dsid,COUNTRY));
    %let rc = %sysfunc(fetchobs(&dsid,&i));
    %let cntry = %sysfunc(getvarc(&dsid,&vnum));
    %let rc = %sysfunc(close(&dsid));

Thoughts

Please share your thoughts and comments.

tags: SAS Macro, SAS Professional Services, SAS Programmers

Data-driven SAS macro loops was published on SAS Users.