6月 152021
 

Dealing with big dataIn this fast-paced data age, when the sheer volume of data (generated, collected, and waiting to be processed and analyzed) grows at a breathtaking rate, the speed of data processing becomes critically important. In many cases, if data is not processed within an allotted time frame, we lose all its value as it becomes obsolete and ultimately irrelevant. That is why computing power becomes of the essence.

However, computing power itself does not guarantee timely processing. How we use that power makes all the difference. Way too often good old sequential processing just does not cut it anymore and different computing methods are required. One  such method is parallel processing.

In my previous post Using shell scripts for massively parallel processing I demonstrated a script-centered technique of running in parallel multiple independent SAS processes in SAS environments lacking SAS/CONNECT.

In this post, we will take a shot at a slightly different task and solution. Instead of having several totally independent processes, now we have some common “pre-processing” part, then we run several independent processes in parallel, and then we combine the results of parallel processing in the “post-processing” portion of our program.

Problem: monthly data ingestion use case

For simplification, we are going to use a scenario similar to one in the previous blog post:

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. Only now, we will go a step further: combining all those daily tables into a monthly table.

Solution: combining sequential and parallel processing

The solution is comprised of the three major components:

  • Shell script running the main SAS program.
  • Main SAS program, consisting of three parts: pre-parallel processing, parallel processing, and post-parallel processing.
  • Single thread SAS program responsible for a single day data ingestion.

1. Shell script running main SAS program

Below shell script mainprog.sh runs the main SAS program mainprog.sas:

#!/bin/sh
 
# HOW TO CALL:
# nohup sh /path/mainprog.sh YYYYMM &
 
now=$(date +%Y.%m.%d_%H.%M.%S)
 
# getting YYYYMM as a parameter in script call
ym=$1
 
pgmname=/path/mainprog.sas
logname=/path/saslogs/mainprog_$now.log
sas $pgmname -log $logname -set inDate $ym -set logname $logname

The script is run a background mode as indicated by the ampersand at the end of its invocation command:

nohup sh /path/mainprog.sh YYYYMM &

We pass a parameter YYYYMM (e.g. 202106) indicating year and month of our request.

When we call SAS program mainprog.sas within the script we indicate the name of the SAS log file to be created (-log $logname) and also pass on inDate parameter (-set inDate $ym, which has the same value YYYYMM as parameter specified in the script calling command), and logname parameter (-set logname $logname). As you will see further, we are going to use these two parameters within mainprog.sas program.

2. Main SAS program

Here is an abridged version of the mainprog.sas program:

/* ======= pre-processing ======= */
 
/* parameters passed from shell script */
%let inDate = %sysget(inDate);
%let logname = %sysget(logname);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* output data library */
libname SASDL '/data/target';
 
/* number of days in month mm of year yyyy */
%let days = %sysfunc(day(%sysfunc(mdy(&mm+1,1,&yyyy))-1));
 
/* ======= parallel processing ======= */
%macro loop;
   %local threadprog looplogdir logdt workpath tasklist i z threadlog cmd;
   %let threadprog = /path/thread.sas;
   %let looplogdir = %substr(&logname,1,%length(&logname)-4)_logs;
   x "mkdir &looplogdir"; *<- directory for loop logs;
   %let logdt = %substr(&logname,%length(&logname)-22,19);
   %let workpath = %sysfunc(pathname(WORK));
   %let tasklist=;
   %do i=1 %to &days;
      %let z = %sysfunc(putn(&i,z2.));
      %let threadlog = &looplogdir/thread_&z._&logdt..log;
      %let tasklist = &tasklist DAY&i;
      %let cmd = sas &threadprog -log &threadlog -set i &i -set workpath &workpath -set inDate &inDate;
      systask command "&cmd" taskname=DAY&i;
   %end;
 
   waitfor _all_ &tasklist;
 
%mend loop;
%loop
 
/* ======= post-processing ======= */
 
/* combine daily tables into one monthly table */
data SASDL.TARGET_&inDate;
   set WORK.TARGET_&inDate._1 - WORK.TARGET_&inDate._&days;
run;

The key highlights of this program are:

  • We capture values of the parameters passed to the program (inDate and logname).
  • Based on these parameters, assign source directory and target data library SASDL.
  • Calculate number of days in a specific month defined by year and month.
  • Create a directory to hold SAS logs of all parallel threads; the directory name is matching the log name of the mainprog.sas.
  • Capture the WORK library location of the main SAS session running mainprog.sas as:

    %let workpath = %sysfunc(pathname(WORK));We use that location in the thread sessions to pass back to the main session data produced by the thread sessions.

  • Macro %do-loop generates a series of SYSTASK statements to spawn additional SAS sessions in the background mode, each ingesting data for a single day of a month:

    systask command "&cmd" taskname=DAY&i;The SYSTASK statement enables you to execute host-specific commands from within your SAS session or application. Unlike the X statement, the SYSTASK statement runs these commands as asynchronous tasks, which means that these tasks execute independently of all other tasks that are currently running. Asynchronous tasks run in the background, so you can perform additional tasks (including launching other asynchronous tasks) while the asynchronous task is still running.

    Restriction: SYSTASK statement is not supported on the CAS server.

  • Also, we generate a cumulative list of all tasknames assigned to each thread sessions:

    %let tasklist = &tasklist DAY&i;

  • Outside the macro %do-loop we use WAITFOR statement which suspends execution of the main SAS session until the specified tasks finish executing. Since we created a list of all daily thread sessions (&tasklist), this will synchronize all our parallel threads and continue mainprog.sas session only when all threads finished executing.
  • At the end of the main SAS session we concatenate all our daily data tables that have been created by parallel threads in the location of the WORK library of the main SAS session.

Using SAS macro loop to generate a series of SYSTASK statements for parallel processing is not the only method available. Alternatively, you can achieve this within a data step using CALL EXECUTE. In this case, each data step iteration will generate a single global SYSTASK statement and push it out of the data step boundaries where they will be sequentially executed (just like in the case of macro implementation). Since option NOWAIT is the default for SYSTASK statements, despite all of them being launched sequentially, their corresponding OS commands will be still running in parallel.

3. Single thread SAS program

Here is an abridged version of the thread.sas program:

/* inDate parameter */
%let inDate = %sysget(inDate);
 
/* parent program's WORK library */
%let workpath = %sysget(workpath);
libname MAINWORK "&workpath";
 
/* thread number */
%let i = %sysget(i);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* source data directory */
%let srcdir = /datapath/&yyyy/&mm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "&srcdir/source_data_&inDate._day&i..cvs";
 
/* create daily output data set */
data MAINWORK.TARGET_&inDate._&i; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This program ingests a single .csv file corresponding to the &i-th day of &inDate (year and month) and creates a SAS data table MAINWORK.TARGET_&inDate._&i. To be available in the main SAS session the MAINWORK library is defined here in the same physical location as the WORK library of the main parental SAS session.

We also use a pre-created SAS data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using SYSTASK and SAS macro loops for massively parallel processing was published on SAS Users.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)