parallel processing

6月 152021
 

Dealing with big dataIn this fast-paced data age, when the sheer volume of data (generated, collected, and waiting to be processed and analyzed) grows at a breathtaking rate, the speed of data processing becomes critically important. In many cases, if data is not processed within an allotted time frame, we lose all its value as it becomes obsolete and ultimately irrelevant. That is why computing power becomes of the essence.

However, computing power itself does not guarantee timely processing. How we use that power makes all the difference. Way too often good old sequential processing just does not cut it anymore and different computing methods are required. One  such method is parallel processing.

In my previous post Using shell scripts for massively parallel processing I demonstrated a script-centered technique of running in parallel multiple independent SAS processes in SAS environments lacking SAS/CONNECT.

In this post, we will take a shot at a slightly different task and solution. Instead of having several totally independent processes, now we have some common “pre-processing” part, then we run several independent processes in parallel, and then we combine the results of parallel processing in the “post-processing” portion of our program.

Problem: monthly data ingestion use case

For simplification, we are going to use a scenario similar to one in the previous blog post:

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. Only now, we will go a step further: combining all those daily tables into a monthly table.

Solution: combining sequential and parallel processing

The solution is comprised of the three major components:

  • Shell script running the main SAS program.
  • Main SAS program, consisting of three parts: pre-parallel processing, parallel processing, and post-parallel processing.
  • Single thread SAS program responsible for a single day data ingestion.

1. Shell script running main SAS program

Below shell script mainprog.sh runs the main SAS program mainprog.sas:

#!/bin/sh
 
# HOW TO CALL:
# nohup sh /path/mainprog.sh YYYYMM &
 
now=$(date +%Y.%m.%d_%H.%M.%S)
 
# getting YYYYMM as a parameter in script call
ym=$1
 
pgmname=/path/mainprog.sas
logname=/path/saslogs/mainprog_$now.log
sas $pgmname -log $logname -set inDate $ym -set logname $logname

The script is run a background mode as indicated by the ampersand at the end of its invocation command:

nohup sh /path/mainprog.sh YYYYMM &

We pass a parameter YYYYMM (e.g. 202106) indicating year and month of our request.

When we call SAS program mainprog.sas within the script we indicate the name of the SAS log file to be created (-log $logname) and also pass on inDate parameter (-set inDate $ym, which has the same value YYYYMM as parameter specified in the script calling command), and logname parameter (-set logname $logname). As you will see further, we are going to use these two parameters within mainprog.sas program.

2. Main SAS program

Here is an abridged version of the mainprog.sas program:

/* ======= pre-processing ======= */
 
/* parameters passed from shell script */
%let inDate = %sysget(inDate);
%let logname = %sysget(logname);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* output data library */
libname SASDL '/data/target';
 
/* number of days in month mm of year yyyy */
%let days = %sysfunc(day(%sysfunc(mdy(&mm+1,1,&yyyy))-1));
 
/* ======= parallel processing ======= */
%macro loop;
   %local threadprog looplogdir logdt workpath tasklist i z threadlog cmd;
   %let threadprog = /path/thread.sas;
   %let looplogdir = %substr(&logname,1,%length(&logname)-4)_logs;
   x "mkdir &looplogdir"; *<- directory for loop logs;
   %let logdt = %substr(&logname,%length(&logname)-22,19);
   %let workpath = %sysfunc(pathname(WORK));
   %let tasklist=;
   %do i=1 %to &days;
      %let z = %sysfunc(putn(&i,z2.));
      %let threadlog = &looplogdir/thread_&z._&logdt..log;
      %let tasklist = &tasklist DAY&i;
      %let cmd = sas &threadprog -log &threadlog -set i &i -set workpath &workpath -set inDate &inDate;
      systask command "&cmd" taskname=DAY&i;
   %end;
 
   waitfor _all_ &tasklist;
 
%mend loop;
%loop
 
/* ======= post-processing ======= */
 
/* combine daily tables into one monthly table */
data SASDL.TARGET_&inDate;
   set WORK.TARGET_&inDate._1 - WORK.TARGET_&inDate._&days;
run;

The key highlights of this program are:

  • We capture values of the parameters passed to the program (inDate and logname).
  • Based on these parameters, assign source directory and target data library SASDL.
  • Calculate number of days in a specific month defined by year and month.
  • Create a directory to hold SAS logs of all parallel threads; the directory name is matching the log name of the mainprog.sas.
  • Capture the WORK library location of the main SAS session running mainprog.sas as:

    %let workpath = %sysfunc(pathname(WORK));We use that location in the thread sessions to pass back to the main session data produced by the thread sessions.

  • Macro %do-loop generates a series of SYSTASK statements to spawn additional SAS sessions in the background mode, each ingesting data for a single day of a month:

    systask command "&cmd" taskname=DAY&i;The SYSTASK statement enables you to execute host-specific commands from within your SAS session or application. Unlike the X statement, the SYSTASK statement runs these commands as asynchronous tasks, which means that these tasks execute independently of all other tasks that are currently running. Asynchronous tasks run in the background, so you can perform additional tasks (including launching other asynchronous tasks) while the asynchronous task is still running.

    Restriction: SYSTASK statement is not supported on the CAS server.

  • Also, we generate a cumulative list of all tasknames assigned to each thread sessions:

    %let tasklist = &tasklist DAY&i;

  • Outside the macro %do-loop we use WAITFOR statement which suspends execution of the main SAS session until the specified tasks finish executing. Since we created a list of all daily thread sessions (&tasklist), this will synchronize all our parallel threads and continue mainprog.sas session only when all threads finished executing.
  • At the end of the main SAS session we concatenate all our daily data tables that have been created by parallel threads in the location of the WORK library of the main SAS session.

Using SAS macro loop to generate a series of SYSTASK statements for parallel processing is not the only method available. Alternatively, you can achieve this within a data step using CALL EXECUTE. In this case, each data step iteration will generate a single global SYSTASK statement and push it out of the data step boundaries where they will be sequentially executed (just like in the case of macro implementation). Since option NOWAIT is the default for SYSTASK statements, despite all of them being launched sequentially, their corresponding OS commands will be still running in parallel.

3. Single thread SAS program

Here is an abridged version of the thread.sas program:

/* inDate parameter */
%let inDate = %sysget(inDate);
 
/* parent program's WORK library */
%let workpath = %sysget(workpath);
libname MAINWORK "&workpath";
 
/* thread number */
%let i = %sysget(i);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* source data directory */
%let srcdir = /datapath/&yyyy/&mm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "&srcdir/source_data_&inDate._day&i..cvs";
 
/* create daily output data set */
data MAINWORK.TARGET_&inDate._&i; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This program ingests a single .csv file corresponding to the &i-th day of &inDate (year and month) and creates a SAS data table MAINWORK.TARGET_&inDate._&i. To be available in the main SAS session the MAINWORK library is defined here in the same physical location as the WORK library of the main parental SAS session.

We also use a pre-created SAS data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using SYSTASK and SAS macro loops for massively parallel processing was published on SAS Users.

4月 062016
 

parallel2One of the hidden gems of SAS Studio is the ability to run process flows in parallel. This feature really shines when used in a grid environment. Let’s discuss this one step at a time.

First, what is a process flow? When working in the Visual Programmer perspective, you have access to process flows. A process flow is a graphical representation of a process, where each object, be it a SAS program, a SAS Studio task, a query, and so on, is represented by a node. Nodes are connected by links that instruct SAS Studio how to move from one node to the next.

Note: Click on the images to enlarge them.

SAS Studio Parallel Process Flows1

Display 1. SAS Studio Process Flow

On the Properties tab of the current process flow, you can set the execution mode of the nodes.  With the default setting, SAS Studio runs the nodes in the order in which they were added to the process flow. If node 2 is dependent on node 1, node 1 must run completely before node 2 will run.

You can change the execution mode to Parallel as shown in Display 2. When this value is set, SAS Studio uses multiple workspace servers to run the nodes concurrently, always enforcing the correct dependencies.

SAS Studio Parallel Process Flows2

Display 2. Setting the Execution Mode to Parallel

When you use this feature in a SAS Grid environment, if the administrator has configured the workspace server sessions to be grid launched, you can achieve the benefits and the performance improvements of multi-machine parallel load balancing, without having to code any SAS/CONNECT statement. It’s a real point-and-click parallel execution engine!

Display 3 shows the process flow presented in Display 1 running in parallel execution mode. The pane is grayed out because it is not possible to interact with it until the execution is complete.

We can see that the List Data node is still running, while the Partition Data node has already finished. Thus, the two Filter Data nodes were able to start in parallel.

SAS Studio Parallel Process Flows3

Display 3. Tasks Running in Parallel in a Process Flow

In this scenario, we would guess three workspace server sessions are concurrently running our code. However, if we monitor what is happening on the back-end hosts, we notice something unexpected. There are actually five workspace server sessions running. Why? As soon as you sign in to SAS Studio, it starts two SAS sessions. These are used only for the default execution mode. If a process flow is run in parallel mode, up to three additional SAS session are started, for a total of five. Once the process flow is finished, the three additional SAS processes terminate, if there is no further activity for 30 seconds, in order to release resources.

An administrator can use a configuration property, webdms.maxParallelWorkspaces, to specify the maximum number of workspaces that can be used when SAS is running in parallel mode. The default value is 3. The maximum value is 8.

I hope you enjoy running multiple tasks concurrently. If you have already started using parallel processing, you might want to check out my earlier blog, How to avoid the pitfalls of parallel jobs!

 

 

tags: parallel processing, SAS Professional Services, SAS Programmers, sas studio

SAS Studio Parallel Process Flows was published on SAS Users.

3月 022016
 

parallelI recently received a call from a colleague that is using parallel processing in a grid environment; he lamented that SAS Enterprise Guide did not show in the work library any of the tables that were successfully created in his project.

The issue was very clear in my mind, but I was not able to find any simple description or picture to show him: so why not put it all down in a blog post so everyone can benefit?

Parallel processing can speed up your projects by an incredible factor, especially when programs consist of subtasks that are independent units of work and can be distributed across a grid and executed in parallel. But when these parallel execution environments are not kept in sync, it can also introduce unforeseen problems.

This specific issue of “disappearing” temporary tables can happen using different client interfaces, because it does not depend upon using a certain software, but rather on the business logic that is implemented. Let’s look at two practical examples.

SAS Studio

We want to run an analysis – here a simple proc print – on two independent subsets of the same table. We decide to use two parallel grid sessions to partition the data and then we run the analysis in the parent session.  The code we submit in SAS Studio could be similar to the following:

%let rc = %sysfunc( grdsvc_enable(_all_, server= SASApp));
signon grid1;
signon grid2;
proc datasets library=work noprint;
delete sedan SUV;
run;
rsubmit grid1 wait=no ;
data sedan;
set sashelp.cars;
where Type=”Sedan”;
run;
endrsubmit;
rsubmit grid2 wait=no ;
data SUV;
set sashelp.cars;
where Type=”SUV”;
run;
endrsubmit;
waitfor _ALL_ grid1 grid2;
proc print data=sedan;
run;
proc print data=SUV;
run;

After submitting the code by pressing F3 or clicking Run, we do not get the expected RESULT window and the LOG window shows some errors:

ThePitfallsofParallelJobs

SAS Enterprise Guide

Suppose we have a project similar to the following.

ThePitfallsofParallelJobs2

Many items can be created independently of others. The orange arrows illustrate the potential tasks which can be executed in parallel.

Let’s try to run these project tasks in parallel. Open File, Project Properties. Select Code Submission and flag “Allow parallel execution on the same server.”

ThePitfallsofParallelJobs3

This property enables SAS Enterprise Guide to create one or more additional workspace server connections so that parallel process flow paths can be run in parallel.

Note: Despite the description, when used in a grid environment the additional workspace server sessions do not always execute on the same server. The grid master server decides where these sessions start.

After we select Run, Run Process Flow to submit the code, SAS Enterprise Guide submits the tasks in parallel respecting the required dependencies.  Unfortunately, some tasks fail and red X’s appear over the top right corner of the task icons. In the log summary we may find the following:

ERROR: File WORK.QUERY_FOR_PRODUCTS.DATA does not exist.

What's going on?

The WORK library is the temporary library that is automatically defined by SAS at the beginning of each SAS session or job. The WORK  library stores temporary SAS files that are written by a data step or a procedure and then read as input of subsequent steps. After enabling the parallel execution on the grid, tasks run in multiple SAS sessions, and each grid session has its own dedicated WORK library that is not shared with any other grid session, or with the parent work session which started it.

In the SAS Studio example, the data steps outputs their results – the SEDAN and SUV tables – in the WORK library of a SAS session, then the PROC PRINT tries to read those tables from the WORK library of a different SAS session. Obviously the tables are not there, and the task fails.

ThePitfallsofParallelJobs4

This is a quite common issue when dealing with multiple sessions – even without a grid. One simple solution is to avoid using the WORK library and any other non-shared resources. It is possible to assign a common library in many ways, such as in autoexec files or in metadata.
The issue is solved:

ThePitfallsofParallelJobs5

No shared resources, am I safe?

Well, maybe not. Coming back to the original issue presented at the opening of this post, sometimes we oversee what ‘shared’ means. I’ll show you with this very simple SAS Enterprise Guide project: it’s just a simple query that writes a result table in the WORK library. You can test it on your laptop, without any grid.

ThePitfallsofParallelJobs6

After running it, the FILTER_FOR_AIR table appears in the server’s pane:

ThePitfallsofParallelJobs7

Now, let’s say we have to prepare for a more complex project and we follow again the steps to “Allow parallel execution on the same server. “ Just to be safe, we resubmit the project to test what happens. All seems unchanged, so we save everything and close SAS Enterprise Guide. Say I forgot to ask you to write down something about the results. We reopen the project and knowing the result table in the WORK library was temporary, we rerun the project to recreate it.

This time something is wrong.

ThePitfallsofParallelJobs8

The Output Data pane shows an error, and the Servers pane does not list the FILTER_FOR_AIR table anymore.
Even if we rerun the project, the table will not reappear.

The reason lies, again, in the realm of shared v.s. local libraries.

As soon as we enable “Allow parallel execution on the same server,” SAS Enterprise Guide starts at least one additional SAS session to process the code, even if there is nothing to parallelize. Results are saved only there, but SAS Enterprise Guide always tries to read them from the original, parent session. So we are again in the trap of local WORK libraries.

ThePitfallsofParallelJobs9

Why didn’t we uncover the issue the first time we ran the project? If you run your code, at least once, without the “Allow parallel execution on the same server” option, the results are saved in the parent session. And, they remain there even after enabling parallelization. As a result, we actually have two copies of the FILTER_FOR_AIR table!
As soon as we close SAS Enterprise Guide, both tables are deleted. So, on the next run, there is nothing in the parent session WORK library to send to SAS Enterprise Guide!

The solution? Same as before – only use shared libraries.

Is this all?

As you might have guessed, the answer is no. Libraries are not the only objects that should be shared across sessions. Every local setting – be it the value of an option, a macro, a format – has to be shared across all parallel sessions. Not difficult, but we have to remember to do it!

 

 

 

 

 

 

 

tags: parallel processing, sas enterprise guide, SAS Grid Manager, SAS Professional Services, SAS Programmers, sas studio

Avoid the pitfalls of parallel jobs was published on SAS Users.

7月 212015
 

I'm gearing up to teach the next "DS2 Programming Essentials with Hadoop" class, and thinking about Warp Speed DATA Steps with DS2 where I first demonstrated parallel processing using threads in base SAS. But how about DATA step processing at maximum warp? For that, we'll need a massively parallel processing […]

The post Jedi SAS Tricks - Maximum Warp with Hadoop appeared first on The SAS Training Post.

9月 172014
 

Scalability is the key objective of high-performance software solutions. “Scaling out” is a concept which is accomplished by throwing more server machines at a solution so that multiple processes can run in dedicated environments concurrently. This blog post will briefly touch on several scalability concepts that affect SAS.

Functional roles

scalability1At SAS, we have a number of different approaches to tackle the ability to scale our software across multiple machines. As we often see with our SAS Enterprise Business Intelligence solution components, we’ll split up the various functional roles of SAS software to run on specific hosts. In one of the most common examples, we’ll set aside one machine for the metadata services, another for the analytic computing workload, and a third for web services.

While this is more complicated than deploying everything to a single machine, it allows for a lot of flexibility in providing responsive resources which are optimized for each role. Now, we’re not limited to just three machines, of course.

Read more:
SAS® 9.4 Intelligence Platform: Overview

Clusters

scalability2For each of these functional roles – Meta, Compute, and Web – we can scale them out independently of the others. Depending on the technology involved, different techniques must be employed. The Meta and Web functional roles, in particular, are well-equipped to function as clusters.

Generally speaking, a software cluster is comprised of services that present as peers to the outside world. They offer scalability and improved availability where any node of the cluster can perform the requested work, continue to offer service in the face of failure of one or more nodes (depending on configuration) and other features.

Read more:

Grids

scalability_gridThe Compute functional role has some built-in ability to act as a cluster if the necessary SAS software is licensed and properly configured – which is pretty great already – but this ability can be extended even further to act as a grid. A grid is a distributed collection of machines that process many concurrent jobs by coordinating the efficient utilization of resources which may vary from host.

With proper implementation and administration, grids are very tolerant of diverse workloads and a mix of resources. For example, it’s possible to inform your grid that certain machines have certain resources available and others do not. Then, when you submit a job to the grid, you can declare parameters on the job that dictate the use of those resources. The grid will then ensure that only machines with those resources are utilized for the job. This simple illustration can be implemented in different ways depending on the kind of resources and with a high-degree of flexibility and control.

Another common component of clusters and grids is the use of a clustered file system. A clustered file system is visible to and accessed by each machine in the grid (or cluster) – typically at the exact same physical path. This is primarily used to ensure that all nodes are able to work with the same set of physical files. Those files might range from shared work product to software configuration and backups, event to shared executable binaries. The exact use of the clustered file system can of course vary from site to site.

Read more:

Massively Parallel Processing

scalability4Extending grid computing even further is the concept of massively parallel processing (or MPP). As we see with Hadoop technology and the SAS In-Memory solutions, a number of benefits can be realized through the use of carefully planned MPP clusters.

One common assumption behind MPP (especially in the implementation of the SAS In-Memory solutions) has historically been that all participating machines are as identical as possible. They have the same physical attributes (RAM, CPU, disk, network) as well as the same software components.

The premise of working in an MPP environment is that any given job (that is, something like a statistical computation or data to store for later) is simply broken into equal size chunks that are evenly distributed to all nodes. Each node works on the problem individually, sharing none of its own CPU, RAM, etc. with the others. Since the ideal is for all nodes to be identical and that each gets the same amount of work without competing for any resources, then complex workload management capabilities (such as described for grid above) are not as crucial.  This assumption keeps the required administrative overhead for workload management to a minimum.

Read more:

Hadoop and YARN

Looking forward, one of the challenges of assuming dedicated, identical nodes and equal-size chunks of work in MPP has been that it’s actually quite difficult to keep everything equal on all nodes all of the time. For one thing, this often assumes that all of the hardware is exclusive for MPP use all of the time – which might not be desirable for systems which sit idle overnight, on weekends, etc. Further, while breaking workload up into equal-size bits is possible, it’s sometimes tough to keep the workload perfectly equal and distributed when there exists competition for finite resources.

For these and many other reasons, Hadoop 2.0 introduces an improvement to the workload management of a Hadoop cluster called YARN (Yet Another Resource Negotiator).

The promise of YARN is to better manage resources in a way accessible to Hadoop as well as various other consumers (like SAS). This will help mature the MPP platform, evolving it from the old Map-Reduce framework to a more flexible platform to handle a wider variety of different workload and resource management challenges.

And of course, SAS solutions are already integrating with YARN to take advantage of the capabilities it offers.

Read more:

 

tags: grid, Hadoop, parallel processing, performance, SAS Administrators, SAS Professional Services, scalability, YARN
10月 022013
 

Do your SAS programs read extra-large volumes of data? Do they run multiple DATA steps and procedures one after the other for hours at a time? Two papers from MWSUG 2013 show how you can speed up those long-running SAS jobs. Although their approaches and environments differed, both authors made use of hardware with multiple CPUs and the MP CONNECT feature of SAS/CONNECT software to run tasks in parallel. Both papers are excellent introductions to the topic.

Parallel processing can improve performance for SAS programs that are I/O intensive or can be broken into multiple, independent subtasks that can be run concurrently. According to both authors, one trick to gaining the most improvement is finding the optimum number of subtasks and how much work should occur in each one.

Simulations can take a long time, a very long time to process. When a statistician approached SAS programmer Jack Fuller for help, Fuller had to do some searching through SAS/CONNECT and SAS Grid Manager documentation to pull all the pieces together. Fuller’s first step was to move the analysis off PC SAS to the organization’s grid-enabled network. With a little work, a little benchmarking and some adjusting, Fuller helped reduce the time needed to run the analysis from approximately 5 hours to 30 minutes or less.

Fuller’s paper Beating Gridlock shared some of the lessons he learned:  when to use parallel processing, how to use it and other factors to consider. He reminds us that certain sections of code may not be parallelized (such as initialization and finalization steps). They’ll become the limiting factor on the amount of speed you’ll gain. A corollary:  if your program has a large amount of code that can’t be parallelized, then it may not be a candidate for this technique at all. To arrive at the optimum number of subtasks, Fuller often starts by breaking the code into 5, 10 or 15 subtasks. He then monitors performance during multiple trials to arrive at the optimum number of tasks.

In Advanced Multithreading Techniques for Performance Improvement, Viraj Kumbhakarna uses an empirical approach, basing his recommendations on a case study that he ran both serially and in parallel using the multithreading capabilities of MP CONNECT. Kumbhakarna found that, as a rule of thumb, the processing time for parallelized tasks can be improved by a factor no greater than the number of CPUs available for processing. When he broke test programs into subtasks that exceeded the number of CPUs available, processing perfomance actually degraded.    

Kumbhakarna also noted that in his research, multithreading yielded higher returns where CPU time and real elapsed time were not far apart. Like Fuller, he points out that the number of threads into which jobs are broken is an important factor in performance improvement. He suggests that programmers can determine the most effective number of subtasks by executing each job multiple times and selecting the job (and its number of subtasks) with the least difference between real time and CPU time.

Both authors provide lots of sample code showing how to create test data, how to break a program into subtasks and how to submit parallelized programs for processing.

Editor’s Note:  You can find this paper and more at the MWSUG 2013 Proceedings.

tags: MWSUG, parallel processing, SAS Administrators, SAS/CONNECT