6月 192018
 

CAS DATA StepCloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?

While there are certain _n_ automatic variable as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID; 
 
        uniqueID = _n_;
 
run;

CAS DATA Step

Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.

To produce a truly unique ID, you need to augment _n_ with something else. _threadID_ automatic variable can help us get our unique ID as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID;
 
        uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.);
 
run;

While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.

Aggregation with DATA Step

Now, let’s look at “whole table” aggregation (no BY Groups).

SAS DATA Step

Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + SaleAmt;
      keep totSalesAmt;
      if eof then output;
 
run;

CAS DATA Step

While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).

As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain threadSalesAmt 0;
      threadSalesAmt = threadSalesAmt + SaleAmt;
      keep threadSalesAmt;
      if eof then output;
 
run;
 
DATA aggregatedTable / single=yes;
SET aggregatedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + threadSalesAmt;
      if eof then output;
 
run;

In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).

BY-GROUP Aggregation

Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.

Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).

Concluding Thoughts

Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the SAS Global Forum paper.

Threads and CAS DATA Step was published on SAS Users.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)