SAS programmers

1月 072022
 

Welcome to the sixth installment in my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to work with the results. Now it's time to generate simple descriptive statistics of a CAS table.

Let's begin by confirming the cars table is loaded into memory. With a connection to CAS established, execute the tableInfo action to view available in-memory tables. If necessary, you can execute the following code in SAS Studio to load the sashelp.cars table into memory.

conn.tableinfo(caslib="casuser")

The results show the cars table is loaded into memory and available for processing. Next, reference the cars table in the variable tbl. Then use the print function to show the value of the variable.

tbl = conn.CASTable('cars', caslib='casuser')
print(tbl)
CASTable('cars', caslib='casuser')

The results show that the tbl variable references the cars table in the CAS server.

Preview the CAS Table

First things first. Remember, the SWAT package blends the world of Pandas and CAS into one. So you can begin with the traditional head method to preview the CAS table.

tbl.head()

The SWAT head method returns five rows from the CAS server to the client as expected.

The Describe Method

Next, let's retrieve descriptive statistics of all numeric columns by using the familiar describe method on the CAS table.

tbl.describe()

The SWAT describe method returns the same descriptive statistics as the Pandas describe method. The only difference is that the SWAT version uses the CAS API to convert the describe method into CAS actions behind the scenes to process the data on the distributed CAS server. CAS processes the data and returns summarized results back to the client as a SASDataFrame, which is a subclass of the Pandas DataFrame. You can now work with the results as you would a Pandas DataFrame.

Summary CAS Action

Instead of using the familiar describe method, let's use a CAS action to do something similar. Here I'll use the summary CAS action.

tbl.summary()

Summary CAS Action

The results of the summary action return a CASResults object (Python dictionary) to the client. The CASResults object contains a single key named Summary with a SASDataFrame as the value. The SASDataFrame shows a variety of descriptive statistics.  While the summary action does not return exactly the same statistics as the describe method, it can provide additional insights into your data.

What if we don't want all the statistics for all of the data?

Selecting Columns and Summary Statistics with the Summary Action

Let's add additional parameters to the summary action. I'll add the inputs parameter to specify the columns to analyze in the CAS server.

tbl.summary(inputs = ['MPG_City','MPG_Highway'])

The results show only the MPG_City and MPG_Highway columns were analyzed.

Next, I'll use the subSet parameter to specify the summary statistics to produce. Here I'll obtain the MEAN, MIN and MAX.

tbl.summary(inputs = ['MPG_City','MPG_Highway'],
                       subSet = ['mean','min','max'])

The results processed only the MPG_City and MPG_Highway columns, and returned only the specified summary statistics to the client.

Creating a Calculated Column

Lastly, let's create a calculated column within the summary action. There are a variety of ways to do this. I like to add it as a parameter to the CASTable object. You can do that by specifying the tbl object, then computedVarsProgram parameter. Within computedVarsProgram you can use SAS assignment statements with most SAS functions. Here we will create a new column name MPG_Avg that takes the mean of MPG_City and MPG_Highway. Lastly, add the new column to the inputs parameter.

tbl.computedVarsProgram = 'MPG_Avg = mean(MPG_City, MPG_Highway);'
tbl.summary(inputs = ['MPG_City','MPG_Highway', 'MPG_Avg'],
                       subSet = ['mean','min','max'])

In the results I see the calculated column and requested summary statistics.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package, or the flexible, highly optimized CAS actions like summary to easily obtain summary statistics of your data in the massively parallel processing CAS engine.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Index
SWAT API Reference
CAS Action Documentation
SAS® Cloud Analytic Services: Fundamentals
SAS Scripting Wrapper for Analytics Transfer (SWAT)
CAS Action! - a series on fundamentals
Execute the following code in SAS Studio to load the sashelp.cars table into memory

Getting Started with Python Integration to SAS® Viya® - Part 6 - Descriptive Statistics was published on SAS Users.

10月 142021
 

Trimming strings left and right

I am pretty sure you have never heard of the TRIMS function, and I would be genuinely surprised if you told me otherwise. This is because this function does not exist (at least at the time of this writing).

But don’t worry, the difference between "nonexistence" and "existence" is only a matter of time, and from now it is less than a blog away. Let me explain. Recently, I published two complementary blog posts:

[1] Removing leading characters from SAS strings

[2] Removing trailing characters from SAS strings

While working on these pieces and researching “prior art” I stumbled upon a multipurpose function in the SAS FedSQL Language that alone does either one or both of these things – remove leading or/and trailing characters from SAS strings.

FedSQL Language and Proc FedSQL

The FedSQL language is the SAS proprietary implementation of the ANSI SQL:1999 core standard. Expectedly, the FedSQL language is implemented in SAS by means of the FedSQL procedure (PROC FEDSQL). This procedure enables you to submit FedSQL language statements from a Base SAS session, and it is supported in both SAS 9.4 and SAS Viya.

Using the FEDSQL procedure, you can submit FedSQL language statements to SAS and third-party data sources that are accessed with SAS and SAS/ACCESS library engines. Or, if you have SAS Cloud Analytic Services (CAS) configured, you can submit FedSQL language statements to the CAS server.

FedSQL TRIM function

FedSQL language has its own vast FedSQL Functions library with hundreds of functions many of which replicate SAS 9.4 Functions. Many, but not all. Deep inside this FedSQL functions library, there is a unique treasure modestly called TRIM Function which is quite different from the BASE SAS Language TRIM() function.

While SAS 9.4 BASE TRIM() function capabilities are quite limited - it removes just trailing blanks from a character string, the FedSQL TRIM() function is way much more powerful. This triple-action function can remove not just trailing blanks, but also leading blanks, as well as both, leading and trailing blanks. On top of it, it can remove not just blanks, but any characters (although one character at a time). See for yourself, this function has the following pretty self-explanatory syntax:

TRIM( [BOTH | LEADING | TRAILING] [trim-character] FROM column)

Here trim-character specifies one character (in single quotations marks) to remove from column. If trim-character is not specified, the function removes blanks.

While being called a function, it does not look like a regular SAS function where arguments are separated by commas.  It looks more like an SQL statement (which it understandably is – it is part of the FedSQL language). However, this function is available only in PROC FEDSQL; it’s not available in SAS DATA steps or other PROC steps. Still, it gives us pretty good idea of what such a universal function may look like.

User-defined function TRIMS to remove leading or/and trailing characters in SAS strings

Let’s build such a function by means of the PROC FCMP for the outside the FedSQL usage (it is worth noticing that the FCMP procedure is not supported for FedSQL). To avoid confusion with the existing TRIM function we will call our new function TRIMS (with an ‘S’ at the end) which suits our purpose quite well denoting its plural purpose. First, we define what we are going to create.

User-defined TRIMS function

TRIMS Function

Removes leading characters, trailing characters, or both from a character string.

Syntax

TRIMS(function-modifier, string, trim-list, trim-list-modifier)

Required Arguments

  • function-modifier is a case-insensitive character constant, variable, or expression that specifies one of three possible operations:
    'L' or 'l' – removes leading characters.
    'T' or 't' – removes trailing characters.
    'B' or 'b' – removes both, leading and trailing characters.
  • string is a case-sensitive character constant, variable, or expression that specifies the character string to be trimmed.
  • trim-list is a case-sensitive character constant, variable, or expression that specifies character(s) to remove from the string.
  • trim-list-modifier is a case-insensitive character constant variable, or expression that supplements the trim-list.
    The valid values are those modifiers of the FINDC function that “add” groups of characters (e.g. 'a' or 'A', 'c' or 'C', 'd' or 'D', etc.) to the trim-list.

The following user-defined function implementation is based on the coding techniques described in the two previous posts, [1] and [2] that I mentioned above. Here goes.

 
libname funclib 'c:\projects\functions';
 
/* delete previous function definition during debugging */
options cmplib=funclib.userfuncs;
proc fcmp outlib=funclib.userfuncs.package1;
   deletefunc trims;
run;
 
/* new function defintion */
proc fcmp outlib=funclib.userfuncs.package1;
   function trims(f $, str $, clist $, mod $) $32767;
      from = 1;
      last = length(str);
      if upcase(f) in ('L', 'B') then from = findc(str, clist, 'K'||mod);
      if from=0 then return('');
      if upcase(f) in ('T', 'B') then last = findc(str, clist, 'K'||mod, -last); 
      if last=0 then return('');
      return(substr(str, from, last-from+1));      
   endfunc; 
run;

Code highlights

  • In the function definition, we first assign initial values of the target substring positions as from=1 and last=length(str).
  • Then for Leading or Both character removal, we calculate an adjusted value of from as a position of the first character in str that is not listed in clist and not defined by the mod
  • If from=0 then we return blank and stop further calculations as this means that ALL characters are to be removed.
  • Then for Trailing or Both character removal, we calculate an adjusted value of last as a position of the last character in str that is not listed in clist and not defined by the mod
  • If last=0 then we return blank and stop further calculations as this means that ALL characters are to be removed.
  • And finally, we return a substring of str starting at the from position and ending at the last position, that is with the length of last-from+1.

TRIMS function usage

Let’s define SAS data set SOURCE as follows:

data SOURCE;
   input X $ 1-30;
   datalines;
*00It's done*2*1**-
*--*1****9*55
94*Clean record-*00
;

In the following DATA step, we will create three new variables with removed leading (variable XL), trailing (variable XT) and both - leading and trailing (variable XB) characters '*' and '-' as well as any digits:

options cmplib=funclib.userfuncs;
data TARGET;
   set SOURCE;
   length XB XL XT $30;
   XB = trims('b', X, '*-', 'd');
   XL = trims('L', X, '*-', 'd');
   XT = trims('t', X, '*-', 'd');
run;

In this code we use the TRIM function three times, each time with a different first argument to illustrate how this affects the outcome.

Arguments usage highlights

  • The first argument of the TRIMS function specifies whether we remove characters from both leading and trailing positions ('b'), from leading positions only ('L'), or from trailing positions only ('t'). This argument is case-insensitive. (I prefer using capital 'L' for clarity since lowercase 'l' looks like digit '1').
  • The second argument specifies the name of the variable (X) that we are going to remove characters from (variable X is coming from the dataset SOURCE).
  • The third argument '*-' specifies which character (or characters) to remove. In our example we are removing '*' and '-'. If you do not need to explicitly specify any character here, you still must supply a null value ('') since it is a required argument. In this case, the fourth argument (trim-list-modifier) will determine the set of characters to be removed.
  • And finally, the fourth argument (case-insensitive) of the TRIMS function specifies the FINDC function modifier(s) to remove certain characters in bulk (in our example 'd' will remove all digits). If such modifier is not needed, you still must supply a null value ('') since all four arguments of the TRIMS function are positional and required.

Here is the output data table TARGRET showing the original string X and the resulting strings XB (Both leading and trailing characters removed), XL (Leading characters removed) and XT (Trailing characters removed) side by side:

Result of leading and trailing characters trimming

Conclusion

The new TRIMS function presented in this blog post goes ways further the ubiquitous LEFT and TRIM functions that remove the leading (LEFT) or trailing (TRIM) blanks. The TRIMS function handles ANY characters, not just blanks. It also expands the character deletion functionality of the powerful  FedSQL TRIM function beyond just removing any single leading and/or trailing character. The TRIMS function single-handedly removes any number of explicitly specified characters from either leading, trailing or both (leading and trailing) positions. Plus, it removes in bulk many implicitly specified characters. For example 'd' modifier removes all digits, 'du' modifier removes all digits ('d') and all uppercase letters ('u'), 'dup' modifier removes all digits ('d'), all uppercase letters ('u') and all punctuation marks ('p'); and so on as described by the FINDC function modifiers. The order in which modifier characters are listed does not matter.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Please share your thoughts with us below.

Introducing TRIMS function to remove any leading and/or trailing characters from SAS strings was published on SAS Users.

10月 012021
 

I'll admit it - I'm often impatient while waiting for results, so my code needs to run as fast as possible! In this video, I show you how you can get faster results, too.

I'll demonstrate several different techniques that produce identical results, and compare processing speeds. For a more robust assessment, I'll test the techniques while reading from both SAS data sets and database tables. Analysis of the results clearly shows that when the techniques produce identical results, these choices usually produce faster run times:

  1. Use a WHERE statement instead of a subsetting IF statement.
  2. Use the KEEP= dataset option on input data sets instead of a KEEP statement.
  3. Should I use SQL or DATA step?
    • If producing a single result, use SQL.
    • If producing multiple results, use the DATA step.
  4. If your process is CPU bound:
    • If you have access to CAS, run it in CAS.
    • Otherwise, refactor in DS2.

The programs I used to create the video are available for download from GitHub if you want to experiment.

Until next time, may the SAS be with you!
Mark

Jedi SAS Tricks: 5 Ways to Make Your SAS Code Run Faster was published on SAS Users.

9月 022021
 

Deleting any trailing characters in SAS stringsIn my previous post, we solved the task of removing specified leading characters from SAS strings. In this post, we tackle the complementary task of removing trailing characters.

While removing trailing blanks is well covered in SAS by the TRIM() and TRIMN() functions, removing non-blank trailing characters remains a bit of a mystery that can pop up during text string processing.

For example, you may need to clean up the following strings by removing all trailing x's from them:

012345x
012345xxx
012345xx

These extra characters can result from data entry errors, prior data manipulations, etc. No matter how you get them in, you want them out.

How to remove any trailing characters

For instance, let’s remove all occurrences of the arbitrary trailing character '*'. The following diagram illustrates what we are going to achieve and how:
Diagram: Deleting trailing characters
In order to remove a specified character (in this example '*') from all trailing positions in a string, we need to search our string from right to left starting from the rightmost non-blank character position and find the position p of the first character in that string that is not equal to the specified character. Note, that despite our right-to-left direction of search the position p=10 is still enumerated from left-to-right. Then we can extract the substring starting from position 1 with the length equal to the found position p.

Unlike in our leading characters removal solution, out of two contenders for our search functionality, VERIFY and FINDC, the VERIFY function has to be dropped from the competition as it does not provide right-to-left search functionality. However, the FINDC function stays on track. Here is a possible solution using the FINDC() function.

Using FINDC() function with negative start-position

The FINDC(X, C, ’K’, -LENGTH(X)) function searches string X from right to left starting from the last non-blank character position determined by the optional start-position argument equal to LENGTH(X), and returns the position P of the first character that does not appear in C.

Here we use the K modifier that switches the default behavior of searching for any character that appears in C to searching for any character that does not appear in C.

The direction of search is defined by the minus sign in front of the start-position (a negative start-position argument translates into searching from right to left.)

Then we can apply the SUBSTR(X, 1, P) function that extracts a substring of X starting from position 1 with a length of P which is effectively a substring of the first P characters in X.

Special considerations

Before we proceed to the code implementation of the outlined solution let’s consider the following edge case.

If our string X consists of all '*' characters and nothing else the FINDC() function will find no character (and therefore no position) that is not equal to '*'. In this case it will return 0. However, 0 is not a valid third argument value for the SUBSTR(X, 1, P) function. Valid values are 1 . . . through VLENGTH(X) – the length attribute of X. Having a 0 value for the third argument will trigger the automatic data step variable _ERROR_=1 and the following note generated in the SAS log:

NOTE: Invalid third argument to function SUBSTR at line ## column #.

Therefore, we need to handle this special case separately, conditionally using SUBSTR(X, 1, P) for P>0 and assigning blank ('') otherwise.

Code implementation for removing trailing characters

Now we are ready to put everything together. First, let’s create a test data table:

data TEST;
   input X $ 1-20;
   datalines;
*It's done***
*********
**01234*ABC**
No trailing *'s
;

Then we apply the logic described above. The following DATA step illustrates our coding solution for deleting trailing characters:

data CLEAN (keep=X Y);
   set TEST;
   C = '*'; *<- trailing character(s) to be removed;
 
   P = findc(X, C, 'K', -length(X));
   if P then Y = substr(X, 1, P); 
        else Y = '';
 
   put _n_= / X= / P= / Y= /;
run;

The SAS log will show interim and final results by the DATA step iterations:

_N_=1
X=*It's done***
P=10
Y=*It's done
 
_N_=2
X=*********
P=0
Y=
 
_N_=3
X=**01234*ABC**
P=11
Y=**01234*ABC
 
_N_=4
X=No trailing *'s
P=15
Y=No trailing *'s

Here is the output data table CLEAN showing the original and the resulting strings X and Y side by side:
Removing any trailing characters in SAS strings

Conclusion

The solution presented in this blog post expands trailing character deletion functionality beyond solely blanks (which are handled by the TRIM and TRIMN functions). Moreover, using this coding technique, we can simultaneously remove a variety of trailing characters. For example, if we have a string X='012345xxx.%' and specify C = 'x.%' (the order of characters listed within the value of C does not matter), then all three characters 'x', '.', and '%' will be removed from all trailing positions of X. The resulting string will be Y='012345'.

In addition, numerous modifiers of the FINDC() function allow specifying many characters in bulk, without explicitly listing them one by one. For example, we may augment a list of characters being removed by adding the D modifier as in P = FINDC(X, C, 'KD', -LENGTH(X)) which will remove all trailing digits in addition to those characters specified in C. Similarly, we may throw in the U modifier as in P = FINDC(X, C, 'KDU', -LENGTH(X)) which adds all uppercase letters to the list of trailing characters to be removed. And so on.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, concerns, comments? Please share with us below.

Removing trailing characters from SAS strings was published on SAS Users.

8月 232021
 

Illustration for trimming leading characters in SAS stringsAs in many other programming languages, there is a very useful SAS function that removes leading blanks in character strings. It is the ubiquitous LEFT function.

The LEFT(x) function left-aligns a character string x, which effectively removes leading blanks.

However, in many SAS applications we need a similar but more versatile data cleansing functionality allowing for removal of other leading characters, not just blanks. For example, consider some bank account numbers that are stored as the following character strings:

123456789
0123456789
000123456789

These strings represent the same account number recorded with either no, one, or several leading zeros. One way of standardizing this data is by removing the leading 0's. And while we're at it, why don’t we address the leading character removal functionality for any leading characters, not just zeros.

How to remove any leading characters

For example, let’s remove all occurrences of the arbitrary leading character '*'. The following diagram illustrates what we are going to achieve and how:

In order to remove a specified character (in this example '*') from all leading positions in a string, we need to search our string from left to right and find the position of the first character in that string that is not equal to the specified character. In this case, it’s a blank character in position 4. Then we can extract a substring starting from that position till the end of the string.

I can see two possible solutions.

Solution 1: Using VERIFY() function

The VERIFY (X, C) function searches string X from left to right and returns the position P of the first character that does not appear in the value of C.

Then we can apply the SUBSTR(X,P) function that extracts a substring of X starting from position P till the end of the string X.

Solution 2: Using FINDC() function


The FINDC(X, C, ‘K’) function also searches string X from left to right and returns the position P of the first character that does not appear in C. (The modifier ‘K’ switches the default behavior of searching for any character that appears in C to searching for any character that does not appear in C.)

Then, as with the VERIFY() function, we can apply the SUBSTR(X,P) function that extracts a substring of X starting from position P till the end of the string X.

Special considerations

So far so good, and everything will be just hunky-dory, right? Not really - unless we cover our bases by handling edge cases.

Have we thought of what would happen if our string X consisted of all '*' characters and nothing else? In this special case, both the verify() function and findc() function will find no position of the character that is not equal to '*' and thus return 0.

However, 0 is not a valid second argument value for the SUBSTR(X,P) function. Valid values are 1 . . . through length(X). Having a 0 value for the second argument will trigger the automatic data step variable _ERROR_=1 and the following note generated in the SAS log:

NOTE: Invalid second argument to function SUBSTR at line ## column #.

Therefore, we need to handle this special case separately, conditionally using SUBSTR(X,P) for P>0 and assigning blank ('') otherwise.

Code implementation for removing leading characters

Let’s put everything together. First, we'll create a test data table:

data TEST;
   input X $ 1-20;
   datalines;
*** It's done*
*********
**01234*ABC**
No leading *'s
;

Then we apply the logic described above. The following DATA step illustrates our two implemented coding solutions for removing leading characters:

data CLEAN (keep=X Y Z);
   set TEST;
   C = '*'; *<- leading character(s) to be removed;
 
   P1 = verify(X,C); *<- Solution 1;
   if P1 then Y = substr(X, P1);    
         else Y = '';
 
   P2 = findc(X,C,'K'); *<- Solution 2;
   if P2 then Z = substr(X, P2); 
         else Z = '';
 
   put _n_= / X= / P1= / Y= / P2= / Z= /;
run;

Alternatively, we can replace the IF-THEN-ELSE construct with this IFC() function one-liner:

data CLEAN (keep=X Y Z);
   set TEST;
   C='*'; *<- leading character(s) to be removed;
 
   P1 = verify(X,C); *<- Solution 1;
   Y = ifc(P1, substr(X, P1), '');
 
   P2 = findc(X,C,'K'); *<- Solution 2;
   Z = ifc(P2, substr(X, P2), '');
 
   put _n_= / X= / P1= / Y= / P2= / Z= /;
run;

The SAS log will show interim and final results by the DATA step iterations:

_N_=1
X=*** It's done*
P1=4
Y=It's done*
P2=4
Z=It's done*
 
_N_=2
X=*********
P1=10
Y=
P2=10
Z=
 
_N_=3
X=**01234*ABC**
P1=3
Y=01234*ABC**
P2=3
Z=01234*ABC**
 
_N_=4
X=No leading *'s
P1=1
Y=No leading *'s
P2=1
Z=No leading *'s

Here is the output data table CLEAN showing the original string X, and resulting strings Y (solution 1) and Z (solution 2) side by side:
Removing any leading characters in SAS strings
As you can see, both solutions (1 & 2) produce identical results.

Conclusion

Compared to the LEFT() function, the solution presented in this blog post not only expands leading character removal/cleansing functionality beyond the blank character exclusively. Using this coding technique we can simultaneously remove a variety of leading characters (including but not limited to blank). For example, if we have a string X=' 0.000 12345' and specify C = ' 0.' (the order of characters listed within the value of C does not matter), then all three characters ' ', '0', and '.' will be removed from all leading positions of X. The resulting string will be Y='12345'.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, concerns, comments? Please share with us below.

Removing leading characters from SAS strings was published on SAS Users.

6月 182021
 

SAS Global Certification is pleased to announce two (yes two!) new SAS Viya programming certifications. Traditional SAS programmers need to migrate their code and data to SAS Viya environments, and there are important skills required to make a successful transition. SAS Viya includes Cloud Analytic Services (CAS), where data is stored in-memory and programs execute in parallel on distributed nodes. To leverage CAS, you may need to update your DATA step and SAS SQL code. Also, there are new techniques to manage data stored in CAS. Once you have mastered these skills, you can take an exam and add the credential to your resumes, cv, and LinkedIn profiles!

Behind the Scenes

A cross functional team here at SAS worked since the beginning of 2021 to design, write, and validate these new credentials. Drawing from their experience working with SAS customers, this diverse team from Professional Services, Technical Support, Documentation, and Education overcame the challenges of distance and time-zones (and virtual meetings!) to create these credentials. We appreciate the work that this team accomplished and we had a great time working with you!

SAS Viya Programming Associate

The first credential, the SAS Certified Associate: Fundamentals of Programming Using SAS Viya is designed for SAS programmers who simply need to migrate their code and data to SAS Viya. They plan to work with CAS tables while still using traditional SAS programming techniques. One helpful tool is the CASUTIL procedure, which provides a familiar SAS procedure framework to help SAS programmers manage data in CAS. Also, we cover important skills to update DATA step code and convert PROC SQL code to PROC FEDSQL to execute in CAS. To learn these skills and prepare for the exam, the recommended training is the one-day Programming for SAS Viya course.

SAS Viya Programming Specialist

The second credential, the SAS Certified Specialist: Intermediate Programming Using SAS Viya builds on the skills introduced in the Associate credential, adding assessment of CAS Language (CASL) programming and CAS actions. With CASL programming skills, you can more precisely control accessing and processing of CAS data. There is also extensive coverage of CAS actions which is the smallest unit of work for the CAS server. CAS actions can load data, transform data, compute statistics, perform analytics, and create output. In addition to the Programming for SAS Viya course, the second recommended training class is the new three-day High-Performance Data Processing with CASL in SAS Viya.

Next Steps

You can learn more about these new credentials at the links above. There you will find detailed exam content guides, free sample questions, and free practice exams (yes - free full length practice exams!). If you have questions, please ask in the comments below. I'll start with a few easy ones:

  • Are there pre-requisites for these credentials? No. While they assume good understanding of traditional programming in a SAS 9 environment, these credentials do not require a prior SAS certification. You also can take the SAS Viya Programming Specialist without taking the SAS Viya Programming Associate.
  • Are these exams performance-based? No, these exams are not performance-based but are rather traditional format exams with multiple choice and short-answer fill-in-the-blank questions.
  • What exactly is on the exam? For detailed exam topics, hit the "Exam Content Guide" links on each credential's web page. The content guides contain exam sections, objectives, and expanded detail about the important topics.
  • What are the exam numbers? The Associate exam is A00-415 and the Specialist exam is A00-420.

We hope you tackle the recommended training, develop your skills, and register soon to be the first candidates to earn these credentials. Good luck with your preparations!

Two New SAS® Viya® Programming Certifications was published on SAS Users.

6月 152021
 

Dealing with big dataIn this fast-paced data age, when the sheer volume of data (generated, collected, and waiting to be processed and analyzed) grows at a breathtaking rate, the speed of data processing becomes critically important. In many cases, if data is not processed within an allotted time frame, we lose all its value as it becomes obsolete and ultimately irrelevant. That is why computing power becomes of the essence.

However, computing power itself does not guarantee timely processing. How we use that power makes all the difference. Way too often good old sequential processing just does not cut it anymore and different computing methods are required. One  such method is parallel processing.

In my previous post Using shell scripts for massively parallel processing I demonstrated a script-centered technique of running in parallel multiple independent SAS processes in SAS environments lacking SAS/CONNECT.

In this post, we will take a shot at a slightly different task and solution. Instead of having several totally independent processes, now we have some common “pre-processing” part, then we run several independent processes in parallel, and then we combine the results of parallel processing in the “post-processing” portion of our program.

Problem: monthly data ingestion use case

For simplification, we are going to use a scenario similar to one in the previous blog post:

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. Only now, we will go a step further: combining all those daily tables into a monthly table.

Solution: combining sequential and parallel processing

The solution is comprised of the three major components:

  • Shell script running the main SAS program.
  • Main SAS program, consisting of three parts: pre-parallel processing, parallel processing, and post-parallel processing.
  • Single thread SAS program responsible for a single day data ingestion.

1. Shell script running main SAS program

Below shell script mainprog.sh runs the main SAS program mainprog.sas:

#!/bin/sh
 
# HOW TO CALL:
# nohup sh /path/mainprog.sh YYYYMM &
 
now=$(date +%Y.%m.%d_%H.%M.%S)
 
# getting YYYYMM as a parameter in script call
ym=$1
 
pgmname=/path/mainprog.sas
logname=/path/saslogs/mainprog_$now.log
sas $pgmname -log $logname -set inDate $ym -set logname $logname

The script is run a background mode as indicated by the ampersand at the end of its invocation command:

nohup sh /path/mainprog.sh YYYYMM &

We pass a parameter YYYYMM (e.g. 202106) indicating year and month of our request.

When we call SAS program mainprog.sas within the script we indicate the name of the SAS log file to be created (-log $logname) and also pass on inDate parameter (-set inDate $ym, which has the same value YYYYMM as parameter specified in the script calling command), and logname parameter (-set logname $logname). As you will see further, we are going to use these two parameters within mainprog.sas program.

2. Main SAS program

Here is an abridged version of the mainprog.sas program:

/* ======= pre-processing ======= */
 
/* parameters passed from shell script */
%let inDate = %sysget(inDate);
%let logname = %sysget(logname);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* output data library */
libname SASDL '/data/target';
 
/* number of days in month mm of year yyyy */
%let days = %sysfunc(day(%sysfunc(mdy(&mm+1,1,&yyyy))-1));
 
/* ======= parallel processing ======= */
%macro loop;
   %local threadprog looplogdir logdt workpath tasklist i z threadlog cmd;
   %let threadprog = /path/thread.sas;
   %let looplogdir = %substr(&logname,1,%length(&logname)-4)_logs;
   x "mkdir &looplogdir"; *<- directory for loop logs;
   %let logdt = %substr(&logname,%length(&logname)-22,19);
   %let workpath = %sysfunc(pathname(WORK));
   %let tasklist=;
   %do i=1 %to &days;
      %let z = %sysfunc(putn(&i,z2.));
      %let threadlog = &looplogdir/thread_&z._&logdt..log;
      %let tasklist = &tasklist DAY&i;
      %let cmd = sas &threadprog -log &threadlog -set i &i -set workpath &workpath -set inDate &inDate;
      systask command "&cmd" taskname=DAY&i;
   %end;
 
   waitfor _all_ &tasklist;
 
%mend loop;
%loop
 
/* ======= post-processing ======= */
 
/* combine daily tables into one monthly table */
data SASDL.TARGET_&inDate;
   set WORK.TARGET_&inDate._1 - WORK.TARGET_&inDate._&days;
run;

The key highlights of this program are:

  • We capture values of the parameters passed to the program (inDate and logname).
  • Based on these parameters, assign source directory and target data library SASDL.
  • Calculate number of days in a specific month defined by year and month.
  • Create a directory to hold SAS logs of all parallel threads; the directory name is matching the log name of the mainprog.sas.
  • Capture the WORK library location of the main SAS session running mainprog.sas as:

    %let workpath = %sysfunc(pathname(WORK));We use that location in the thread sessions to pass back to the main session data produced by the thread sessions.

  • Macro %do-loop generates a series of SYSTASK statements to spawn additional SAS sessions in the background mode, each ingesting data for a single day of a month:

    systask command "&cmd" taskname=DAY&i;The SYSTASK statement enables you to execute host-specific commands from within your SAS session or application. Unlike the X statement, the SYSTASK statement runs these commands as asynchronous tasks, which means that these tasks execute independently of all other tasks that are currently running. Asynchronous tasks run in the background, so you can perform additional tasks (including launching other asynchronous tasks) while the asynchronous task is still running.

    Restriction: SYSTASK statement is not supported on the CAS server.

  • Also, we generate a cumulative list of all tasknames assigned to each thread sessions:

    %let tasklist = &tasklist DAY&i;

  • Outside the macro %do-loop we use WAITFOR statement which suspends execution of the main SAS session until the specified tasks finish executing. Since we created a list of all daily thread sessions (&tasklist), this will synchronize all our parallel threads and continue mainprog.sas session only when all threads finished executing.
  • At the end of the main SAS session we concatenate all our daily data tables that have been created by parallel threads in the location of the WORK library of the main SAS session.

Using SAS macro loop to generate a series of SYSTASK statements for parallel processing is not the only method available. Alternatively, you can achieve this within a data step using CALL EXECUTE. In this case, each data step iteration will generate a single global SYSTASK statement and push it out of the data step boundaries where they will be sequentially executed (just like in the case of macro implementation). Since option NOWAIT is the default for SYSTASK statements, despite all of them being launched sequentially, their corresponding OS commands will be still running in parallel.

3. Single thread SAS program

Here is an abridged version of the thread.sas program:

/* inDate parameter */
%let inDate = %sysget(inDate);
 
/* parent program's WORK library */
%let workpath = %sysget(workpath);
libname MAINWORK "&workpath";
 
/* thread number */
%let i = %sysget(i);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* source data directory */
%let srcdir = /datapath/&yyyy/&mm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "&srcdir/source_data_&inDate._day&i..cvs";
 
/* create daily output data set */
data MAINWORK.TARGET_&inDate._&i; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This program ingests a single .csv file corresponding to the &i-th day of &inDate (year and month) and creates a SAS data table MAINWORK.TARGET_&inDate._&i. To be available in the main SAS session the MAINWORK library is defined here in the same physical location as the WORK library of the main parental SAS session.

We also use a pre-created SAS data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using SYSTASK and SAS macro loops for massively parallel processing was published on SAS Users.

5月 112021
 

It’s safe to say that SAS Global Forum is a conference designed for users, by users. As your conference chair, I am excited by this year’s top-notch user sessions. More than 150 sessions are available, many by SAS users just like you. Wherever you work or whatever you do, you’ll find sessions relevant to your industry or job role. New to SAS? Been using SAS forever and want to learn something new? Managing SAS users? We have you covered. Search for sessions by industry or topic, then add those sessions to your agenda and personal calendar.

Creating a customizable agenda and experience

Besides two full days of amazing sessions, networking opportunities and more, many user sessions will be available on the SAS Users YouTube channel on May 20, 2021 at 10:00am ET. After you register, build your agenda and attend the sessions that most interest you when the conference begins. Once you’ve viewed a session, you can chat with the presenter. Don’t know where to start? Sample agendas are available in the Help Desk.

For the first time, proceedings will live on SAS Support Communities. Presenters have been busy adding their papers to the community. Everything is there, including full paper content, video presentations, and code on GitHub. It all premiers on “Day 3” of the conference, May 20. Have a question about the paper or code? You’ll be able to post a question on the community and ask the presenter.

Want training or help with your code?

Code Doctors are back this year. Check out the agenda for the specific times they’re available and make your appointment, so you’ll be sure to catch them and get their diagnosis of code errors. If you’re looking for training, you’ll be quite happy. Training is also back this year and it’s free! SAS instructor-led demos will be available on May 20, along with the user presentations on the SAS Users YouTube channel.

Chat with attendees and SAS

It is hard to replicate the buzz of a live conference, but we’ve tried our best to make you feel like you’re walking the conference floor. And we know networking is always an important component to any conference. We’ve made it possible for you to network with colleagues and SAS employees. Simply make your profile visible (by clicking on your photo) to connect with others, and you can schedule a meeting right from the attendee page. That’s almost easier than tracking down someone during the in-person event.

We know the exhibit hall is also a big draw for many attendees. This year’s Innovation Hub (formerly known as The Quad) has industry-focused booths and technology booths, where you can interact in real-time with SAS experts. There will also be a SAS Lounge where you can learn more about various SAS services and platforms such as SAS Support Communities and SAS Analytics Explorers.

Get started now

I’ve highlighted a lot in this blog post, but I encourage you to view this 7-minute Innovation Hub video. It goes in depth on the Hub and all its features.

This year there is no reason not to register for SAS Global Forum…and attend as few or as many sessions as you want. Why? Because the conference is FREE!

Where else can you get such quality SAS content and learning opportunities? Nowhere, which is why I encourage you to register today. See you soon!

SAS Global Forum: Your experience, your way was published on SAS Users.

4月 202021
 

I can’t believe it’s true, but SAS Global Forum is just over a month away. I have some exciting news to share with you, so let’s start with the theme for this year:

New Day. New Answers. Inspired by Curiosity.

What a fitting theme for this year! Technology continues to evolve, so each new day is a chance to seek new answers to what can sometimes feel like impossible challenges. Our curiosity as humans drives us to seek out better ways to do things. And I hope your curiosity will drive you to register for this year’s SAS Global Forum.

We are excited to offer a global event across three regions. If you’re in the Americas, the conference is May 18-20. In Asia Pacific? Then we’ll see you May 19-20. And we didn’t forget about Europe. Your dates are May 25-26. We hope these region-specific dates and the virtual nature of the conference means more SAS users than ever will join us for an inspiring event. Curious about the exciting agenda? It’s all on the website, so check it out.

Keynotes speakers that you’ll talk about for months to come

Want to be inspired to chase your “impossible” dreams? Or hear more about the future of AI? How about learning about work-life balance and your mental health? We have you covered. SAS executives are gearing up to host an exciting lineup of extremely smart, engaging and thought-provoking keynote speakers like Adam Grant, Ayesha Khanna and Hakeem Oluseyi.

And who knows, we might have a few more surprises up our sleeve. You’ll just have to register and attend to find out.

Papers and proceedings: simplified and easy to find

Have you joined the SAS Global Forum online community? You should, because that’s where you’ll find all the discussion around the conference…before, during and after. It’s also where you’ll find a link to the 2021 proceedings, when they become available. Authors are busy preparing their presentations now and they are hard at work staging their proceedings in the community. Join the community so you can connect with other attendees and know when the proceedings become available.

Stay tuned for even more details

SAS Global Forum is the place where creativity meets curiosity, and amazing analytics happens! I encourage you to regularly check the conference website, as we’re continually adding new sessions and events. You don’t want to miss this year’s conference, so don’t forget to register for SAS Global Forum. See you soon!

Registration is open for a truly inspiring SAS Global Forum 2021 was published on SAS Users.

4月 142021
 

Improving programming jobs performance with massively parallel processingUntil recently, I used UNIX/Linux shell scripts in a very limited capacity, mostly as vehicle of submitting SAS batch jobs. All heavy lifting (conditional processing logic, looping, macro processing, etc.) was done in SAS and by SAS.  If there was a need for parallel processing and synchronization, it was also implemented in SAS. I even wrote a blog post  Running SAS programs in parallel using SAS/CONNECT®, which I proudly shared with my customers.

The post caught their attention and I was asked if I could implement the same approach to speed up processes that were taking too long to run.

However, it turned out that SAS/CONNECT was not licensed at their site and procuring the license wasn’t going to happen any time soon. Bummer!

Or boon? You should never be discouraged by obstacles. In fact, encountering an obstacle might be a stroke of luck. Just add a mixture of curiosity, creativity, and tenacity – and you get a recipe for new opportunity and success. That’s exactly what happened when I turned to exploring shell scripting as an alternative way of implementing parallel processing.

Running several batch jobs in parallel

UNIX/Linux OS allows running several scripts in parallel. Let’s say we have three SAS batch jobs controlled by their own scripts script1.sh, script2.sh, and script3.sh. We can run them concurrently (in parallel) by submitting these shell scripts one after another in background mode using & at the end. Just put them in a wrapper “parent” script allthree.sh and run it in background mode as:

$ nohup allthree.sh &

Here what is inside the allthree.sh: 

#!/bin/sh
script1.sh &
script2.sh &
script3.sh &
wait

With such an arrangement, allthree.sh “parent” script starts all three background tasks (and corresponding SAS programs) that will run by the server concurrently (as far as resources would allow.) Depending on the server capacity (mainly, the number of CPU’s) these jobs will run in parallel, or quasi parallel competing for the server shared resources with the Operating System taking charge for orchestrating their co-existence and load balancing.

The wait command at the end is responsible for the “parent” script’s synchronization. Since no process id or job id is specified with wait command, it will wait for all current “child” processes to complete. Once all three tasks completed, the parent script allthree.sh will continue past the wait command.

Get the UNIX/Linux server information

To evaluate server capabilities as it relates to the parallel processing, we would like to know the number of CPU’s.

To get this information we can ran the the lscpu command as it provides an overview of the CPU architectural characteristics such as number of CPU’s, number of CPU cores, vendor ID, model, model name, speed of each core, and lots more. Here is what I got:

Ha! 56 CPUs! This is not bad, not bad at all! I don’t even have to usurp the whole server after all. I can just grab about 50% of its capacity and be a nice guy leaving another 50% to all other users.

Problem: monthly data ingestion use case

Here is a simplified description of the problem I was facing.

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month.  The existing process sequentially looped through all the CSV files, which (given the data volume) took about an hour to run.

This task was a perfect candidate for parallel processing since data ingestions of individual days were fully independent of each other.

Solution: massively parallel process

The solution is comprised of the two parts:

  • Single thread SAS program responsible for a single day data ingestion.
  • Shell script running multiple instances of this SAS program concurrently.

Single thread SAS process

The first thing I did was re-writing the SAS program from looping through all of the days to ingesting just a single day of a month-year. Here is a bare-bones version of the SAS program:

/* capture parameter &sysparm passed from OS command */ 
%let YYYYMMDD = &sysparm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "/cvspath/rawdata&YYYYMMDD..cvs";
 
/* create daily output data set */
data SASDL.DATA&YYYYMMDD; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This SAS program (let’s call it oneday.sas) can be run in batch using the following OS command:

sas oneday.sas -log oneday.log -sysparm 202103

Note, that we pass a parameter (e.g. 202103 means year 2021, month 03) defining the requested year and month YYYYMM as -sysparm value.

That value becomes available in the SAS program as a macro variable reference &sysparm.

We also use a pre-created data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes (see Simplify data preparation using SAS data templates).

Shell script running the whole process in parallel

Below shell script month_parallel_driver.sh puts everything together. It spawns and runs concurrently as many daily processes as there are days in a specified month-of-year and synchronizes all single day processes (threads) at the end by waiting them all to complete. It logs all its treads and calculates (and prints) the total processing duration. As you can see, shell script as a programming language is a quite versatile and powerful. Here it is:

#!/bin/sh
 
# HOW TO RUN:
# cd /projpath/scripts
# nohup sh month_parallel_driver.sh &
 
# Project path
proj=/projpath
 
# Program file name
prgm=oneday
pgmname=$proj/programs/$prgm.sas
 
# Current date/time stamp
now=$(date +%Y.%m.%d_%H.%M.%S)
echo 'Start time:'$now
 
# Reset timer
SECONDS=0
 
# Get YYYYMM as the script parameter
par=$1
 
# Extract year and month from $par
y=${par:0:4}
m=${par:4:2}
 
# Get number of days in month $m of year $y
days=$(cal $m $y | awk 'NF {DAYS = $NF}; END {print DAYS}')
 
# Create log directory
logdir=$proj/saslogs/${prgm}_${y}${m}_${now}_logs
mkdir $logdir
 
# Loop through all days of month $m of year $y
for i in $(seq -f "%02g" 1 $days)
do
   # Assign log name for a single day thread
   logname=$logdir/${prgm}_${y}${m}_thread${i}_$now.log
 
   # Run single day thread
   /SASHome/SASFoundation/9.4/sas $pgmname -log $logname -sysparm $par$i &
done
 
# Wait until all threads are finished
wait
 
# Calculate and print duration
end=$(date +%Y.%m.%d_%H.%M.%S)
echo 'End time:'$end
hh=$(($SECONDS/3600))
mm=$(( $(($SECONDS - $hh * 3600)) / 60 ))
ss=$(($SECONDS - $hh * 3600 - $mm * 60))
printf " Total Duration: %02d:%02d:%02d\n" $hh $mm $ss
echo '------- End of job -------'

This script is self-described by detail comments and can be run as:

cd /projpath/scripts
nohup sh month_parallel_driver.sh &

Results

The results were as expected as they were stunning. The overall duration was cut roughly by a factor of 25, so now this whole task completes in about two minutes vs. one hour before. Actually, now it is even fun to watch how SAS logs and output data sets are being updated in real time.

What is more, this script-centric approach can be used for running not just SAS processes, but non-SAS, open source and/or hybrid processes as well. This makes it a powerful amplifier and integrator for heterogeneous software applications development.

SAS Consulting Services

The solution presented in this post is a stripped-down version of the original production quality solution. This better serves our educational objective of communicating the key concepts and coding techniques. If you believe your organization’s computational powers are underutilized and may benefit from a SAS Consulting Services engagement, please reach out to us through your SAS representative, and we will be happy to help.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using shell scripts for massively parallel processing was published on SAS Users.