Programming Tips

6月 262019
 

"There's a way to do it better - find it." - Thomas A. Edison

Finding a better SAS code

When it comes to SAS coding, this quote by Thomas A. Edison is my best advisor. Time permitting, I love finding better ways of implementing SAS code.

But what code feature means “better” – brevity, clarity or efficiency? It all depends on the purpose of your code. When code is to illustrate a coding concept or technique, clarity is a paramount. However, when processing large data volumes in near real-time, code efficiency becomes critical, not just a luxury or convenience. And brevity won’t hurt in either case. Ideally, your code should be a combination of all three features - brevity, clarity and efficiency.

Parsing a character string

In this blog post we will solve a problem of parsing a character string to find a position of n-th occurrence of a group of characters (substring) in that string.

The closest out-of-box solution to this problem is SAS’ FIND() function. Except this function searches only for a single/first instance of specified substring of characters within a character string. Close enough, and with some do-looping we can easily construct what we want.

After some internet and soul searching to find the Nth occurrence of a substring within a string, I came up with the following DATA STEP code snippet:

   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;

Here, s is a text string (character variable) to be parsed; x is a character variable holding a group of characters that we are searching for within s; p is a position of x value found within s; n is an instance number.

If there is no n-th instance of x within s found, then the code returns p=0.

In this code, each do-loop iteration searches for x within s starting from position p+1 where p is position found in prior iteration: p = find(s,x,p+1);.

Notice, if there is no prior-to n instance of x within s, the do-loop ends prematurely, based on until(p=0) condition, thus cutting the number of loops to the minimal necessary.

Reverse string search

Since find() function allows for a string search in a reverse direction (from right to left) by making the third augment negative, the above code snippet can be easily modified to do just that: find Nth instance (from right to left) of a group of characters within a string. Here is how you can do that:

   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;

The difference here is that we start from position length(s)+1 instead of 0, and each iteration searches substring x within string s starting from position –(p-1)=-p+1 from right to left.

Testing SAS code

You can run the following SAS code to test and see how these searches work:

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;
   put p=;
 
   /* from right to left */
   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;
   put p=;
run;

FINDNTH() function

We can also combine the above left-to-right and right-to-left searches into a single user-defined SAS function by means of SAS Function Compiler (PROC FCMP) procedure:

proc fcmp outlib=sasuser.functions.findnth;
   function findnth(str $, sub $, n);
      p = ifn(n>=0,0,length(str)+1);
      do i=1 to abs(n) until(p=0);
         p = find(str,sub,sign(n)*p+1);
      end;
      return (p);
   endsub;
run;

We conveniently named it findnth() to match the Tableau FINDNTH(string, substring, occurrence) function that returns the position of the nth occurrence of substring within the specified string, where the occurrence argument defines n.

Except our findnth() function allows for both, positive (for left-to-right searches) as well as negative (for right-to-left searches) third argument while Tableau’s function only allows for left-to-right searches.

Here is an example of the findnth() function usage:

options cmplib=sasuser.functions;
data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p=findnth(s,x,n);
   put p=;
 
   /* from right to left */
   p=findnth(s,x,-n);
   put p=;
run;

Using Perl regular expression

As an alternative solution I also implemented SAS code for finding n-th occurrence of a substring within a string using Perl regular expression (regex or prx):

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* using regex */
   xid = prxparse('/'||x||'/o');
   p = 0;
   do i=1 to n until(p=0);
      from = p + 1;
      call prxnext(xid, p + 1, length(s), s, p, len);
   end;
   put p=;
run;

However, efficiency benchmarking tests demonstrated that the above solutions using FIND() function or FINDNTH() SAS user-written function run roughly twice faster than this regex solution.

Challenge

Can you come up with an even better solution to the problem of finding Nth instance of a sub-string within a string? Please share your thoughts and solutions with us. Thomas A. Edison would have been proud of you!

Finding n-th instance of a substring within a string was published on SAS Users.

5月 082019
 

Legendary unicorn surrounded by bubbles representing zeros

To those of you who have not read my previous post, Dividing by zero with SAS, it's not too late to go back and make it up. You missed a lot of fun, deep thought and opportunity to solve an unusual SAS coding challenge.

For those who have already read it, let’s get serious for a second.

I have found a lot of misconceptions and misunderstandings percolated among SAS online communities and discussion groups. The goal of this post is to dispel all those fallacies and delusions, also known in civilized societies as myths.

Myth

When SAS encounters division by zero during program execution it generates an ERROR message in the SAS log and stops. (A more histrionic version of this myth is, “Your computer disconnects from the Internet and burns down.”)

Reality

SAS does not generate an ERROR message in the SAS log and does not stop executing the current step and subsequent steps in the event of division by zero.

Myth

When SAS encounters division by zero during program execution it generates a WARNING message in the SAS log and continues execution.

Reality

Consider yourself officially warned that SAS does not generate even a WARNING message in the SAS log in the event of division by zero.

When SAS encounters division by zero during SAS data step execution it does the following:

1. For each occurrence, it generates a NOTE in the SAS log:

NOTE: Division by zero detected at line XXX column XX.

2. For each occurrence, it sets the result of division by zero to a missing value and dumps the Program Data Vector (PDV) including data step variables and automatic variables (_N_ and _ERROR_) to the SAS log, e.g.

a=0 b=0 c=. _ERROR_=1 _N_=1

3. Finally, it generates a summary NOTE to the SAS log, e.g.

NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values.
Each place is given by: (Number of times) at (Line):(Column).
3 at 244:8

You can run the following SAS code snippet to confirm reality over myth for yourself:

data a;
   input n d;
   datalines;
2  0
-2 0
0  0
;
 
data b;
   set a;
   c = n/d;
run;

Myth

SAS programmers don’t care what messages SAS generates in the SAS log about division by zero and just ignore them.

Reality

Good programming practices and good SAS programmers do care about the possibility of dividing by zero and try to eliminate those messages by taking control over the situation. For example, instead of blindly dividing by a variable that can potentially have zero values, you can write the following “clean” code:

 
if d ne 0
   then r = n/d;
   else r = .;

In this case the division by zero event will never happen during program execution, and you will not receive any nastygrams in the SAS log, but still the result will be the same – a missing value.

Myth

To avoid those “Division by zero detected” NOTEs, one can use ifn() function as in the following example:

r = ifn(d=0, ., n/d);

The assumption is that SAS will first evaluate the first argument (logical expression d=0); if it is true, then return the second argument (missing value); if false, then evaluate and return the third argument (n/d).

Reality

Despite the ifn() function being one of my favorites, the reality is that it works somewhat differently than the above assumption – it evaluates all its arguments before deciding which argument to return. If d does in fact equal 0, evaluating the third argument, n/d, will trigger an attempt to divide by 0, resulting in the “Division by zero detected” NOTE and the PDV dump in the SAS log; that disqualifies this function from being a graceful handler of division by zero events.

Myth

SAS does not have an effective solution for graceful handling of division by zero events; therefore, SAS programmers are compelled to write additional special programming logic.

Reality

SAS doesn’t just have an effective solution to gracefully handle division by zero events. It has a perfect solution! That perfect solution is even conveniently named the divide() function.

The divide(n,d) function has two arguments, each is either a numeric constant, variable, or expression. It divides the first argument by the second argument and returns a non-missing quotient in cases when none of the arguments is missing and the second argument is not zero. Otherwise, it returns missing value. No ugly “Division by zero detected” NOTES in the SAS log, just what we want.

So, instead of using:

r = n/d;

you would just use

r = divide(n,d);

Moreover, it gives you extended functionality by providing additional information about its argument’s composition. In the case of a zero divisor, it returns three different types of missing values. If the dividend (the first argument) is positive, it returns a special missing value of .I (I for infinity); if the dividend is negative, it returns special missing value of .M (M for minus infinity); if dividend is zero, it returns . as an ordinary missing value.

The icing on the cake

SAS’ missing() function equates all those ordinary and special missing values:

missing(.) = missing(.I) = missing(.M) = 1.

If for some reason you don’t like having different missing values X in your results after using the X=divide(n,d) function, you can easily recode them all to the generic missing value:

if missing(X) then X=.;

Your Comments

How do you prefer handling division by zero? Please share with us.

Dividing by zero with SAS - myths and realities was published on SAS Users.