"There's a way to do it better - find it." - Thomas A. Edison

When it comes to SAS coding, this quote by Thomas A. Edison is my best advisor. Time permitting, I love finding better ways of implementing SAS code.

But what code feature means “better” – brevity, clarity or efficiency? It all depends on the purpose of your code. When code is to illustrate a coding concept or technique, clarity is a paramount. However, when processing large data volumes in near real-time, code efficiency becomes critical, not just a luxury or convenience. And brevity won’t hurt in either case. Ideally, your code should be a combination of all three features - brevity, clarity and efficiency.

## Parsing a character string

In this blog post we will solve a problem of parsing a character string to find a position of n-th occurrence of a group of characters (substring) in that string.

The closest out-of-box solution to this problem is SAS’ FIND() function. Except this function searches only for a single/first instance of specified substring of characters within a character string. Close enough, and with some do-looping we can easily construct what we want.

After some internet and soul searching to find the Nth occurrence of a substring within a string, I came up with the following DATA STEP code snippet:

p = 0; do i=1 to n until(p=0); p = find(s, x, p+1); end; |

Here, s is a text string (character variable) to be parsed; x is a character variable holding a group of characters that we are searching for within s; p is a position of x value found within s; n is an instance number.

If there is no n-th instance of x within s found, then the code returns p=0.

In this code, each do-loop iteration searches for x within s starting from position p+1 where p is position found in prior iteration: p = find(s,x,p+1);.

Notice, if there is no prior-to n instance of x within s, the do-loop ends prematurely, based on until(p=0) condition, thus cutting the number of loops to the minimal necessary.

## Reverse string search

Since **find() function** allows for a string search in a reverse direction (from right to left) by making the third augment negative, the above code snippet can be easily modified to do just that: find Nth instance (from right to left) of a group of characters within a string. Here is how you can do that:

p = length(s) + 1; do i=1 to n until(p=0); p = find(s, x, -p+1); end; |

The difference here is that we start from position length(s)+1 instead of 0, and each iteration searches substring x within string s starting from position –(p-1)=-p+1 from right to left.

## Testing SAS code

You can run the following SAS code to test and see how these searches work:

data a; s='AB bhdf +BA s Ab fs ABC Nfm AB '; x='AB'; n=3; /* from left to right */ p = 0; do i=1 to n until(p=0); p = find(s, x, p+1); end; put p=; /* from right to left */ p = length(s) + 1; do i=1 to n until(p=0); p = find(s, x, -p+1); end; put p=; run; |

## FINDNTH() function

We can also combine the above left-to-right and right-to-left searches into a single user-defined SAS function by means of SAS Function Compiler (PROC FCMP) procedure:

proc fcmp outlib=sasuser.functions.findnth; function findnth(str $, sub $, n); p = ifn(n>=0,0,length(str)+1); do i=1 to abs(n) until(p=0); p = find(str,sub,sign(n)*p+1); end; return (p); endsub; run; |

We conveniently named it **findnth()** to match the Tableau FINDNTH(string, substring, occurrence) function that returns the position of the nth occurrence of substring within the specified string, where the occurrence argument defines n.

Except our **findnth()** function allows for both, positive (for left-to-right searches) as well as negative (for right-to-left searches) third argument while Tableau’s function only allows for left-to-right searches.

Here is an example of the **findnth()** function usage:

options cmplib=sasuser.functions; data a; s='AB bhdf +BA s Ab fs ABC Nfm AB '; x='AB'; n=3; /* from left to right */ p=findnth(s,x,n); put p=; /* from right to left */ p=findnth(s,x,-n); put p=; run; |

## Using Perl regular expression

As an alternative solution I also implemented SAS code for finding n-th occurrence of a substring within a string using Perl regular expression (regex or prx):

data a; s='AB bhdf +BA s Ab fs ABC Nfm AB '; x='AB'; n=3; /* using regex */ xid = prxparse('/'||x||'/o'); p = 0; do i=1 to n until(p=0); from = p + 1; call prxnext(xid, p + 1, length(s), s, p, len); end; put p=; run; |

However, efficiency benchmarking tests demonstrated that the above solutions using **FIND() function** or **FINDNTH()** SAS user-written function run roughly twice faster than this regex solution.

## Challenge

Can you come up with an even better solution to the problem of finding Nth instance of a sub-string within a string? Please share your thoughts and solutions with us. Thomas A. Edison would have been proud of you!

Finding n-th instance of a substring within a string was published on SAS Users.