character data

2月 222021
 

Removing a piece from character string In my previous post, we addressed the problem of inserting substrings into SAS character strings. In this post we will solve a reverse problem of deleting substrings from SAS strings.

These two complementary tasks are commonly used for character data manipulation during data cleansing and preparation to transform data to a shape suitable for analysis, text mining, reporting, modeling and decision making.

As in the previous case of substring insertion, we will cover substring deletion for both, character variables and macro variables as both data objects are strings.

The following diagram illustrates what we are going to achieve by deleting a substring from a string:

Removing a substring from SAS string illustration

Have you noticed a logical paradox? We take away a “pieceof” cake and get the whole thing as result! 😊

Now, let’s get serious.

Deleting all instances of a substring from a character variable

Let’s suppose we have a variable STR whose values are sprinkled with some undesirable substring ‘<br>’ which we inherited from some HTML code where tag <br> denotes a line break. For our purposes, we want to remove all instances of those pesky <br>’s. First, let’s create a source data set imitating the described “contaminated” data:

data HAVE;
   infile datalines truncover;
   input STR $100.;
   datalines;
Some strings<br> have unwanted sub<br>strings in them<br>
<br>A s<br>entence must not be cont<br>aminated with unwanted subs<br>trings
Several line<br> breaks<br> are inserted here<br><br><br>
<br>Resulting st<br>ring must be n<br>eat and f<br>ree from un<br>desirable substrings
Ugly unwanted substrings<br><br> must <br>be<br> removed
<br>Let's remove them <br>using S<br>A<br>S language
Ex<br>periment is a<br>bout to b<br>egin
<br>Simpli<br>city may sur<br>prise you<br><br>
;

This DATA step creates WORK.HAVE data set that looks pretty ugly and is hardly usable:
Source data to be cleansed
The following code, however, cleans it up removing all those unwanted substrings ‘<br>’:

data WANT (keep=NEW_STR);
   length NEW_STR $100;
   SUB = '<br>';
   set HAVE;
   NEW_STR = transtrn(STR,'<br>',trimn(''));
run;

After this code runs, the data set WANT will look totally clean and usable:
Cleaned data

Code highlights

  • We use .

The TRANSTRN function is similar to TRANWRD function which replaces all occurrences of a substring in a character string. While TRANWRD uses a single blank when the replacement string has a length of zero, TRANSTRN does allow the replacement string to have a length of zero which essentially means removing.

  • TRIM() function which removes trailing blanks from a character string and returns one blank if the string is missing. However, when it comes to removing (which is essentially replacement with zero length substring) the ability of TRIMN function to return a zero-length string makes all the difference.

Deleting all instances of a substring from a SAS macro variable

For macro variables, I can see two distinct methods of removing all occurrences of undesirable substring.

Method 1: Using SAS data step

Here is a code example:

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
data _null_;
   NEW_STR = transtrn("&STR","&SUB",trimn(''));
   call symputx('NEW',NEW_STR);
run;
 
%put &=STR;
%put &=NEW;

In this code, we stick our macro variable value &STR in double quotes in the transtrn() function as the first argument (source). The macro variable value &SUB, also double quoted, is placed as a second argument. After variable NEW_STR is produced free from the &SUB substrings, we create a macro variable NEW using

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
%let NEW = %sysfunc(transtrn(&STR,&SUB,%sysfunc(trimn(%str()))));
 
%put &=STR;
%put &=NEW;

Deleting selected instance of a substring from a character variable

In many cases we need to remove not all substring instances form a string, but rather a specific occurrence of a substring. For example, in the following sentence (which is a quote by Albert Einstein) “I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.” the second word “sometimes” was added by mistake. It needs to be removed. Here is a code example presenting two solutions of how such a deletion can be done:

data A;
   length STR STR1 STR2 $250;
   STR = 'I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.';
   SUB = 'sometimes';
   STR_LEN = length(STR);
   SUB_LEN = length(SUB);
   POS = find(STR,SUB,-STR_LEN);
   STR1 = catx(' ', substr(STR,1,POS-1), substr(STR,POS+SUB_LEN)); /* solution 1 */
   STR2 = kupdate(STR,POS,SUB_LEN+1);                              /* solution 2 */
   put STR1= / STR2=;
run;

The code will produce two correct identical values of this quote in the SAS log (notice, that the second instance of word “sometimes” is gone):

STR1=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.
STR2=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.

Code highlights

  • FIND() function determines position POS of the substring SUB to be deleted in the string STR. In this particular example, we used the fact, that the second occurrence of word “sometimes” is the first occurrence of this word when counted from right to left. That is indicated by the negative 3-rd argument (-STR_LEN) which means that FIND function searches STR for SUB starting from position STR_LEN from right to left.

Solution 1

This is the most traditional solution that cuts out two pieces of the string – before and after the substring being deleted – and then concatenates them together thus removing that substring:

  • substr(STR,1,POS-1) extracts the first part of the source string STR before the substring to be deleted: from position 1 to position POS-1.
  • substr(STR,POS+SUB_LEN) extracts the second part of the source string STR after the substring to be deleted: from position POS+SUB_LEN till the end of STR value (since the third argument, length, is not specified).
  • Solution 2

    Finding n-th instance of a substring within a string .

Deleting selected instance of a substring from a SAS macro variable

Here is a code example of how to solve the same problem as it relates to SAS macro variables. For brevity, we provide just one solution using %sysfunc and KUPDATE() function:

%let STR = I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.;
%let SUB = sometimes;
%let POS = %sysfunc(find(&STR,&SUB,-%length(&STR)));
%let STR2 = %sysfunc(kupdate(&STR,&POS,%eval(%length(&SUB)+1)));
%put "&STR2";

This should produce the following corrected Einstein’s quote in the SAS log:

"I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am."

Additional Resources for SAS character strings processing

Your thoughts?

Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

Deleting a substring from a SAS string was published on SAS Users.

2月 162021
 

Insert a piece into a stringSAS provides an extensive set of tools for data cleansing and preparation – transforming data to a shape suitable for analysis, text mining, reporting, modeling and ultimately decision making.

In this post we will cover one of the common tasks of character data manipulation – inserting a substring into a SAS character string.

A diagram below illustrates what we are going to achieve:

Illustration for inserting a substring into a string

SAS character strings come in two different incarnations: character variables and macro variables. Since these two are quite different SAS language objects, let’s cover them one by one separately.

Inserting a substring into a character variable

Here is our task: we have a SAS character variable (string) and we want to insert in it a value of another character variable (substring) starting at a particular specified position.

Let’s say we have a string BASE in which we want to insert a COUNTRY name right before word "stays" to make different variation of the resultant phrase. Here is an example of how this can be easily done:

data COUNTRIES;
   length COUNTRY $20;
   input COUNTRY;
   datalines;
Spain
Argentina
Slovenia
Romania
USA
Luxembourg
Egypt
Switzerland
;
 
data NEW (keep=COUNTRY PHRASE);
   BASE = 'The rain in stays mainly in the plain';
   INSPOS = find(BASE,'stays');
   set COUNTRIES;
   length PHRASE $50;
   PHRASE = catx(' ',substr(BASE,1,INSPOS-1),COUNTRY,substr(BASE,INSPOS));
run;

This code dynamically creates variable PHRASE out of values of variable BASE and the values of variable COUNTRY, thus making it data-driven.

After this code runs, the data set NEW will look like this:

Results after inserting a substring into a character string

Here are the code highlights:

  • maximum length of varying-length character variables is 536,870,911 characters (UTF-8 encoding).

    Inserting a substring into a SAS macro variable

    Let’s solve a similar task, but now instead of SAS variables we will operate with SAS macro variables, since they are strings too.

    Here is our problem to solve: we have a SAS macro variable (string) and we want to insert in it a value of another macro variable (substring) starting at a particular specified position.

    Let’s say we have a macro variable BASE with value of The rain in stays mainly in the plain in which we want to insert a country name defined by macro variable COUNTRY with value of Spain right before word stays. Here is an example of how this can be done:

    %let BASE = The rain in stays mainly in the plain;
    %let COUNTRY = Spain;
    %let W = stays;
     
    %let INSPOS = %index(&BASE,&W);
    %let PHRASE = %substr(&BASE,1,%eval(&INSPOS-1))&COUNTRY %substr(&BASE,&INSPOS);
    %put ***&PHRASE***;

    This code will insert the country name in the appropriate place within the BASE macro variable which will be printed in the SAS log by %put statement:

    ***The rain in Spain stays mainly in the plain***

    Here are the code highlights:

    • %substr() macro function to extract two parts of its first argument (&BASE) - before and after insertion:
      • %substr(&BASE,1,%eval(&INSPOS-1))captures the first part of &BASE (before insertion): substring of &BASE starting from the position 1 with a length of %eval(&INSPOS-1).
      • %substr(&BASE,&INSPOS) captures the second part of &BASE (after insertion): substring of &BASE starting from the position &INSPOS till the end of &BASE (since the third argument is not specified).
    • In case of macro variables, we don’t need any concatenation functions – we just list the component pieces of the macro variable value in a proper order with desired separators (blanks in this case).

    NOTE: Unlike for SAS variables, you don’t need to assign the length of SAS macro variables which are automatically defined by their assigned values. The maximum length of SAS macro variables is 65,534 bytes.

    Inserting multiple instances of a substring into a SAS character string

    Sometimes you need to insert a substring into several places (positions p1, p2, …, pn) of a character string. In this case you can use the above strategy repeatedly or iteratively with one little caveat: start inserting from the highest position and moving backwards to the lowest position. This will preserve your pre-determined positions because positions are counted from left to right and inserting a substring at a higher position won’t change the lower position number. Otherwise, after insertion of a substring into lower position, all your higher positions will shift by the length of the inserted substring.

    Additional Resources for SAS character strings processing

    Your thoughts?

    Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

    Inserting a substring into a SAS string was published on SAS Users.

  • 1月 272015
     

    SAS Technical Support Problem Solvers“Here’s Johnny!!!” and well sometimes John and sometimes Jonathan and sometimes Jon.

    In the real world, you sometimes need to make matching character strings more flexible. This is especially common when merging data sets. Variables, especially names, are not always exactly the same in all sources of your data. When matching data, you need to be able to programmatically determine if ‘John Doe’ is the same as ‘Johnny Doe’. The term most often associated with this type of matching is ‘fuzzy matching’. Fortunately within SAS, there are several functions that allow you to perform a fuzzy match. I’ll show you the most common of these functions and then I will show you an example that uses my favorite from this list.

    COMPARE Function

    The COMPARE function returns the position of the leftmost character by which two strings differ, or returns 0 if there is no difference. This function is most useful when comparing two strings that should be the same, but may only differ in case or due to leading or trailing blanks. This function takes the place of the following code:

    if strip(upcase(name1))=strip(upcase(name2)) then do;
    

    The COMPARE function can be used to make the same comparison.

    if compare(name1,name2,’il’)=0 then do;
    

    COMPGED Function

    The COMPGED function returns the generalized edit distance between two strings. Specifically, the COMPGED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of operations (deletions, insertions, or replacement) of a single characters that are required to transform string-1 into string-2. 

    Each operation basically ‘costs’ a certain value. For example, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the greater the cost.  The COMPGED will return the total cost for all operations that occur. The costs returned by COMPGED can be altered by using CALL COMPCOST so that the cost are specific to your needs. A common use I have seen for using the COMPGED function is using it to compare email addresses.

    email1='JohnDoe@abc.com';                                                                                                        
    email2='John_Doe@abc.com';                                                                                                          
    cost=compged(email1,email2);
    

    The value of COST will be 30 which is the cost of adding punctuation to a string.

    COMPLEV Function

    The COMPLEV function is very similar to the COMPGED function. The difference is that the Levenshtein edit distance that is computed by COMPLEV is a special case of the generalized edit distance that is computed by COMPGED. The result is the COMPLEV executes much more quickly than COMPGED. However, the COMPLEV function is not as powerful or versatile as the COMPGED function. The COMPLEV function is generally most useful when comparing simple strings and when speed of comparison is important.

    SPEDIS Function

    The SPEDIS function is the oldie-but-a-goodie of the bunch. The SPEDIS function determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words. SPEDIS is similar to COMPGED in that it assigns a cost to the each operation such as swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument. It is important to remember this because this makes the order of the arguments important. This means that SPEDIS(X,Y) does not always equal SPEDIS(Y,X).

    Summary: Review your data

    Any process that is described as ‘fuzzy’ is obviously not an exact science. With each of these functions using different algorithms, each is going to have its own strengths.

    • If you are comparing complex strings and need the most control, then I would recommend looking at the COMPGED function.
    • If you are comparing fairly simple strings and within large data sets, then COMPLEV may be a better choice.

    The bottom line is that you will have to review your data and what differences are important to you. The results returned by these functions are subjective. You have to determine what an acceptable difference is. I suggest that you simply test out each function to see which one works best for you.

    COMPGED and SOUNDEX Example

    Of the above functions, COMPGED is the one I tend to use most often because, for the scenarios brought to me by various customers, COMPGED has produced the most precise results. I have been able to get the best results by combining the COMPGED function with the SOUNDEX function. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

    The below example encodes each of the first names that is to be compared with the SOUNDEX function and then evaluates the results of using SOUNDEX with the COMPGED function. The goal of this example is to determine if the value of FNAME is the same as ‘Johnathan’ or ‘William’.

    /* Sample data that contains names and the class being taken */                                                                         
    data class;                                                                                                                             
      input fname : $12. lname $ class : $9.;                                                                                               
      datalines;                                                                                                                            
    Jon Smith Math                                                                                                                          
    John Smith Math                                                                                                                         
    Johnny Smith Math                                                                                                                       
    James Smith Math                                                                                                                        
    Will Miller Chemistry                                                                                                                   
    Willy Miller Chemistry                                                                                                                  
    Willie Miller Chemistry                                                                                                                 
    Bonny Davis Gym                                                                                                                         
    Milly Wilson Biology                                                                                                                    
    ;                                                                                                                                       
                                                                                                                                            
    /*  Data set that contains the name and the grade for the class */                                                                      
    data grade;                                                                                                                             
      input fname : $12. lname $ grade $;                                                                                                   
      datalines;                                                                                                                            
    Johnathan Smith A                                                                                                                       
    William Miller B                                                                                                                        
    ;                                                                                                                                       
                                                                                                                                            
                                                                                                                                            
    /* This code uses both the SOUNDEX and COMPGED functions.  I have found */                                                              
    /* that by comparing the strings produced by SOUNDEX, I have been able */                                                               
    /* to get a tighter comparison.                                        */                                                               
    data c;                                                                                                                                 
      set class;                                                                                                                            
      tmp1=soundex(fname);                                                                                                                  
      do i=1 to nobs;                                                                                                                       
        set grade(rename=(fname=fname2)) point=i nobs=nobs;                                                                                 
        tmp2=soundex(fname2);                                                                                                               
        dif=compged(tmp1,tmp2);                                                                                                             
        if dif<=50 then do;                                                                                                                 
          possible_match='Yes';                                                                                                             
          drop  i tmp1 tmp2  fname2;                                                                                                        
          output;                                                                                                                           
        end;                                                                                                                                
      end;                                                                                                                                  
    run;                                                                                                                                    
    proc print; run;
    
    

    One of the great things about SAS is the number of different ways you can accomplish a given goal.  If you have a technique to make fuzzy comparisons that you like to use, please share.  I would love to see some of the techniques being used!

    tags: character data, fuzzy match, Problem Solvers, SAS functions, soundex