text mining

2月 222021
 

Removing a piece from character string In my previous post, we addressed the problem of inserting substrings into SAS character strings. In this post we will solve a reverse problem of deleting substrings from SAS strings.

These two complementary tasks are commonly used for character data manipulation during data cleansing and preparation to transform data to a shape suitable for analysis, text mining, reporting, modeling and decision making.

As in the previous case of substring insertion, we will cover substring deletion for both, character variables and macro variables as both data objects are strings.

The following diagram illustrates what we are going to achieve by deleting a substring from a string:

Removing a substring from SAS string illustration

Have you noticed a logical paradox? We take away a “pieceof” cake and get the whole thing as result! 😊

Now, let’s get serious.

Deleting all instances of a substring from a character variable

Let’s suppose we have a variable STR whose values are sprinkled with some undesirable substring ‘<br>’ which we inherited from some HTML code where tag <br> denotes a line break. For our purposes, we want to remove all instances of those pesky <br>’s. First, let’s create a source data set imitating the described “contaminated” data:

data HAVE;
   infile datalines truncover;
   input STR $100.;
   datalines;
Some strings<br> have unwanted sub<br>strings in them<br>
<br>A s<br>entence must not be cont<br>aminated with unwanted subs<br>trings
Several line<br> breaks<br> are inserted here<br><br><br>
<br>Resulting st<br>ring must be n<br>eat and f<br>ree from un<br>desirable substrings
Ugly unwanted substrings<br><br> must <br>be<br> removed
<br>Let's remove them <br>using S<br>A<br>S language
Ex<br>periment is a<br>bout to b<br>egin
<br>Simpli<br>city may sur<br>prise you<br><br>
;

This DATA step creates WORK.HAVE data set that looks pretty ugly and is hardly usable:
Source data to be cleansed
The following code, however, cleans it up removing all those unwanted substrings ‘<br>’:

data WANT (keep=NEW_STR);
   length NEW_STR $100;
   SUB = '<br>';
   set HAVE;
   NEW_STR = transtrn(STR,'<br>',trimn(''));
run;

After this code runs, the data set WANT will look totally clean and usable:
Cleaned data

Code highlights

  • We use .

The TRANSTRN function is similar to TRANWRD function which replaces all occurrences of a substring in a character string. While TRANWRD uses a single blank when the replacement string has a length of zero, TRANSTRN does allow the replacement string to have a length of zero which essentially means removing.

  • TRIM() function which removes trailing blanks from a character string and returns one blank if the string is missing. However, when it comes to removing (which is essentially replacement with zero length substring) the ability of TRIMN function to return a zero-length string makes all the difference.

Deleting all instances of a substring from a SAS macro variable

For macro variables, I can see two distinct methods of removing all occurrences of undesirable substring.

Method 1: Using SAS data step

Here is a code example:

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
data _null_;
   NEW_STR = transtrn("&STR","&SUB",trimn(''));
   call symputx('NEW',NEW_STR);
run;
 
%put &=STR;
%put &=NEW;

In this code, we stick our macro variable value &STR in double quotes in the transtrn() function as the first argument (source). The macro variable value &SUB, also double quoted, is placed as a second argument. After variable NEW_STR is produced free from the &SUB substrings, we create a macro variable NEW using

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
%let NEW = %sysfunc(transtrn(&STR,&SUB,%sysfunc(trimn(%str()))));
 
%put &=STR;
%put &=NEW;

Deleting selected instance of a substring from a character variable

In many cases we need to remove not all substring instances form a string, but rather a specific occurrence of a substring. For example, in the following sentence (which is a quote by Albert Einstein) “I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.” the second word “sometimes” was added by mistake. It needs to be removed. Here is a code example presenting two solutions of how such a deletion can be done:

data A;
   length STR STR1 STR2 $250;
   STR = 'I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.';
   SUB = 'sometimes';
   STR_LEN = length(STR);
   SUB_LEN = length(SUB);
   POS = find(STR,SUB,-STR_LEN);
   STR1 = catx(' ', substr(STR,1,POS-1), substr(STR,POS+SUB_LEN)); /* solution 1 */
   STR2 = kupdate(STR,POS,SUB_LEN+1);                              /* solution 2 */
   put STR1= / STR2=;
run;

The code will produce two correct identical values of this quote in the SAS log (notice, that the second instance of word “sometimes” is gone):

STR1=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.
STR2=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.

Code highlights

  • FIND() function determines position POS of the substring SUB to be deleted in the string STR. In this particular example, we used the fact, that the second occurrence of word “sometimes” is the first occurrence of this word when counted from right to left. That is indicated by the negative 3-rd argument (-STR_LEN) which means that FIND function searches STR for SUB starting from position STR_LEN from right to left.

Solution 1

This is the most traditional solution that cuts out two pieces of the string – before and after the substring being deleted – and then concatenates them together thus removing that substring:

  • substr(STR,1,POS-1) extracts the first part of the source string STR before the substring to be deleted: from position 1 to position POS-1.
  • substr(STR,POS+SUB_LEN) extracts the second part of the source string STR after the substring to be deleted: from position POS+SUB_LEN till the end of STR value (since the third argument, length, is not specified).
  • Solution 2

    Finding n-th instance of a substring within a string .

Deleting selected instance of a substring from a SAS macro variable

Here is a code example of how to solve the same problem as it relates to SAS macro variables. For brevity, we provide just one solution using %sysfunc and KUPDATE() function:

%let STR = I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.;
%let SUB = sometimes;
%let POS = %sysfunc(find(&STR,&SUB,-%length(&STR)));
%let STR2 = %sysfunc(kupdate(&STR,&POS,%eval(%length(&SUB)+1)));
%put "&STR2";

This should produce the following corrected Einstein’s quote in the SAS log:

"I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am."

Additional Resources for SAS character strings processing

Your thoughts?

Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

Deleting a substring from a SAS string was published on SAS Users.

2月 162021
 

Insert a piece into a stringSAS provides an extensive set of tools for data cleansing and preparation – transforming data to a shape suitable for analysis, text mining, reporting, modeling and ultimately decision making.

In this post we will cover one of the common tasks of character data manipulation – inserting a substring into a SAS character string.

A diagram below illustrates what we are going to achieve:

Illustration for inserting a substring into a string

SAS character strings come in two different incarnations: character variables and macro variables. Since these two are quite different SAS language objects, let’s cover them one by one separately.

Inserting a substring into a character variable

Here is our task: we have a SAS character variable (string) and we want to insert in it a value of another character variable (substring) starting at a particular specified position.

Let’s say we have a string BASE in which we want to insert a COUNTRY name right before word "stays" to make different variation of the resultant phrase. Here is an example of how this can be easily done:

data COUNTRIES;
   length COUNTRY $20;
   input COUNTRY;
   datalines;
Spain
Argentina
Slovenia
Romania
USA
Luxembourg
Egypt
Switzerland
;
 
data NEW (keep=COUNTRY PHRASE);
   BASE = 'The rain in stays mainly in the plain';
   INSPOS = find(BASE,'stays');
   set COUNTRIES;
   length PHRASE $50;
   PHRASE = catx(' ',substr(BASE,1,INSPOS-1),COUNTRY,substr(BASE,INSPOS));
run;

This code dynamically creates variable PHRASE out of values of variable BASE and the values of variable COUNTRY, thus making it data-driven.

After this code runs, the data set NEW will look like this:

Results after inserting a substring into a character string

Here are the code highlights:

  • maximum length of varying-length character variables is 536,870,911 characters (UTF-8 encoding).

    Inserting a substring into a SAS macro variable

    Let’s solve a similar task, but now instead of SAS variables we will operate with SAS macro variables, since they are strings too.

    Here is our problem to solve: we have a SAS macro variable (string) and we want to insert in it a value of another macro variable (substring) starting at a particular specified position.

    Let’s say we have a macro variable BASE with value of The rain in stays mainly in the plain in which we want to insert a country name defined by macro variable COUNTRY with value of Spain right before word stays. Here is an example of how this can be done:

    %let BASE = The rain in stays mainly in the plain;
    %let COUNTRY = Spain;
    %let W = stays;
     
    %let INSPOS = %index(&BASE,&W);
    %let PHRASE = %substr(&BASE,1,%eval(&INSPOS-1))&COUNTRY %substr(&BASE,&INSPOS);
    %put ***&PHRASE***;

    This code will insert the country name in the appropriate place within the BASE macro variable which will be printed in the SAS log by %put statement:

    ***The rain in Spain stays mainly in the plain***

    Here are the code highlights:

    • %substr() macro function to extract two parts of its first argument (&BASE) - before and after insertion:
      • %substr(&BASE,1,%eval(&INSPOS-1))captures the first part of &BASE (before insertion): substring of &BASE starting from the position 1 with a length of %eval(&INSPOS-1).
      • %substr(&BASE,&INSPOS) captures the second part of &BASE (after insertion): substring of &BASE starting from the position &INSPOS till the end of &BASE (since the third argument is not specified).
    • In case of macro variables, we don’t need any concatenation functions – we just list the component pieces of the macro variable value in a proper order with desired separators (blanks in this case).

    NOTE: Unlike for SAS variables, you don’t need to assign the length of SAS macro variables which are automatically defined by their assigned values. The maximum length of SAS macro variables is 65,534 bytes.

    Inserting multiple instances of a substring into a SAS character string

    Sometimes you need to insert a substring into several places (positions p1, p2, …, pn) of a character string. In this case you can use the above strategy repeatedly or iteratively with one little caveat: start inserting from the highest position and moving backwards to the lowest position. This will preserve your pre-determined positions because positions are counted from left to right and inserting a substring at a higher position won’t change the lower position number. Otherwise, after insertion of a substring into lower position, all your higher positions will shift by the length of the inserted substring.

    Additional Resources for SAS character strings processing

    Your thoughts?

    Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

    Inserting a substring into a SAS string was published on SAS Users.

  • 3月 052020
     

    Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and Machine Learning. Download it today and keep reading for an excerpt from this free e-book!

    In SAS Studio, you can access tasks that help automate your programming so that you do not have to manually write your code. However, there are three options for manually writing your programs in SAS® Viya®:

    1. SAS Studio provides a SAS programming environment for developing and submitting programs to the server.
    2. Batch submission is also still an option.
    3. Open-source languages such as Python, Lua, and Java can submit code to the CAS server.

    In this blog post, you will learn the syntax for two of the new, advanced data mining and machine learning procedures: PROC TEXTMINE and PROCTMSCORE.

    Overview

    The TEXTMINE and TMSCORE procedures integrate the functionalities from both natural language processing and statistical analysis to provide essential functionalities for text mining. The procedures support essential natural language processing (NLP) features such as tokenizing, stemming, part-of-speech tagging, entity recognition, customized stop list, and so on. They also support dimensionality reduction and topic discovery through Singular Value Decomposition.

    In this example, you will learn about some of the essential functionalities of PROC TEXTMINE and PROC TMSCORE by using a text data set containing 1,830 Amazon reviews of electronic gaming systems. The data set is named Amazon. You can find similar data sets of Amazon reviews at http://jmcauley.ucsd.edu/data/amazon/.

    PROC TEXTMINE

    The Amazon data set has already been loaded into CAS. The review content is stored in the variable ReviewBody, and we generate a unique review ID for each review. In the proc call shown in Program 1 we ask PROC TEXTMINE to do three tasks:

    1. parse the documents in table reviews and generate the term by document matrix
    2. perform dimensionality reduction via Singular Value Decomposition
    3. perform topic discovery based on Singular Value Decomposition results

    Program 1: PROC TEXTMINE

    data mycaslib.amazon;
        set mylib.amazon;
    run;
    
    data mycaslib.engstop;
        set mylib.engstop;
    run;
    
    proc textmine data=mycaslib.amazon;
        doc_id id;
        var reviewbody;
    
     /*(1)*/  parse reducef=2 entities=std stoplist=mycaslib.engstop 
              outterms=mycaslib.terms outparent=mycaslib.parent
              outconfig=mycaslib.config;
    
     /*(2)*/  svd k=10 svdu=mycaslib.svdu outdocpro=mycaslib.docpro
              outtopics=mycaslib.topics;
    
    run;

    (1) The first task (parsing) is specified in the PARSE statement. Parameter “reducef” specifies the minimum number of times a term needs to appear in the text to be included in the analysis. Parameter “stop” specifies a list of terms to be excluded from the analysis, such as “the”, “this”, and “that”. Outparent is the output table that stores the term by document matrix, and Outterms is the output table that stores the information of terms that are included in the term by document matrix. Outconfig is the output table that stores configuration information for future scoring.

    (2) Tasks 2 and 3 (dimensionality reduction and topic discovery) are specified in the SVD statement. Parameter K specifies the desired number of dimensions and number of topics. Parameter SVDU is the output table that stores the U matrix from SVD calculations, which is needed in future scoring. Parameter OutDocPro is the output table that stores the new matrix with reduced dimensions. Parameter OutTopics specifies the output table that stores the topics discovered.

    Click the Run shortcut button or press F3 to run Program 1. The terms table shown in Output 1 stores the tagging, stemming, and entity recognition results. It also stores the number of times each term appears in the text data.

    Output 1: Results from Program 1

    PROC TMSCORE

    PROC TEXTMINE is used with large training data sets. When you have new documents coming in, you do not need to re-run all the parsing and SVD computations with PROC TEXTMINE. Instead, you can use PROC TMSCORE to score new text data. The scoring procedure parses the new document(s) and projects the text data into the same dimensions using the SVD weights derived from the original training data.

    In order to use PROC TMSCORE to generate results consistent with PROC TEXTMINE, you need to provide the following tables generated by PROC TEXTMINE:

    • SVDU table – provides the required information for projection into the same dimensions.
    • Config table – provides parameter values for parsing.
    • Terms table – provides the terms that should be included in the analysis.

    Program 2 shows an example of TMSCORE. It uses the same input data layout used for PROC TEXTMINE code, so it will generate the same docpro and parent output tables, as shown in Output 2.

    Program 2: PROC TMSCORE

    Proc tmscore data=mycaslib.amazon svdu=mycaslib.svdu
            config=mycaslib.config terms=mycaslib.terms
            svddocpro=mycaslib.score_docpro outparent=mycaslib.score_parent;
        var reviewbody;
        doc_id id;
    run;

     

    Output 2: Results from Program 2

    To learn more about advanced data mining and machine learning procedures available in SAS Viya, including PROC FACTMAC, PROC TEXTMINE, and PROC NETWORK, you can download the free e-book, Exploring SAS® Viya®: Data Mining and Machine Learning. Exploring SAS® Viya® is a series of e-books that are based on content from SAS® Viya® Enablement, a free course available from SAS Education. You can follow along with examples in real time by watching the videos.

     

    Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

    5月 302018
     

    Maybe you’ve heard of text analytics (or natural language processing) as a way to analyze consumer sentiment. Businesses often use these techniques to analyze customer complaints or comments on social media, to identify when a response is needed. But text analytics has far more to offer than examining posts on [...]

    5 remarkable uses for text analytics was published on SAS Voices by Tom Sabo

    6月 262017
     

    The role of analytics in combating terrorism Earlier this spring, I found myself walking through a quiet and peaceful grove of spruce trees south of the small hamlet of Foy outside of Bastogne, Belgium.  On travel in Europe, I happened to have some extra time before heading to London.  I [...]

    The analytics of evil was published on SAS Voices by Steve Bennett

    6月 262017
     

    The role of analytics in combating terrorism Earlier this spring, I found myself walking through a quiet and peaceful grove of spruce trees south of the small hamlet of Foy outside of Bastogne, Belgium.  On travel in Europe, I happened to have some extra time before heading to London.  I [...]

    The analytics of evil was published on SAS Voices by Steve Bennett

    8月 222016
     

    A huge proportion of big data is unstructured text (such as client interactions, product reviews, call center logs, emails, blogs and tweets). Organizations starting to invest in advanced analytics often overlook the value text analytics could add to the process. But when data scientists or analysts get to work exploring […]

    3 Steps to big data success with text analytics was published on SAS Voices.

    7月 062016
     

    When a person feels sufficiently wronged to lodge a complaint with the Consumer Financial Protection Bureau (CFPB), there’s likely to be some negative sentiment involved. But is there a connection between the language they use and the likelihood they will be compensated by the offending company? At the upcoming Sentiment […]

    The post Sentiment analysis, machine learning open up world of possibilities appeared first on The Text Frontier.

    5月 042016
     

    How many of you have read The Cuckoo’s Calling by the previously unknown author Robert Galbraith? The answer is not many, until it came out that Robert Galbraith was none other than blockbuster best-selling author JK Rowling. Sales then skyrocketed. Rowling recently published some of the rejection letters she received as […]

    Could a recommendation engine pick 'The Cuckoo’s Calling' as a best-seller? was published on SAS Voices.

    10月 272015
     

    In my last post, I talked about why SAS utilizes a rotated Singular Value Decomposition (SVD) approach for topic generation, rather than using Latent Dirichlet Allocation (LDA).  I noted that LDA has undergone a variety of improvements in the last seven years since SAS opted to use the SVD method.  So, the […]

    The post Topical advice about topics, redux appeared first on The Text Frontier.

     Posted by at 3:13 上午