SAS programmers

2月 222021
 

Removing a piece from character string In my previous post, we addressed the problem of inserting substrings into SAS character strings. In this post we will solve a reverse problem of deleting substrings from SAS strings.

These two complementary tasks are commonly used for character data manipulation during data cleansing and preparation to transform data to a shape suitable for analysis, text mining, reporting, modeling and decision making.

As in the previous case of substring insertion, we will cover substring deletion for both, character variables and macro variables as both data objects are strings.

The following diagram illustrates what we are going to achieve by deleting a substring from a string:

Removing a substring from SAS string illustration

Have you noticed a logical paradox? We take away a “pieceof” cake and get the whole thing as result! 😊

Now, let’s get serious.

Deleting all instances of a substring from a character variable

Let’s suppose we have a variable STR whose values are sprinkled with some undesirable substring ‘<br>’ which we inherited from some HTML code where tag <br> denotes a line break. For our purposes, we want to remove all instances of those pesky <br>’s. First, let’s create a source data set imitating the described “contaminated” data:

data HAVE;
   infile datalines truncover;
   input STR $100.;
   datalines;
Some strings<br> have unwanted sub<br>strings in them<br>
<br>A s<br>entence must not be cont<br>aminated with unwanted subs<br>trings
Several line<br> breaks<br> are inserted here<br><br><br>
<br>Resulting st<br>ring must be n<br>eat and f<br>ree from un<br>desirable substrings
Ugly unwanted substrings<br><br> must <br>be<br> removed
<br>Let's remove them <br>using S<br>A<br>S language
Ex<br>periment is a<br>bout to b<br>egin
<br>Simpli<br>city may sur<br>prise you<br><br>
;

This DATA step creates WORK.HAVE data set that looks pretty ugly and is hardly usable:
Source data to be cleansed
The following code, however, cleans it up removing all those unwanted substrings ‘<br>’:

data WANT (keep=NEW_STR);
   length NEW_STR $100;
   SUB = '<br>';
   set HAVE;
   NEW_STR = transtrn(STR,'<br>',trimn(''));
run;

After this code runs, the data set WANT will look totally clean and usable:
Cleaned data

Code highlights

  • We use .

The TRANSTRN function is similar to TRANWRD function which replaces all occurrences of a substring in a character string. While TRANWRD uses a single blank when the replacement string has a length of zero, TRANSTRN does allow the replacement string to have a length of zero which essentially means removing.

  • TRIM() function which removes trailing blanks from a character string and returns one blank if the string is missing. However, when it comes to removing (which is essentially replacement with zero length substring) the ability of TRIMN function to return a zero-length string makes all the difference.

Deleting all instances of a substring from a SAS macro variable

For macro variables, I can see two distinct methods of removing all occurrences of undesirable substring.

Method 1: Using SAS data step

Here is a code example:

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
data _null_;
   NEW_STR = transtrn("&STR","&SUB",trimn(''));
   call symputx('NEW',NEW_STR);
run;
 
%put &=STR;
%put &=NEW;

In this code, we stick our macro variable value &STR in double quotes in the transtrn() function as the first argument (source). The macro variable value &SUB, also double quoted, is placed as a second argument. After variable NEW_STR is produced free from the &SUB substrings, we create a macro variable NEW using

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
%let NEW = %sysfunc(transtrn(&STR,&SUB,%sysfunc(trimn(%str()))));
 
%put &=STR;
%put &=NEW;

Deleting selected instance of a substring from a character variable

In many cases we need to remove not all substring instances form a string, but rather a specific occurrence of a substring. For example, in the following sentence (which is a quote by Albert Einstein) “I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.” the second word “sometimes” was added by mistake. It needs to be removed. Here is a code example presenting two solutions of how such a deletion can be done:

data A;
   length STR STR1 STR2 $250;
   STR = 'I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.';
   SUB = 'sometimes';
   STR_LEN = length(STR);
   SUB_LEN = length(SUB);
   POS = find(STR,SUB,-STR_LEN);
   STR1 = catx(' ', substr(STR,1,POS-1), substr(STR,POS+SUB_LEN)); /* solution 1 */
   STR2 = kupdate(STR,POS,SUB_LEN+1);                              /* solution 2 */
   put STR1= / STR2=;
run;

The code will produce two correct identical values of this quote in the SAS log (notice, that the second instance of word “sometimes” is gone):

STR1=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.
STR2=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.

Code highlights

  • FIND() function determines position POS of the substring SUB to be deleted in the string STR. In this particular example, we used the fact, that the second occurrence of word “sometimes” is the first occurrence of this word when counted from right to left. That is indicated by the negative 3-rd argument (-STR_LEN) which means that FIND function searches STR for SUB starting from position STR_LEN from right to left.

Solution 1

This is the most traditional solution that cuts out two pieces of the string – before and after the substring being deleted – and then concatenates them together thus removing that substring:

  • substr(STR,1,POS-1) extracts the first part of the source string STR before the substring to be deleted: from position 1 to position POS-1.
  • substr(STR,POS+SUB_LEN) extracts the second part of the source string STR after the substring to be deleted: from position POS+SUB_LEN till the end of STR value (since the third argument, length, is not specified).
  • Solution 2

    Finding n-th instance of a substring within a string .

Deleting selected instance of a substring from a SAS macro variable

Here is a code example of how to solve the same problem as it relates to SAS macro variables. For brevity, we provide just one solution using %sysfunc and KUPDATE() function:

%let STR = I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.;
%let SUB = sometimes;
%let POS = %sysfunc(find(&STR,&SUB,-%length(&STR)));
%let STR2 = %sysfunc(kupdate(&STR,&POS,%eval(%length(&SUB)+1)));
%put "&STR2";

This should produce the following corrected Einstein’s quote in the SAS log:

"I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am."

Additional Resources for SAS character strings processing

Your thoughts?

Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

Deleting a substring from a SAS string was published on SAS Users.

2月 162021
 

Insert a piece into a stringSAS provides an extensive set of tools for data cleansing and preparation – transforming data to a shape suitable for analysis, text mining, reporting, modeling and ultimately decision making.

In this post we will cover one of the common tasks of character data manipulation – inserting a substring into a SAS character string.

A diagram below illustrates what we are going to achieve:

Illustration for inserting a substring into a string

SAS character strings come in two different incarnations: character variables and macro variables. Since these two are quite different SAS language objects, let’s cover them one by one separately.

Inserting a substring into a character variable

Here is our task: we have a SAS character variable (string) and we want to insert in it a value of another character variable (substring) starting at a particular specified position.

Let’s say we have a string BASE in which we want to insert a COUNTRY name right before word "stays" to make different variation of the resultant phrase. Here is an example of how this can be easily done:

data COUNTRIES;
   length COUNTRY $20;
   input COUNTRY;
   datalines;
Spain
Argentina
Slovenia
Romania
USA
Luxembourg
Egypt
Switzerland
;
 
data NEW (keep=COUNTRY PHRASE);
   BASE = 'The rain in stays mainly in the plain';
   INSPOS = find(BASE,'stays');
   set COUNTRIES;
   length PHRASE $50;
   PHRASE = catx(' ',substr(BASE,1,INSPOS-1),COUNTRY,substr(BASE,INSPOS));
run;

This code dynamically creates variable PHRASE out of values of variable BASE and the values of variable COUNTRY, thus making it data-driven.

After this code runs, the data set NEW will look like this:

Results after inserting a substring into a character string

Here are the code highlights:

  • maximum length of varying-length character variables is 536,870,911 characters (UTF-8 encoding).

    Inserting a substring into a SAS macro variable

    Let’s solve a similar task, but now instead of SAS variables we will operate with SAS macro variables, since they are strings too.

    Here is our problem to solve: we have a SAS macro variable (string) and we want to insert in it a value of another macro variable (substring) starting at a particular specified position.

    Let’s say we have a macro variable BASE with value of The rain in stays mainly in the plain in which we want to insert a country name defined by macro variable COUNTRY with value of Spain right before word stays. Here is an example of how this can be done:

    %let BASE = The rain in stays mainly in the plain;
    %let COUNTRY = Spain;
    %let W = stays;
     
    %let INSPOS = %index(&BASE,&W);
    %let PHRASE = %substr(&BASE,1,%eval(&INSPOS-1))&COUNTRY %substr(&BASE,&INSPOS);
    %put ***&PHRASE***;

    This code will insert the country name in the appropriate place within the BASE macro variable which will be printed in the SAS log by %put statement:

    ***The rain in Spain stays mainly in the plain***

    Here are the code highlights:

    • %substr() macro function to extract two parts of its first argument (&BASE) - before and after insertion:
      • %substr(&BASE,1,%eval(&INSPOS-1))captures the first part of &BASE (before insertion): substring of &BASE starting from the position 1 with a length of %eval(&INSPOS-1).
      • %substr(&BASE,&INSPOS) captures the second part of &BASE (after insertion): substring of &BASE starting from the position &INSPOS till the end of &BASE (since the third argument is not specified).
    • In case of macro variables, we don’t need any concatenation functions – we just list the component pieces of the macro variable value in a proper order with desired separators (blanks in this case).

    NOTE: Unlike for SAS variables, you don’t need to assign the length of SAS macro variables which are automatically defined by their assigned values. The maximum length of SAS macro variables is 65,534 bytes.

    Inserting multiple instances of a substring into a SAS character string

    Sometimes you need to insert a substring into several places (positions p1, p2, …, pn) of a character string. In this case you can use the above strategy repeatedly or iteratively with one little caveat: start inserting from the highest position and moving backwards to the lowest position. This will preserve your pre-determined positions because positions are counted from left to right and inserting a substring at a higher position won’t change the lower position number. Otherwise, after insertion of a substring into lower position, all your higher positions will shift by the length of the inserted substring.

    Additional Resources for SAS character strings processing

    Your thoughts?

    Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

    Inserting a substring into a SAS string was published on SAS Users.

  • 1月 272021
     

    SASensei logoSAS offers myriad ways to level up your SAS skills (scroll to the bottom to see a list of SAS-provided learning paths and channels). In this post, I introduce you to SASensei, an independent, third-party online SAS learning resource that I enjoy a lot.

    Learning: dull or fun?

    Learning is not always associated with fun. Sometimes it feels difficult and exhausting. New concepts often contradict our prior knowledge and experience, compelling us to rethink, adjust, change and adapt to new paradigms.

    Learning new ideas, skills and technologies can be intimidating, challenging and demanding. While learning, you are stretching out of your comfort zone. But that feeling is only transient. As a matter of attitude, learning is not about pushing yourself out of your comfort zone, it’s about expanding your comfort zone. And that is long lasting. The more you learn, the more comfortable and self-confident you become.

    Learning does not have to be tedious. Look at pre-school kids. They learn basic life skills like walking (rolling, crawling), talking (in one or more languages), asking questions (a lot) – all without taking classes, just through their natural curiosity and ... playing games.

    What is SASensei? Gamified SAS learning

    When I first discovered the SASensei online SAS learning game/application I was pleasantly surprised by its non-traditional approach to learning such a serious and well-established platform as SAS.

    As stated on its website, “Sasensei is a question based learning system. You must demonstrate your command of SAS® to earn Tokens - which should be wisely invested, to enable you to unlock new levels within the game...”

    The following screenshot shows the main page of the SASensei website that displays a dashboard of the top players (they call it leaderboard). You can filter it geographically - by Country, Continent, or World, as well as by the timeline – by Past Month, Past Year, or All Time.

    SASensei leaderboard

    Privacy or prominence

    Users have full control of their privacy or prominence. As you can see in the screenshot above, registered users are displayed by their screen names. This allows the users to either remain anonymous by selecting some fictitious obscure screen name or use their real name. Users can change their screen name at any time.

    Rules of the game

    In this blog post I provide just an overview of the main functionality and features of the SASensei learning platform. For detailed rules of the game, see SASensei Documentation.

    Play and learn

    Users are offered a variety of learning activities:

    • Viewing, reviewing and submitting SAS-learning flashcards;
    • Playing, reviewing and submitting questions by different SAS-related topics;
    • Taking and creating public, private, multi-player and custom quizzes;
    • Providing feedback on questions and flashcards by voting and commenting.

    Users can challenge themselves by delving into different topics. Your successes and failures will provide you an honest and objective estimation of your SAS strengths as well as weaknesses. A healthy competition with other users encourages you to learn more and hone your SAS skills. When you fail a question, you can review the explanation of the correct answer and thus learn why you failed and acquire new knowledge, tips and tricks quickly and efficiently.

    Invest, score, win and build a reputation

    To play you will need to earn and spend tokens which are essentially the game’s currency. To motivate you further, you also earn reputation points, which is your ultimate score, a level of achievement in demonstrating SAS skills. Your reputation score is prominently displayed in your public profile. As you progress in the game and your reputation grows, additional functionality unlocks and becomes available to you. Your reputation score determines your SASensei standing level which is derived from those used in martial arts:

    Sasensei title Sasamurai title

    • White Belt (new players)
    • Yellow Belt (reputation ≥ 50)
    • Green Belt (reputation ≥ 100)
    • Black Belt (reputation ≥ 200)
    • Sasamurai (reputation ≥ 500)
    • Assassin (reputation ≥ 1000)
    • Sasensei (reputation ≥ 5000)

    Sample SASensei question

    When you play a question, you select a topic, and then you are presented with a randomly selected multiple-choice question of a specified time limit (30, 60, 90 or 120 seconds). Here is a sample of such question:

    Question:

    What is wrong with the following LIBNAME statement?
    libname fruits (apples oranges tomatoes);

    Answers:

    • Incorrect syntax
    • You cannot mix apples and oranges in LIBNAME statement
    • Nothing is wrong, valid LIBNAME statement
    • Tomatoes are not fruits, therefore the statement is not correct

    Correct answer:

    Nothing is wrong, valid LIBNAME statement

    Explanation:

    According to Combine and conquer with SAS for examples of usage.

    Question
    Try tackling a question on your own in the SASensei environment to get real life experience: Sample Question.

    Take a SASensei sample quiz

    There are various quizzes available at SASensei: public quizzes, multiplayer quiz games, private quizzes (tests) for students.

    A public quiz contains 12 questions with a total time cap of 12 minutes, and costs eight tokens to play. You can choose a single topic (sas statements, sas macro, procedures, options, etc.), and if you pass (75% or more) you get 12 tokens back, plus 20 point to your reputation. If you get 100%, you get 30 reputation points plus Top Student badge. A count of passed sessions (by topic) is displayed on your public profile.

    QuizAlthough public quizzes are unlocked at the SASamurai level, for the readers of this blog, I have created a special custom quiz sample so you can experience it firsthand right here, right now. Just click on this link, Sample Quiz, register, and enjoy your ride.

    See you at the top of the SASensei dashboard!

    Credit

    Big THANKS to Allan Bowe (United Kingdom) – SAS innovator and entrepreneur who created and founded SASensei learning platform.

    Other SAS learning resources

    Game on! SASensei: a fun way to learn SAS was published on SAS Users.

    1月 132021
     

    Running SAS programs in parallel reduces run time

    As earth completes its routine annual circle around the sun and a new (and hopefully better) year kicks in, it is a perfect occasion to reflect on the idiosyncrasy of time.

    While it is customary to think that 3+2=5, it is only true in sequential world. In parallel world, however, 3+2=3. Think about it: if you have two SAS programs one of which runs 3 hours, and the second one runs 2 hours, their total duration will be 5 hours if you run them one after another sequentially, but it will take only 3 hours if you run them simultaneously, in parallel.

    I am sure you remember those “filling up a swimming pool” math problems from elementary school. They clearly and convincingly demonstrate that two pipes will fill up a swimming pool faster than one. That’s the power of running water in parallel.

    The same principle of parallel processing (or parallel computing) is applicable to SAS programs (or non-SAS programs) by running their different independent pieces in separate SAS sessions at the same time (in parallel).  Divide and conquer.

    You might be surprised at how easily this can be done, and at the same time how powerful it is. Let’s take a look.

    SAS/CONNECT

    MP CONNECT) parallel processing functionality was added to SAS/CONNECT enabling you to execute multiple SAS sessions asynchronously. When a remote SAS session kicks off asynchronously, a portion of your SAS program is sent to the server session for execution and control is immediately returned to the client session. The client session can continue with its own processing or spawn one or more additional asynchronous remote server sessions.

    Running programs in parallel on a single machine

    Sometimes, what comes across as new is just well forgotten old. They used to be Central Processing Units (CPU), but now they are called just processors. Nowadays, practically every single computer is a “multi-machine” (or to be precise “multi-processor”) device. Even your laptop. Just open Task Manager (Ctrl-Alt-Delete), click on the Performance tab and you will see how many physical processors (or cores) and logical processors your laptop has:

    Parallel processing on a single machine

    That means that this laptop can run eight independent SAS processes (sessions) at the same time. All you need to do is to say nicely “Dear Mr. & Mrs. SAS/CONNECT, my SAS program consists of several independent pieces. Would you please run each piece in its own SAS session, and run them all at the same time?” And believe me, SAS/CONNECT does not care how many logical processors you have, whether your logical processors are far away from each other “remote machines” or they are situated in a single laptop or even in a single chip.

    Here is how you communicate your request to SAS/CONNECT in SAS language.

    Spawning multiple SAS sessions using MP Connect

    Suppose you have a SAS code that consists of several pieces – DATA or PROC steps that are independent of each other, i.e. they do not require to be run in a specific sequence. For example, each of the two pieces generates its own data set.

    Then we can create these two data sets in two separate “remote” SAS sessions that run in parallel. Here is how you do this.  (For illustration purposes, I just create two dummy data sets.)

    options sascmd="sas";
     
    /* Current datetime */
    %let _start_dt = %sysfunc(datetime());
     
    /* Prosess 1 */
    signon task1;
    rsubmit task1 wait=no;
     
       libname SASDL 'C:\temp';
     
       data SASDL.DATA_A (keep=str);
          length str $1000;
          do i=1 to 1150000;
             str = '';
             do j=1 to 1000;
                str = cats(str,'A');
             end;
             output;
          end;
       run;
     
    endrsubmit;
     
    /* Process 2 */
    signon task2;
    rsubmit task2 wait=no;
     
       libname SASDL 'C:\temp';
     
       data SASDL.DATA_B (keep=str);
          length str $1000;
          do i=1 to 750000;
             str = '';
             do j=1 to 1000;
                str = cats(str,'B');
             end;
             output;
          end;
       run;
     
    endrsubmit;
     
    waitfor _all_;
    signoff _all_;
     
    /* Print total duration */
    data _null_;
       dur = datetime() - &_start_dt;
       put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-';
    run;

    In this code, the key elements are:

    SIGNON Statement - initiates a connection between a client session and a server session.

    ENDRSUBMIT statement - marks the end of a block of statements that a client session submits to a server session for execution.

    SIGNOFF Statement - ends the connection between a client session and a server session.

    Parallel processing vs. threaded processing

    There is a distinction between parallel processing described above and threaded processing (aka multithreading). Parallel processing is achieved by running several independent SAS sessions, each processing its own unit of SAS code.

    Threaded processing, on the other hand, is achieved by developing special algorithms and implementing executable codes that run on multiple processors (threads) within the same SAS session. Amdahl's law, which provides theoretical background and estimation of potential time saving achievable by parallel computing on multiple processors.

    Passing information to and from “remote” SAS sessions

    Besides passing pieces of SAS code from client sessions to server sessions, MP CONNECT allows you to pass some other SAS objects.

    Passing data library definitions

    For example, if you have a data library defined in your client session, you may pass that library definition on to multiple server sessions without re-defining them in each server session.

    Let’s say you have two data libraries defined in your client session:

    libname SRCLIB oracle user=myusr1 password=mypwd1 path=mysrv1;
    libname TGTLIB '/sas/data/datastore1';

    In order to make these data libraries available in the remote session all you need is to add

    rsubmit task1 wait=no inheritlib=(SRCLIB TGTLIB);

    This will allow libraries that are defined in the client session to be inherited by and available in the server session. As an option, each client libref can be associated with a libref that is named differently in the server session:

    rsubmit task1 wait=no inheritlib=(SRCLIB=NEWSRC TGTLIB=NEWTGT);

    Passing macro variables from client to server session

    options sascmd="sas";
    %let run_dt = %sysfunc(datetime());
    signon task1;
    %syslput rem_run_dt=&run_dt / remote=task1;
    rsubmit task1 wait=no;
       %put &=rem_run_dt;
    endrsubmit;
     
    waitfor task1;
    signoff task1;

    Passing macro variables from server to client session

    Similarly,

  • %SYSRPUT_USER_ </LIKE=‘character-string’>;
  • (/LIKE=<‘character-string’ >specifies a subset of macro variables whose names match a user-specified character sequence, or pattern.)

    Here is a code example that passes two macro variables, rem_start and rem_year from the remote session and outputs them to the SAS log in the client session:

    options sascmd="sas";
    signon task1;
    rsubmit task1 wait=no;
       %let start_dt = %sysfunc(datetime());
       %sysrput rem_start=&start_dt;
       %sysrput rem_year=2021;
    endrsubmit;
     
    waitfor task1;
    signoff task1;
    %put &=rem_start &=rem_year;

    Summary

    SAS’ Multi-Process Connect is a simple and efficient tool enabling parallel execution of independent programming units. Compared to sequential processing of time-intensive programs, it allows to substantially reduce overall duration of your program execution.

    Additional resources

    Running SAS programs in parallel using SAS/CONNECT® was published on SAS Users.

    12月 172020
     

    There’s nothing worse than being in the middle of a task and getting stuck. Being able to find quick tips and tricks to help you solve the task at hand, or simply entertain your curiosity, is key to maintaining your efficiency and building everyday skills. But how do you get quick information that’s ALSO engaging? By adding some personality to traditionally routine tutorials, you can learn and may even have fun at the same time. Cue the SAS Users YouTube channel.

    With more than 50 videos that show personality published to-date and over 10,000 hours watched, there’s no shortage of learning going on. Our team of experts love to share their knowledge and passion (with personal flavor!) to give you solutions to those everyday tasks.

    What better way to round out the year than provide a roundup of our most popular videos from 2020? Check out these crowd favorites:

    Most viewed

    1. How to convert character to numeric in SAS
    2. How to import data from Excel to SAS
    3. How to export SAS data to Excel

    Most hours watched

    1. How to import data from Excel to SAS
    2. How to convert character to numeric in SAS
    3. Simple Linear Regression in SAS
    4. How to export SAS data to Excel
    5. How to Create Macro Variables and Use Macro Functions
    6. The SAS Exam Experience | See a Performance-Based Question in Action
    7. How it Import CSV files into SAS
    8. SAS Certification Exam: 4 tips for success
    9. SAS Date Functions FAQs
    10. Merging Data Sets in SAS Using SQL

    Latest hits

    1. Combining Data in SAS: DATA Step vs SQL
    2. How to Concatenate Values in SAS
    3. How to Market to Customers Based on Online Behavior
    4. How to Plan an Optimal Tour of London Using Network Optimization
    5. Multiple Linear Regression in SAS
    6. How to Build Customized Object Detection Models

    Looking forward to 2021

    We’ve got you covered! SAS will continue to publish videos throughout 2021. Subscribe now to the SAS Users YouTube channel, so you can be notified when we’re publishing new videos. Be on the lookout for some of the following topics:

    • Transforming variables in SAS
    • Tips for working with SAS Technical Support
    • How to use Git with SAS

    2020 roundup: SAS Users YouTube channel how to tutorials was published on SAS Users.

    11月 102020
     

    As a SAS consultant I have been an avid user of SAS Enterprise Guide for as long as I can remember. It has been not just my go-to tool, but that of many of the SAS customers I have worked with over the years.

    It is easy to use, the interface intuitive, a Swiss Army knife when it comes to data analysis. Whether you’re looking to access SAS data or import good old Excel locally, join data together or perform data analysis, a few clicks and ta-dah, you’re there! Alternatively, if you insist on coding or like me, use a bit of both, the ta-dah point still holds.

    SAS Enterprise Guide, or EG as it is commonly known as, is a mature SAS product with many years of R&D, an established user base, a reliable and trusted product. So why move to SAS Studio? Why should I leave the comfort of what works?

    For the last nine months I have been working with one of the UK’s largest supermarket answering that exact question as they make that journey from SAS Enterprise Guide to SAS Studio. EG is used widely across several supermarket operations, including:

    • supply chain (to look at wastage and stock availability)
    • marketing analytics (to look at customer behaviour and build successful campaigns)
    • fraud detection (to detect misuse of vouchers).

    What is SAS Studio?

    Firstly, let's answer the "what is SAS Studio" question. It is the browser-based interface for SAS programmers to run code or use predefined tasks to automatically generate SAS code. Since there is nothing to install on your desktop, you can access it from almost any machine: Windows or Mac. And SAS Studio is brought to you by the same SAS R&D developers who maintain SAS Enterprise Guide.

    SAS Studio with Ignite (dark) theme

    1. Still does the regular stuff

    It allows you to access your data, libraries and existing programs and import a range of data sources including Excel and CSV. You can code or use the tasks to perform analysis. You can build queries to join data, create simple and complex expressions, filter and sort data.

    But it does much more than that... So what cool things can you do with SAS Studio?

    2. Use the processing power of SAS Viya

    SAS Studio (v5.2 onwards) works on SAS Viya. Previously SAS 9 had the compute server aka the workspace server as the processing engine. SAS Viya has CAS, the next generation SAS run time environment which makes use of both memory and disk. It is distributed, fault tolerant, elastic and can work on problems larger than the available RAM. It is all centrally managed, secure, auditable and governed.

    3. Cool new functionality

    SAS Studio comes with many enhancements and cool new functionality:

    • Custom tasks. You can easily build your own custom tasks (software developer skills not required) so others without SAS coding skills can utilise them. Learn more in this Ask the Expert session.
    • Code snippets. It comes with pre-defined code snippets, commonly used bits of code that you can save and reuse. Additionally, you can create your own which you can share with colleagues. Coders love that these code snippets can be used with keystroke abbreviations.
    • Background submit.  This allows you to run code in the background whilst you continue to work.
    • DATA step debugger. First added into SAS Enterprise Guide, SAS Studio now offers an interactive DATA step debugger as well.
    • Flexible layout for your workspace, You can have multiple tabs open for each program, and open multiple datasets and items.
    • FEDSQL. The query window

      DATA step debugger in SAS Studio

    4. Seamlessly access the full suite of SAS Viya capabilities

    A key benefit of SAS Studio is the ease of which you can move from writing code to doing some data discovery, visualisation and model building. Previously in the SAS 9 world you may have used EG to access and join your data and then move to SAS Enterprise Miner, a different interface, installed separately to build a model. Those days are long gone.

    To illustrate the point, if I wanted to build a campaign to see who would respond to a supermarket voucher, I could access my customer data and join that to my transaction and products data in SAS Studio. I could then move into SAS Visual Analytics to identify the key variables I would need to build an analytical model and even the best model to build. From there I would move to SAS Visual Data Mining and Machine Learning to build the model. I could very easily use the intuitive point-and-click pipeline interface to build several models, incorporating an R or Python to find the best model. This would all be done within one browser-based interface and the data being loaded only once.

    This tutorial from Christa Cody illustrates this coding workflow in action.

    The Road to SAS Studio

    SAS Studio clearly has a huge number of benefits, it does the regular stuff you would expect, but additionally brings a host of cool new functionality and the processing power of SAS Viya, not to mention allowing you to move seamlessly to the next steps of the analytical and decisioning journey including model building, creating visualisations, etc.

    Change management + technical enablement = success

    Though adoption of modern technology can bring significant benefits to enterprise organisations as this supermarket is seeing, it is not without its challenges. Change is never easy and the transition from EG to Studio will take time. Especially with a mature, well liked and versatile product like EG.

    The cultural challenge that new technology provides should not be underestimated and can provide a barrier to adoption. Newer technology requires new approaches, a different way of working across diverse user communities many of whom have well established working practices that may in some cases, resist change. The key is to invest the time with the communities, explain how newer technology can support their activities more efficiently and provide them with broader capability.

    Learn more

    Visit the Learn and Support center for SAS Studio.

    Moving from SAS Enterprise Guide to SAS Studio was published on SAS Users.

    11月 042020
     

    Removing duplicate charactersIn this blog post we are going to tackle a data cleansing task of removing unwanted repeated characters in SAS character variables.

    Character repetition can stem from various stages of data life cycle: from data collection, to data transmission, to data transformation. It can be accidental or intentional by design. It can be sporadic or consistent. In either case, it needs to be addressed by robust data cleansing processes to ensure adequate data quality that is imperative for the data usability.

    Character repetition examples

    Example 1. Data entry, especially manual data entry, can be a high-risk factor for accidental character duplication. Have you ever pressed a key on your computer keyboard for a bit longer than intended, so it started automatically typing multiple characters???????????????

    Keyboard properties adjustment

     Tip: You can adjust your Keyboard Properties to control “Repeat delay” and “Repeat rate” settings (on Windows computer, search for “Keyboard” and click on Keyboard in Control Panel).

    Example 2. Recently, I had to deal with the data that contained multiple consecutive double quotation marks all over the character string values. Even though we don’t know the exact cause of it, still for each of these duplicated quotation marks occurrences we needed to replace them with a single quotation mark.

    Removing repeated blanks

    There is a very useful Removing unwanted characters from text strings by Amadeus Software we developed a prototype using

    data D;
       c = ','; *<- character to un-duplicate;
       cc = c||c; *<- double character;
       string = 'Many,,,,,, commas,,,,, in,,, this,, sentence.,'; *<- source string;
       put 'BEFORE:' string=; *<- output initial string;
       do while (find(string,cc)); *<- loop through while there are doubles;
          string = tranwrd(string,cc,c); *<- replace double with a single character;
       end;
       put 'AFTER: ' string=; *<- output unduplicated string;
    run;

    This code will produce the following in the SAS log:

    BEFORE:string=Many,,,,,, commas,,,,, in,,, this,, sentence.,
    AFTER: string=Many, commas, in, this, sentence.,

    which shows that this approach correctly un-duplicates the source string removing and replacing all repeated characters (commas in our example) with a single one.

    User-defined SAS function for removing any repeated characters

    Let’s use

    libname funclib 'c:\projects\functions';
    proc fcmp outlib=funclib.userfuncs.package1;
       function undupc(str $, clist $) $;
          length x $32767 c $1 cc $2;
          x = str; 
          do i=1 to length(clist);
             c = char(clist,i);
             cc = c||c;
             do while (find(trim(x),cc));
                x = tranwrd(x,cc,c);
             end;
          end;
          return (x); 
       endfunc; 
    run;

    Code highlights

    • We introduce an interim variable x to which we will iteratively apply replacing double characters with a single one.
    • We assign length attribute of this variable to be maximum allowable character length of 32767 bytes to accommodate any character length used in the calling program.
    • Outer do-loop loops through the clist containing characters we want to unduplicate.
    • Variable c is assigned a single character from clist, variable cc is assigned double of the cc value.
    • Inner do-loop iterates through trimmed characters in x while doubles are found; using trim(x) is essential as it not only speeds up processing while searching through a shorter string (without trailing blanks), it also prevents from falling into an infinite loop in case clist contains blank character to unduplicate (cc equals to double blanks which are always going to be found among trailing blanks).

    Let’s test our newly minted UNDUPC function on the following data:

    data SOURCE;
       infile datalines truncover;
       input str $50.;
       datalines;
    """Repeated "double quotes""""
    Repeated,,,,,commas,,,,,,,,,,,
    [[[""Mixed""]]   characters,,,
    ;

    Since our user-defined function is permanently stored in the location specified in the

    options cmplib=funclib.userfuncs;
    data TARGET;
       set SOURCE;
       length new_str $50;
       new_str = undupc(str, ' ,"][');
    run;

    This code will remove and replace all repeated sequences of characters ' ',',', '"', ']', and '['. The order of these characters listed in the second argument doesn’t matter. Here is what we get:

    Duplicate characters removal results
    As you can see, we get what we wanted including the functionality of the COMPBL function.

    User-defined CALL routine for removing any repeated characters

    As much as I love user-defined functions, I have an issue with the above undupc user-defined function implementation. It has to do with how the PROC FCMP handles interim character variables length attribute assignment. It does not implicitly inherit their length attribute from another variable as SAS data step does. For example, if you run the following data step:

    data a;
       length x $99;
       y = x;
    run;

    variable y will have the length attribute $99 implicitly inherited from the x variable.

    In PROC CFMP function, you can either assign the length attribute to a character variable explicitly with LENGTH or ATTRIB statement (as we did by using length x $32767 ), or it will be set to $33 if you use any other way of implicit assignment. (I leave it up to you guessing why 33 and why not any other number.) Since we wanted to accommodate SAS character strings of any length, we had to explicitly assign our interim variable x length attribute the maximum valid value of $32767. This will inevitably take tall on the function performance as we will have to process longer strings.

    However, we can avoid this issue by using CALL routine instead:

    libname funclib 'c:\projects\functions';
    proc fcmp outlib=funclib.usercalls.package1;
       subroutine undupc(str $, clist $, x $);
          outargs x;
          length c $1 cc $2;
          x = str;
          do i=1 to length(clist);
             c = char(clist,i);
             cc = c||c;
             do while (find(trim(x),cc));
                x = tranwrd(x,cc,c);
             end;
          end;
       endsub; 
    run;

    This code is very similar to the user-defined function above with a slight difference. Here, x variable is listed as an argument in the subroutine definition and refers to a SAS data step variable whose length attribute is assigned in the calling data step. Unlike SAS function, SAS subroutine does not return a value; instead, it uses

    options cmplib=funclib.usercalls;
    data TARGET;
       set SOURCE;
       length new_str $50;
       call undupc(str, ' ,"][', new_str);
    run;

    And we will get the same results as with the UNDUPC function above.

    Store user-defined functions and subroutines separately

    You can create and have both, user-defined function and call routine with the same name. However, to avoid confusion (and errors) do not store their definitions in the same data table (outlib= option of the PROC FCMP). If they are stored in the same data table, then when used in a DATA step, SAS will pull the latest definition by its name only and that may not be the entity you want.

    Performance benchmarking

    To compare performances of the UNDUPC function vs UNDUPC subroutine we created a rather large data table (1 Million observations) with randomly generated strings (1000 characters long):

    libname SASDL 'C:\PROJECTS\TESTDATA';
     
    data SASDL.TESTDATA (keep=str);
       length str $1000;
       do i=1 to 1000000;
          str = '';
          do j=1 to 1000;
             str = cats(str,byte(int(rank(' ')+38*rand('UNIFORM'))));
          end;
          output;
       end;
    run;

    Then we ran the following 2 data steps, one using the undupc() function, and the other using undupc() call routine:

    options cmplib=funclib.userfuncs;
     
    data SASDL.TESTDATA_UNDUPC_FUNC;
       set SASDL.TESTDATA;
       length new_str $1000;
       new_str = undupc(str, '#+');
    run;
     
    options cmplib=subrlib.usercalls;
     
    data SASDL.TESTDATA_UNDUPC_CALL;
       set SASDL.TESTDATA;
       length new_str $1000;
       call undupc(str, '#+', new_str);
    run;

    A quick SAS log inspection reveals that CALL UNDUPC works as much as 3 times faster than UNDUPC function (10 seconds vs. 30 seconds). The time savings may vary depending on your data composition and computing environment, but in any case, if you process high volumes of data you may consider using CALL routine over function. This is not a blanket statement, as it only pertains to this particular algorithm of eliminating character repetitions where we had to explicitly assign the highest possible length attribute value to the interim variable in the function, but not in the CALL routine.

    When we reduced declared length of x from $32767 to $1000 within the user-defined function definition its performance became on par with the CALL routine.

    Additional Resources for SAS character strings processing

    Your thoughts?

    Have you found this blog post useful? Would you vote for implementing UNDUPC as a native built-in SAS function? Please share your thoughts and feedback in the comments section below.

    Removing repeated characters in SAS strings was published on SAS Users.

    10月 152020
     

    SAS Microsoft partnershipYou might have heard about SAS - Microsoft partnership announced in June 2020 that officially joined the powers of SAS analytics with Microsoft’s cloud technology to further advance Artificial Intelligence (AI).

    This partnership did not just happen out of nowhere. SAS has a long and deep history of integrating with Microsoft technologies. Examples include:

    In this post we will look at a lesser known, but quite useful feature in SAS that allows SAS users to bring many Microsoft Excel functions right to their SAS programs. I hope that many SAS users (not just MS Excel aficionados) will love to discover this functionality within SAS.

    Excel functions as SAS user-defined functions

    SAS has a wide variety of built-in functions, however there are still many Microsoft Excel functions that are not intrinsically implemented in SAS. Luckily, many of them are made available in SAS via

    proc fcmp inlib=SASHELP.SLKWXL listall;
    run;

    You can also capture the list of available Excel functions in a SAS data table using ODS OUTPUT with CODELIST= option:

    ods noresults;
    ods output codelist=WORK.EXCEL_FUNCTIONS_LIST (keep=COL1 COL2);
    proc fcmp inlib=SASHELP.SLKWXL listall;
    run;
    ods output close;
    ods results;

    From this data table you can produce a nice looking HTML report listing all these functions:

    data WORK.EXCEL_SAS_FUNCTIONS (keep=exc sas arg);
       label exc='Excel Function' sas='SAS Function' arg='Arguments';
       set WORK.EXCEL_FUNCTIONS_LIST (rename=(col2=arg));
       sas = tranwrd(col1,'Function ','');
       exc = tranwrd(sas,'_slk','');
    run;
     
    ods html path='c:\temp' file='excel_sas_functions.html';
    title 'List of Excel functions available in SAS (via SASHELP.SLKWXL)';
    proc print data=EXCEL_SAS_FUNCTIONS label;
    run;
    ods html close;

    When you run this code, you should get the following list of Excel functions along with their SAS equivalents:

    List of Excel functions available in SAS (via SASHELP.SLKWXL)
    Obs Excel Function SAS Function Arguments
    1 even even_slk ( x )
    2 odd odd_slk ( x )
    3 factdouble factdouble_slk ( x )
    4 product product_slk ( nums )
    5 multinomial multinomial_slk ( nums )
    6 floor floor_slk ( n, sg )
    7 datdif4 datdif4_slk ( start, end )
    8 amorlinc amorlinc_slk ( cost, datep, fperiod, salvage, period, rate, basis )
    9 amordegrc amordegrc_slk ( cost, datep, fperiod, salvage, period, rate, basis )
    10 disc disc_slk ( settlement, maturity, pr, redemp, basis )
    11 tbilleq tbilleq_slk ( settlement, maturity, discount )
    12 tbillprice tbillprice_slk ( settlement, maturity, discount )
    13 tbillyield tbillyield_slk ( settlement, maturity, par )
    14 dollarde dollarde_slk ( fdollar, frac )
    15 dollarfr dollarfr_slk ( ddollar, frac )
    16 effect effect_slk ( nominal_rate, npery )
    17 coupnum coupnum_slk ( settlement, maturity, freq, basis )
    18 coupncd coupncd_slk ( settlement, maturity, freq, basis )
    19 coupdaysnc coupdaysnc_slk ( settlement, maturity, freq, basis )
    20 couppcd couppcd_slk ( settlement, maturity, freq, basis )
    21 coupdays coupdays_slk ( settlement, maturity, freq, basis )
    22 db db_slk ( cost, salvage, life, period, month )
    23 yield yield_slk ( settlement, maturity, rate, pr, redemp, freq, basis )
    24 yielddisc yielddisc_slk ( settlement, maturity, pr, redemp, basis )
    25 coupdaybs coupdaybs_slk ( settlement, maturity, freq, basis )
    26 oddfprice oddfprice_slk ( settlement, maturity, issue, fcoupon, rate, yield, redemp, freq, basis )
    27 oddfyield oddfyield_slk ( settlement, maturity, issue, fcoupon, rate, pr, redemp, freq, basis )
    28 oddlyield oddlyield_slk ( settlement, maturity, linterest, rate, pr, redemp, freq, basis )
    29 oddlprice oddlprice_slk ( settlement, maturity, linterest, rate, yield, redemp, freq, basis )
    30 price price_slk ( settlement, maturity, rate, yield, redemp, freq, basis )
    31 pricedisc pricedisc_slk ( settlement, maturity, discount, redemp, basis )
    32 pricemat pricemat_slk ( settlement, maturity, issue, rate, yld, basis )
    33 yieldmat yieldmat_slk ( settlement, maturity, issue, rate, pr, basis )
    34 received received_slk ( settlement, maturity, investment, discount, basis )
    35 accrint accrint_slk ( issue, finterest, settlement, rate, par, freq, basis )
    36 accrintm accrintm_slk ( issue, maturity, rate, par, basis )
    37 duration duration_slk ( settlement, maturity, coupon, yld, freq, basis )
    38 mduration mduration_slk ( settlement, maturity, coupon, yld, freq, basis )
    39 avedev avedev_slk ( data )
    40 devsq devsq_slk ( data )
    41 varp varp_slk ( data )

     
    NOTE: Excel functions that are made available in SAS are named from their Excel parent functions, suffixing them with _SLK to distinguish them from their Excel incarnations, as well as from native SAS functions.

    Examples of Microsoft Excel functions usage in SAS

    In order to use any of these Excel functions in your SAS code, all you need to do is to specify the functions definition data table in the CMPLIB= option:

    options cmplib=SASHELP.SLKWXL;

    Let’s consider several examples.

    ODD function

    This function returns number rounded up to the nearest odd integer:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       x = 6.4;
       y = odd_slk(x);
       put 'odd( ' x ') = ' y;
    run;

    SAS log:
    odd( 6.4 ) = 7

    EVEN function

    This function returns number rounded up to the nearest even integer:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       x = 6.4;
       y = even_slk(x);
       put 'even( ' x ') = ' y;
    run;

    SAS log:
    odd( 6.4 ) = 8

    FACTDOUBLE function

    This function returns the double factorial of a number. If number is not an integer, it is truncated.
    Double factorial (or semifactorial) of a number n, denoted by n!!, is the product of all the integers from 1 up to n that have the same parity as n.
    For even n, the double factorial is n!!=n(n-2)(n-4)…(4)(2), and for odd n, the double factorial is n!! = n(n-2)(n-4)…(3)(1).

    Here is a SAS code example using the factdouble() Excel function:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       n = 6;
       m = 7;
       nn = factdouble_slk(n);
       mm = factdouble_slk(m);
       put n '!! = ' nn / m '!! = ' mm;
    run;

    It will produce the following SAS log:
    6 !! = 48
    7 !! = 105

    Indeed, 6!! = 2 x 4 x 6 = 48 and 7!! = 1 x 3 x 5 x 7 = 105.

    PRODUCT function

    This function multiplies all elements of SAS numeric array given as its argument and returns the product:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       array x x1-x5 (5, 7, 1, 2, 2);
       p = product_slk(x);
       put 'x = ( ' x1-x5 ')';
       put 'product(x) = ' p;
    run;

    SAS log:
    x = ( 5 7 1 2 2 )
    product(x) = 140

    Indeed 5*7*1*2*2 = 140.

    MULTINOMIAL function

    This function returns the ratio of the factorial of a sum of values to the product of factorials:

    MULTINOMIAL(a1, a2, ... , an) = (a1 + a2 + ... + an)! : (a1! a2! ... an!)

    In SAS, the argument to this function is specified as numeric array name:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       array a a1-a3 (1, 3, 2);
       m = multinomial_slk(a);
       put 'a = ( ' a1-a3 ')';
       put 'multinomial(a) = ' m;
    run;

    SAS log:
    a = ( 1 3 2 )
    multinomial(a) = 60

    Indeed (1+3+2)!  :  (1! + 3! + 2!) = 720 : 12 = 60.

    Other Microsoft Excel functions available in SAS

    You can explore other Excel functions available in SAS via SASHELP.SLKWXL user-defined functions by cross-referencing them with the corresponding Microsoft Excel functions documentation (alphabetical or by categories) As you can see in the above List of Excel functions available in SAS, besides mathematical functions exemplified in the previous section, there are also many Excel financial functions related to securities trading that are made available in SAS.

    Additional Resources on SAS user-defined functions

    Your thoughts?

    Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

    Using Microsoft Excel functions in SAS was published on SAS Users.

    10月 032020
     

    If you're a SAS programmer who now uses SAS Viya and CAS, it's worth your time to optimize your existing programs to take advantage of the new environment. This post is a continuation of my SAS Global Forum 2020 paper Best Practices for Converting SAS® Code to Leverage SAS® Cloud Analytic Services and my SGF 2020 Super Demo.

    The best approach for refactoring SAS code for SAS Viya has a few steps:

    • First, "lift and shift" your existing code to run successfully in the compute server for SAS Viya.
    • Next, create CASLIB statements to all of your data sources: i.e. sas7bdat, CSV files, parquet files, RDBMs, cloud data sources, etc.
    • Finally, identify the longest running steps so you know where you have the biggest opportunities. For example, look at steps where the "real time" is 30 minutes or longer, as well as steps that are CPU bound. CPU-bound steps are steps where the CPU time is equal to or greater than the real time for that step.

    To help us identify those steps we can leverage a new utility to analyze SAS logs and create reports to help us understand the Real Time and CPU Time for each step. Read on the learn more about this final step in the code refactoring process.

    Utility: SAS Log Parser

    To generate these reports, I created a SAS program that will read all SAS log files in a directory and create one report per SAS log as well as a descending Real Time (Clock Time) and CPU Time reports. Figure 1 is an example of the report that is generated for each SAS log. In this report we see each step’s procedure or DATA Step’s Real Time and CPU Time. It's derived by picking up on SAS log entries like this:

    NOTE: PROCEDURE SGPLOT used (Total process time):
          real time           2.79 seconds
          cpu time            0.08 seconds
    
    NOTE: The SAS System used:
          real time           1:08.86
          cpu time            1:18.18
    

    Note, the Total Time and Total CPU Time are fields that are populated when the SAS log note “NOTE: The SAS System used:” is encountered. SAS programs that are ran in batch or using an RSUBMIT process via

    Figure 1. sampleETS.log report

    Descending Real Time Report

    Figure 2 contains an example of the descending real time report. In this report we observe in the Step column that the longest running step is a PROC LOGISTIC that takes over 14 hours (Real Time column) from the SAS log called Sample3.log (File Name column). The best way to use this report is to focus on steps that take longer than 30 minutes. In our case we have 9 steps from 3 SAS logs. Now that we know that we can review the details of each step and then benchmark if that step would run faster by leveraging SAS® Cloud Analytic Services (CAS). Note, for CAS to process data all source and target tables, all data must be in CAS tables and the step must be coded using CAS-enabled steps.

    Figure 2. Descending Real Time Report

    Descending CPU Time Report

    Figure 3 contains an example of the descending CPU times report where we observe that the most CPU intensive step takes over 13 hours (CPU Time column) and is from the Samples3.log (File Name column). Note, if you review the Real Time and CPU Time columns you should notice that only observation 11 (Obs column) has a CPU Time that exceeds the Real Time making it CPU bound.  However, we would not focus on this step since the Real Time is less than 30 minutes.

    Figure 3. Descending CPU Time Report

    Source code for the SAS Log Parser

    I've bundled my SAS code for these steps in my GitHub repository for SAS Global Forum 2020. You'll find these programs along with the other code that supports my previous topics of adapting SAS 9 code for SAS Viya.

    sasLogParserMacros.sas contains three macros that drive the process. The macro program %LIST_FILES lists all files with a given extension, %CHECK checks for the existence of a file and deletes it if found, and %SASLOG parses a SAS log and provides the values found in the reports. When you save this file ensure you name it sasLogParserMacros.sas and in the same directory that you save sasLogParser.sas.

    sasLogParser.sas is the program we submit to produce the reports. This code includes the code sasLogParserMacros.sas and then generate the repots. The only two statements we need to modify are the first two %LET statements in this program. The first %LET statement points to the location of the two SAS programs sasLogParserMacros.sas and sasLogParser.sas. The second %LET statement points to the directory containing all the SAS logs we want to parse. Note, outside of the two %let statements in sasLogParser.sas do not change any other statement in either program.

    Conclusion

    In order to understand which steps are good candidates for leveraging the in-memory engine CAS, we must first understand the real time and CPU time of each step. Then we can benchmark which engine in SAS Viya is appropriate for that step -- the compute server or CAS.  The code that I've shared for benchmarking can run within SAS 9 or SAS Viya, on both Windows and Linux.

    9月 172020
     

    Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

    %QUOTE and %UNQUOTE macro functions

    Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

    Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

    QUOTE function

    Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

    This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

    DEQUOTE function

    There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

    data a;
       input x $ 1-50;
       datalines;
    'This is what you get. Let's be careful.'
    ;
     
    data _null_;
       set a;
       y = dequote(x);
       put x= / y=;
    run;

    you will get the following in the SAS log:

    y=This is what you get. Let

    This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

    $QUOTE informat

    The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

    y = dequote(x);

    with the INPUT() function

    y = input(x, $quote50.);

    Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

    input x $quote50.;

    Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

    • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
    • Match single quotation mark with double quotation mark as if they are the same.
    • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

    If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

    UNQUOTE function definition

    Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

    • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
    • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
    • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
    • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
    • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

    To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

    UNQUOTE function implementation

    Here is how such a function can be implemented using PROC FCMP:

    libname funclib 'c:\projects\functions';
     
    proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
       function unquote(x $) $32767;
          pos1 = notspace(x); *<- first non-blank character position;
          if pos1=0 then return (x); *<- empty string;
     
          char1 = char(x, pos1); *<- first non-blank character;
          if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
     
          posL = notspace(x, -length(x)); *<- last non-blank character position;
     
          if pos1=posL then return (x); *<- single character string;
     
          charL = char(x, posL); *<- last non-blank character;
          if charL^=char1 then return (x); *<- last non-blank character does not macth first;
     
          /* at this point we should have matching quotation marks */
          return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
       endfunc; 
    run;

    Here are the highlights of this implementation:

    We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

    At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

    NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

    However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

    UNQUOTE function results

    Let’s use the following code to test our newly minted user-defined function UNQUOE():

    libname funclib 'c:\projects\functions';
    options cmplib=funclib.userfuncs;
     
    data A;
       infile datalines truncover;
       input @1 S $char100.;
       datalines;
    '
    "
    How about this?
        How about this?
    "How about this?"
    'How about this?'
    "How about this?'
    'How about this?"
    "   How about this?"
    '      How about this?'
    '      How "about" this?'
    '      How 'about' this?'
       "     How about this?"
       "     How "about" this?"
       "     How 'about' this?"
       '     How about this?'
    ;
     
    data B;
       set A;
       length NEW_S $100;
       label NEW_S = 'unquote(S)';
       NEW_S = unquote(S);
    run;

    This code produces the following output table:

    Example of character string unquoting
    As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

    DSD (Delimiter-Sensitive Data) option

    This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

    In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

    Additional Resources

    Your thoughts?

    Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

    How to unquote SAS character variable values was published on SAS Users.