Leonid Batkhan

1月 132021
 

Running SAS programs in parallel reduces run time

As earth completes its routine annual circle around the sun and a new (and hopefully better) year kicks in, it is a perfect occasion to reflect on the idiosyncrasy of time.

While it is customary to think that 3+2=5, it is only true in sequential world. In parallel world, however, 3+2=3. Think about it: if you have two SAS programs one of which runs 3 hours, and the second one runs 2 hours, their total duration will be 5 hours if you run them one after another sequentially, but it will take only 3 hours if you run them simultaneously, in parallel.

I am sure you remember those “filling up a swimming pool” math problems from elementary school. They clearly and convincingly demonstrate that two pipes will fill up a swimming pool faster than one. That’s the power of running water in parallel.

The same principle of parallel processing (or parallel computing) is applicable to SAS programs (or non-SAS programs) by running their different independent pieces in separate SAS sessions at the same time (in parallel).  Divide and conquer.

You might be surprised at how easily this can be done, and at the same time how powerful it is. Let’s take a look.

SAS/CONNECT

MP CONNECT) parallel processing functionality was added to SAS/CONNECT enabling you to execute multiple SAS sessions asynchronously. When a remote SAS session kicks off asynchronously, a portion of your SAS program is sent to the server session for execution and control is immediately returned to the client session. The client session can continue with its own processing or spawn one or more additional asynchronous remote server sessions.

Running programs in parallel on a single machine

Sometimes, what comes across as new is just well forgotten old. They used to be Central Processing Units (CPU), but now they are called just processors. Nowadays, practically every single computer is a “multi-machine” (or to be precise “multi-processor”) device. Even your laptop. Just open Task Manager (Ctrl-Alt-Delete), click on the Performance tab and you will see how many physical processors (or cores) and logical processors your laptop has:

Parallel processing on a single machine

That means that this laptop can run eight independent SAS processes (sessions) at the same time. All you need to do is to say nicely “Dear Mr. & Mrs. SAS/CONNECT, my SAS program consists of several independent pieces. Would you please run each piece in its own SAS session, and run them all at the same time?” And believe me, SAS/CONNECT does not care how many logical processors you have, whether your logical processors are far away from each other “remote machines” or they are situated in a single laptop or even in a single chip.

Here is how you communicate your request to SAS/CONNECT in SAS language.

Spawning multiple SAS sessions using MP Connect

Suppose you have a SAS code that consists of several pieces – DATA or PROC steps that are independent of each other, i.e. they do not require to be run in a specific sequence. For example, each of the two pieces generates its own data set.

Then we can create these two data sets in two separate “remote” SAS sessions that run in parallel. Here is how you do this.  (For illustration purposes, I just create two dummy data sets.)

options sascmd="sas";
 
/* Current datetime */
%let _start_dt = %sysfunc(datetime());
 
/* Prosess 1 */
signon task1;
rsubmit task1 wait=no;
 
   libname SASDL 'C:\temp';
 
   data SASDL.DATA_A (keep=str);
      length str $1000;
      do i=1 to 1150000;
         str = '';
         do j=1 to 1000;
            str = cats(str,'A');
         end;
         output;
      end;
   run;
 
endrsubmit;
 
/* Process 2 */
signon task2;
rsubmit task2 wait=no;
 
   libname SASDL 'C:\temp';
 
   data SASDL.DATA_B (keep=str);
      length str $1000;
      do i=1 to 750000;
         str = '';
         do j=1 to 1000;
            str = cats(str,'B');
         end;
         output;
      end;
   run;
 
endrsubmit;
 
waitfor _all_;
signoff _all_;
 
/* Print total duration */
data _null_;
   dur = datetime() - &_start_dt;
   put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-';
run;

In this code, the key elements are:

SIGNON Statement - initiates a connection between a client session and a server session.

ENDRSUBMIT statement - marks the end of a block of statements that a client session submits to a server session for execution.

SIGNOFF Statement - ends the connection between a client session and a server session.

Parallel processing vs. threaded processing

There is a distinction between parallel processing described above and threaded processing (aka multithreading). Parallel processing is achieved by running several independent SAS sessions, each processing its own unit of SAS code.

Threaded processing, on the other hand, is achieved by developing special algorithms and implementing executable codes that run on multiple processors (threads) within the same SAS session. Amdahl's law, which provides theoretical background and estimation of potential time saving achievable by parallel computing on multiple processors.

Passing information to and from “remote” SAS sessions

Besides passing pieces of SAS code from client sessions to server sessions, MP CONNECT allows you to pass some other SAS objects.

Passing data library definitions

For example, if you have a data library defined in your client session, you may pass that library definition on to multiple server sessions without re-defining them in each server session.

Let’s say you have two data libraries defined in your client session:

libname SRCLIB oracle user=myusr1 password=mypwd1 path=mysrv1;
libname TGTLIB '/sas/data/datastore1';

In order to make these data libraries available in the remote session all you need is to add

rsubmit task1 wait=no inheritlib=(SRCLIB TGTLIB);

This will allow libraries that are defined in the client session to be inherited by and available in the server session. As an option, each client libref can be associated with a libref that is named differently in the server session:

rsubmit task1 wait=no inheritlib=(SRCLIB=NEWSRC TGTLIB=NEWTGT);

Passing macro variables from client to server session

options sascmd="sas";
%let run_dt = %sysfunc(datetime());
signon task1;
%syslput rem_run_dt=&run_dt / remote=task1;
rsubmit task1 wait=no;
   %put &=rem_run_dt;
endrsubmit;
 
waitfor task1;
signoff task1;

Passing macro variables from server to client session

Similarly,

  • %SYSRPUT_USER_ </LIKE=‘character-string’>;
  • (/LIKE=<‘character-string’ >specifies a subset of macro variables whose names match a user-specified character sequence, or pattern.)

    Here is a code example that passes two macro variables, rem_start and rem_year from the remote session and outputs them to the SAS log in the client session:

    options sascmd="sas";
    signon task1;
    rsubmit task1 wait=no;
       %let start_dt = %sysfunc(datetime());
       %sysrput rem_start=&start_dt;
       %sysrput rem_year=2021;
    endrsubmit;
     
    waitfor task1;
    signoff task1;
    %put &=rem_start &=rem_year;

    Summary

    SAS’ Multi-Process Connect is a simple and efficient tool enabling parallel execution of independent programming units. Compared to sequential processing of time-intensive programs, it allows to substantially reduce overall duration of your program execution.

    Additional resources

    Running SAS programs in parallel using SAS/CONNECT® was published on SAS Users.

    11月 042020
     

    Removing duplicate charactersIn this blog post we are going to tackle a data cleansing task of removing unwanted repeated characters in SAS character variables.

    Character repetition can stem from various stages of data life cycle: from data collection, to data transmission, to data transformation. It can be accidental or intentional by design. It can be sporadic or consistent. In either case, it needs to be addressed by robust data cleansing processes to ensure adequate data quality that is imperative for the data usability.

    Character repetition examples

    Example 1. Data entry, especially manual data entry, can be a high-risk factor for accidental character duplication. Have you ever pressed a key on your computer keyboard for a bit longer than intended, so it started automatically typing multiple characters???????????????

    Keyboard properties adjustment

     Tip: You can adjust your Keyboard Properties to control “Repeat delay” and “Repeat rate” settings (on Windows computer, search for “Keyboard” and click on Keyboard in Control Panel).

    Example 2. Recently, I had to deal with the data that contained multiple consecutive double quotation marks all over the character string values. Even though we don’t know the exact cause of it, still for each of these duplicated quotation marks occurrences we needed to replace them with a single quotation mark.

    Removing repeated blanks

    There is a very useful Removing unwanted characters from text strings by Amadeus Software we developed a prototype using

    data D;
       c = ','; *<- character to un-duplicate;
       cc = c||c; *<- double character;
       string = 'Many,,,,,, commas,,,,, in,,, this,, sentence.,'; *<- source string;
       put 'BEFORE:' string=; *<- output initial string;
       do while (find(string,cc)); *<- loop through while there are doubles;
          string = tranwrd(string,cc,c); *<- replace double with a single character;
       end;
       put 'AFTER: ' string=; *<- output unduplicated string;
    run;

    This code will produce the following in the SAS log:

    BEFORE:string=Many,,,,,, commas,,,,, in,,, this,, sentence.,
    AFTER: string=Many, commas, in, this, sentence.,

    which shows that this approach correctly un-duplicates the source string removing and replacing all repeated characters (commas in our example) with a single one.

    User-defined SAS function for removing any repeated characters

    Let’s use

    libname funclib 'c:\projects\functions';
    proc fcmp outlib=funclib.userfuncs.package1;
       function undupc(str $, clist $) $;
          length x $32767 c $1 cc $2;
          x = str; 
          do i=1 to length(clist);
             c = char(clist,i);
             cc = c||c;
             do while (find(trim(x),cc));
                x = tranwrd(x,cc,c);
             end;
          end;
          return (x); 
       endfunc; 
    run;

    Code highlights

    • We introduce an interim variable x to which we will iteratively apply replacing double characters with a single one.
    • We assign length attribute of this variable to be maximum allowable character length of 32767 bytes to accommodate any character length used in the calling program.
    • Outer do-loop loops through the clist containing characters we want to unduplicate.
    • Variable c is assigned a single character from clist, variable cc is assigned double of the cc value.
    • Inner do-loop iterates through trimmed characters in x while doubles are found; using trim(x) is essential as it not only speeds up processing while searching through a shorter string (without trailing blanks), it also prevents from falling into an infinite loop in case clist contains blank character to unduplicate (cc equals to double blanks which are always going to be found among trailing blanks).

    Let’s test our newly minted UNDUPC function on the following data:

    data SOURCE;
       infile datalines truncover;
       input str $50.;
       datalines;
    """Repeated "double quotes""""
    Repeated,,,,,commas,,,,,,,,,,,
    [[[""Mixed""]]   characters,,,
    ;

    Since our user-defined function is permanently stored in the location specified in the

    options cmplib=funclib.userfuncs;
    data TARGET;
       set SOURCE;
       length new_str $50;
       new_str = undupc(str, ' ,"][');
    run;

    This code will remove and replace all repeated sequences of characters ' ',',', '"', ']', and '['. The order of these characters listed in the second argument doesn’t matter. Here is what we get:

    Duplicate characters removal results
    As you can see, we get what we wanted including the functionality of the COMPBL function.

    User-defined CALL routine for removing any repeated characters

    As much as I love user-defined functions, I have an issue with the above undupc user-defined function implementation. It has to do with how the PROC FCMP handles interim character variables length attribute assignment. It does not implicitly inherit their length attribute from another variable as SAS data step does. For example, if you run the following data step:

    data a;
       length x $99;
       y = x;
    run;

    variable y will have the length attribute $99 implicitly inherited from the x variable.

    In PROC CFMP function, you can either assign the length attribute to a character variable explicitly with LENGTH or ATTRIB statement (as we did by using length x $32767 ), or it will be set to $33 if you use any other way of implicit assignment. (I leave it up to you guessing why 33 and why not any other number.) Since we wanted to accommodate SAS character strings of any length, we had to explicitly assign our interim variable x length attribute the maximum valid value of $32767. This will inevitably take tall on the function performance as we will have to process longer strings.

    However, we can avoid this issue by using CALL routine instead:

    libname funclib 'c:\projects\functions';
    proc fcmp outlib=funclib.usercalls.package1;
       subroutine undupc(str $, clist $, x $);
          outargs x;
          length c $1 cc $2;
          x = str;
          do i=1 to length(clist);
             c = char(clist,i);
             cc = c||c;
             do while (find(trim(x),cc));
                x = tranwrd(x,cc,c);
             end;
          end;
       endsub; 
    run;

    This code is very similar to the user-defined function above with a slight difference. Here, x variable is listed as an argument in the subroutine definition and refers to a SAS data step variable whose length attribute is assigned in the calling data step. Unlike SAS function, SAS subroutine does not return a value; instead, it uses

    options cmplib=funclib.usercalls;
    data TARGET;
       set SOURCE;
       length new_str $50;
       call undupc(str, ' ,"][', new_str);
    run;

    And we will get the same results as with the UNDUPC function above.

    Store user-defined functions and subroutines separately

    You can create and have both, user-defined function and call routine with the same name. However, to avoid confusion (and errors) do not store their definitions in the same data table (outlib= option of the PROC FCMP). If they are stored in the same data table, then when used in a DATA step, SAS will pull the latest definition by its name only and that may not be the entity you want.

    Performance benchmarking

    To compare performances of the UNDUPC function vs UNDUPC subroutine we created a rather large data table (1 Million observations) with randomly generated strings (1000 characters long):

    libname SASDL 'C:\PROJECTS\TESTDATA';
     
    data SASDL.TESTDATA (keep=str);
       length str $1000;
       do i=1 to 1000000;
          str = '';
          do j=1 to 1000;
             str = cats(str,byte(int(rank(' ')+38*rand('UNIFORM'))));
          end;
          output;
       end;
    run;

    Then we ran the following 2 data steps, one using the undupc() function, and the other using undupc() call routine:

    options cmplib=funclib.userfuncs;
     
    data SASDL.TESTDATA_UNDUPC_FUNC;
       set SASDL.TESTDATA;
       length new_str $1000;
       new_str = undupc(str, '#+');
    run;
     
    options cmplib=subrlib.usercalls;
     
    data SASDL.TESTDATA_UNDUPC_CALL;
       set SASDL.TESTDATA;
       length new_str $1000;
       call undupc(str, '#+', new_str);
    run;

    A quick SAS log inspection reveals that CALL UNDUPC works as much as 3 times faster than UNDUPC function (10 seconds vs. 30 seconds). The time savings may vary depending on your data composition and computing environment, but in any case, if you process high volumes of data you may consider using CALL routine over function. This is not a blanket statement, as it only pertains to this particular algorithm of eliminating character repetitions where we had to explicitly assign the highest possible length attribute value to the interim variable in the function, but not in the CALL routine.

    When we reduced declared length of x from $32767 to $1000 within the user-defined function definition its performance became on par with the CALL routine.

    Additional Resources for SAS character strings processing

    Your thoughts?

    Have you found this blog post useful? Would you vote for implementing UNDUPC as a native built-in SAS function? Please share your thoughts and feedback in the comments section below.

    Removing repeated characters in SAS strings was published on SAS Users.

    10月 152020
     

    SAS Microsoft partnershipYou might have heard about SAS - Microsoft partnership announced in June 2020 that officially joined the powers of SAS analytics with Microsoft’s cloud technology to further advance Artificial Intelligence (AI).

    This partnership did not just happen out of nowhere. SAS has a long and deep history of integrating with Microsoft technologies. Examples include:

    In this post we will look at a lesser known, but quite useful feature in SAS that allows SAS users to bring many Microsoft Excel functions right to their SAS programs. I hope that many SAS users (not just MS Excel aficionados) will love to discover this functionality within SAS.

    Excel functions as SAS user-defined functions

    SAS has a wide variety of built-in functions, however there are still many Microsoft Excel functions that are not intrinsically implemented in SAS. Luckily, many of them are made available in SAS via

    proc fcmp inlib=SASHELP.SLKWXL listall;
    run;

    You can also capture the list of available Excel functions in a SAS data table using ODS OUTPUT with CODELIST= option:

    ods noresults;
    ods output codelist=WORK.EXCEL_FUNCTIONS_LIST (keep=COL1 COL2);
    proc fcmp inlib=SASHELP.SLKWXL listall;
    run;
    ods output close;
    ods results;

    From this data table you can produce a nice looking HTML report listing all these functions:

    data WORK.EXCEL_SAS_FUNCTIONS (keep=exc sas arg);
       label exc='Excel Function' sas='SAS Function' arg='Arguments';
       set WORK.EXCEL_FUNCTIONS_LIST (rename=(col2=arg));
       sas = tranwrd(col1,'Function ','');
       exc = tranwrd(sas,'_slk','');
    run;
     
    ods html path='c:\temp' file='excel_sas_functions.html';
    title 'List of Excel functions available in SAS (via SASHELP.SLKWXL)';
    proc print data=EXCEL_SAS_FUNCTIONS label;
    run;
    ods html close;

    When you run this code, you should get the following list of Excel functions along with their SAS equivalents:

    List of Excel functions available in SAS (via SASHELP.SLKWXL)
    Obs Excel Function SAS Function Arguments
    1 even even_slk ( x )
    2 odd odd_slk ( x )
    3 factdouble factdouble_slk ( x )
    4 product product_slk ( nums )
    5 multinomial multinomial_slk ( nums )
    6 floor floor_slk ( n, sg )
    7 datdif4 datdif4_slk ( start, end )
    8 amorlinc amorlinc_slk ( cost, datep, fperiod, salvage, period, rate, basis )
    9 amordegrc amordegrc_slk ( cost, datep, fperiod, salvage, period, rate, basis )
    10 disc disc_slk ( settlement, maturity, pr, redemp, basis )
    11 tbilleq tbilleq_slk ( settlement, maturity, discount )
    12 tbillprice tbillprice_slk ( settlement, maturity, discount )
    13 tbillyield tbillyield_slk ( settlement, maturity, par )
    14 dollarde dollarde_slk ( fdollar, frac )
    15 dollarfr dollarfr_slk ( ddollar, frac )
    16 effect effect_slk ( nominal_rate, npery )
    17 coupnum coupnum_slk ( settlement, maturity, freq, basis )
    18 coupncd coupncd_slk ( settlement, maturity, freq, basis )
    19 coupdaysnc coupdaysnc_slk ( settlement, maturity, freq, basis )
    20 couppcd couppcd_slk ( settlement, maturity, freq, basis )
    21 coupdays coupdays_slk ( settlement, maturity, freq, basis )
    22 db db_slk ( cost, salvage, life, period, month )
    23 yield yield_slk ( settlement, maturity, rate, pr, redemp, freq, basis )
    24 yielddisc yielddisc_slk ( settlement, maturity, pr, redemp, basis )
    25 coupdaybs coupdaybs_slk ( settlement, maturity, freq, basis )
    26 oddfprice oddfprice_slk ( settlement, maturity, issue, fcoupon, rate, yield, redemp, freq, basis )
    27 oddfyield oddfyield_slk ( settlement, maturity, issue, fcoupon, rate, pr, redemp, freq, basis )
    28 oddlyield oddlyield_slk ( settlement, maturity, linterest, rate, pr, redemp, freq, basis )
    29 oddlprice oddlprice_slk ( settlement, maturity, linterest, rate, yield, redemp, freq, basis )
    30 price price_slk ( settlement, maturity, rate, yield, redemp, freq, basis )
    31 pricedisc pricedisc_slk ( settlement, maturity, discount, redemp, basis )
    32 pricemat pricemat_slk ( settlement, maturity, issue, rate, yld, basis )
    33 yieldmat yieldmat_slk ( settlement, maturity, issue, rate, pr, basis )
    34 received received_slk ( settlement, maturity, investment, discount, basis )
    35 accrint accrint_slk ( issue, finterest, settlement, rate, par, freq, basis )
    36 accrintm accrintm_slk ( issue, maturity, rate, par, basis )
    37 duration duration_slk ( settlement, maturity, coupon, yld, freq, basis )
    38 mduration mduration_slk ( settlement, maturity, coupon, yld, freq, basis )
    39 avedev avedev_slk ( data )
    40 devsq devsq_slk ( data )
    41 varp varp_slk ( data )

     
    NOTE: Excel functions that are made available in SAS are named from their Excel parent functions, suffixing them with _SLK to distinguish them from their Excel incarnations, as well as from native SAS functions.

    Examples of Microsoft Excel functions usage in SAS

    In order to use any of these Excel functions in your SAS code, all you need to do is to specify the functions definition data table in the CMPLIB= option:

    options cmplib=SASHELP.SLKWXL;

    Let’s consider several examples.

    ODD function

    This function returns number rounded up to the nearest odd integer:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       x = 6.4;
       y = odd_slk(x);
       put 'odd( ' x ') = ' y;
    run;

    SAS log:
    odd( 6.4 ) = 7

    EVEN function

    This function returns number rounded up to the nearest even integer:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       x = 6.4;
       y = even_slk(x);
       put 'even( ' x ') = ' y;
    run;

    SAS log:
    odd( 6.4 ) = 8

    FACTDOUBLE function

    This function returns the double factorial of a number. If number is not an integer, it is truncated.
    Double factorial (or semifactorial) of a number n, denoted by n!!, is the product of all the integers from 1 up to n that have the same parity as n.
    For even n, the double factorial is n!!=n(n-2)(n-4)…(4)(2), and for odd n, the double factorial is n!! = n(n-2)(n-4)…(3)(1).

    Here is a SAS code example using the factdouble() Excel function:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       n = 6;
       m = 7;
       nn = factdouble_slk(n);
       mm = factdouble_slk(m);
       put n '!! = ' nn / m '!! = ' mm;
    run;

    It will produce the following SAS log:
    6 !! = 48
    7 !! = 105

    Indeed, 6!! = 2 x 4 x 6 = 48 and 7!! = 1 x 3 x 5 x 7 = 105.

    PRODUCT function

    This function multiplies all elements of SAS numeric array given as its argument and returns the product:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       array x x1-x5 (5, 7, 1, 2, 2);
       p = product_slk(x);
       put 'x = ( ' x1-x5 ')';
       put 'product(x) = ' p;
    run;

    SAS log:
    x = ( 5 7 1 2 2 )
    product(x) = 140

    Indeed 5*7*1*2*2 = 140.

    MULTINOMIAL function

    This function returns the ratio of the factorial of a sum of values to the product of factorials:

    MULTINOMIAL(a1, a2, ... , an) = (a1 + a2 + ... + an)! : (a1! a2! ... an!)

    In SAS, the argument to this function is specified as numeric array name:

    options cmplib=SASHELP.SLKWXL;
    data _null_;
       array a a1-a3 (1, 3, 2);
       m = multinomial_slk(a);
       put 'a = ( ' a1-a3 ')';
       put 'multinomial(a) = ' m;
    run;

    SAS log:
    a = ( 1 3 2 )
    multinomial(a) = 60

    Indeed (1+3+2)!  :  (1! + 3! + 2!) = 720 : 12 = 60.

    Other Microsoft Excel functions available in SAS

    You can explore other Excel functions available in SAS via SASHELP.SLKWXL user-defined functions by cross-referencing them with the corresponding Microsoft Excel functions documentation (alphabetical or by categories) As you can see in the above List of Excel functions available in SAS, besides mathematical functions exemplified in the previous section, there are also many Excel financial functions related to securities trading that are made available in SAS.

    Additional Resources on SAS user-defined functions

    Your thoughts?

    Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

    Using Microsoft Excel functions in SAS was published on SAS Users.

    9月 172020
     

    Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

    %QUOTE and %UNQUOTE macro functions

    Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

    Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

    QUOTE function

    Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

    This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

    DEQUOTE function

    There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

    data a;
       input x $ 1-50;
       datalines;
    'This is what you get. Let's be careful.'
    ;
     
    data _null_;
       set a;
       y = dequote(x);
       put x= / y=;
    run;

    you will get the following in the SAS log:

    y=This is what you get. Let

    This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

    $QUOTE informat

    The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

    y = dequote(x);

    with the INPUT() function

    y = input(x, $quote50.);

    Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

    input x $quote50.;

    Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

    • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
    • Match single quotation mark with double quotation mark as if they are the same.
    • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

    If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

    UNQUOTE function definition

    Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

    • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
    • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
    • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
    • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
    • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

    To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

    UNQUOTE function implementation

    Here is how such a function can be implemented using PROC FCMP:

    libname funclib 'c:\projects\functions';
     
    proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
       function unquote(x $) $32767;
          pos1 = notspace(x); *<- first non-blank character position;
          if pos1=0 then return (x); *<- empty string;
     
          char1 = char(x, pos1); *<- first non-blank character;
          if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
     
          posL = notspace(x, -length(x)); *<- last non-blank character position;
     
          if pos1=posL then return (x); *<- single character string;
     
          charL = char(x, posL); *<- last non-blank character;
          if charL^=char1 then return (x); *<- last non-blank character does not macth first;
     
          /* at this point we should have matching quotation marks */
          return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
       endfunc; 
    run;

    Here are the highlights of this implementation:

    We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

    At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

    NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

    However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

    UNQUOTE function results

    Let’s use the following code to test our newly minted user-defined function UNQUOE():

    libname funclib 'c:\projects\functions';
    options cmplib=funclib.userfuncs;
     
    data A;
       infile datalines truncover;
       input @1 S $char100.;
       datalines;
    '
    "
    How about this?
        How about this?
    "How about this?"
    'How about this?'
    "How about this?'
    'How about this?"
    "   How about this?"
    '      How about this?'
    '      How "about" this?'
    '      How 'about' this?'
       "     How about this?"
       "     How "about" this?"
       "     How 'about' this?"
       '     How about this?'
    ;
     
    data B;
       set A;
       length NEW_S $100;
       label NEW_S = 'unquote(S)';
       NEW_S = unquote(S);
    run;

    This code produces the following output table:

    Example of character string unquoting
    As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

    DSD (Delimiter-Sensitive Data) option

    This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

    In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

    Additional Resources

    Your thoughts?

    Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

    How to unquote SAS character variable values was published on SAS Users.

    9月 172020
     

    Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

    %QUOTE and %UNQUOTE macro functions

    Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

    Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

    QUOTE function

    Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

    This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

    DEQUOTE function

    There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

    data a;
       input x $ 1-50;
       datalines;
    'This is what you get. Let's be careful.'
    ;
     
    data _null_;
       set a;
       y = dequote(x);
       put x= / y=;
    run;

    you will get the following in the SAS log:

    y=This is what you get. Let

    This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

    $QUOTE informat

    The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

    y = dequote(x);

    with the INPUT() function

    y = input(x, $quote50.);

    Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

    input x $quote50.;

    Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

    • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
    • Match single quotation mark with double quotation mark as if they are the same.
    • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

    If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

    UNQUOTE function definition

    Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

    • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
    • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
    • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
    • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
    • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

    To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

    UNQUOTE function implementation

    Here is how such a function can be implemented using PROC FCMP:

    libname funclib 'c:\projects\functions';
     
    proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
       function unquote(x $) $32767;
          pos1 = notspace(x); *<- first non-blank character position;
          if pos1=0 then return (x); *<- empty string;
     
          char1 = char(x, pos1); *<- first non-blank character;
          if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
     
          posL = notspace(x, -length(x)); *<- last non-blank character position;
     
          if pos1=posL then return (x); *<- single character string;
     
          charL = char(x, posL); *<- last non-blank character;
          if charL^=char1 then return (x); *<- last non-blank character does not macth first;
     
          /* at this point we should have matching quotation marks */
          return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
       endfunc; 
    run;

    Here are the highlights of this implementation:

    We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

    At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

    NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

    However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

    UNQUOTE function results

    Let’s use the following code to test our newly minted user-defined function UNQUOE():

    libname funclib 'c:\projects\functions';
    options cmplib=funclib.userfuncs;
     
    data A;
       infile datalines truncover;
       input @1 S $char100.;
       datalines;
    '
    "
    How about this?
        How about this?
    "How about this?"
    'How about this?'
    "How about this?'
    'How about this?"
    "   How about this?"
    '      How about this?'
    '      How "about" this?'
    '      How 'about' this?'
       "     How about this?"
       "     How "about" this?"
       "     How 'about' this?"
       '     How about this?'
    ;
     
    data B;
       set A;
       length NEW_S $100;
       label NEW_S = 'unquote(S)';
       NEW_S = unquote(S);
    run;

    This code produces the following output table:

    Example of character string unquoting
    As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

    DSD (Delimiter-Sensitive Data) option

    This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

    In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

    Additional Resources

    Your thoughts?

    Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

    How to unquote SAS character variable values was published on SAS Users.

    8月 122020
     

    CVP engine as a magnifying glassIn my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.

    What is a character variable’s length attribute?

    Before reading any further, let’s take a quick quiz:

    Q: A character variable length attribute represents a number of:

    1. Bits
    2. Bytes
    3. Centimeters
    4. Characters

    If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable's values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.

    It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 28=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.

    Why would we want to expand character variable lengths?

    Use case 1. Expanding character values range

    For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.

    Use case 2. Migrating SAS data to multi-byte encoding environment

    In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.

    Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.

    Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.

    When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:

    ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

    When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:

    WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

    Either ERROR or WARNING is unacceptable and must be properly addressed.

    How to expand all character variables lengths?

    Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.

    The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:

    libname inlib cvp 'c:\source_folder';

    Then use PROC COPY to create expanded versions of our original data sets in a target library:

    libname outlib 'c:\target_folder';
    proc copy in=inlib out=outlib noclone;
       select dataset1 dataset2;
    run;

    Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:

    proc copy in=inlib out=outlib noclone;
    run;

    It’s that easy. And the icing on the cake is that CVP engine automatically adjusts the variables format widths to meet the expanded byte lengths for all converted character variables.

    Avoiding character data truncation by using the CVP Engine

    CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.

    To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:

    libname inlib cvp 'c:\source_folder';
    libname outlib 'c:\utf8_target_folder' outencoding='UTF-8';
    proc copy in=inlib out=outlib noclone;
       select dataset1 dataset2;
    run;

    Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:

    proc copy in=inlib out=outlib noclone;
    run;

    After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.

    Code notes

    • The code above will create a different version of your original data sets with desired encoding and expanded by 50% (default) character variables lengths. As shown below, this default behavior can be changed by using CVPBYTES= or CVPMULTIPLIER= options which explicitly define bytes expansion rate.
    • It is important to note that CVP option is specified on the input library since the CVP engine is read-only engine, thus available for input (read) processing only.
    • For the output library you specify your desired encoding option, in this case outencoding='UTF-8'.
    • The noclone option specifies not to copy data set attributes. This is needed to make sure the attributes are recreated rather than duplicated.
    • If you want to migrate your data sets using PROC MIGRATE, you should expand column lengths before using PROC COPY as shown above since the CVP engine is not currently supported with PROC MIGRATE.
    • The CVP engine supports only SAS data files (no SAS views, catalogs, item stores, and so on).

    CVP Engine options

    There are several options available with the CVP Engine. Here are the most widely used:

    CVPBYTES=bytes - specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.

    Example: libname inlib 'SAS data-library' cvpbytes=5;

    The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.

    CVPMULTIPLIER=multiplier - specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.

    Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;

    The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.

    Note:

    • You cannot specify both the CVPMULTIPLIER= option and the CVPBYTES= option. Specify only one of these options.
    • If you explicitly assign the CVP engine but do not specify either CVPBYTES= or CVPMULTIPLIER=, then SAS defaults to using CVPMULTIPLIER=1.5 to increase the lengths of the character variables.

    Additional Resources

    Your thoughts?

    Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.

    Expanding lengths of all character variables in SAS data sets was published on SAS Users.

    7月 232020
     

    Splitting one into smaller pieces

    In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

    In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

    We might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

    We might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

    This blog post provides possible coding solutions for such scenarios.

    Splitting a data set into smaller data sets sequentially

    Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

    1. Each smaller data set should have maximum of K observations.
    2. There should be S smaller data sets of approximately same size.

    Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

    Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

    Below is a SAS macro code that covers both these scenarios.

    %macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
    /* Parameters:
    /*   SRC_DATASET - name of the source data set     */
    /*   OUT_PREFIX - prefix of the output data sets   */
    /*   SPLIT_NUM - split number                      */
    /*   SPLIT_DEF - split definition (=SETS or =NOBS) */
     
       %local I K S TLIST;
     
       /* number of observations &K, number of smaller datasets &S */
       data _null_;
          if 0 then set &SRC_DATASET nobs=N;
          if upcase("&SPLIT_DEF")='NOBS' then
             do;
                call symputx('K',&SPLIT_NUM); 
                call symputx('S',ceil(N/&SPLIT_NUM));
                put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
             end;
             else if upcase("&SPLIT_DEF")='SETS' then
             do;
                call symputx('S',&SPLIT_NUM); 
                call symputx('K',ceil(N/&SPLIT_NUM));
                put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
            end;
             else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
          stop; 
       run;
     
     
       /* terminate macro if nothing to split */
       %if (&K le 0) or (&S le 0) %then %return;
     
        /* generate list of smaller dataset names */
       %do I=1 %to &S;
          %let TLIST = &TLIST &OUT_PREFIX._&I;
       %end;
     
       /* split source dataset into smaller datasets */
       data &TLIST;
          set &SRC_DATASET;
          select;
             %do I=1 %to &S;
                when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
             %end;
          end;
       run;
     
    %mend split;

    The following are examples of the macro invocations:

    %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
     
    %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
     
    %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

    These invocations will produce the following SAS logs:

    %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
    ***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
     
    %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
    ***MACRO SPLIT: Splitting into datasets of no more than 100 observations
    NOTE: There were 428 observations read from the data set SASHELP.CARS.
    NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
    NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
    NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
    NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
    NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
     
    %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
    ***MACRO SPLIT: Splitting into 3 datasets
    NOTE: There were 428 observations read from the data set SASHELP.CARS.
    NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
    NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
    NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

    Splitting a data set into smaller data sets randomly

    For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

    In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

    /* generate random numbers, R */
       data;
          set &SRC_DATASET;
          call streaminit(1234);
          R = rand('uniform');
       run;
     
       /* sort data in R order */
       proc sort;
          by R;
       run;
     
       /* split source dataset into smaller datasets */
       data &TLIST (drop=R);
          set;
          select;
             %do I=1 %to &S;
                when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
             %end;
          end;
       run;

    This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

    DATAn naming convention

    You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

    Here are defaults in SAS’ DATAn naming convention:

    • If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
    • If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

    For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

    Your thoughts?

    Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

    Splitting a data set into smaller data sets was published on SAS Users.

    5月 282020
     

    SAS toolbox: macro functions
    Did you know you could have a single universal function that can replace all the functions in the world? All those sin(x), log(x), … whatever(x) can all be replaced by a single super function f(x). Don’t believe me? Just make those functions names – sin, log, … whatever to be another argument to that all-purpose function f, just like that: f(x, sin), f(x, log), … f(x, whatever). Now, we only must deal with a single function instead of many, and its second argument will define what transformation needs to be done with the first argument in order to arrive at this almighty function’s value.

    How many functions there are in SAS

    Last time I counted there were more than 600 SAS functions, and that is excluding call routines and macro functions. But even that huge number grossly under-represents the actual number of functions available in SAS. That is because there are some functions that are built like the universal multi-purpose super function described above. For example, look at the following functions:

    finance() function represents several dozen various financial functions;

    finfo() function represents multiple functions returning various information items about files (file size, date created, date modified, access permission, etc.);

    dinfo() function returns similar information items about directories;

    attrn() function returns numeric attributes of a data set (number of observations, number of variables, etc.)

    attrc() function returns character attributes of a data set (engine name, encoding name, character set, etc.)

    Each of these functions represents not a single function, but a group of functions, and one of their arguments stipulates specific functionality (an information item or an attribute) that is being requested. You can think of this argument as a function modifier.

    %sysfunc SAS macro function

    %sysfunc() is a super macro function that brings a wealth of SAS functions into SAS macro language. With very few exceptions, most SAS functions are available in SAS macro language thanks to the %sysfunc().

    Moreover, we can build our own user-defined macro functions using SAS-supplied macro functions (such as %eval, %length, %quote, %scan, etc.), as well as hundreds of the SAS non-macro functions wrapped into the %sysfunc() super macro function.

    Building a super macro function to retrieve information about data sets

    Armed with such a powerful arsenal, let’s build a multi-purpose macro function that taps into the data tables’ metadata and extracts various information items about those tables.

    Let’s make this macro function return any of the following most frequently used values:

    • Number of observations
    • Number of variables
    • Variables list (positional, separated by spaces)
    • Variables list (positional, separated by commas)

    Obviously, we can create much more of these information items and attributes, but here I am just showing how to do this so that you can create your own list depending on your needs.

    In my earlier blog post, How to create and use SAS macro functions, we had already built a macro function for getting the number of observations; let’s expand on that.

    Here is the SAS Macro code that handles extraction of all four specified metadata items:

    %macro dsinfo(dset,info);
    /* dset - data set name                             */
    /* info - modifier (NOBS, NVARS, VARLIST, VARLISTC) */      
       %local dsid result infocaps i;
       %let infocaps = %upcase(&info);
       %let dsid = %sysfunc(open(&dset));
       %if &dsid %then
       %do;
          %if &infocaps=NOBS %then %let result = %sysfunc(attrn(&dsid,nlobs));
          %else %if &infocaps=NVARS %then %let result = %sysfunc(attrn(&dsid,nvars));
          %else %if &infocaps=VARLIST %then
          %do i=1 %to %sysfunc(attrn(&dsid,nvars));
             %let result = &result %sysfunc(varname(&dsid,&i));
          %end;
          %else %if &infocaps=VARLISTC %then
          %do i=1 %to %sysfunc(attrn(&dsid,nvars));
             %if &i eq 1 %then %let result = %sysfunc(varname(&dsid,&i));
             %else %let result = &result,%sysfunc(varname(&dsid,&i));
          %end;
          %let dsid = %sysfunc(close(&dsid));
       %end;
       %else %put %sysfunc(sysmsg());
       &result
    %mend dsinfo;

    The SAS log will show:

    %put NOBS=***%dsinfo(SASHELP.CARS,NOBS)***;
    NOBS=***428***
    %put NVARS=***%dsinfo(SASHELP.CARS,NVARS)***;
    NVARS=***15***
    %put VARLIST=***%dsinfo(SASHELP.CARS,VARLIST)***;
    VARLIST=***Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length***
    %put VARLISTC=***%dsinfo(SASHELP.CARS,VARLISTC)***;
    VARLISTC=***Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length***

    Macro function code highlights

    We used the following statement to make our macro function case-insensitive regarding the info argument:

    %let infocaps = %upcase(&info);

    Then depending on the up-cased second argument of our macro function (modifier) we used the attrn(), varnum() and varname() functions within %sysfunc() to retrieve and construct our result macro variable.

    We stick that result macro variable value, &result, right before the %mend statement so that the value is returned to the calling environment.

    While info=VARLIST (space-separated variable list) is useful in DATA steps, info=VARLISTC (comma-separated variable list) is useful in PROC SQL.

    Usage example

    Having this %dsinfo macro function at hands, we can use it in multiple programming scenarios. For example:

    /* ending SAS session if no observations to process */
    %if %dsinfo(SASHELP.CARS,NOBS)=0 %then %do; endsas; %end;
     
    /* further processing */
    data MYNEWDATA (keep=%dsinfo(SASHELP.CARS,VARLIST));
       retain %dsinfo(SASHELP.CARS,VARLIST);
       set SASHELP.CARS;
       if _n_=1 then put %dsinfo(SASHELP.CARS,VARLIST);
       /* ... */
    run;

    Here we first check if there is at least one observation in a data set. If not (0 observations) then we stop the SAS session and don’t do any further processing. Otherwise, when there are some observations to process, we continue.

    If SAS code needs multiple calls to the same macro function with the same argument, we can shorten the code by first assigning that macro function’s result to a macro variable and then reference that macro variable instead of repeating macro function invocation. Here is an example:

    /* further processing */
    %let vlist = %dsinfo(SASHELP.CARS,VARLIST);
    data MYNEWDATA (keep=&vlist);
       retain &vlist;
       set SASHELP.CARS;
       if _n_=1 then put &vlist;
       /* ... */
    run;

    Additional resources

    Your thoughts?

    Do you see the benefits of these multi-purpose SAS macro functions? Can you suggest other scenarios of their usage? Please share your thoughts in the comments section below.

    Multi-purpose macro function for getting information about data sets was published on SAS Users.

    4月 232020
     

    SAS macro function

    SAS® Macro Language immensely empowers SAS programmers with versatility and efficiency of their code development. It allows SAS users to modularize programming code with “once written – many times used” components, and in many cases automatically generate data-driven SAS code.

    Macro language and macro processor

    Generally, SAS software processes your SAS program step by step, first scanning it for macro language objects - macro variables referenced as &somename, and macros referenced as %somename. If found, SAS software activates macro processor which resolves and substitutes those macro references according to the macro language syntax before SAS compiles and executes your programming steps.

    SAS macro language vs. SAS programming language

    A SAS program usually consists of two, often interwoven layers – macro layer and non-macro layer, each with its own syntax and its own timing of compilation and execution. In other words, SAS code is a combination of two distinct languages:

    • SAS programming language (comprised of DATA steps, PROC steps and global statements such as LIBNAME, OPTIONS, TITLE etc.)
    • SAS macro language (comprised of %LET, %IF, %DO, macro functions etc.) which is processed separately from and before SAS compiler executes SAS programming language code.

    The difference between them is like a difference between cooking a meal and eating the meal. In this analogy meal=code, cooking=SAS macro language, eating=SAS programming language. Clear understanding of this difference is the key to becoming a successful SAS programmer.

    Two types of SAS macros

    There are two distinct types of SAS macros:

    1. Macros that generate some SAS programming language code which can span across SAS statements or steps;
    2. Macros that generate some string values which can be used as part of SAS programming language code or data values, but they are not complete SAS statements or steps. This type does not generate any SAS executable code, just a value.

    What is a SAS macro function?

    SAS macro function is a SAS macro that generates a value. In other words, it is the type 2 macro described above. As any SAS macros, SAS macro functions can have any number (zero or more) of positional or/and named parameters (arguments). SAS users may define their own macro functions, but in doing so you may not utilize any SAS language syntax; only SAS macro language syntax is allowed. You can use existing macro functions in your own macro function definition. Among others, one of the most powerful is %SYSFUNC macro function which brings a wealth of SAS language functions into SAS macro language.

    Sources of SAS macro functions

    SAS macro functions may come from the following three sources.

    1. Pre-built macro functions

    Pre-built macro functions that are part of the macro processor. These are such macro functions as %eval, %length, %quote, %scan, %str, %sysfunc, %upcase, etc. Here is a complete list of the pre-built SAS macro functions.

    2. Auto-call macro functions

    Auto-call macros, some of them are type 1 (macros), and some – type 2 (macro functions) such as %cmpres, %left, %lowcase, %trim, %verify, etc. These macro functions supplement the pre-built macro functions. The main difference from the pre-built macro functions is that the auto-call macro functions are program samples of the user-defined macro functions that are written in SAS macro language and made available to you without having to define or include them in your programs. The auto-call macro functions come with your SAS software installation and usually pre-configured for you by setting MAUTOSOURCE and SASAUTOS= macro system options. They may include several macro libraries depending on the SAS products licensed at your site. For example, for my SAS BASE installation the auto-call macro library is in the following folder:

    C:\Program Files\SASHome\SASFoundation\9.4\core\sasmacro

    Here is a selected list of auto-call macros provided with SAS software.

    From the usage standpoint, you will not notice any difference between the pre-built and the auto-call macro functions. For example, macro function %upcase() is pre-built, while macro function %lowcase() is auto-call macro function. They belong to entirely different families, but we use them as if they are complementary siblings.

    3. User-defined macro functions

    Finally, there are user-defined macro functions that do not come with SAS installation. These are the macro functions that you define on your own. Usually, they are kept separately from the auto-call macros mainly in order to distinguish them from SAS-supplied ones.

    To enable access to your own SAS macro library in addition to the auto-call macro library (or libraries), you can use the INSERT= system option:

    options insert=(sasautos="path_to_your_own_macro_library_folder");

    Instead of replacing the SASAUTOS value, this option inserts an additional value into the existing SASAUTOS option as the first value, thus allowing you to tap into your own macro library first, and then also into pre-set SAS auto-call libraries.

    Creating user-defined macro function

    Let’s consider the following example. Suppose, we want to create a macro function that takes a data set name as an argument and returns a value equal to the number of observations in that data set.

    We know that the following code calculates the number of observations in a data set:

    data _null_;
       set SASHELP.CARS (obs=0) nobs=n;
       call symputx('NOBS',n);
    run;
     
    %put &=NOBS;
    NOBS=428

    Can we create a SAS macro function by enclosing this code into macro? Something like this:

    %macro nobs(dset=,result=);
       data _null_;
          set &dset (obs=0) nobs=n;
          call symputx("&result",n);
       run;
    %mend nobs;

    The answer is “No”. Yes, we created a valid macro; we can invoke this macro to produce the result:

    %nobs(dset=SASHELP.CARS, result=NOBS);
    %put &=NOBS;
    NOBS=428

    But this is not a macro function. Remember type 2 macro that does not generate any SAS programming language code, just a value? But this macro does generate SAS code which assigns a value to the macro variable specified as the second argument (result=NOBS).

    In order to create a valid macro function, our macro should not have any SAS language code in it – neither a DATA step, nor a PROC step. It may only be comprised of the SAS macro language code. Here it is:

    %macro nobs(dset);
       %local dsid n rc;
       %let dsid = %sysfunc(open(&dset));
       %if &dsid %then
       %do;
          %let n = %sysfunc(attrn(&dsid,nlobs));
          %let dsid = %sysfunc(close(&dsid));
       %end;
       %else %put %sysfunc(sysmsg());
       &n
    %mend nobs;

    When macro processor executes this macro, the only object that gets passed to the SAS language compiler is the value shown in the line right before the %mend. This is the calculated value of the number of observations (denoted by &n ). This is the only thing that is visible by the SAS language compiler, the rest is the macro language code visible and being handled by SAS macro processor.

    IMPORTANT: When defining SAS macro function always use %local statement to list ALL macro variables that are created in your macro to ensure they will not accidentally overwrite same-named macro variables in the calling environment. You don’t need to declare %local for macro parameters as they are always local automatically.

    SAS macro functions usage examples

    When a macro function is defined this way, wherever you place its invocation %nobs(SASHELP.CARS) in your SAS code it will be evaluated and replaced with the corresponding value (in this case it is number 428) by the SAS macro processor. That way you can avoid substandard hard-coding and make your SAS code dynamic and powerful. You can use macro functions in many SAS coding contexts. For example:

    • Assignment statements for macro variable: %let NOBS=%nobs(SASHELP.CARS);
    • Assignment statement in a DATA step: x = %nobs(SASHELP.CARS);
    • As a value of the iterative do loop: do i=1 to %nobs(SASHELP.CARS);
    • As part of condition in IF statement: if %nobs(SASHELP.CARS) > 500 then do;

    And so on.

    Your thoughts?

    Do you find this post useful? Do you use SAS macro functions? Can you suggest other usage examples? Please share with us in the Comments below.

    Additional resources

     

    How to create and use SAS macro functions was published on SAS Users.

    3月 112020
     

    Automating SAS applications development

    SAS variable labels are unique features of SAS data tables (aka data sets) that allow SAS users to enhance reading and interpretation of tables and reports.

    Whether you use SAS data table as a data source in any of the reporting procedures or interactive interface such as SAS Visual Analytics, you will benefit from pre-assigning meaningful labels during the data preparation process. Besides being more efficient, such an early label assignment secures consistency of the data elements descriptions (labels) across different developers.

    The most direct way of creating column labels is by explicitly assigning them to the data variables. You can do it during the data table creation in a DATA step using either LABEL statement or ATTRIB statement. Alternatively, you can do it after your data table is already created by using PROC DATASETS’ MODIFY statement with the LABEL= option.

    However, in many situations there are ways of automating this tedious and voluminous process of column labels creation. Let’s look at one of them that I found useful for bulk column labeling. Plus, we are going to explore SAS coding technique using _DATA_and_LAST_special data sets.

    Deriving variable labels from variable names

    This method is suitable when variable names are well-formed, for example CUSTOMER_ADDRESS, FIRST_NAME, LAST_NAME, COMPANY_NAME, PLACE_OF_BIRTH, etc. Kudos to data designer!

    We can transform these names into labels by replacing underscores with space characters and converting words from upper case to proper case. These are the labels we will get: Customer Address, First Name, Last Name, Company Name, Place Of Birth.

    Let’s say our original data table is DEMO:

    data DEMO;
       input CUSTOMER_CITY $ 1-15 FIRST_NAME $ 16-26 LAST_NAME $27-37 COMPANY_NAME $38-50 COUNTRY_OF_BIRTH $51-65;
       datalines;
    Washington     Peter      Birn       Citibank     USA
    Denver         Lisa       Roth       IBM          UK
    Cary           Antony     Bessen     SAS          Spain
    ;

    Then the following macro will create variable labels out of the variable names as described above:

    options mprint; 
    %macro ilabel (dataset);
       %local lbref dsname vname vlabel nvars;
     
       %if %index(&dataset,.) %then
       %do; /* 2-level dataset name */
          %let lbref  = %scan(&dataset,1,'.');
          %let dsname = %scan(&dataset,2,'.');
       %end;
       %else
       %do; /* 1-level dataset name */
          %let lbref  = WORK;
          %let dsname = &dataset;
       %end;
     
       /* get variable names */
       proc contents data=&dataset out=_data_(keep=name) noprint;
       run;
     
       /* create name/label pairs */
       data _null_;
          set _last_ end=eof nobs=n;
          call symput('vname'!!strip(put(_n_,best.)),name);
          lbl = propcase(translate(name,' ','_'));
          call symput('vlabel'!!strip(put(_n_,best.)),trim(lbl));
          if eof then call symputx('nvars',n);
       run;
     
       /* modify variable labels */
       proc datasets lib=&lbref nolist;
          modify &dsname;
             label
                %do i=1 %to &nvars;
                   &&vname&i = "&&vlabel&i"
                %end;
             ;
       quit;
     
    %mend ilabel;

    You can invoke this macro by either one line of code:

    %ilabel(DEMO)

    or

    %ilabel(WORK.DEMO)

    Here are how our DEMO table looks before and after %ilabel macro modifies/assigns the labels based on the column names:

    BEFORE:

    Data table showing column names

     

    AFTER:

    Data table showing column labels

    Macro code highlights

    In this macro, we:

    1. Define local macro variables to make sure their names will not interfere with possible namesakes in the calling program.
    2. Determine libref and one-level data set name for the input data set.
    3. Create a table containing variable names in the input data set using PROC CONTENTS.
    4. Use DATA _NULL_ step to read through the variable names, and derive labels as

      lbl = propcase(translate(name,' ','_'));

      Here, transalate() function replaces underscores with blanks, then propcase() function converts every word in an argument to proper case (upper case for the first character and lower case for the remaining characters). We also create macro variables for each name/label pair (vname1, vlabel1, vname2, vlabel2, …) and macro variable nvars representing the number of such pairs.

    5. Use PROC DATASETS with MODIFY and LABEL statements to assign generated column labels to the source data set.

    If some of the labels assigned by this macro are not what you need you may run another PROC DATASETS to individually adjust (re-assign) them according to your wishes or specification. But when you need to label data set columns on a large scale (many tables with dozens or hundreds of columns) this can be a good first draft that can save you time and efforts.

    _DATA_ and _LAST_ special data sets

    You might notice that I used _data_ data set name in the out= option of the PROC CONTENTS. This is not an explicit data set name; it is a keyword, a special data set that allows SAS to assign one of the available data set names dynamically. The created output data set will have a name that looks something like DATA1 or DATA2, etc. Try running this code:

    data _data_;
       x=1;
    run;

    and look in the SAS LOG at what data set is created. I got:

    NOTE: The data set WORK.DATA1 has 1 observations and 1 variables.

    Special data set name _data_ tells SAS to create a data set in the WORK library from a list of names DATA1, DATA2, … according to the DATAn naming convention. These names (as well as WORK library) are unique for a given SAS session. The first time you use _data_ within a SAS session it will create data set named WORK.DATA1, the second time you use _data_ it will create WORK.DATA2, and so on.

    Consequently, I used special data set name _last_ in the SET statement of the DATA step following the PROC CONTENTS. Again, here _last_ is a keyword, not a data set name; it is a special data set that refers to the name of the last created data set during your SAS session. That causes SAS to use the latest data set created prior to the _last_ reference.

    Special data sets _data_ and _last_ are reserved names (or SAS keywords) along with special data set _null_ that is used in the DATA _NULL_ statement and causes SAS to execute the DATA step without creating a data set. (By the way, using DATA _NULL_ can increase your code efficiency when you use the DATA step for custom report writing or creating macro variables or other processing for which the output data set is not needed as it does not consume computer resources for writing and storing the output data set.)

    If I were using an explicit table name in this macro instead, and your calling program accidentally were using the same table name, then the macro would overwrite your table which would wreak havoc to your program. Using _data_ and _last_ special data sets protect your SAS program from a possibility of inadvertently overwriting your other data set with the same name by executing the %ilabel macro. It is similar to using %LOCAL for macro variable names for protecting from possible overwrites of your %GLOBAL macro variables with the same names.

    A WORD OF CAUTION: Remember, that the _data_ keyword creates table names that are unique only within a SAS session, so it works perfectly for the WORK data library which itself is a unique instance for a SAS session. While it is syntactically correct to use special data set notation _data_ for creating permanent data sets such as libref._data_ (including SASUSER._data_), I have to warn you against using it as it will not guarantee the name uniqueness in the permanent data library, and you may end up overwriting data sets that already exist there.

    Your thoughts?

    Do you find this post useful? How do you handle the task of assigning variable labels on a mass scale? Do you use _data_ and _last_ special data sets in your SAS coding? Please share in the comments section below.

    Automating SAS variable labels creation was published on SAS Users.