SAS programmers

9月 172020
 

Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

%QUOTE and %UNQUOTE macro functions

Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

QUOTE function

Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

DEQUOTE function

There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

data a;
   input x $ 1-50;
   datalines;
'This is what you get. Let's be careful.'
;
 
data _null_;
   set a;
   y = dequote(x);
   put x= / y=;
run;

you will get the following in the SAS log:

y=This is what you get. Let

This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

$QUOTE informat

The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

y = dequote(x);

with the INPUT() function

y = input(x, $quote50.);

Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

input x $quote50.;

Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

  • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
  • Match single quotation mark with double quotation mark as if they are the same.
  • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

UNQUOTE function definition

Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

  • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
  • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
  • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
  • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
  • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

UNQUOTE function implementation

Here is how such a function can be implemented using PROC FCMP:

libname funclib 'c:\projects\functions';
 
proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
   function unquote(x $) $32767;
      pos1 = notspace(x); *<- first non-blank character position;
      if pos1=0 then return (x); *<- empty string;
 
      char1 = char(x, pos1); *<- first non-blank character;
      if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
 
      posL = notspace(x, -length(x)); *<- last non-blank character position;
 
      if pos1=posL then return (x); *<- single character string;
 
      charL = char(x, posL); *<- last non-blank character;
      if charL^=char1 then return (x); *<- last non-blank character does not macth first;
 
      /* at this point we should have matching quotation marks */
      return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
   endfunc; 
run;

Here are the highlights of this implementation:

We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

UNQUOTE function results

Let’s use the following code to test our newly minted user-defined function UNQUOE():

libname funclib 'c:\projects\functions';
options cmplib=funclib.userfuncs;
 
data A;
   infile datalines truncover;
   input @1 S $char100.;
   datalines;
'
"
How about this?
    How about this?
"How about this?"
'How about this?'
"How about this?'
'How about this?"
"   How about this?"
'      How about this?'
'      How "about" this?'
'      How 'about' this?'
   "     How about this?"
   "     How "about" this?"
   "     How 'about' this?"
   '     How about this?'
;
 
data B;
   set A;
   length NEW_S $100;
   label NEW_S = 'unquote(S)';
   NEW_S = unquote(S);
run;

This code produces the following output table:

Example of character string unquoting
As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

DSD (Delimiter-Sensitive Data) option

This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

How to unquote SAS character variable values was published on SAS Users.

9月 172020
 

Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

%QUOTE and %UNQUOTE macro functions

Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

QUOTE function

Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

DEQUOTE function

There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

data a;
   input x $ 1-50;
   datalines;
'This is what you get. Let's be careful.'
;
 
data _null_;
   set a;
   y = dequote(x);
   put x= / y=;
run;

you will get the following in the SAS log:

y=This is what you get. Let

This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

$QUOTE informat

The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

y = dequote(x);

with the INPUT() function

y = input(x, $quote50.);

Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

input x $quote50.;

Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

  • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
  • Match single quotation mark with double quotation mark as if they are the same.
  • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

UNQUOTE function definition

Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

  • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
  • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
  • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
  • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
  • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

UNQUOTE function implementation

Here is how such a function can be implemented using PROC FCMP:

libname funclib 'c:\projects\functions';
 
proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
   function unquote(x $) $32767;
      pos1 = notspace(x); *<- first non-blank character position;
      if pos1=0 then return (x); *<- empty string;
 
      char1 = char(x, pos1); *<- first non-blank character;
      if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
 
      posL = notspace(x, -length(x)); *<- last non-blank character position;
 
      if pos1=posL then return (x); *<- single character string;
 
      charL = char(x, posL); *<- last non-blank character;
      if charL^=char1 then return (x); *<- last non-blank character does not macth first;
 
      /* at this point we should have matching quotation marks */
      return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
   endfunc; 
run;

Here are the highlights of this implementation:

We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

UNQUOTE function results

Let’s use the following code to test our newly minted user-defined function UNQUOE():

libname funclib 'c:\projects\functions';
options cmplib=funclib.userfuncs;
 
data A;
   infile datalines truncover;
   input @1 S $char100.;
   datalines;
'
"
How about this?
    How about this?
"How about this?"
'How about this?'
"How about this?'
'How about this?"
"   How about this?"
'      How about this?'
'      How "about" this?'
'      How 'about' this?'
   "     How about this?"
   "     How "about" this?"
   "     How 'about' this?"
   '     How about this?'
;
 
data B;
   set A;
   length NEW_S $100;
   label NEW_S = 'unquote(S)';
   NEW_S = unquote(S);
run;

This code produces the following output table:

Example of character string unquoting
As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

DSD (Delimiter-Sensitive Data) option

This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

How to unquote SAS character variable values was published on SAS Users.

9月 022020
 

SAS offering free learning resources in celebration of programmers

For more than 40 years, SAS programmers have crafted software and solutions that transform the world. From statistics to data science, to analytics and artificial intelligence, people writing code have architected a new economy with incredible opportunities. SAS Programmer Week honors those people by offering free learning resources available for everyone, from students to early career professionals to SAS veterans.

Running from Sept. 7-11, SAS Programmer Week leads up to the international Day of the Programmer on Saturday, Sept. 12. Training resources will be available for free through a variety of YouTube and video tutorials, webinars, blogs and documentation.

There will be three different tracks for new, experienced and analytics-focused users, with new content released each day. The week culminates with SAS certification prep content that will have participants ready to pursue a valuable SAS credential.

For instance, Tech Republic named SAS as one of 7 data science certifications to boost your resume and salary. CIO Magazine puts SAS among the top 11 big data and analytics certifications for 2020.

Since SAS programmers are busy and may not have all day to engage with the materials, SAS Programmer Week is flexible. Participants can access the material when they want to learn a specific skill related to the day’s topic or consume the material in snippets when they have time.

Interested participants can visit the SAS Programmer Week website to register today, preview the materials and schedule, and jump-start their career journeys.

 

All hail the SAS programmer! was published on SAS Users.

8月 122020
 

CVP engine as a magnifying glassIn my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.

What is a character variable’s length attribute?

Before reading any further, let’s take a quick quiz:

Q: A character variable length attribute represents a number of:

  1. Bits
  2. Bytes
  3. Centimeters
  4. Characters

If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable's values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.

It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 28=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.

Why would we want to expand character variable lengths?

Use case 1. Expanding character values range

For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.

Use case 2. Migrating SAS data to multi-byte encoding environment

In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.

Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.

Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.

When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:

ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:

WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

Either ERROR or WARNING is unacceptable and must be properly addressed.

How to expand all character variables lengths?

Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.

The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:

libname inlib cvp 'c:\source_folder';

Then use PROC COPY to create expanded versions of our original data sets in a target library:

libname outlib 'c:\target_folder';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

It’s that easy. And the icing on the cake is that CVP engine automatically adjusts the variables format widths to meet the expanded byte lengths for all converted character variables.

Avoiding character data truncation by using the CVP Engine

CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.

To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:

libname inlib cvp 'c:\source_folder';
libname outlib 'c:\utf8_target_folder' outencoding='UTF-8';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.

Code notes

  • The code above will create a different version of your original data sets with desired encoding and expanded by 50% (default) character variables lengths. As shown below, this default behavior can be changed by using CVPBYTES= or CVPMULTIPLIER= options which explicitly define bytes expansion rate.
  • It is important to note that CVP option is specified on the input library since the CVP engine is read-only engine, thus available for input (read) processing only.
  • For the output library you specify your desired encoding option, in this case outencoding='UTF-8'.
  • The noclone option specifies not to copy data set attributes. This is needed to make sure the attributes are recreated rather than duplicated.
  • If you want to migrate your data sets using PROC MIGRATE, you should expand column lengths before using PROC COPY as shown above since the CVP engine is not currently supported with PROC MIGRATE.
  • The CVP engine supports only SAS data files (no SAS views, catalogs, item stores, and so on).

CVP Engine options

There are several options available with the CVP Engine. Here are the most widely used:

CVPBYTES=bytes - specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.

Example: libname inlib 'SAS data-library' cvpbytes=5;

The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.

CVPMULTIPLIER=multiplier - specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.

Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;

The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.

Note:

  • You cannot specify both the CVPMULTIPLIER= option and the CVPBYTES= option. Specify only one of these options.
  • If you explicitly assign the CVP engine but do not specify either CVPBYTES= or CVPMULTIPLIER=, then SAS defaults to using CVPMULTIPLIER=1.5 to increase the lengths of the character variables.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.

Expanding lengths of all character variables in SAS data sets was published on SAS Users.

7月 232020
 

Splitting one into smaller pieces

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

We might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

We might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

  1. Each smaller data set should have maximum of K observations.
  2. There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios.

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
/* Parameters:
/*   SRC_DATASET - name of the source data set     */
/*   OUT_PREFIX - prefix of the output data sets   */
/*   SPLIT_NUM - split number                      */
/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
 
   %local I K S TLIST;
 
   /* number of observations &K, number of smaller datasets &S */
   data _null_;
      if 0 then set &SRC_DATASET nobs=N;
      if upcase("&SPLIT_DEF")='NOBS' then
         do;
            call symputx('K',&SPLIT_NUM); 
            call symputx('S',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
         end;
         else if upcase("&SPLIT_DEF")='SETS' then
         do;
            call symputx('S',&SPLIT_NUM); 
            call symputx('K',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
        end;
         else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
      stop; 
   run;
 
 
   /* terminate macro if nothing to split */
   %if (&K le 0) or (&S le 0) %then %return;
 
    /* generate list of smaller dataset names */
   %do I=1 %to &S;
      %let TLIST = &TLIST &OUT_PREFIX._&I;
   %end;
 
   /* split source dataset into smaller datasets */
   data &TLIST;
      set &SRC_DATASET;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;
 
%mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
***MACRO SPLIT: Splitting into datasets of no more than 100 observations
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
***MACRO SPLIT: Splitting into 3 datasets
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

/* generate random numbers, R */
   data;
      set &SRC_DATASET;
      call streaminit(1234);
      R = rand('uniform');
   run;
 
   /* sort data in R order */
   proc sort;
      by R;
   run;
 
   /* split source dataset into smaller datasets */
   data &TLIST (drop=R);
      set;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

  • If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
  • If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Your thoughts?

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Splitting a data set into smaller data sets was published on SAS Users.

7月 202020
 

I think we can all agree that lifelong learning is the future, for all of us. We know that we need to learn and develop all the time, simply to stay abreast. The world is changing fast, and we must change with it. Investing in analytics talent is an investment in your future.

Create a corporate learning culture

Organizations may embrace the concept of data in their work cultures, yet they fail to commit to developing the analytics skills their teams need to harness the data. Organizations often don’t know what skills their team has, the skills they’re lacking, or even where they need their employees to go. Corporate training brings new avenues for career development and professional training, and that can have a huge impact on retention.

Companies that understand the importance of continued skill development for their employees will set out a vision and develop a corporate learning culture. Leaders are needed who can drive a cultural change towards continued learning in the organization. Hiring curious people and rewarding that curiosity develops the right mindset. Revealing the knowledge gap that employees have is a good trigger to get the learning started.

When an organization wants to close the talent gap for Analytics and AI, SAS can offer support on the whole journey.

Company objectives and goals will determine how the skills development program will be shaped. Once the roles and desired skills levels are determined, SAS will provide a Learning Needs Assessment to uncover the skills gaps.

Based on the findings from the assessment, a program for learning and development will be proposed for the different target audiences and to work towards company goals.

With readily available digital assets, the first steps to a learning platform can be realized quickly. By dynamically developing more assets and creating domain and company specific materials, continuous learning is encouraged and facilitated.

Power of online learning

Today most of us are forced to work from home due to COVID-19. We know the future of work will look different; we are getting used to the digital channel for communicating and interacting. Though the digitization of learning has been going on for many years, learning content is now moving to the cloud, becoming accessible across multiple devices and teaching environments and often being generated, shared, and continually updated.

Millennials feel most comfortable with this digitization, but the wider workforce will ultimately benefit from access to digital learning assets. Integrated cloud-based platforms enable more than just new computer programs or smartphone apps. Smart organizations are now expanding their use of cloud-based learning to run personalized online courses, small tailored live web sessions, instructional videos, e-coaching, communities, virtual classrooms, and simulations games. SAS is responding to this need by offering several free options for learning including e-Learning, online tutorials, lab time, SAS Academy for Data Science, and a SAS Learning Subscription that offers more than 100 e-Learning courses. There is something for everyone!

Finding analytics talent within your workforce can be challenging. It can be hard to assess the skills and identify the talent you need for specific solutions. With an unparalleled depth of understanding in the industry, SAS can help identify, cultivate and grow analytics talent in your organization. In addition, we’ll help you to build a talent pipeline in partnership with local academic institutions, if needed.

Do you want to start today with learning essential skills? Take a look at how SAS can help your organization to get ahead, through our learning solutions customized to help you train your team. Or transform it.

Shaping the future of lifelong learning was published on SAS Users.

6月 242020
 

A lookup table is a programming technique where one or more values can be used to retrieve another value. For example, many years ago, I had benzene exposure estimates for 10 years (1940 to 1949) for each of five locations in a factory. Given a year and a job location, I needed to know the benzene concentration.

I would be terribly embarrassed today if anyone saw the first program I wrote to solve the problem! This blog shows a better way that uses temporary arrays to create an n-way lookup table. To keep the example simple, let's use five years of data (1944 to 1948) and four locations (1 to 4).

Temporary arrays

Before we get into the program, let's discuss temporary arrays, one of my favorite SAS tools. Here is an example of a one-dimensional temporary array:

Data Pass_Fail;
   input ID $ Grade1 - Grade5;
   array PF[5] _temporary_ (65 70 55 65 55);
   array Grade[5]; *If you leave off the variable list SAS will use the
                    array name with numbers 1-5 added. In this example
                    the variables will be Grade1, Grade2, etc.;
   array Pass_or_Fail[5] $ 4;
   do i = 1 to 5;
      if Grade[i] ge PF[i] then Pass_or_Fail[i] = 'Pass';
      else if not missing(Grade[i]) then Pass_or_Fail[i] = 'Fail';
   end;
   drop i;
datalines;
001 90 68 52 70 72
002 56 69 72 75 88
;
Title "Listing of Data Set Pass_Fail";
Proc print data=Pass_Fail noobs;
Run;

In this example, the temporary array is called PF (pass fail values), and it has 5 elements. There are no actual variables PF1, PF2, and so on, only array elements PF[1], PF[2], and so on. The initial values of the five passing grades are placed in parentheses following the key word _temporary_. In many situations, you load the values of the temporary array from a data file.

To keep this first example easy to understand, we will put the initial values in the array statement. You can now compare each student's grade for every test and assign a value of "Pass" or "Fail."

Here is the output:

Note: You can read a blog that I wrote years ago on temporary arrays for another example. 

Example

Now for the two-way table lookup example.

*Two-dimensional table lookup using a temporary array;
data Lookup;
   array Benzene[1944:1948,4] _temporary_; ①
 
   /* Populate the array */
   if _n_ = 1 then do Year = 1944 to 1948; ②
       do Location = 1 to 4;
	      input Benzene[Year,Location] @; ③
	   end;
	end;
 
   input Subj $ Year Location;
   Benzene_Level = Benzene[Year, Location]; ④
datalines;
250 200 150 130
90 180 155 90
95 35 170 140 
80 50 45 100 
40 50 25 15
001 1944 3
002 1948 1
003 1945 4
;
title "Listing od Data Set Lookup";
proc print data=Lookup noobs;
run;

① This ARRAY statement creates an array with two dimensions (you use a comma to create multiple dimensions). To make programming easier to understand, the first dimension of the array uses subscripts 1944 to 1948, rather than 1 to 5 (the colon enables you to specify the lower and upper bounds of an array). Also notice that there are no initial values in this statement—they will be read from data.
② This section of code populates the values in the Benzene temporary array. You use the statement if _n_ = 1 to ensure that this section of code executes only once.
③ The INPUT statement reads in a value for Year and Location. The single trailing @ sign prevents SAS from going to a new line each time to DO loop iterates.
④ Notice how easy it is to retrieve an exposure value, given a value of Year and Location. The first five lines of data are the values used to populate the temporary array.

You can read more about temporary arrays in my book, Learning SAS by Example: A Programmers Guide, Second Edition.

Comments on this blog are welcome.

Multi-way lookup tables was published on SAS Users.

4月 232020
 

SAS macro function

SAS® Macro Language immensely empowers SAS programmers with versatility and efficiency of their code development. It allows SAS users to modularize programming code with “once written – many times used” components, and in many cases automatically generate data-driven SAS code.

Macro language and macro processor

Generally, SAS software processes your SAS program step by step, first scanning it for macro language objects - macro variables referenced as &somename, and macros referenced as %somename. If found, SAS software activates macro processor which resolves and substitutes those macro references according to the macro language syntax before SAS compiles and executes your programming steps.

SAS macro language vs. SAS programming language

A SAS program usually consists of two, often interwoven layers – macro layer and non-macro layer, each with its own syntax and its own timing of compilation and execution. In other words, SAS code is a combination of two distinct languages:

  • SAS programming language (comprised of DATA steps, PROC steps and global statements such as LIBNAME, OPTIONS, TITLE etc.)
  • SAS macro language (comprised of %LET, %IF, %DO, macro functions etc.) which is processed separately from and before SAS compiler executes SAS programming language code.

The difference between them is like a difference between cooking a meal and eating the meal. In this analogy meal=code, cooking=SAS macro language, eating=SAS programming language. Clear understanding of this difference is the key to becoming a successful SAS programmer.

Two types of SAS macros

There are two distinct types of SAS macros:

  1. Macros that generate some SAS programming language code which can span across SAS statements or steps;
  2. Macros that generate some string values which can be used as part of SAS programming language code or data values, but they are not complete SAS statements or steps. This type does not generate any SAS executable code, just a value.

What is a SAS macro function?

SAS macro function is a SAS macro that generates a value. In other words, it is the type 2 macro described above. As any SAS macros, SAS macro functions can have any number (zero or more) of positional or/and named parameters (arguments). SAS users may define their own macro functions, but in doing so you may not utilize any SAS language syntax; only SAS macro language syntax is allowed. You can use existing macro functions in your own macro function definition. Among others, one of the most powerful is %SYSFUNC macro function which brings a wealth of SAS language functions into SAS macro language.

Sources of SAS macro functions

SAS macro functions may come from the following three sources.

1. Pre-built macro functions

Pre-built macro functions that are part of the macro processor. These are such macro functions as %eval, %length, %quote, %scan, %str, %sysfunc, %upcase, etc. Here is a complete list of the pre-built SAS macro functions.

2. Auto-call macro functions

Auto-call macros, some of them are type 1 (macros), and some – type 2 (macro functions) such as %cmpres, %left, %lowcase, %trim, %verify, etc. These macro functions supplement the pre-built macro functions. The main difference from the pre-built macro functions is that the auto-call macro functions are program samples of the user-defined macro functions that are written in SAS macro language and made available to you without having to define or include them in your programs. The auto-call macro functions come with your SAS software installation and usually pre-configured for you by setting MAUTOSOURCE and SASAUTOS= macro system options. They may include several macro libraries depending on the SAS products licensed at your site. For example, for my SAS BASE installation the auto-call macro library is in the following folder:

C:\Program Files\SASHome\SASFoundation\9.4\core\sasmacro

Here is a selected list of auto-call macros provided with SAS software.

From the usage standpoint, you will not notice any difference between the pre-built and the auto-call macro functions. For example, macro function %upcase() is pre-built, while macro function %lowcase() is auto-call macro function. They belong to entirely different families, but we use them as if they are complementary siblings.

3. User-defined macro functions

Finally, there are user-defined macro functions that do not come with SAS installation. These are the macro functions that you define on your own. Usually, they are kept separately from the auto-call macros mainly in order to distinguish them from SAS-supplied ones.

To enable access to your own SAS macro library in addition to the auto-call macro library (or libraries), you can use the INSERT= system option:

options insert=(sasautos="path_to_your_own_macro_library_folder");

Instead of replacing the SASAUTOS value, this option inserts an additional value into the existing SASAUTOS option as the first value, thus allowing you to tap into your own macro library first, and then also into pre-set SAS auto-call libraries.

Creating user-defined macro function

Let’s consider the following example. Suppose, we want to create a macro function that takes a data set name as an argument and returns a value equal to the number of observations in that data set.

We know that the following code calculates the number of observations in a data set:

data _null_;
   set SASHELP.CARS (obs=0) nobs=n;
   call symputx('NOBS',n);
run;
 
%put &=NOBS;
NOBS=428

Can we create a SAS macro function by enclosing this code into macro? Something like this:

%macro nobs(dset=,result=);
   data _null_;
      set &dset (obs=0) nobs=n;
      call symputx("&result",n);
   run;
%mend nobs;

The answer is “No”. Yes, we created a valid macro; we can invoke this macro to produce the result:

%nobs(dset=SASHELP.CARS, result=NOBS);
%put &=NOBS;
NOBS=428

But this is not a macro function. Remember type 2 macro that does not generate any SAS programming language code, just a value? But this macro does generate SAS code which assigns a value to the macro variable specified as the second argument (result=NOBS).

In order to create a valid macro function, our macro should not have any SAS language code in it – neither a DATA step, nor a PROC step. It may only be comprised of the SAS macro language code. Here it is:

%macro nobs(dset);
   %local dsid n rc;
   %let dsid = %sysfunc(open(&dset));
   %if &dsid %then
   %do;
      %let n = %sysfunc(attrn(&dsid,nlobs));
      %let dsid = %sysfunc(close(&dsid));
   %end;
   %else %put %sysfunc(sysmsg());
   &n
%mend nobs;

When macro processor executes this macro, the only object that gets passed to the SAS language compiler is the value shown in the line right before the %mend. This is the calculated value of the number of observations (denoted by &n ). This is the only thing that is visible by the SAS language compiler, the rest is the macro language code visible and being handled by SAS macro processor.

IMPORTANT: When defining SAS macro function always use %local statement to list ALL macro variables that are created in your macro to ensure they will not accidentally overwrite same-named macro variables in the calling environment. You don’t need to declare %local for macro parameters as they are always local automatically.

SAS macro functions usage examples

When a macro function is defined this way, wherever you place its invocation %nobs(SASHELP.CARS) in your SAS code it will be evaluated and replaced with the corresponding value (in this case it is number 428) by the SAS macro processor. That way you can avoid substandard hard-coding and make your SAS code dynamic and powerful. You can use macro functions in many SAS coding contexts. For example:

  • Assignment statements for macro variable: %let NOBS=%nobs(SASHELP.CARS);
  • Assignment statement in a DATA step: x = %nobs(SASHELP.CARS);
  • As a value of the iterative do loop: do i=1 to %nobs(SASHELP.CARS);
  • As part of condition in IF statement: if %nobs(SASHELP.CARS) > 500 then do;

And so on.

Your thoughts?

Do you find this post useful? Do you use SAS macro functions? Can you suggest other usage examples? Please share with us in the Comments below.

Additional resources

 

How to create and use SAS macro functions was published on SAS Users.

4月 072020
 

As many of us are learning to navigate the changing world we are living in amid the COVID-19 outbreak, and as we care for our loved ones, friends, and our community, many of us now find ourselves working and studying from home much more than we did before. As an employee at SAS and an instructor at NC State University, I have found myself splitting my time between setting up my temporary home office and an at-home routine while also trying to help my students feel safe and comfortable as we move to a virtual classroom. But, as my commute and previous social time becomes Facetime calls and text messages, I’ve found myself with more downtime then I previously had, the time I want to dedicate to the training I’ve been wanting to do for the past year.

At SAS, we are striving to care for our users during this time—in that spirit, I wanted to share with you some free SAS offerings, as well as coping techniques I am doing from home.

Take care of yourself and your family

First and foremost, make sure you and your family are taking time for self-care. Whether it be meditation, using a mobile app or YouTube video, or getting some exercise together. I am finding my daily walk something I need to help relax my mind and get myself back to focus on my tasks.

Retrain on skills you haven’t touched in awhile

Sometimes we all need that reboot on tools or methods we use every day. I love SAS’ Statistics 1 course and SAS Programming 1, both of which are free. It’s a great refresher to those who haven’t taken a math course in a few years, or for those just wanting to start out with data science using SAS. Understanding the fundamentals and getting some refreshers on SAS language tricks is something I try to push through with my work constantly. I am also preparing to take the SAS Certification Exam at the end of the year, so the practice exam is something I also plan to use often during my study sessions.

Learn a new skill

It is also a great time to learn something you have always wanted to do. I started by taking some free online photography classes. I also have on my list enrolling in the SAS Academy for Data Science, which is free until the end of May 2020. The advanced analytics professional courses are something I have wanted to complete for a long time, so I am excited to get started on learning more about data modeling. The SAS e-books collection is now also free until April 30, 2020, so I’ve downloaded some great additional materials using code FREEBOOKS at checkout. 

Be kind to others

Being stuck inside can sometimes make you feel like you are spending more time with family than you are used to, or maybe you are spending more time alone. Using this time to connect with those I haven’t made time to talk to has been something I am really thankful for. I call my sister who is a nurse in Florida and check on her. As this outbreak affects us all differently, this is a great time to come together to connect. It is also a great time to think of others affected by the outbreak and who don’t have the ability to continue working. There are some great ways you can help others by getting involved in volunteer work or donating to a helpful cause.

Do fun activities

Though we are stuck at home, this time has been great for enjoying things I often don’t get to during my normal busy schedule. Besides taking free training, I’ve been playing some new video games (#ACNH) and some games I’ve neglected for far too long. I’ve also used this time to find what brings me joy from my home. Making time for reading is bringing great joy to my life right now.

As we all move through this turbulent time, make sure to take care of yourself and others. I hope some of these free tools and training will come in handy as you work towards your personal goals while remaining safe and healthy.

Working remotely? A list of 5 ways to spend your down time was published on SAS Users.

4月 032020
 

Whether you like it or not, Microsoft Excel is still a big hit in the data analysis world. From small to big customers, we still see fit for daily routines such as filtering, generating plots, calculating items on ad-hoc analysis or even running statistical models. Whenever I talk to customers, there is always someone who will either ask: Can this be exported to excel or can we import data from excel?. Recently, other questions started to come up more often: Can we run Python within SAS? How do I allow my team to choose their language of preference? How do I provide an interface that looks like Microsoft Excel, but has SAS functionalities?.

Well… good news is: we can answer YES to all of these questions. With the increase in number of users performing analytics and the number of analytical tools available, for me it was clear that we would end up having lots of disparate processes. For a while this was a problem, but naturally, companies started developing ways to integrate these siloed teams.

In the beginning of last decade, SAS developed SAS Add-in for Microsoft Office. The tool allows customers to run/embed SAS analytic capabilities inside Microsoft Office applications. More recently, SAS released a new version of PROC FCMP allowing users to write Python code and call, if as a function, inside SAS programs.

These advancements provide users the ability to run Python inside Excel. When I say inside, I really mean from within Excel's interface.

Before we jump to how we can do it, you may ask yourself: Why is this relevant to me? If I know SAS, I import the dataset and work with the data in SAS; If I know Python, I open a Jupyter notebook, import the data set and do my thing. Well… you are kind of right, but let me tell you a story.

The use case

Recently I worked with a customer and his business process was like this: I have a team of data scientists that is highly technical and knowledgeable in Python and SAS. Additionally, I have a team of analysts with little Python knowledge, but are always working with Excel to summarize data, create filters, graphs, etc. My teams need to communicate and collaborate. The normal chain of events follows:

  1. the Python team works on the data, and exports the results to Excel
  2. the analytics team picks up the data set, and runs SAS scripts and excel formulas

This is a problem of inefficiency for the customer. Why can't the data scientist pass his or her code to the analyst to execute it on the same project without having to wait on the Python specialist to run the code?

I know this sounds overly complicated, but as my SAS colleague Mike Zizzi concludes in his post SAS or Python? Why not use both? Using Python functions inside SAS programs, at the end of the day what matters is that you get your work done. No matter which language, software or IDE you are using. I highly recommend Mike's article if you want a deep dive on what PROC FCMP has to offer.

The process

Let's walk through a data scoring scenario similar to my customer's story. Imagine I am a SAS programmer using Excel to explore data. I am also part of a team that uses Python, creating scoring data code using analytical models developed in Python. My job is to score and analyze the data on Excel and pass the results to the service representative, so they can forward the response to the customer.

Importing data

The data set we'll work with in this example will help me analyze which customers are more likely to default on a loan. The data and all code used in this article are in the associated GitHub repository. The data dictionary for the data set is located here. First, I open the data set as seen on Sheet1 below in Excel.

Upload data to SAS

Before we jump to the coding part with SAS and Python, I need to send the data to SAS. We'll use the SAS add-in, in Excel to send data to the local server. I cover the steps in detail below.

I start by selecting the cells I want to upload to the library.

Next, I move to the SAS tab and select the Copy to SAS Server task.

A popup shows up where I confirm the selected cells.

After I click OK, I configure column, table, naming and location options.

SAS uploads the table to the requested library. Additionally, a new worksheet with the library.table name displays the results. As you can see on the image below, the sheet created follows the name WORK.IMPORTED_DATA we setup on the previous step. This represents the table in the SAS library memory. Notice, however, we are still working in Excel.

The next step is to incorporate the code sent from my teammate.

The Python code

The code our colleague sent is pure Python. I don't necessarily have to understand the code details, just what it does. The Python code below imports and scores a model and returns a score. Note: if you're attempting this in your own environment, make sure to update the hmeq_model.sav file location in the # Import model pickle file section.

def score_predictions(CLAGE, CLNO, DEBTINC,DELINQ, DEROG, LOAN, MORTDUE, NINQ,VALUE, YOJ):
	"Output: scored"
	# Imporing libraries
	import pandas as pd
	from sklearn.preprocessing import OneHotEncoder
	from sklearn.compose import ColumnTransformer
	from sklearn.externals import joblib
 
	# Create pandas dataframe with input vars
	dataset = pd.DataFrame({'CLAGE':CLAGE, 'CLNO':CLNO, 'DEBTINC':DEBTINC, 'DELINQ':DELINQ, 'DEROG':DEROG, 'LOAN':LOAN, 'MORTDUE':MORTDUE, 'NINQ':NINQ, 'VALUE':VALUE, 'YOJ':YOJ}, index=[0])
 
	X = dataset.values
 
	# Import model pickle file
	loaded_model = joblib.load("C://assets/hmeq_model.sav")
 
	# Score the input dataframe and get 0 or 1 
	scored = int(loaded_model.predict_proba(X)[0,1])
 
	# Return scored dataframe
	return scored

My SAS code calls this Python code from a SAS function defined in the next section.

The SAS code

Turning back to Excel, in the SAS Add-in side of the screen, I click on Programs. This displays a code editor, and as explained on this video, is like any other SAS code editor.

We will use this code editor to write, run and view results from our code.

The code below defines a FCMP function called Score_Python, that imports the Python script from my colleague and calls it from a SAS datastep. The output table, HMEQ_SCORED, is saved on the WORK library in SAS. Note: if you're attempting this in your own environment, make sure to update the script.py file location in the /* Getting Python file */ section.

proc fcmp outlib=work.fcmp.pyfuncs;
 
/* Defining FCMP function */
proc fcmp outlib=work.fcmp.pyfuncs;
	/* Defining name and arguments of the Python function to be called */
 
	function Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
		/* Python object */
		declare object py(python);
 
		/* Getting Python file  */
		rc = py.infile("C:\assets\script.py");
 
		/* Send code to Python interpreter */
		rc = py.publish();
 
		/* Call python function with arguments */
		rc = py.call("score_predictions",CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
 
		/* Pass Python results to SAS variable */
		MyFCMPResult = py.results["scored"];
 
		return(MyFCMPResult);
	endsub;
run;
 
options cmplib=work.fcmp;
 
/* Calling FCMP function from data step */
data work.hmeq_scored;
	set work._excelexport;
	scored_bad = Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
	put scored_bad=;
run;

 

I place my code in the editor.

We're now ready to run the code. Select the 'Running man' icon in the editor.

The output

The result represents the outcome of the Python model scoring code. Once the code completes the WORK.HMEQ_SCORED worksheet updates with a new column, scored_bad.

The binary value represents if the customer is likely (a '1') or unlikely (a '0') to default on his or her loan. I could now use any built in Excel features to filter or further analyze the data. For instance, I could filter all the customers likely to default on their loans and pass a report on to the customer management team.

Final thoughts

In this article we've explored how collaboration between teams with different skills can streamline processes for efficient data analysis. Each team focuses on what they're good at and all of the work is organized and completed in one place. It is a win-win for everyone.

Related resources

Extending Excel with Python and SAS Viya was published on SAS Users.