SAS functions

4月 212018
 

Have you ever been working in the macro facility and needed a macro function, but you could not locate one that would achieve your task? With the %SYSFUNC macro function, you can access most SAS® functions. In this blog post, I demonstrate how %SYSFUNC can help in your programming needs when a macro function might not exist. I also illustrate the formatting feature that is built in to %SYSFUNC. %SYSFUNC also has a counterpart called %QSYSFUNC that masks the returned value, in case special characters are returned.
%SYSFUNC enables the execution of SAS functions and user-written functions, such as those created with the FCMP procedure. Within the DATA step, arguments to the functions require quotation marks, but because %SYSFUNC is a macro function, you do not enclose the arguments in quotation marks. The examples here demonstrate this.

%SYSFUNC has two possible arguments. The first argument is the SAS function, and the second argument (which is optional) is the format to be applied to the value returned from the function. Suppose you had a report and within the title you wanted to issue today’s date in word format:

   title "Today is %sysfunc(today(),worddate20.)";

The title appears like this:

   "Today is               July 4, 2018"

Because the date is right-justified, there are leading blanks before the date. In this case, you need to introduce another function to remove the blank spaces. Luckily %SYSFUNC enables the nesting of functions, but each function that you use must have its own associated %SYSFUNC. You can rewrite the above example by adding the STRIP function to remove any leading or trailing blanks in the value:

   title "Today is %sysfunc(strip(%sysfunc(today(),worddate20.)))";

The title now appears like this:

    "Today is July 4, 2018"

The important thing to notice is the use of two separate functions. Each function is contained within its own %SYSFUNC.

Suppose you had a macro variable that contained blank spaces and you wanted to remove them. There is no macro COMPRESS function that removes all blanks. However, with %SYSFUNC, you have access to one. Here is an example:

   %let list=a    b    c; 
   %put %sysfunc(compress(&list));

The value that is written to the log is as follows:

   abc

In this last example, I use %SYSFUNC to work with SAS functions where macro functions do not exist.

The example checks to see whether an external file is empty. It uses the following SAS functions: FILEEXIST, FILENAME, FOPEN, FREAD, FGET, and FCLOSE. There are other ways to accomplish this task, but this example illustrates the use of SAS functions within %SYSFUNC.

   %macro test(outf);
   %let filrf=myfile;
 
   /* The FILEEXIST function returns a 1 if the file exists; else, a 0
   is returned. The macro variable &OUTF resolves to the filename
   that is passed into the macro. This function is used to determine
   whether the file exists. In this case you want to find the file
   that is contained within &OUTF. Notice that there are no quotation
   marks around the argument, as you will see in all cases below. If
   the condition is false, the %ELSE portion is executed, and a
   message is written to the log stating that the file does not
   exist.*/
 
   %if %sysfunc(fileexist(&outf)) %then %do;
 
   /* The FILENAME function returns 0 if the operation was successful; 
   else, a nonzero is returned. This function can assign a fileref
   for the external file that is located in the &OUTF macro 
   variable. */
 
   %let rc=%sysfunc(filename(filrf,&outf));
 
   /* The FOPEN function returns 0 if the file could not be opened; 
   else, a nonzero is returned. This function is used to open the
   external file that is associated with the fileref from &FILRF. */
 
   %let fid=%sysfunc(fopen(&filrf));
 
   /* The %IF macro checks to see whether &FID has a value greater
   than zero, which means that the file opened successfully. If the
   condition is true, we begin to read the data in the file. */
 
   %if &fid > 0 %then %do;
 
   /* The FREAD function returns 0 if the read was successful; else, a
   nonzero is returned. This function is used to read a record from
   the file that is contained within &FID. */
 
   %let rc=%sysfunc(fread(&fid));
 
   /* The FGET function returns a 0 if the operation was successful. A
   returned value of -1 is issued if there are no more records
   available. This function is used to copy data from the file data 
   buffer and place it into the macro variable, specified as the
   second argument in the function. In this case, the macro variable
   is MYSTRING. */   
 
   %let rc=%sysfunc(fget(&fid,mystring));
 
   /* If the read was successful, the log will write out the value
   that is contained within &MYSTRING. If nothing is returned, the
   %ELSE portion is executed. */
 
   %if &rc = 0 %then %put &mystring;
   %else %put file is empty;
 
   /* The FCLOSE function returns a 0 if the operation was successful;
   else, a nonzero value is returned. This function is used to close
   the file that was referenced in the FOPEN function. */
 
   %let rc=%sysfunc(fclose(&fid));
   %end;
 
   /* The FILENAME function is used here to deassign the fileref 
   FILRF. */
 
   %let rc=%sysfunc(filename(filrf));
   %end;
   %else %put file does not exist;
   %mend test;
   %test(c:\testfile.txt)

There are times when the value that is returned from the function used with %SYSFUNC contains special characters. Those characters then need to be masked. This can be done easily by using %SYSFUNC’s counterpart, %QSYSFUNC. Suppose we run the following example:

   %macro test(dte);
   %put &dte;
   %mend test;
 
   %test(%sysfunc(today(), worddate20.))

The above code would generate an error in the log, similar to the following:

   1  %macro test(dte);
   2  %put &dte;
   3  %mend test;
   4
   5  %test(%sysfunc(today(), worddate20.))
   MLOGIC(TEST):  Beginning execution.
   MLOGIC(TEST):  Parameter DTE has value July 20
   ERROR: More positional parameters found than defined.
   MLOGIC(TEST):  Ending execution.

The WORDDATE format would return the value like this: July 20, 2017. The comma, to a parameter list, represents a delimiter, so this macro call is pushing two positional parameters. However, the definition contains only one positional parameter. Therefore, an error is generated. To correct this problem, you can rewrite the macro invocation in the following way:

   %test(%qsysfunc(today(), worddate20.))

The %QSYSFUNC macro function masks the comma in the returned value so that it is seen as text rather than as a delimiter.

For a list of the functions that are not available with %SYSFUNC, see the “How to expand the number of available SAS functions within the macro language was published on SAS Users.

4月 052018
 

Sir Tim Berners-Lee is famous for inventing the World Wide Web and for the construction of URLs -- a piece of syntax that every 8-year-old is now familiar with. According to the lore, when Sir Tim invented URLs he did not imagine that Internet surfers of all ages and backgrounds would be expected to type these cryptic schemes into their browser windows...but here we are. Today's young people can navigate the Web with URLs in the same way that migratory birds can find their way South for the winter -- by pure instinct.

URLs are syntax, the language of Web navigation. And HTML is syntax too, the language of the Web page. Each of these represent instructions to our web browser, telling it to how to go somewhere or display something. As we navigate the web programmatically, it is sometimes necessary to encode information in a URL or in HTML in a way that it won't be mistakenly interpreted as syntax. And of course, SAS has some functions for that.

Here's a table with links to the SAS documentation for these functions. You'll find they are intuitively named. The names are the same or similar to corresponding functions in other programming languages -- you can get only so creative with basic functions like these.

HTMLDECODE Function   Decodes a string that contains HTML numeric character references or HTML character entity references, and returns the decoded string.
HTMLENCODE Function   Encodes characters using HTML character entity references, and returns the encoded string.
URLDECODE Function   Returns a string that was decoded using the URL escape syntax.
URLENCODE Function   Returns a string that was encoded using the URL escape syntax.

Of these four functions, I use URLENCODE the most often. I need it when I need to pass syntax for instructions to a REST API, in which the API call itself is a URL. Here's an example from my paper about accessing Google Analytics with SAS:

%let workdate='01Oct2017'd;
%let urldate=%sysfunc(putn(&workdate.,yymmdd10.));
%let metrics=%sysfunc(urlencode(%str(ga:pageviews,ga:sessions)));
%let id=%sysfunc(urlencode(%str(ga:XXXXXX)));
filename garesp temp;
proc http
  url="https://www.googleapis.com/analytics/v3/data/ga?ids=&id.%str(&)start-date=&urldate.%str(&)end-date=&urldate.%str(&)metrics=&metrics.%str(&)max-results=20000"
  method="GET" out=garesp;
  headers 
    "Authorization"="Bearer &access_token."
    "client-id:"="&client_id.";
 run;

In the previous example, I need to include "ga:pageviews,ga:sessions" as an instruction on the Google Analytics API, but the colon character has special meaning within the URL syntax. I need to "escape" the colon character so that the URL parser ignores it and simply passes the value to the API. The URLENCODE function converts this segment to "ga%3Apageviews,ga%3Asessions". The colon has been replaced by its hexadecimal code, set off by a percent sign: %3A.

An aside: the percent (%) and ampersand (&) characters have special meaning in URLs. They also have special meaning in the SAS macro language. We can claim that the SAS macro language preceded URL syntax by decades, but there are only so many characters on the keyboard that syntax designers can use to set off code instructions. I use the

HTMLENCODE is useful when you need to represent HTML syntax in your output, but prevent the web browser from interpreting the HTML as code. Here's a simple example.

filename out temp;
ods html5 file=out;
data link;
  site = "sas.com";
  link = "<a href='https://www.sas.com'>sas.com</a>";
  code_as = htmlencode(link);
run;
 
proc print data=link;
run;
ods html5 close;

When produced using the HTML5 destination, the link variable is formatted as a live link, while the code_as variable shows the syntax that went into it.

As you might expect, the URLDECODE function "unescapes" the URL hex characters and restores the original URL syntax. HTMLDECODE does the same for HTML content. If you are writing code that implements an API endpoint in SAS (as you might do with a SAS stored process on the back end of a web service), you'll find these functions useful to unpack the information that was encoded on an API call.

HTMLENCODE and URLENCODE are not interchangeable. More than once, I have written programs that mistakenly use HTMLENCODE when it was URLENCODE that was needed. Those mistakes can be tricky to debug, so pay attention!

Cover image by Fabio Lanari, Internet2, CC BY-SA 4.0

The post SAS functions to encode and decode data for the Web appeared first on The SAS Dummy.

11月 022017
 

The purpose of this blog post is to demonstrate a SAS coding technique that allows for calculations with multiple variables across a SAS dataset, whether or not their values belong to the same or different observations.

calculations across observations of a SAS data table

What do we want?

As illustrated in the picture on the right, we want to be able to hop, or jump, back and forth, up and down across observations of a data table in order to implement calculations not just on different variables, but with their values from different observations of a data table.

In essence, we want to access SAS dataset variable values similar to accessing elements of a matrix (aij), where rows represent dataset observations, and columns represent dataset variables.

Combine and Conquer

In the spirit of my earlier post Combine and Conquer with SAS, this technique combines the functionality of the LAG function, which allows us to retrieve the variable value of a previous observation from a queue, with an imaginary, non-existent in SAS, LEAD function that reads a subsequent observation in a data set while processing the current observation during the same iteration of the DATA step.

LAG function

LAG<n> function in SAS is not your usual, ordinary function. While it provides a mechanism of retrieving previous observations in a data table, it does not work “on demand” to arbitrarily read a variable value from a previous observation n steps back. If you want to use it conditionally in some observations of a data step, you still need to call it in every iteration of that data step. That is because it retrieves values from a queue that is built sequentially for each invocation of the LAG<n> function. In essence, in order to use the LAG function even just once in a data step, you need to call it every time in each data step iteration until that single use.

Moreover, if you need to use each of the LAG1, LAG2, . . . LAGn functions just once, in order to build these queues, you have to call each of them in every data step iteration even if you are going to use them in some subsequent iterations.

LEAD function

The LEAD function is implemented in Oracle SQL and it returns data from the next or subsequent row of a data table. It allows you to query more than one row in a table at a time without having to join the table to itself.

There is no such function in SAS. However, the POINT= option of the SET statement in a SAS data step allows retrieving any observation by its number from a data set using random (direct) access to read a SAS data set. This will allow us to simulate a LEAD function in SAS.

HOP function

But why do we need two separate functions like LAG and LEAD in order to retrieve non-current observations. In essence, these two functions do the same thing, just in opposite directions. Why can’t we get by with just one function that does both backwards and forwards “hopping?”

Let’s combine and conquer.

Ideally, we would like to construct a new single function - let’s call it HOP(x, j) - that combines the best qualities of both LAG and LEAD functions. The two arguments of the HOP function would be as follows:

x – SAS variable name (numeric or character) the value of we are retrieving;

j – hop distance (numeric) – an offset from the current observation; negative values being lagging (hopping back), positive values being leading (hopping forward), and a zero-value meaning staying within the current observation.

The sign of the second argument defines whether we lag (minus) or lead (plus). The absolute value of this second argument defines how far from the current observation we hop.

Alternatively, we could have the first argument as a column number, and the second argument as a row/observation number, if we wanted this function to deal with the data table more like with a matrix. But relatively speaking, the method doesn’t really matter as long as we can unambiguously identify a data element or a cell. To stay within the data step paradigm, we will stick with the variable name and offset from the current observation (_n_) as arguments.

Let’s say we have a data table SAMPLE, where for each event FAIL_FLAG=1 we want to calculate DELTA as the difference between DATE_OUT, one observation after the event, and DATE_IN, two observations before the event:

Calculations across observations of a SAS data table

That is, we want to calculate DELTA in the observation where FAIL_FLAG = 1 as

26MAR2017 18JAN2017 = 67 (as shown in light-blue highlighting in the above figure).

With the HOP() function, that calculation in the data step would look like this:

data SAMPLE;
   set SAMPLE;
   if FAIL_FLAG then DELTA = hop(DATE_OUT,1) - hop(DATE_IN,-2);
run;

It would be reasonable to suggest that the hop() function should return a missing value when the second argument produces an observation number outside of the dataset boundary, that is when

_n_ + j < 0 or _n_ + j > num, where _n_ is the current observation number of the data step iteration; num is the number of observations in the dataset; j is the offset argument value.

Do you see anything wrong with this solution? I don’t. Except that the HOP function exists only in my imagination. Hopefully, it will be implemented soon if enough SAS users ask for it. But until then, we can use its surrogate in the form of a %HOP macro.

%HOP macro

The HOP macro grabs the value of a specified variable in an offset observation relative to the current observation and assigns it to another variable. It is used within a SAS data step, but it cannot be used in an expression; each invocation of the HOP macro can only grab one value of a variable across observations and assign it to another variable within the current observation.

If you need to build an expression to do calculations with several variables from various observations, you would need to first retrieve all those values by invoking the %hop macro as many times as the number of the values involved in the expression.

Here is the syntax of the %HOP macro, which has four required parameters:

%HOP(d,x,y,j)

d – input data table name;

x – source variable name;

y – target variable name;

j – integer offset relative to the current observation. As before, a negative value means a previous observation, a positive value means a subsequent observation, and zero means the current observation.

Using this %HOP macro we can rewrite our code for calculating DELTA as follows:

 data SAMPLE (drop=TEMP1 TEMP2);
   set SAMPLE;
   if FAIL_FLAG then
   do;
      %hop(SAMPLE,DATE_OUT,TEMP1, 1)
      %hop(SAMPLE,DATE_IN, TEMP2,-2)
      DELTA = TEMP1 - TEMP2;
   end;
run;

Note that we should not have temporary variables TEMP1 and TEMP2 listed in a RETAIN statement, as this could mess up our calculations if the j-offset throws an observation number out of the dataset boundary.

Also, the input data table name (d parameter value) is the one that is specified in the SET statement, which may or may not be the same as the name specified in the DATA statement.

In case you are wondering where you can download the %HOP macro from, here it is in its entirety:

%macro hop(d,x,y,j);
   _p_ = _n_ + &j;
   if (1 le _p_ le _o_) then set &d(keep=&x rename=(&x=&y)) point=_p_ nobs=_o_;
%mend hop;

Of course, it is “free of charge” and “as is” for your unlimited use.

Your turn

Please provide your feedback and share possible use cases for the HOP function/macro in the Comment section below. This is your chance for your voice to be heard!

Hopping for the best - calculations across SAS dataset observations was published on SAS Users.

5月 082017
 

In his recent article Perceptions of probability, Rick Wicklin explores how vague statements about "likeliness" translate into probabilities that we can express numerically. It's a fun, informative post -- I recommend it! You'll "Almost Certainly" enjoy it.

To prepare the article, Rick first had to download the source data from the study he cited. The data was shared as a CSV file on GitHub. Rick also had to rename the variables (column names) from the data table so that they are easier to code within SAS. Traditionally, SAS variable names must adhere to a few common programming rules: they must be alphanumeric, begin with a letter, and contain no spaces or special characters. The complete rules are documented in the this method for reading data from a cloud service like DropBox and GitHub. It's still my favorite technique for reading data from the Internet. You'll find lots of papers and examples that use FILENAME URL for the same job in fewer lines of code, but PROC HTTP is more robust. It runs faster, and it allows you to separate the step of fetching the file from the subsequent steps of processing that file.

You can see the contents of the CSV file at this friendly URL: https://github.com/zonination/perceptions/blob/master/probly.csv. But that's not the URL that I need for PROC HTTP or any programmatic access. To download the file via a script, I need the "Raw" file URL, which I can access via the Raw button on the GitHub page.

GitHub preview

In this case, that's https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv. Here's the PROC HTTP step to download the CSV file into a temporary fileref.

/* Fetch the file from the web site */
filename probly temp;
proc http
 url="https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv"
 method="GET"
 out=probly;
run;

A note for SAS University Edition users: this step won't work for you, as the free software does not support access to secure (HTTPS) sites. You'll have to manually download the file via your browser and then continue with the remaining steps.

Step 2. Import the data into SAS with PROC IMPORT

SAS can process data with nonstandard variable names, including names that contain spaces and special characters. You simply have to use the VALIDVARNAME= system option to put SAS into the right mode (oops, almost wrote "mood" there but it's sort of the same thing).

With 'crime against nature'n.)

For this step, I'll set VALIDVARNAME=ANY to allow PROC IMPORT to retain the original column names from the CSV file. The same trick would work if I was importing from an Excel file, or any other data source that was a little more liberal in its naming rules.

/* Tell SAS to allow "nonstandard" names */
options validvarname=any;
 
/* import to a SAS data set */
proc import
  file=probly
  out=work.probly replace
  dbms=csv;
run;

Step 3. Create RENAME and LABEL statements with PROC SQL

This is one of my favorite SAS tricks. You can use PROC SQL SELECT INTO to create SAS programming statements for you, based on the data you're processing. Using this technique, I can build the parts of the LABEL statement and the RENAME statement dynamically, without knowing the variable names ahead of time.

The LABEL statement is simple. I'm going to build a series of assignments that look like this:

  'original variable name'n = 'original variable name'

I used the SELECT INTO clause to build a label assignment for each variable name. I used the CAT function to assemble the label assignment piece-by-piece, including the special literal syntax, the variable name, the assignment operator, and the label value within quotes. I'm fetching the variable names from SASHELP.VCOLUMN, one of the built-in dictionary tables that SAS provides to surface table and column metadata.

  select cat("'",trim(name),"'n","=","'",trim(name),"'") 
     into :labelStmt separated by ' '  
  from sashelp.vcolumn where memname="PROBLY" and libname="WORK";

Here's part of the value of &labelStmt:

'Almost Certainly'n='Almost Certainly' 
'Highly Likely'n='Highly Likely' 
'Very Good Chance'n='Very Good Chance' 
'Probable'n='Probable' 
'Likely'n='Likely' 
'Probably'n='Probably' 
'We Believe'n='We Believe' 

The RENAME statement is a little trickier, because I have to calculate a new valid variable name. For this specific data source that's easy, because the only SAS "rule" that these column names violate is the ban on space characters. I can create a new name by using the COMPRESS function to remove the spaces. To be a little safer, I used the "kn" modifier on the COMPRESS function to keep only English letters, numbers, and underscores. That should cover all cases except for variable names that are too long (greater than 32 characters) or that begin with a number (or that don't contain any valid characters to begin with).

Some of the column names are one-word names that are already valid. If I include those in the RENAME statement, SAS will generate an error (you cannot "rename" a variable to its current name). I used the

/* Generate new names to comply with SAS rules.                          */
/* Assumes names contain spaces, and can fix with COMPRESS               */
/* Other deviations (like special chars, names that start with a number) */
/* would need different adjustments                                      */
/* NVALID() function can check that a name is a valid V7 name           */
proc sql noprint;
 
  /* retain original names as labels */
  select cat("'",trim(name),"'n","=","'",trim(name),"'") 
     into :labelStmt separated by ' '  
  from sashelp.vcolumn where memname="PROBLY" and libname="WORK";
 
  select cat("'",trim(name),"'n","=",compress(name,,'kn')) 
     into :renameStmt separated by ' '  
  from sashelp.vcolumn where memname="PROBLY" and libname="WORK"
  /* exclude those varnames that are already valid */
  AND not NVALID(trim(name),'V7');
quit;

Step 4. Modify the data set with new names and labels using PROC DATASETS

With the body of the LABEL and RENAME statements built, it's time to plug them into a PROC DATASETS step. PROC DATASETS can change data set attributes such as variable names, labels, and formats without requiring a complete rewrite of the data -- it's a very efficient operation.

I include the LABEL statement first, since it references the original variable names. Then I include the RENAME statement, which changes the variable names to their new V7-compliant values.

Finally, I reset the VALIDVARNAME= option to the normal V7 sanity. (Unless you're running in SAS Enterprise Guide, in which case the option is already set to ANY by default. Check this blog post for a less disruptive method of setting/restoring options.)

proc datasets lib=work nolist ;
  modify probly / memtype=data;
  label &labelStmt.;
  rename &renameStmt.;
  /* optional: report on the var names/labels */
  contents data=probly nodetails;
quit;
 
/* reset back to the old rules */
options validvarname=v7;

Here's the CONTENTS output from the PROC DATASETS step, which shows the final variable attributes. I now have easy-to-code variable names, and they still have their descriptive labels. My data dictionary dreams are coming true!

DATASETS rename output

Download the entire program example from my public Gist: import_renameV7.sas.

The post How to download and convert CSV files for use in SAS appeared first on The SAS Dummy.

5月 082017
 

In his recent article Perceptions of probability, Rick Wicklin explores how vague statements about "likeliness" translate into probabilities that we can express numerically. It's a fun, informative post -- I recommend it! You'll "Almost Certainly" enjoy it.

To prepare the article, Rick first had to download the source data from the study he cited. The data was shared as a CSV file on GitHub. Rick also had to rename the variables (column names) from the data table so that they are easier to code within SAS. Traditionally, SAS variable names must adhere to a few common programming rules: they must be alphanumeric, begin with a letter, and contain no spaces or special characters. The complete rules are documented in the this method for reading data from a cloud service like DropBox and GitHub. It's still my favorite technique for reading data from the Internet. You'll find lots of papers and examples that use FILENAME URL for the same job in fewer lines of code, but PROC HTTP is more robust. It runs faster, and it allows you to separate the step of fetching the file from the subsequent steps of processing that file.

You can see the contents of the CSV file at this friendly URL: https://github.com/zonination/perceptions/blob/master/probly.csv. But that's not the URL that I need for PROC HTTP or any programmatic access. To download the file via a script, I need the "Raw" file URL, which I can access via the Raw button on the GitHub page.

GitHub preview

In this case, that's https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv. Here's the PROC HTTP step to download the CSV file into a temporary fileref.

/* Fetch the file from the web site */
filename probly temp;
proc http
 url="https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv"
 method="GET"
 out=probly;
run;

A note for SAS University Edition users: this step won't work for you, as the free software does not support access to secure (HTTPS) sites. You'll have to manually download the file via your browser and then continue with the remaining steps.

Step 2. Import the data into SAS with PROC IMPORT

SAS can process data with nonstandard variable names, including names that contain spaces and special characters. You simply have to use the VALIDVARNAME= system option to put SAS into the right mode (oops, almost wrote "mood" there but it's sort of the same thing).

With 'crime against nature'n.)

For this step, I'll set VALIDVARNAME=ANY to allow PROC IMPORT to retain the original column names from the CSV file. The same trick would work if I was importing from an Excel file, or any other data source that was a little more liberal in its naming rules.

/* Tell SAS to allow "nonstandard" names */
options validvarname=any;
 
/* import to a SAS data set */
proc import
  file=probly
  out=work.probly replace
  dbms=csv;
run;

Step 3. Create RENAME and LABEL statements with PROC SQL

This is one of my favorite SAS tricks. You can use PROC SQL SELECT INTO to create SAS programming statements for you, based on the data you're processing. Using this technique, I can build the parts of the LABEL statement and the RENAME statement dynamically, without knowing the variable names ahead of time.

The LABEL statement is simple. I'm going to build a series of assignments that look like this:

  'original variable name'n = 'original variable name'

I used the SELECT INTO clause to build a label assignment for each variable name. I used the CAT function to assemble the label assignment piece-by-piece, including the special literal syntax, the variable name, the assignment operator, and the label value within quotes. I'm fetching the variable names from SASHELP.VCOLUMN, one of the built-in dictionary tables that SAS provides to surface table and column metadata.

  select cat("'",trim(name),"'n","=","'",trim(name),"'") 
     into :labelStmt separated by ' '  
  from sashelp.vcolumn where memname="PROBLY" and libname="WORK";

Here's part of the value of &labelStmt:

'Almost Certainly'n='Almost Certainly' 
'Highly Likely'n='Highly Likely' 
'Very Good Chance'n='Very Good Chance' 
'Probable'n='Probable' 
'Likely'n='Likely' 
'Probably'n='Probably' 
'We Believe'n='We Believe' 

The RENAME statement is a little trickier, because I have to calculate a new valid variable name. For this specific data source that's easy, because the only SAS "rule" that these column names violate is the ban on space characters. I can create a new name by using the COMPRESS function to remove the spaces. To be a little safer, I used the "kn" modifier on the COMPRESS function to keep only English letters, numbers, and underscores. That should cover all cases except for variable names that are too long (greater than 32 characters) or that begin with a number (or that don't contain any valid characters to begin with).

Some of the column names are one-word names that are already valid. If I include those in the RENAME statement, SAS will generate an error (you cannot "rename" a variable to its current name). I used the

/* Generate new names to comply with SAS rules.                          */
/* Assumes names contain spaces, and can fix with COMPRESS               */
/* Other deviations (like special chars, names that start with a number) */
/* would need different adjustments                                      */
/* NVALID() function can check that a name is a valid V7 name           */
proc sql noprint;
 
  /* retain original names as labels */
  select cat("'",trim(name),"'n","=","'",trim(name),"'") 
     into :labelStmt separated by ' '  
  from sashelp.vcolumn where memname="PROBLY" and libname="WORK";
 
  select cat("'",trim(name),"'n","=",compress(name,,'kn')) 
     into :renameStmt separated by ' '  
  from sashelp.vcolumn where memname="PROBLY" and libname="WORK"
  /* exclude those varnames that are already valid */
  AND not NVALID(trim(name),'V7');
quit;

Step 4. Modify the data set with new names and labels using PROC DATASETS

With the body of the LABEL and RENAME statements built, it's time to plug them into a PROC DATASETS step. PROC DATASETS can change data set attributes such as variable names, labels, and formats without requiring a complete rewrite of the data -- it's a very efficient operation.

I include the LABEL statement first, since it references the original variable names. Then I include the RENAME statement, which changes the variable names to their new V7-compliant values.

Finally, I reset the VALIDVARNAME= option to the normal V7 sanity. (Unless you're running in SAS Enterprise Guide, in which case the option is already set to ANY by default. Check this blog post for a less disruptive method of setting/restoring options.)

proc datasets lib=work nolist ;
  modify probly / memtype=data;
  label &labelStmt.;
  rename &renameStmt.;
  /* optional: report on the var names/labels */
  contents data=probly nodetails;
quit;
 
/* reset back to the old rules */
options validvarname=v7;

Here's the CONTENTS output from the PROC DATASETS step, which shows the final variable attributes. I now have easy-to-code variable names, and they still have their descriptive labels. My data dictionary dreams are coming true!

DATASETS rename output

Download the entire program example from my public Gist: import_renameV7.sas.

The post How to download and convert CSV files for use in SAS appeared first on The SAS Dummy.

1月 272015
 

SAS Technical Support Problem Solvers“Here’s Johnny!!!” and well sometimes John and sometimes Jonathan and sometimes Jon.

In the real world, you sometimes need to make matching character strings more flexible. This is especially common when merging data sets. Variables, especially names, are not always exactly the same in all sources of your data. When matching data, you need to be able to programmatically determine if ‘John Doe’ is the same as ‘Johnny Doe’. The term most often associated with this type of matching is ‘fuzzy matching’. Fortunately within SAS, there are several functions that allow you to perform a fuzzy match. I’ll show you the most common of these functions and then I will show you an example that uses my favorite from this list.

COMPARE Function

The COMPARE function returns the position of the leftmost character by which two strings differ, or returns 0 if there is no difference. This function is most useful when comparing two strings that should be the same, but may only differ in case or due to leading or trailing blanks. This function takes the place of the following code:

if strip(upcase(name1))=strip(upcase(name2)) then do;

The COMPARE function can be used to make the same comparison.

if compare(name1,name2,’il’)=0 then do;

COMPGED Function

The COMPGED function returns the generalized edit distance between two strings. Specifically, the COMPGED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of operations (deletions, insertions, or replacement) of a single characters that are required to transform string-1 into string-2. 

Each operation basically ‘costs’ a certain value. For example, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the greater the cost.  The COMPGED will return the total cost for all operations that occur. The costs returned by COMPGED can be altered by using CALL COMPCOST so that the cost are specific to your needs. A common use I have seen for using the COMPGED function is using it to compare email addresses.

email1='JohnDoe@abc.com';                                                                                                        
email2='John_Doe@abc.com';                                                                                                          
cost=compged(email1,email2);

The value of COST will be 30 which is the cost of adding punctuation to a string.

COMPLEV Function

The COMPLEV function is very similar to the COMPGED function. The difference is that the Levenshtein edit distance that is computed by COMPLEV is a special case of the generalized edit distance that is computed by COMPGED. The result is the COMPLEV executes much more quickly than COMPGED. However, the COMPLEV function is not as powerful or versatile as the COMPGED function. The COMPLEV function is generally most useful when comparing simple strings and when speed of comparison is important.

SPEDIS Function

The SPEDIS function is the oldie-but-a-goodie of the bunch. The SPEDIS function determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words. SPEDIS is similar to COMPGED in that it assigns a cost to the each operation such as swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument. It is important to remember this because this makes the order of the arguments important. This means that SPEDIS(X,Y) does not always equal SPEDIS(Y,X).

Summary: Review your data

Any process that is described as ‘fuzzy’ is obviously not an exact science. With each of these functions using different algorithms, each is going to have its own strengths.

  • If you are comparing complex strings and need the most control, then I would recommend looking at the COMPGED function.
  • If you are comparing fairly simple strings and within large data sets, then COMPLEV may be a better choice.

The bottom line is that you will have to review your data and what differences are important to you. The results returned by these functions are subjective. You have to determine what an acceptable difference is. I suggest that you simply test out each function to see which one works best for you.

COMPGED and SOUNDEX Example

Of the above functions, COMPGED is the one I tend to use most often because, for the scenarios brought to me by various customers, COMPGED has produced the most precise results. I have been able to get the best results by combining the COMPGED function with the SOUNDEX function. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

The below example encodes each of the first names that is to be compared with the SOUNDEX function and then evaluates the results of using SOUNDEX with the COMPGED function. The goal of this example is to determine if the value of FNAME is the same as ‘Johnathan’ or ‘William’.

/* Sample data that contains names and the class being taken */                                                                         
data class;                                                                                                                             
  input fname : $12. lname $ class : $9.;                                                                                               
  datalines;                                                                                                                            
Jon Smith Math                                                                                                                          
John Smith Math                                                                                                                         
Johnny Smith Math                                                                                                                       
James Smith Math                                                                                                                        
Will Miller Chemistry                                                                                                                   
Willy Miller Chemistry                                                                                                                  
Willie Miller Chemistry                                                                                                                 
Bonny Davis Gym                                                                                                                         
Milly Wilson Biology                                                                                                                    
;                                                                                                                                       
                                                                                                                                        
/*  Data set that contains the name and the grade for the class */                                                                      
data grade;                                                                                                                             
  input fname : $12. lname $ grade $;                                                                                                   
  datalines;                                                                                                                            
Johnathan Smith A                                                                                                                       
William Miller B                                                                                                                        
;                                                                                                                                       
                                                                                                                                        
                                                                                                                                        
/* This code uses both the SOUNDEX and COMPGED functions.  I have found */                                                              
/* that by comparing the strings produced by SOUNDEX, I have been able */                                                               
/* to get a tighter comparison.                                        */                                                               
data c;                                                                                                                                 
  set class;                                                                                                                            
  tmp1=soundex(fname);                                                                                                                  
  do i=1 to nobs;                                                                                                                       
    set grade(rename=(fname=fname2)) point=i nobs=nobs;                                                                                 
    tmp2=soundex(fname2);                                                                                                               
    dif=compged(tmp1,tmp2);                                                                                                             
    if dif<=50 then do;                                                                                                                 
      possible_match='Yes';                                                                                                             
      drop  i tmp1 tmp2  fname2;                                                                                                        
      output;                                                                                                                           
    end;                                                                                                                                
  end;                                                                                                                                  
run;                                                                                                                                    
proc print; run;

One of the great things about SAS is the number of different ways you can accomplish a given goal.  If you have a technique to make fuzzy comparisons that you like to use, please share.  I would love to see some of the techniques being used!

tags: character data, fuzzy match, Problem Solvers, SAS functions, soundex
9月 222011
 

SAS programming is taught in schools all over the world, including in high schools.  Occasionally, I receive questions via my blog such as this one:

Can somebody help me on this?
Write a short DATA _NULL_ step to determine the largest integer you can store on your computer in 3, 4, 5, 6, and 7 bytes.

This sounds like a homework assignment to me, not a typical "how do I" programming question from a person trying to get a job done.  I'm pretty sure that the professor who assigned it would not sanction a lazy web approach.

Besides that, given the brief problem statement above, how do we know what the professor wants?  Is the student supposed to write a program to calculate the largest integer that you can safely store within SAS?  Or is it okay to be clever and simply use built-in SAS functions to tell you the answer?  The first approach is an exercise in programming logic and math.  The second approach is a test of your resourcefulness -- can you find the answer without a "brute force" approach, perhaps by becoming familiar with SAS documentation?  Both approaches are valid, and will provide you with practice in skills that will help you with your ongoing SAS programming endeavors.

At the risk of rewarding lazy behavior, I will present one method to find the answer.  I'm sharing it only because it sheds light on the useful CONSTANT function, and other readers might find that helpful in their own work.

In this example, we'll use the EXACTINT constant. From the SAS documentation:

The exact integer is the largest integer k such that all integers less than or equal to k in absolute value have an exact representation in a SAS numeric variable of length nbytes. This information can be useful to know before you trim a SAS numeric variable from the default 8 bytes of storage to a lower number of bytes to save storage.

Here's the program:

data _null_;
  array len{8} _numeric_ BYTES_1-BYTES_8;
  do i=3 to 8;
   len{i} = constant('exactint',i);
  end;
  pi = constant('pi');
  put "Decimal:" (BYTES_3-BYTES_8) (=/comma32.0);
  put "Hexidecimal:" (BYTES_3-BYTES_8) (=/hex14.);
  put "Binary:" (BYTES_3-BYTES_8) (=/binary64.);
  put "and a slice of Pi:" (pi) (=/16.14);
run;

You'll have to run the program in SAS to see the results. As my high school history teacher used to say, "I'm not going to spoon-feed the answers to you." (Although I suppose that's what I just did...)

Bonus (even though you don't deserve it)

Here are a couple of resources that may be helpful in future assignments:

tags: CONSTANT function, homework, SAS functions, SAS programming