Kevin Russell

6月 232020
 

When you execute code on the SAS® Cloud Analytic Services (CAS) server, one way you can improve performance is by partitioning a table. A partitioned table enables the data to load faster when that data needs to be grouped by the common values of a variable.

This post explains what it means to partition a table and describes the advantages of a partitioned table. I'll illustrate these concepts with example code that shows improved processing time when you use a partitioned table.

SAS BY-group requires SORT

BY-group processing enables you to group your data by unique variable values. This processing is used, for example, in tasks like merging data sets by a common variable and producing reports that contain data that is grouped by the value of a classification variable. You also use BY-group processing when you execute code in CAS. A key difference in creating BY-groups in SAS versus in CAS is that SAS requires a SORT procedure to sort the data by the specified BY variable in order to create the BY groups.

BY-group sorting implicit in CAS

This step is not required in CAS. When you perform BY-group processing on a CAS table, an implicit sorting action takes place, and each BY-group is distributed to the available threads. This implicit sort process takes place each time that the table is accessed and the BY-groups are requested.

Partitioning a CAS table permanently stores the table such that the values of the BY variable are grouped. Using a partitioned CAS table enables you to skip the implicit sort process each time the table is used, which can greatly improve performance.

Partition CAS action example

You create a partitioned CAS table by using the partition CAS action. The following example shows how to partition the CARS table (created from the SASHELP.CARS data set) by the MAKE variable.

caslib _all_ assign;    ①
data casuser.cars;      
set sashelp.cars;  ②
run;
proc cas;
table.partition /     ③                                              
casout={caslib="casuser", name="cars2"} ④                            
table={caslib="casuser", name="cars", groupby={name="make"}}; ⑤
quit;

In this code:

①  The CASLIB statement creates SAS librefs that point to all existing caslibs. CASUSER, which is used in the subsequent DATA step, is one of the librefs that are created by this statement.

②  The DATA step creates the CARS CAS table in the CASUSER caslib.

③  The partition action in the CAS procedure is part of the TABLE action set.

④  The casout= parameter contains the caslib= parameter, which points to the caslib where the partitioned CAS table named CARS2 will be stored.

⑤ The table= parameter contains the name= parameter, which lists the name of the table that is being partitioned. It also contains the CASLIB= option, which points to the caslib in which the table is stored. The groupby= parameter contains the name= option, which names the variable by which to partition the table.

You can confirm that the table has been partitioned by submitting the following CAS procedure with the tabledetails action.

proc cas;
table.tabledetails result=r / level='partition' ①
name='cars2'
caslib='casuser';
run;
describe r; ②
quit;

In this code:
① The LEVEL= parameter specifies the aggregation level of the TABLEDETAILS output.
② The DESCRIBE statement writes the output of the TABLEDETAILS action to the log.

The following output is displayed in the resulting log. The KEY column shows the variable the table has been partitioned by.

As mentioned earlier, the purpose of partitioning a table is to improve performance. The following example uses two CAS tables that consist of an ID variable with 10 possible values and 10 character variables. Each table contains 5,000,000 rows. This example illustrates how much performance improvement you can gain by partitioning the tables. In this case, the ID variable merges two tables.

First, you create the tables by submitting the following DATA steps:

data casuser.one;
array vars(10) $8 x1-x10;
do j=1 to 5000000;
id=put(rand('integer',1,10),8.);
do i=1 to 10;
vars(i)=byte(rand('integer',65,90));
end;
output;
end;
drop i j;
run;
 
data casuser.two;
array vars(10) $8 x1-x10;
do j=1 to 5000000;
id=put(rand('integer',1,10),8.);
do i=1 to 10;
vars(i)=byte(rand('integer',65,90));
end;
output;
end;
drop i j;
run;

The DATA steps above show how to merge non-partitioned tables. In the log output shown below, you can see that the total, real time (highlighted) took almost 45 seconds to run.

Partitioned tables example

The next example runs the same DATA step code, but it uses partitioned tables. The first step is to partition the tables, as shown below:

proc cas;
table.partition /                                                   
casout={caslib="casuser", name="onepart"}                             
table={caslib="casuser", name="one", groupby={name="id"}};   
run; 
 
table.partition /                                                   
casout={caslib="casuser", name="twopart"}                             
table={caslib="casuser", name="two", groupby={name="id"}};   
quit;

To merge the two tables and to product the log, submit the following code. (The real time is highlighted in the log.)

data casuser.nopart;
merge casuser.onepart casuser.twopart;
by id;
run;

The output for this code is show below:

This time, the DATA Step took only 25.43 seconds to execute. That is a 43% improvement in execution time!

Partition for improved performance

If your analysis requires you to use the same table multiple times to perform BY-group processing, then I strongly recommend that you partition the table. As the last example shows, partitioning your table can greatly improve performance!

Partition your CAS tables to greatly improve performance was published on SAS Users.

6月 212019
 

For every project in SAS®, the first step is almost always making your data available. This blog shows you how to load three of the most common input data types—a data set, a text file, and a Microsoft Excel file—into SAS® Cloud Analytic Services (CAS) tables.

The three methods that I show here are the three easiest ways to load each data type into CAS. Multiple tools can load data into CAS, but I am showing the tools that I consider the easiest to use and that are probably the most familiar to SAS programmers.

You need to place your data in a location that can be accessed by the programming environment that is used to access CAS. The most common programming environment that accesses CAS is SAS® Studio. Input data files that are used in CAS are going to be very large. You will need to use an SFTP tool to move your data from your PC to a directory that can be accessed by your SAS Studio session. (Check with your system administrator to see what the preferred tool is at your site.)

After your data is in a location that can be accessed by the programming environment, you need to start a CAS session. This is done with the CAS statement; here is the syntax:

cas session-name <option(s)>;

The options that you specify depend on how your system administrator configured your environment. For example, I asked my system administrator to set it up so that the only thing I need to do is issue the following statement:

cas;

That statement then creates a CAS session with the default name of CASAUTO, with an active caslib of CASUSER.

After you establish your CAS session, you can start loading data.

Load a SAS data set

The easiest way to load SAS data into CAS is to use a DATA step. The basic syntax is the same as it is when you are creating a SAS data set. The key difference is that the libref that is listed in the DATA step must point to a caslib.

The following example accesses SASHELP.CARS and then creates the table CARS in the CASUSER caslib.

cas;                   /* log on to the CAS session */
   caslib _all_ assign;   /* create a libref for each active 
                             caslib */
 
   data casuser.cars;     /* create the table in the caslib */
      set sashelp.cars;
   run;

The one thing to note about this code is that this DATA step is running in SAS and not CAS. A DATA step runs in CAS only if the input and output librefs are both using the CAS engine and only if it uses language elements that CAS supports. In this example, the SASHELP libref was not created with the CAS engine, so the step must run in SAS.

There are no calculations in this step, so there is no effect on performance.

When you load a data set into a CAS table, one task that you might want to perform is to promote the table. The following DATA step shows how to promote a table as you create it:

data casuser.cars(promote=yes);
      set sashelp.cars;
   run;

Promoting a table gives it a global scope. You can then access the table in multiple sessions, and the table is also available in SAS® Visual Analytics. The PROMOTE= data set option can be used with all three of the examples in this blog.

Load a delimited text file

The easiest way to load a delimited text file and store it as a CAS table is to use another familiar SAS step, the IMPORT procedure. The syntax is going to be basically the same as it is in SAS. Again, the key difference is that you need to point to a caslib in the OUT= option of the PROC IMPORT statement.

If you are running SAS® Studio 5.1, a common location for a text file that needs to be loaded into CAS is the SAS Content folder. This folder is a predefined repository for text files that you need to access from SAS Studio.

In order to access the files from SAS Content, you need to use the FILESRVC access method with the FILENAME statement. This method enables you to store and retrieve content using the SAS® Viya® Files service. Here is the basic syntax:

filename fileref filesrvc folderpath='path'
            filename='name';

For more information about this access method, see SAS 9.4 Global Statements Reference.

In the following example, PROC IMPORT and the FILESRVC access method are used to load the class.csv file from the SAS Content folder. The resulting ALLCLASS table is written to the CASUSER caslib.

filename myfile filesrvc folderpath='/Users/saskir'
            filename='class.csv';
   proc import datafile=myfile out=casuser.allclass(promote=yes)
        dbms=csv;
   run;

Load an Excel file

This section shows you how to load an Excel file into a CAS table by using PROC IMPORT. Loading an Excel file using PROC IMPORT requires that you have SAS/ACCESS® Interface to PC Files. If you are unsure whether you have this product, you can write all of your licensed products to the log by using the SETINIT procedure:

proc setinit;
   run;

You should see the following in the log when the product is licensed:

---SAS/ACCESS Interface to PC Files

After you confirm that you have this product, you can adapt the following PROC IMPORT code. This example loads the ReportTest.xlsx file and stores it in a table named ReportTest in the CASUSER caslib.

cas;
   caslib _all_ assign;
 
   proc import datafile='/viyashare/ReportTest.xlsx'
        out=casuser.ReportTest dbms=xlsx;
   run;

There are other methods

The purpose of this blog was to show you the easiest ways to load a SAS data set, text file, and Excel file into CAS. Although there are multiple ways to accomplish these tasks, both programmatically and interactively, these methods are the easiest and most straightforward ways to get data into CAS.

Learn the three easiest ways to load data into CAS tables was published on SAS Users.

8月 242018
 

I know a lot of you have been programming in SAS for a long time, which is awesome! However, when you do something for a long time, sometimes you get set in your ways and you miss out on new ways of doing things.

Although the COUNT and CAT functions have been around for a while now, I see a lot of customer code that is counting and concatenating text strings the "old-fashioned" way. In this article, I would like to introduce you to the COUNT, COUNTW, CATS and CATX functions. These functions make certain tasks much simpler, like counting words in a string and concatenating text together.

Counting words or text occurrences

First let's take a look at the COUNT and COUNTW functions.
The

Data a;
  Contributors='The Big Company INC, The Little Company, ACME Incorporated,    Big Data Co, Donut Inc.';
  Num=count(contributors,'inc','i');  /* the 'i' modifier means to ignore case*/
  Put num=;
Run;

When we examine the SAS log, we can see that NUM has a value of 3.

Num=3
NOTE: The data set WORK.A has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

The

/* DON'T USE - use COUNTW instead */
data a(drop=done i);
  x='a#b#c#d#e';
  do until(done);
    i+1;
    y=scan(x,i,'#');
    if y='' then done=1;
    else output;
  end;
  run;

I realize this code isn't terrible, but I try to avoid DO UNTIL/WHILE loops if I can. There is always the possibility of going into an infinite loop.
The COUNTW function eliminates the need for a DO UNTIL/WHILE loop.

Here is an example of logic that I use all the time. In this example, I have a macro variable that contains a list of values that I want to loop through. I can use the COUNTW function to easily loop through each file listed in the resolved value of &FILE_NAMES. The code then uses the file name on the DATA statement and the INFILE statement.

%let file_names=01JAN2018.csv 01FEB2018.csv 01MAR2018.csv 01APR2018.csv;
%macro test(files);
 
%do i=1 %to %sysfunc(countw(&file_names,%str( )));
  %let file=%scan(&file_names,&i,%str( ));
  data _%scan(&file,1,.);
    infile "c:\my files\&file";
    input region $ manager $ sales;
  run;
%end;
%mend;
%test(&file_names)

The log is too large to list here, but you can see one of the generated DATA steps in the MPRINT output of this snapshot of the log.

MPRINT(TEST):   data _01JAN2018;
MPRINT(TEST):   infile "c:\my files\01JAN2018.csv";
MPRINT(TEST):   input region $ manager $ sales;
MPRINT(TEST):   run;

This data step will be generated for each file listed.

Counting strings within another text string should be easy to do. The COUNT functions definitely make this a reality!

Concatenating strings in SAS

Now that we know how to COUNT text in SAS, let me show you how to CAT in SAS with the CATS and CATX functions.

Back in the old days, I had hair(!) and we concatenated text strings using double pipes syntax.

  X=var1||var2||var3;

This syntax is not too bad, but what if VARn has trailing blanks? Prior to SAS Version 9 you had to remove the trailing blanks from each value. Also, if the text was right justified, you had to left justify the text. This complicates the syntax:

X=trim(left(var1))||trim(left(var2))||trim(left(var3));

You can now accomplish the same thing using CATS. The
data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=cats(var1,var2,var3);
  put x=;
run;

VAR1-VAR3 have a length of 12, which means each value contains trailing blanks. By using the CATS function, all trailing blanks are removed and the text is concatenated without any spaces between the text. Here is the result of the above PUT statement.

x=abc123xyz

Another common need when concatenating text together is to create a delimited string. This can now be done using the CATX function. The

data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=catx(',',var1,var2,var3);
  put x=;
run;

This syntax creates a comma separated list with all leading and trailing blanks removed. Here is the result of the PUT statement.

x=abc,123,xyz

Just how the COUNT functions making counting text in SAS easier, the CAT functions make concatenating strings so much easier. For more explanation and examples of these CAT* functions, see this paper by Louise Hadden, Purrfectly Fabulous Feline Functions (because they are CAT functions, get it?).

Before I let you go, let me point out that in addition to the COUNT, COUNTW, CATS and CATX functions, there are also the COUNTC, CAT, CATQ and CATT functions that provide even more functionality. These functions are not used as often, so I haven't discussed them here. Please How to COUNT CATs in SAS was published on SAS Users.

8月 212017
 

The stored compiled macro facility enables you to compile and save your macro definition in a permanent catalog in a library that you specify. The macro is compiled only once. When you call the macro in the current and subsequent SAS® sessions, SAS executes the compiled code from the macro catalog that you created when you compiled the macro.

The stored compiled facility has two main purposes. The first is that it enables your code to run faster because the macro code does not need to be compiled each time it is executed. The second purpose is to help you protect your code. Sometimes you need to share code that you’ve written with other users, but you do not want them to be able to see the code that is being executed. The stored compiled macro facility enables you to share the program without revealing the code. Compiling the macro with the SECURE option prevents the output of the SYMBOLGEN, MPRINT, and MLOGIC macro debugging options from being written to the log when the macro executes. This means that no code is written to the log when the code executes. After the macro has been compiled, there is no way to decompile it to retrieve the source code that created the catalog entry. This behavior prevents the user from being able to retrieve the code. However, it also prevents you from being able to recover the code.

It is very important to remember that there is no way to get back the code from a stored compiled macro. Because of this behavior, you should ALWAYS save your code when creating a stored compiled macro catalog. In order to update a stored compiled macro, you must recompile the macro. The only way to do this is to submit the macro definition again. Another important fact is that a stored compiled macro catalog can be used only on the same operating system and release of SAS that it was created on. So, in order to use a stored compiled macro on another operating system or release of SAS, that macro must be compiled in the new environment. Again, the only way to compile the macro is to resubmit the macro definition.

Save the Macro Source Code

To make it easier for you to save your code, the %MACRO statement contains the SOURCE option. When you create a stored compiled macro, the SOURCE option stores the macro definition as part of a catalog entry in the SASMACR catalog in the permanent SAS library listed on the SASMSTORE= system option.

Here is the syntax needed to create a stored compiled macro with the SOURCE option set:

libname mymacs 'c:\my macro library';   ❶                                                                                                
options mstored sasmstore=mymacs;       ❷                                                                                              
 
%macro test / store source;             ❸                                                                                                          
 
  libname mylib1 'path-to-my-first-library';                                                                                            
  libname mylib2 'path-to-my-second-library';                                                                                           
 
%mend;

 

❶ The LIBNAME statement points to the SAS library that will contain my stored compiled macro catalog.

❷ The MSTORED system option enables the stored compiled facility. The SASMSTORE= option points to the libref that points to the macro library.

❸ The STORE option instructs the macro processor to store the compiled version of TEST in the SASMACR catalog in the library listed in the SASMSTORE= system option. The SOURCE option stores the TEST macro definition in the same SASMACR catalog.

Note that the contents of the SASMACR catalog do not contain an entry for the macro source. The source has been combined with the macro entry that contains the compiled macro. To verify that the source has been saved, add the DES= option to the %MACRO statement. The DES= option enables you specify a description for the macro entry in the SASMACR catalog. So for example, you could add the following description when compiling the macro to indicate that the source code has been saved:

%macro test / store source des=’Source code saved with entry’;

 

You can look at the contents of the macro catalog using the CATALOG procedure:

proc catalog cat=a.sasmacr;                                                                                                            
contents;                                                                                                                               
run;                                                                                                                                    
quit;

 

You see the description indicating that the source code was saved with the macro entry in the output from PROC CATALOG:

Retrieve the Macro Source Code

When you need to update the macro or re-create the catalog on another machine, you can retrieve the macro source code using the %COPY statement. The %COPY statement enables you to retrieve the macro source code and write the code to a file. Here is the syntax:

%copy test / source outfile='c:\my macro library\test.sas';

 

This %COPY statement writes the source code for the TEST macro to the TEST.SAS file. Using TEST.SAS, you are now able to update the macro or compile the macro on another machine.

Remember, you should always save your source code when creating a stored compiled macro. Without the source code, you will not be able to update the macro or move the macro to a new environment.

Here are the relevant links for this article:

Always save your code when creating a stored compiled macro was published on SAS Users.

3月 122016
 

ProblemSolversBeing able to access information about your operating system from within SAS can be incredibly useful. For example, if you have a directory that contains an unknown number of CSV files that you would like to read into SAS, you certainly would not want to have to write an IMPORT procedure for each of the files. It would be great if SAS just knew the names of all the files as well how many there were. That would make this task much less tedious.

Fortunately, there is a straightforward way to accomplish this task within SAS. One of the easiest ways to gather information from your operating system from within SAS is to use a pipe (|). Pipes enable your SAS application to do the following:

  • Receive input from any operating-system command that writes to standard output.
  • Route output to any operating-system command that reads from standard input.

In other words, a pipe can be used to capture the output of an operating-system command and put that output into SAS, which gives you the ability to easily run PROC IMPORT for each CSV file in your directory.

Before we take a look at an example, let’s look at the syntax needed to use a pipe. To use a pipe on a directory-based operating system, you just need to issue a FILENAME statement with the following syntax:

FILENAME fileref PIPE 'operating-system-commandoption-list;

Here is additional information about the syntax above:

fileref: Can be any valid fileref, as described in Referencing External Files.

PIPE: Is the device-type keyword that tells SAS that you want to use an unnamed pipe.

‘operating-system-command’: Is the name of an operating-system command, executable program, or shell script that you want to route output to or that you want to read input from. You must enclose the command in quotation marks.

option-list: Can be any of the options that are valid in the FILENAME statement, such as the LRECL= or RECFM= options.  Options that affect the operating-system command can also be listed here.

So let’s take a look at a pipe in action.

Using a pipe to return the output of an operating system command to SAS® software

In the following example, I have an unknown number of CSV files in the MYFILES subdirectory. I want to read in each file and create an output data set that has the same name as the CSV file. Also, I am only interested in the files that were created within the last year.

To do this import, I am going to use a pipe, which enables me to use the DIR operating-system command to list all the files in the directory. The output of the DIR command shows the filename and its create date. Based on the create date, I can pass in each file =name to a macro that contains PROC IMPORT. Then, PROC IMPORT reads in each file and outputs it to a data set with the desired name.

Here is the code:

Note: The code below contains numbers that are enclosed in parentheses; these numbers correspond to further explanation below the code block.

%let mydir=c:myfiles;       (1)                                                                                                             
                                                                                                                                        
%macro mymac(fname,dsname);   (2)                                                                                                          
                                                                                                                                        
proc import datafile="&amp;mydir&amp;fname" out=&amp;dsname                                                                                        
dbms=csv replace;                                                                                                                       
getnames=yes;                                                                                                                           
run;                                                                                                                                    
                                                                                                                                        
%mend;                                                                                                                                  
                                                                                                                                        
filename myfiles pipe "dir &amp;mydir /T:C";   (3)                                                                                                  
                                                                                                                                        
data _null_;                                                                                                                            
  infile myfiles truncover;                                                                                                             
  input;      (4)                                                                                                                          
  if index(lowcase(_infile_),'.csv') then do;    (5)                                                                                       
    date=input(scan(_infile_,1,' '),mmddyy10.);  (6)                                                                                       
    fname=substr(_infile_,38);                   (7)                                                                                                        
    dsname=compress(tranwrd(strip(scan(fname,1,'.')),'_',' '),,'p');  (8)                                                                 
    dsname=tranwrd(strip(dsname),' ','_');                                                                                              
    if intck('year',date,today())&lt;=1 then do;    (9)                                                                                         
      call=cats('%mymac(',fname,',',dsname,')');                                                                                        
      call execute(call);                        (10)                                                                                       
    end;                                                                                                                                
  end;                                                                                                                                  
run;

  1. The MYDIR macro variable lists the directory that contains the CSV files.
  2. The MYMAC macro contains the PROC IMPORT code needed to read in the CSV files. The macro contains the parameters FNAME and DSNAME. FNAME is the name of the CSV file being read in, and DSNAME is the name of the output data set being created.
  3. The FILENAME statement uses a pipe, which enables me to use the DIR operating-system command. This command shows the filename and the create date. Note that the ‘/T:C’ option is needed to show the create date. The default date returned by the DIR operating-system command is the last-modified date.
  4. The INPUT statement does not list any variables, so it loads the entire record into the input record buffer.
  5. _INFILE_ is an automatic variable in the DATA step that contains the contents of the current input record buffer. This buffer is created by the INPUT statement. In this example, I use the INDEX function to determine if the file has a .csv extension.
  6. When you run the DIR command, dates are always the first value listed. I use the SCAN function to extract the date and the INPUT function to convert the extracted date to a SAS date format.
  7. Before you can extract the filename, you have to search the output from the DIR command to find the column number that contains the filename value. In this example, the filename starts in the 38th column of the output file. The SUBSTR function specifies this column number and enables me to extract the filename.
  8. DATA-step logic is used to create the output data-set name. To make sure that there are no invalid characters in the filename like spaces and punctuation other than an underscore, I use the TRANWRD and STRIP functions and specify what characters should not be included.
  9. I use the INTCK function to select values that contain a create date that are a year or less from today’s date before I execute the macro.
  10. CALL EXECUTE invokes the MYMAC macro. CALL EXECUTE enables me to pass in the values of FNAME and DSNAME as the values of the macro’s parameters and conditionally execute the macro based on DATA-step logic.

As this example shows, using a pipe to capture the output of the DIR operating-system command can be extremely useful. In this example, it enables me to dynamically run code based on an unknown number of files that meet a certain condition. Using a pipe to get the file information for you can drastically reduce the amount of time it would take to perform this task manually.

 

 

tags: Problem Solvers, SAS Programmers

Using a pipe to return the output of an operating system command to SAS® software was published on SAS Users.

1月 272015
 

SAS Technical Support Problem Solvers“Here’s Johnny!!!” and well sometimes John and sometimes Jonathan and sometimes Jon.

In the real world, you sometimes need to make matching character strings more flexible. This is especially common when merging data sets. Variables, especially names, are not always exactly the same in all sources of your data. When matching data, you need to be able to programmatically determine if ‘John Doe’ is the same as ‘Johnny Doe’. The term most often associated with this type of matching is ‘fuzzy matching’. Fortunately within SAS, there are several functions that allow you to perform a fuzzy match. I’ll show you the most common of these functions and then I will show you an example that uses my favorite from this list.

COMPARE Function

The COMPARE function returns the position of the leftmost character by which two strings differ, or returns 0 if there is no difference. This function is most useful when comparing two strings that should be the same, but may only differ in case or due to leading or trailing blanks. This function takes the place of the following code:

if strip(upcase(name1))=strip(upcase(name2)) then do;

The COMPARE function can be used to make the same comparison.

if compare(name1,name2,’il’)=0 then do;

COMPGED Function

The COMPGED function returns the generalized edit distance between two strings. Specifically, the COMPGED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of operations (deletions, insertions, or replacement) of a single characters that are required to transform string-1 into string-2. 

Each operation basically ‘costs’ a certain value. For example, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the greater the cost.  The COMPGED will return the total cost for all operations that occur. The costs returned by COMPGED can be altered by using CALL COMPCOST so that the cost are specific to your needs. A common use I have seen for using the COMPGED function is using it to compare email addresses.

email1='JohnDoe@abc.com';                                                                                                        
email2='John_Doe@abc.com';                                                                                                          
cost=compged(email1,email2);

The value of COST will be 30 which is the cost of adding punctuation to a string.

COMPLEV Function

The COMPLEV function is very similar to the COMPGED function. The difference is that the Levenshtein edit distance that is computed by COMPLEV is a special case of the generalized edit distance that is computed by COMPGED. The result is the COMPLEV executes much more quickly than COMPGED. However, the COMPLEV function is not as powerful or versatile as the COMPGED function. The COMPLEV function is generally most useful when comparing simple strings and when speed of comparison is important.

SPEDIS Function

The SPEDIS function is the oldie-but-a-goodie of the bunch. The SPEDIS function determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words. SPEDIS is similar to COMPGED in that it assigns a cost to the each operation such as swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument. It is important to remember this because this makes the order of the arguments important. This means that SPEDIS(X,Y) does not always equal SPEDIS(Y,X).

Summary: Review your data

Any process that is described as ‘fuzzy’ is obviously not an exact science. With each of these functions using different algorithms, each is going to have its own strengths.

  • If you are comparing complex strings and need the most control, then I would recommend looking at the COMPGED function.
  • If you are comparing fairly simple strings and within large data sets, then COMPLEV may be a better choice.

The bottom line is that you will have to review your data and what differences are important to you. The results returned by these functions are subjective. You have to determine what an acceptable difference is. I suggest that you simply test out each function to see which one works best for you.

COMPGED and SOUNDEX Example

Of the above functions, COMPGED is the one I tend to use most often because, for the scenarios brought to me by various customers, COMPGED has produced the most precise results. I have been able to get the best results by combining the COMPGED function with the SOUNDEX function. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

The below example encodes each of the first names that is to be compared with the SOUNDEX function and then evaluates the results of using SOUNDEX with the COMPGED function. The goal of this example is to determine if the value of FNAME is the same as ‘Johnathan’ or ‘William’.

/* Sample data that contains names and the class being taken */                                                                         
data class;                                                                                                                             
  input fname : $12. lname $ class : $9.;                                                                                               
  datalines;                                                                                                                            
Jon Smith Math                                                                                                                          
John Smith Math                                                                                                                         
Johnny Smith Math                                                                                                                       
James Smith Math                                                                                                                        
Will Miller Chemistry                                                                                                                   
Willy Miller Chemistry                                                                                                                  
Willie Miller Chemistry                                                                                                                 
Bonny Davis Gym                                                                                                                         
Milly Wilson Biology                                                                                                                    
;                                                                                                                                       
                                                                                                                                        
/*  Data set that contains the name and the grade for the class */                                                                      
data grade;                                                                                                                             
  input fname : $12. lname $ grade $;                                                                                                   
  datalines;                                                                                                                            
Johnathan Smith A                                                                                                                       
William Miller B                                                                                                                        
;                                                                                                                                       
                                                                                                                                        
                                                                                                                                        
/* This code uses both the SOUNDEX and COMPGED functions.  I have found */                                                              
/* that by comparing the strings produced by SOUNDEX, I have been able */                                                               
/* to get a tighter comparison.                                        */                                                               
data c;                                                                                                                                 
  set class;                                                                                                                            
  tmp1=soundex(fname);                                                                                                                  
  do i=1 to nobs;                                                                                                                       
    set grade(rename=(fname=fname2)) point=i nobs=nobs;                                                                                 
    tmp2=soundex(fname2);                                                                                                               
    dif=compged(tmp1,tmp2);                                                                                                             
    if dif<=50 then do;                                                                                                                 
      possible_match='Yes';                                                                                                             
      drop  i tmp1 tmp2  fname2;                                                                                                        
      output;                                                                                                                           
    end;                                                                                                                                
  end;                                                                                                                                  
run;                                                                                                                                    
proc print; run;

One of the great things about SAS is the number of different ways you can accomplish a given goal.  If you have a technique to make fuzzy comparisons that you like to use, please share.  I would love to see some of the techniques being used!

tags: character data, fuzzy match, Problem Solvers, SAS functions, soundex
4月 182014
 

Some very common questions we receive into Technical Support are, “I need to be able to send the data for every hospital listed in my data set to a separate .csv file. How can I do that?” Or, “How can I run a PROC REPORT for each region listed in my data set?” And, “How can I send an email to each manager listed in my data set?”

Each one of these questions has something in common. Each question is asking how to execute a task based on a data set variable. Within SAS, the best way to execute a repetitive task is to use the SAS Macro language. The best way to invoke a macro based on a data set variable is to use CALL EXECUTE. The two most important aspects of CALL EXECUTE are that it allows you to:

  • conditionally execute a macro based on DATA step logic
  • pass in a data set variable’s value in as the value to a macro parameter

Two common errors when invoking macro statements based on DATA step logic
Because CALL EXECUTE is a DATA step call routine, it can be conditionally executed using DATA step logic. It is important to understand the basic timing of a macro statement. A macro statement executes a DATA step at compile time. This means that macro code executes before any DATA step logic is ever evaluated. Here is an example of logic that will not work. Because of the timing of macro statements, this macro invocation will always execute, regardless of the IF condition. CALL EXECUTE can be used to accommodate for this difference in timing.

If first.hospital_id then do;
  %mymacro(hospital_id)
End;

The above example also illustrates another common mistake when attempting to invoke a macro based on DATA step logic. This example is attempting to pass in the value of the data set variable HOSPITAL_ID as the parameter to the macro. It is important to know that the macro language is simply a text language. When the MYMACRO macro executes, the text ‘hospital_id’ is passed in as the value of the macro parameter, not the value of the data set variable HOSPITAL_ID. Again the desired result can be achieved using CALL EXECUTE.

How to invoke a macro using CALL EXECUTE
So now let’s see the correct way to invoke the macro using CALL EXECUTE. In the following example, we are going to dynamically create a .csv file for each hospital listed in our data set. This will be accomplished by placing a PROC EXPORT within a macro. CALL EXECUTE will allow to invoke the macro for each hospital listed in the HOSPITALS data set.

/* sample data set that contains the hospital ids */ 
data hospitals;                                                
  input hospital_id $ quarter number_of_patients;                               
  datalines;                                      
A100 1 125                                    
A100 2 115                                                                                                                              
A100 3 130                                                                                                                              
A100 4 110                                                                                                                              
A200 1 200                                                                        
A200 2 195 
A200 3 180                                                        
A200 4 190                                       
;                                                

/* this macro uses PROC EXPORT to write data to a .csv file */

%macro mymacro(hosp_id);                                                                                                                

proc export data=hospitals(where=(hospital_id="&hosp_id"))                                                                                  
  outfile="c:\hospitals\&hosp_id..csv"                                                                                                  
  dbms=csv                                                                                                                              
  replace;                                                                                                                              
run;                                                                                                                                    

%mend;                                                                                                                                  

data _null_;                                                                                                                            
  set hospitals;                                                                                                                        
  by hospital_id;

/* this logic allows us to call the MYMACRO macro 
each time the value of HOSPITAL_ID changes */   

  if first.hospital_id then do;
    call execute(cats('%mymacro(',hospital_id,');')); 
  end;                                                                  
run;

This logic will work regardless of the number of hospitals listed and of the number of quarters listed for each hospital. When this code is executed, the following macro invocations are generated.

First time the IF condition is true:
%mymacro(A100)
Second time IF condition is true:
%mymacro(A200)

How the CALL EXECUTE routine processes macros
It is important to know what happens behind the scenes when using CALL EXECUTE. When the argument to CALL EXECUTE is a macro invocation, the macro will execute immediately. This means that all macro statements are executed and all macro variable references are resolved. All non-macro code generated by the macro is stored and does not execute until after the step boundary that terminates the DATA step that contains the CALL EXECUTE is encountered. The timing of when statements are executed can sometimes lead to unresolved macro variables, which in turn can lead to unexpected syntax errors. This issue is discussed in detail in SAS Usage Note 23134. The note discusses the cause of the potential problem as well as the solution.

CALL EXECUTE is an invaluable tool that should be added to your macro programming toolbox. CALL EXECUTE allows you to make your macro invocations data driven, which can make your programming more efficient and adaptable.

tags: CALL EXECUTE, macro, Problem Solvers, SAS Programmers