PROC FORMAT

9月 072020
 

Locale-specific SAS® format catalogs make reporting in multiple languages more dynamic. It is easy to generate reports in different languages when you use both the LOCALE option in the FORMAT procedure and the LOCALE= system option to create these catalogs. If you are not familiar with the LOCALE= system option, see the "Resources" section below for more information.

This blog post, inspired by my work on this topic with a SAS customer, focuses on how to create and use locale-specific informats to read in numeric values from a Microsoft Excel file and then transform them into SAS character values. I incorporated this step into a macro that transforms ones and zeroes from the Excel file into meaningful information for multilingual readers.

Getting started: Creating the informats

The first step is to submit the LOCALE= system option with the value fr_FR. For the example in this article, I chose the values fr_FR and en_US for French and English from this table of LOCALE= values. (That is because I know how to say “yes” and “no” in both English and French — I need to travel more!)

   options locale=fr_fr;

The following code uses both the INVALUE statement and the LOCALE option in PROC FORMAT to create an informat that is named $PT_SURVEY:

   proc format locale library=work;
      invalue $pt_survey 1='oui' 0='non'; run;

Now, toggle the LOCALE= system option and create a second informat using labels in a different language (in this example, it is English):
options locale=en_us;

   proc format locale library=work;
      invalue $pt_survey 1='yes' 0='no';
   run;

In the screenshot below, which shows the output from the DATASETS procedure, you can see that PROC FORMAT created two format catalogs using the specified locale values, which are preceded by underscore characters. If the format catalogs already exist, PROC FORMAT simply adds the $PT_SURVEY informat entry type to them.

   proc datasets memtype=catalog; 
   quit;

Before you use these informats for a report, you must tell SAS where the informats are located. To do so, specify /LOCALE after the libref name within the FMTSEARCH= system option. If you do not add the /LOCALE specification, you see an error message stating either that the $PT_SURVEY informat does not exist or that it cannot be found. In the next two OPTIONS statements, SAS searches for the locale-specific informat in the FORMATS_FR_FR catalog, which PROC FORMAT created in the WORK library:

   options locale=fr_fr;
   options fmtsearch=(work/locale);

If you toggle the LOCALE= system option to have the en_US locale value, SAS then searches for the informat in the other catalog that was created, which is the FORMATS_EN_US catalog.

Creating the Excel file for this example

For this example, you can create an Excel file by using the ODS EXCEL destination from the REPORT procedure output. Although you can create the Excel file in various ways, the reason that I chose the ODS EXCEL statement was to show you some options that can be helpful in this scenario and are also useful at other times.
Use the ODS EXCEL destination to create a file from PROC REPORT. I specify the TAGATTR= style attribute using “TYPE:NUMBER” for the Q_1 variable:

   %let  path=%sysfunc(getoption(WORK));
   filename temp "&path\surveys.xlsx"; 
   ods excel file=temp;
 
 
   data one;
      infile datalines truncover;
      input ptID Q_1;
      datalines;
   111 0
   112 1
   ;
   run;
 
   proc report data=one;
      define ptID / display style(column)={tagattr="type:String"};
      define Q_1 / style(column)={tagattr="type:Number"};
   run;
 
   ods excel close;

Now you have a file that looks like this screenshot when it is opened in Excel. Note that the data value for the Q_1 column is numeric:

The IMPORT procedure uses the DBSASTYPE= data set option to convert the numeric Excel data into SAS character values. Then I can apply the locale-specific character informat to a character variable.

As you will see below, in the macro, I use DBMS=EXCEL in PROC IMPORT to read the Excel file because my SAS and Microsoft Office versions are both 64-bit. (You might have to use the PCFILES LIBNAME Engine to connect to Excel through the SAS PC Files Server if you are not set up this way.)

Using the informats in a macro to create the multilingual reports

The final step is to run the macro with parameters to produce the two reports in French and English, using the locale-specific catalogs. When the macro is called, depending on the parameter value for the macro variable LOCALE, the LOCALE= system option changes, and the $PT_SURVEY informat from the locale-specific catalog is applied. These two tabular reports are produced:

Here is the full code for the example:

   %let  path=%sysfunc(getoption(WORK));
   filename temp "&path\surveys.xlsx";
   ods excel file=temp;
 
   data one;
      infile datalines truncover;
      input ptID Q_1;
      datalines;
   111 0
   112 1
   ;
   run;
 
   proc report data=one;
      define ptID / display style(column)={tagattr="type:String"};
      define Q_1 / style(column)={tagattr="type:Number"};
   run;
 
   ods excel close;
   options locale=fr_fr;
 
   proc format locale library=work;
      invalue $pt_survey 1='oui' 0='non';
   run;
 
   options locale=en_us;
 
   proc format locale library=work;
      invalue $pt_survey 1='yes' 0='no';
   run;
 
   /* Set the FMTSEARCH option */
   options fmtsearch=(work/locale);
 
   /* Compile the macro */
   %macro survey(locale,out);
      /* Set the LOCALE system option */
      options locale=&locale;
 
      /* Import the Excel file  */
      filename survey "&path\surveys.xlsx";
 
      proc import dbms=excel datafile=survey out=work.&out replace;
         getnames=yes;
         dbdsopts="dbsastype=(Q_1='char(8)')";
      run;
 
      data work.&out;
         set work.&out;
 
         /* Create a new variable for the report whose values are assigned by specifying the locale-specific informat in the INPUT function */
         newvar=input(Q_1, $pt_survey.);
         label newvar='Q_1';
      run;
 
      options missing='0';
 
      /*  Create the tabular report */
      proc tabulate data=&out;
         class ptID newvar;
 
         table ptID='Patient ID', newvar*n=' '/box="&locale";
      run;
 
   %mend survey;
 
   /* Call the macros */
   %survey(fr_fr,fr)
   %survey(en_us,en)

For a different example that does not involve an informat, you can create a format in a locale-specific catalog to print a data set in both English and Romanian. See Example 19: Creating a Locale-Specific Format Catalog in the Base SAS® 9.4 Procedures Guide.

Resources

For more information about the LOCALE option:

For more information about reading and writing Excel files:

For more information about creating macros and using the macro facility in SAS:

Using locale-specific format catalogs to create reports in multiple languages was published on SAS Users.

12月 042017
 

During my 35 years of using SAS® software, I have found the CNTLIN and CNTLOUT options in the FORMAT procedure to be among the most useful features that I routinely suggest to other SAS users. The CNTLIN option enables you to create user-defined formats from a SAS data set (input control data set). The CNTLOUT option enables you to create a SAS data set (output control data set) containing format details from an entry in a SAS format catalog.

In this blog post, I provide a few examples demonstrating how to use the CNTLIN option. I also mention how to use the CNTLOUT option to store your format information in case you need to move to a new operating environment.

You can store all the format details from a SAS format catalog in a CNTLOUT data set and later restore them in a format catalog in your new operating environment using the CNTLIN option. For details, see SAS Usage Note 22194: “How to use the CNTLOUT= and CNTLIN= options in PROC FORMAT to move formats from one platform to another.”

A data set for the CNTLIN option contains variables that give specific information about ranges and values. At a minimum, the data set must contain the following variables:

FMTNAME specifies a character variable whose value is the format or informat name.
START specifies a variable that gives the range's starting value.
LABEL specifies a variable whose value is associated with a format or an informat.

For details about input and output control data sets, see the “FORMAT Procedure” section of Base SAS® 9.4 Procedures Guide, Seventh Edition.

Create a Numeric Format

The following simple example using the CNTLIN option creates a numeric format named respf:

 data test;                                         
    input response desc $20.;                       
 datalines;                                         
 1  Strongly Disagree                               
 2  Disagree                                        
 3  Neutral                                         
 4  Agree                                           
 5  Stongly Agree                                   
 ;                                                  
 run;                                               
 
 data crfmt;                                        
    set test;                                       
    start=response;                                 
    label=desc;                                     
    fmtname='respf';                                
 run;                                               
 
 proc format library=work cntlin=crfmt fmtlib;      
    select respf;                                   
 run;

Controlling Your Formats

Reveal Data Set Variables

To see the other variables that are included in data sets created by the CNTLIN and CNTLOUT options, use CNTLOUT to create a data set for the respf format created above:

 proc format library=work cntlout=outfmt;       
    select respf;                               
 run;                                      
 proc print data=outfmt;                        
 run;

Add Additional Ranges

To add another range to the respf format, you can use DATA step processing with the data set created by the CNTLOUT option. Then, re-create the format using the CNTLIN option:

data infmt;                                               
    set outfmt end=last;                                   
    output;                                                
    if last then do;                                       
       HLO='O';  /* indicates a special other range  */      
       label='NA';                                         
       output;                                             
    end;                                                   
 run;                                                     
 
 proc format library=work cntlin=infmt fmtlib;             
    select respf;                                          
 run;

Convert a State Name to Its Postal Abbreviation

One use for the CNTLIN option is to create a format that converts a state name to its 2-letter postal abbreviation. For example, this option can convert 'North Carolina' to 'NC'.  Because SAS does not have a function or format to convert state names to postal abbreviations, this is an excellent use of the CNTLIN option.

We can use data from the SASHELP.ZIPCODE data set to create a user-defined format using the CNTLIN option, as shown below:

proc sql noprint;                               
    create table crfmt as                        
    select distinct statename as start,          
           statecode as label,                   
           '$mystate' as fmtname                 
    from sashelp.zipcode;                        
 quit;                                           
 
 proc format library=work cntlin=crfmt fmtlib;   
    select $mystate;                             
 run;

Identify State Capitals

In a similar manner, we can use the MAPS.USCITY data set to create a user-defined format that identifies state capitals from the 2-letter state abbreviation. See the sample code and partial results below:

proc sql noprint;                                 
   create table crfmt as                          
   select distinct statecode as start,            
          city as label,                          
          '$mycity' as fmtname                    
   from maps.uscity                               
   where capital='Y';                             
 quit;                                            
 
proc format library=work cntlin=crfmt fmtlib;     
   select $mycity;                                
run;

Use External Data Sources

You can gather information from external data sources and read that information into a data set created by the CNTLIN option to create user-defined formats.

The following example uses ICD10 medical diagnosis codes. I downloaded a list of ICD10 codes and their descriptions into a Microsoft Excel file from the Center for Medicare & Medicaid Services website. Then, I created a user-defined format from the first 25 records:
Note: You can also download the codes as a text file.

/* This code reads in the Excel file.   */                                                                  
proc import out==myicd10                                              
   datafile= "C:\Section111ValidICD10-2017.xlsx"   
   dbms=excelcs replace;                                                
   range="'Valid ICD10 2017 & NF Exclude$'";                            
   scantext=yes;                                                        
   usedate=yes;                                                         
   scantime=yes;                                                        
run;                                                                    
 
 
data crfmt;                                         
   set myicd10 (obs=25);                         
   fmtname='$myicd';                                
   start=code;                                      
   label=short_description;                         
run;  
 
title1 'ICD10 Format';                                                      
title3 'FMTLIB results only display the first 40 characters of the label';  
proc format library=work cntlin=crfmt fmtlib;       
   select $myicd;                                   
run;

A more complicated example that uses other data set variables created by the CNTLIN option is included in the linked sample program in Sample 47312: “Create a user-defined format containing decile ranges from PROC UNIVARIATE results.”

If you can think of a scenario in which the CNTLIN format would be helpful, give it a try. If you have questions, you can ask via  SAS Communities or contact us in SAS Technical Support.

Controlling your formats was published on SAS Users.

12月 202016
 

Joining tables with PROC FORMAT

The title of this post borrows from Stanley Kubrick’s 1964 comedy “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” It stars the great Peter Sellers as the titular character as well as George C. Scott and Slim Pickens. The black and white film is strange and brilliant just like Kubrick was. Similarly, as I was experiencing the issue I outline below and was told of this solution, I thought two things. The first was “what a strange but brilliant solution” and the second one was “I’ll try anything as long as it works.”   Thus, a title was born. But enough about that. Why are we here?

Problem

You want to add a couple of columns of information to your already large dataset, but each time you try to join the tables you run out of memory!  For example, you want to append latitude and longitude values from Table B to an existing list of customer phone numbers in Table A.

You’ve tried this and got nowhere fast:

proc sort data = demo.tablea;
by npa nxx;
proc sort data = demo.tableb;
by npa nxx;
run;
 
data demo.aunionb;
merge demo.tablea (in=a) demo.tableb (in=b);
by npa nxx;
if a;
run;

And then you tried this and also got nowhere (albeit a little slower):

proc sql;
   	create table demo.aunionb as 
   	select *,
	from demo.tablea a
left join demo.tableb b on (a.npa = b.npa) and (a.nxx = b.nxx);
quit;

Solution - Joining tables with PROC FORMAT

Use PROC FORMAT!

Here’s how:

First, take Table B and create character equivalents of the fields required in your join (assuming they aren’t characters already). In this example, NPA and NXX are the two fields that you are joining on. They will be your key once you concatenate them.  Next, create character equivalents of the fields that you want appended.

data work.tableb (keep = npa_nxx--nxx_c); 
set demo.tableb; 
 
npa_c = compress(put(npa, best10.));
nxx_c = compress(put(nxx, best10.));
 
npa_nxx = catx('_',npa_c, nxx_c);
 
lat_c = compress(put(latitude, best14.3)); 
long_c = compress(put(longitude, best14.3)); 
run;

Next, make sure that you have only unique values of your key. Use PROC SORT with OPT=noduprecs turned on.

Now, create a table that will be used as the input into PROC FORMAT. In this example, you are creating a table that will contain the formats for the latitude column.

proc sort data = work.tableb noduprecs;
by npa_nxx;
 
data demo.tableb_lat_fmt(keep=fmtname type start label); 
retain fmtname 'lat_f' type 'C'; 
set work.tableb; 
 
if npa_nxx = '._.' then start = 'Other  ';
else start = npa_nxx; 
label = lat_c; 
run;
proc sort data = demo.tableb_fmt;
by start;
run;

This step creates a table that includes the format name (lat_f), the format type (C), the key field (start) and its corresponding latitude value (label).  Sort this table by the ‘start’ column and then repeat this step for every column you wish to append, with each column getting its own unique format and table.

Now run PROC FORMAT using the CNTLIN option pointing to the tables that you just created in order to create your format.

proc format cntlin=demo.tableb_lat_fmt; 
run; 
proc format cntlin=demo.tableb_long_fmt; 
run;

Now all you have to do is run your data step to create the resultant dataset with the appended values.

data demo.aunionb (drop = npa_nxx); 
set demo.tablea; 
 
npa_nxx = catx('_',compress(put(npa,best10.)),compress(put(nxx, best10.)));
 
latitude = input(put(npa_nxx, $lat_f.), BEST.); 
longitude = input(put(npa_nxx, $long_f.), BEST.);
 
run;

This step creates 3 columns: npa_nxx, latitude, and longitude. Npa_nxx is the key built from the NPA  and NXX values. Latitude and longitude are then populated with the formatted value of npa_nxx, which in this case is the character equivalent of the original latitude or longitude. It also formats the value back into a numeric field.

The result is a clever way to add columns to a dataset, much like a VLOOKUP function works in Microsoft Excel, without the hassle of running out of memory space.

Notes:

  1. The author realizes there are other, more boring ways of tackling this issue like indexing and using WHERE statements, but where’s the fun in that?
  2. This solution may not be right for you. See your doctor if you experience any of the following symptoms:  memory loss, headache, frustration, Cartesian rage, cranial-keyboard embedment or memory loss.
tags: Global Technology Practice, Joining tables, PROC FORMAT, SAS Programmers, tips & techniques

Dr. Strangeformat or: How I Learned to Stop Joining Tables and Love the PROC was published on SAS Users.

11月 152016
 

Rick Wicklin showed us how to visualize the ages of US Presidents at the time of their inaugurations. That's a pretty relevant thing to do, as the age of the incoming president can indirectly influence aspects of the president's term, thanks to health and generational factors.

As part of his post, Rick supplied the complete data set for US Presidents and their birthdays. He challenged his readers to create their own interesting visualizations, and that's what I'm going to do here. I'm going to show you the distribution of US Presidents by their astrological signs.

Now, you might think that "your sign" is not as relevant of a factor as Age, and I certainly hope that you're correct about that. But past presidents have sought the advice of astrologers, and zodiac signs can influence the counsel such astrologers might offer. (Famously, Richard Nixon took advice from celebrity psychic Jeane Dixon. First Lady Nancy Reagan also sought her advice, and we know that Mrs. Reagan in turn influenced President Reagan.)

Like any good analyst, I mostly reused existing work to produce my results. First, I used the DATA step that Rick provided to create the data set of presidents and birthdays. Next, I reused my own work to create a SAS format that displays a zodiac sign for each date. And finally, I wrote write a tiny bit of PROC FREQ code to create my table and frequency plot.

data signs;
 /* So this column appears first */
 retain President;
 length sign 8;
 /* SIGN. format created earlier with PROC FORMAT */
 format sign sign.;
 set presidents (keep=President BirthDate InaugurationDate);
 /* convert birthday to our normalized SIGN date */
 sign = mdy(month(birthdate),day(birthdate),2000);
run;
 
ods graphics on;
proc freq data=signs order=freq;
tables sign / plots=freqplot;
run;

To keep things a bit fresh, I did all of this work in SAS University Edition using the Jupyter Notebook interface. Here's a glimpse of what it looks like:

procprintsigns
And here's the distribution you've all been waiting to see. When he takes office, Donald Trump will join George H. W. Bush and JFK in the Gemini column.

signspres
I've shared the Jupyter Notebook file as a public gist on GitHub. You can download and import into your own instance if you have SAS and Jupyter Notebook working together. (Having trouble rendering the notebook file? Try looking at it through the nbviewer service. That usually works.)

tags: Jupyter, proc format, SAS University Edition, zodiac

The post Zodiac signs of US Presidents appeared first on The SAS Dummy.

6月 112014
 

Now, the Star-Belly Sneetches
Had bellies with stars.
The Plain-Belly Sneetches
Had none upon thars.
- from "The Sneetches", by Dr. Seuss

Recently a user on the SAS-L mailing list had this challenge: "I would like to display stars in a table (created by PROC REPORT) based on variable values. For example, if value=3, then display five stars with 3 in yellow, and the other two in grey."

In her original post, the user had sketched out an approach that used a custom SAS format with uppercase and lowercase Xs representing the desired "star" states. It turns out that this was very close to a working approach. All she needed to do was apply the same technique that I used to add Harvey Balls to a SAS report. Like the Fix-It-Up Chappie, I'm here to offer the solution.

In my Harvey Balls example, I used ODS ESCAPECHAR and Unicode character codes to add extended characters to my SAS format labels. There is a Unicode character for a star (HEX 2605), so the same approach can work. In addition, the ESCAPECHAR sequence can set off style instructions, such as a color definition. This allows you to control the character color "mid-stream". For example, this sequence specifies a red star followed by a gray star:

 ~{style [color=red] ~{unicode '2605'x}} ~{style [color=ygr] ~{unicode '2605'x}}

As it happens, I keep some movie rating data that I downloaded from my Netflix account history (a very rich source of trivial SAS examples). When I apply a custom "star" format to my data, I can produce a report like this:
starsthars
Here's my SAS format and the PROC PRINT code that references it. The format and report look perfect in HTML, PDF and RTF output:

/* ODS EXCAPECHAR needed to set style/unicode cues */
ods escapechar='~';
 
/* Captured these in macro variables for readability and */
/* easy maintenance                                      */
%let graystar = ~{style [color=ygr] ~{unicode '2605'x}};
%let redstar =  ~{style [color=red] ~{unicode '2605'x}};
 
/* using a 4-star system where 1 means "zero stars" */
/* and 5 means the full "4 stars"                   */
proc format lib=work;
value stars
 1 = "&graystar.&graystar.&graystar.&graystar."
 2 = "&redstar.&graystar.&graystar.&graystar." 
 3 = "&redstar.&redstar.&graystar.&graystar." 
 4 = "&redstar.&redstar.&redstar.&graystar." 
 5 = "&redstar.&redstar.&redstar.&redstar." 
;
 
title "Movie titles with STARS on THARS";
proc print data=work.movies noobs;
  format rating stars.;
run;

If you want to try it yourself, you can download my program with test data:

>> Complete SAS program with test data

This program should work in SAS Display Manager, SAS Enterprise Guide (select HTML output), and even the SAS University Edition (SAS Studio).

tags: ods escapechar, proc format
11月 022012
 
This week's SAS tip is from Ron Cody and his book Cody's Data Cleaning Techniques Using SAS, Second Edition. Ron is the popular author of several bestselling SAS books and has been a SAS user since 1977. Visit Ron's author page for lots of bonus content, including a free chapter from his [...]
3月 212012
 
learn proc format:




data test;
input start $ end $ label;
cards;
nextag nextag 1
#sina #sina 2
#1000 #1000 3
;
run;

data fmt0;
retain fmtname'$testfmt' ;
set test end=last;
start=start;
end=end;
label=label;
output;
if last then do;
hlo='O';
label=0;
output;
end;
run;

proc print data=fmt0;
run;

proc format cntlin=fmt0;
select $testfmt;
run;

data test;
input test $20.;
cards;
nextag
#sina
&ok
#1000
wokong
;
run;

data test_result;
set test;
id=put(test, $testfmt.)+0;
run;

proc print data=test_result;
run;



The output is:



Obs test id

1 nextag 1
2 #sina 2
3 &ok 0
4 #1000 3
5 wokong 0


 Posted by at 2:04 下午
3月 162012
 
The question is: after calculate the percentage, we want to count from how many obs are in 0-1%, how many in 1%-2%, ..., how many in 99%-100%. For this simple example, we can use round or floor or ceil to get the result. But here shows how to use proc format to get it.

After we get cumulative percentage, we format the percentage with put function. Take care in the proc format don't forget to add hlo, otherwise it will gives error since the first start number is missing.



data test;
do i=1 to 1000;
x=ranpoi(8,8);
output;
end;
run;

data test;
set test;
y+x;
run;

proc sql;
create table test as
select x, y, y/max(y) as pct
from test;
quit;

data test2;
do i=1 to 100;
x=i/100;
y=lag(x);
output;
end;
run;

data a;
set test2;
fmtname='fmttestf';
start=y;
end=x;
label=i;
eexcl='Y';
if _n_=1 then hlo='L';
run;

proc format cntlin=a;
select fmttestf;
run;

data final;
set test;
rank=put(pct, fmttestf.)+0;
run;

proc print data=final;
run;

proc sql;
select rank, count(1) as cnt
from final
group by rank
order by rank;
quit;

 Posted by at 2:12 下午
9月 142011
 

Rick Wicklin and I are engaged in an arms race of birthday-related blog posts.  To recap:

Now I have no choice but to respond again.  This isn't my fault. I didn't start this.

Today, I'm going to take the data that Rick supplied and attempt to answer the deep scientific question, "What is our (zodiac) sign?"

You might as well ask the ever-reliable desktop scientist, the Magic 8-Ball: "Reply hazy, try again."  As you can see from the PROC FREQ output below, there doesn't appear to be a clear dominant horoscope that might be influencing our collective fate.

Attributes of Virgo (meticulous and reliable) and Taurus (warm-hearted and loving) are definitely reflected in our corporate culture, but I'm not sure that these signs are concentrated enough within our population to affect that.  Perhaps the most useful thing that came out of this exercise is my user-written SAS format that equates birthdates to signs of the zodiac.  Here's the complete program, which you ought to be able to run as-is from SAS 9.2 or later, or in SAS Enterprise Guide.

filename bdays url 
    "http://blogs.sas.com/content/iml/files/2011/09/SASBirthdays.csv"
  /* behind a corporate firewall? don't forget the PROXY= option here */
  ;
 
/* SAS format for zodiac signs in a given year */
proc format lib=work;
	value sign
	'21Mar2000'd - '19Apr2000'd = 'Aries'
	'20Apr2000'd - '20May2000'd = 'Taurus'
	'21May2000'd - '20Jun2000'd = 'Gemini'
	'21Jun2000'd - '22Jul2000'd = 'Cancer'
	'23Jul2000'd - '22Aug2000'd = 'Leo'
	'23Aug2000'd - '22Sep2000'd = 'Virgo'
	'23Sep2000'd - '22Oct2000'd = 'Libra'
	'23Oct2000'd - '21Nov2000'd = 'Scorpio'
	'22Nov2000'd - '21Dec2000'd = 'Sagittarious'
	/* split Capricorn to make two valid ranges */
	/* that don't span the calendar boundary */
	'22Dec2000'd - '31Dec2000'd = 'Capricorn'
	'01Jan2000'd - '19Jan2000'd = 'Capricorn'
	'20Jan2000'd - '18Feb2000'd = 'Aquarius'
	'19Feb2000'd - '20Mar2000'd = 'Pisces'
	other = 'Unknown';
run;
 
data bdays;
	infile bdays dsd firstobs=2;
	input mon day;
	length birthdate 8 birthsign 8;
	format birthdate date5.;
	format birthsign sign.;
	label birthsign="Zodiac sign";
	/* make sure we pick a leap year, so 29Feb is valid */
	birthdate = mdy(mon,day,2000);
	birthsign = birthdate;
run;
 
ods graphics on / height=400 width=800;
title "What's our sign?";
ods noproctitle;
proc freq data=bdays order=data;
	tables birthsign /plots=freqplot(scale=percent);
run;
 
/* clear the filename */
filename bdays;

I'm sure that the question has occurred to you: what about my Facebook friends?  How are they distributed among the stars?  Of those that report their birthdays, here is how they fall:

As you can see, there really isn't a dominant sign among them.  However, Aquarius might be a bit underrepresented.  That's a shame, because I could probably use more honest and loyal people among my friends.

tags: facebook, magic 8-ball, proc format, proc freq, zodiac
11月 292010
 

Introduction: Merging two or multiple datasets is essential for many ‘data people’. Yes, it is a dirty and routine job. Everyone wants to get it done quick and accurate. Actually, SAS has many ways to tackle this job[3]. In two competing papers from SAS Global Conference 2009, Qinfeng Liang[1] described five ways to marge a base table and a lookup table regarding the healthcare industry, while David Franklin[2] pictured eight methods to combine patient and effect datasets in a typical pharmaceutical scenario. Here I would like to extend the discussion further: one base table and two lookup tables. I would like to see which one of the solutions would cost less hardware resource and, most importantly, system time.

Method: The base table was generated with 10 million sequential numbers. Two subset tables were randomly chosen from the base table and kept unsorted until a method was applied, and each contains 1 million records. Five methods, Proc SQL, Data step Merge, Proc Format, Data step Hash object and Data step key-set were utilized and compared. System time was summed for each method. The requirement for memory was recorded.

Result: As expected, methods related to data step consume less memory. Key-data structure methods, including Proc SQL, Proc Format and Data step Hash object, ask much more memory. However, in-memory processing does not deliver much help to the time. Proc Format and Data step Hash object still spend above-average time, while obviously Data step set-key costs most time in waiting.

Discussion: Amazingly, even as an ancient technology (maybe 40 years old), Data step Merge is the winner in this competition with both satisfied time and least memory usage. It is also code-efficient. Proc SQL is the second choice. Hash object doesn’t show its edge as other authors suggested. Proc format and Data step key-set are the least favored ones. I also tried Data step Array, and I found that it was very difficult to load the lookup table and eventually I gave up the attempts. The solution by Proc Format is hard to code for multiple table joining, since each has to build an individual format and Proc Format has no batch mode. In conclusion, the choice of the best method depends on specific needs or situation. Old methods, like Data step Merge, can still perform as well as others do, sometimes even better.

References: 1. Qingfeng Liang. Choosing the Right Technique to Merge Large Data Sets Efficiently. SAS Global Forum 2009
2. David Franklin. Merging Data Eight Different Ways. SAS Global Forum 2009
3. SAS® Certification Prep Guide: Advanced Programming for SAS®9. SAS Institute Inc. 2007

********************GENERATE THREE TABLES TO JOIN BY SIMULATION****************;
********************1. GENERATE THE BASE TABLE;
data base;
do number=1 to 1E7;
output;
end;
run;

********************2. GENERATE THE SUB TABLE TO BE JOINED;
proc sql outobs=1000000;
create table sub1 as
select number
from base
order by ranuni(43234);
create table sub2 as
select number
from base
order by ranuni(45954);
quit;
********************END OF SIMULATION ****************;


*******************MERGE STEPS******************;
*******************1. PROC SQL;
proc sql;
create table sqlmerge as
select a.number, b.number as var1, c.number as var2
from base as a left join sub1 as b on a.number = b.number
left join sub2 as c on a.number=c.number;
quit;

*******************2. DATA STEP MERGE;
proc sort data=sub1 out=sub1_s;
by number;
run;
proc sort data=sub2 out=sub2_s;
by number;
run;
data datastepmerge;
merge base(in=a) sub1_s(in=b) sub2_s(in=c);
by number;
if b then var1=number;
if c then var2=number;
if a;
run;
/*ALTERNATIVE 1 -- USE DATA SET INDEX TO APPLY DATA STEP MERGE*/
/*
proc sql;
create index number on base;
create index number on sub1;
create index number on sub2;
quit;
data datastepmerge;
merge base(in=a) sub1(in=b) sub2(in=c);
by number;
if b then var1=number;
if c then var2=number;
if a;
run;
*/
/*ALTERNATIVE 2 -- USE DATA SET VIEW TO APPLY DATA STEP MERGE*/
/*
proc sql;
create view Vbase as select * from base order by number ;
create view Vsub1 as select * from sub1 order by number;
create view Vsub2 as select * from sub2 order by number;
quit;
data datastepmerge;
merge Vbase(in=a) Vsub1(in=b) Vsub2(in=c);
by number;
if b then var1=number;
if c then var2=number;
if a;
run;
*/

*******************3. PROC FORMAT;
data sub1_fmt;
length start $12;
retain fmtname "sub1_fmt";
set sub1 end=lastobs;
do _n_=1 until (lastobs);
start=input(number, $12.);
label=start;
drop number;
output;
end;
if lastobs then do;
start ='other' ;
label= '.';
output;
end;
run;
proc format cntlin=sub1_fmt;
run;

data sub2_fmt;
length start $12;
retain fmtname "sub2_fmt";
set sub2 end=lastobs;
do _n_=1 until (lastobs);
start=input(number, $12.);
label=start;
drop number;
output;
end;
if lastobs then do;
start ='other' ;
label= '.';
output;
end;
run;
proc format cntlin=sub2_fmt;
run;

data formatmerge;
set base;
var1=put(number, sub1_fmt.);
var2=put(number, sub2_fmt.);
run;

*******************4. DATA STEP HASH OBJECT;
data hashobjmerge;
if _n_=1 then do;
declare hash sub1_h(dataset: 'sub1');
sub1_h.defineKey('number');
sub1_h.defineDone();
declare hash sub2_h(dataset: 'sub2');
sub2_h.defineKey('number');
sub2_h.defineDone();
end;
set base;
var1=number;
var2=number;
if sub1_h.find() then call missing(var1);
if sub2_h.find() then call missing(var2);
run;

**********5. DATA STEP KEY-SET;
proc sql;
create table sub1_ix as
select *, number as var1
from sub1;
create index number on sub1_ix;
quit;
data sub2_ix(index=(number /unique /nomiss));
set sub2;
var2=number;
run;
data keysetmerge;
set base;
set sub1_ix KEY=number /unique;
if _IORC_ then do;
_ERROR_=0;
var1='';
end;
set sub2_ix key=number/unique;
if _IORC_ then do;
_ERROR_=0;
var2='';
end;
run;
quit;
*******************END OF COMPARISION************;