2月 052016
 

TL;DR

The next time that you find yourself writing a PROC SORT step, verify that you're working with the SAS Base engine and not a database. If your data is in a database, skip the SORT!

The details: When to skip the PROC SORT step

Many SAS procedures allow you to group multiple analyses into a single step through use of the BY statement. The BY statement groups your data records by each unique combination of your BY variables (yes, you can have more than one), and performs the PROC's work on each distinct group.

When using the SAS Base data engine (that's your SAS7BDAT files), BY-group processing requires data records to be sorted, or at least pre-grouped, according to the values of the BY variables. The reason for this is that the Base engine accesses data records sequentially. When a SAS procedure is performing a grouped analysis, it expects to encounter all records of a group in a contiguous sequence. What happens when records are out of order? You might see an error like this:

ERROR: Data set SASHELP.CLASS is not sorted in ascending sequence. 
The current BY group has Sex = M and the next BY group has Sex = F.

I first described this in 2010: Getting out of SORTs with SAS data.

In a recent post, Rick Wicklin discussed a trick you can use to tell SAS that your data are already grouped, but the group values might not be in a sorted order. The NOTSORTED option lets you avoid a SORT step when you can promise that SAS won't encounter different BY group values interleaved across the data records.

Sorting data is expensive. In data tables that have lots of records, the sort processing requires tremendous amounts of temporary disk space for its I/O operations -- and I/O usually the slowest part of any data processing. But here's an important fact for SAS programmers: a SORT step is required only for SAS data sets that you access using the Base engine*. If your data resides in database, you do not need to sort or group your data in order to use BY group processing. And if you do sort the data first (as many SAS programmers do, out of habit), you're wasting time.

I'm going to use a little SAS trick to illustrate this in a program. Imagine two copies of the ubiquitous CLASS data set, one in a Base library and one in a database library. In my example I'll use the SPDE engine as the database, even though it's not a separate database server. (Yes! You can do this too! SPDE is part of Base SAS.)

/* these 3 lines will create a temp space for your */
/* SPDE data */
/* See: http://blogs.sas.com/content/sasdummy/use-dlcreatedir-to-create-folders/ */
options dlcreatedir;
libname t "%sysfunc(getoption(WORK))/spde"; 
libname t clear;
 
/* assign an SPDE library. Works like a database! */
libname spde SPDE "%sysfunc(getoption(WORK))/spde";
/* copy a table to the new library */
data spde.class;
 set sashelp.class;
run;
 
/* THIS step produces an error, because CLASS */
/* is not sorted by SEX */
proc reg data=sashelp.class;
    by sex;
    model age=weight;
run;
quit;
 
/* THIS step works correctly.  An implicit          */
/* ORDER BY clause is pushed to the database engine */
proc reg data=spde.class;
    by sex;
    model age=weight;
run;
quit;

Why does the second PROC REG step succeed? It's because the requirement for sorted/grouped records is passed through to the database using an implicit ORDER BY clause. You don't see it happening in your SAS log, but it's happening under the covers. Most SAS procedures are optimized to push these commands to the database. Most databases don't really have the concept of sorted data records; they return records in whatever sequence you request. Returning sorted data from a database doesn't have the same performance implications as a SAS-based PROC SORT step.

How does SAS Enterprise Guide generate optimized code?

Do you use SAS Enterprise Guide tasks to build your analyses? If so, you might have noticed that the built-in tasks go to great lengths to guarantee that your task will encounter data that are properly sorted. Consider this setup for the Linear Regression task:
Linear Regression task
Here's an example of the craziness that you'll see from the Linear Regression task when you have a BY variable in a Base SAS data set. There is "defensive" keep-and-sort code in there, because we want the task to work properly for any data scenario.

/* -------------------------------------------------------------------
   Determine the data set's type attribute (if one is defined)
   and prepare it for addition to the data set/view which is
   generated in the following step.
   ------------------------------------------------------------------- */
DATA _NULL_;
 dsid = OPEN("SASHELP.CLASS", "I");
 dstype = ATTRC(DSID, "TYPE");
 IF TRIM(dstype) = " " THEN
  DO;
  CALL SYMPUT("_EG_DSTYPE_", "");
  CALL SYMPUT("_DSTYPE_VARS_", "");
  END;
 ELSE
  DO;
  CALL SYMPUT("_EG_DSTYPE_", "(TYPE=""" || TRIM(dstype) || """)");
  IF VARNUM(dsid, "_NAME_") NE 0 AND VARNUM(dsid, "_TYPE_") NE 0 THEN
   CALL SYMPUT("_DSTYPE_VARS_", "_TYPE_ _NAME_");
  ELSE IF VARNUM(dsid, "_TYPE_") NE 0 THEN
   CALL SYMPUT("_DSTYPE_VARS_", "_TYPE_");
  ELSE IF VARNUM(dsid, "_NAME_") NE 0 THEN
   CALL SYMPUT("_DSTYPE_VARS_", "_NAME_");
  ELSE
   CALL SYMPUT("_DSTYPE_VARS_", "");
  END;
 rc = CLOSE(dsid);
 STOP;
RUN;
 
/* -------------------------------------------------------------------
   Sort data set SASHELP.CLASS
   ------------------------------------------------------------------- */
PROC SORT
 DATA=SASHELP.CLASS(KEEP=Age Height Sex &_DSTYPE_VARS_)
 OUT=WORK.SORTTempTableSorted &_EG_DSTYPE_
 ;
 BY Sex;
RUN;
TITLE;
TITLE1 "Linear Regression Results";
PROC REG DATA=WORK.SORTTempTableSorted
  PLOTS(ONLY)=ALL
 ;
 BY Sex;
 Linear_Regression_Model: MODEL Age = Height
  /  SELECTION=NONE
 ;
RUN;
QUIT;

This verbose code drives experienced SAS programmers crazy. But unlike a SAS programmer, the SAS Enterprise Guide code generator does not understand all of the nuances of your data, and thus can't guess what steps can be skipped. (SAS macro programmers: you know what I'm talking about. Think about all of the scenarios you have to code for/defend against in a generalized macro.)

And when running the same task with a database table? SAS Enterprise Guide detects the table source is a database, and builds a much more concise version:

TITLE;
TITLE1 "Linear Regression Results";
PROC REG DATA=SPDE.CLASS
  PLOTS(ONLY)=ALL
 ;
 BY Sex;
 Linear_Regression_Model: MODEL Age = Weight
  /  SELECTION=NONE
 ;
RUN;
QUIT;

Consider the data source, and -- if you can -- skip the SORT!

* The Base engine is not the only sequential data engine in SAS, but it's the most common.

tags: in-database, SAS Enterprise Guide, SAS programming

The post Sorting data in SAS: can you skip it? appeared first on The SAS Dummy.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)