Kim Wilson

12月 192017
 

Compressing a data setCompressing a data set is a process that reduces the number of bytes that are required to represent each observation in a file. You might choose to enable compression to reduce the storage requirements for the file and to lessen the number of I/O operations that are needed to read from or write to the data during processing.

Compression is enabled by the COMPRESS= system option, the COMPRESS= option in the LIBNAME statement, and the COMPRESS= data set option. The COMPRESS= system option compresses all data set sets that are created during a SAS session, and the COMPRESS= option in the LIBNAME statement compresses all data sets for a particular SAS® library. The COMPRESS= data set option is the most popular of these methods because you compress data sets individually as they are created.

The COMPRESS= data set option can be set to CHAR (or YES), NO, and BINARY. The following example illustrates using COMPRESS=YES:

data new(compress=yes);
set old;
run;

 

While compression is a useful tool in your programming toolbox, it isn't a tool that you should use on every data set. When you request compression by using the COMPRESS= option, SAS considers the following information:

  • The header information of the data set to determine how many variables are in the program data vector
  • whether the variables are character or numeric
  • the lengths of the variables

SAS doesn't consider data values at all. The compression overhead for Microsoft 32-bit Windows and 64-bit Windows is 12 bytes, whereas 64-bit UNIX hosts require 24 bytes of overhead. When SAS determines that it is possible to recoup the 12 or 24 bytes of overhead per observation that compression requires, then SAS attempts to compress the data. If that 12 or 24 bytes per observation can't be recouped, the data set size is increased when the compression is completed. So, you should determine ahead of time whether your data set is a good candidate for compression.

In the following example, a data set is created in the Windows operating environment with two variables having lengths, respectively, of 3 and 5 bytes. Because it is impossible to recoup the 12 bytes that are needed per observation for compression overhead, SAS automatically disables compression and a note is written to the SAS log that indicates the same.

571  data new(compress=char);
572     x='abc';
573     y='defgh';
574  run;
 
NOTE: Compression was disabled for data set WORK.NEW because compression overhead would increase
      the size of the data set.
NOTE: The data set WORK.NEW has 1 observations and 2 variables.

 

The compression process doesn’t recognize individual variables within an observation. Instead, the process sees each observation as a large collection of bytes that are run together end to end. In the COMPRESS= data set option, you enable compression by specifying either CHAR (YES) and BINARY. These values for the option differ slightly in the types of data values that they target for compression.

Using the COMPRESS=CHAR|YES option

Specifying COMPRESS=CHAR (or YES) targets data with repeating single characters and variables with stored lengths that are longer than most of the values. As a result, blank spaces pad the end of values that are shorter than the number of bytes of storage.

In thinking about conserving space, customers often shorten the storage lengths of variables by using a LENGTH statement. When you shorten the lengths of your variables, you remove the best opportunity for SAS to compress. For example, if a numeric variable can be stored accurately in 4 bytes, the remaining 4 bytes (in an 8-byte variable) will all be zeros. This situation is perfect for compression. However, when you shorten the length to 4 bytes, the layout of the value is no longer suitable for compression. The only reason to truncate the storage length by using the LENGTH statement is to save disk space. All values are expanded to the full size of 8 bytes in the program data vector to perform computations in DATA and PROC steps. You'll use extra CPU resources to uncompress the data set as well as to expand variables back to 8 bytes.

Using the COMPRESS=BINARY option

When you use COMPRESS=BINARY, patterns of multiple characters across the entire observation are compressed. Binary compression uses two techniques at the same time. This option searches for the following:

  1. Repeating byte sequences (for example, 10 blank spaces or 10 zero bytes in a row)
  2. Repeating byte patterns (for example, the repeated pattern in the hexadecimal value 0102030405FAF10102030405FBF20102030405FCF3)

With that in mind, you can see that the bytes in a numeric variable are just as likely to be compressed as those in a character variable because the compression process does not consider those bytes to be numeric or character. They are just viewed as bytes. Consider a missing value that is represented in hexadecimal notation as FFFF000000000001. In the middle of that value is a string of five zero bytes (0x00) that can be replaced by two compression code-bytes. So, what starts as a sequence of 8 bytes ends up as a sequence of 5 bytes.

Keep in mind

As mentioned earlier, although compression saves space and is a great tool to keep handy in your SAS toolbox, it’s not meant for all your data sets. Some data sets are not going to compress well and the data set will grow larger, so know your data. Also, you’ll want to consider the extra CPU resources that are required to read a compressed file due to the overhead of uncompressing each observation.

What can compression do for you? was published on SAS Users.

2月 192016
 

ProblemSolversPROC DS2 (DS2) is a new SAS programming language that extends the power of the traditional DATA step. Some customers have wondered if this new functionality is simply the “DATA step on steroids,” meaning that we simply added more robust functionality so that it’s a replacement for the DATA step. While it includes extra functionality that allows you to do more advanced data processing and manipulation, the addition of methods, packages, additional data types, and the power to run code inside databases boosts the power of this new procedure. For a more in-depth discussion, see the “Similarities between DS2 and the DATA Step” web page and the “Differences between DS2 and the DATA Step” web page. DS2 is not a replacement for the DATA step as INFILE, INPUT, FILE, PUT, and DATALINES statements aren’t permitted. Although most DS2 customers are using DS2 against database tables, it can also be used against traditional SAS data sets.

If DS2 is new to you, you are probably asking why you should invest time to learn about it rather than staying with the ever-reliable DATA step. Let me discuss several reasons why others are finding that DS2 is a welcomed addition to Base SAS®:

  • Additional precision gained from using new data types
  • Threaded processing for computationally intensive processing
  • Methods and packages for easy deployment of reusable code modules
  • Embedded PROC FedSQL – DATA step power and SQL simplicity all in the same program
  • Code can be run inside massively parallel databases like Teradata, Hadoop and Greenplum

Additional Precision Gained from Using New Data Types

While Base SAS allows only character and numeric data types, DS2 offers numerous others: BIGINT, BINARY, CHAR, DATE, DECIMAL, DOUBLE, FLOAT, INTEGER, NCHAR, NVARCHAR, REAL, SMALLINT, TIME, TIMESTAMP, TINYINT, VARBINARY, and VARCHAR. You can read more about each type on the “What Are the Data Types?” web page.

If you’re using DS2 to write to a SAS data set, you will still be limited to storing the standard character and numeric data types, even if you’ve processed other DS2 data types in the code. If your data is stored in one of the supported data sources listed on this web page, you can read from and write to them while using data types native to those databases. For example, DS2 can provide the extra precision you need to get accurate, precise answers by using the DECIMAL data type instead of a standard SAS 8-byte numeric variable.

Threaded Processing for Computationally Intensive Processing

Unless thread programs are created and run inside DS2, the program runs sequentially. When your SAS program is computationally intense, it might be advantageous to create a thread program so the data is spread out over multiple processes to become more efficient. Enclose the DS2 code between THREAD and ENDTHREAD statements, use the DECLARE THREAD statement to create an instance of the thread in the DS2 program, and execute the thread with a SET FROM statement. If one computation thread can keep up with the I/O thread, then that single thread is used for all processing. It’s also important to remember that a single reader feeds all threads. If you’d like to know how the observations were spread out over the threads, you can use the _THREADID_ variable. Here’s an example of a thread program, including the _THREADID_ variable:

data test; 
do x=1 to 12316;
 j=x*x; 
output; 
end; 
run; 
 
proc ds2; 
  thread newton/overwrite=yes; 
    dcl double y count; 
    dcl bigint thisThread;
    drop count; 
    method run(); 
      set test; 
      y=(x*j*x)/(j+x); 
      thisThread=_threadid_; 
      count+1;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 
quit; 
 
proc ds2; 
  data _null_; 
    dcl thread newton frac; 
    method run(); 
         set from frac threads=3; 
    end; 
  enddata;
  run; 
quit;

This is the output in the SAS log that shows the number of rows processed by each thread.

**Thread 1 processed 4124 rows
**Thread 0 processed 4096 rows
**Thread 2 processed 4096 rows

Methods and Packages for Easy Deployment of Reusable Code Modules

In DS2, all code must reside within a method. The three system methods are INIT, RUN, and TERM, and at least one must be specified in each DS2 program. You can write custom methods that can be stored for future use and that alleviate the need to recreate the code each time. Packages are collections of methods and variables. Although you can create custom packages, DS2 provides some packages that are predefined with methods to aid in your programming needs. Predefined packages include FCMP, Hash, HTTP, JSON, Logger, Matrix, SQLSTMT, and TZ. You can read more about them on the “Predefined DS2 Packages” web page.

Embedded PROC FedSQL – DATA Step Power and SQL Simplicity All in the Same Program

SAS FedSQL is a SAS proprietary implementation of the ANSI SQL:1999 core standard, and it allows vendor-neutral access to many data sources. This functionality can be used in the new procedure, PROC FedSQL, but you can also embed FedSQL queries in the SET statement within DS2. When a query is used in the SET statement, the resulting rows become the input to the remaining DS2 code. For example,

  set {select a.id,a.income,b.expenses 
         from lib.dset as a left join lib2.dset2 as b
         on a.id=b.id
         where a.income >40000
         order by a.id  };

The embedded FedSQL code gets converted to the native database code as needed.

Code Can Be Run Inside Massively Parallel Databases Like Teradata, Hadoop, and Greenplum

The SAS® In-Database Code Accelerator enables you to publish a DS2 thread program to the database and execute that thread program in parallel inside the database. DS2 thread programs can run inside Hadoop, Greenplum, and Teradata, while DS2 data programs can run inside Hadoop and Teradata. This has been a favorite aspect of DS2 for many customers thus far. To take advantage of this feature, you must license the SAS In-Database Code Accelerator, and the SAS/ACCESS® Interface to your database (Hadoop, Greenplum, or Teradata) in addition to Base SAS. The SAS® Embedded Process must be installed and configured on your database. Read more about the SAS In-Database Code Accelerator on the “Using the SAS In-Database Code Accelerator” web page.

While DS2 isn’t meant to replace the DATA step, there are numerous reasons why a customer might choose to use this new functionality. If you’re unsure if you should give it a try, or if you’d like to read more about it, here’s the link to the SAS® 9.4 DS2 Language Reference.

As always, we in Technical Support are here to answer your questions and offer tips and suggestions as you use our software. Thanks for using SAS!

tags: Problem Solvers, PROC DS2, SAS Programmers

Reasons to love PROC DS2 was published on SAS Users.

1月 152016
 

ProblemSolversAs support analysts in the SAS Technical Support division, we answer many phone calls from SAS customers.  As members of the SAS Foundation team, we get questions that vary significantly in content from all of the areas that we support.  We offer coding tips and suggestions as well as point customers to SAS Notes and documentation. A common question that we are frequently asked is how to collapse a data set with many observations for the same BY variable value into a data set that has one observation per BY variable value.  Another way to phrase this question is: how do you reshape the data from “long to wide?”

Resources to Reshape Data

The following SAS Note illustrates how to use two TRANSPOSE procedure steps to collapse the observations for multiple BY variables:

Sample 44637: Double PROC TRANSPOSE method for reshaping your data set with multiple BY variables

If you prefer to use the DATA step rather than PROC TRANSPOSE, the following SAS Note provides a code sample that accomplishes the same purpose:

Sample 24579: Collapsing observations within a BY-Group into a single observation (when data set has 3 or more variables)

If the data set is “wide” and you’d like to expand each observation into multiple observations to make it a “long” data set, the following SAS Note is helpful:

Sample 24794: Expand single observations into multiple observations

Brief Overview of Some Support.sas.com Resources

Since we’ve been discussing SAS Notes from the support.sas.com site, here is a brief overview of how to use the site to find other helpful items for your future coding projects.

Documentation

  1. Go to support.sas.com
  2. On the left side of the page in the Knowledge Base section, select Documentation, Product Index A-Z, and then select your product of interest to access user guides and other documentation resources.

SAS Notes

  1. Go to support.sas.com
  2. On the left side of the page in the Knowledge Base section, select Samples & SAS Notes.
  3. From this page, you can type keywords in the Search box to get a list of relevant notes from both categories.
    Note: If you’re interested in specific types of notes (for example, samples or problem notes), select the type of note from the choices given on the left side of the page underneath Samples & SAS Notes.

It was a pleasure to work with you in 2015 when you contacted us for assistance. We look forward to another great year in 2016.


tags: data, Problem Solvers, SAS Programmers

New year refresher on reshaping data was published on SAS Users.

2月 212014
 

Most SAS programmers would agree that they use the SET statement without giving much thought to the syntax, because it’s such a widely used statement of choice. We routinely name the expected data sets and possibly a few options, and away we go. A visit to the documentation can be saved for more complex concerns such as arrays, hash tables, and regular expressions.  Maybe a review of how the Program Data Vector (PDV) stores and outputs variable values could interest even the most experienced SAS programmer.

Often in Technical Support, we hear from customers who are referencing more than one data set on the SET statement and their results “aren’t correct”.  Once we see the code, it’s obvious that this is a case of SAS behaving by design but different than expected by that customer. The following DATA steps and resulting output use missing values to illustrate one common situation where the PDV causes the DATA step to “misbehave”:

Reading datasets A and B with SET statement

Unexpected output in FINAL

Why the IN= data set option helps, but isn’t the total solution

Because Z doesn’t exist in data set A, all observations from data set A meet the IF condition, thus all Y values are set to 999.  It’s likely that most users want only the observations from data set B evaluated by the IF statement. The addition of the IN= data set option seems like a reasonable fix.

Results when reading data set B with IN= data set option

Don’t forget how the SET statement changes PDV processing

Observations from data set A are no longer affected by the IF statement. But wait, the eighth observation has a Y value of 999 although Z isn’t missing.  What happened? The main points to remember when reading data sets with a SET statement are these:

  • The SET statement does not reset the values in the program data vector to missing, except for variables whose values are calculated or assigned during the DATA step.
  • Variables that are created by the DATA step are set to missing at the beginning of each iteration of the DATA step.
  • Variables that are read from a data set are not.

These differences are key to what happens when data sets are combined with a SET statement. The Combining SAS Data Sets: Methods section of the SAS 9.4 Language Reference: Concepts manual is very helpful.

How to override the automatic retain in the PDV

To take the code sample one step further, observations six and seven correctly show Y values of 999, but since Z isn’t missing on the eighth observation, Y should be a missing value.

You can manually set Y to missing at the top of the DATA step to prevent the automatic retain.  Each iteration of the DATA step sets Y to missing, and the SET and IF statements execute and populate the PDV accordingly.  Now the resulting data set values are as expected.

Override the automatic retain of the PDV

Overriding the automatic retain in the PDV

Usage Note 48147: Variables read using SET, MERGE, and UPDATE statements are automatically retained is an excellent note to bookmark for future use when reading data sets with the SET, MERGE, or UPDATE statements.

tags: missing values, Problem Solvers, Program Data Vector, SAS Programmers, SET statement