Data compression

12月 192017
 

Compressing a data setCompressing a data set is a process that reduces the number of bytes that are required to represent each observation in a file. You might choose to enable compression to reduce the storage requirements for the file and to lessen the number of I/O operations that are needed to read from or write to the data during processing.

Compression is enabled by the COMPRESS= system option, the COMPRESS= option in the LIBNAME statement, and the COMPRESS= data set option. The COMPRESS= system option compresses all data set sets that are created during a SAS session, and the COMPRESS= option in the LIBNAME statement compresses all data sets for a particular SAS® library. The COMPRESS= data set option is the most popular of these methods because you compress data sets individually as they are created.

The COMPRESS= data set option can be set to CHAR (or YES), NO, and BINARY. The following example illustrates using COMPRESS=YES:

data new(compress=yes);
set old;
run;

 

While compression is a useful tool in your programming toolbox, it isn't a tool that you should use on every data set. When you request compression by using the COMPRESS= option, SAS considers the following information:

  • The header information of the data set to determine how many variables are in the program data vector
  • whether the variables are character or numeric
  • the lengths of the variables

SAS doesn't consider data values at all. The compression overhead for Microsoft 32-bit Windows and 64-bit Windows is 12 bytes, whereas 64-bit UNIX hosts require 24 bytes of overhead. When SAS determines that it is possible to recoup the 12 or 24 bytes of overhead per observation that compression requires, then SAS attempts to compress the data. If that 12 or 24 bytes per observation can't be recouped, the data set size is increased when the compression is completed. So, you should determine ahead of time whether your data set is a good candidate for compression.

In the following example, a data set is created in the Windows operating environment with two variables having lengths, respectively, of 3 and 5 bytes. Because it is impossible to recoup the 12 bytes that are needed per observation for compression overhead, SAS automatically disables compression and a note is written to the SAS log that indicates the same.

571  data new(compress=char);
572     x='abc';
573     y='defgh';
574  run;
 
NOTE: Compression was disabled for data set WORK.NEW because compression overhead would increase
      the size of the data set.
NOTE: The data set WORK.NEW has 1 observations and 2 variables.

 

The compression process doesn’t recognize individual variables within an observation. Instead, the process sees each observation as a large collection of bytes that are run together end to end. In the COMPRESS= data set option, you enable compression by specifying either CHAR (YES) and BINARY. These values for the option differ slightly in the types of data values that they target for compression.

Using the COMPRESS=CHAR|YES option

Specifying COMPRESS=CHAR (or YES) targets data with repeating single characters and variables with stored lengths that are longer than most of the values. As a result, blank spaces pad the end of values that are shorter than the number of bytes of storage.

In thinking about conserving space, customers often shorten the storage lengths of variables by using a LENGTH statement. When you shorten the lengths of your variables, you remove the best opportunity for SAS to compress. For example, if a numeric variable can be stored accurately in 4 bytes, the remaining 4 bytes (in an 8-byte variable) will all be zeros. This situation is perfect for compression. However, when you shorten the length to 4 bytes, the layout of the value is no longer suitable for compression. The only reason to truncate the storage length by using the LENGTH statement is to save disk space. All values are expanded to the full size of 8 bytes in the program data vector to perform computations in DATA and PROC steps. You'll use extra CPU resources to uncompress the data set as well as to expand variables back to 8 bytes.

Using the COMPRESS=BINARY option

When you use COMPRESS=BINARY, patterns of multiple characters across the entire observation are compressed. Binary compression uses two techniques at the same time. This option searches for the following:

  1. Repeating byte sequences (for example, 10 blank spaces or 10 zero bytes in a row)
  2. Repeating byte patterns (for example, the repeated pattern in the hexadecimal value 0102030405FAF10102030405FBF20102030405FCF3)

With that in mind, you can see that the bytes in a numeric variable are just as likely to be compressed as those in a character variable because the compression process does not consider those bytes to be numeric or character. They are just viewed as bytes. Consider a missing value that is represented in hexadecimal notation as FFFF000000000001. In the middle of that value is a string of five zero bytes (0x00) that can be replaced by two compression code-bytes. So, what starts as a sequence of 8 bytes ends up as a sequence of 5 bytes.

Keep in mind

As mentioned earlier, although compression saves space and is a great tool to keep handy in your SAS toolbox, it’s not meant for all your data sets. Some data sets are not going to compress well and the data set will grow larger, so know your data. Also, you’ll want to consider the extra CPU resources that are required to read a compressed file due to the overhead of uncompressing each observation.

What can compression do for you? was published on SAS Users.