Kim Wilson

12月 142018
Several years ago, I wrote a paper about the top-ten questions about the DATA step that SAS Technical Support receives from customers. Those topics are still popular among people who contact us for help. In this blog, I’m sharing some additional questions that we’re asked on a regular basis. Those questions cover SAS dates, arrays, and how to reference local PC files from SAS® Enterprise Guide® and SAS® Studio when those applications connect to a SAS® server in UNIX operating environments.

About SAS® dates

Let’s begin with dates. We regularly hear customers say something similar to this: "I have a date, but I’m not sure how to use it or whether it’s even a SAS date yet." No worries--we can figure it out! A SAS date is a numeric variable whose value represents the number of days between January 1, 1960 and a specific date. For example, assume that you have a variable named X that has a value of 12398, but you’re not sure what that value represents. Is it a SAS date? Or does it represent January 23, 1998?
To determine what the value represents, you first need to run the CONTENTS procedure on the data set and determine whether the variable in question is character or numeric.
For this example, here is the partial output from the PROC CONTENTS step:

Alphabetic List of Variables and Attributes
#    Variable    Type    Len    Format

1    x           Num       8
2    y           Char      3
3    z           Num       8    Z5.

If X is a numeric variable, is a format shown in the FORMAT column for that variable? In this case, the answer is no. However, if the variable is numeric and there is no assigned format, this might be a SAS date that needs to be formatted to make sense of the value. If you run a simple DATA step to add any date format to that SAS date value, you will see that 12398 represents the date December 11, 1993.
data a;
format mydate worddate.;

If you print the results of this program with the PRINT procedure, the output for data set A is as shown below:
Obs         mydate

 1     December 11, 1993

Is this a valid date in the context of this data sample? If you’re unsure, look at the other date values to see whether most of them are similarly structured. Most of the time, if a variable is stored as a SAS date, the variable is already assigned a date format, which is shown in the PROC CONTENTS output. If the value 12398 is a numeric variable such that the digits represent the month, day, and year of a given date (for example, January 23, 1998), you can convert it to a SAS date by running the following DATA step:
data a;
format y date9.;

The PROC PRINT output from this step shows that the variable Y has a formatted value of 23JAN1998.
Obs      x              y

 1     12398    23JAN1998

The format that you assign to the variable can be any SAS format or custom-date format.
If the original variable is a character variable, you can convert it to a SAS date by using the INPUT function and the MMDDYY6. informat.
data a;
format y date9.;

Using arrays in SAS

Many customers aren’t quite sure that they understand how to use arrays. Arrays are a common construct in many programming languages. Arrays can seem less complex when you remember that they are a temporary grouping of variables. When you perform the same operation on multiple variables, you have less to program if you can refer to a group of variables by a single name. You simply execute a DO loop that processes each variable in turn, and the task is complete!

We often see arrays used for "reshaping data" or transposing a data set from wide-to-long (or long-to-wide). For example, assume that you want to reshape a data set, comprised of three variables and four observations, into a data set that contains twelve variables. Using an array approach makes the programming much easier, as shown below:

In this example:

    1. The variables X, Y, and Z are loaded into an array named VARS, which means that they can be referred to as VARS(1) – VARS(3) or by the variable names X, Y, and Z.
    2. A multidimensional array named ALL is created with twelve variables. The first number in parentheses represents rows, and the second represents columns.
    3. A DO loop processes each variable in the VARS array.
    4. The ALL array is populated one observation at a time by the value of I and the value of J as the DO loop increments.

Because the ALL array is populated by each observation as it is read from data set One, the END= option in the SET statement creates the variable LAST as a flag. This variable indicates when the last observation is read, and the IF statement tests variable LAST. If the variable has a value of 1 (which evaluates to "true"), the statement prints the contents of the program data vector to the output data set. Here's the starting data set and the reshaped result:

Managing PC files in client/server environments

When I began working in Technical Support many years ago, the only interface to Base SAS® software was the Display Manager System, which has separate Program Editor, Log, and Output windows. Now, you can run SAS in various ways, and many of our customers use SAS Enterprise Guide and SAS Studio as their interfaces. One of the most frequently asked questions from customers is about how to access local PC files from these applications that access SAS through a UNIX server.

SAS Enterprise Guide offers built-in tasks to upload and download data sets and other files. You can find these tasks on the Tasks->Data menu.

Two of the tasks, Upload Data Files to Server and Download Data Files to PC, allow you to copy SAS data sets directly between your local PC and your SAS libraries. The third task, Copy Files, allow you to copy any file (or group of files) between your local PC and the file system of the SAS session. See this article to learn how to apply a common pattern with this task: export and download any file from SAS Enterprise Guide. (Note: The Copy Files task was added in SAS Enterprise Guide 7.13. For earlier releases, you can follow the steps in this article.)

If you’re using the SAS Studio interface, you can upload and download files between the server and your PC.

Upload File and Download File buttons in SAS Studio

To download a file from the SAS server to your computer:

    1. Select the file that you want to download from the folder tree.
    2. Click the download button and save the file according to the information in your browser dialog box.

To upload one or more files from your local computer:

    1. Select the folder to which you want to upload the files and click the upload button.
    2. In the Upload Files window, click Choose Files to browse for the files that you want to upload.
    3. Select one or more files from your computer and click Open. The selected files are displayed as well as their size. An error message is displayed when you try to upload files where the total size exceeds 10 MB.
    4. Click Upload to complete the upload process.

Always go back to the basics

The three topics that are discussed here don't represent new features or challenges. However, these topics generate many calls to Technical Support. It's a reminder that even as SAS continues to add new features and technology, we still need to know how to tackle the basic building blocks of our SAS programs.

FAQs about SAS dates, arrays and managing local PC files was published on SAS Users.

12月 192017

Compressing a data setCompressing a data set is a process that reduces the number of bytes that are required to represent each observation in a file. You might choose to enable compression to reduce the storage requirements for the file and to lessen the number of I/O operations that are needed to read from or write to the data during processing.

Compression is enabled by the COMPRESS= system option, the COMPRESS= option in the LIBNAME statement, and the COMPRESS= data set option. The COMPRESS= system option compresses all data set sets that are created during a SAS session, and the COMPRESS= option in the LIBNAME statement compresses all data sets for a particular SAS® library. The COMPRESS= data set option is the most popular of these methods because you compress data sets individually as they are created.

The COMPRESS= data set option can be set to CHAR (or YES), NO, and BINARY. The following example illustrates using COMPRESS=YES:

data new(compress=yes);
set old;


While compression is a useful tool in your programming toolbox, it isn't a tool that you should use on every data set. When you request compression by using the COMPRESS= option, SAS considers the following information:

  • The header information of the data set to determine how many variables are in the program data vector
  • whether the variables are character or numeric
  • the lengths of the variables

SAS doesn't consider data values at all. The compression overhead for Microsoft 32-bit Windows and 64-bit Windows is 12 bytes, whereas 64-bit UNIX hosts require 24 bytes of overhead. When SAS determines that it is possible to recoup the 12 or 24 bytes of overhead per observation that compression requires, then SAS attempts to compress the data. If that 12 or 24 bytes per observation can't be recouped, the data set size is increased when the compression is completed. So, you should determine ahead of time whether your data set is a good candidate for compression.

In the following example, a data set is created in the Windows operating environment with two variables having lengths, respectively, of 3 and 5 bytes. Because it is impossible to recoup the 12 bytes that are needed per observation for compression overhead, SAS automatically disables compression and a note is written to the SAS log that indicates the same.

571  data new(compress=char);
572     x='abc';
573     y='defgh';
574  run;
NOTE: Compression was disabled for data set WORK.NEW because compression overhead would increase
      the size of the data set.
NOTE: The data set WORK.NEW has 1 observations and 2 variables.


The compression process doesn’t recognize individual variables within an observation. Instead, the process sees each observation as a large collection of bytes that are run together end to end. In the COMPRESS= data set option, you enable compression by specifying either CHAR (YES) and BINARY. These values for the option differ slightly in the types of data values that they target for compression.

Using the COMPRESS=CHAR|YES option

Specifying COMPRESS=CHAR (or YES) targets data with repeating single characters and variables with stored lengths that are longer than most of the values. As a result, blank spaces pad the end of values that are shorter than the number of bytes of storage.

In thinking about conserving space, customers often shorten the storage lengths of variables by using a LENGTH statement. When you shorten the lengths of your variables, you remove the best opportunity for SAS to compress. For example, if a numeric variable can be stored accurately in 4 bytes, the remaining 4 bytes (in an 8-byte variable) will all be zeros. This situation is perfect for compression. However, when you shorten the length to 4 bytes, the layout of the value is no longer suitable for compression. The only reason to truncate the storage length by using the LENGTH statement is to save disk space. All values are expanded to the full size of 8 bytes in the program data vector to perform computations in DATA and PROC steps. You'll use extra CPU resources to uncompress the data set as well as to expand variables back to 8 bytes.

Using the COMPRESS=BINARY option

When you use COMPRESS=BINARY, patterns of multiple characters across the entire observation are compressed. Binary compression uses two techniques at the same time. This option searches for the following:

  1. Repeating byte sequences (for example, 10 blank spaces or 10 zero bytes in a row)
  2. Repeating byte patterns (for example, the repeated pattern in the hexadecimal value 0102030405FAF10102030405FBF20102030405FCF3)

With that in mind, you can see that the bytes in a numeric variable are just as likely to be compressed as those in a character variable because the compression process does not consider those bytes to be numeric or character. They are just viewed as bytes. Consider a missing value that is represented in hexadecimal notation as FFFF000000000001. In the middle of that value is a string of five zero bytes (0x00) that can be replaced by two compression code-bytes. So, what starts as a sequence of 8 bytes ends up as a sequence of 5 bytes.

Keep in mind

As mentioned earlier, although compression saves space and is a great tool to keep handy in your SAS toolbox, it’s not meant for all your data sets. Some data sets are not going to compress well and the data set will grow larger, so know your data. Also, you’ll want to consider the extra CPU resources that are required to read a compressed file due to the overhead of uncompressing each observation.

What can compression do for you? was published on SAS Users.

2月 192016

ProblemSolversPROC DS2 (DS2) is a new SAS programming language that extends the power of the traditional DATA step. Some customers have wondered if this new functionality is simply the “DATA step on steroids,” meaning that we simply added more robust functionality so that it’s a replacement for the DATA step. While it includes extra functionality that allows you to do more advanced data processing and manipulation, the addition of methods, packages, additional data types, and the power to run code inside databases boosts the power of this new procedure. For a more in-depth discussion, see the “Similarities between DS2 and the DATA Step” web page and the “Differences between DS2 and the DATA Step” web page. DS2 is not a replacement for the DATA step as INFILE, INPUT, FILE, PUT, and DATALINES statements aren’t permitted. Although most DS2 customers are using DS2 against database tables, it can also be used against traditional SAS data sets.

If DS2 is new to you, you are probably asking why you should invest time to learn about it rather than staying with the ever-reliable DATA step. Let me discuss several reasons why others are finding that DS2 is a welcomed addition to Base SAS®:

  • Additional precision gained from using new data types
  • Threaded processing for computationally intensive processing
  • Methods and packages for easy deployment of reusable code modules
  • Embedded PROC FedSQL – DATA step power and SQL simplicity all in the same program
  • Code can be run inside massively parallel databases like Teradata, Hadoop and Greenplum

Additional Precision Gained from Using New Data Types

While Base SAS allows only character and numeric data types, DS2 offers numerous others: BIGINT, BINARY, CHAR, DATE, DECIMAL, DOUBLE, FLOAT, INTEGER, NCHAR, NVARCHAR, REAL, SMALLINT, TIME, TIMESTAMP, TINYINT, VARBINARY, and VARCHAR. You can read more about each type on the “What Are the Data Types?” web page.

If you’re using DS2 to write to a SAS data set, you will still be limited to storing the standard character and numeric data types, even if you’ve processed other DS2 data types in the code. If your data is stored in one of the supported data sources listed on this web page, you can read from and write to them while using data types native to those databases. For example, DS2 can provide the extra precision you need to get accurate, precise answers by using the DECIMAL data type instead of a standard SAS 8-byte numeric variable.

Threaded Processing for Computationally Intensive Processing

Unless thread programs are created and run inside DS2, the program runs sequentially. When your SAS program is computationally intense, it might be advantageous to create a thread program so the data is spread out over multiple processes to become more efficient. Enclose the DS2 code between THREAD and ENDTHREAD statements, use the DECLARE THREAD statement to create an instance of the thread in the DS2 program, and execute the thread with a SET FROM statement. If one computation thread can keep up with the I/O thread, then that single thread is used for all processing. It’s also important to remember that a single reader feeds all threads. If you’d like to know how the observations were spread out over the threads, you can use the _THREADID_ variable. Here’s an example of a thread program, including the _THREADID_ variable:

data test; 
do x=1 to 12316;
proc ds2; 
  thread newton/overwrite=yes; 
    dcl double y count; 
    dcl bigint thisThread;
    drop count; 
    method run(); 
      set test; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
proc ds2; 
  data _null_; 
    dcl thread newton frac; 
    method run(); 
         set from frac threads=3; 

This is the output in the SAS log that shows the number of rows processed by each thread.

**Thread 1 processed 4124 rows
**Thread 0 processed 4096 rows
**Thread 2 processed 4096 rows

Methods and Packages for Easy Deployment of Reusable Code Modules

In DS2, all code must reside within a method. The three system methods are INIT, RUN, and TERM, and at least one must be specified in each DS2 program. You can write custom methods that can be stored for future use and that alleviate the need to recreate the code each time. Packages are collections of methods and variables. Although you can create custom packages, DS2 provides some packages that are predefined with methods to aid in your programming needs. Predefined packages include FCMP, Hash, HTTP, JSON, Logger, Matrix, SQLSTMT, and TZ. You can read more about them on the “Predefined DS2 Packages” web page.

Embedded PROC FedSQL – DATA Step Power and SQL Simplicity All in the Same Program

SAS FedSQL is a SAS proprietary implementation of the ANSI SQL:1999 core standard, and it allows vendor-neutral access to many data sources. This functionality can be used in the new procedure, PROC FedSQL, but you can also embed FedSQL queries in the SET statement within DS2. When a query is used in the SET statement, the resulting rows become the input to the remaining DS2 code. For example,

  set {select,a.income,b.expenses 
         from lib.dset as a left join lib2.dset2 as b
         where a.income >40000
         order by  };

The embedded FedSQL code gets converted to the native database code as needed.

Code Can Be Run Inside Massively Parallel Databases Like Teradata, Hadoop, and Greenplum

The SAS® In-Database Code Accelerator enables you to publish a DS2 thread program to the database and execute that thread program in parallel inside the database. DS2 thread programs can run inside Hadoop, Greenplum, and Teradata, while DS2 data programs can run inside Hadoop and Teradata. This has been a favorite aspect of DS2 for many customers thus far. To take advantage of this feature, you must license the SAS In-Database Code Accelerator, and the SAS/ACCESS® Interface to your database (Hadoop, Greenplum, or Teradata) in addition to Base SAS. The SAS® Embedded Process must be installed and configured on your database. Read more about the SAS In-Database Code Accelerator on the “Using the SAS In-Database Code Accelerator” web page.

While DS2 isn’t meant to replace the DATA step, there are numerous reasons why a customer might choose to use this new functionality. If you’re unsure if you should give it a try, or if you’d like to read more about it, here’s the link to the SAS® 9.4 DS2 Language Reference.

As always, we in Technical Support are here to answer your questions and offer tips and suggestions as you use our software. Thanks for using SAS!

tags: Problem Solvers, PROC DS2, SAS Programmers

Reasons to love PROC DS2 was published on SAS Users.

1月 152016

ProblemSolversAs support analysts in the SAS Technical Support division, we answer many phone calls from SAS customers.  As members of the SAS Foundation team, we get questions that vary significantly in content from all of the areas that we support.  We offer coding tips and suggestions as well as point customers to SAS Notes and documentation. A common question that we are frequently asked is how to collapse a data set with many observations for the same BY variable value into a data set that has one observation per BY variable value.  Another way to phrase this question is: how do you reshape the data from “long to wide?”

Resources to Reshape Data

The following SAS Note illustrates how to use two TRANSPOSE procedure steps to collapse the observations for multiple BY variables:

Sample 44637: Double PROC TRANSPOSE method for reshaping your data set with multiple BY variables

If you prefer to use the DATA step rather than PROC TRANSPOSE, the following SAS Note provides a code sample that accomplishes the same purpose:

Sample 24579: Collapsing observations within a BY-Group into a single observation (when data set has 3 or more variables)

If the data set is “wide” and you’d like to expand each observation into multiple observations to make it a “long” data set, the following SAS Note is helpful:

Sample 24794: Expand single observations into multiple observations

Brief Overview of Some Resources

Since we’ve been discussing SAS Notes from the site, here is a brief overview of how to use the site to find other helpful items for your future coding projects.


  1. Go to
  2. On the left side of the page in the Knowledge Base section, select Documentation, Product Index A-Z, and then select your product of interest to access user guides and other documentation resources.

SAS Notes

  1. Go to
  2. On the left side of the page in the Knowledge Base section, select Samples & SAS Notes.
  3. From this page, you can type keywords in the Search box to get a list of relevant notes from both categories.
    Note: If you’re interested in specific types of notes (for example, samples or problem notes), select the type of note from the choices given on the left side of the page underneath Samples & SAS Notes.

It was a pleasure to work with you in 2015 when you contacted us for assistance. We look forward to another great year in 2016.

tags: data, Problem Solvers, SAS Programmers

New year refresher on reshaping data was published on SAS Users.

2月 212014

Most SAS programmers would agree that they use the SET statement without giving much thought to the syntax, because it’s such a widely used statement of choice. We routinely name the expected data sets and possibly a few options, and away we go. A visit to the documentation can be saved for more complex concerns such as arrays, hash tables, and regular expressions.  Maybe a review of how the Program Data Vector (PDV) stores and outputs variable values could interest even the most experienced SAS programmer.

Often in Technical Support, we hear from customers who are referencing more than one data set on the SET statement and their results “aren’t correct”.  Once we see the code, it’s obvious that this is a case of SAS behaving by design but different than expected by that customer. The following DATA steps and resulting output use missing values to illustrate one common situation where the PDV causes the DATA step to “misbehave”:

Reading datasets A and B with SET statement

Unexpected output in FINAL

Why the IN= data set option helps, but isn’t the total solution

Because Z doesn’t exist in data set A, all observations from data set A meet the IF condition, thus all Y values are set to 999.  It’s likely that most users want only the observations from data set B evaluated by the IF statement. The addition of the IN= data set option seems like a reasonable fix.

Results when reading data set B with IN= data set option

Don’t forget how the SET statement changes PDV processing

Observations from data set A are no longer affected by the IF statement. But wait, the eighth observation has a Y value of 999 although Z isn’t missing.  What happened? The main points to remember when reading data sets with a SET statement are these:

  • The SET statement does not reset the values in the program data vector to missing, except for variables whose values are calculated or assigned during the DATA step.
  • Variables that are created by the DATA step are set to missing at the beginning of each iteration of the DATA step.
  • Variables that are read from a data set are not.

These differences are key to what happens when data sets are combined with a SET statement. The Combining SAS Data Sets: Methods section of the SAS 9.4 Language Reference: Concepts manual is very helpful.

How to override the automatic retain in the PDV

To take the code sample one step further, observations six and seven correctly show Y values of 999, but since Z isn’t missing on the eighth observation, Y should be a missing value.

You can manually set Y to missing at the top of the DATA step to prevent the automatic retain.  Each iteration of the DATA step sets Y to missing, and the SET and IF statements execute and populate the PDV accordingly.  Now the resulting data set values are as expected.

Override the automatic retain of the PDV

Overriding the automatic retain in the PDV

Usage Note 48147: Variables read using SET, MERGE, and UPDATE statements are automatically retained is an excellent note to bookmark for future use when reading data sets with the SET, MERGE, or UPDATE statements.

tags: missing values, Problem Solvers, Program Data Vector, SAS Programmers, SET statement