sas programming

3月 022012
 

About once a month, a customer approaches SAS and asks a question of significance. By "significance", I don't necessarily mean "of great importance", but instead I mean "of how SAS handles large numbers, or floating-point values with many significant digits".

In response, we always first ask why they asked. This is not just a ploy to buy ourselves time. Often we learn that the customer has a legacy database system that represents some value as a very long series of digits. He wants to know if SAS can preserve the integrity of that value. Drilling in, we discover that the value represents something like a customer ID or account number, and no math will be performed using the value. In that case, the answer is easy: read it into SAS as a character value (regardless of its original form), and all will be well.

For the cases where the number is a number and the customer expects to use it in calculations, we must then dust off our lesson about floating-point math. We cite IEEE standards. As we've seen in this blog, floating-point nuances are good for "stupid math tricks", but they can also cause confusion when two values that appear to be the same actually compare as different under equality tests.

If we're lucky, this approach of citing standards makes us sound so smart that the customer is satisfied with the answer (or is too intimidated to pursue more questioning). But sometimes (quite often, actually), the customer is smarter, and fires back with deeper questions about exponents, mantissas and hardware capabilities.

That's when we tap someone like Bill Brideson, a systems developer at SAS who not only knows much that there is to know about floating-point arithmetic standards, but who also knows how to use SAS to demonstrate these standards at work -- when you display, compare, add, divide, and multiply floating-point values in SAS programs.

The remainder of this post comes from Bill. It's a thorough (!) explanation that he provided for a customer who was concerned that when he read the value "122000015596951" from a database as a decimal, it compared differently to the same value when read as an integer. He also provides a SAS program (with lots of comments) to support the explanation, as well as the output from the program when it was run using the example value.

Executive Summary (from Bill Brideson)

  • SAS fully exploits the hardware on which it runs to calculate correct and complete results using numbers of high precision and large magnitude.
  • By rounding to 15 significant digits, the w.d format maps a range of base-2 values to each base-10 value. This is usually what you want, but not when you're digging into very small differences.
  • Numeric operations in the DATA step use all of the range and precision supported by the hardware. This can be more than you had in mind, and includes more precision than the w.d format displays.
  • SAS has a rich set of tools that the savvy SAS consultant can employ to diagnose unexpected behavior with floating-point numbers.

I wrote a SAS program that shows what SAS and the hardware actually do with floating-point numbers.  The program has the details; I’ll try to keep those to a minimum here. I hope you find some of the techniques in the SAS program useful for tackling problems in the future. (If you’re not familiar with hexadecimal, have a quick look at http://en.wikipedia.org/wiki/Hexadecimal.)

How the heck does floating-point really work?

The first DATA step and the PROC PRINT output titled 1: Software Calculation of IEEE Floating-Point Values, Intel Format use SAS to decompose and re-assemble floating-point numbers. Don't read the program in detail unless you're interested in this sort of thing, but do have a quick look at http://en.wikipedia.org/wiki/IEEE_754-1985 which shows the components of the IEEE floating-point format.

To talk about whether SAS can handle "large" numbers, we have to talk in terms of the native floating-point format that SAS uses on Intel hardware. Different manufacturers use different floating-point formats. Here, though, we’ll just worry about Intel.

The columns of the PROC PRINT output are:

  • crbx is the hexadecimal representation of the floating-point number as it is stored in memory and used in calculations.
  • x is the same floating-point number, but printed using the w.d format.
  • value is the result of decomposing and reconstructing the floating-point number. The algorithms in the DATA step are correct if x and value are equal.
  • signexp is the sign and exponent portion of the floating-point number, extracted from the rightmost two bytes (four hexadecimal characters) of crbx and reordered so the bits read left-to-right in decreasing significance.
  • sign shows the setting of the sign bit in the floating-point number.
  • exponent shows the 11 exponent bits, right-justified. They’re a little hard to see in crbx and signexp.
  • exp_biased is the decimal value of the exponent. The exponent doesn’t have its own sign bit, so biasing makes effectively negative exponents possible. See the code and the Wikipedia reference for details.
  • exp_use is the decimal value of the exponent that is used to determine the value of the floating-point number. As you glance down the PROC PRINT output, notice how the exponent and mantissa change. (The mantissa is the first twelve and the fourteenth hexadecimal characters of crbx.)

The one-bit difference between the original values

The second DATA step assigns both of  the customer's original values to variables in a DATA step and subtracts one from the other. The PROC PRINT output titled 2: The Two Original Values and Their Difference shows:

  • x and crbx show the original value that evaluates to exactly an integer. x is the output of the w.d format, and crbx is the floating-point representation of x (shown in hexadecimal).
  • y shows the original value that looked like an integer when displayed by the w.d format, and crby shows the floating-point representation of that value. The values of x and y look the same via the w.d format but, as crbx and crby show, the actual values of x and y are different.
  • d shows the difference between x and y in base 10.
  • crbd shows the difference between x and y in IEEE floating-point format. The hardware automatically normalizes numbers so that the value is positioned toward the most-significant mantissa bits. Here, no mantissa bits are set because the difference is exactly a power of 2 (see next bullet).
  • signexp, exponent, exp_biased, and exp_use have the same meanings as above and show that the exponent is effectively -6.  This value of exponent combined with the zero mantissa get us to the same value of d by a different route: 2**-6 is 1 divided by 2**6. In base 10, this is 1 divided by 64 = 1/64 = 0.015625.

Small, but real, differences that the w.d format does not show

The customer could have had many different values that would have looked the same but compared unequal. The PROC PRINT output titled 3: Values Immediately Adjacent to the Customer’s Integer shows distinct values that display the same via the w.d format but are not, in fact, equal:

  • x and crbx (decimal via the w.d format, and the floating-point format in hexadecimal, respectively) show the original value that evaluates to exactly an integer (again, x is displayed by the w.d format and w.d is the same value as shown in its floating-point representation).
  • y (decimal, via w.d) and crby (floating-point, in hexadecimal) show values produced by the DATA step incrementing the mantissa one bit at a time. While the w.d format shows x and y to 15 significant decimal digits, crbx and crby show the actual floating-point values that the hardware uses, to all 52 bits of precision in the mantissa. This shows crby changing when the displayed value of y does not.
  • lsb is the binary representation of the least-significant byte of the mantissa. Because Intel is byte-swapped and little-endian, crby can be hard to read; lsb makes it easier to see how the value increments by one bit each time. This value is the same as the first two hexadecimal characters of crby.
  • d shows the difference between the value of y and the value of x on each observation, in base 10.
  • lagd shows the difference between the value of d on the current observation and the value of d on the previous observation. This checks for consistent behavior.

How many bits get lost to rounding?

The next DATA step and the PROC PRINT output titled 4: Incrementing the Least-Significant of 15 Decimal Digits shows how many base-2 values occur between values that are rounded to 15 significant decimal digits:

  • y and crby show the values that are being incremented by one. Here, the w.d format shows y changing value in base 10 because the change is within the 15 most significant digits.
  • ls_12bits_hex extracts the least significant 12 bits of the mantissa for easier reading. These are shown in the first four hexadecimal characters of crby, but crby is hard to read anyway and especially so in a situation like this. The least-significant 12 bits are the only ones that change value through this series.
  • ls_12bits_bin is the same value as ls_12bits_hex, but in binary. This shows more clearly which bit positions change value and which bit positions do not.
  • d_10 is the base-10 difference between successive values in the least-significant bits of the mantissa. The ls_12bits_bin column shows clearly that the least-significant 6 bits of the mantissa never change; 2**6 is 64, so there are 64 possible base-2 values between each successive value of y.
  • exp_use is decoded from y using the algorithm from the first step. Here, we’re looking for where the radix point belongs in the mantissa; in other words, where "1" is in the mantissa. The mantissa has 52 bits, so 52 – 46 tells us that there are six bits to the right of the radix point in these floating-point values. Combining this with the ls_12bits_bin column, it is clear that none of the bit positions that would represent a fractional value are changing.
  • "Lowest 12 Bits Decoded" uses the algorithm from the first step to show that when just these bits of the mantissa are decoded, values are obtained that increase by one on each observation. The individual values are irrelevant because they’re just a piece of the mantissa, but the fact that they increment by one on each observation confirms the math from step 1.

What about actual math operations?

The PROC PRINT output titled 5: Results of Miscellaneous Math Operations show a few of the ways in which small differences between values can play out in calculations:

  • operation shows the name of the operation that will be performed.
  • x and crbx show the “exactly an integer” value (in decimal via w.d, and in floating-point via hexadecimal, respectively) that participates in the operation.
  • y and crby (w.d and floating-point, like x and crbx) show the "integer plus 1/64 (least-significant mantissa bit)" value that participates in the operation.
  • w and crbw (again, w.d and floating-point, respectively) show the result of the operation.
  • d shows the difference between the results of the operation, and what the result would have been if both values participating in the operation had been x.

"Divide" shows that the w.d format displays the quotient (w) as 1.0, but crbw shows that the least-significant bit that was present in y is still present in the quotient (as it should be). While the quotient would not compare equal to 1.0, it would be displayed by the w.d format as 1.0 because the difference is 2.2E-16, which is less than 1E-15.

"Multiply" shows the result of multiplying 122,000,015,596,951.0 by 122,000,015,596,951.015625, and compares that result to multiplying 122,000,015,596,951.0 by itself. The difference is about 2.3E12, and 122,000,015,596,951.0 * 0.015625 = 1.9E12. This rounding error shows how a DATA step programmer can be surprised if they don't pay attention to the actual precision of values they use in calculations. Applying the round() function to operands before they’re used in a calculation can prevent this kind of problem.

"Add" shows that that least-significant bit got lost when the two values were added. The fourth hexadecimal character from the right of crbw shows that the value of the exponent increased by one (from D to E). This effectively shifted-out the least-significant bit of the mantissa. This could be an unexpected "rounding error."

"Fuzz" shows that the fuzz() function is useless to resolve nuisance differences in numbers of this magnitude because fuzz() only rounds values that are within 1E-12 of an integer. Since 0.015625 is much larger than 1E-12, fuzz() doesn’t round the non-integer value.

"Round" shows that round() does, however, turn a potentially problematic value into a nice, clean value.

Coming back up for air

Thank you to Bill for providing this tremendous detail.  If you're a SAS programmer who is sometimes curious about numeric representation, this content should keep you chewing for a while.  Here are a few other resources that you might find useful:

 

tags: formats, precision, SAS programming
2月 202012
 

The SAS DATA step supports a special syntax for determining whether a value is contained in an interval:

y = (-2 < x < 2);
This expression creates an indicator variable with the value 1 if x is in the interval (-2,2) and 0 otherwise.

The documentation for the AND operator states that "two comparisons with a common variable linked by [the AND operator] can be condensed" into a single statement with an implied AND operator. For example, the following two statements are equivalent:

y = (-2 < x < 2);
y = (-2<x & x<2);
The second syntax is more familiar to programmers in languages such as C/C++ that do not support an "implied AND" comparison operator.

The reason that I mention this syntax is that it is NOT available in the SAS/IML language (nor in R, nor in MATLAB). Sometimes experienced DATA step programmers expect the implied AND operator to work in their SAS/IML programs. The syntax does parse and execute, but it gives a different value than in the DATA step! For example, consider the following SAS/IML program:

proc iml;
x = -3:3;   /* the vector {-3 -2 -1 0 1 2 3} */
y = (-2 < x < 2);
There are no errors when you run this program. The expression for y is parsed as
y = ((-2 < x) < 2);
Regardless of the values in x, the variable y is a vector of ones. Why? The previous statement is equivalent to the following two statements:
v = (-2 < x);  /* {0 0 1 1 1 1 1} */
y = (v < 2);   /* {1 1 1 1 1 1 1} */
In the first statement, v is an indicator variable: v[i]=1 when the expression is true, and 0 otherwise. In the second statement, the zeros and ones in v are compared with the value 2. Not surprisingly, all of the zeros and ones are less than 2, so y is a vector of all ones.

Conclusion: Don't use the DATA step syntax in your SAS/IML programs. Instead, use an explicit AND operator such as (-2<x & x<2).

I am told that Python and Perl 6 (but not Perl 5) also support this implied AND operator. The SQL procedure in SAS software supports it, although other implementations of SQL do not. Do you know of other languages that also support a compact syntax for testing whether a value is within an interval?

tags: Getting Started, SAS Programming, Tips and Techniques
2月 112012
 
As SAS user Marje Fecht said "We all want a 'SAS programming assistant' to help us complete our jobs more quickly." Fecht, Senior Partner at Prowerk Consulting, then went on to say "In her book SAS Macro Programming Made Easy, Second Edition, Michele Burlew encourages us to take advantage of the SAS [...]
2月 062012
 

Have you ever wanted to run a sample program from the SAS documentation or wanted to use a data set that appears in the SAS documentation? You can: all programs and data sets in the documentation are distributed with SAS, you just have to know where to look!

Sample data in the SASHELP library

The data that are used in the SAS documentation are obtained in one of two ways: from a data set in the SASHELP library, or from a DATA step.

Many documentation examples use data in the SASHELP library, such as the Class, Iris, and Cars data sets. To use these data sets, specify the two-level name (libref.DataSetName) on the DATA= option of a procedure. For example, to display descriptive statistics about numerical variables in the Iris data, use the following call to the MEANS procedure:

title "Descriptive statistics for SASHELP.Iris";
proc means data=sashelp.iris;
run;

In other examples, the data is presented as a DATA step. However, sometimes the DATA step is so long that all of the data do not appear in the documentation. For example, the documentation of the QUANTREG procedure contains the following truncated DATA step:

data ozone;
  days = _n_;
  input ozone @@;
datalines;
0.0060 0.0060 0.0320 0.0320 0.0320 0.0150 0.0150 0.0150 0.0200 0.0200
0.0160 0.0070 0.0270 0.0160 0.0150 0.0240 0.0220 0.0220 0.0220 0.0185
0.0150 0.0150 0.0110 0.0070 0.0070 0.0240 0.0380 0.0240 0.0265 0.0290
   ... more lines ...   
0.0220 0.0210 0.0210 0.0130 0.0130 0.0130 0.0330 0.0330 0.0330 0.0325
0.0320 0.0320 0.0320 0.0120 0.0200 0.0200 0.0200 0.0320 0.0320 0.0250
0.0180 0.0180 0.0270 0.0270 0.0290
;

The complete data is contained in the sample program for this example. The next sections show how to access sample programs.

Access sample programs from the Help menu

You can access SAS sample programs by using the Help menu inside the SAS Windowing Environment, which is the default GUI environment. The following image shows a way to access the SAS/STAT sample programs by using the SAS GUI:

Access sample programs from the installation directory

Michael A. Raithel, in a SAS Tip on sasCommunity.org, points out that sample programs exist for all products, and gives examples of directories in which you can find them. For example, on my desktop PC, which is running SAS 9.3, I can browse to

C:\Program Files\SASHome\SASFoundation\9.3\ProductName\sample\
in order to see the sample programs.

You need to replace ProductName with the name of a SAS product, such as stat or ets. But how can you find the path of the "root directory" for your SAS installation? You can run the following DATA step in order to find it:

data _NULL_;
   %put %sysget(sasroot);
run;
     C:\Program Files\SASHome\SASFoundation\9.3

This is the value of sasroot that Michael refers to in his article. As Michael points out, it can be useful to browse programs in the sample directory, especially for products that you are trying to learn. The names of the files can be somewhat cryptic, so you should to use a search tool to find the correct file. For example, if you search the SAS/STAT sample directory for the term "data ozone," you will discover that the DATA step for the PROC QUANTREG example is in the file qregex4.sas. Therefore, the complete path for the PROC QUANTREG example on my PC is

     C:\Program Files\SASHome\SASFoundation\9.3\stat\sample\qregex4.sas

I'd like to be able to end this post by saying "the examples for PROC IML are great and help you learn how to program in the SAS/IML language." Unfortunately, many of the SAS/IML examples (which are based on the documentation) are old and do not contain many comments or explanatory text. But don't dispair. For Getting Started examples, read The DO Loop every Monday.

tags: Getting Started, SAS Programming, Tips and Techniques
1月 302012
 

The other day I encountered the following SAS DATA step for generating three normally distributed variables. Study it, and see if you can discover what is unnecessary (and misleading!) about this program:

data points;
drop i;
do i=1 to 10;
   x=rannor(34343);
   y=rannor(12345);
   z=rannor(54321);
   output;
end;
run;

The program creates the POINTS data set. The data set contains three variables, each containing random numbers from the standard normal distribution. I'm guessing that the author of the program thinks that using rannor(12345) to define the y variable makes y independent from the x variable, which is defined by rannor(34343).

Sorry, but that is not correct.

The x, y, and z variables are, indeed, independent samples from a normal distribution, but that fact does not depend on using different seeds in the RANNOR function. In fact, in this DATA step, all random number seeds except the first one are completely ignored! Don't believe me? Run the following DATA step and compare the two data sets, as follows:

data points2;
drop i;
/* change all random number seeds except the first */
x=rannor(34343); y=rannor(1); z=rannor(2); output;
do i=2 to 10;
   x=rannor(10+i);
   y=rannor(100+i);
   z=rannor(1000+i);
   output;
end;
run;
 
proc compare base=points compare=points2;
run;
                           The COMPARE Procedure
                Comparison of WORK.POINTS with WORK.POINTS2
                               (Method=EXACT)
                               
NOTE: No unequal values were found. All values compared are exactly equal.

All values compared are exactly equal. Every observation, every variable, down to the last bit. But except for the first observation of the x variable, the second DATA step uses completely different random number seeds! How can the POINTS2 data set be identical to the POINTS data set?

As I explained in a previous post on random number seeds in SAS, the random number seed for a DATA step (or SAS/IML program) is set by the first call. SAS ignores subsequent seeds within the same DATA step or PROC step. In my previous post, I used the newer (and better) STREAMINIT function and the RAND function instead of the older RANNOR function, but the fact remains that first random number seed determines the random number stream for the entire DATA step.

For further details, see the SAS documentation, which shows an example similar to mine in which three data sets (imaginatively named A, B, and C) contain the same pseudorandom numbers.

Now that I've ranted against using different random number seeds, I will reveal that the DATA step at the beginning of my post is from an example in the SAS Knowledge Base! Yes, even experienced SAS programmers are sometimes confused by the subtleties of random number streams. There is nothing wrong with a program that uses multiple seeds, but such a program makes the reader think that all those seeds are actually doing something. They’re not.

Are you someone who uses different random number seeds for each variable in the same DATA step or PROC IML program? If so, you can safely stop. Multiple seeds do not make your random variables any more "random." Only the first seed matters.
tags: Getting Started, Sampling and Simulation, SAS Programming
1月 282012
 

So many of us struggle with this mountain. In fact, 68.27% of us get within sight of reaching the summit (while 95.47% of us are at least on a perceivable slope). We run, walk, crawl and sometimes slide our way uphill (from one direction or the other) until we finally reach the top.

That is, the top of the "bell" curve.

I came across this t-shirt design over at shirt.woot.com, one of a number of entries that celebrate dubious honors. (Also worth a look: Least Noticed Person Ever and Duck, Duck, Goose Champion.)

The design inspired me. I thought, "I'm an average sort of SAS programmer; this is a summit that I can actually reach." And with a middling amount of effort, I met my objective.

And you can do it too. But please, don't knock yourself out. I sure-as-heck didn't. In fact, I reached this dubious peak by climbing over the backs of others, lifting the code for generating a normal distribution from Rick, and the code for vector plots from the book by Sanjay and Dan.

Here's my less-than-original SAS program that yields a less-than-original design:

ods graphics /width=500 height=500;
 
data normal;
  do x = -3 to 3 by 0.1;
    y = pdf("Normal", x);
    output;
  end;
  x0 = 0;
  y0 = .43;
  you="YOU'VE ARRIVED!";
  output;
run;
 
proc sgplot data=normal noautolegend;
  title "Congratulations!";
  title2 "You've reached...";
  footnote "MEDIOCRITY";
  series x=x y=y;
  vector x=x0 y=y0 /
    xorigin=x0 yorigin=.5  
    arrowdirection=out 
    lineattrs=(color=red thickness=1) 
    datalabel=you;
  xaxis grid display=(novalues);
  yaxis grid display=(novalues);
  refline 0 / axis=y;
run;
tags: SAS dummy, SAS programming, SGPLOT
1月 232012
 

Statistical programmers often need mathematical constants such as π (3.14159...) and e (2.71828...). Programmers of numerical algorithms often need to know machine-specific constants such as the machine precision constant (2.22E-16 on my Windows PC) or the largest representable double-precision value (1.798E308 on my Windows PC).

Some computer languages build these constants into the language as reserved (or semi-reserved) words. This can lead to problems. For example, statisticians use the symbol pi to represent mixture probabilities in mixture distributions. In some languages, you can assign the symbol pi to a vector, which overrides the built-in value and causes cos(pi) to have an unexpected value. In other languages, the symbol pi is a reserved keyword and is unavailable for assignment.

In SAS, constants are not built into the language. Instead, they are available through a call to the CONSTANT function. You can call the CONSTANT function from the DATA step, from SAS/IML programs, and from SAS procedures (such as PROC FCMP, PROC MCMC, and PROC NLIN) that enable you to write SAS statements inside the procedure.

For example, the following SAS/IML statements get two mathematical and two machine-specific constant and assign them to program variables:

proc iml;
pi = constant("pi");
e = constant("e");
maceps = constant("maceps");
big = constant("big");
print pi e maceps big;

The constants are printed by using the default BEST9. format; you can use a format such as BEST16. to see that the values contain additional precision. For example, the value that is actually stored in the pi variable is 3.14159265358979.

Most programmers understand the need for constants like "pi." However, I often use machine-specific constants in order to write robust numerical algorithms that avoid numerical overflow and underflow. For example, suppose that you have some data and you want to apply an exponential transformation to the data. (In other word, you want to compute exp(x) for each value of x.) The exponential function increases very quickly, so exp(x) cannot be stored in a double-precision value when x is moderately large. As a result, the following computation causes a numerical overflow error:

x = do(0, 1000, 200); /* {0, 200, 400, ..., 1000} */
expx = exp(x); /* ERROR: numerical overflow */
ERROR: (execution) Invalid argument to function.

 count     : number of occurrences is 2
 operation : EXP at line 2154 column 11
 operands  : x

<...more error information (omitted)..>

The largest value of x that can be exponentiated is found by calling the CONSTANT function with the "LOGBIG" argument. You can use the "LOGBIG" value to locate values of x that are too large to be transformed. You can then assign missing values to exp(x) for those "large" values of x, as shown in the following statements:

logbig = constant("logbig");
idx = loc(x<logbig);  /* locate values for which exp(x) does not overflow */
expx = j(1,ncol(x), .); /* allocate result vector of missing values */
expx[idx] = exp( x[idx] ); /* exp(x) only for "sufficiently small" x */
print logbig, expx;

In this way, you can write robust code that does not fail when there are large values in the data. Similarly, you can use this idea to protect against overflow when raising a value to a power by using the '**' operator in the DATA step or the '##' operator in SAS/IML.

This blog post was inspired by a tweet from @RLangTip on Twitter: "Constants built into #rstats: letters, months and pi," which linked to documentation on constants that are built into the R language.

tags: Getting Started, SAS Programming, Tips and Techniques
1月 192012
 

In the past, getting your hands on SAS for learning purposes required one of two fortunate situations:

  • being a student enrolled in a college course (or high school!) where SAS is taught
  • working for an employer who is willing to sponsor your training, either in an official course or on-the-job.

Now, there is an affordable way for professionals to get hands-on access to SAS for a reasonable price: SAS OnDemand for Professionals: Enterprise Guide (in the USA and Canada only, for now).

This new SAS OnDemand offering complements the SAS OnDemand for Academics offering that has evolved over the past few years.  It's SAS, running on "the cloud", and you use a supplied version of SAS Enterprise Guide to access it (along with a good collection of sample data).  With SAS Enterprise Guide you can exercise most of the SAS features that you would need to practice for any career objective: learn SAS programming, hone your skills in business analytics, or use high-end statistical methods to analyze data.

CHOOSE YOUR PATH THROUGH THIS BLOG POST:

  • If you want to learn how to use SAS to query and transform data, calculate summary statistics, build graphs, create reports, and dabble in higher-end analytics -- but you don't want to have to write or understand programming code...continue on to the immediately following section, "Learning SAS without programming".
  • If you want to learn how to program in SAS, using DATA step, SAS macro language, SAS procedures and more, and you don't want to be held back by a point-and-click interface...skip this next section and go directly to read "Programming SAS with SAS Enterprise Guide".

Learning SAS without programming

SAS Enterprise Guide is often positioned as "the point-and-click interface to SAS".  Many of us were raised on the idea that "using SAS requires programming".  But it doesn't.  SAS Enterprise Guide has over 90 built-in tasks for accessing data, summarizing data, creating reports, and performing statistical analysis.

There are many popular books and training courses that show you how to get to the power of SAS without having to learn the syntax of SAS.  For example, we have books like SAS For Dummies, Little SAS Book for Enterprise Guide, and Basic Statistics using SAS Enterprise Guide: A Primer.  And there is a whole boatload of training courses on the topic.

If you've read some of the SAS Enterprise Guide tips that I've published on this blog and want to try them out, you can probably can do it with SAS OnDemand.

NOTE: If you don't want to know anything about SAS programming, skip the next section and read "SAS OnDemand for Professionals: Learn what you want, how you want"

Programming SAS with SAS Enterprise Guide

So, you're a SAS programmer?  Or you want to be one?  Perhaps you're pursuing a SAS programming certification?  With SAS Enterprise Guide, you can skip the point-and-click stuff and jump right into programming with File->New->Program.  If you've read some of the SAS programming tips that I've published on this blog, you can try them for yourself using the SAS OnDemand environment.

For some experienced SAS programmers, SAS Enterprise Guide presents a different SAS environment than what you're accustomed to.  But you can accomplish most programming tasks here, and might even find yourself more productive with the super program editor and the process flow approach for organizing your code.

You can learn more about programmer productivity in SAS Enterprise Guide by watching this SAS Talks webinar.  Or if you really want a leg up, take the course.

SAS OnDemand for Professionals: Learn what you want, how you want

Regardless of your path -- point-and-click, SAS programming, or a mix of each -- SAS OnDemand for Professionals: Enterprise Guide provides a good learning environment to gain and practice your SAS skills.

But the learning resources don't stop there.  You can use these blogs, discussion forums, and the entire SAS community to supplement your knowledge as you learn.  SAS professionals love to share their knowledge.  And you'll be proud to share what you know too, when you join their ranks.

tags: SAS Enterprise Guide, SAS OnDemand, SAS programming
1月 042012
 

In the immortal words of Britney Spears: Oops! I did it again.

At least, I'm afraid that I did. I think I might have helped a SAS student with a homework assignment, or perhaps provided an answer in preparation for a SAS certification exam. Or maybe it was a legitimate work-related question; I'd like to think so, anyway.

This time, the question came to me via LinkedIn. (By the way, LinkedIn contains a rich network of SAS professionals; in her blog post, Tricia provides some helpful guidance for making use of that network.)

The question pertains to some confusing behavior of the LAG function. Within a DATA step, the LAG function is thought to provide a peek into the value of a variable within a previous observation. But in this program, the LAG function didn't seem to be doing its job:

data test;
  infile datalines dlm=',' dsd;
  input a b c;
  datalines;
4272451,17878,17878 
4272451,17878,17878 
4272451,17887,17887 
4272454,17878,17878 
4272454,17881,17881 
4272454,17893,17893 
4272455,17878,17878 
4272455,17878,18200 
run;
 
data testLags;
  retain e f ( 1 1);
  set test;
  if a=lag(a) and b>lag(b) then
    e=e+1;
  else if a^=lag(a) or lag(a)=. then
      e=1;
  if a^=lag(a) or lag(a)=. then
      f=1;
  else if a=lag(a) and b>lag(b) then
      f=f+1;
run;
 
proc print data=testLags;
run;

The questioner thought that the e and f variables should have the same values in each record of output, but they don't. The two variables are calculated using the exact same statements, but with the seemingly-exclusive IF/THEN conditions reversed. Here's the output:

Obs    e    f       a         b        c

 1     1    1    4272451    17878    17878
 2     1    1    4272451    17878    17878
 3     2    2    4272451    17887    17887
 4     1    1    4272454    17878    17878
 5     2    1    4272454    17881    17881
 6     3    2    4272454    17893    17893
 7     1    1    4272455    17878    17878
 8     1    1    4272455    17878    18200

There is a SAS note that warns of the effect of using the LAG function conditionally. But in this example, each set of LAG functions are used unconditionally (before the THEN clause). Or are they?

Let's review how the LAG function works. It draws values from a queue of previous values, and within each DATA step iteration that you call the LAG function, it draws a previous value from the queue. The trick here is that this program does not call the LAG function for both A and B with each iteration of the DATA step! Because the IF statements combine two conditions with an AND, if the first condition resolves to false, the second condition is not evaluated. After all, in logic-speak, FALSE AND (ANY value) is always FALSE, so the DATA step can save work by not bothering to evaluate the remainder of the expression.

  if a=lag(a) /* if false*/ and b>lag(b) /* then this is not evaluated*/

And then the next time around, when the LAG(b) function is called again, it's "behind" one on the queue for the value of b.

One way to solve the issue (and remove the logical ambiguity): set two temporary variables to LAG(a) and LAG(b) at the start of the DATA step, and use those variables in the subsequent comparisons. With the LAG function now being called with each iteration no matter what, the results are consistent. Here's an example of the modified program:

data testlags2(drop=laga lagb);
  retain e f ( 1 1);
  set test;
  laga = lag(a);
  lagb = lag(b);
  if a=laga and b>lagb then
    e=e+1;
  else if a^=laga or laga=. then
      e=1;
  if a^=laga or laga=. then
      f=1;
  else if a=laga and b>lagb then
      f=f+1;
run;

Here are the new results when printed:

Obs    e    f       a         b        c

 1     1    1    4272451    17878    17878
 2     1    1    4272451    17878    17878
 3     2    2    4272451    17887    17887
 4     1    1    4272454    17878    17878
 5     2    2    4272454    17881    17881
 6     3    3    4272454    17893    17893
 7     1    1    4272455    17878    17878
 8     1    1    4272455    17878    18200
tags: lag, LinkedIn, SAS programming
12月 172011
 

On the heels of the release of the popular SAS macro variable viewer from last month, I'm providing another custom task that I hope will prove just as useful. This one is a SAS options viewer, similar in concept to the OPTIONS window in SAS display manager.

You can download the new task from this location. (The download is a ZIP file with a DLL and a README.pdf that explains how to install it. In fact, it's the same download package as the macro viewer task; I've packaged them both in the same DLL, so if you install one, you get them both. You're welcome! Both tasks require SAS Enterprise Guide 4.3, with a SAS 9.2 or 9.3 environment.)

If you've tried the macro variable viewer, then the user interface for the Options viewer will look familiar. It shares many of the same features: the window "floats" as a toolbox window so you can keep working while it's visible, you can filter the results (important, given the hundreds of options!), you can view options as a straight list or grouped by category. In addition, there are important features specific to SAS options, such as the ability to see details about how each option was set and where it is valid. Here is the complete set of features:

Always-visible window: Once you open the task from the Tools menu, you can leave it open for your entire SAS Enterprise Guide session. The window uses a "modeless" display, so you can still interact with other SAS Enterprise Guide features while the window is visible. This makes it easy to switch between SAS programs and other SAS Enterprise Guide windows and the options viewer to see results.

Select active SAS server: If your SAS environment contains multiple SAS workspace connections, you can switch among the different servers to see options values on multiple systems.

One-click refresh: Refresh the list of option values by clicking on the Refresh button in the toolbar.

View by group or as a straight list: View the option values in their group categories (for example, MEMORY, GRAPHICS, EMAIL, etc.) or as a straight list, sorted by option name or current value. Click on the column headers to sort the list.

Set window transparency: You can make the window appear "see-through" so that it doesn't completely obscure your other windows as you work with it.

Filter results: Type a string of characters in the "Filter results" field, and the list of options will be instantly filtered to those that contain the sequence that you type. The filtered results will match on option names as well as values, and the search is case-insensitive. This is a convenient way to narrow the list to just a few options that you're interested in. To clear the filter, click on the X button next to the "Filter results" field, or "blank out" the text field.

Show option details: The "About" pane at the bottom of the window shows details about the currently selected option, including the current value, its value at session startup (SAS 9.3 only), where the option can be set (in OPTIONS statement or startup) and how the current value was set. You can show or hide this pane by clicking a toggle button at the top of the window.

I can imagine several more features that might be useful, but I decided that these were enough for a first version. Try it out and leave feedback for me in the comments here. (That worked pretty well with the macro viewer task; in fact, this options viewer was one of your suggestions!)

See also:

tags: SAS custom tasks, SAS Enterprise Guide, SAS options, SAS programming