sas programming

4月 132017
 
The Penitent Magdalene

Titian (Tiziano Vecellio) (Italian, about 1487 - 1576) The Penitent Magdalene, 1555 - 1565, Oil on canvas 108.3 × 94.3 cm (42 5/8 × 37 1/8 in.) The J. Paul Getty Museum, Los Angeles; Digital image courtesy of the Getty's Open Content Program.

Even if you are a traditional SAS programmer and have nothing to do with cybersecurity, you still probably have to deal with this issue in your day-to-day work.

The world has changed, and what you do as a SAS programmer is not just between you and your computer anymore. However, I have found that many of us are still careless, negligent or reckless enough to be dangerous.

Would you scotch-tape your house key to the front door next to the lock and go on vacation? Does anybody do that? Still, some of us have no problem explicitly embedding passwords in our code.

That single deadly sin, the thing that SAS programmers (or any programmers) must not do under any circumstances, is placing unmasked passwords into their code. I must confess, I too have sinned, but I have seen the light and hope you will too.

Password usage examples

Even if SAS syntax calls for a password, never type it or paste it into your SAS programs. Ever.

If you connect to a database using a SAS/ACCESS LIBNAME statement, your libname statement might look like:

libname mydblib oracle path=airdb_remote schema=hrdept
	user=myusr1 password=mypwd1;

If you specify the LIBNAME statement options for the metadata engine to connect to the metadata server, it may look like:

libname myeng meta library=mylib
	repname=temp metaserver='a123.us.company.com' port=8561 
 		user=idxyz pw=abcdefg;

If you use LIBNAME statement options for the metadata engine to connect to a database, it may look like:

libname oralib meta library=oralib dbuser=orauser dbpassword=orapw;

In all of the above examples, some password is “required” to be embedded in the SAS code. But it does not mean you should put it there “as is,” unmasked. SAS program code is usually saved as a text file, which is stored on your laptop or somewhere on a server. Anyone who can get access to such a file would immediately get access to those passwords, which are key to accessing databases that might contain sensitive information. It is your obligation and duty to protect this sensitive data.

Hiding passwords in a macro variable or a macro?

I’ve seen some shrewd SAS programmers who do not openly put passwords in their SAS code. Instead of placing the passwords directly where the SAS syntax calls for, they assign them to a macro variable in AUTOEXEC.SAS, an external SAS macro file, a compiled macro or some other SAS program file included in their code, and then use a macro reference, e.g.

/* in autoexec.sas or external macro */
%let opw = mypwd1;
 
/* or */
 
%macro opw;
	mypwd1
%mend opw;
/* in SAS program */
libname mydblib oracle user=myusr1 password=&opw
        path=airdb_remote schema=hrdept;
 
/* or */
 
libname mydblib oracle user=myusr1 password=%opw
        path=airdb_remote schema=hrdept;

Clever! But it’s no more secure than leaving your house key under the door mat. In fact it is much less secure. One who wants to look up your password does not even need to look under the door mat, oh, I mean look at your program file where the password is assigned to a macro variable or a macro. For a macro variable, a simple %put &opw; statement will print the password’s actual value in the SAS log. In case of a macro, one can use %put %opw; with the same result.

In other words, hiding the passwords do not actually protect them.

What to do

What should you do instead of openly placing or concealing those required passwords into your SAS code? The answer is short: encrypt passwords.

SAS software provides a powerful procedure for encrypting passwords that renders them practically unusable outside of the SAS system.

This is PROC PWENCODE, and it is very easy to use. In its simplest form, in order to encrypt (encode) your password abc123 you would need to submit just the following two lines of code:

proc pwencode in="abc123";
run;

The encrypted password is printed in the SAS log:

1 proc pwencode in=XXXXXXXX;
2 run;

{SAS002}3CD4EA1E5C9B75D91A73A37F

Now, you can use this password {SAS002}3CD4EA1E5C9B75D91A73A37F in your SAS programs. The SAS System will seamlessly take care of decrypting the password during compilation.

The above code examples can be re-written as follows:

libname mydblib oracle path=airdb_remote schema=hrdept
	user=myusr1 password="{SAS002}9746E819255A1D2F154A26B7";
 
libname myeng meta library=mylib
	repname=temp metaserver='a123.us.company.com' port=8561 
 		user=idxyz pw="{SAS002}9FFC53315A1596D92F13B4CA";
 
libname oralib meta library=oralib dbuser=orauser
dbpassword="{SAS002}9FFC53315A1596D92F13B4CA";

Encryption methods

The {SAS002} prefix indicates encoding method. This SAS-proprietary encryption method which uses 32-bit key encryption is the default, so you don’t have to specify it in the PROC PWENCODE.

There are other, stronger encryption methods supported in SAS/SECURE:

{SAS003} – uses a 256-bit key plus 16-bit salt to encode passwords,

{SAS004} – uses a 256-bit key plus 64-bit salt to encode passwords.

If you want to encode your password with one of these stronger encryption methods you must specify it in PROC PWENCODE:

proc pwencode in="abc123" method=SAS003;
run;

SAS Log: {SAS003}50374C8380F6CDB3C91281FF2EF57DED10E6

proc pwencode in="abc123" method=SAS004;
run;

SAS Log: {SAS004}1630D14353923B5940F3B0C91F83430E27DA19164FC003A1

Beyond encryption

There are other methods of obscuring passwords to protect access to sensitive information that are available in the SAS Business Intelligence Platform. These are the AUTHDOMAIN= SAS/ACCESS option supported in LIBNAME statements, as well as PROC SQL CONNECT statements, SAS Token Authentication, and Integrated Windows Authentication. For more details, see Five strategies to eliminate passwords from your SAS programs.

Conclusion

Never place unencrypted password into your SAS program. Encrypt it!

Place this sticker in front of you until you memorize it by heart.

PROC PWENCODE sticker

 

One deadly sin SAS programmers should stop committing was published on SAS Users.

4月 132017
 

I recently asked a SAS user, “Which interface do you use for SAS?” She replied, “Interface? I just install SAS and use it.” “You’re using the SAS windowing environment,” I explained, but she had no idea what I was talking about. This person is an extremely sophisticated SAS user who [...]

The post What’s your SAS interface? appeared first on SAS Learning Post.

3月 272017
 

Did you know that you can check a SAS macro variable to see if ODS graphics is enabled? The other day I wanted to write a SAS program that creates a graph only if ODS graphics is enabled. The solution is to check the SYSODSGRAPHICS macro variable, which is automatically updated by the SAS system whenever ODS graphics is enabled or disabled. (Thanks to Warren Kuhfeld for this tip!) The SYSODSGRAPHICS variable has the value 0 for "off" and 1 for "on."

For example, the following SAS/IML program checks to see if ODS graphics is on. If so, it creates a bar chart of a categorical variable. If not, it computes the count for each category and prints a frequency table:

ods graphics on;         /* -OR-  ods graphics off */
proc iml;
use sashelp.cars;  read all var "Origin";  close;
 
if &SYSODSGRAPHICS then  /* is ODS graphics enabled? */
   call bar(Origin);     /* optionally display a graphical summary */
/* always display a tabular summary */
call tabulate(Categories, Counts, Origin);
print Counts[colname=Categories];

The SYSODSGRAPHICS automatic macro variable is a recent addition to the automatic macro variables in SAS. You can read Rick Langston's 2015 paper to discover other (relatively) new macro features in SAS 9.3 and SAS 9.4.

SAS has many other automatic macro variables. In general, you can use automatic macro variables in SAS to check the status of the SAS system at run time. Some of my other favorite automatic macro variables (which some people call system macro variables) are the following:

  • SYSERR, SYSERRORTEXT, and SYSWARNINGTEXT: Provide information about errors and warnings from SAS procedures. In my book Statistical Programming with SAS/IML I show how to use these macro variables in conjunction with the SUBMIT/ENDSUBMIT statements in SAS/IML.
  • SYSVER and SYSVLONG: Provide information about your version of SAS. You can use these macro variables to execute newer, more efficient, code if the user has a recent version of SAS.
  • SYSSCP and SYSSCPL: Provide information about the operating system. For example, is SAS running on Windows or Linux?

Do you have a favorite automatic variable? Leave a comment and tell me how you use automatic variables in your SAS programs.

The post Is ODS graphics enabled? Use automatic macro variables to determine the state of SAS appeared first on The DO Loop.

3月 202017
 

SAS formats are very useful and can be used in a myriad of creative ways. For example, you can use formats to display decimal values as a fraction. However, SAS supports so many formats that it is difficult to remember details about the format syntax, such as the default field width. I often use the "Formats by Category" page in the SAS documentation to look up the range of valid values of the field width (w) and decimal places (d) that are associated with a format such as PERCENTw.d or DATETIMEw.d. (Recall that the field width specifies the width for the formatted output.)

The documentation provides the minimum, maximum, and default values of the field width, but did you know that you can discover these value programmatically? In SAS 9.4m3 you can call the FMTINFO function, which provide information about SAS formats and informats. The FMTINFO function takes two arguments (the name of a format or informat, and a keyword) and returns a character value. For brevity, I will refer to the first argument as the "format," even though the function also supports informats.

You can use the FMTINFO function to create a personalized "cheat sheet" of the formats/informats that you use most often. The following SAS DATA step uses the FMTINFO function to retrieve information about SAS formats, including a short description, default parameter values, and the minimum and maximum values of the width and decimal parameters. You can modify the DATALINES statement to produce a table for your favorite formats.


Create a "cheat sheet" of your favorite #SAS formats.
Click To Tweet


data FormatInfo;
length Name $9. Type $8. Category $4. Desc $40. 
       DefW $5. MinW $5. MaxW $5. DefD $2. MinD $2. MaxD $2.; 
input Name @@;
Category = fmtinfo(Name, "Cat");  /* numeric, character, date, ... */
Type = fmtinfo(Name, "Type");     /* format, informat, or both */
Desc = fmtinfo(Name, "Desc");     /* short description of the format */
DefW = fmtinfo(Name, "DefW");     /* default width if you omit w. Example: BEST. */
MinW = fmtinfo(Name, "MinW");     /* minimum width */
MaxW = fmtinfo(Name, "MaxW");     /* maximum width */
DefD = fmtinfo(Name, "DefD");     /* default decimal digits */
MinD = fmtinfo(Name, "MinD");     /* minimum decimal digits */
MaxD = fmtinfo(Name, "MaxD");     /* maximum decimal digits */
datalines;
ANYDTDTE BEST   
DATETIME DOLLAR 
FRACT    PERCENT
PVALUE   $UPCASE
;
 
proc print data=FormatInfo noobs;
   var Name Type Category Desc;
run;
 
proc print data=FormatInfo noobs;
   var Name DefW MinW MaxW DefD MinD MaxD;
run;
Attributes of SAS formats as output by the FMTINFO function
Field width and decimal places of SAS formats as output by the FMTINFO function

The first table shows the name of a few SAS formats and informats. The TYPE column shows whether the name is a format, an informat, or both. The CATEGORY column shows the general category of data to which the format applies. The DESC column gives a brief description of the format.

The second table shows default, minimum, and maximum values of the field width (w) and the decimal places (d) that are displayed. The columns that display field width information are the most valuable for me. You can see that the default and minimum field widths vary quite a bit among the formats. In contrast, most of the formats in the table display zero decimal places by default. (The exception is the PVALUE. format, which displays numbers between 0 and 1.) If the maximum number of decimal places is zero, it means that the format does not support a decimal value. For example, character formats do not support decimal places.

Other than creating a cheat sheet, I don't think that the casual SAS programmer will need this function very often. It seems most useful for advanced applications such as validating input from a GUI, but maybe I'm wrong. What do you think? Do you anticipate using the FMTINFO function in your work? Leave a comment.

The post Discover information about SAS formats... programatically appeared first on The DO Loop.

3月 152017
 

SAS programmers who have experience with other programming languages sometimes wonder whether the SAS language supports statements that are equivalent to the "break" and "continue" statements in other languages. The answer is yes. The LEAVE statement in the SAS DATA step is equivalent to the "break" statement. It provides a way to immediately exit from an iterative loop. The CONTINUE statements in the SAS DATA step skips over any remaining statements in the body of a loop and starts the next iteration.

Not all languages in SAS support those statements. For example, the SAS/IML language does not. However, you can use an alternative syntax to implement the same logical behavior, as shown in this article.

To review the syntax of various DO, DO-WHILE, and DO-UNTIL loops in SAS, see "Loops in SAS."

The LEAVE statement

The LEAVE statement exits a DO loop, usually as part of an IF-THEN statement to test whether a certain condition is met.

To illustrate the LEAVE statement consider a DATA step that simulates tossing a coin until "heads" appears. (Represent "tails" by a 0 and "heads" by a 1, chosen at random.) In the following SAS DATA step, the DO WHILE statement is always true, so the program potentially contains an infinite loop. However, the LEAVE statement is used to break out of the loop when "heads" appears. This coin-tossing experiment is repeated 100 times, which is equivalent to 100 draws from the geometric distribution:

/* toss a coin until "heads" (1) */
data Toss;
call streaminit(321);
do trial = 1 to 100;               /* simulate an experiment 100 times */
   count = 0;                      /* how many tosses until heads? */
   do while (1);                   /* loop forever */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      if coin = 1 then LEAVE;      /* exit loop when "heads" */
      count + 1;                   /* otherwise increment count */
   end;
   output;
end;
keep trial count;
run;

Some people like this programming paradigm (set up an infinite loop, break out when a condition is satisfied), but I personally prefer a DO UNTIL loop because the exit condition is easier to see when I read the program. For example, the code for each trial could be rewritten as

   count  = 0;                     /* how many tosses until heads? */
   done = 0;                       /* initialize flag variable */
   do until (done);                /* exit loop when "heads" */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      done = (coin = 1);           /* update flag variable */
      if ^done then                /* otherwise increment count */
         count + 1;
   end;
   output;

Notice that the LEAVE statement exits only the inner loop. In this example, the LEAVE statement does n ot affect the iteration of the DO TRIAL loop.

The CONTINUE statement

The CONTINUE statement tells the DATA step to skip the remaining body of the loop and go to the next iteration. It is used to skip processing when a condition is true. To illustrate the CONTINUE statement, let's simulate a coin-tossing experiment in which you toss a coin until "heads" appears OR until you have tossed the coin five times. In the following SAS DATA step, if tails (0) appears the CONTINUE statement executes, which skips the remaining statements and begins the next iteration of the loop. Consequently, the DONE=1 assignment is not executed when coin=0. Only if heads (1) appears does the DONE variable get assigned to a nonzero value, thereby ending the DO-UNTIL loop:

data Toss2;
call streaminit(321);
do trial = 1 to 100;               /* simulate an experiment 100 times */
   done = 0;                       /* initialize flag variable */
   do count = 0 to 4 until (done); /* iterate at most 5 times */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      if coin = 0 then CONTINUE;   /* tails: go to next iteration */
      done = 1;                    /* exit loop when "heads" */
   end;
   output;
end;
keep trial count;
run

The CONTINUE statement is not strictly necessary, although it can be convenient. You can always use an IF-THEN statement to bypass the remainder of the loop, as follows:

   coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
   /* wrap the remainder of the body in an IF-THEN statement */
   if coin ^= 0 then do;           /* heads: abort the loop */
      done = 1;                    /* exit loop when "heads" */
      /* other computations, as needed */
   end;

The CONTINUE and the LEAVE statements are examples of "jump statements" that tell the program the location of the next statement to execute. Both statements jump to a statement that might be far away. Consequently, programs that contain these statements are less structured than programs that avoid them. I try to avoid these statements in my programs, although sometimes the LEAVE statement is the simplest way to abort a loop when the program has to check for multiple exit conditions.

While we are on the topic, another jump statement in SAS is the GOTO statement, which can be used to emulate the behavior of LEAVE and CONTINUE. I avoid the GOTO statement because I think programs are easier to read and maintain when the logical flow of the program is controlled by using the DO-WHILE, DO-UNTIL, and IF-THEN statements. I also use those control statements in the SAS/IML language, which does not support the CONTINUE or LEAVE statements (although it does support the GOTO statement).

What are your views? Do you use the CONTINUE or LEAVE statement to simplify the error-handling logic of your programs? Or do you avoid them in favor of a more structured programming style? Why?

The post LEAVE and CONTINUE: Two ways to control the flow in a SAS DO loop appeared first on The DO Loop.

3月 102017
 

I recently needed to solve a fun programming problem. I challenge other SAS programmers to solve it, too! The problem is easy to state: Given a long sequence of digits, can you write a program to count how many times a particular subsequence occurs? For example, if I give you a sequence of 1,000 digits, can you determine whether the five-digit pattern {1 2 3 4 5} appears somewhere as a subsequence? How many times does it appear?

If the sequence is stored in a data set with one digit in each row, then SAS DATA step programmers might suspect that the LAG function will be useful for solving this problem. The LAG function enables a DATA step to examine the values of several digits simultaneously.

The SAS/IML language also has a LAG function which enables you to form a matrix of lagged values. This leads to an interesting way use vectorized computations to solve this problem. The following SAS/IML program defines a small eight-digit set of numbers in which the pattern {1 2 3} appears twice. The LAG function in SAS/IML accepts a vector of lags and creates a matrix where each column is a lagged version of the input sequence:

/* Find a certain pattern in sequence of digits */
proc iml;
Digit = {1,1,2,3,3,1,2,3};      /* digits to search */
target = {1 2 3};               /* the pattern to find */
p = ncol(target);               /* length of target sequence */
D = lag(Digit, (p-1):0);        /* columns shift the digits */
print D;

The output shows a three-column matrix (D) that contains the second, first, and zeroth lag (in that order) for the input sequence. Notice that if I am searching for a particular three-digit pattern, this matrix is very useful. The rows of this matrix are all three-digit patterns that appear in the original sequence. Consequently, to search for a three-digit pattern, I can use the rows of the matrix D.

To make the task easier, you can delete the first two rows, which contain missing values. You can also form a binary matrix X that has the value X[i,j]=1 when the j_th element of the pattern equals the j_th element of the i_th row, and 0 otherwise, as shown in the following:

D = D[p:nrow(Digit),];          /* delete first p rows */
X = (D=target);                 /* binary matrix */
print X;

Notice that in SAS/IML, the comparison operator (=) can perform a vector comparison. The binary comparison operator detects that the matrix on the left (D) and the vector on the right (target) both contain three columns. Therefore the operator creates the three-column logical matrix X, as shown. The X matrix has a wonderful property: a row of X contains all 1s if and only if the corresponding row of D matches the target pattern. So to find matches, you just need to sum the values in the rows of X. If the row sum equals the number of digits in the pattern, then that row indicates a place where the target pattern appears in the original sequence.

You can program this test in PROC IML as follows. The subscript reduction operator [,+] is shorthand for "compute the sum of each row over all columns".

/* sum across columns. Which rows contain all 1s? */
b = (X[,+] = p);                /* b[i]=1 if i_th row matches target */
NumRepl = sum(b);               /* how many times does target appear? */
if NumRepl=0 then 
   print "The target does not appear in the digits";
else
   print "The target appears at location " (loc(b)[1]),  /* print 1st location */
         "The target appears" (NumRepl) "times.";

The program discovered that the target pattern appears in the sequence twice. The first appearance begins with the second digit in the sequence. The pattern also appears in the sequence at the sixth position, although that information is not printed.

Notice that you can solve this problem in SAS/IML without writing any loops. Instead, you can use the LAG function to convert the N-digit sequence into a matrix with N-p rows and p columns. You can then test whether the target pattern matches one of the rows.

Your turn! Can you solve this problem?

Now that I have shown one way to solve the problem, I invite SAS programmers to write their own program to determine whether a specified pattern appears in a numeric sequence. You can use the DATA step, DS2, SQL, or any other SAS language.

Post a comment to submit your best SAS program. Extra points for programs that count all occurrences of the pattern and display the location of the first occurrence.

To help you debug your program, here is test data that we can all use. It contains 10 million random digits:

data Have(keep=Digit);
call streaminit(31488);
do i = 1 to 1e7;
   Digit = floor(10*rand("uniform"));
   output;
end;
run;

To help you determine if your program is correct, you can use the following results. In this sequence of digits:

  • The five-digit pattern {1 2 3 4 5} occurs 101 times and the first appearance begins at row 34417
  • The six-digit patter {6 5 4 3 2 1} occurs 15 times and the first appearance begins at row 120920

You can verify these facts by using PROC PRINT as follows:

proc print data=Have(firstobs=34417 obs=34421); run;
proc print data=Have(firstobs=120920 obs=120925); run;

Happy programming!

The post Find a pattern in a sequence of digits appeared first on The DO Loop.

3月 092017
 

Editor’s note: This is the first in a series of posts to help current SAS programmers add SAS Viya to their analytics skillset. In this post, SAS instructors Stacey Syphus and Marc Huber introduce you to the new Transitioning from Programming in SAS 9 to SAS Viya video library, designed to show SAS programmers [...]

The post Transitioning from programming in SAS 9 to SAS Viya appeared first on SAS Learning Post.

3月 082017
 

Suppose you have several discrete variables. You want to conduct a frequency analysis of these variables and print the results, but ONLY for variables that have three or more levels. In other words, you want to conditionally display some results, but you don't know which variables satisfy the condition until after you run the analysis.

An experienced SAS programmer can probably think of several ways to solve this problem. The simplest solution requires going through the data twice. During the first pass you use PROC SQL or PROC FREQ to count the number of distinct levels for each variable. You then create a list of the variables that have three or more levels and call PROC FREQ on those variables and show the one-way frequency tables that result.

That is a fine solution. However, I read the question just after I finished writing an article about how to select and reorder output with PROC DOCUMENT. It occurred to me that a more efficient solution is to let PROC FREQ compute tables for all the variables, but use PROC DOCUMENT to display only the tables that satisfy the condition. If you don't mind extra complexity, you can even use the DATA step and CALL EXECUTE to automate some of the replaying, a technique that I learned from a 2016 paper by Warren Kuhfeld. (He uses similar ideas in his free e-book Advanced ODS Graphics Examples.)

To demonstrate this technique, I will create a modified version of the Sashelp.Cars data. The following DATA step copies the data and adds two new character variables, one with one level and another with two levels:

data Have;
set sashelp.cars;
c1 = "A";
if _N_ < 100 then c2 = "A"; 
             else c2 = "B";
run;

Step 1: Store the output in a document

The goal is to print ONLY frequency tables for variables that have three or more levels. The following ODS statements suppress output to all open destinations, open the DOCUMENT destination (named "RDoc"), and select only the OneWayFreqs table. The ODS OUTPUT destination is used to save the "NLevels" table of PROC FREQ, which contains information about the number of levels in each variable.

ods exclude all;                          /* suppress output */
ods document name=RDoc(write);            /* write to document */
ods document select OneWayFreqs;          /* these tables go into the doc */
ods output NLevels=Levels;                /* save number of levels to data set */
   proc freq data=Have nlevels;
     tables origin c1 cylinders c2 type;  /* specify variables to analyze */
   run;
ods document close;
ods exclude none;

If the preceding statements seem confusing, try running just the PROC FREQ statement. It produces five frequency tables and an output data set (Levels) which contains the number of levels for each variable. The other ODS statements just ensure that only the DOCUMENT destination receives the OneWayFreqs tables.

Step 2: Examine the names of the objects in the document

While developing the program, you will want to see the contents of the Levels data and the RDoc document, as follows. These statements will not appear in the final program.

proc print data=Levels noobs;
   var TableVar NLevels;
run;
 
proc document name=RDoc;
   list ^ (where=(_TYPE_="Table")) / levels=all;  /* list all tables */
run; quit;

The first table shows which variables have three or more levels. The second table lists the names of the tables in the document. The variables are stored in the same order as the variables in the Levels data set.

Step 3: Display the output for certain variabes

If you were doing this task manually, you would look at the Levels data set and conclude that the first, third, and fifth variables have three or more levels. You could then use the REPLAY statement in PROC document to display those tables. The manual code would look like the following:

/* No automation: Print only OneWayFreqs tables w/ 3 or more levels */
proc document name=RDoc(read);
   replay Freq#1Table1#1OneWayFreqs#1;   /* display Table1 */
   replay Freq#1Table3#1OneWayFreqs#1;   /* display Table3 */
   replay Freq#1Table5#1OneWayFreqs#1;   /* display Table5 */
run; quit;

The observant programmer will notice that these statements are just the result of an algorithm:

  1. Loop over each row in the Levels data set
  2. If the NLevels variable is greater than some threshold, output the corresponding table.

You can program that algorithm in the SAS DATA step and generate the corresponding PROC DOCUMENT statements. One way is to write the statements to a text file and then use the %INCLUDE statement to execute the statements. An alternative approach is to use the CALL EXECUTE subroutine to buffer up the statement so that they run when the DATA step terminates, as shown by the following program:

%let L = 3;              /* print only OneWayFreqs tables w/ L or more levels */
options source;          /* show the statements submitted by CALL EXECUTE*/
title "Replay only the tables that contain &L or more levels";
data _NULL_;
set Levels end=EOF;     /* implicit loop over rows of the data */
if _N_ = 1 then         /* first statement */
   call execute('proc document name=RDoc(read);');
if NLevels >= &L then   /* replay tables that satisfy condition */
   call execute('replay Freq#1Table'|| strip(putn(_N_,3)) ||'#1OneWayFreqs#1;');
if EOF then             /* last statement */
   call execute('run; quit;');
run;

The DATA step generates the complete call to PROC DOCUMENT, which executes after the DATA set exits. The result is that one-way frequency tables are conditionally printed. Although PROC FREQ analyzed all the variables, only the tables that have more than three levels are displayed.

If you haven't seen this technique before, it might be a little jarring because you are using a SAS program to write a SAS program. This is an advanced technique, to be sure, but one that can be very useful. It can be adapted to many other situations in which you want to conditionally display certain tables, but you must run the analysis before you know which tables satisfy the condition.

The post Display output conditionally with PROC DOCUMENT appeared first on The DO Loop.

3月 062017
 

After reading my article about how to use BY-group processing to run 1000 regression models, a SAS programmer asked whether it is possible to reorder the output of a BY-group analysis. The answer is yes: you can use the DOCUMENT procedure to replay a portion of your output in any order you like. It is a feature of the SAS ODS system that is not very well known to statisticians. Kevin Smith (2012) has called PROC DOCUMENT "the most underutilized tool in the ODS toolbox."

The end of this article contains links to several papers that explain how PROC DOCUMENT works. I was first introduced to PROC DOCUMENT by Warren Kuhfeld, who uses it to modify the graphical output that comes from statistical procedures.

Reordering output (the topic of this article) is much easier than modifying the output. There is a SAS Sample that shows how to use PROC DOCUMENT to reorder BY groups, but I will show a simpler variation. The SAS Sample mentions that you can also use SAS macros to order the output from BY-group analyses, but the macro method is less efficient, computationally speaking.

Use PROC DOCUMENT to reorder SAS output

Suppose you use multiple procedures in an analysis and each procedure uses BY groups. The SAS output will consist of all the BY groups for the first procedure followed by all the BY groups for the second procedure. Sometimes it is preferable to reorder the output so that the results from all procedures are grouped by the BY-group levels.

A basic reordering of the SAS output uses the following steps:

  1. Use the ODS DOCUMENT statement to store the tables and graphs from the analyses into a destination called the document.
  2. Use the LIST statement in PROC DOCUMENT to examine the names of the tables and graphs within the document.
  3. Use the REPLAY statement in PROC DOCUMENT to send the tables and graphs to other ODS destinations in whatever order you choose.

Step 1: Store the output in a document

The following statements generate descriptive statistics for six clinical variables and for each level of the Smoking_Status variable. The Smoking_Status variable indicates whether a patient in this study is a non-smoker, a light smoker, and so on. PROC MEANS generates statistics for the continuous variable; PROC FREQ generates counts for the levels of the discrete variables. The ODS DOCUMENT statement writes all of the ODS output information to a SAS document (technically an item store) called 'doc':

proc sort data=sashelp.heart out=heart; /* sort for BY group processing */
   by smoking_status;
run;
 
ods document name=doc(write);      /* wrap ODS DOCUMENT around the output */
   proc means data=heart;
      by smoking_status;
      var Cholesterol Diastolic Systolic;         /* 1 table, results for three vars */
   run;
 
   proc freq data=heart;
      by smoking_status;
      tables BP_Status Chol_Status Weight_Status; /* 3 tables, one for each of three vars */
   run;
ods document close;                /* wrap ODS DOCUMENT around the output */

Step 2: Examine the names of the objects in the document

At this point, the output is captured in an ODS document. Each object is assigned a name that is closely related to the ODS name that is displayed if you use ODS TRACE ON. You can use PROC DOCUMENT to list the contents of the SAS item store. This will tell you the names of the ODS objects in the document.

PROC DOCUMENT is an interactive procedure, which means that the procedure continues running until you submit the QUIT statement. You can submit a group of statements followed by a RUN statement to execute the procedure interactively. The following statement runs a LIST statement, but does not exit the procedure:
proc document name=doc(read);
   list / levels=all bygroups;  /* add column for each BY group var */
run;

The output shows the table names. The tables from PROC MEANS begin with "Means" and end with the name of the table, which is "Summary." (The references explain the purpose of the many "#1" strings, which are called sequence numbers.) Similarly, the tables from PROC FREQ begin with "Freq" and end with "OneWayFreqs." Notice that the listing includes the BY-group variable as a column. You can use this variable to list or replay only certain BY groups. For example, to list only the tables for non-smokers, you can run the following PROC DOCUMENT statements:

   /* caret (^) is "current directory," not "negation"! */
   list ^ (where=(Smoking_Status="Non-smoker")) / levels=all bygroups;
run;

The output shows that there are four tables that correspond to the output for the "Non-smoker" level of the BY group.

Step 3: Replay the output in any order

The previous section showed that you can use a WHERE clause to list all the tables (and graphs) in a BY group. The REPLAY Statement also supports a WHERE clause, and you can use the REPLAY statement to send only those tables to the open ODS destinations. The DOCUMENT procedure is still running, so the following statements produce the output:

   replay ^ (where=(Smoking_Status="Non-smoker")); /* levels=all is default */
run;

The output is not shown, but it consists of all tables for which Smoking_Status="Non-smoker," regardless of which procedure created the output.

Notice that the "Non-smoker" group was the fifth level (in alphabetical order) of the Smoking_Status variable. You can use multiple REPLAY statements to send output for other levels in any order. For example, you might want to output the tables only for smokers, and order the output according to how much tobacco the patients consume. The following statement output the tables for smokers, grouped by tobacco usage:

   /* replay output in a different order */
   replay ^(where=(Smoking_Status="Light (1-5)"));
   replay ^(where=(Smoking_Status="Moderate (6-15)"));
   replay ^(where=(Smoking_Status="Heavy (16-25)"));
   replay ^(where=(Smoking_Status="Very Heavy (> 25)"));
run;
quit;

For brevity I have left out many details about the syntax of PROC DOCUMENT, but I hope the main idea is clear: You can store the results of multiple computations in a document and then replay any subset of the results in any order. The references in the next section contain many useful tips and techniques to help you get the most out of PROC DOCUMENT.

References

For an innovative use of PROC DOCUMENT to modify statistical graphics, see the papers and books by Warren Kuhfeld, such as Kuhfeld (2016) "Highly Customized Graphs Using ODS Graphics."

The post Reorder the output from a BY-group analysis in SAS appeared first on The DO Loop.

2月 272017
 

Longtime SAS programmers know that the SAS DATA step and SAS procedures are very tolerant of typographical errors. You can misspell most keywords and SAS will "guess" what you mean. For example, if you mistype "PROC" as "PRC," SAS will run the program but write a warning to the log: WARNING 14-169: Assuming the symbol PROC was misspelled as PRC.

This feature provided a big productivity boost in the days before GUI program editors. Imagine submitting a program from a command line in the early 1980s. If you mistyped one keyword you would have to retype the entire statement. As a convenience, SAS implemented an algorithm that checks the "spelling distance" between the tokens that you submit and a list of valid keywords for the procedure that you are calling. DATA step programmers might be familiar with the SPEDIS function, which measures how close two words are to each other in the English language. The SAS language parser uses the same algorithm.

Not everyone wants this feature. Many companies in regulated industries (such as pharmaceuticals) turn off the autocorrect feature in SAS because they want to force their programmers to type every keyword correctly. You can determine whether AUTOCORRECT option is enabled on your system by running PROC OPTIONS:

proc options option=AUTOCORRECT value;  run;

The AUTOCORRECT option is turned on by default. You can turn off the option by submitting options NOAUTOCORRECT or by putting -NOAUTOCORRECT in a configuration file.

Today I've invited two people to argue for and against using this feature. Larry Literal is a programmer who believes that no program should ever accept a syntax error. Annie Intel sees nothing wrong with programs that self-correct. She argues that it is desirable for programs to interpret the intention of the programmer. Which do you agree with? Do you have something to add? Leave a comment.

Point: A program should not allow ambiguity

My name is Larry Literal and I believe that computer programming should be an exact science. There is no room for ambiguity. A program that runs because it is "close to" a correct program is an abomination. I do not want a computer to change the code that I write!

When my system administrator installs a new version of SAS, the first thing I do is turn off the autocorrect feature. (I've also turned off the autocorrect feature on my phone. What a pain!) My main argument against the AUTOCORRECT option is that it makes code unreadable. Take a look at the following program:

/* The correct program is:
   proc freq dta=sashelp.class order=freq;
      table sex / chisq;
   run;
*/
prc freq dta=sashelp.class ordor=freqq;
   tble sex / chsq;
runn;

Every keyword in this program is mistyped. The only tokens that are specified correctly are the name of the procedure, the name of the data set, and the name of the variable. The program looks more like the Klingon language than the SAS language, yet this program runs if you use the AUTOCORRECT option!

And what happens if SAS introduces a new keyword that is closer to a mistyped word than a previous keyword? Then the procedure might do something different even though I have not changed the program! The autocorrect feature is an abomination and should never be used!

Counterpoint: Computers should interpret what you say

Really, Larry? "An abomination"? What century are you living in?

My name is Annie Intel, but my friends call me "A.I." I think the SAS autocorrect feature was way ahead of its time. Today we have autocorrecting logic on smartphones and word processors. Applying the same techniques to computer programs is no different. In fact, if you use a modern SAS program editor, the editor will suggest valid keywords and flag any keyword that is not valid.

Let's be real: Larry's example is not realistic. No programmer is going to use that garbled call to PROC FREQ in a production job. The autocorrect feature does not "make code unreadable." It is a convenience while developing a program, not an excuse to write nonsense. Any competent programmer will check the log for warning messages and correct the typos.

Larry claims that he doesn't want a computer munging and altering the code he writes. But optimizing compilers have been doing exactly that for decades! Programmers write instructions in a high-level language and an optimizing compiler maps the code to a set of machine instructions. The compiler will sometimes rearrange the structure of the program to get better performance. If it is okay for a compiler to map a program into an optimal version of itself, why is it not okay for a parser to do the same by correcting misspellings?

I want computers to recognize my intentions. When I give a voice command to my smartphone or personal home device, the audio signal is mapped to an action. I am allowed a certain amount of flexibility. "Turn on the lights" and "turn da light on" are equivalent phrases that should be understood and mapped to the same action. The SAS AUTOCORRECT feature is similar. The interpreter has a context (the name of the procedure) which is used to standardize your input. I think it is very cool. In the future, I think more programming languages will accept ambiguities.

The post Point/Counterpoint: Should a programming language accept misspelled keywords? appeared first on The DO Loop.