sas programming

3月 152017
 

SAS programmers who have experience with other programming languages sometimes wonder whether the SAS language supports statements that are equivalent to the "break" and "continue" statements in other languages. The answer is yes. The LEAVE statement in the SAS DATA step is equivalent to the "break" statement. It provides a way to immediately exit from an iterative loop. The CONTINUE statements in the SAS DATA step skips over any remaining statements in the body of a loop and starts the next iteration.

Not all languages in SAS support those statements. For example, the SAS/IML language does not. However, you can use an alternative syntax to implement the same logical behavior, as shown in this article.

To review the syntax of various DO, DO-WHILE, and DO-UNTIL loops in SAS, see "Loops in SAS."

The LEAVE statement

The LEAVE statement exits a DO loop, usually as part of an IF-THEN statement to test whether a certain condition is met.

To illustrate the LEAVE statement consider a DATA step that simulates tossing a coin until "heads" appears. (Represent "tails" by a 0 and "heads" by a 1, chosen at random.) In the following SAS DATA step, the DO WHILE statement is always true, so the program potentially contains an infinite loop. However, the LEAVE statement is used to break out of the loop when "heads" appears. This coin-tossing experiment is repeated 100 times, which is equivalent to 100 draws from the geometric distribution:

/* toss a coin until "heads" (1) */
data Toss;
call streaminit(321);
do trial = 1 to 100;               /* simulate an experiment 100 times */
   count = 0;                      /* how many tosses until heads? */
   do while (1);                   /* loop forever */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      if coin = 1 then LEAVE;      /* exit loop when "heads" */
      count + 1;                   /* otherwise increment count */
   end;
   output;
end;
keep trial count;
run;

Some people like this programming paradigm (set up an infinite loop, break out when a condition is satisfied), but I personally prefer a DO UNTIL loop because the exit condition is easier to see when I read the program. For example, the code for each trial could be rewritten as

   count  = 0;                     /* how many tosses until heads? */
   done = 0;                       /* initialize flag variable */
   do until (done);                /* exit loop when "heads" */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      done = (coin = 1);           /* update flag variable */
      if ^done then                /* otherwise increment count */
         count + 1;
   end;
   output;

Notice that the LEAVE statement exits only the inner loop. In this example, the LEAVE statement does n ot affect the iteration of the DO TRIAL loop.

The CONTINUE statement

The CONTINUE statement tells the DATA step to skip the remaining body of the loop and go to the next iteration. It is used to skip processing when a condition is true. To illustrate the CONTINUE statement, let's simulate a coin-tossing experiment in which you toss a coin until "heads" appears OR until you have tossed the coin five times. In the following SAS DATA step, if tails (0) appears the CONTINUE statement executes, which skips the remaining statements and begins the next iteration of the loop. Consequently, the DONE=1 assignment is not executed when coin=0. Only if heads (1) appears does the DONE variable get assigned to a nonzero value, thereby ending the DO-UNTIL loop:

data Toss2;
call streaminit(321);
do trial = 1 to 100;               /* simulate an experiment 100 times */
   done = 0;                       /* initialize flag variable */
   do count = 0 to 4 until (done); /* iterate at most 5 times */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      if coin = 0 then CONTINUE;   /* tails: go to next iteration */
      done = 1;                    /* exit loop when "heads" */
   end;
   output;
end;
keep trial count;
run

The CONTINUE statement is not strictly necessary, although it can be convenient. You can always use an IF-THEN statement to bypass the remainder of the loop, as follows:

   coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
   /* wrap the remainder of the body in an IF-THEN statement */
   if coin ^= 0 then do;           /* heads: abort the loop */
      done = 1;                    /* exit loop when "heads" */
      /* other computations, as needed */
   end;

The CONTINUE and the LEAVE statements are examples of "jump statements" that tell the program the location of the next statement to execute. Both statements jump to a statement that might be far away. Consequently, programs that contain these statements are less structured than programs that avoid them. I try to avoid these statements in my programs, although sometimes the LEAVE statement is the simplest way to abort a loop when the program has to check for multiple exit conditions.

While we are on the topic, another jump statement in SAS is the GOTO statement, which can be used to emulate the behavior of LEAVE and CONTINUE. I avoid the GOTO statement because I think programs are easier to read and maintain when the logical flow of the program is controlled by using the DO-WHILE, DO-UNTIL, and IF-THEN statements. I also use those control statements in the SAS/IML language, which does not support the CONTINUE or LEAVE statements (although it does support the GOTO statement).

What are your views? Do you use the CONTINUE or LEAVE statement to simplify the error-handling logic of your programs? Or do you avoid them in favor of a more structured programming style? Why?

The post LEAVE and CONTINUE: Two ways to control the flow in a SAS DO loop appeared first on The DO Loop.

3月 102017
 

I recently needed to solve a fun programming problem. I challenge other SAS programmers to solve it, too! The problem is easy to state: Given a long sequence of digits, can you write a program to count how many times a particular subsequence occurs? For example, if I give you a sequence of 1,000 digits, can you determine whether the five-digit pattern {1 2 3 4 5} appears somewhere as a subsequence? How many times does it appear?

If the sequence is stored in a data set with one digit in each row, then SAS DATA step programmers might suspect that the LAG function will be useful for solving this problem. The LAG function enables a DATA step to examine the values of several digits simultaneously.

The SAS/IML language also has a LAG function which enables you to form a matrix of lagged values. This leads to an interesting way use vectorized computations to solve this problem. The following SAS/IML program defines a small eight-digit set of numbers in which the pattern {1 2 3} appears twice. The LAG function in SAS/IML accepts a vector of lags and creates a matrix where each column is a lagged version of the input sequence:

/* Find a certain pattern in sequence of digits */
proc iml;
Digit = {1,1,2,3,3,1,2,3};      /* digits to search */
target = {1 2 3};               /* the pattern to find */
p = ncol(target);               /* length of target sequence */
D = lag(Digit, (p-1):0);        /* columns shift the digits */
print D;

The output shows a three-column matrix (D) that contains the second, first, and zeroth lag (in that order) for the input sequence. Notice that if I am searching for a particular three-digit pattern, this matrix is very useful. The rows of this matrix are all three-digit patterns that appear in the original sequence. Consequently, to search for a three-digit pattern, I can use the rows of the matrix D.

To make the task easier, you can delete the first two rows, which contain missing values. You can also form a binary matrix X that has the value X[i,j]=1 when the j_th element of the pattern equals the j_th element of the i_th row, and 0 otherwise, as shown in the following:

D = D[p:nrow(Digit),];          /* delete first p rows */
X = (D=target);                 /* binary matrix */
print X;

Notice that in SAS/IML, the comparison operator (=) can perform a vector comparison. The binary comparison operator detects that the matrix on the left (D) and the vector on the right (target) both contain three columns. Therefore the operator creates the three-column logical matrix X, as shown. The X matrix has a wonderful property: a row of X contains all 1s if and only if the corresponding row of D matches the target pattern. So to find matches, you just need to sum the values in the rows of X. If the row sum equals the number of digits in the pattern, then that row indicates a place where the target pattern appears in the original sequence.

You can program this test in PROC IML as follows. The subscript reduction operator [,+] is shorthand for "compute the sum of each row over all columns".

/* sum across columns. Which rows contain all 1s? */
b = (X[,+] = p);                /* b[i]=1 if i_th row matches target */
NumRepl = sum(b);               /* how many times does target appear? */
if NumRepl=0 then 
   print "The target does not appear in the digits";
else
   print "The target appears at location " (loc(b)[1]),  /* print 1st location */
         "The target appears" (NumRepl) "times.";

The program discovered that the target pattern appears in the sequence twice. The first appearance begins with the second digit in the sequence. The pattern also appears in the sequence at the sixth position, although that information is not printed.

Notice that you can solve this problem in SAS/IML without writing any loops. Instead, you can use the LAG function to convert the N-digit sequence into a matrix with N-p rows and p columns. You can then test whether the target pattern matches one of the rows.

Your turn! Can you solve this problem?

Now that I have shown one way to solve the problem, I invite SAS programmers to write their own program to determine whether a specified pattern appears in a numeric sequence. You can use the DATA step, DS2, SQL, or any other SAS language.

Post a comment to submit your best SAS program. Extra points for programs that count all occurrences of the pattern and display the location of the first occurrence.

To help you debug your program, here is test data that we can all use. It contains 10 million random digits:

data Have(keep=Digit);
call streaminit(31488);
do i = 1 to 1e7;
   Digit = floor(10*rand("uniform"));
   output;
end;
run;

To help you determine if your program is correct, you can use the following results. In this sequence of digits:

  • The five-digit pattern {1 2 3 4 5} occurs 101 times and the first appearance begins at row 34417
  • The six-digit patter {6 5 4 3 2 1} occurs 15 times and the first appearance begins at row 120920

You can verify these facts by using PROC PRINT as follows:

proc print data=Have(firstobs=34417 obs=34421); run;
proc print data=Have(firstobs=120920 obs=120925); run;

Happy programming!

The post Find a pattern in a sequence of digits appeared first on The DO Loop.

3月 092017
 

Editor’s note: This is the first in a series of posts to help current SAS programmers add SAS Viya to their analytics skillset. In this post, SAS instructors Stacey Syphus and Marc Huber introduce you to the new Transitioning from Programming in SAS 9 to SAS Viya video library, designed to show SAS programmers [...]

The post Transitioning from programming in SAS 9 to SAS Viya appeared first on SAS Learning Post.

3月 082017
 

Suppose you have several discrete variables. You want to conduct a frequency analysis of these variables and print the results, but ONLY for variables that have three or more levels. In other words, you want to conditionally display some results, but you don't know which variables satisfy the condition until after you run the analysis.

An experienced SAS programmer can probably think of several ways to solve this problem. The simplest solution requires going through the data twice. During the first pass you use PROC SQL or PROC FREQ to count the number of distinct levels for each variable. You then create a list of the variables that have three or more levels and call PROC FREQ on those variables and show the one-way frequency tables that result.

That is a fine solution. However, I read the question just after I finished writing an article about how to select and reorder output with PROC DOCUMENT. It occurred to me that a more efficient solution is to let PROC FREQ compute tables for all the variables, but use PROC DOCUMENT to display only the tables that satisfy the condition. If you don't mind extra complexity, you can even use the DATA step and CALL EXECUTE to automate some of the replaying, a technique that I learned from a 2016 paper by Warren Kuhfeld. (He uses similar ideas in his free e-book Advanced ODS Graphics Examples.)

To demonstrate this technique, I will create a modified version of the Sashelp.Cars data. The following DATA step copies the data and adds two new character variables, one with one level and another with two levels:

data Have;
set sashelp.cars;
c1 = "A";
if _N_ < 100 then c2 = "A"; 
             else c2 = "B";
run;

Step 1: Store the output in a document

The goal is to print ONLY frequency tables for variables that have three or more levels. The following ODS statements suppress output to all open destinations, open the DOCUMENT destination (named "RDoc"), and select only the OneWayFreqs table. The ODS OUTPUT destination is used to save the "NLevels" table of PROC FREQ, which contains information about the number of levels in each variable.

ods exclude all;                          /* suppress output */
ods document name=RDoc(write);            /* write to document */
ods document select OneWayFreqs;          /* these tables go into the doc */
ods output NLevels=Levels;                /* save number of levels to data set */
   proc freq data=Have nlevels;
     tables origin c1 cylinders c2 type;  /* specify variables to analyze */
   run;
ods document close;
ods exclude none;

If the preceding statements seem confusing, try running just the PROC FREQ statement. It produces five frequency tables and an output data set (Levels) which contains the number of levels for each variable. The other ODS statements just ensure that only the DOCUMENT destination receives the OneWayFreqs tables.

Step 2: Examine the names of the objects in the document

While developing the program, you will want to see the contents of the Levels data and the RDoc document, as follows. These statements will not appear in the final program.

proc print data=Levels noobs;
   var TableVar NLevels;
run;
 
proc document name=RDoc;
   list ^ (where=(_TYPE_="Table")) / levels=all;  /* list all tables */
run; quit;

The first table shows which variables have three or more levels. The second table lists the names of the tables in the document. The variables are stored in the same order as the variables in the Levels data set.

Step 3: Display the output for certain variabes

If you were doing this task manually, you would look at the Levels data set and conclude that the first, third, and fifth variables have three or more levels. You could then use the REPLAY statement in PROC document to display those tables. The manual code would look like the following:

/* No automation: Print only OneWayFreqs tables w/ 3 or more levels */
proc document name=RDoc(read);
   replay Freq#1Table1#1OneWayFreqs#1;   /* display Table1 */
   replay Freq#1Table3#1OneWayFreqs#1;   /* display Table3 */
   replay Freq#1Table5#1OneWayFreqs#1;   /* display Table5 */
run; quit;

The observant programmer will notice that these statements are just the result of an algorithm:

  1. Loop over each row in the Levels data set
  2. If the NLevels variable is greater than some threshold, output the corresponding table.

You can program that algorithm in the SAS DATA step and generate the corresponding PROC DOCUMENT statements. One way is to write the statements to a text file and then use the %INCLUDE statement to execute the statements. An alternative approach is to use the CALL EXECUTE subroutine to buffer up the statement so that they run when the DATA step terminates, as shown by the following program:

%let L = 3;              /* print only OneWayFreqs tables w/ L or more levels */
options source;          /* show the statements submitted by CALL EXECUTE*/
title "Replay only the tables that contain &L or more levels";
data _NULL_;
set Levels end=EOF;     /* implicit loop over rows of the data */
if _N_ = 1 then         /* first statement */
   call execute('proc document name=RDoc(read);');
if NLevels >= &L then   /* replay tables that satisfy condition */
   call execute('replay Freq#1Table'|| strip(putn(_N_,3)) ||'#1OneWayFreqs#1;');
if EOF then             /* last statement */
   call execute('run; quit;');
run;

The DATA step generates the complete call to PROC DOCUMENT, which executes after the DATA set exits. The result is that one-way frequency tables are conditionally printed. Although PROC FREQ analyzed all the variables, only the tables that have more than three levels are displayed.

If you haven't seen this technique before, it might be a little jarring because you are using a SAS program to write a SAS program. This is an advanced technique, to be sure, but one that can be very useful. It can be adapted to many other situations in which you want to conditionally display certain tables, but you must run the analysis before you know which tables satisfy the condition.

The post Display output conditionally with PROC DOCUMENT appeared first on The DO Loop.

3月 062017
 

After reading my article about how to use BY-group processing to run 1000 regression models, a SAS programmer asked whether it is possible to reorder the output of a BY-group analysis. The answer is yes: you can use the DOCUMENT procedure to replay a portion of your output in any order you like. It is a feature of the SAS ODS system that is not very well known to statisticians. Kevin Smith (2012) has called PROC DOCUMENT "the most underutilized tool in the ODS toolbox."

The end of this article contains links to several papers that explain how PROC DOCUMENT works. I was first introduced to PROC DOCUMENT by Warren Kuhfeld, who uses it to modify the graphical output that comes from statistical procedures.

Reordering output (the topic of this article) is much easier than modifying the output. There is a SAS Sample that shows how to use PROC DOCUMENT to reorder BY groups, but I will show a simpler variation. The SAS Sample mentions that you can also use SAS macros to order the output from BY-group analyses, but the macro method is less efficient, computationally speaking.

Use PROC DOCUMENT to reorder SAS output

Suppose you use multiple procedures in an analysis and each procedure uses BY groups. The SAS output will consist of all the BY groups for the first procedure followed by all the BY groups for the second procedure. Sometimes it is preferable to reorder the output so that the results from all procedures are grouped by the BY-group levels.

A basic reordering of the SAS output uses the following steps:

  1. Use the ODS DOCUMENT statement to store the tables and graphs from the analyses into a destination called the document.
  2. Use the LIST statement in PROC DOCUMENT to examine the names of the tables and graphs within the document.
  3. Use the REPLAY statement in PROC DOCUMENT to send the tables and graphs to other ODS destinations in whatever order you choose.

Step 1: Store the output in a document

The following statements generate descriptive statistics for six clinical variables and for each level of the Smoking_Status variable. The Smoking_Status variable indicates whether a patient in this study is a non-smoker, a light smoker, and so on. PROC MEANS generates statistics for the continuous variable; PROC FREQ generates counts for the levels of the discrete variables. The ODS DOCUMENT statement writes all of the ODS output information to a SAS document (technically an item store) called 'doc':

proc sort data=sashelp.heart out=heart; /* sort for BY group processing */
   by smoking_status;
run;
 
ods document name=doc(write);      /* wrap ODS DOCUMENT around the output */
   proc means data=heart;
      by smoking_status;
      var Cholesterol Diastolic Systolic;         /* 1 table, results for three vars */
   run;
 
   proc freq data=heart;
      by smoking_status;
      tables BP_Status Chol_Status Weight_Status; /* 3 tables, one for each of three vars */
   run;
ods document close;                /* wrap ODS DOCUMENT around the output */

Step 2: Examine the names of the objects in the document

At this point, the output is captured in an ODS document. Each object is assigned a name that is closely related to the ODS name that is displayed if you use ODS TRACE ON. You can use PROC DOCUMENT to list the contents of the SAS item store. This will tell you the names of the ODS objects in the document.

PROC DOCUMENT is an interactive procedure, which means that the procedure continues running until you submit the QUIT statement. You can submit a group of statements followed by a RUN statement to execute the procedure interactively. The following statement runs a LIST statement, but does not exit the procedure:
proc document name=doc(read);
   list / levels=all bygroups;  /* add column for each BY group var */
run;

The output shows the table names. The tables from PROC MEANS begin with "Means" and end with the name of the table, which is "Summary." (The references explain the purpose of the many "#1" strings, which are called sequence numbers.) Similarly, the tables from PROC FREQ begin with "Freq" and end with "OneWayFreqs." Notice that the listing includes the BY-group variable as a column. You can use this variable to list or replay only certain BY groups. For example, to list only the tables for non-smokers, you can run the following PROC DOCUMENT statements:

   /* caret (^) is "current directory," not "negation"! */
   list ^ (where=(Smoking_Status="Non-smoker")) / levels=all bygroups;
run;

The output shows that there are four tables that correspond to the output for the "Non-smoker" level of the BY group.

Step 3: Replay the output in any order

The previous section showed that you can use a WHERE clause to list all the tables (and graphs) in a BY group. The REPLAY Statement also supports a WHERE clause, and you can use the REPLAY statement to send only those tables to the open ODS destinations. The DOCUMENT procedure is still running, so the following statements produce the output:

   replay ^ (where=(Smoking_Status="Non-smoker")); /* levels=all is default */
run;

The output is not shown, but it consists of all tables for which Smoking_Status="Non-smoker," regardless of which procedure created the output.

Notice that the "Non-smoker" group was the fifth level (in alphabetical order) of the Smoking_Status variable. You can use multiple REPLAY statements to send output for other levels in any order. For example, you might want to output the tables only for smokers, and order the output according to how much tobacco the patients consume. The following statement output the tables for smokers, grouped by tobacco usage:

   /* replay output in a different order */
   replay ^(where=(Smoking_Status="Light (1-5)"));
   replay ^(where=(Smoking_Status="Moderate (6-15)"));
   replay ^(where=(Smoking_Status="Heavy (16-25)"));
   replay ^(where=(Smoking_Status="Very Heavy (> 25)"));
run;
quit;

For brevity I have left out many details about the syntax of PROC DOCUMENT, but I hope the main idea is clear: You can store the results of multiple computations in a document and then replay any subset of the results in any order. The references in the next section contain many useful tips and techniques to help you get the most out of PROC DOCUMENT.

References

For an innovative use of PROC DOCUMENT to modify statistical graphics, see the papers and books by Warren Kuhfeld, such as Kuhfeld (2016) "Highly Customized Graphs Using ODS Graphics."

The post Reorder the output from a BY-group analysis in SAS appeared first on The DO Loop.

2月 272017
 

Longtime SAS programmers know that the SAS DATA step and SAS procedures are very tolerant of typographical errors. You can misspell most keywords and SAS will "guess" what you mean. For example, if you mistype "PROC" as "PRC," SAS will run the program but write a warning to the log: WARNING 14-169: Assuming the symbol PROC was misspelled as PRC.

This feature provided a big productivity boost in the days before GUI program editors. Imagine submitting a program from a command line in the early 1980s. If you mistyped one keyword you would have to retype the entire statement. As a convenience, SAS implemented an algorithm that checks the "spelling distance" between the tokens that you submit and a list of valid keywords for the procedure that you are calling. DATA step programmers might be familiar with the SPEDIS function, which measures how close two words are to each other in the English language. The SAS language parser uses the same algorithm.

Not everyone wants this feature. Many companies in regulated industries (such as pharmaceuticals) turn off the autocorrect feature in SAS because they want to force their programmers to type every keyword correctly. You can determine whether AUTOCORRECT option is enabled on your system by running PROC OPTIONS:

proc options option=AUTOCORRECT value;  run;

The AUTOCORRECT option is turned on by default. You can turn off the option by submitting options NOAUTOCORRECT or by putting -NOAUTOCORRECT in a configuration file.

Today I've invited two people to argue for and against using this feature. Larry Literal is a programmer who believes that no program should ever accept a syntax error. Annie Intel sees nothing wrong with programs that self-correct. She argues that it is desirable for programs to interpret the intention of the programmer. Which do you agree with? Do you have something to add? Leave a comment.

Point: A program should not allow ambiguity

My name is Larry Literal and I believe that computer programming should be an exact science. There is no room for ambiguity. A program that runs because it is "close to" a correct program is an abomination. I do not want a computer to change the code that I write!

When my system administrator installs a new version of SAS, the first thing I do is turn off the autocorrect feature. (I've also turned off the autocorrect feature on my phone. What a pain!) My main argument against the AUTOCORRECT option is that it makes code unreadable. Take a look at the following program:

/* The correct program is:
   proc freq dta=sashelp.class order=freq;
      table sex / chisq;
   run;
*/
prc freq dta=sashelp.class ordor=freqq;
   tble sex / chsq;
runn;

Every keyword in this program is mistyped. The only tokens that are specified correctly are the name of the procedure, the name of the data set, and the name of the variable. The program looks more like the Klingon language than the SAS language, yet this program runs if you use the AUTOCORRECT option!

And what happens if SAS introduces a new keyword that is closer to a mistyped word than a previous keyword? Then the procedure might do something different even though I have not changed the program! The autocorrect feature is an abomination and should never be used!

Counterpoint: Computers should interpret what you say

Really, Larry? "An abomination"? What century are you living in?

My name is Annie Intel, but my friends call me "A.I." I think the SAS autocorrect feature was way ahead of its time. Today we have autocorrecting logic on smartphones and word processors. Applying the same techniques to computer programs is no different. In fact, if you use a modern SAS program editor, the editor will suggest valid keywords and flag any keyword that is not valid.

Let's be real: Larry's example is not realistic. No programmer is going to use that garbled call to PROC FREQ in a production job. The autocorrect feature does not "make code unreadable." It is a convenience while developing a program, not an excuse to write nonsense. Any competent programmer will check the log for warning messages and correct the typos.

Larry claims that he doesn't want a computer munging and altering the code he writes. But optimizing compilers have been doing exactly that for decades! Programmers write instructions in a high-level language and an optimizing compiler maps the code to a set of machine instructions. The compiler will sometimes rearrange the structure of the program to get better performance. If it is okay for a compiler to map a program into an optimal version of itself, why is it not okay for a parser to do the same by correcting misspellings?

I want computers to recognize my intentions. When I give a voice command to my smartphone or personal home device, the audio signal is mapped to an action. I am allowed a certain amount of flexibility. "Turn on the lights" and "turn da light on" are equivalent phrases that should be understood and mapped to the same action. The SAS AUTOCORRECT feature is similar. The interpreter has a context (the name of the procedure) which is used to standardize your input. I think it is very cool. In the future, I think more programming languages will accept ambiguities.

The post Point/Counterpoint: Should a programming language accept misspelled keywords? appeared first on The DO Loop.

2月 212017
 

What?!?  You mean a period (.) isn't the only SAS numeric missing value? Well, there are 27 others: .A .B, to .Z and ._ (period underscore). Your first question might be: "Why would you need more than one missing value?"  One situation where multiple missing values are useful involves survey data. Suppose [...]

The post The Other 27 SAS Numeric Missing Values appeared first on SAS Learning Post.

2月 082017
 

Colors are the subject of many romantic poems and songs, but there isn't much romance to be found in their hexadecimal values. With apologies to Van Morrison:

...Skipping and a jumping
In the misty morning fog with
Our hearts a thumpin' and you
My cx662F14 eyed girl

When it comes to specifying colors within a SAS program, you can always rely on the simple color names: red, blue, yellow, and so on. (You know, the colors you might remember from your first box of Crayola crayons.) You can even predict a few more exotic names such as "lightgreen" and "darkyellow" and even "olivedrab". Are you familiar with HTML color name standards? Most of those names work as well. But for true color precision, you might want to use the hexadecimal values or at least the super-descriptive SAS color names.


In SAS Enterprise Guide, when you type a piece of SAS syntax that expects a color value, you'll find that the program editor pops up a helpful "color picker," displaying a long list of acceptable color names and their hex values. You can scroll through the list or use "type ahead" to find the color you want, then click or press Enter to accept it.

There's a keyboard shortcut that will invoke the color picker at any time: Ctrl+Shift+C. Use that when working on a SAS macro program or at any place the SAS program editor might not otherwise predict. By default, the editor will drop in the color name. You can change that behavior by visiting Program->Editor Options, Autocomplete tab. Select between the SAS color name or the more-obscure hex value. (Guaranteed to make your program more difficult to read, and thus helpful for job security.)

Are you using SAS Studio? You can also color your world with just a few keystrokes. This screenshot is from SAS Studio 3.6:

More colorful resources

tags: SAS Enterprise Guide, SAS programming

The post Tip for coding your color values in SAS Enterprise Guide appeared first on The SAS Dummy.

2月 062017
 

Suppose you create a scatter plot in SAS with PROC SGPLOT. What color does PROC SGPLOT use for the markers? If you specify the GROUP= option so that markers are colored by a grouping variable, what colors are used to represent the various groups? The following scatter plot shows the colors that are used by default for the HTMLBlue style. They are shades of blue, red, green, brown, and magenta.

data A;        /* example data with groups 1, 2, ..., 5 */
do Color = 1 to 5;
   x = Color; y = Color;  output;
end;
run;
 
title "Marker Colors Used for GROUP= Option";
title2 "HTMLBlue Style";
proc sgplot data=A;
xaxis grid; yaxis grid;
scatter x=x y=y / group=Color markerattrs=(size=24 symbol=SquareFilled);
run;
stylescolors1

Notice that these marker colors are not fully saturated colors, so they are not the SAS color names RED, BLUE, GREEN, BROWN, and MAGENTA. So what colors are these? What are their RGB values?


What colors does PROC SGPLOT uses for groups? #SASTip
Click To Tweet


Colors come from styles

Colors are defined by styles, and you can use ODS style elements to set marker colors. A style defines elements called GraphDataDefault, GraphData1, GraphData2, GraphData3, and so forth. Each element contains several attributes such as colors and line patterns. The complete list of style elements and attributes for ODS graphics is in the documentation, but for this article, the important fact is that the GraphDatan:ContrastColor attribute determines the marker color for the nth group. These are the colors in the previous scatter plot.

Display an ODS style template

Styles are defined by ODS templates. You can use the SOURCE statement in PROC TEMPLATE to display a template. In the style template, the GraphDatan:ContrastColor attributes are set by using keywords named gcdata1, gcdata2, gcdata3, etc.

If you display the template for the styles.HTMLBlue template, you will see that the HTMLBlue style inherits from the Statistical style, and it is the Statistical style that defines the contrast colors. The following statements display the contents of the Statistical template to the SAS log:

proc template;
source styles.statistical;
quit;

The template is long, and it is hard to scroll through the log to discover which colors are associated with each attribute. But that's no problem: you can use SAS to find and display only the information about contrast colors.

Display the marker colors as RGB and hexadecimal values

For years Warren Kuhfeld has been showing SAS customers how to view, edit, and use ODS templates to customize the graphs that are produced by SAS statistical procedures. A powerful technique that he uses is to write a template to a file and then use the DATA step to modify the template.

I will not modify the template but merely display information from it. The following DATA step writes the template to a text file and then uses the DATA step to find all instances of the keyword 'gcdata' in the template. For lines that contain the string 'gcdata', the program extracts the color for each keyword. The keyword-value pairs are saved to a data set, which is sorted and displayed:

libname temp "C:/temp";
proc template;
source styles.statistical / file='temp.tmp'; /* write template to text file */
quit;
 
data Colors;
keep Num Name Color R G B;
length Name Color $8;
infile 'temp.tmp';                    /* read from text file */
input;
/* example string:  'gcdata1' = cx445694 */
k = find(_infile_,'gcdata','i');      /* if k=0 then string not found */
if k > 0 then do;                     /* Found line that contains 'gcdata' */
   s = substr(_infile_, k);           /* substring from 'gcdata' to end of line */
   j = index(s, "'");                 /* index of closing quote  */
   Name = substr(s, 1, j-1);          /* keyword                 */
   if j = 7 then Num = 0;             /* string is 'gcdata'      */
   else                               /* extract number 1, 2, ... for strings */
      Num = inputn(substr(s, 7, j-7), "best2.");  /* gcdata1, gcdata2,...     */
   j = index(s, "=");                 /* index of equal sign     */
   Color = compress(substr(s, j+1));  /* color value for keyword */
   R = inputn(substr(Color, 3, 2), "HEX2.");   /* convert hex to RGB */
   G = inputn(substr(Color, 5, 2), "HEX2.");
   B = inputn(substr(Color, 7, 2), "HEX2.");
end;
if k > 0;
run;
 
proc sort data=Colors; by Num; run;
 
proc print data=Colors; 
var Name Color R G B;
run;
stylescolors2

Success! The output shows the contrast colors for the HTMLBlue style. The 'gcdata' color is the fill color (a dark blue) for markers when no GROUP= option is specified. The 'gcdatan' colors are used for markers that are colored by group membership. Obviously you could use this same technique to display other style attributes, such as line patterns or bar colors ('gdata').

If you prefer a visual summary of the attributes for an ODS style, see section "ODS Style Comparisons" in the SAS/STAT documentation. That section is part of the chapter "Statistical Graphics Using ODS," which could have been titled "Everything you always wanted to know about ODS graphics but were afraid to ask."

An application of setting marker colors

I prefer to style elements and discrete attribute maps to set colors for markers. But if you are rushed for time, you might want to use the STYLEATTRS statement to set the colors that are used for the GROUP= option. The STYLEATTRS statement requires a color list of hexadecimal colors or SAS color names. The following call to PROC SGPLOT uses the RGB/hex values for GraphData1:ContrastColor and so forth:

/* use colors for HTMLBlue style */
%let gcdata1 = cx445694;        
%let gcdata2 = cxA23A2E;
%let gcdata3 = cx01665E;
title "Origin in {Europe, USA}";
proc sgplot data=sashelp.cars;
where origin^='Asia' && type^="Hybrid";                 /* omit first category */
   styleattrs DataContrastColors = (&gcdata2 &gcdata3); /* use 2nd and 3rd colors */
   scatter x=weight y=mpg_city / group=Origin markerattrs=(symbol=CircleFilled);
   keylegend / location=inside position=TopRight across=1;
run;

It would be great if you could specify a style-independent syntax such as

styleattrs DataContrastColors=(GraphData2:ContrastColor GraphData3:ContrastColor);

Unfortunately, that syntax is not supported. The STYLEATTRS statement requires a list of color values or SAS color names.

Although this trick is interesting, in general I prefer to use styles (rather than hard-coded color values) in production code. However, if you want to know the RGB/hex values for a style, this trick shows how you can get them from an ODS template.

tags: SAS Programming, Statistical Graphics

The post What colors does PROC SGPLOT use for markers? appeared first on The DO Loop.

1月 242017
 

In a previous blog, Random Sampling: What's Efficient?, I discussed the efficiency of various techniques for selecting a simple random sample from a large SAS dataset.  PROC SURVEYSELECT easily does the job: proc surveyselect data=large out=sample method=srs /* simple random sample */ rate=.01; /* 1% sample rate */ run; Note: […]

The post Stratified random sample: What's efficient? appeared first on SAS Learning Post.