sas programming

5月 312023
 

Inserting various programming languages into SAS programSAS consistently expands its support for running non-SAS code inside SAS programs. It’s been a while since SAS introduced explicit SQL pass-through facility allowing SAS users to embed native Database Management Systems (DBMS) Sequel (SQL) code into its PROC SQL and PROC DS2.

Similarly, now you can run R code within PROC IML.

SAS Viya added PROC PYTHON enabling you to embed increasingly popular Python programming language code within your SAS programs.

However, while SAS developers welcome such cross-language functionality, its implementation often leads to making SAS programs bulky and overloaded, not easily readable and difficult to follow. In the spirit of modular programming, it would have been nice just referencing those “foreign” modules without actually inserting their code into SAS programs. Somewhat like %INCLUDE statement brings into SAS program pieces of SAS code stored in files.

Limitations of %INCLUDE

The problem with the %INCLUDE for bringing non-SAS code into a SAS program stems from the fact that % INCLUDE is not a SAS macro statement as it seems to be. That’s right. Despite prominently displaying % sign of distinction in front of itself, it has nothing to do with SAS macro facility. I know, it is misleading and it is a misnomer.  Let’s call it an exception. However, the fact is that %INCLUDE is a SAS global statement, and as such it is not processed by the SAS macro processor.

Being a global statement, %INCLUDE must be placed between other SAS statements (or be the very first or last statement in your SAS program). It cannot be positioned within and be part of other SAS statements.

In case of native SQL, its code resides within a SAS statement, namely inside parentheses of the FROM CONNECTION TO dbase ( ) clause of the explicit SQL pass-through construct. Therefore, placing %INCLUDE where the native SQL code is expected will generate a syntax error. Bummer!

Be %INCLUDE a true macro object the described above problem would not exist.

%EMBED macro function

Well, let’s leave the %INCLUDE alone and instead create a macro function (we'll call it %EMBED) that will do what %INCLUDE does - bringing the contents of an external code file into a SAS program.  The key difference though will be that we will make it a true macro function, which runs by the SAS macro processor during SAS program compilation (before execution). Then by the SAS program execution time, the contents of the external code file will be already injected where this macro function is called. Therefore, one can invoke (place) it inside other SAS statements.

%EMBED macro function does not care whether it embeds SAS code or non-SAS code; the contents of the external file determine what’s embedded. It is up to you, SAS developer, to place this macro call strategically where code from the external file is appropriate.

Here is how we can easily implement such a macro function:

/* -------------------------------------------------------------------
| DESCRIPTION | Macro to embed any code from a file into SAS program.
|---------------------------------------------------------------------
| INVOCATION  | %embed(full-file-name) or %embed(file-reference)
|---------------------------------------------------------------------
| AUTHOR      | Leonid Batkhan, May 2023
|-------------------------------------------------------------------*/
%macro embed(f);
  %local p ref rc fid;
  %let p = %sysfunc(findc(&f,/\:.));
  %if &p %then %let rc = %sysfunc(filename(ref, &f));
         %else %let ref = &f;
  %let fid = %sysfunc(fopen(&ref));
  %if &fid>0 %then
    %do %while(%sysfunc(fread(&fid))=0);
      %let rc = %sysfunc(fget(&fid, line, 32767));
      &line
    %end;
    %else %put ERROR: Macro &sysmacroname - file "&f" cannot be opened.;
  %let rc = %sysfunc(fclose(&fid));
  %if &p %then %let rc = %sysfunc(filename(ref));
%mend embed;

%EMBED macro function has a single argument (parameter) representing either a full-path file name (including extension) or a fileref assigned with a FILENAME statement or function. It is “smart enough” to distinguish between the file name and fileref and process them accordingly.

It returns as its value the contents of the external file specified as its argument (parameter). This value may contain multiple lines of code.

%EMBED macro function code highlights

The first %let p= statement determines whether argument f represents full-path file name (p>0) or a file reference (p=0). We deduce it from the fact that file name must contain at least one of the /\:. symbols, but fileref may not contain any of them.

Then the %if-%then-%else statement creates either its own fileref using %sysfunc(filename(ref, &f)) or assigns macro variable ref to &f (if &p>0). Since we do not provide an initial value for the ref macro variable, filename function will assign a unique system-generated fileref, which starts with #LN followed by 5 digits (e.g. #LN00009). This ensures that the fileref would not interfere with other potential filesref’s assigned outside of this macro.

Then we open this file. If file opening fails (fid=0) then we output an ERROR message in the SAS log.

If the file opens successfully (fid>0) then we loop through this file using fread( ) function (while fread=0) which loads one record per iteration into the file data buffer (FDB) and fget( ) function which copies data from the FDB to a macro variable line. Note, that in %sysfunc(fget(&fid, line, 32767)) second argument (line) do not need & in front of it.

The key here is the following line of code:

&line

This macro variable reference (not a %put &line) just "injects" the value of macro variable line into the return value of this macro function. For each iteration of the do-loop, we read a line of code from the external file and add it to the return value of the %embed macro function.

After the loop, we close the file and conditionally de-assign fileref (if we assigned it within the macro); we do not de-assign the fileref if it is assigned outside the macro in the calling program.

%EMBED macro function usage

While we developed %EMBED macro function to address limitations of %INCLUDE for inserting native SQL code into SAS pass-through clause, its usage is much broader. One can use %EMBED for embedding/including/inserting/injecting/delivering any code, SAS or non-SAS, anywhere in the SAS program. In essence, %embed is a more advanced and robust alternative to the %include.

Let’s explore several scenarios of the %embed macro function usage.

Embedding native DBMS SQL Code into SAS program

Suppose you have a native DBMS SQL query code stored in file C:\project\query1.sql. (File extension is arbitrary, and in this case just indicates what type of code we are going to embed.)

Then you can use SQL Procedure Pass-Through Facility of the SAS PROC SQL and %embed macro function as in the following example:

proc sql;
   connect to odbc (dsn="db_name");
   create table WORK.ABC as
   select * 
      from connection to odbc ( %embed(C:\project\query1.sql) );
   disconnect from odbc;
quit;

During compilation phase, SAS macro processor will replace %embed(C:\project\query1.sql) with the contents of the external file containing DBMS-specific native SQL code (instead of SAS PROC SQL code). Then during execution time, PROC SQL will pass this query code on to the DBMS for processing.  The DBMS will return to SAS the result of this query and SAS will create data table WORK.ABC.

You can also use %embed macro function for "injecting" SQL code from a file into explicit FedSQL pass-through and explicit SQL pass-through in DS2.

Embedding Python or Lua code into SAS program

In PROC PYTHON and PROC LUA, you can use their optional INFILE= clause to reference an external file that contains their statements to run within a SAS session. Alternatively, you can use %embed macro function instead. For example, if you have your Python code stored in a file C:\project\program.py then you can place your %embed(C:\project\program.py) macro function call between submit and endsubmit statements of the PROC PYTHON:

proc python;
  submit;
    %embed(C:\project\program.py)
  endsubmit;
run;

Similar for PROC LUA:

proc lua;
  submit;
    %embed(C:\project\program.lua)
  endsubmit;
run;

Embedding R code into SAS program

Here is an example of %embed usage for inserting R code into SAS using PROC IML:

proc iml;
  submit / R;
    %embed(C:\project\program.r)
  endsubmit;
run;

Embedding SAS code into SAS program

Finally, you can embed a SAS code from external file into SAS program. You can use it instead of %include statement, for example:

data TWO;
   set ONE;
run;
 
%embed(c:\project\data_three.sas)

Unlike %include global statement which can be placed only between SAS statements, %embed can be placed anywhere in a SAS program. You can use it not only within DATA or PROC steps, but also within SAS executable statements. That means %embed can bring in SAS code that can be executed conditionally. For example:

data a;
  set b;
  if x>0 then %embed(c:\project\code1.sas);
         else %embed(c:\project\code2.sas);
run;

Obviously, your embedded code must comply with the context of its surrounding.

Conclusion

As you can see, %embed macro function presents an elegant solution for embedding both “foreign” and SAS native code without cluttering the SAS program. Would you agree that %embed macro is a more robust alternative to the %include?

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions or your own tips and tricks for embedding other code into your programs? Please share with us below.

Additional resources

Embedding any code anywhere into SAS programs was published on SAS Users.

5月 082023
 

Recently, I learned about an elementary programming assignment called the FizzBuzz program. Some companies use this assignment for the first round of interviews with potential programmers. A competent programmer can write FizzBuzz in 5-10 minutes, which leaves plenty of time to discuss other topics. If an applicant can't complete the FizzBuzz program in a required language, the interviewer concludes that they are a weak programmer in that language.

When I heard about the FizzBuzz program, I quickly implemented it in the SAS DATA step. However, it occurred to me that I could think of additional techniques to solve the problem in SAS. Each technique demonstrates different skills and could help an interviewer distinguish between junior-level, intermediate-level, and senior-level SAS programmers. This article introduces the FizzBuzz program for SAS programmers and solves it in the following ways:

  • Junior level: Use the SAS DATA step to transform a set of input data
  • Intermediate level: Use a function that is defined by using PROC FCMP
  • Senior level: Create a user-defined format by using PROC FORMAT
  • Statistical level: Write a vectorized SAS IML program

What is the FizzBuzz algorithm?

The FizzBuzz program is presented on the Rosetta Code website. The Rosetta Code site shows the same program written in hundreds of different programming languages, which makes it a convenient way to compare languages. The description of the FizzBuzz program on the Rosetta Code page is as follows:

Write a program that prints the integers from 1 to 100 (inclusive). But:

  • for multiples of three, print "Fizz" (instead of the number)
  • for multiples of five, print "Buzz" (instead of the number)
  • for multiples of both three and five, print "FizzBuzz" (instead of the number)

If you would like to take a minute to implement the program in SAS (or another language!), do so now. A solution is presented in the next section.

The modified FizzBuzz program in SAS

First, let's slightly adapt the assignment for the SAS programmer. The solution given on the Rosetta Code site uses a DO loop to generate the numbers and the PUT statement to write the result to the log, which is a fine implementation. However, the ability to read and transform existing data is an essential part of SAS programming. Consequently, a better assignment for a SAS programmer would start with a data set of values. The programmer must read the values (whatever they are) and apply the FizzBuzz algorithm to create a new variable in a new data set.

In theory, the input data could be any numerical values, but to stay faithful to the original assignment, you can ask the programmer to create an input data set (Have) that contains the integers 1-100, one per row:

data Have;
do n=1 to 100;
   output;
end;
run;

A junior SAS programmer writes the FizzBuzz program

Ready to write the FizzBuzz program? A junior-level Base SAS programmer would probably write the following DATA step, which reads the Have data and creates a new 8-character variable named Word that contains either "Fizz," "Buzz," "FizzBuzz," or uses the PUT function to convert the number to a character representation:

/* Junior programmer */
data Want;
length Word $8; 
set Have;
if      mod(n,15)=0 then Word = "FizzBuzz";
else if mod(n,5) =0 then Word = "Buzz";
else if mod(n,3) =0 then Word = "Fizz";
else Word = put(n, 8.);
run;
 
proc print data=Want(obs=15) noobs; 
   var n Word;
run;

This is a fine solution. It enables the interviewer to ask about the LENGTH statement, the w.d format, and integer division by using the MOD function. If a programmer omits the LENGTH statement, that indicates a lack of knowledge about character variables in SAS.

Another possibility is that a junior-level programmer could use PROC SQL to write the FizzBuzz program. There is an SQL version of the program at Rosetta Code, and I invite a reader to add the PROC SQL version in a comment to this article.

An intermediate SAS programmer writes the FizzBuzz program

An intermediate-level programmer understands the power of encapsulation. If the FizzBuzz functionality needs to be used several times, can you encapsulate the program into a reusable function?

In SAS, you can use PROC FCMP to define your own library of useful functions. The documentation for PROC FCMP provides the details and several examples. For this exercise, the key is to have the function return a character value, which means you need to specify a dollar sign ($) after the argument list (and optionally specify the length). You also need to use the OUTLIB= option to specify the name of the data set where the function is stored. Lastly, you should use the global CMPLIB= option to make the function known to a DATA step.

/* Intermediate programmer: Use PROC FCMP to define the FizzBuzz function */
/* https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/n1eyyzaux0ze5ln1k03gl338avbj.htm */
proc fcmp outlib=work.functions.NterView;
   function FizzBuzz(n) $ 8;
      length Word $8;
      if      mod(n,15)=0 then Word = "FizzBuzz";
      else if mod(n,5) =0 then Word = "Buzz";
      else if mod(n,3) =0 then Word = "Fizz";
      else Word = put(n, 8.);
      return(Word);
   endsub;
run;
 
options cmplib=(work.functions);   /* make the function available to DATA step */
data Want;
length Word $8; 
set Have;
Word = FizzBuzz(n);
run;
 
proc print data=Want(obs=15) noobs; 
   var n Word;
run;

The output is the same as was shown previously.

A senior SAS programmer writes the FizzBuzz program

A senior-level programmer understands the power of SAS formats and can create a user-defined format to prevent the wasteful work copying of data. Consider the result of the previous intermediate-level program. The entire Have data set is copied merely to add a new eight-character variable. Think about the wastefulness of this approach if the input data set is many gigabytes in size!

One alternative to copying the data is to create a user-defined format that will format a variable in place without recoding it. Senior-level programmers should be able to explain why using PROC FORMAT is better than copying and recoding variables.

Creating a user-defined format uses the FizzBuzz function that we defined by using PROC FCMP. The documentation of PROC FORMAT has an example that shows how to use a user-defined function to define a custom format. The following program shows how to use the FizzBuzz function to define a custom format in PROC FORMAT:

/* Senior programmer: Create a format by using the FCMP function */
/* We don't need a new data set with a new variable. Just apply a format to the existing data! */
proc format; 
   value FBFMT other=[FizzBuzz()]; 
run;
 
/* use the format */
proc print data=Have(obs=15);
   format n FBfmt.;
run;

This solution is very short because it builds on the previous solutions. It can lead to discussions about efficiency.

A SAS statistical programmer writes the FizzBuzz program

Advanced statistical programmers use the high-level SAS IML matrix language to program custom analyses. In a matrix language, the ability to vectorize a computation is important. Vectorization means treating data as vector and matrix objects and using vector operations rather than loops to interact with the data. After you read the data into a vector, you can construct binary (0/1) vectors that indicate whether each row is divisible by 3, by 5, or by both. You can then use the LOC function to identify the rows that satisfy each condition, as follows:

/* SAS IML programmer: Vectorize the FizzBuzz algorithm */
proc iml;
use Have; read all var "n"; close;
F = (mod(n,3)=0);           /* binary variable: is n divisible by 3? */
B = (mod(n,5)=0);           /* binary variable: is n divisible by 5? */
FB = F & B;                 /* binary variable: is n divisible by 3 & 5? */
Words = char(n, 8);         /* default: convert the number into a string */
Words[loc(F)]  = "Fizz";    /* write to the "div by 3" indices */
Words[loc(B)]  = "Buzz";    /* write to the "div by 5" indices */
Words[loc(FB)] = "FizzBuzz";/* write to the "div by 3 & 5" indices */
print n Words;

This program can lead to discussions about efficiency, vectorization, and logical operators on vectors.

Discussion

The FizzBuzz program assignment is more than a programming exercise. It can provide opportunities for discussing related SAS programming topics. For example:

  • Does the implementation handle missing values?
  • Does the program correctly handle negative integers? What about 0?
  • What does the program do if the input data are not integers? For example, what is FizzBuzz(3.2)?
  • How would you modify the program to detect whether the input is not a positive integer and write "Jazz" in that case?
  • Suppose the input data set contains one billion observations. Discuss the efficiency of your implementation of FizzBuzz.

Summary

The FizzBuzz algorithm is an elementary programming assignment that tests whether a programmer has minimal knowledge of a language. It is sometimes used in job interviews to assess the candidate's skills. This article presents a SAS-specific variation on the classic FizzBuzz assignment. It also shows how this elementary problem can be solved by using more sophisticated methods in SAS, such as user-defined functions, user-defined formats, and matrix programming in the SAS IML language. Although the methods might be too difficult for some candidates to write during an interview, a discussion of the enhancements can help assess the candidate's knowledge of advanced techniques in SAS.

In early 2023, many programmers have been impressed by the ability of ChatGPT and Bing Chat to write elementary computer programs. Can an AI chatbot replace a junior-level SAS programmer? In my next blog post, I investigate what you get if you ask Bing Chat to implement the FizzBuzz algorithm in SAS.

The post An interview question for SAS programmers appeared first on The DO Loop.

4月 242023
 

A previous article discusses how to compute the union, intersection, and other subsets of a pair of sets. In that article, I displayed a simple Venn diagram (reproduced to the right) that illustrates the intersection and difference between two sets. The diagram uses a red disk for one set, a blue disk for the other set, and makes the disks semi-transparent so that the intersection is purple.

When I was creating the graph, I thought about how I might create a similar graph that shows the symmetric difference. The symmetric difference between two sets, A and B, is the set of elements that are in either A or B, but not both. This is sometimes called the "exclusive OR" (XOR) operation. To create a symmetric difference graph, or to highlight other subsets, you need to specify the colors of the three areas in the graph. For example, the Wikipedia article for the symmetric difference includes an image (shown below) that uses red for the set differences and white for the intersection. I wanted to create a similar image in SAS.

This article shows how to use the ELLIPSEPARM statement in PROC SGPLOT to create a simple Venn diagram that shows the relationships between sets. It also shows how to use the POLYGON statement to create a diagram for which you have complete control over the colors of each portion of the graph. Lastly, I provide references for SAS papers that show how to construct Venn diagrams when you want the areas of the regions to represent counts in real data.

A simple Venn diagram for sets

A simple Venn diagram is shown at the top of this article. The diagram is an abstract representation of the relationship between sets. The relative sizes of the colored areas are not important in this diagram. You can create this diagram in SAS by using two ELLIPSEPARM statements in PROC SGPLOT. You can create the text by using the TEXT statement.

You need to choose a coordinate system so that you can control the placement of the text relative to the disks. For this example, I used the following coordinates:

  • The left circle has unit radius and is centered at (0, 0).
  • The right circle has unit radius and is centered at (1, 0).
  • The text 'A\B' is placed at (-0.25, 0). The text 'A&B' is placed at (0, 0). The text 'B\A' is placed at (1.25, 0).

First, create a SAS data set that contains the text values and positions. You can then overlay the text on a diagram that displays the disks and uses the TRANSPARENCY= option to make the disks semitransparent, as follows:

/* Visualize Venn diagram for the intersection of two sets */
data Labels;   /* create labels values and locations for the graph */
tx=-0.25; ty=0;   text='A\B'; output;
tx= 0.5;  ty=0;   text='A&B'; output;
tx= 1.25; ty=0;   text='B\A'; output;
tx= 0;    ty=1.2; text='A';   output;
tx= 1;    ty=1.2; text='B';   output;
run;
 
title "Intersection of Sets";
proc sgplot data=Labels noautolegend noborder;
   ellipseparm semimajor=1 semiminor=1 / slope=0 xorigin=0 yorigin=0 
      fill fillattrs=(color=red) transparency=0.5;
   ellipseparm semimajor=1 semiminor=1 / slope=0 xorigin=1 yorigin=0 
      fill fillattrs=(color=royalblue) transparency=0.5;
   text x=tx y=ty text=text / textattrs=(size=18);
   xaxis display=none offsetmax=0.05 offsetmin=0.05;
   yaxis display=none offsetmax=0.05 offsetmin=0.05; 
run;

The diagram is shown at the top of this article. If you want to extend it to three sets, I suggest using a disk centered at (0, -1).

Control over the colors in a Venn diagram

The technique in the previous section enables you to control the colors of the left and right disks, but the color of the intersection is determined by color mixing. In elementary school, we learn that red and yellow mix to become orange, red and blue make purple, and yellow and blue make green. You cannot choose the color of the intersection; the color is determined by the colors of the two disks.

To independently control the colors of the three regions (A\B, A&B, and B\A), you need to represent them as polygons and use the POLYGON statement in PROC SGPLOT. If you allow the center and radius of the circles to be arbitrary, then the geometry of the circle-circle intersection is somewhat complicated. However, I chose the position and the radii so that the geometry is as simple as possible. Namely, the two circles intersect at θ = ±π/3 with respect to the center of the first circle. Relative to the center of the second circle, the intersection points are at the angles φ = {2π/3, 4π/3}.

Recall that you can parameterize the left circle by the central angle by using the parametric equations θ → (cos(θ), sin(θ)). Similarly, if φ is the central angle of the right circle, you can parameterize the circle by using φ → (1+cos(φ), sin(φ). Because we know where the circles intersect, we can parameterize the crescent-shaped and lens-shaped regions by strategically switching from one parametric equation to another. This is carried out in the following SAS DATA step. The results are then plotted by using a POLYGON statement and the GROUP= option. To save typing, I define two macros. The %C1 macro implements the parameterization of the first circle. The %C2 macro implements the parameterization of the second circle. For this graph, I do not attempt to control the colors or to overlay text. I just show the three regions.

%macro C1(theta);
   x = cos(&theta);    /* circle centered at (0,0) */
   y = sin(&theta);
   output;
%mend;
%macro C2(phi);
   x = d + cos(&phi);  /* circle centered at (d,0) */
   y =     sin(&phi);
   output;
%mend;
 
data XOR;
/* Circles have unit radii and are centered at (0,0) and (d,0).
   The circles intersect at %C1(pi/3) and %C1(-pi/3) */
d = 1;        
pi = constant('pi'); 
dt = pi/45;           /* step size for angles */
/* left crescent-shaped portion of the circle "A" */
ID = 'A\B';
do theta = -pi/3 to pi/3 by dt; /* crescent; use circle B params */
   %C2(pi - theta);
end;
do theta = pi/3 to 2*pi-pi/3 by dt; /* use circle A params */
   %C1(theta);
end;
/* lens-shaped intersection A & B */
ID = 'A&B';
do phi = 2*pi/3 to 4*pi/3 by dt;  /* use circle B params */
   %C2(phi);
end;
do theta = -pi/3 to pi/3 by dt;  /* use circle A params */
   %C1(theta);
end;
/* right crescent-shaped portion of the circle B */
ID = 'B\A';
do phi = 4*pi/3 to 2*pi + 2*pi/3 by dt; /* use circle B params */
   %C2(phi);
end;
do theta = pi/3 to -pi/3 by -dt; /* crescent; use circle A params */
   %C1(theta);
end;
drop pi d theta phi;
run;
 
title "Intersection of Sets";
proc sgplot data=XOR;
   polygon x=x y=y ID=ID / group=ID fill;
   xaxis display=none offsetmax=0.1 offsetmin=0.1; /* pad the margins */
   yaxis display=none offsetmax=0.1 offsetmin=0.1; 
run;

The graph consists of three polygons, each in a different color. You can merge the polygon data with the text data and overlay the text on the Venn diagram. Furthermore, you can use the STYLEATTRS statement to assign colors to each region. In this example, I use a reddish color for the two crescent-shaped regions and white for the lens-shaped region in the center.

/* combine the polygons and the text labels */
data All;
set XOR Labels;
run;
 
/* use the reddish color for GraphData1 in the HTMLBlue style. See
   https://blogs.sas.com/content/iml/2017/02/06/group-colors-sgplot.html */
%let gcdata2 = cxA23A2E;        /* a darkish red */
title "Exclusive OR: XOR(A,B)";
proc sgplot data=All noautolegend noborder;
   styleattrs datacolors=(&gcdata2 white &gcdata2);
   polygon x=x y=y ID=ID / group=ID fill outline lineattrs=(thickness=2);
   text x=tx y=ty text=text / textattrs=(size=18);
   xaxis offsetmax=0.05 offsetmin=0.05 display=none;
   yaxis offsetmax=0.05 offsetmin=0.05 display=none;
run;

Success! To change the colors of regions, specify any colors you want for the DATACOLORS= option on the STYLEATTRS statement.

Summary

This article shows two ways to create a Venn diagram that illustrates relationships between sets such as intersection and set difference. The simplest method uses ELLIPSEPARM statements in PROC SGPLOT. You can specify colors for the disks in the diagram. By making the colors semi-transparent, the colors of the intersecting regions are determined by the standard properties of color mixing. A more sophisticated method uses basic geometry to parameterize the regions in the diagram as polygons. By using the POLYGON statement in PROC SGPLOT, you can completely control the color of each region.

Further reading

SAS customers have a long history of using SAS graphics to create Venn diagrams. Some of the diagrams are abstract, such as presented here, but others attempt to control the size and centers of the circles so that the relative areas of the circles and their intersection are proportional to counts in data. The latter graphs are called area-proportional Venn diagrams.

For example, suppose a survey reveals that 40 people like to eat hamburgers, 60 people like to eat ice cream, and 20 people like both. Let A be the set of people that like hamburgers, and B be the set of people that like ice cream. Then you could arrange the size and position of the circles so that Area(B) = 1.5*Area(A) and Area(A&B) = 0.5*Area(A).

For two sets, you can solve this problem by using circles. For three sets, you need to use ellipses to create area-proportional Venn diagrams. A discussion and algorithm are presented in the open-access article, Micallef and Rodgers (2014) "eulerAPE: Drawing Area-Proportional 3-Venn Diagrams Using Ellipses," PLoS ONE.

The following articles are about creating Venn diagrams in SAS. Some of the articles use older SAS/GRAPH routines instead of the newer ODS statistical graphics. Some authors draw a distinction between Venn diagrams and Euler diagrams.

I will close by remarking that a Venn diagram for three or more sets might misrepresent the relationships between sets. The simplest example with three sets is A={1} and B={2} and C={1, 2}. There is no way to use circles (or ellipses) to draw a Venn diagram that correctly represents the relationships between these three sets. (You can, however, use rectangles for this situation.) Two attempts are shown below. The attempt on the left gives the false impression that there are elements of C that are not contained in A or B. The attempt on the right misrepresents the intersection between C and the other sets.

The post Venn diagrams that illustrate relationships between sets appeared first on The DO Loop.

4月 192023
 

The fundamental operations on sets are union, intersection, and set difference, all of which are supported directly in the SAS IML language. While studying another programming language, I noticed that the language supports an additional operation, namely the symmetric difference between two sets. The language also supports query functions to test for subsets, supersets, and disjoint sets. This article shows how to work with sets in the SAS IML language, including how to compute the symmetric difference and how to implement set queries.

Support for sets in SAS

The SAS IML language has four built-in functions that operate on sets:

  1. Union: The UNION function computes the union of sets.
  2. Intersection: The XSECT function computes the intersection of sets.
  3. Difference: The SETDIF function computes the difference between two sets.
  4. Subset: The ELEMENT function returns an indicator variable that specifies which elements of one vector are contained in another.

In a previous article, I showed how to test whether two sets are equal.

To show how the usual set operations work, let's create four sets of integers restricted to the range [0, 9]. The four sets are:

  • The even numbers, which form the set {0, 2, 4, 6, 8}.
  • The odd numbers, which form the set {1, 3, 5, 7, 9}.
  • The prime numbers, which form the set {2, 3, 5, 7}.
  • The Fibonacci numbers. The Fibonacci sequence is 0, 1, 1, 2, 3, 5, 8, ..., so the set of Fibonacci numbers less than 10 is the set {0, 1, 2, 3, 5, 8}. You can use the UNIQUE function to create a set of unique, sorted, values from an arbitrary vector of values.
proc iml;
Evens = {0, 2, 4, 6, 8};
Odds = {1, 3, 5, 7, 9};
Primes = {2, 3, 5, 7};
/* Note: if values are not distinct, use UNIQUE function to create a set */
Fibonaccis = unique({0, 1, 1, 2, 3, 5, 8});
 
/* union */
All = union( Evens, Odds );
 
/* intersection */
EvenPrimes = xsect( Evens, Primes );
 
/* set difference A \ B */
FibsNotPrime = setdif( Fibonaccis, Primes );
PrimesNotFibs = setdif( Primes, Fibonaccis );
print All, EvenPrimes, FibsNotPrime, PrimesNotFibs;

The program demonstrates a few simple set operations:

  • The UNION function computes the union of the even and odd numbers, which results in all the whole numbers (less than 10).
  • The XSECT function computes the intersection of the even numbers and the primes. The number 2 is the only even prime.
  • Whereas unions and intersections are commutative operations, the set difference is not. The first call to the SETDIF function computes the set of elements that are Fibonacci numbers but not prime. This is the set {0, 1, 8}. The second call computes the set of primes (less than 10) that are not Fibonacci numbers. This is the set {7}.

The symmetric difference between sets

The symmetric difference between two sets, A and B, is the set of elements that are in either A or B, but not both. In other words, you take the union of the sets and remove any elements in the intersection: (A ∪ B) \ (A ∩ B). In computer science and logic, this is sometimes called the "exclusive OR" (XOR) operation. Clearly, you can compute this set in SAS IML by using the union, intersection, and set-difference operators, as follows:

/* symmetric difference is the union minus the intersection: (A U B) \ (A & B) */
start SymDiff(A, B);
   U = union(A,B);
   N = xsect(A,B);
   return setdif(U, N);
finish;
 
SymDiff = SymDiff( Fibonaccis, Primes );
print SymDiff;

The symmetric difference is commutative. The example shows the set of integers that are either Fibonacci numbers or are prime, but not both. The result is the set {0, 1, 7, 8}, which is also the union of the differences A\B and B\A.

Subsets, supersets, and disjoint sets

So far, we have not used the ELEMENT function. The ELEMENT function indicates which elements in one set are contained in another set. In other words, it identifies the elements of a set A that are also in another set B. You can use the ELEMENT function to determine whether all the elements in A are also elements of B (that is, whether A ⊆ B). By exchanging the roles of A and B, you can determine whether A ⊇ B. Lastly, two sets are disjoint if their intersection is empty. All of these queries can be written in the SAS IML language in a natural way:

/* isSubset: return 1 (true) if A is a subset of B */
start isSubset(A, B);
   AinB = element(A,B);
   return( all(AinB) );
finish;
 
/* isSubset: return 1 (true) if A is a superset of B */
start isSuperset(A, B);
   return isSubset(B,A);
finish;
 
/* isDisjoint: return 1 (true) if A does not intersect B */
start isDisjoint(A, B);
   return( isEmpty(xsect(A,B)) );
finish;
 
/* run some examples */
bAllPrimesOdd = isSubset( Primes, Odds );     /* are the primes a subset of the odd numbers? (no) */
bIntSuperEven = isSuperset( All, Evens );     /* are the integers a superset of the odds? (yes) */
bOddEvenDisjoint = isDisjoint( Odds, Evens ); /* are the odds and evens disjoint? (yes) */
print bAllPrimesOdd bIntSuperEven bOddEvenDisjoint;

The output shows the following:

  • The prime numbers are not a subset of the odd numbers because 2 is a prime but not an odd number.
  • The set of all integers is a superset of the set of even numbers.
  • The odd numbers and even numbers are disjoint sets.

Summary

The SAS IML language supports several built-in operations for unions, intersections, differences, and set membership. You can use the built-in operations to create new operations such as the symmetric difference (XOR) or to query the relationship between two sets. By using these functions, you can experience the joy of sets.

The post The joy of sets appeared first on The DO Loop.

3月 222023
 

A data analyst wanted to estimate the correlation between two variables, but he was concerned about the influence of a confounding variable that is correlated with them. The correlation might affect the apparent relationship between main two variables in the study. A common confounding variable is age because young people and older people tend to have different attitudes, behaviors, resources, and health issues.

This article gives a brief overview of partial correlation, which is a way to adjust or "control" for the effect of other variables in a study.

In this article, a confounding variable is a variable that is measured in the study and therefore can be incorporated into a model. This is different than a lurking variable, which is a variable that is not measured in the study. These definitions are common, but not universally used. The process of incorporating a confounding variable into a statistical model is called controlling for the variable.

Age as a confounding variable

Age is the classic confounding variable, so let's use a small SAS data set that gives the heights, weights, and ages of 19 school-age children. To make the analysis more general, I define macro variables Y1 and Y2 for the names of the two variables for which we want to find the correlation. For these data, the main variables are Y1=Height and Y2=Weight, and the confounding variable is Age.

To begin, let's compute the correlation between Y1 and Y2 without controlling for any other variables:

%let DSName = sashelp.Class;
%let Y1 = Height;
%let Y2 = Weight;
 
proc corr data=&DSName noprob plots(maxpoints=none)=scatter(ellipse=none);
   var &Y1;
   with &Y2;
run;

The output from the procedure indicates that Y1 and Y2 are strongly correlated, with a correlation of 0.88. But the researcher suspects that the ages of the students are contributing to the strong association between height and weight. You can test this idea graphically by creating a scatter plot of the data and coloring the markers according to the value of a third variable, as follows.

%let ControlVars = Age;         /* list one or more control variables */
 
title "Color Markers by a Confounding Variable";
proc sgplot data=&DSName;
   scatter x=&Y1 y=&Y2 / colorresponse=%scan(&ControlVars, 1)   /* color by first control var */
                         markerattrs=(symbol=CircleFilled size=14) FilledOutlinedMarkers
                         colormodel=TwoColorRamp;
run;

The graph shows that low values of height and weight are generally associated with low values of age, which are light blue in color. Similarly, high values of height and weight are generally associated with high values of age, which are colored dark blue. This indicates that the age variable might be affecting the correlation, or even might be responsible for it. One way to "adjust" for the age variable is to use partial correlation.

What is partial correlation?

A partial correlation is a way to adjust a statistic to account for one or more additional covariates. The partial correlation between variables Y1 and Y2 while adjusting for the covariates X1, X2, X3, ... is computed as follows:

  1. Regress Y1 onto the covariates and calculate the residuals for the model. Let R1 be the variable that contains the residuals for the model.
  2. Regress Y1 onto the covariates. Let R2 be the column of residuals.
  3. Compute the correlation between R1 and R2. This is the partial correlation between Y1 and Y2 after adjusting for the covariates.

How to compute partial correlation in SAS

SAS provides an easy way to compute the partial correlation. PROC CORR supports the PARTIAL statement. On the PARTIAL statement, you list the covariates that you want to account for. PROC CORR automatically computes the regression models and provides the correlation of the residuals. In addition to the partial Pearson correlation, PROC CORR can report the partial versions of Spearman's rank correlation, Kendall's association, and the partial variance and standard deviation. The partial mean is always 0, so it is not reported.

The following call to PROC CORR shows the syntax of the PARTIAL statement:

proc corr data=&DSName plots(maxpoints=none)=scatter(ellipse=none);
   var &Y1;
   with &Y2;
   partial &ControlVars; 
run;

The output includes tables and a scatter plot of the residuals of Y2 versus the residuals of Y1. Only the graph is shown here. The graph includes an inset that tells you that the partial correlation is 0.7, which is less than the unadjusted correlation (0.88). Although the axes are labeled by using the original variable names, the quantities that are graphed are residuals from the two regressions. Notice that both axes are centered around zero, and the range of each axis is different from the original range.

How should you interpret this graph and the partial correlation statistic? The statistic tells you that after adjusting for age, the correlation between the heights and weights of students is about 0.7. The graph shows the scatter plot of the residual values of the heights and weights after regressing those variables onto the age variable.

How to manually calculate partial correlations

You can obtain the partial correlation manually by performing each step of the computation. This is not necessary in SAS, but it enables you to verify the computations that are performed by the PARTIAL statement in PROC CORR. To verify the output, you can manually perform the following steps:

  1. Use PROC REG to regress Y1 onto the covariates. Use the OUTPUT statement to save the residuals for the model.
  2. Use PROC REG to regress Y2 onto the covariates. Save the residuals for this model, too.
  3. Merge the two output data sets.
  4. Use PROC CORR to compute the (raw) correlation between the residual variables. This is the partial correlation between Y1 and Y2 after adjusting for the covariates.

The following SAS statements perform this analysis:

/* 1. Regress Y1 onto covariates and save residuals */
proc reg data=&DSName noprint;
   model &Y1 = &ControlVars;
   output out=RegOut1 R=Part_&Y1;
quit;
/* 2. Regress Y2 onto covariates and save residuals */
proc reg data=&DSName noprint;
   model &Y2 = &ControlVars;
   output out=RegOut2 R=Part_&Y2;
quit;
/* 3. Merge the two sets of residuals */
data RegOut;
merge RegOut1(keep=&ControlVars Part_&Y1)
      RegOut2(keep=Part_&Y2);
label Part_&Y1 = "Residual of &Y1 ~ (&ControlVars)"
      Part_&Y2 = "Residual of &Y2 ~ (&ControlVars)";
run;
/* 4. Display the correlation between the residuals */
proc corr data=RegOut plots=scatter(ellipse=none);
   var Part_&Y1;
   with Part_&Y2;
run;

As expected, the graph of the residual values is the same graph that was created by using the PARTIAL statement. As expected, the inset shows that the correlation between the residuals is the same value that was reported by using the PARTIAL statement.

Summary

It can be useful to account for or "adjust" a statistic to account for other variables that might be strongly correlated with the main variables in the analysis. One way to adjust for confounding variables is to regress the main variables onto the confounding variables and look at the statistic for the residual values. This is called a "partial" statistic and is supported in SAS by using the PARTIAL statement in several procedures. This article shows how to use the PARTIAL statement in PROC CORR to compute the partial correlation. It also shows how to manually reproduce the computation by explicitly performing the regressions and calculating the correlation of the residuals.

Appendix: PROC REG also provides (squared) partial correlations

In SAS, there are often multiple ways to get the same statistic. It turns out that you can get the SQUARED Pearson partial correlation from PROC REG by using the PCORR2 option on the MODEL statement. I don't often use this option, but for completeness, the following statements compute the squared partial correlation between Y2 and Y1:

proc reg data=&DSName plots=none;
   model &Y2 = &Y1 &ControlVars / pcorr2;
   ods select ParameterEstimates;
quit;

The "Type II Squared Partial Correlation" between Y2 and Y1 is 0.49656. If you take the square root of that number, you get ±0.7047. The tabular output from PROC REG does not enable you to determine whether the partial correlation is positive or negative.

The post Partial correlation: controlling for confounding variables appeared first on The DO Loop.

2月 062023
 

This article is about how to use Git to share SAS programs, specifically how to share libraries of SAS IML functions. Some IML programmers might remember an earlier way to share libraries of functions: SAS/IML released "packages" in SAS 9.4m3 (2015), which enable you to create, document, share, and use libraries of IML functions. Unfortunately, the "package" mechanism for SAS 9.4 assumes that programmers can install files on their local SAS workspace server, which is often running on a desktop or laptop PC. This mechanism does not work well for SAS Viya, which runs SAS "in the cloud" on a remote server that is deployed and maintained by an administrator.

Recently, I have been experimenting with using Git to share code with others in a way that will work on SAS Viya as well as on SAS 9.4. Git is not new. It is also not new to many SAS programmers. In fact, for several years Chris Hemedinger has been promoting the use of Git to manage SAS projects by writing blog posts and giving talks at SAS User Groups. The SAS language has supported function that interact with Git since SAS 9.4M6. Many interfaces to SAS, including SAS Studio and SAS Enterprise Guide, provide GUI support for Git operations.

Since I am a programmer, I will show how to use Git function in the DATA step to download and use a SAS IML function that I created in a recent blog post. There are differences between the Git functions in SAS 9.4 and the Git functions in SAS Viya, so I will show both techniques. For this article, I only need one function because my goal is to copy ("clone") files from a remote Git repository.

The task: Download a library of SAS IML functions

The remainder of this article shows how to copy the GitHub repository (a "repo") to a specified location, then use a %INCLUDE statement to include a file into a PROC IML program. On SAS 9.4, I download the repo to my PC. On SAS Viya in the cloud, I download the repo to the WORK libref. The technique in this article is based on Chris Hemedinger's guide to using Git in SAS, which has additional details.

To demonstrate this technique, I show how to include a file that defines the PrintToLog subroutine. The PrintToLog subroutine has been part of SAS IML on SAS Viya since Viya 3.5, but it is not supported in SAS 9.4. However, I like to use the subroutine, so I wrote a module named PrintToLog that reproduces the functionality in SAS 9.4. I wrapped the module in a macro that detects whether the program is running on SAS 9.4 or SAS Viya. On SAS 9.4, the macro defines the module. On SAS Viya, the macro does nothing. Thus, in any version of SAS, you can call the macro and then call the PrintToLog subroutine.

I uploaded the module to a GitHub repository for my blog. You can look at the file that defines the PrintToLog subroutine.

Download functions from GitHub in SAS 9.4

In SAS 9.4, you can copy the files from the Git repo onto your PC or to any directory that SAS can write to. I chose the location C:\Downloads\BlogRepo. I turn on the DLCREATEDIR option to tell SAS that it should create the directory if it does not already exist. If you haven't used Git before, be aware that "cloning a repo" will copy all directories and all files. Even if you want only the PrintToLog module, you have to copy all files and all directories.

In SAS 9.4m6, the function to create a local copy of a remote repo is GITFN_CLONE. (This name was later deprecated in favor of GIT_CLONE, without the 'FN' characters.) The following SAS DATA step copies the files from the Git repo into a specified location on my PC in SAS 9.4m6:

/* Clone the GitHub repo into RepoPath on SAS 9.4M6 */
options dlcreatedir;  /* give permission to create the RepoPath directory, if it doesn't exist */
%let gitURL = https://github.com/sascommunities/the-do-loop-blog/;  /* Git repo to copy */
%let RepoPath = C:\Downloads\BlogRepo;                              /* location to put copy */
 
/* clone repository into RepoPath; if local repository exists, skip download */
data _null_;
if fileexist("&RepoPath.") then do;
   put 'Repository already exists; skipping the clone operation';
end;
else do;
   put "Cloning repository from &gitURL";
   /* NOTE: use GITFN_CLONE function for 9.4M6; use GIT_CLONE function for Viya */
   rc = gitfn_clone("&gitURL", "&RepoPath." ); 
end;
run;

If I navigate to the RepoPath directory on my PC, I can see that the entire repo has been copied, including the PrintToLog directory and SAS file.

I can therefore include the file into a SAS program in the usual way by using the %INCLUDE statement:

proc iml;
/* define the PrintToLog subroutine */
%include "&RepoPath/printtolog/printtolog.sas";
/* call the PrintToLog subroutine */
run PrintToLog("This is a test message.", 0);

If you run the program, it prints the following message to the log:

NOTE: This is a test message.

The GITFN_CLONE function downloads a repo to an empty directory. If the remote repo changes and you want to download the new version, you can delete the local repo and rerun the DATA _NULL_ step.

Download functions from GitHub in SAS Viya

Technically, there is no need to download the PrintToLog function in SAS Viya, since the PrintToLog function is built-in to SAS IML in SAS Viya. Nevertheless, let's examine how to clone a Git repo on SAS Viya.

Here's the main issue: If you are running a program on SAS Viya "in the cloud," the program is probably executing on a remote server. Files on your local machine (on which you are running SAS Studio in a browser) might not be accessible to the server. In addition, you might not have administrative privileges to add new files and directories on the server. So how can you download files that you can %INCLUDE into a program?

Chris Hemedinger's blog post introduced me to a wonderful idea: put the files into the WORK libref or some other libref for which you have write permission. If you use a temporary libref such as WORK, the repo will disappear at the end of the SAS session. If you want the repo to persist, clone it to a permanent libref.

Thus, I need to modify only two lines in the previous program: define the RepoPath macro to point to a writable libref, and call the GIT_CLONE function (note the different name), as follows:

/* Clone the GitHub repo into RepoPath on SAS Viya */
options dlcreatedir;  /* give permission to create the RepoPath directory if it doesn't exist */
%let gitURL = https://github.com/sascommunities/the-do-loop-blog/;  /* Git repo to copy */
%let RepoPath = %sysfunc(getoption(WORK))/BlogRepo;                 /* location to put copy */
 
/* clone repository into RepoPath; if local repository exists, skip download */
data _null_;
if fileexist("&RepoPath.") then do;
   put 'Repository already exists; skipping the clone operation';
end;
else do;
   put "Cloning repository from &gitURL";
   /* NOTE: use GITFN_CLONE for 9.4M5; use GIT_CLONE for 9.4M6 and for Viya */
   rc = git_clone("&gitURL", "&RepoPath." ); 
end;
run;
 
proc iml;
/* define the PrintToLog subroutine */
%include "&RepoPath/printtolog/printtolog.sas";
/* call the PrintToLog subroutine */
run PrintToLog("This is a test message.", 0);
quit;

The %INCLUDE statement successfully reads the printtolog.sas file, which is located in the local copy of the Git repo. You can set the SOURCE2 option (by running OPTIONS SOURCE2) if you want the log to display the code that is read by the %INCLUDE statement.

Best practices to create a SAS IML library

Typically, people want to share a library of SAS IML functions that are related to each other. For example, the functions might all perform computations in a subject area such as computational biology or financial risk management. Here are a few best practices for sharing a library of SAS IML functions with others in a GitHub repo:

  • Create the files so that they can be included in a PROC IML program. That means that each file should consist of a series of START/FINISH statements that define the modules. The file should NOT contain a PROC IML statement or a QUIT statement.
  • Use the STORE statement to store the functions at the end of the file. This enables the user to store the modules. The stored modules are faster to load because they do not need to be parsed when they are loaded into a PROC IML program.
  • If the functions are related, make the module names start with a common prefix. For example, if you distribute functions that compute quantities about polygons, you might choose to use the string "Poly_" as the first few characters of each name.
  • Include documentation and examples of using the functions. The documentation explains the input and output arguments. The examples demonstrate how to call the functions and explain the results.

For example, the following template shows how you might structure a file (named Sim_Defin.sas) that contains many functions that are related to simulations:

/* do not use the PROC IML statement */
start Sim_Normal(n, mean=0, std=1);
   ...
finish;
start Sim_LogNormal(n, mean=0, std=1);
   ...
finish;
start Sim_Exp(n, scale=1);
   ...
finish;
store module=(Sim_Normal Sim_LogNormal Sim_Exp);
/* do not use the QUIT statement */

After downloading the file, a SAS IML programmer can use the functions as follows:

proc iml;
%include "&RepoPath/Sim/Sim_Define.sas";   /* read the function definitions and store them */
x = Sim_Normal(100);
quit;

Because the file ends with a STORE statement, you can use the LOAD statement for subsequent calls:

proc iml;
load module=(Sim_Normal);   /* load individual modules or LOAD MODULE=_ALL_; */
x = Sim_Normal(100);
quit;

Summary

This article discusses how to use functions in SAS to download a Git repository of files. On SAS 9.4m6, you can call the GITFN_CLONE function in a DATA step to copy a repo from a remote site such as GitHub into a local repository. On SAS Viya in the cloud, you can use the GIT_CLONE function to copy a repo into a libref for which you have write permission. In either case, you can then use the %INCLUDE statement to read a file into a SAS program. In SAS IML, the primary application of this technique is to read a file that defines a series of related modules. The article concludes by providing some best practices for sharing a library of SAS IML modules.

The post How to use Git to share SAS programs appeared first on The DO Loop.

2月 012023
 

SAS supports the ColorBrewer system of color palettes from the ColorBrewer website (Brewer and Harrower, 2002). The ColorBrewer color ramps are available in SAS by using the PALETTE function in SAS IML software. The PALETTE function supports all ColorBrewer palettes, but some palettes are not interpretable by people with color deficiencies (usually called "colorblind" people). Fortunately, the ColorBrewer website includes information about which palettes are colorblind-safe and which are not. You should strive to use only the colorblind-safe palettes when you create graphs that you intend to show to others.

Which palettes are colorblind-safe?

As discussed previously, the most common color vision deficiencies are the inability to distinguish red and green. Therefore, as a general rule, you should avoid using palettes that contain both red and green. The ColorBrewer palettes are split into three different types of palettes:

  • A sequential palette enables you to distinguish low and high values for an interval measurement.
  • A diverging palette enables you to compare high and low values to an average.
  • A qualitative palette enables you to visualize different values of a categorical variable, such as sex, political affiliation, or an arm of a clinical study.

For each type, the following list reveals which palettes are colorblind-safe:

  • Sequential palettes: All sequential palettes from ColorBrewer are colorblind-safe. This includes the one-hue palettes (for example, "Blues"), the two-hue palettes (for example, "OrRed," which is an orange-red palette), and three-hue palettes (such as "YlOrRd," which is a yellow-orange-red palette).
  • Diverging palettes: Six diverging palettes are colorblind-safe, including the popular "BrBG," which is a brown-blue-green palette. However, three diverging palettes are not colorblind-safe. Avoid using "RdGy" (red-gray), "RdYlGn" (red-yellow-green), and "Spectral" (red-yellow-blue, which passes through green).
  • Qualitative palettes: Very few of the qualitative palettes are colorblind-safe. For three categories, you can use "Dark2", "Paired", and "Set2". For four categories, only "Paired" is a colorblind-safe palette. Many qualitative palettes use green as one of the colors.

A SAS IML module for colorblind-safe palettes

After discovering which palettes are colorblind-safe, it is easy to write a wrapper for the PALETTE function in SAS IML that adds a bit of colorblind-safe logic. The following function returns the requested palette if it is colorblind-safe. Otherwise, it returns a vector of missing values and writes an error to the log by using the PrintToLog subroutine.

/* In SAS 9.4, the macro defines a function that emulates the PrintToLog call.
   See https://blogs.sas.com/content/iml/2023/01/25/printtolog-iml.html */
%macro DefinePrintToLog;
%if %sysevalf(&sysver = 9.4) %then %do;
start PrintToLog(msg,errCode=-1);
   if      errCode=0 then prefix = "NOTE: ";
   else if errCode=1 then prefix = "WARNING: ";
   else if errCode=2 then prefix = "ERROR: ";
   else prefix = "";
   stmt = '%put ' + prefix + msg + ';';
   call execute(stmt);
finish;
%end;
%mend;
 
/* Support colorblind-safe palettes as reported by https://colorbrewer2.org/
   If the requested palette is not colorblind-safe, return missing values and log a warning. */
proc iml;
%DefinePrintToLog
 
start ColorblindSafe(name, numColors );
/* Create a denylist of palettes that are NOT colorblind-safe. 
   The first column is the name of a palette.
   The second column is 'N' if not safe for any number of colors. 
   Otherwise, '3' or '4' indicates the max number of colors for which the palette is safe. */
NotCBS = {
         RDGY     N,
         RDYLGN   N,
         SPECTRAL N,
         ACCENT   N,
         DARK2    3,  /* only numColors <= 3 */
         PAIRED   4,  /* only numColors <= 4 */
         PASTEL1  N,
         PASTEL2  N,
         SET1     N,
         SET2     3,  /* only numColors <= 3 */
         SET3     N 
         };
 
/* if the requested palette is not colorblind-safe, return a vector of missing values */
ErrPalette = j(1, numColors, .);
idx = loc(NotCBS[,1] = upcase(name));
if ncol(idx)=0 then /* name is not on the denylist. Use it! */
   return palette(name, numColors);
 
isCBS = NotCBS[idx,2];  /* get value in second column */
if isCBS='N' then do;   /* 'N' = not colorblind-safe */
   msg = cat("The palette '", strip(name), "' is not colorblind-safe.");
   call PrintToLog(msg, 1); /* WARNING */
   return ErrPalette;
end;
else do;
   n = num(isCBS);     /* '4' = colorblind-safe only for <= 4 colors */
   if numColors <= n then 
      return palette(name, numColors);
   else do;
      msg = cat("The palette '", strip(name), "' is colorblind-safe only for ",
                strip(isCBS) + " or fewer colors.");
      call PrintToLog(msg, 1); /* WARNING */
      return ErrPalette;
   end;
end;
finish;
 
/* test the ColorblindSafe function on a few examples */
RedBlue = ColorblindSafe("RdBu", 7);
Spectral = ColorblindSafe("Spectral", 5);
Pair4 = ColorblindSafe("Paired", 4);
Pair5 = ColorblindSafe("Paired", 5);
print RedBlue, Spectral, Pair4, Pair5;

The output shows that the 'Spectral' palette is not colorblind-safe. The 'Paired' palette is safe for four or fewer colors. If the caller requests a colorblind-safe palette, it is returned. Otherwise, the function returns a vector of missing values. The program also writes the following messages to the log:

The messages inform the user that some of the requested palettes are not colorblind-safe.

Summary

SAS software supports the ColorBrewer family of color palettes. Not all the palettes are colorblind-safe. Among the palettes, all sequential palettes are safe, and most of the diverging palettes are safe. However, most of the qualitative palettes are not safe. The SAS IML function in this article creates a denylist of the ColorBrewer palettes that are not colorblind-safe. The function returns a colorblind-safe palette when one is requested. However, the function returns a vector of missing values if the palette is on the denylist.

The post Colorblind-safe palettes in SAS appeared first on The DO Loop.

1月 232023
 

Many experienced SAS programmers use the PUT statement to write messages to the log from a DATA step. But did you know that SAS supports the PUTLOG function, which is another way to write a message to the log? I use the PUTLOG statement in the DATA step for the following reasons:

  • It is easy to understand. The name of the statement alerts anyone reading the program that the program is sending a message to the log.
  • It is specific. Whereas the PUT statement can be used to write to any open file, the PUTLOG statement always writes to the log.
  • It is dependable. Suppose you are writing a SAS macro that other people will use. If you use the PUTLOG statement, your message will always appear in the log. If you use the PUT statement, you have to handle the possibility that the user has defined a fileref, which could potentially redirect the messages away from the log.
  • It displays colored messages: f you start the message with the keywords "NOTE:", "WARNING:", or "ERROR", the log displays the message in a special color.

How to use the PUTLOG statement

The PUTLOG statement is easy to use. Typically, it is part of an IF-THEN block and is executed conditionally when a certain condition is encountered. You can display a message in the log, but you can also display the values of any variables in the DATA step.

The following DATA step reads in coefficients for a quadratic equation a x2 + b x + 3 = 0. For each set of coefficients, the program uses the discriminant of the quadratic equation to determine whether the equation has any real roots. If so, the program uses the quadratic formula to find the roots. To demonstrate the three common log messages, the program does the following:

  • Display an ERROR if the coefficient of the quadratic term is 0.
  • Display a WARNING if the quadratic equation does not have any real roots.
  • Display a NOTE if the quadratic equation does has a repeated root.
/* Read coefficients of quadratic equation a*x**2 + b*x + c.
   For a ^= 0, use the quadratic formula to solve for real roots of
   a*x**2 + b*x + c = 0.
*/
data Quadratic;
input a b c;
if a=0 then 
   putlog "ERROR: The coefficient of the quadratic term is 0 for observation " _N_;
else do;
   discrim = b**2 - 4*a*c;
   if discrim < 0 then do;
      putlog "WARNING: No real roots for observation " _N_
             "; the coefficients are " a= b= c=;
      Root1 = .; Root2 = .;
   end;
   else do;
      Root1 = (-b - sqrt(discrim)) / (2*a);
      Root2 = (-b + sqrt(discrim)) / (2*a);
      if Discrim=0 then 
         putlog "NOTE: Repeated root for observation " _N_;
   end;
end;
datalines;
 1 -1 -6
 1  7  6
 0  7  6
 2  7  7
-2  5 12
-2  4 -2
 5  4 10
;
 
proc print; format Root1 Root2 5.2; run;

The SAS log contains the following message, which is color-coded according to whether the message is an error (red color), a warning (green color), or a note (blue color).

The output from the PRINT procedure shows the results of the DATA step. For coefficients that have real roots, the Root1 and Root2 columns show the roots. If the equation is not quadratic or has no real roots, the columns show missing values. Notice that there are missing values for observations 3, 4, and 7, which are the observations that are mentioned in the log messages.

Summary

The PUTLOG statement is an easy way to write messages to the log from a SAS DATA step. You can use the PUTLOG statement to display messages when the data is unusual, and a computation cannot be completed.

In closing, I want to mention that outside of a DATA step, you can use the %PUT statement to write a message to the log. Like the PUTLOG statement, the message will be color-coded if you begin the message with "NOTE:", "WARNING:", or "ERROR:".

The post Use the PUTLOG statement to write errors, warning, and notes to the SAS log appeared first on The DO Loop.

12月 222022
 

The addition of the PYTHON procedure and Python editor in SAS Viya enables users to execute Python code in SAS Studio. This new capability in SAS Viya adds another tool to SAS's existing collection. With this addition I thought, how can I utilize this new found power?

In this example, I'll keep it simple. I want to create a Microsoft Excel report using a combination of SAS, Python and SQL. I'll use data that's stored in a SAS library; however, the library could be using data stored anywhere, like a path, database or in the cloud. I'll write a program that executes the following:

All code used in this post is located on GitHub, here.

Set folder path and file name

To begin, I'll create a macro variable to specify the output folder path and Microsoft Excel workbook name.

%let path=/*Enter your output folder path*/;
%let xlFileName = myExcelReport.xlsx;

Prepare data

Further, I'll prepare the data using the SAS DATA step. I'll use the available sashelp.cars table, create a new column named MPG_Avg, and drop unnecessary columns. Instead of using the DATA step you can use Python or SQL to prepare the data. Whatever tool works best for you.

data work.cars;
    set sashelp.cars;
    MPG_Avg=mean(MPG_City, MPG_Highway);
    drop Wheelbase Weight Length;
run;

Create the Microsoft Excel workbook

After the data is ready, I'll use the ODS EXCEL statement to create the Excel spreadsheet. The following ODS options are used:

  • FILE - specifies the file path and name.
  • STYLE - modifies the appearance of the SAS output
  • EMBEDDED_TITLES - specifies titles should appear in the worksheet
  • SHEET_INTERVAL - enables manual control when to create a new worksheet
ods excel file="&amp;path./&amp;xlFileName" 
		  style=ExcelMidnight   
		  options(embedded_titles="on");

Worksheet 1

Print the data using SAS

With the ODS EXCEL destination open I'll name the first worksheet Data, and manually specify when a new sheet is created. Next, I'll use the PRINT procedure to print the detailed data to Excel. The PRINT procedure will print the entire SAS data set with the associated formats and styles to Excel.

* Sheet 1 - Print the data using SAS *;
ods excel options(sheet_name='Data' sheet_interval='none');
title height=16pt color=white "Detailed Car Data";
proc print data=work.cars noobs;
run;

Worksheet 2

Create violin plots using Python

Next, I want to create violin plots on a new worksheet named Origin_MPG. Now, these can be created in SAS, but I personally found the matplotlib package in Python a bit easier to use . With the PYTHON procedure, I can include the Python code within the SAS program (or you can reference a .py file) to create the visualization. Then I'll use the SAS.pyplot method to save and render the visualization. Since the pyplot callback renders the image in the results tab, it exports the image to the Excel workbook by default.

First I'll use ODS EXCEL to create the new worksheet and the TITLE statement to add a title to the Excel worksheet.

ods excel options(sheet_name='Origin_MPG' sheet_interval='now');
title justify=left height=16pt color=white "Analyzing MPG by Each Car Origin";

Then I'll execute the PYTHON procedure to execute my Python code.

* Create violin plots using Python *;
proc python;
submit;
 
##
## Import packages and options
##
 
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
outpath = SAS.symget('path')
 
##
## Data prep for the visualization
##
 
## Load the SAS table as a DataFrame
df = (SAS
      .sd2df('work.cars')                 ## SAS callback method to load the SAS data set as a DataFrame
      .loc[:,['Origin','MPG_Avg']]        ## Keep the necessary columns
)
 
 
## Create a series of MPG_Avg for each distinct origin for the violin plots
listOfUniqueOrigins = df.Origin.unique().tolist()
 
mpg_by_origin = {}
for origin in listOfUniqueOrigins:
    mpg_by_origin[origin] = df.query(f'Origin == @origin ').MPG_Avg
 
 
##
## Create the violin plots
##
 
## Violin plot
fig, ax = plt.subplots(figsize = (8,6))
ax.violinplot(mpg_by_origin.values(), showmedians=True)
 
## Plot appearance
ax.set_title('Miles per Gallon (MPG) by Origin')
rename_x_axis = {'position': [1,2,3], 'labels':listOfUniqueOrigins}
ax.set_xticks(rename_x_axis['position'])
ax.set_xticklabels(rename_x_axis['labels']);
 
## Save and render the image file
SAS.pyplot(plt, filename='violinPlot',filepath=outpath)
 
endsubmit;
quit;
title;

SQL Aggregation

SQL is an extremely common and useful language for data analysts and scientists. I find using SQL for aggregation easy, so I will create a simple aggregation and add it below the visualization on the same worksheet in the the Excel report.

* SQL Aggregation *;
title justify=left "Average MPG by Car Makes";
proc sql;
select Origin, round(mean(MPG_Avg)) as AverageMPG
	from work.cars
	group by Origin
	order by AverageMPG desc;
quit;
title;

Add text

At the end of the same worksheet I'll add some simple text using the ODSTEXT procedure to give some information about the data.

proc odstext;
   heading 'NOTES';
   p 'Using the SASHELP.CARS data. The following car Origins were analyzed:';
   list ;
      item 'Asia';
      item 'Europe';
      item 'USA';
   end;    
   p 'Created by Peter S';
quit;

Close the Excel workbook

Lastly, I'll close the ODS EXCEL destination since I am done writing out to Excel.

ods excel close;

Results

That's it! Now I'll execute the entire program and view the Excel workbook.

Summary

With the capabilities of SAS and the new ability to execute Python code in SAS Studio, teams have a variety of tools in SAS Viya for their analytic needs.

Additional resources

PYTHON Procedure documentation
SAS opens its code editor interface to Python users
Using PROC PYTHON to augment your SAS programs
ODS Excel Statement

Creating a Microsoft Excel report using SAS, Python and SQL! was published on SAS Users.

12月 022022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. If you'd like to start by learning more about the distributed CAS server and CAS actions, please see CAS Actions and Action Sets - a brief intro. Otherwise, let's learn how to rename columns in CAS tables.

In this example, I will use the CAS language (CASL) to execute the alterTable CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory in the Casuser caslib. By default the Samples caslib should be available in your SAS Viya environment. Then I'll preview the CAS table using the columnInfo and fetch CAS actions.

* Connect to the CAS server and name the connection CONN *;
cas conn;
 
proc cas;
   * Specify the output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table *;
   table.loadtable / 
      path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
      casOut = casTbl;
 
    * Preview the CAS table *;
    table.columnInfo / table = casTbl;
    table.fetch / table = casTbl, to = 5;
quit;

The columnInfo action returns information about each column. Notice that the WARRANTY_CLAIMS CAS table has column names and columns labels.

The fetch CAS action returns five rows.

Notice that by default the fetch action uses columns labels in the header.

Rename columns in a CAS table

To rename columns in a CAS table, use the alterTable CAS action. In the alterTable action, specify the CAS table using the name and caslib parameters. Additionally, use the columns parameter to specify the columns to modify. The columns parameter requires a list of dictionaries, each dictionary specifies the column to modify.

Here, I'll rename the claim_attribute_1, seller_attribute_5 and product_attribute_1 columns. Then I'll execute the columnInfo action to view the updated column information.

proc cas;
   * Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Rename columns *;
   table.alterTable / 
      name = casTbl['name'], caslib = casTbl['caslib'],
      columns = {
	{name = 'claim_attribute_1', rename = 'Campaign_Type'},
	{name = 'seller_attribute_5', rename = 'Selling_Dealer'},
	{name = 'product_attribute_1', rename = 'Vehicle_Class'}
      };
 
   * View column metadata *;
   table.columnInfo / table = casTbl;
quit;

The results show that the alterTable CAS action renamed the columns to Campaign_Type, Selling_Dealer and Vehicle_Class. While this worked, what if you wanted to rename all columns in the CAS table using the column labels?

Rename all columns using the column labels

I'll dynamically rename the CAS table columns using the column labels. Since the column labels contain spaces, I'll also replace all spaces with an underscore. Now, I could manually specify each column and column label in the alterTable action, but why do all that work? Instead you can dynamically create a list of dictionaries for use in the alterTable action.

proc cas;
* Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
  * Rename columns with the labels. Spaces replaced with underscores *;
 
   *1. Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   *Loop over the columnInfo result table and create a list of dictionaries *;
   *2*;
   listElementCounter = 0;
   *3*;
   do columnMetadata over cr.ColumnInfo;
	*4.1*; listElementCounter = listElementCounter + 1;
	*4.2*; convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	*4.3*; renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel};
   end;
 
   *5. Rename columns *;
   table.alterTable / 
	name = casTbl['Name'], 
	caslib = casTbl['caslib'], 
	columns=renameColumns;
 
   *6. Preview CAS table *;
   table.columnInfo / table = casTbl;
quit;
  1. The columnInfo action will store the results in a dictionary named cr.
  2. The variable listElementCounter will act as a counter that can be used to append each dictionary to the list.
  3. Loop over the result table stored in the cr dictionary. When you loop over a result table, each row is treated as a dictionary. The key is the column name and it returns the value of that column.
  4. In the loop:
    1. accumulate the counter
    2. access the column label and replace all spaces with underscores using the tranwrd function
    3. create a list named renamedColumns that contains each dictionary with the column to rename and it's new name.
  5. The alterTable action will use the list of dictionaries to rename each column.
  6. The columnInfo action will display the new column information.

The results show that each column was dynamically renamed using the column label and the spaces replaced with underscores.

Summary

In summary, using the alterTable CAS action enables you to rename columns in a CAS table.  With some knowledge of lists, dictionaries and loops in the CAS language, you can dynamically use the column labels to rename the columns. When using the alterTable action remember that:

  • The name and caslib parameters specify the CAS table.
  • The columns parameter requires a list of dictionaries.
  • Each dictionary specifies the column to modify.

Want to learn how to do this using Python? Check out my post Getting started with Python integration to SAS® Viya® - Part 11 - Rename Columns.

Additional resources

simple.freq CAS action
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index
SAS® Cloud Analytic Services: Fundamentals

CAS-Action! Rename Columns in a CAS Table was published on SAS Users.