sas programming

7月 062021
 

Complex loops in SAS programmingIterative loops are one of the most powerful and imperative features of any programming language, allowing blocks of code to be automatically executed repeatedly with some variations. In SAS we call them DO-loops because they are defined by the iterative DO statements. These statements come in three distinct forms:

  • DO with index variable
  • DO UNTIL
  • DO WHILE

In this blog post we will focus on the versatile iterative DO loops with index variable pertaining to SAS DATA steps, as opposed to its modest IML’s DO loops subset.

Iterative DO statement with index variable

The syntax of the DATA step’s iterative DO statement with index variable is remarkably simple yet powerful:

DO statement with index-variable

DO index-variable=specification-1 <, ...specification-n>;

...more SAS statements...

END;

It executes a block of code between the DO and END statements repeatedly, controlled by the value of an index variable. Given that angle brackets (< and >) denote “optional”, notice how index-variable requires at least one specification (specification-1) yet allows for multiple additional optional specifications (<, ...specification-n>) separated by commas.

Now, let’s look into the DO statement’s index-variable specifications.

Index-variable specification


Each specification denotes an expression, or a series of expressions as follows:

start-expression <TO stop-expression> <BY increment-expression> <WHILE (expression) | UNTIL (expression)>

Note that only start-expression is required here whereas <TO stop-expression>, <BY increment-expression>, and <WHILE (expression) or UNTIL (expression)> are optional.

Start-expression may be of either Numeric or Character type, while stop-expression and increment-expression may only be Numeric complementing Numeric start-expression.

Expressions in <WHILE (expression) | UNTIL (expression)> are Boolean Numeric expressions (numeric value other than 0 or missing is TRUE and a value of 0 or missing is FALSE).

Other iterative DO statements

For comparison, here is a brief description of the other two forms of iterative DO statement:

  • The DO UNTIL statement executes statements in a DO loop repetitively until a condition is true, checking the condition after each iteration of the DO loop. In other words, if the condition is true at the end of the current loop it will not iterate anymore, and processing continues with the next statement after END. Otherwise, it will iterate.
  • The DO WHILE statement executes statements in a DO loop repetitively while a condition is true, checking the condition before each iteration of the DO loop. That is if the condition is true at the beginning of the current loop it will iterate, otherwise it will not, and processing continues with the next statement after the END.

Looping over a list of index variable values/expressions

DO loops can iterate over a list of index variable values. For example, the following DO-loop will iterate its index variable values over a list of 7, 13, 5, 1 in the order they are specified:

data A; 
   do i=7, 13, 5, 1;
      put i=;
      output;
   end;
run;

This is not yet another form of iterative DO loop as it is fully covered by the iterative DO statement with index variable definition. In this case, the first value (7) is the required start expression of the required first specification, and all subsequent values (13, 5 and 1) are required start expressions of the additional optional specifications.

Similarly, the following example illustrates looping over a list of index variable character values:

data A1;
   length j $4;
   do j='a', 'bcd', 'efgh', 'xyz';
      put j=;
      output;
   end;
run;

Since DO loop specifications denote expressions (values are just instances or subsets of expressions), we can expand our example to a list of actual expressions:

data B;
   p = constant('pi');
   do i=round(sin(p)), sin(p/2), sin(p/3);
      put i=;
      output;
   end;
run;

In this code DO-loop will iterate its index variable over a list of values defined by the following expressions: round(sin(p)), sin(p/2), sin(p/3).

Infinite loops

Since <TO stop> is optional for the index-variable specification, the following code is perfectly syntactically correct:

data C;
   do j=1 by 1;
      output;
   end;
run;

It will result in an infinite (endless) loop in which resulting data set will be growing indefinitely.

While unintentional infinite looping is considered to be a bug and programmers’ anathema, sometimes it may be used intentionally. For example, to find out what happens when data set size reaches the disk space capacity… Or instead of supplying a “big enough” hard-coded number (which is not a good programming practice) for the loop’s TO expression, we may want to define an infinite DO-loop and take care of its termination and exit inside the loop. For example, you can use IF exit-condition THEN LEAVE; or IF exit-condition THEN STOP; construct.

LEAVE statement immediately stops processing the current DO-loop and resumes with the next statement after its END.

STOP statement immediately stops execution of the current DATA step and SAS resumes processing statements after the end of the current DATA step.

The exit-condition may be unrelated to the index-variable and be based on some events occurrence. For instance, the following code will continue running syntactically “infinite” loop, but the IF-THEN-LEAVE statement will limit it to 200 seconds:

data D;
   start = datetime();
   do k=1 by 1;
      if datetime()-start gt 200 then leave;
      /* ... some processing ...*/
      output; 
   end;
run;

You can also create endless loop using DO UNTIL(0); or DO WHILE(1); statement, but again you would need to take care of its termination inside the loop.

Changing “TO stop” within DO-loop will not affect the number of iterations

If you think you can break out of your DO loop prematurely by adjusting TO stop expression value from within the loop, you may want to run the following code snippet to prove to yourself it’s not going to happen:

data E;
   n = 4;
   do i=1 to n;
      if i eq 2 then n = 2;
      put i=;
      output;
   end;
run;

This code will execute DO-loop 4 times despite that you change value of n from 4 to 2 within the loop.

According to the iterative DO statement documentation, any changes to stop made within the DO group do not affect the number of iterations. Instead, in order to stop iteration of DO-loop before index variable surpasses stop, change the value of index-variable so that it becomes equal to the value of stop, or use LEAVE statement to jump out of the loop. The following two examples will do just that:

data F;
   do i=1 to 4;
      put i=;
      if i eq 2 then i = 4;
      output;
   end;
run;
 
data G;
   do i=1 to 4;
      put i=;
      if i eq 2 then leave;
      output;
   end;
run;

Know thy DO-loop specifications

Here is a little attention/comprehension test for you.

How many times will the following DO-loop iterate?

data H;
   do i=1, 7, 3, 6, 2 until (i>3);
      put i=;
      output;
   end;
run;

If your answer is 2, you need to re-read the whole post from the beginning (I am only partly joking here).

You may easily find out the correct answer by running this code snippet in SAS. If you are surprised by the result, just take a closer look at the DO statement: there are 5 specifications for the index variable here (separated by commas) whereas UNTIL (expression) belongs to the last specification where i=2. Thus, UNTIL only applies to a single value of i=2 (not to any previous specifications of i =1,7,3,6); therefore, it has no effect as it is evaluated at the end of each iteration.

Now consider the following DO-loop definition:

data Z;
   pi = constant('pi');
   do x=3 while(x>pi), 10 to 1 by -pi*3, 20, 30 to 35 until(pi);
      put x=;
      output;
   end;
run;

I hope after reading this blog post you can easily identify the index variable list of values the DO-loop will iterate over. Feel free to share your solution and explanation in the comments section below.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, other secrets, tips or tricks about the DO loop? Please share with us below.

Little known secrets of DO-loops with index variables was published on SAS Users.

6月 212021
 

A SAS programmer noticed that his SAS output was not displaying multiple blanks in his strings. He had some strings with leading blanks, others with trailing blanks, and others with multiple blanks in the middle. Yet, every time he used SAS to print the strings to the HTML destination, something mysterious happened. The leading and trailing blanks vanished and the multiple blanks in the middle of strings were replaced by a single blank.

No, this isn't a bug in SAS, it is a feature of the HTML renderer. The HTML renderer intentionally "eats" blanks and other whitespace. HTML is a popular destination for SAS output, but be aware the strings you see in an HTML table might not accurately show the blank characters in the underlying text.

This article demonstrates the issue, then shows how a programmer can discover the location of blanks in SAS character values.

HTML compresses multiple blanks

The following DATA step creates strings that have multiple blanks in the middle of a string, at the beginning of a string, and at the end of a string:
data BlankTest;
length str $ 30;
BlankLoc = 'Middle  '; str = '  string   with      blanks   ';  
output;
BlankLoc = 'Leading '; str = '          string   with blanks';
output;
BlankLoc = 'Trailing'; str = 'string with    blanks         ';
output;
run;
 
/* note that the multiple blanks do not appear when you display them in HTML */
ods HTML;
proc print data=BlankTest; run;

The output shows why the SAS programmer was confused: the strings all look the same! Although he had explicitly constructed strings that contained multiple blank characters, his PROC PRINT output (in the HTML destination) did not show the multiple blanks. It is well-known that character values in SAS tables are left-aligned for most destinations, so it is not a surprise that the text strings are flush left. What might be surprising is that the multiple blanks in the middle of the strings have been compressed into a single blank character. It is not SAS that did it: The blanks are still there but the HTML renderer has compressed them.

Viewing the location of blanks in a SAS string

You can use a trick to visualize the location of blank characters in a SAS string. The trick is to use the TRANSLATE function in SAS to replace blanks with a visible character. The following DATA step view replaces each blank with the asterisk ('*') character:

/* you can replace blanks with another character to see that they are there */
data Substitute / view=Substitute;
set BlankTest;
str = translate(str, '*', ' ');
run;
 
proc print data=Substitute; run;

Now the output shows where blanks occur in each string.

Other ODS destinations

Other ODS destinations do not compress multiple blanks, so an alternative is to use a non-HTML destination. For example, here is the output in the RTF destination:

ods RTF;
proc print data=BlankTest; run;
ods RTF close;

Be aware that the font determines how well you can see the extra spaces. Most fonts are proportional-width fonts, which means that a blank character is relatively thin compared to the width of other characters (such as 'W'). The blank characters are most visible when you use a fixed-width font (also called a monospace font), such as Courier.

The SAS LISTING destination

As mentioned previously, SAS left-aligns text in most modern ODS destinations. However, the ancient SAS LISTING destination uses a monospace font and does not left-align the text. This enables you to see the location of all non-trailing blanks:

ods listing;
proc print data=BlankTest; run;
ods listing close;

Summary

A SAS programmer noticed that his SAS output was ignoring multiple blanks in strings. This is not because of SAS; it is a feature of the HTML renderer. You can see that the exact location of all blanks by using the TRANSLATE function to convert blanks to a visible character. Alternatively, if you don't mind leading and trailing blanks being stripped, you can send your output to a non-HTML destination such as PDF or RTF. Lastly, you can use the venerable SAS LISTING destination to display strings with leading blanks.

The post The case of the missing blanks: Why SAS output might not show multiple blanks in strings appeared first on The DO Loop.

6月 182021
 

SAS Global Certification is pleased to announce two (yes two!) new SAS Viya programming certifications. Traditional SAS programmers need to migrate their code and data to SAS Viya environments, and there are important skills required to make a successful transition. SAS Viya includes Cloud Analytic Services (CAS), where data is stored in-memory and programs execute in parallel on distributed nodes. To leverage CAS, you may need to update your DATA step and SAS SQL code. Also, there are new techniques to manage data stored in CAS. Once you have mastered these skills, you can take an exam and add the credential to your resumes, cv, and LinkedIn profiles!

Behind the Scenes

A cross functional team here at SAS worked since the beginning of 2021 to design, write, and validate these new credentials. Drawing from their experience working with SAS customers, this diverse team from Professional Services, Technical Support, Documentation, and Education overcame the challenges of distance and time-zones (and virtual meetings!) to create these credentials. We appreciate the work that this team accomplished and we had a great time working with you!

SAS Viya Programming Associate

The first credential, the SAS Certified Associate: Fundamentals of Programming Using SAS Viya is designed for SAS programmers who simply need to migrate their code and data to SAS Viya. They plan to work with CAS tables while still using traditional SAS programming techniques. One helpful tool is the CASUTIL procedure, which provides a familiar SAS procedure framework to help SAS programmers manage data in CAS. Also, we cover important skills to update DATA step code and convert PROC SQL code to PROC FEDSQL to execute in CAS. To learn these skills and prepare for the exam, the recommended training is the one-day Programming for SAS Viya course.

SAS Viya Programming Specialist

The second credential, the SAS Certified Specialist: Intermediate Programming Using SAS Viya builds on the skills introduced in the Associate credential, adding assessment of CAS Language (CASL) programming and CAS actions. With CASL programming skills, you can more precisely control accessing and processing of CAS data. There is also extensive coverage of CAS actions which is the smallest unit of work for the CAS server. CAS actions can load data, transform data, compute statistics, perform analytics, and create output. In addition to the Programming for SAS Viya course, the second recommended training class is the new three-day High-Performance Data Processing with CASL in SAS Viya.

Next Steps

You can learn more about these new credentials at the links above. There you will find detailed exam content guides, free sample questions, and free practice exams (yes - free full length practice exams!). If you have questions, please ask in the comments below. I'll start with a few easy ones:

  • Are there pre-requisites for these credentials? No. While they assume good understanding of traditional programming in a SAS 9 environment, these credentials do not require a prior SAS certification. You also can take the SAS Viya Programming Specialist without taking the SAS Viya Programming Associate.
  • Are these exams performance-based? No, these exams are not performance-based but are rather traditional format exams with multiple choice and short-answer fill-in-the-blank questions.
  • What exactly is on the exam? For detailed exam topics, hit the "Exam Content Guide" links on each credential's web page. The content guides contain exam sections, objectives, and expanded detail about the important topics.
  • What are the exam numbers? The Associate exam is A00-415 and the Specialist exam is A00-420.

We hope you tackle the recommended training, develop your skills, and register soon to be the first candidates to earn these credentials. Good luck with your preparations!

Two New SAS® Viya® Programming Certifications was published on SAS Users.

5月 192021
 

A SAS programmer noticed that there is not a built-in function in the SAS DATA step that computes the product for each row across a specified set of variables. There are built-in functions for various statistics such as the SUM, MAX, MIN, MEAN, and MEDIAN functions. But no DATA step function for the product.

This article discusses various ways to compute products in SAS. The article shows how to use PROC FCMP to implement a PRODUCT function that can be called from the DATA step.

Wide and long data

If you have numbers in a rectangular data set, you might want to compute a product in either of two directions: down a column or across a row. The SAS programmer was interested in computing across a row, but, for completeness, I will demonstrate both cases.

Let's generate some sample data on which to compute the product. For fun, the data I will use is the first 1,000 terms of the Wallis product, which is an infinite sequence of numbers whose product converges to π/2 ≈ 1.5707963. The product of the first 1,000 terms of the Wallis sequence is 1.5704039. The following two DATA steps generate the Wallis terms. The WallisLong data set contains one variable and 1,000 rows. The WallisWide data set contains 1,000 variables and one row. The names of the variables are term1, term2, ..., term1000.

/* Create sample data. See https://blogs.sas.com/content/iml/2021/03/10/pi-and-products.html */
%let N = 1000;
data WallisLong(drop=n);
do n = 1 to &N;
   term = (2*n/(2*n-1)) * (2*n/(2*n+1));
   output;                     
end;
run;
 
data WallisWide(drop=n);
array term[&N];
do n = 1 to &N;
   term[n] = (2*n/(2*n-1)) * (2*n/(2*n+1));
end;
output;                     
run;

The PROD function in SAS/IML

Before computing the products in the SAS DATA step, I want to mention that SAS/IML software supports a built-in function that can compute the product of the nonmissing elements of its arguments. You can use PROD function in PROC IML to read the data and compute the products, as follows:

/* the easy way: SAS/IML */
proc iml;
use WallisLong;  read all var "term" into TermLong;  close;
use WallisWide;  read all var _NUM_ into TermWide;   close;
 
prodLong = prod(TermLong);  /* compute product of 1000 x 1 column */
prodWide = prod(TermWide);  /* compute product of 1 x 1000 row */
print prodLong prodWide;
QUIT;

Using a DO loop and an array to compute a product

You can write a program to compute a product in the DATA step. If there are no missing values in the data, the program is straightforward. However, it is best to write programs that can handle missing values in the data. To detect missing values, you can use the MISSING function. The following program computes the product down the rows and excludes any missing values. If all values are missing, then the product is a missing value.

/* DATA step method to compute the product down the rows */
data _null_;
retain Prod 1 n 0;  /* n = number of nonmissing */
set WallisLong end=EOF;
if ^missing(term) then do;
   n + 1;
   Prod = Prod * term;
end;
if EOF then do;
   if n=0 then Prod=.;
   put n= / prod=;
end;
run;
n=1000
Prod=1.570403873

The DATA step program to compute the product across multiple columns is similar. It uses an ARRAY statement to loop over the variables term1-term1000. Again, the product omits missing values.

/* DATA step method to compute the product across columns */
data _null_;
array term[&N];    /* implicit assumption: Variables are named term1-term1000 */
set WallisWide;
Prod = 1;
n = 0;  /* n = number of nonmissing */
do i = 1 to dim(term);
   if ^missing(term[i]) then do;
      n + 1;
      Prod = Prod * term[i];
   end;
end;
if n=0 then 
   Prod = .;
put n= / Prod=;
run;

For this example, the variables are named term1-term1000, so it is easy to read the variables into the array. For more complicated examples, you can read about ways to specify a list of variable names in SAS.

Define the PRODUCT function in PROC FCMP

Although the program in the previous section shows how to compute the product for each row across multiple variables, it is more useful to encapsulate that logic into a function that you can call from the DATA step. You can use PROC FCMP to define a function that you can call from the DATA step. The following call to PROC FCMP defines the PRODUCT function. The input argument is an array:

proc fcmp outlib=work.funcs.MathFuncs;
function Product(x[*]);
   prod = 1; n=0; /* n = number of nonmissing */
   do i = 1 to dim(x);
      if ^missing(x[i]) then do;
         n + 1;
         prod = prod * x[i];
      end;
   end;
   if n=0 then 
      prod = .;
   return( prod );
endsub;
quit;
 
/* call user-defined PRODUCT function for an array */
options cmplib=(work.funcs);    /* tell DATA step where to search for user-defined functions */
 
data _null_;
array vars[&N] term1-term1000;  /* array of all variables that will be in the product */ 
set WallisWide;
Prod = Product(vars);           /* send that array to the user-defined function */
put Prod=;
run;
Prod=1.570403873

Handling missing values

Whenever you write a user-defined function, you should practice defensive programming and assume that someday your function will be called on data that have missing values. For completeness, the following program creates data that contains missing values. The next program shows an alternate way to define the array if the variables do not have a common prefix and a numerical suffix:

data Have;
input x1 x2 y qqq ABC;
datalines;
2 3 10 0.1 4
2 2 -1 0.5 3
2 . -1 0.5 3
. .  . .   .
;
 
data Want;
set Have;
array vars[5] ABC qqq x2 x1 y;   /* list variable names in any order */
Prod = Product(vars);
run;
 
proc print; run;

Summary

This article shows how to implement a user-defined function that computes the product of multiple variables for each row in a data set. The function is defined by using PROC FCMP and is callable from any SAS DATA step. Although you might not need to use this particular function, it provides a good example of how to create a user-defined function in SAS. It also shows how to pass an array to an FCMP function.

The post Implement a product function in SAS appeared first on The DO Loop.

5月 112021
 

It’s safe to say that SAS Global Forum is a conference designed for users, by users. As your conference chair, I am excited by this year’s top-notch user sessions. More than 150 sessions are available, many by SAS users just like you. Wherever you work or whatever you do, you’ll find sessions relevant to your industry or job role. New to SAS? Been using SAS forever and want to learn something new? Managing SAS users? We have you covered. Search for sessions by industry or topic, then add those sessions to your agenda and personal calendar.

Creating a customizable agenda and experience

Besides two full days of amazing sessions, networking opportunities and more, many user sessions will be available on the SAS Users YouTube channel on May 20, 2021 at 10:00am ET. After you register, build your agenda and attend the sessions that most interest you when the conference begins. Once you’ve viewed a session, you can chat with the presenter. Don’t know where to start? Sample agendas are available in the Help Desk.

For the first time, proceedings will live on SAS Support Communities. Presenters have been busy adding their papers to the community. Everything is there, including full paper content, video presentations, and code on GitHub. It all premiers on “Day 3” of the conference, May 20. Have a question about the paper or code? You’ll be able to post a question on the community and ask the presenter.

Want training or help with your code?

Code Doctors are back this year. Check out the agenda for the specific times they’re available and make your appointment, so you’ll be sure to catch them and get their diagnosis of code errors. If you’re looking for training, you’ll be quite happy. Training is also back this year and it’s free! SAS instructor-led demos will be available on May 20, along with the user presentations on the SAS Users YouTube channel.

Chat with attendees and SAS

It is hard to replicate the buzz of a live conference, but we’ve tried our best to make you feel like you’re walking the conference floor. And we know networking is always an important component to any conference. We’ve made it possible for you to network with colleagues and SAS employees. Simply make your profile visible (by clicking on your photo) to connect with others, and you can schedule a meeting right from the attendee page. That’s almost easier than tracking down someone during the in-person event.

We know the exhibit hall is also a big draw for many attendees. This year’s Innovation Hub (formerly known as The Quad) has industry-focused booths and technology booths, where you can interact in real-time with SAS experts. There will also be a SAS Lounge where you can learn more about various SAS services and platforms such as SAS Support Communities and SAS Analytics Explorers.

Get started now

I’ve highlighted a lot in this blog post, but I encourage you to view this 7-minute Innovation Hub video. It goes in depth on the Hub and all its features.

This year there is no reason not to register for SAS Global Forum…and attend as few or as many sessions as you want. Why? Because the conference is FREE!

Where else can you get such quality SAS content and learning opportunities? Nowhere, which is why I encourage you to register today. See you soon!

SAS Global Forum: Your experience, your way was published on SAS Users.

5月 032021
 

A previous article discusses the definition of the Hoeffding D statistic and how to compute it in SAS. The letter D stands for "dependence." Unlike the Pearson correlation, which measures linear relationships, the Hoeffding D statistic tests whether two random variables are independent. Dependent variables have a Hoeffding D statistic that is greater than 0. In this way, the Hoeffding D statistic is similar to the distance correlation, which is another statistic that can assess dependence versus independence.

This article shows a series of examples where Hoeffding's D is compared with the Pearson correlation. You can use these examples to build intuition about how the D statistic performs on real and simulated data.

Hoeffding's D and duplicate values

As a reminder, Hoeffding's D statistic is affected by duplicate values in the data. If a vector has duplicate values, the Hoeffding association of the vector with itself will be less than 1. A vector that has many duplicate values (few distinct values) has an association with itself that might be much less than 1.

For many examples, the Hoeffding D association between two variables is between 0 and 1. However, occasionally you might see a negative value for Hoeffding's D, especially if there are many duplicate values in the data.

Pearson correlation versus Hoeffding's D on real data

Let's compare the Pearson correlation and the Hoeffding D statistic on some real data. The following call to PROC SGSCATTER creates bivariate scatter plots for five variables in the Sashelp.Cars data set:

%let VarList = MSRP Invoice EngineSize Horsepower MPG_City MPG_Highway;
proc sgscatter data=sashelp.cars;
   matrix &VarList;
run;

The graph shows relationships between pairs of variables. Some pairs (MSRP and Invoice) are highly linearly related. Other pairs (MPG_City versus other variables) appear to be related in a nonlinear manner. Some pairs are positively correlated whereas others are negatively correlated.

Let's use PROC CORR in SAS to compare the matrix of Pearson correlations and the matrix of Hoeffding's D statistics for these pairs of variables. The NOMISS option excludes any observations that have a missing value in any of the specified variables:

proc corr data=Sashelp.cars PEARSON HOEFFDING noprob nomiss nosimple;
   label EngineSize= MPG_City= MPG_Highway=; /* suppress labels */
   var &VarList;
run;

There are a few noteworthy differences between the tables:

  • Diagonal elements: For these variables, there are 428 complete cases. The Invoice variable has 425 unique values (only three duplicates) whereas the MPG_City and MPG_Highway variables have only 28 and 33 unique values, respectively. Accordingly, the diagonal elements are closest to 1 for the variables (such as MSRP and Invoice) that have few duplicate values and are smaller for variables that have many duplicate values.
  • Negative correlations: A nice feature of the Pearson correlation is that it reveals positive and negative relationships. Notice the negative correlations between MPG_City and other variables. In contrast, the Hoeffding association assesses dependence/independence. The association between MPG_City and the other variables is small, but the table of Hoeffding statistics does not give information about the direction of the association.
  • Magnitudes: The off-diagonal Hoeffding D statistics are mostly small values between 0.15 and 0.35. In contrast, the same cells for the Pearson correlation are between -0.71 and 0.83. As shown in a subsequent section, the Hoeffding statistic has a narrower range than the Pearson correlation does.

Association of exact relationships

A classic example in probability theory shows that correlation and dependence are different concepts. If X is a random variable and Y=X2, then X and Y are not independent, even though their Pearson correlation is 0. The following example shows that the Hoeffding statistic is nonzero, which indicates dependence between X and Y:

data Example;
do x = -1 to 1 by 0.05;          /* X is in [-1, 1] */
   yLinear = 3*x + 2;            /* Y1 = 3*X + 2    */
   yQuad   = x**2 + 1;           /* Y2 = X**2 + 1   */
   output;
end;
run;
 
proc corr data=Example PEARSON HOEFFDING nosimple noprob;
   var x YLinear yQuad;
run;

Both statistics (Pearson correlation and Hoeffding's D) have the value 1 for the linear dependence between X and YLinear. The Pearson correlation between X and YQuad is 0, whereas the Hoeffding D statistic is nonzero. This agrees with theory: the two random variables are dependent, but their Pearson correlation is 0.

Statistics for correlated bivariate normal data

A previous article about distance correlation shows how to simulate data from a bivariate normal distribution with a specified correlation. Let's repeat that numerical experiment, but this time compare the Pearson correlation and the Hoeffding D statistic. To eliminate some of the random variation in the statistics, let's repeat the experiment 10 times and plot the Monte Carlo average of each experiment. That is, the following SAS/IML program simulates bivariate normal data for several choices of the Pearson correlation. For each simulated data set, the program computes both the Pearson correlation and Hoeffding's D statistic. This is repeated 10 times, and the average of the statistics is plotted:

proc iml;
call randseed(54321);
 
/* helper functions */
start PearsonCorr(x,y);
   return( corr(x||y)[1,2] );
finish;
start HoeffCorr(x,y);
   return( corr(x||y, "Hoeffding")[1,2] );
finish;
 
/* grid of correlations */
rho = {-0.99 -0.975 -0.95} || do(-0.9, 0.9, 0.1) || {0.95 0.975 0.99};  
N = 500;                   /* sample size */
mu = {0 0};  Sigma = I(2); /* parameters for bivariate normal distrib */
PCor = j(1, ncol(rho), .); /* allocate vectors for results */
HoeffD = j(1, ncol(rho), .);
 
/* generate BivariateNormal(rho) data */
numSamples = 10;           /* how many random samples in each study? */
do i = 1 to ncol(rho);     /* for each rho, simulate bivariate normal data */
   Sigma[1,2] = rho[i]; Sigma[2,1] = rho[i]; /* population covariance */
   meanP=0; meanHoeff=0;   /* Monte Carlo average in the study */
   do k = 1 to numSamples;
      Z = RandNormal(N, mu, Sigma);        /* simulate bivariate normal sample */
      meanP = meanP + PearsonCorr(Z[,1], Z[,2]);        /* Pearson correlation */
      meanHoeff = meanHoeff + HoeffCorr(Z[,1], Z[,2]);  /* Hoeffding D */
   end;
   PCor[i] = MeanP /numSamples;             /* MC mean of Pearson correlation */
   HoeffD[i] = meanHoeff / numSamples;      /* MC mean of Hoeffding D */
end;
 
create MVNorm var {"rho" "PCor" "HoeffD"}; 
append; close;
QUIT;
 
title "Average Correlations/Associations of Bivariate Normal Data";
title2 "N = 500";
proc sgplot data=MVNorm;
   label rho="Pearson Correlation of Population"
         PCor="Pearson Correlation"
         HoeffD="Hoeffding's D";
   series x=rho y=PCor / markers name="P";
   series x=rho y=HoeffD / markers name="D";
   lineparm x=0 y=0 slope=1;
   yaxis grid label="Association or Correlation";
   xaxis grid;
   keylegend "P" "D";
run;

For these bivariate normal samples (of size 500), the Monte Carlo means for the Pearson correlations are very close to the diagonal line, which represented the expected value of the correlation. For the same data, the Hoeffding D statistics are different. For correlations in [-0.3, 0.3], the mean of the D statistic is positive and is very close to 0. Nevertheless, the p-values for the Hoeffding test of independence account for the shape of the curve. For bivariate normal data that has a small correlation (say, 0.1), the Pearson test for zero correlation and the Hoeffding test for independence often both accept or both reject their null hypotheses.

Summary

Hoeffding's D statistic provides a test for independence, which is different than a test for correlation. In SAS, you can compute the Hoeffding D statistic by using the HOEFFDING option on the PROC CORR statement. You can also compute it in SAS/IML by using the CORR function. Hoeffding's D statistic can detect nonlinear dependencies between variables. It does not, however, indicate the direction of dependence.

I doubt Hoeffding's D statistic will ever be as popular as Pearson's correlation, but you can use it to detect nonlinear dependencies between pairs of variables.

The post Examples of using the Hoeffding D statistic appeared first on The DO Loop.

4月 262021
 

SAS/IML programmers often create and call user-defined modules. Recall that a module is a user-defined subroutine or function. A function returns a value; a subroutine can change one or more of its input arguments. I have written a complete guide to understanding SAS/IML modules, which contains may tips for working with SAS/IML modules. Among the tips are two that can trip up programmers who are new to the SAS/IML language:

In the years since I wrote those articles, SAS/IML introduced lists, which are a popular way to pass many arguments to a user-defined module. Lists and matrices behave similarly when passed to a module:

  • If you modify a list in a module, the list is also changed in the calling environment.
  • If you specify an expression such as [A, B], SAS/IML creates a temporary list.

This article shows a subroutine that takes a list and modifies the items in the list. It also looks at what happens if you pass in a temporary list.

Review: Passing a matrix to a module

Before discussing lists, let's review the rules for passing a matrix to a user-defined module. The following user-defined subroutine doubles the elements for its matrix argument:

proc iml;
/* Double the values in X. The matrix X is changed in the calling environment. */
start Double(X);
   X = 2*X;     
finish;
 
r1 = 1:3;
r2 = 4:6;
A = r1 // r2;          /* each row of A is a copy. r1 and r2 do not share memory with A. */
run Double(A);
print A;

As shown by the output, the elements of A are changed after passing A to the Double module. Notice also that although A was constructed from vectors r1 and r2, those vectors are not changed. They were used to initialize the values of A, but they do not share any memory with A.

Now suppose you want to double ONLY the first row of A. The following attempt does NOT double the first row:

A = r1 // r2;          /* each row of A is a copy */
run Double( A[1,] );   /* NOTE: create temporary vector, which is changed but vanishes */
print A;

The result is not shown because the values of A are unchanged. The matrix is not changed because it was not sent into the module. Instead, a temporary matrix (A[1,]) was passed to the Double module. The values of the temporary matrix were changed, but that temporary matrix vanishes, so there is no way to access those modified values. (See the previous article for the correct way to double the first row.)

Passing lists: The same rules apply

These same rules apply to lists, as I demonstrate in the next example. The following user-defined subroutine takes a list as an argument. It doubles every (numerical) item in the list. This modification affects the list in the calling environment.

/* Double the values of every numerical matrix in a list, Lst. 
   Lst will be changed in the calling environment. */
start DoubleListItems(Lst);
   if type(Lst)^='L' then stop "ERROR: Argument is not a list";
   nItems = ListLen(Lst);
   do i = 1 to nItems;
      if type(Lst$i)='N' then 
         Lst$i = 2*Lst$i;     /* the list is changed in the calling environment */
   end;
finish;
 
A = {1 2 3, 4 5 6};
B = -1:1;
L = [A, B];            /* each item in L is a copy */
run DoubleListItems(L);/* the copies are changed */
print (L$1)[label="L$1"], (L$2)[label="L$2"];

As expected, the items in the list were modified by the DoubleListItems subroutine. The list is modified because we created a "permanent" variable (L) and passed it to the module. If you simply pass in the expression [A, B], then SAS/IML creates a temporary list. The module modifies the items in the temporary list, but you cannot access the modified values because the temporary list vanishes:

/* create temporary list. Items in the list are changed but the list vanishes */
run DoubleListItems( [A, B] );  /* no way to access the modified values! */

Summary

This brief note is a reminder that SAS/IML creates temporary variables for expressions like X[1,] or [A, B]. In most cases, programmers do not need to think about this fact. However, it becomes important if you write a user-defined module that modifies one of its arguments. If you pass a temporary variable, the modifications are made to the temporary variable, which promptly vanishes after the call. To prevent unexpected surprises, always pass in a "permanent" variable to a module that modifies its arguments.

The post On passing a list to a SAS/IML module appeared first on The DO Loop.

4月 202021
 

I can’t believe it’s true, but SAS Global Forum is just over a month away. I have some exciting news to share with you, so let’s start with the theme for this year:

New Day. New Answers. Inspired by Curiosity.

What a fitting theme for this year! Technology continues to evolve, so each new day is a chance to seek new answers to what can sometimes feel like impossible challenges. Our curiosity as humans drives us to seek out better ways to do things. And I hope your curiosity will drive you to register for this year’s SAS Global Forum.

We are excited to offer a global event across three regions. If you’re in the Americas, the conference is May 18-20. In Asia Pacific? Then we’ll see you May 19-20. And we didn’t forget about Europe. Your dates are May 25-26. We hope these region-specific dates and the virtual nature of the conference means more SAS users than ever will join us for an inspiring event. Curious about the exciting agenda? It’s all on the website, so check it out.

Keynotes speakers that you’ll talk about for months to come

Want to be inspired to chase your “impossible” dreams? Or hear more about the future of AI? How about learning about work-life balance and your mental health? We have you covered. SAS executives are gearing up to host an exciting lineup of extremely smart, engaging and thought-provoking keynote speakers like Adam Grant, Ayesha Khanna and Hakeem Oluseyi.

And who knows, we might have a few more surprises up our sleeve. You’ll just have to register and attend to find out.

Papers and proceedings: simplified and easy to find

Have you joined the SAS Global Forum online community? You should, because that’s where you’ll find all the discussion around the conference…before, during and after. It’s also where you’ll find a link to the 2021 proceedings, when they become available. Authors are busy preparing their presentations now and they are hard at work staging their proceedings in the community. Join the community so you can connect with other attendees and know when the proceedings become available.

Stay tuned for even more details

SAS Global Forum is the place where creativity meets curiosity, and amazing analytics happens! I encourage you to regularly check the conference website, as we’re continually adding new sessions and events. You don’t want to miss this year’s conference, so don’t forget to register for SAS Global Forum. See you soon!

Registration is open for a truly inspiring SAS Global Forum 2021 was published on SAS Users.

3月 292021
 

A previous article discusses how to interpret regression diagnostic plots that are produced by SAS regression procedures such as PROC REG. In that article, two of the plots indicate influential observations and outliers. Intuitively, an observation is influential if its presence changes the parameter estimates for the regression by "more than it should." Various researchers have developed statistics that give a precise meaning to "more than it should," and the formulas for several of the popular influence statistics are included in the PROC REG documentation.

This article discusses the Cook's D and leverage statistics. Specifically, how can you use the information in the diagnostic plots to identify which observations are influential? I show two techniques for identifying the observations. The first uses a DATA step and a formula to identify influential observations. The second technique uses the ODS OUTPUT statement to extract the same information directly from a regression diagnostic plot.

These are not the only regression diagnostic plots that can identify influential observations:

  • The DFBETAS statistic estimates the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS statistic is a measure of how the predicted value changes when an observation is deleted. It is closely related to the Cook's D statistic.

The data and model

As in the previous article, let's use a model that does NOT fit the data very well, which makes the diagnostic plots more interesting. The following DATA step adds a quadratic effect to the Sashelp.Thick data and also adds a variable that is used in a subsequent section to merge the data with the Cook's D and leverage statistics.

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
Observation = _N_;   /* for merging with other data sets */
run;

Graphs and labels

Rather than create the entire panel of diagnostic plots, you can use the PLOTS(ONLY)= option to create only the graphs for Cook's D statistic and for the studentized residuals versus the leverage. In the following call to PROC REG, the LABEL suboption requests that the influential observations be labeled. By default, the labels are the observation numbers. You can use the ID statement in PROC REG to specify a variable to use for the labels.

proc reg data=Thick2  plots(only label) =(CooksD RStudentByLeverage);
   model Thick = North North2 East; /* can also use INFLUENCE option */
run;

The graph of the Cook'd D statistic is shown above. The PROC REG documentation states that the horizontal line is the value 4/n, where n is the number of observations in the analysis. For these data, n = 75. According to this statistic, observations 1, 4, 8, 63, and 65 are influential.

The second graph is a plot of the studentized residual versus the leverage statistic. The PROC REG documentation states that the horizontal lines are the values ±2 and the vertical line is at the value 2p/n, where p is the number of parameters, including the intercept. For this model, p = 4. The observations 1, 4, 8, 10, 16, 63, and 65 are shown in this graph as potential outliers or potential high-leverage points.

From the graphs, we know that certain observations are influential. But how can highlight those influential observations in plots, print them, or otherwise analyze them? And how can we automate the process so that it works for any model on any data set?

Manual computation of influential observations

There are two ways to determine which observations have large residuals or are high-leverage or have a large value for the Cook's D statistic. The traditional way is to use the OUTPUT statement in PROC REG to output the statistics, then identify the observations by using the same cutoff values that are shown in the diagnostic plots. For example, the following DATA step lists the observations whose Cook's D statistic exceeds the cutoff value 4/n ≈ 0.053.

/* manual identification of influential observations */
proc reg data=Thick2  plots=none;
   model Thick = North North2 East; /* can also use INFLUENCE option */
   output out=RegOut predicted=Pred student=RStudent cookd=CookD H=Leverage;
quit;
 
%let p = 4;  /* number of parameter in model, including intercept */
%let n = 75; /* Number of Observations Used */
title "Influential (Cook's D)";
proc print data=RegOut;
   where CookD > 4/&n;
   var Observation East North Thick CookD;
run;

This technique works well. However, it assumes that you can easily write a formula to identify the influential observations. This is true for regression diagnostics. However, you can imagine a more complicated graph for which "special" observations are not so easy to identify. As shown in the next section, you can leverage (pun intended) the fact that SAS already identified the special observations for you.

Leveraging the power of the ODS OUTPUT statement

Did you know that you can create a data set from any SAS graphic? Many SAS programmers use ODS OUTPUT to save a table to a SAS data set, but the same technique enables you to save the data underlying any ODS graph. There is a big advantage of using ODS OUTPUT to get to the data in a graph: SAS has already done the work to identify and label the important points in the graph. You don't need to know any formulas!

Let's see how this works by extracting the observations whose Cook's D statistic exceeds the cutoff value. First, generate the graphs and use ODS OUTPUT to same the underlying data models, as follows:

/* Let PROC REG do the work. Use ODS OUTPUT to capture information */
ods exclude all;
proc reg data=Thick2  plots(only label) =(CooksD RStudentByLeverage);
   model Thick = North North2 East; 
   ods output CooksDPlot=CookOut         /* output the data for the Cook's D graph */
              RStudentByLeverage=RSOut;  /* output the data for the outlier-leverage plot */
quit;
ods exclude none;
 
/* you need to look at the data! */
proc print data=CookOut(obs=12); 
   where Observation > 0;
run;

When you output the data from a graph, you have to look at how the data are structured. Sometimes the data set has strange variable names or extra observations. In this case, the data begins with three extra observations for which the Cook's D statistic is missing. For the curious, fake observations are often used to set axes ranges or to specify the order of groups in a plot.

Notice that the CookOut data set includes a variable named Observation, which you can use to merge the CookOut data and the original data.

From the structure of the CookOut data set, you can infer that the influential observations are those for which the CooksDLabel variable is nonmissing (excepting the fake observations at the top of the data). Therefore, the following DATA step merges the output data sets and the original data. In the same DATA step, you can create other useful variables, such as a binary variable that indicates which observations have a large Cook's D statistic:

data All;
merge Thick2 CookOut RSOut;
by Observation;
/* create a variable that indicates whether the obs has a large Cook's D stat */
CooksDInf = (^missing(CooksD) & CooksDLabel^=.);
label CooksDInf = "Influential (Cook's D)";
run;
 
proc print data=All noobs;
   where CooksDInf > 0; 
   var Observation East North Thick;
run;
 
proc sgplot data=All;
   scatter x=East y=North / group=CooksDInf datalabel=CooksDLabel 
                  markerattrs=(symbol=CircleFilled size=10);
run;

The output from PROC PRINT (not shown) confirms that observations 1, 4, 8, 63, and 65 have a large Cook's D statistic. The scatter plot shows that the influential observations are located at extreme values of the explanatory variables.

Outliers and high-leverage points

The process to extract or visualize the outliers and high-leverage points is similar. The RSOut data set contains the relevant information. You can do the following:

  1. Look at the names of the variables and the structure of the data set.
  2. Merge with the original data by using the Observation variable.
  3. Use one of more variables to identify the special observations.

The following call to PROC PRINT gives you an overview of the data and its structure:

/* you need to look at the data! */
proc print data=RSOut(obs=12) noobs; 
   where Observation > 0;
   var Observation RStudent HatDiagonal RsByLevIndex outLevLabel RsByLevGroup;
run;

For the RSOut data set, the indicator variable is named RsByLevIndex, which has the value 1 for ordinary observations and the value 2, 3, or 4 for influential observations. The meaning of each index value is shown in the RsByLevGroup variable, which has the corresponding values "Outlier," "Leverage," and "Outlier and Leverage" (or a blank string for ordinary observations). You can use these values to identify the outliers and influential observations. For example, you can print all influential observations or you can graph them, as shown in the following statements:

proc print data=All noobs;
   where Observation > 0 & RsByLevIndex > 1;
   var Observation East North Thick RsByLevGroup;
run;
 
proc sgplot data=All;
   scatter x=East y=North / markerattrs=(size=9);        /* ordinary points are not filled */
   scatter x=East y=North / group=RsByLevGroup nomissinggroup /* special points are filled */
           datalabel=OutLevLabel markerattrs=(symbol=CircleFilled size=10);
run;

The PROC PRINT output confirms that we can select the noteworthy observations. In the scatter plot, the color of each marker indicates whether the observation is an outlier, a high-leverage point, both, or neither.

Summary

It is useful to identify and visualize outliers and influential observations in a regression model. One way to do this is to manually compute a cutoff value and create an indicator variable that indicates the status of each observation. This article demonstrates that technique for the Cook's D statistic. However, an alternative technique is to take advantage of the fact that SAS can create graphs that label the outliers and influential observations. You can use the ODS OUTPUT statement to capture the data underlying any ODS graph. To visualize the noteworthy observations, you can merge the original data and the statistics, indicator variables, and label variables.

Although it's not always easy to decipher the variable names and the structure of the data that comes from ODS graphics, this technique is very powerful. Its use goes far beyond the regression example in this article. The technique enables you to incorporate any SAS graph into a second analysis or visualization.

The post Identify influential observations in regression models appeared first on The DO Loop.

3月 162021
 

SAS global statementsSAS IF-THEN/ELSE statement that executes DATA step statements depending on specified conditions:

IF expression THEN executable-statement1;
<ELSE executable-statement2;>

Try sticking it in there and SAS will slap you with an ERROR:

data _null_;
   set SASHELP.CARS nobs=n;
   if n=0 then libname outlib 'c:\temp';
run;

SAS log will show:

3       if n=0 then libname outlib 'c:\temp';
                    -------
                    180
ERROR 180-322: Statement is not valid or it is used out of proper order.

But global statements’ “not executable” status only means that they cannot be executed as part of a DATA step execution. Otherwise, “they take effect” (in my mind that equates to “they execute”) right after the compilation phase but before DATA step executes (or processes) its data reads, writes, logic and iterations.

Here is another illustration. Let’s get a little creative and tack a LIBNAME global statement within conditionally executed DO-group of the IF-THEN statement:

data OUTLIB.CARS;
   set SASHELP.CARS nobs=n;
   if n=0 then
   do;
      libname OUTLIB 'c:\temp';
   end;
run;
 
In this case, SAS log will show:
NOTE: Libref OUTLIB was successfully assigned as follows:
      Engine:        V9
      Physical Name: c:\temp
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set OUTLIB.CARS has 428 observations and 15 variables.

As you can see, not only our LIBNAME statement “executed” (or “took effect”) despite the IF-THEN condition was FALSE, it successfully assigned the OUTLIB library and applied it to the data OUTLIB.CARS; statement that appears earlier in the code. That is because the LIBNAME global statement took effect (executed) right after the DATA step compilation before its execution.

For the same reason, you can place global statement TITLE either in open code before PROC that produces output with a title or within that PROC. In the first case, the stand-alone TITLE statement is compiled on its own and immediately executed thus setting the title for the PROCs that follow. In the latter case, it is compiled with the PROC step, then immediately executed before PROC step’s execution.

Now, when we have a solid grasp of the global statements timing habits, let’s look at the coding techniques allowing us to take full control of when and whether global statements take effect (executed).

Macro language to conditionally execute SAS global statements

Since

%let dsname = SASHELP.CARS;
/*%let dsname = SASHELP.CLASS;*/
 
%let name = %scan(&dsname,2);
 
%if (&name eq CARS) or (&name eq CLASS) %then
%do;
   options DLCREATEDIR;
   libname outlib "c:\temp\&name";
%end;
%else
%do;
   libname outlib "c:\temp";
%end;
 
data OUTLIB.&name;
   set &dsname;
run;

In this code, if name is either CARS or CLASS the following global statements will be generated and passed on to the SAS compiler:

   options DLCREATEDIR;
   libname outlib "c:\temp\&name";

This will create a directory c:\temp\&name (if it does not exist) and assign libref OUTLIB to that directory.

Otherwise, the following global statement will be generated and passed on to the SAS compiler:

   libname outlib "c:\temp";

The DATA step then creates data set OUTLIB.&name in the corresponding dynamically assigned library. Using this technique, you can conditionally generate global statements for SAS system options, librefs, filerefs, titles, footnotes, etc. SAS compiler will pick up those generated global statements and execute (activate, put in effect) them.

CALL EXECUTE to conditionally execute SAS global statements

Sometimes, it is necessary to conditionally execute global statements based on values contained in data, whether in raw data or SAS data sets. Such a data-driven approach can be easily implemented using CALL EXECUTE routine in a DATA step.

data _null_;
   set SASHELP.CARS;
   by MAKE;
   if first.MAKE then
   do;
      call execute('title "'||trim(MAKE)||' models";');
      call execute('proc print noobs data=SASHELP.CARS(where=(MAKE="'||trim(MAKE)||'"));');
      call execute('   var MAKE MODEL TYPE;');
      call execute('run;');
   end;
run;

In this code, for every block of unique MAKE values (identified by first.MAKE) we have CALL EXECUTE generating lines of SAS code and pushing them outside the DATA step boundary where they compile and execute. The code snippets for TITLE and WHERE clause are data-driven and generated dynamically. The SAS log will show a series of the generated statements:

NOTE: CALL EXECUTE generated line.
1   + title "Acura models";
2   + proc print noobs data=SASHELP.CARS(where=(MAKE="Acura"));
3   +    var MAKE MODEL TYPE;
4   + run;
 
5   + title "Audi models";
6   + proc print noobs data=SASHELP.CARS(where=(MAKE="Audi"));
7   +    var MAKE MODEL TYPE;
8   + run;

. . . and so forth.

In this implementation, global statement TITLE is prepared (“pre-cooked”) conditionally (if first.MAKE is TRUE) within the DATA step in a form of a character value. It’s still not a global statement until CALL EXECUTE pushes it out of the DATA step. There it becomes a global statement as part of SAS code stream. There it gets compiled and executed, setting a nice data-driven title for the PROC PRINT output (individually for each Make):

PROC PRINT outputs with dynamically generated titles

Additional resources

Your thoughts?

Have you found this blog post useful? Do you have any questions? Please feel free to ask and share your thoughts and feedback in the comments section below.

How to conditionally execute SAS global statements was published on SAS Users.