do loop

7月 062021
 

Complex loops in SAS programmingIterative loops are one of the most powerful and imperative features of any programming language, allowing blocks of code to be automatically executed repeatedly with some variations. In SAS we call them DO-loops because they are defined by the iterative DO statements. These statements come in three distinct forms:

  • DO with index variable
  • DO UNTIL
  • DO WHILE

In this blog post we will focus on the versatile iterative DO loops with index variable pertaining to SAS DATA steps, as opposed to its modest IML’s DO loops subset.

Iterative DO statement with index variable

The syntax of the DATA step’s iterative DO statement with index variable is remarkably simple yet powerful:

DO statement with index-variable

DO index-variable=specification-1 <, ...specification-n>;

...more SAS statements...

END;

It executes a block of code between the DO and END statements repeatedly, controlled by the value of an index variable. Given that angle brackets (< and >) denote “optional”, notice how index-variable requires at least one specification (specification-1) yet allows for multiple additional optional specifications (<, ...specification-n>) separated by commas.

Now, let’s look into the DO statement’s index-variable specifications.

Index-variable specification


Each specification denotes an expression, or a series of expressions as follows:

start-expression <TO stop-expression> <BY increment-expression> <WHILE (expression) | UNTIL (expression)>

Note that only start-expression is required here whereas <TO stop-expression>, <BY increment-expression>, and <WHILE (expression) or UNTIL (expression)> are optional.

Start-expression may be of either Numeric or Character type, while stop-expression and increment-expression may only be Numeric complementing Numeric start-expression.

Expressions in <WHILE (expression) | UNTIL (expression)> are Boolean Numeric expressions (numeric value other than 0 or missing is TRUE and a value of 0 or missing is FALSE).

Other iterative DO statements

For comparison, here is a brief description of the other two forms of iterative DO statement:

  • The DO UNTIL statement executes statements in a DO loop repetitively until a condition is true, checking the condition after each iteration of the DO loop. In other words, if the condition is true at the end of the current loop it will not iterate anymore, and processing continues with the next statement after END. Otherwise, it will iterate.
  • The DO WHILE statement executes statements in a DO loop repetitively while a condition is true, checking the condition before each iteration of the DO loop. That is if the condition is true at the beginning of the current loop it will iterate, otherwise it will not, and processing continues with the next statement after the END.

Looping over a list of index variable values/expressions

DO loops can iterate over a list of index variable values. For example, the following DO-loop will iterate its index variable values over a list of 7, 13, 5, 1 in the order they are specified:

data A; 
   do i=7, 13, 5, 1;
      put i=;
      output;
   end;
run;

This is not yet another form of iterative DO loop as it is fully covered by the iterative DO statement with index variable definition. In this case, the first value (7) is the required start expression of the required first specification, and all subsequent values (13, 5 and 1) are required start expressions of the additional optional specifications.

Similarly, the following example illustrates looping over a list of index variable character values:

data A1;
   length j $4;
   do j='a', 'bcd', 'efgh', 'xyz';
      put j=;
      output;
   end;
run;

Since DO loop specifications denote expressions (values are just instances or subsets of expressions), we can expand our example to a list of actual expressions:

data B;
   p = constant('pi');
   do i=round(sin(p)), sin(p/2), sin(p/3);
      put i=;
      output;
   end;
run;

In this code DO-loop will iterate its index variable over a list of values defined by the following expressions: round(sin(p)), sin(p/2), sin(p/3).

Infinite loops

Since <TO stop> is optional for the index-variable specification, the following code is perfectly syntactically correct:

data C;
   do j=1 by 1;
      output;
   end;
run;

It will result in an infinite (endless) loop in which resulting data set will be growing indefinitely.

While unintentional infinite looping is considered to be a bug and programmers’ anathema, sometimes it may be used intentionally. For example, to find out what happens when data set size reaches the disk space capacity… Or instead of supplying a “big enough” hard-coded number (which is not a good programming practice) for the loop’s TO expression, we may want to define an infinite DO-loop and take care of its termination and exit inside the loop. For example, you can use IF exit-condition THEN LEAVE; or IF exit-condition THEN STOP; construct.

LEAVE statement immediately stops processing the current DO-loop and resumes with the next statement after its END.

STOP statement immediately stops execution of the current DATA step and SAS resumes processing statements after the end of the current DATA step.

The exit-condition may be unrelated to the index-variable and be based on some events occurrence. For instance, the following code will continue running syntactically “infinite” loop, but the IF-THEN-LEAVE statement will limit it to 200 seconds:

data D;
   start = datetime();
   do k=1 by 1;
      if datetime()-start gt 200 then leave;
      /* ... some processing ...*/
      output; 
   end;
run;

You can also create endless loop using DO UNTIL(0); or DO WHILE(1); statement, but again you would need to take care of its termination inside the loop.

Changing “TO stop” within DO-loop will not affect the number of iterations

If you think you can break out of your DO loop prematurely by adjusting TO stop expression value from within the loop, you may want to run the following code snippet to prove to yourself it’s not going to happen:

data E;
   n = 4;
   do i=1 to n;
      if i eq 2 then n = 2;
      put i=;
      output;
   end;
run;

This code will execute DO-loop 4 times despite that you change value of n from 4 to 2 within the loop.

According to the iterative DO statement documentation, any changes to stop made within the DO group do not affect the number of iterations. Instead, in order to stop iteration of DO-loop before index variable surpasses stop, change the value of index-variable so that it becomes equal to the value of stop, or use LEAVE statement to jump out of the loop. The following two examples will do just that:

data F;
   do i=1 to 4;
      put i=;
      if i eq 2 then i = 4;
      output;
   end;
run;
 
data G;
   do i=1 to 4;
      put i=;
      if i eq 2 then leave;
      output;
   end;
run;

Know thy DO-loop specifications

Here is a little attention/comprehension test for you.

How many times will the following DO-loop iterate?

data H;
   do i=1, 7, 3, 6, 2 until (i>3);
      put i=;
      output;
   end;
run;

If your answer is 2, you need to re-read the whole post from the beginning (I am only partly joking here).

You may easily find out the correct answer by running this code snippet in SAS. If you are surprised by the result, just take a closer look at the DO statement: there are 5 specifications for the index variable here (separated by commas) whereas UNTIL (expression) belongs to the last specification where i=2. Thus, UNTIL only applies to a single value of i=2 (not to any previous specifications of i =1,7,3,6); therefore, it has no effect as it is evaluated at the end of each iteration.

Now consider the following DO-loop definition:

data Z;
   pi = constant('pi');
   do x=3 while(x>pi), 10 to 1 by -pi*3, 20, 30 to 35 until(pi);
      put x=;
      output;
   end;
run;

I hope after reading this blog post you can easily identify the index variable list of values the DO-loop will iterate over. Feel free to share your solution and explanation in the comments section below.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, other secrets, tips or tricks about the DO loop? Please share with us below.

Little known secrets of DO-loops with index variables was published on SAS Users.

7月 142020
 

In my new book, End-to-End Data Science with SAS: A Hands-On Programming Guide, I use the 1.5 IQR rule to adjust multiple variables.  This program utilizes a macro that loops through a list of variables to make the necessary adjustments and creates an output data set.

One of the most popular ways to adjust for outliers is to use the 1.5 IQR rule. This rule is very straightforward and easy to understand. For any continuous variable, you can simply multiply the interquartile range by the number 1.5. You then add that number to the third quartile. Any values above that threshold are suspected as being an outlier. You can also perform the same calculation on the low end. You can subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.

The process of adjusting for outliers can be tedious if you have several continuous variables that are suspected as having outliers. You will need to run PROC UNIVARIATE on each variable to identify its median, 25th percentile, 75th percentile, and interquartile range. You would then need to develop a program that identifies values above and below the 1.5 IQR rule thresholds and overwrite those values with new values at the threshold.

The following program is a bit complicated, but it automates the process of adjusting a list of continuous variables according to the 1.5 IQR rule. This program consists of three distinct parts:

    1. Create a BASE data set that excludes the variables contained in the &outliers global macro. Then create an OUTLIER data set that contains only the unique identifier ROW_NUM and the outlier variables.
    2. Create an algorithm that loops through each of the outlier variables contained in the global variable &outliers and apply the 1.5 IQR rule to cap each variable’s range according to its unique 1.5 IQR value.
    3. Merge the newly restricted outlier variable with the BASE data set.
/*Step 1: Create BASE and OUTLIER data sets*/
 
%let outliers = /*list of variables*/;
 
DATA MYDATA.BASE;
    SET MYDATA.LOAN_ADJUST (DROP=&amp;outliers.);
    ROW_NUM = _N_;
RUN;
 
DATA outliers;
    SET MYDATA.LOAN_ADJUST (KEEP=&amp;outliers. ROW_NUM);
    ROW_NUM = _N_;
RUN;
 
 /*Step 2: Create loop and apply the 1.5 IQR rule*/
 
%MACRO loopit(mylist);
    %LET n = %SYSFUNC(countw(&amp;mylist));
 
    %DO I=1 %TO &amp;n;
        %LET val = %SCAN(&amp;mylist,&amp;I);
 
        PROC UNIVARIATE DATA = outliers ;
            VAR &amp;val.;
            OUTPUT OUT=boxStats MEDIAN=median QRANGE=iqr;
        run;
 
        data _NULL_;
           SET boxStats;
           CALL symput ('median',median);
           CALL symput ('iqr', iqr);
        run;
 
        %PUT &amp;median;
        %PUT &amp;iqr;
 
        DATA out_&amp;val.(KEEP=ROW_NUM &amp;val.);
        SET outliers;
 
       IF &amp;val. ge &amp;median + 1.5 * &amp;iqr THEN
           &amp;val. = &amp;median + 1.5 * &amp;iqr;
       RUN;
 
/*Step 3: Merge restricted value to BASE data set*/
 
       PROC SQL;
           CREATE TABLE MYDATA.BASE AS
               SELECT *
               FROM MYDATA.BASE AS a
               LEFT JOIN out_&amp;val. as b
                   on a.ROW_NUM = b.ROW_NUM;
       QUIT;
 
    %END;
%MEND;
 
%LET list = &amp;outliers;
%loopit(&amp;list);

Notes on the outlier adjustment program:

  • A macro variable is created that contains all of the continuous variables that are suspected of having outliers.
  • Separate data sets were created: one that contains all of the outlier variables and one that excludes the outlier variables.
  • A macro program is developed to contain the process of looping through the list of variables.
  • A macro variable (n) is created that counts the number of variables contained in the macro variable.
  • A DO loop is created that starts at the first variable and runs the following program on each variable contained in the macro variable.
  • PROC UNIVARIATE identifies the variable’s median and interquartile range.
  • A macro variable is created to contain the values of the median and interquartile range.
  • A DATA step is created to adjust any values that exceed the 1.5 IQR rule on the high end and the low end.
  • PROC SQL adds the adjusted variables to the BASE data set.

This program might seem like overkill to you. It could be easier to simply adjust outlier variables one at a time. This is often the case; however, when you have a large number of outlier variables, it is often beneficial to create an algorithm to transform them efficiently and consistently

Adjusting outliers with the 1.5 IQR rule was published on SAS Users.

2月 182020
 

In case you missed the news, there is a new edition of The Little SAS Book! Last fall, we completed the sixth edition of our book, and even though it is actually a few pages shorter than the fifth edition, we managed to add many more topics to the book. See if you can answer this question.

The answer is D – all of the above! We also added new sections on subsetting, summarizing, and creating macro variables using PROC SQL, new sections on the XLSX LIBNAME engine and ODS EXCEL, more on iterative DO statements, a new section on %DO, and more. For a summary of all the changes, see our blog post “The Little SAS Book 6.0: The best-selling SAS book gets even better."

Updating The Little SAS Book meant updating its companion book, Exercises and Projects for The Little SAS Book, as well. The exercises and projects book contains multiple choice and short answer questions as well as programming exercises that cover the same topics that are in The Little SAS Book. The exercises and projects book can be used in a classroom setting, or for anyone wanting to test their SAS knowledge and practice what they have learned.

Here are examples of the types of questions you might find in the exercises and projects book.

Multiple Choice

Short Answer

Programming Exercise

Solutions

In the book, we provide solutions for odd-numbered multiple choice and short answer questions and hints for the programming exercises.

  1. B
  2. Hint: New variables (columns) can be specified in the SELECT clause. Also, see our blog post “Expand your SAS Knowledge by Learning PROC SQL.”

While we don’t provide solutions for even-numbered questions, we can tell you that the iterative DO statement is covered in Section 3.12 of The Little SAS Book, Sixth Edition, “Using Iterative DO, DO WHILE, and DO UNTIL Statements.” The %DO statement is covered in Section 7.7, “Using %DO Loops in Macros.”

For more information about these books, explore the following links to the SAS website:

The Little SAS Book, Sixth Edition

Exercises and Projects for The Little SAS Book, Sixth Edition

Test your SAS skills with the newest edition of Exercises and Projects for The Little SAS Book was published on SAS Users.

1月 062015
 
In the US, it's typical to borrow a fairly substantial portion of the cost of a new house from a bank. The cost of these loans, the mortgage rate, varies over time depending on what the financial wizards see in their crystal balls. What this means over time is that when the mortgage rates go down, the cost of living in your own house magically decreases--you take a new loan at the lower rate and pay off your old loan with it-- then you only have to pay off the new loan at the lower rate. You can find mortgage rate calculators on the web very easily-- if you don't mind their collecting your data and being bombarded with ads if you let their cookies trace you.

Instead, you can use SAS or R to calculate what you might pay for a new loan with various posted rates. There are some sophisticated tools available for either package if you're interested in the remaining principal or the proportion of each payment that's principal. Here, we just want to check the monthly payment.

R
We'll begin by writing a little function to calculate the monthly payment from the principal, interest rate (in per cent), and term (in years) of the loan. This is basic stuff, but the code here is adapted from a function written by Thomas Girke of UC Riverside.

mortgage <- function(principal=300000, rate=3.875, term=30) {
J <- rate/(12 * 100)
N <- 12 * term
M <- principal*J/(1-(1+J)^(-N))
monthPay <<- M
return(monthPay)
}
To compare the monthly costs for a series of loans offered by a local bank, we'll input the bank's loans as a data frame. To save typing, we'll use the rep() function to generate the term of the loan and the points.

offers = data.frame(
principal = rep(275000, times=9),
term = rep(c(30,20,15), each=3),
points = rep(c(0,1,2), times=3),
rate = c(3.875, 3.75, 3.5, 3.625, 3.5, 3.375, 3, 2.875, 2.75))

> offers

principal term points rate
1 275000 30 0 3.875
2 275000 30 1 3.750
3 275000 30 2 3.500
4 275000 20 0 3.625
5 275000 20 1 3.500
6 275000 20 2 3.375
7 275000 15 0 3.000
8 275000 15 1 2.875
9 275000 15 2 2.750
(Points are an up-front cost a borrower can pay to lower the mortgage rate for the loan.) With the data and function in hand, it's easy to add the monthly cost to the data frame:

offers$monthly = with(offers, mortgage(rate=rate, term=term, principal=principal))

> offers

principal term points rate monthly
1 275000 30 0 3.875 1293.152
2 275000 30 1 3.750 1273.568
3 275000 30 2 3.500 1234.873
4 275000 20 0 3.625 1612.610
5 275000 20 1 3.500 1594.889
6 275000 20 2 3.375 1577.282
7 275000 15 0 3.000 1899.100
8 275000 15 1 2.875 1882.611
9 275000 15 2 2.750 1866.210
In theory, each of these costs are fair, and the borrower should choose based on monthly costs they can afford, as well as whether they see a better value in having money in hand to spend on a better quality of life or to invest it in savings or in paying off their house sooner. Financial professionals often discuss things like the total dollars spent or the total spent on interest vs. principal, as well.

SAS
The SAS/ETS package provides the LOAN procedure, which can calculate the detailed analyses mentioned above. For simple calculations like this one, we can use the mort function in the data step. It will find and return the missing one of the four parameters-- principal, payment, rate, and term. To enter the data in a manner similar to R, we'll use array statements and do loops.

data t;
principal = 275000;
array te [3] (30,20,15);
array po [3] (0,1,2);
array ra [9] (.03875, .0375, .035, .03625, .035,
.03375, .03, .02875, .0275);
do i = 1 to 3;
do j = 1 to 3;
term = te[i];
points = po[j];
rate = ra[ 3 * (i-1) +j];
monthly = mort(principal,.,rate/12, term*12);
output;
end;
end;
run;

proc print noobs data = t;
var principal term points rate monthly; run;

principal term points rate monthly

275000 30 0 0.03875 1293.15
275000 30 1 0.03750 1273.57
275000 30 2 0.03500 1234.87
275000 20 0 0.03625 1612.61
275000 20 1 0.03500 1594.89
275000 20 2 0.03375 1577.28
275000 15 0 0.03000 1899.10
275000 15 1 0.02875 1882.61
275000 15 2 0.02750 1866.21
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission. If you read this on another aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page other than as noted above, the aggregator is violating the terms by which we publish our work.
4月 142014
 
Rick Wicklin showed how to make a Hilbert matrix in SAS/IML. Rick has a nice discussion of these matrices and why they might be interesting; the value of H_{r,c} is 1/(r+c-1). We show how to make this matrix in the data step and in R. We also show that Rick's method for displaying fractions in SAS/IML works in PROC PRINT, and how they can be displayed in R.

SAS
In the SAS data step, we'll use an array to make the columns of the matrix and do loops to put values into each cell in a row and output the row into the data set before incrementing the row value. This is a little awkward, but does at least preserve the simple 1/(r+c-1) function of the cell values. We arrange the approach using a global macro to be more general.
%let n = 5;
data h;
array val [&n] v1 - v&n;
do r = 1 to &n;
do c = 1 to &n;
val[c] = 1/(r+c-1);
end;
output;
end;
run;
To print the resulting matrix, we use the fract format, as Rick demonstrated. Pretty nice! The noobs option in the proc print statement suppresses the row number that would otherwise be shown.
proc print data = h noobs; 
var v1 - v5;
format _all_ fract.;
run;
 v1    v2     v3     v4     v5

1 1/2 1/3 1/4 1/5
1/2 1/3 1/4 1/5 1/6
1/3 1/4 1/5 1/6 1/7
1/4 1/5 1/6 1/7 1/8
1/5 1/6 1/7 1/8 1/9


R
As is so often the case in R, a solution can be generated in one line using nesting. Also as usual, though, its a bit unnatural, and we don't deconstruct it here.
n = 5
1/sapply(1:n,function(x) x:(x+n-1))
A more straightforward approach follows. It's certainly less efficient, though efficiency would seem a non-issue in any application of this code. It's also the kind of code that R aficionados would scoff at. It does have the attractiveness of perfect clarity, however. We begin by defining an empty matrix, then simply loop through the cells of the matrix, assigning values one by one.
n=5
h1 = matrix(nrow=n,ncol=n)
for (r in 1:n) {
for (c in 1:n)
h1[r,c] = 1/(r+c-1)
}
To display the fractions, we use the fractions() function in MASS package that's distributed with R.
> library(MASS)
> fractions(h1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1/2 1/3 1/4 1/5
[2,] 1/2 1/3 1/4 1/5 1/6
[3,] 1/3 1/4 1/5 1/6 1/7
[4,] 1/4 1/5 1/6 1/7 1/8
[5,] 1/5 1/6 1/7 1/8 1/9


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
10月 212012
 

DoW

    Once upon a midnight dreary,
While I pondered, weak and weary,
   Over many a quaint and curious
        Volume of SAS-L galore,
    While I coded, nearly snapping,
    Suddenly there came a tapping
    As of someone gently wrapping
      Code around my mental core.
“’Tis’ Do-Whitlock Loop”, I muttered,
  “Wrapped around my brain core” -
     This it was, and so much more! 
  –Paul Dorfman, A Bit of DoW-History

It’s great to have Paul Dorfman’s demo of DoW-loop (with a poem! Thanks to Paul to send me a copy.) in this SESUG 2012 conference. First posted in SAS-L by Whitlock, promoted by Dorfman,  DoW-loop is widely known as “Whitlock Do Loop” or ” Dorfman-Whitlock Do Loop”. In Dorfman’s presentation, the following three forms of DoW-loop were constructed:

    • Henderson-Whitlock Original Form of DoW-loop
    • Dorfman’s Short Form of DoW-loop
    • Double DoW-loop: the DeVenezia-Schreier Form

Using a test data set

data a;
    input id $ var;
datalines;
A 1
A 2
B 3
B 4
B 5
;

the Henderson-Whitlock Original Form of DoW-loop looks like:

data b;
    count= 0;
    sum = 0 ;

    do until ( last.id ) ;
        set a ;
        by id ;

        count+1;
        sum+var;
    end ;

    mean = sum / count ;
run ;

Ian Whitlock first used such kind of do loop in a SAS-L post and after its rising, Paul Dorfman, the main promoter of DoW-loop, found that Don Henderson also made use such DO UNTIL() structure in a NESUG 1988 paper, The SAS Supervisor. That’s why he named it as Henderson-Whitlock form of DoW-loop. I then read from a DoW-loop page in sascommunity.org that Don Henderson taught such concept in class where Ian Whitlock was a student. This is a nice story.

Paul Dorfman himself also contributed a short form:

data c ;
    do n = 1 by 1 until ( last.id ) ;
        set a ;
        by id ;

        count = sum (count, 1) ;
        sum = sum (sum, var) ;
    end ;

    mean = sum / count ;
run ;

or even shorter:

data c ;
    do _n_ = 1 by 1 until ( last.id ) ;
        set a ;
        by id ;

        sum = sum (sum, var) ;
    end ;

    mean = sum / _n_ ;
run ;

These kinds of form of DoW-loop utilizes the SUM function, automatic variable _N_ and an increment trick in DO UNTIL() structure then the initializations before the loop are not needed any more. Besides such a short form, Paul Dorfman’s work on DoW-loop includes the invention of the dynamic file splitting method combining the DoW-loop and the hash object (also showed up in the meeting).

The double DoW-loop is under the name Howard Schreier and Richard DeVenezia (DeVenezia-Schreier Form; I should do more literature research on it!):

data d ;
    do n = 1 by 1 until ( last.id ) ;
        set a ;
        by id ;

        count = sum (count, 1) ;
        sum = sum (sum, var) ;
    end ;

    mean = sum / count ;

    do n = 1 by 1 until ( last.id ) ;
        set a ;
        by id ;
        output;
    end ;
run ;

/*********************/

/*****update 1*********/

Thanks to Quentin’s message, Don Henderson’s The SAS Supervisor can be even traced back to 1983.

/*****update 2*********/

I used a Star Wars style of opening crawl to render Paul Dorfman’s DoW-loop verses simply because the finding of Don Henderson looks slightly like a prequel for me (although I have no intention to make up a Star Wars parody). Actually, as Paul stated, the honor belongs to an American author, Edgar Allan Poe and one of his poems, The Raven:

Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore —
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
"’Tis some visiter," I muttered, "tapping at my chamber door —
            Only this and nothing more."

/*****update 3*********/

A double DoW demo in recent SAS-L(Sat, 27 Oct 2012).

5月 142012
 
In example 9.30 we explored the effects of adjusting for multiple testing using the Bonferroni and Benjamini-Hochberg (or false discovery rate, FDR) procedures. At the time we claimed that it would probably be inappropriate to extract the adjusted p-values from the FDR method from their context. In this entry we attempt to explain our misgivings about this practice.

The FDR procedure is described in Benjamini and Hochberg (JRSSB, 1995) as a "step-down" procedure. Put simply, the procedure has the following steps:

0. Choose the familywise alpha
1. Rank order the unadjusted p-values
2. Beginning with the Mth of the ordered p-values p(m),
2a. if p(m) < alpha*(m/M), then reject all tests 1 ... m,
2b. if not, m = m-1
3. Repeat steps 2a and 2b until the condition is met
or p(1) > alpha/M

where M is the number of tests. The "adjusted p-value" based on this procedure is the smallest familywise alpha under which the current test would have been rejected. To calculate this, we can modify the routine above:

1. Rank order the unadjusted p-values
2. For ordered p-values p(m) M to 1,
2a. candidate ap(m) = p(m) *(M/m)
2b. if candidate ap(m) > ap(m+1) then ap(m) = ap(m+1)
2c. else ap(m) = candidate ap(m)

where ap(m) refers to the adjusted p-value corresponding to the mth ordered unadjusted p-value. It's interesting to note that the adjusted p-value for the Mth ordered test is the same as the unadjusted p-value, while the candidate adjusted p-value for the smallest test is the Bonferroni adjusted p-value. The primary difficulty with taking these p-values (as opposed to the test results) out of context is captured in steps 2b and 2c. They imply that the p-value for a given test may be lowered by other observed p-values in the family of tests. It's also true that the adjusted p-value depends on the number of tests included in the family, but this seems somewhat less troubling.

To examine the impact of the procedure on the adjusted p-values for the individual tests, we'll compare the candidate ap(m) from step 2a against the actual ap(m). Our sense is that to the degree these are different, the adjusted p-value should not be extracted from the context of the observed family of tests.


SAS
Our SAS code relies heavily on the array statement (section 1.11.5). We loop through the p-values from largest to smallest, calculating the candidate fdr p-value as above, before arriving at the final adjusted p-value. To compare the values conveniently, we make a new data set with two copies of the original data set, renaming first the candidate and then the adjusted p-values to have the same names. The in = data set option creates a temporary variable which identifies which data set an observation was read from; here it denotes which version of the same data set (and which set of p-values) was used.

data fdr;
array pvals [10] pval1 - pval10
(.001 .001 .001 .001 .001 .03 .035 .04 .05 .05);
array cfdrpvals [10] cfdr1 - cfdr10;
array fdrpvals [10] fdr1 - fdr10;
fdrpvals[10] = pvals[10];
do i = 9 to 1 by -1;
cfdrpvals[i] = pvals[i] * 10/i;
if cfdrpvals[i] > fdrpvals[i+1] then fdrpvals[i] = fdrpvals[i+1];
else fdrpvals[i] = cfdrpvals[i];
end;
run;

data compare;
set fdr (in = cfdr rename=(cfdr1=c1 cfdr2=c2 cfdr3=c3 cfdr4=c4
cfdr5=c5 cfdr6=c6 cfdr7=c7 cfdr8=c8 cfdr9=c9))
fdr (in = fdr rename=(fdr1=c1 fdr2=c2 fdr3=c3 fdr4=c4 fdr5=c5
fdr6=c6 fdr7=c7 fdr8=c8 fdr9=c9));
if cfdr then adjustment = "Candidate fdr";
if fdr then adjustment = "Final fdr";
run;

proc print data = compare; var adjustment c1-c9; run;

adjustment c1 c2 c3 c4 c5 c6 c7 c8 c9

Candidate fdr 0.010 .005 .0033 .0025 .002 .05 .05 .05 .055
Final fdr 0.002 .002 .0020 .0020 .002 .05 .05 .05 .050

(We omit the last p-value because the adjustment does not affect it.) The result shows that for many of the tests in this family, a substantially smaller p-value is obtained with the final FDR p-value than the candidate. To this degree, the FDR p-value is dependent on the observed values of the p-values in the tests in the family, and ought not to be removed from the context of these other tests. We would recommend caution in displaying the FDR p-values in such settings, given readers' propensity to use them as if they were ordinary p-values, safely adjusted for multiple testing.

R
Comparison of the R and SAS code may make SAS programmers weep. The candidate values are easily calculated, and can be presented with the final p-values in one step using the p.adjust() function. Three lines of code, albeit incorporating multiple functions in each line. (And it could sensibly be done in two, calculating the candidate p-values within the rbind() function call.) Note especially the line calculating the candidate p-values, in which vectorization allows a for loop to be avoided in a very natural fashion.

fakeps = c(rep(.2, 5), 6, 7, 8, 10, 10)/200
cfdr = fakeps * 10/(1:10)
rbind(cfdr, fdr=p.adjust(fakeps, "fdr"))[,1:9]

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
cfdr 0.010 0.005 0.0033 0.0025 0.002 0.05 0.05 0.05 0.0556 0.05
fdr 0.002 0.002 0.0020 0.0020 0.002 0.05 0.05 0.05 0.0500 0.05


An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.