A recent

post pointed us to a great

talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a

paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

**R**We'll repeat the exercise twice in R: first using the

`mosaic` package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For

`mosaic`, we begin by entering the data and creating a dataframe. The

` do()` operator and the

shuffle() function facilitate carrying out a

permutation test (see also section 5.4.5 of the book).

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)

water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water),

x = c(rep("beer", length(beer)), rep("water", length(water))))

require(mosaic)

obsdiff = compareMean(y ~ x, data=ds)

nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)

histogram(~ result, xlab="permutation differences", data=nulldist)

ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff

[1] -4.377778

> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

TRUE FALSE

0.1 99.9

The

` do()` operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.

alldata = c(beer, water)

labels = c(rep("beer", length(beer)), rep("water", length(water)))

obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff

[1] -4.377778

The

`sample()` function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.

resample_labels = sample(labels)

resample_diff = mean(alldata[resample_labels=="beer"]) -

mean(alldata[resample_labels=="water"])

resample_diff

[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is

`replicate()`. To use it, we make a function of the resampling procedure shown above.

resamp_means = function(data, labs){

resample_labels = sample(labs)

resample_diff = mean(data[resample_labels=="beer"]) -

mean(data[resample_labels=="water"])

return(resample_diff)

}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")

abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:

alldiffs = c(obsdiff,nulldist)

p = sum(abs(alldiffs >= obsdiff)/ 10000)

**SAS**The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the

`infile datalines` statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the

`_n_` implied variable that SAS creates but does not save with the data.

The

`summary` procedure generates the mean for each group and saves the results in a data set with a row for each group; the

`transpose` procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use

`call symput` to make it into a macro variable for later use.

data bites;;

if _n_ le 18 then drink="water";

else drink="beer";

infile datalines delimiter=',';

input bites @@;

datalines;

21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20

27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24

28, 24, 29, 21, 21, 18, 27, 20

;

run;

proc summary nway data = bites;

class drink;

var bites;

output out=obsmeans mean=mean;

run;

proc transpose data = obsmeans out=om2;

var mean;

id drink;

run;

data om3;

set om2;

obsdiff = beer-water;

call symput('obsdiff',obsdiff);

run;

proc print data = om3; var obsdiff; run;

Obs obsdiff

1 4.37778

(Yes, we could have done this with

`proc ttest` and

`ODS`. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.

data rerand;

set bites;

randorder = uniform(0);

run;

proc sort data = rerand; by randorder; run;

data rerand2;

set rerand;

if _n_ le 18 then redrink = "water";

else redrink = "beer";

run;

proc summary nway data = rerand2;

class redrink;

var bites;

output out=rerand3 mean=mean;

run;

proc transpose data = rerand3 out=rerand4;

var mean;

id redrink;

run;

data rrdiff;

set rerand4;

rrdiff = beer-water;

run;

proc print data = rrdiff; var rrdiff; run;

Obs rrdiff

1 -1.73778

One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting

`rrdiff` into a data set. Instead, we use the

`surveyselect` procedure to do this much more efficiently. The

`groups` option sample groups of 18 and 25 from the data, while the

`reps` option requests this be done 9,999 times. We can then use the

`summary` and

`transpose` procedures as before, with the addition of the

`by replicate` statement to make a data set with columns for each group mean and a row for each replicate.

proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;

by replicate;

class groupid;

var bites;

output out=ssresamp2 mean=mean;

run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;

by replicate;

var mean;

id groupid;

run;

To get a p-value and make a histogram, we use the macro variable created earlier.

data ssresamp4;

set ssresamp3;

diff = group2 - group1;

exceeds = abs(diff) ge &obsdiff;

run;

proc means data = ssresamp4 sum; var exceeds; run;

The MEANS Procedure

Analysis Variable : exceeds

Sum

9.0000000

proc sgplot data = ssresamp4;

histogram diff;

refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);

run;

The p-value is 0.001 (= (9+1)/10000).

**An unrelated note about aggregators:**We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences.

SAS and R is aggregated by

R-bloggers,

PROC-X, and

statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at

SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our

license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.