simulation studies

6月 302014
 
In our last entry, we demonstrated how to simulate data from a logistic regression with an interaction between a dichotomous and a continuous covariate. In this entry we show how to use the simulation to estimate the power to detect that interaction. This is a simple, elegant, and powerful idea: simply simulate data under the alternative, and count the proportion of times the null is rejected. This is an estimate of power. If we lack infinite time to simulate data sets, we can also generate confidence intervals for the proportion.

R
In R, extending the previous example is almost trivially easy. The coef() function, applied to a glm summary object, returns an array with the parameter estimate, standard error, test statistic, and p-value. In one statement, we can extract the p-value for the interaction and return an indicator of a rejected null hypothesis. (This line is commented on below.) Then the routine is wrapped as a trivial function.
logist_inter = function() {
c = rep(0:1,each=50) # sample size is 100
x = rnorm(100)
lp = -3 + 2*c*x
link_lp = exp(lp)/(1 + exp(lp))
y = (runif(100) < link_lp)

log.int = glm(y~as.factor(c)*x, family=binomial)
reject = ifelse( coef(summary(log.int))[4,4] < .05, 1, 0)
# The coef() function above gets the parameter estimates; the [4,4]
# element is the p-value for the interaction.
return(reject)
}
Running the function many times is also trivial, using the replicate() function.
pow1 = replicate(100, logist_inter())
The result is an array of 1s and 0s. To get the estimated power and confidence limits, we use the binom.test() function.
binom.test(sum(pow1), 100)
The test gives a p-value against the null hypothesis that the probability of rejection is 0.5, which is not interesting. The interesting part is at the end.

95 percent confidence interval:
0.3219855 0.5228808
sample estimates:
probability of success
0.42
It would be simple to adjust this code to allow a change in the number of subjects or of the effect sizes, etc.

SAS
In SAS, generating the data is no trouble, but evaluating the power programmatically requires several relatively cumbersome steps. To generate multiple data sets, we include the data generation loop from the previous entry within another loop. (Note that the number of observations has also been reduced vs. the previous entry.)

data test;
do ds = 1 to 100; #100 data sets
do i = 1 to 100; #100 obs/data set
c = (i gt 50);
x = normal(0);
lp = -3 + 2*c*x;
link_lp = exp(lp)/(1 + exp(lp));
y = (uniform(0) lt link_lp);
output;
end;
end;
run;

Then we fit all of the models at once, using the by statement. Here, the ODS system suppresses voluminous output and is also used to capture the needed results in a single data set. The name of the piece of output that holds the parameter estimates (parameterestimates) can be found with the ods trace on statement.
ods select none;
ods output parameterestimates= int_ests;
proc logistic data = test ;
by ds;
class c (param = ref desc);
model y(event='1') = x|c;
run;
ods exclude none;

The univariate procedure can be used to count the number of times the null hypothesis of no interaction would be rejected. To do this, we use the loccount option to request a table of location counts, and the mu0 option to specify that the location of interest is 0.05. As above, since our goal is to use the count programmatically, we also extract the result into a data set. If you're following along at home, it's probably worth your while to print out some of this data to see what it looks like.

ods output locationcounts=int_power;
proc univariate data = int_ests loccount mu0=.05;
where variable = "x*c";
var probchisq;
run;
For example, while the locationcounts data set reports the number of observations above and below 0.05, it also reports the number not equal to 0.05. This is not so useful, and we need to exclude this row from the next step. We do that with a where statement. Then proc freq gives us the proportion and (95%) confidence limits we need, using the binomial option to get the confidence limits and the weight statement to convey the fact that the count variable represents the number of observations.

proc freq data = int_power;
where count ne "Num Obs ^= Mu0";
tables count / binomial;
weight value;
run;
Finally, we find our results:
                        Binomial Proportion
Count = Num Obs < Mu0

Proportion 0.4000
ASE 0.0490
95% Lower Conf Limit 0.3040
95% Upper Conf Limit 0.4960

Exact Conf Limits
95% Lower Conf Limit 0.3033
95% Upper Conf Limit 0.5028
We estimate our power at only 40%, with a confidence limit of (30%, 50%). This agrees closely enough with R: we don't need to narrow the limit to know that we'll need a larger sample size.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
5月 212012
 
In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)

%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);
ods select none;
data test;
do sim = 1 to &nsims;
do i = 1 to &ntests;
raw_p = uniform(0) *
( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );
output;
end;
end;
run;

ods output pvalues = __pv;
proc multtest inpvalues=test fdr;
by sim;
run;

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.

data __pv1;
set __pv;
if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;
else fdrprop =0;
run;

proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;
by sim; run;

data pvals;
set pvals_a;
prop = sum(of col1 - col&ntests);
run;
ods select all;

proc freq data = pvals; tables prop; run;
%mend fdr;

%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);

Cumulative Cumulative
prop Frequency Percent Frequency Percent
---------------------------------------------------------
0.5 758 75.80 758 75.80
0.55 210 21.00 968 96.80
0.6 27 2.70 995 99.50
0.65 5 0.50 1000 100.00

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.

checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
raw_p = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
fdr = t(apply(raw_p, 1, p.adjust, "fdr"))
reject = ifelse(fdr<.05, 1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6 0.65
0.755 0.210 0.032 0.003

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.

checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
pvals = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))
reject = ifelse(hochberg<.05,1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6
0.951 0.046 0.003

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".


An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
4月 292011
 
We often simulate data in SAS or R to confirm analytical results. For example, consider the following problem from the excellent text by Rice:

Let U1, U2, and U3 be independent random variables uniform on [0, 1]. What is the probability that the roots of the quadratic U1*x^2 + U2*x + U3 are real?

Recall that for a quadratic equation A*x^2 + B*x + C to have real roots we need the discriminant (B^2-4AC) to be non-negative.

The answer given in the second and third editions of Rice is 1/9. Here's how you might get there:

Since, B^2 > 4*A*C <=> B > 2*sqrt(A*C), we need to integrate B over the range 2*sqrt(a*c) to 1, then integrate over all possible values for A and C (each from 0 to 1).

Another answer can be found by taking y = b^2 and w = 4ac and integrating over their joint distribution (they're independent, of course). That leads to an answer of approximately 0.254. Here's how to calculate this in R:

f = function(x) {
A = x[1]; B = x[2]; C = x[3];
return(as.numeric(B^2 > 4*A*C))
}
library(cubature)
adaptIntegrate(f, c(0,0,0), c(1,1,1), tol=0.0001, max=1000000)

which generates the following output:

$integral
[1] 0.2543692

$error
[1] 0.005612558

$functionEvaluations
[1] 999999

$returnCode
[1] -1

We leave the details of the calculations aside for now, but both seem equally plausible, at first glance. A quick simulation can suggest which is correct.

For those who want more details, here's a more complete summary of this problem and solution.

SAS

Neither the SAS nor the R code is especially challenging.


data test;
do trial = 1 to 10000;
u1 = uniform(0); u2 = uniform(0); u3 = uniform(0);
res = u2**2 - 4*u1*u3;
realroot = (res ge 0);
output;
end;
run;

proc print data=test (obs=10);
run;

proc means data=test;
var realroot;
run;

Leading to the result:

The MEANS Procedure

Analysis Variable : realroot

N Mean Std Dev Minimum Maximum
-----------------------------------------------------------------
10000 0.2556000 0.4362197 0 1.0000000
-----------------------------------------------------------------


R


numsim = 10000
u1 = runif(numsim); u2 = runif(numsim); u3 = runif(numsim)
res = u2^2 - 4*u1*u3
realroots = res>=0
table(realroots)/numsim

With the result

realroots
FALSE TRUE
0.747 0.253

The simulation demonstrates both that the first solution is incorrect. Here the simulation serves as a valuable check for complicated analysis.

Insights into where the 1/9 solution fails would be welcomed in the comments.
4月 292011
 
We often simulate data in SAS or R to confirm analytical results. For example, consider the following problem from the excellent text by Rice:

Let U1, U2, and U3 be independent random variables uniform on [0, 1]. What is the probability that the roots of the quadratic U1*x^2 + U2*x + U3 are real?

Recall that for a quadratic equation A*x^2 + B*x + C to have real roots we need the discriminant (B^2-4AC) to be non-negative.

The answer given in the second and third editions of Rice is 1/9. Here's how you might get there:

Since, B^2 > 4*A*C <=> B > 2*sqrt(A*C), we need to integrate B over the range 2*sqrt(a*c) to 1, then integrate over all possible values for A and C (each from 0 to 1).

Another answer can be found by taking y = b^2 and w = 4ac and integrating over their joint distribution (they're independent, of course). That leads to an answer of approximately 0.254. Here's how to calculate this in R:

f = function(x) {
A = x[1]; B = x[2]; C = x[3];
return(as.numeric(B^2 > 4*A*C))
}
library(cubature)
adaptIntegrate(f, c(0,0,0), c(1,1,1), tol=0.0001, max=1000000)

which generates the following output:

$integral
[1] 0.2543692

$error
[1] 0.005612558

$functionEvaluations
[1] 999999

$returnCode
[1] -1

We leave the details of the calculations aside for now, but both seem equally plausible, at first glance. A quick simulation can suggest which is correct.

For those who want more details, here's a more complete summary of this problem and solution.

SAS

Neither the SAS nor the R code is especially challenging.


data test;
do trial = 1 to 10000;
u1 = uniform(0); u2 = uniform(0); u3 = uniform(0);
res = u2**2 - 4*u1*u3;
realroot = (res ge 0);
output;
end;
run;

proc print data=test (obs=10);
run;

proc means data=test;
var realroot;
run;

Leading to the result:

The MEANS Procedure

Analysis Variable : realroot

N Mean Std Dev Minimum Maximum
-----------------------------------------------------------------
10000 0.2556000 0.4362197 0 1.0000000
-----------------------------------------------------------------


R


numsim = 10000
u1 = runif(numsim); u2 = runif(numsim); u3 = runif(numsim)
res = u2^2 - 4*u1*u3
realroots = res>=0
table(realroots)/numsim

With the result

realroots
FALSE TRUE
0.747 0.253

The simulation demonstrates both that the first solution is incorrect. Here the simulation serves as a valuable check for complicated analysis.

Insights into where the 1/9 solution fails would be welcomed in the comments.
2月 282011
 
It's been a long winter so far in New England, with many a snow storm. In this entry, we consider a simulation to complement the analytic solution for a probability problem concerning snow.

Consider a company that buys a policy to insure its revenue in the event of major snowstorms that shut down business. The policy pays nothing for the first such snowstorm of the year and $10,000 for each one thereafter, until the end of the year. The number of major snowstorms per year that shut down business is assumed to have a Poisson distribution with mean 1.5. What is the expected amount paid to the company under this policy during a one-year period?

Let SNOW be the number of snowstorms, and pay the amount paid out by the insurance. The following chart may be useful in discerning the patttern:

SNOW PAY 10000*(snow-1)
0 0 -10000
1 0 0
2 10000 10000
3 20000 20000

The analytic solution is straightforward, but involves a truncation of the first snowstorm. Since we can assume that the random variable SNOW ~ Poisson(1.5) we know that E[SNOW] = 1.5 and E[10000*(SNOW-1)] = 10000*E[snow] - 10000 = 15000 - 10000 = 5000.

E[PAY] is equal to E[10000*(SNOW-1]) + 10000*P(SNOW=0) so the exact answer is

10000*P(snow=0) + 15000 - 10000 =
10000*exp(-1.5) + 15000 - 10000 = $7231

Here the advantage of simulation is that it may provide a useful check on the results, as well as a ready measure of variability. In this situation, the code is quite simple, but the approach is powerful.

R

numsim = 1000000
snow = rpois(numsim, 1.5)
pay = snow - 1 # subtract one
pay[snow==0] = 0 # deal with the pesky P(snow=0)
sim = mean(pay*10000)
analytic = 10000*(dpois(0, 3/2) + 3/2 - 1)

Yielding the following:

> sim
[1] 7249.55
> analytic
[1] 7231.302


SAS
The simulation and analytic solutions are also straightforward in SAS. Here the analytic result is only calculated once


data snow_insurance;
do i = 1 to 1000000;
nsnow = ranpoi(0, 1.5);
payout = max(nsnow -1, 0) * 10000;
output;
end;
analytic = 10000 * (cdf("POISSON", 0, 1.5) + 1.5 -1);
output;
run;

proc means data=snow_insurance mean;
var payout analytic;
run;

This results in the following output:

Variable Mean
------------------------
payout 7236.96
analytic 7231.30
------------------------
2月 282011
 
It's been a long winter so far in New England, with many a snow storm. In this entry, we consider a simulation to complement the analytic solution for a probability problem concerning snow.

Consider a company that buys a policy to insure its revenue in the event of major snowstorms that shut down business. The policy pays nothing for the first such snowstorm of the year and $10,000 for each one thereafter, until the end of the year. The number of major snowstorms per year that shut down business is assumed to have a Poisson distribution with mean 1.5. What is the expected amount paid to the company under this policy during a one-year period?

Let SNOW be the number of snowstorms, and pay the amount paid out by the insurance. The following chart may be useful in discerning the patttern:

SNOW PAY 10000*(snow-1)
0 0 -10000
1 0 0
2 10000 10000
3 20000 20000

The analytic solution is straightforward, but involves a truncation of the first snowstorm. Since we can assume that the random variable SNOW ~ Poisson(1.5) we know that E[SNOW] = 1.5 and E[10000*(SNOW-1)] = 10000*E[snow] - 10000 = 15000 - 10000 = 5000.

E[PAY] is equal to E[10000*(SNOW-1]) + 10000*P(SNOW=0) so the exact answer is

10000*P(snow=0) + 15000 - 10000 =
10000*exp(-1.5) + 15000 - 10000 = $7231

Here the advantage of simulation is that it may provide a useful check on the results, as well as a ready measure of variability. In this situation, the code is quite simple, but the approach is powerful.

R

numsim = 1000000
snow = rpois(numsim, 1.5)
pay = snow - 1 # subtract one
pay[snow==0] = 0 # deal with the pesky P(snow=0)
sim = mean(pay*10000)
analytic = 10000*(dpois(0, 3/2) + 3/2 - 1)

Yielding the following:

> sim
[1] 7249.55
> analytic
[1] 7231.302


SAS
The simulation and analytic solutions are also straightforward in SAS. Here the analytic result is only calculated once


data snow_insurance;
do i = 1 to 1000000;
nsnow = ranpoi(0, 1.5);
payout = max(nsnow -1, 0) * 10000;
output;
end;
analytic = 10000 * (cdf("POISSON", 0, 1.5) + 1.5 -1);
output;
run;

proc means data=snow_insurance mean;
var payout analytic;
run;

This results in the following output:

Variable Mean
------------------------
payout 7236.96
analytic 7231.30
------------------------