Monte Carlo experiments

6月 302014
In our last entry, we demonstrated how to simulate data from a logistic regression with an interaction between a dichotomous and a continuous covariate. In this entry we show how to use the simulation to estimate the power to detect that interaction. This is a simple, elegant, and powerful idea: simply simulate data under the alternative, and count the proportion of times the null is rejected. This is an estimate of power. If we lack infinite time to simulate data sets, we can also generate confidence intervals for the proportion.

In R, extending the previous example is almost trivially easy. The coef() function, applied to a glm summary object, returns an array with the parameter estimate, standard error, test statistic, and p-value. In one statement, we can extract the p-value for the interaction and return an indicator of a rejected null hypothesis. (This line is commented on below.) Then the routine is wrapped as a trivial function.
logist_inter = function() {
c = rep(0:1,each=50) # sample size is 100
x = rnorm(100)
lp = -3 + 2*c*x
link_lp = exp(lp)/(1 + exp(lp))
y = (runif(100) < link_lp) = glm(y~as.factor(c)*x, family=binomial)
reject = ifelse( coef(summary([4,4] < .05, 1, 0)
# The coef() function above gets the parameter estimates; the [4,4]
# element is the p-value for the interaction.
Running the function many times is also trivial, using the replicate() function.
pow1 = replicate(100, logist_inter())
The result is an array of 1s and 0s. To get the estimated power and confidence limits, we use the binom.test() function.
binom.test(sum(pow1), 100)
The test gives a p-value against the null hypothesis that the probability of rejection is 0.5, which is not interesting. The interesting part is at the end.

95 percent confidence interval:
0.3219855 0.5228808
sample estimates:
probability of success
It would be simple to adjust this code to allow a change in the number of subjects or of the effect sizes, etc.

In SAS, generating the data is no trouble, but evaluating the power programmatically requires several relatively cumbersome steps. To generate multiple data sets, we include the data generation loop from the previous entry within another loop. (Note that the number of observations has also been reduced vs. the previous entry.)

data test;
do ds = 1 to 100; #100 data sets
do i = 1 to 100; #100 obs/data set
c = (i gt 50);
x = normal(0);
lp = -3 + 2*c*x;
link_lp = exp(lp)/(1 + exp(lp));
y = (uniform(0) lt link_lp);

Then we fit all of the models at once, using the by statement. Here, the ODS system suppresses voluminous output and is also used to capture the needed results in a single data set. The name of the piece of output that holds the parameter estimates (parameterestimates) can be found with the ods trace on statement.
ods select none;
ods output parameterestimates= int_ests;
proc logistic data = test ;
by ds;
class c (param = ref desc);
model y(event='1') = x|c;
ods exclude none;

The univariate procedure can be used to count the number of times the null hypothesis of no interaction would be rejected. To do this, we use the loccount option to request a table of location counts, and the mu0 option to specify that the location of interest is 0.05. As above, since our goal is to use the count programmatically, we also extract the result into a data set. If you're following along at home, it's probably worth your while to print out some of this data to see what it looks like.

ods output locationcounts=int_power;
proc univariate data = int_ests loccount mu0=.05;
where variable = "x*c";
var probchisq;
For example, while the locationcounts data set reports the number of observations above and below 0.05, it also reports the number not equal to 0.05. This is not so useful, and we need to exclude this row from the next step. We do that with a where statement. Then proc freq gives us the proportion and (95%) confidence limits we need, using the binomial option to get the confidence limits and the weight statement to convey the fact that the count variable represents the number of observations.

proc freq data = int_power;
where count ne "Num Obs ^= Mu0";
tables count / binomial;
weight value;
Finally, we find our results:
                        Binomial Proportion
Count = Num Obs < Mu0

Proportion 0.4000
ASE 0.0490
95% Lower Conf Limit 0.3040
95% Upper Conf Limit 0.4960

Exact Conf Limits
95% Lower Conf Limit 0.3033
95% Upper Conf Limit 0.5028
We estimate our power at only 40%, with a confidence limit of (30%, 50%). This agrees closely enough with R: we don't need to narrow the limit to know that we'll need a larger sample size.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
12月 132010
In recent weeks, we've explored methods to fit logistic regression models when a state of quasi-complete separation exists. We considered Firth's penalized likelihood approach, exact logistic regression, and Bayesian models using Markov chain Monte Carlo (MCMC).

Today we'll show how to build a Monte Carlo experiment to compare these approaches. Suppose we have 100 observations with x=0 and 100 with x=1, and suppose that the Pr(Y=1|X=0) = 0.001, while the Pr(Y=1|X=1) = 0.05. Thus the true odds ratio is (0.05/0.95)/(0.001/0.999) = 52.8 and the log odds ratio we want to find is 3.96. But we will rarely observe any y=1 when x=0. Which of these approaches is most likely to give us acceptable results?

Note that in all of the MCMC analyses we use only 6000 iterations, which is likely too few to trust in practice.

The code is long enough here that we annotate within rather than write much text around the code.


All the SAS procedures used accept the events/trials syntax (section 4.1.1), so we'll generate example data sets as two observations of binomial random variates with the probabilities noted above. We also make extensive use of the ODS system to suppress all printed output (section A.7.1) and to save desired pieces of output as SAS data sets. The latter usage requires first using the ods trace on/listing statement to find the name of the output before saving it. Finally, we use the by statement (section A.6.2) to replicate the analysis for each simulated data set.

data rlog;
do trial = 1 to 100;
/* each "trial" is a simulated data set with two observations
containing the observed number of events with x=0 or x=1 */
x=0; events = ranbin(0,100,.001); n=100; output;
x=1; events = ranbin(0,100,.05); n=100; output;

ods select none; /* omit _all_ printed output */

ods output parameterestimates=glm; /* save the estimated betas */
proc logist data = rlog;
by trial;
model events / n=x; /* ordinary logistic regression */

ods output parameterestimates=firth; /* save the estimated betas */
/* note the output data set has the same name
as in the uncorrected glm */
proc logist data = rlog;
by trial;
model events / n = x / firth; /* do the firth bias correction */

ods output exactparmest=exact;
/* the exact estimates have a different name under ODS */
proc logist data=rlog;
by trial;
model events / n = x;
exact x / estimate; /* do the exact estimation */

data prior;
input _type_ $ Intercept x;
Var 25 25
Mean 0 0

ods output postsummaries=mcmc;
proc genmod data = rlog;
by trial;
model events / n = x / dist=bin;
bayes nbi=1000 nmc=6000
coeffprior=normal(input=prior) diagnostics=none
/* do the Bayes regression, using the prior made in the
previous data step */

Now I have four data sets with parameter estimates in them. I could use them separately, but I'd like to merge them together. I can do this with the merge statement (section 1.5.7) in a data step. I also need to drop the lines with the estimated intercepts and rename the variables that hold the parameter estimates. The latter is necessary because the names are duplicated across the output data sets and desirable in that it allows names that are meaningful. In any event, I can use the where and rename data set options to include these modifications as I do the merge. I'll also add the number of events when x=0 and when x=1, which requires merging in the original data twice.

data lregsep;
glm (where = (variable = "x") rename = (estimate = glm))
firth (where = (variable = "x") rename = (estimate = firth))
exact (rename = (estimate = exact))
mcmc (where = (parameter = "x") rename = (mean=mcmc))
rlog (where = (x = 1) rename = (events = events1))
rlog (where = (x = 0) rename = (events = events0));
by trial;

ods select all; /* now I want to see the output! */
/* check to make sure the output dataset looks right */
proc print data = lregsep (obs = 5) ;
var trial glm firth exact mcmc;

/* what do the estimates look like? */
proc means data=lregsep;
var glm firth exact mcmc;

With the following output.

Obs trial glm firth exact mcmc

1 1 12.7866 2.7803 2.3186 3.9635
2 2 12.8287 3.1494 2.7223 4.0304
3 3 10.7192 1.6296 0.8885 2.5613
4 4 11.7458 2.2378 1.6906 3.3409
5 5 10.7192 1.6296 0.8885 2.5115

Variable Mean Std Dev
glm 10.6971252 3.4362801
firth 2.2666700 0.5716097
exact 1.8237047 0.5646224
mcmc 3.1388274 0.9620103

The ordinary logistic estimates are entirely implausible, while the three alternate approaches are more acceptable. The MCMC result has the least bias, but it's unclear to what degree this is a happy coincidence between the odds ratio and the prior precision. The Firth approach appears to be less biased than the exact logistic regression

The R version is roughly analogous to the SAS version. The notable differences are that 1) I want the "weights" version of the data (see example 8.15) for the glm() and logistf() functions and need the events/trials syntax for the elrm() function and the expanded (one row per observation) version for the MCMClogit() funtion. The sapply() function (section B.5.3) serves a similar function to the by statement in SAS. Finally, rather than spelunking through the ods trace output to find the parameter estimates, I used the str() function (section 1.3.2) to figure out where they are stored in the output objects and indexes (rather than data set options) to pull out the one estimate I need.

# make sure the needed packages are present
# the runlogist() function generates a dataset and runs each analysis
# the parameter "trial" keeps track of which time we're calling runlogist()
runlogist = function(trial) {
# the result vector will hold the estimates temporarily
result = matrix(0,4)
# generate the number of events once
events.0 =rbinom(1,100, .001) # for x = 0
events.1 = rbinom(1,100, .05) # for x = 1
# following for glm and logistf "weights" format
xw = c(0,0,1,1)
yw = c(0,1,0,1)
ww = c(100 - events.0, events.0, 100 - events.1,events.1)
# run the glm and logistf, grab the estimates, and stick
# them into the results vector
result[1] =
glm(yw ~ xw, weights=ww, binomial)$coefficients[2]
result[2] = logistf(yw ~ xw, weights=ww)$coefficients[2]
# elrm() needs a data frame in the events/trials syntax
elrmdata = data.frame(events=c(events.0,events.1), x =c(0,1),
trials = c(100,100))
# run it and grab the estimate
result[3]=elrm(events/trials ~ x, interest = ~ x, iter = 6000,
burnIn = 1000, data = elrmdata, r = 2)$coeffs
# MCMClogit() needs expanded data
x = c(rep(0,100), rep(1,100))
y = c(rep(0,100-events.0), rep(1,events.0),
rep(0, 100-events.1), rep(1, events.1))
# run it and grab the mean of the MCMC posteriors
result[4] = summary(MCMClogit(y~as.factor(x), burnin=1000,
mcmc=6000, b0=0, B0=.04,
seed = list(c(781306, 78632467, 364981736, 6545634, 7654654,
# send back the four estimates, plus the number of events
# when x=0 and x=1
return(c(trial, events.0, events.1, result))

Note the construction of the seed= option to the MCMClogit() function. This allows a different seed in every call without actually using sequential seeds.

Now we're ready to call the function repeatedly. We'll do that with the sapply() function, but we need to nest that inside a t() function call to get the estimates to appear as columns rather than rows, and we'll also make it a data frame in the same command. Note that the parameters we change within the sapply() function are merely a list of trial numbers. Finally, we'll add descriptive names for the columns with the names() function (section 1.3.4).

res2 =, runlogist)))
names(res2) <- c("trial","events.0","events.1", "glm",
"firth", "exact-ish", "MCMC")
mean(res2[,4:7], na.rm=TRUE)

trial events.0 events.1 glm firth exact-ish MCMC
1 1 0 6 18.559624 2.6265073 2.6269087 3.643560
2 2 1 3 1.119021 0.8676031 1.1822296 1.036173
3 3 0 5 18.366720 2.4489268 2.1308186 3.555314
4 4 0 5 18.366720 2.4489268 2.0452446 3.513743
5 5 0 2 17.419339 1.6295391 0.9021854 2.629160
6 6 0 9 17.997524 3.0382577 2.1573979 4.017105

glm firth exact-ish MCMC
17.333356 2.278344 1.813203 3.268243

The results are notably similar to SAS, except for the unacceptable glm() results.

In most Monte Carlo experimental settings, one would also be interested in examining the confidence limits for the parameter estimates. Notes and code for doing this can be found here. In a later entry we'll consider plots for the results generated above. As a final note, there are few combinations of event numbers with any mass worth considering. One could compute the probability of each of these and the associated parameter estimates, deriving a more analytic answer to the question. However, this would be difficult to replicate for arbitrary event probabilities and Ns, and very awkward for continuous covariates, while the above approach could be extended with trivial ease.