Nick Horton

9月 212017
 
A previous entry (http://sas-and-r.blogspot.com/2017/07/options-for-teaching-r-to-beginners.htmldescribes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested

In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here's Randall: 

For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of
  • the lattice package for graphics,
  • several functions from the stats package for modeling (e.g., lm(), t.test()), and
  • the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.
Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.  

    goal ( y ~ x , data = mydata, ... )


Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.

Trouble in paradise
As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing.
On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify.
ggformula: The third-and-a half way
Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics).
For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the side-by-side boxplots made with lattice in the previous post.
We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the "then" operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.

Comparing groups
Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color)
or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses | very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers.


The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired.
Fitting into the tidyverse work flow
ggformala also fits into a tidyverse-style workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations.
Summary
The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the  ggformula  package vignettes and tutorials describe these in more detail.
-- Randall Pruim
7月 282017
 
I've been reading David Robinson's excellent blog entry "Teach the tidyverse to beginners" (http://varianceexplained.org/r/teach-tidyverse), which argues that a tidyverse approach is the best way to teach beginners.  He summarizes two competing curricula:

1) "Base R first": teach syntax such as $ and [[]], built in functions like ave() and tapply(), and use base graphics

2) "Tidyverse first": start from scratch with pipes (%>%) and leverage dplyr and use ggplot2 for graphics

If I had to choose one of these approaches, I'd also go with 2) ("Tidyverse first"), since it helps to move us closer to helping our students "think with data" using more powerful tools (see here for my sermon on this topic).

A third way

Of course, there’s a third option that addresses David’s imperative to "get students doing powerful things quickly".  The mosaic package was written to make R easier to use in introductory statistics courses.  The package is part of Project MOSAIC (http://mosaic-web.org), an NSF-funded initiative to integrate statistics, modeling, and computing. A paper outlining the mosaic package's "Less Volume, More Creativity" approach was recently published in the R Journal (https://journal.r-project.org/archive/2017/RJ-2017-024). To his credit, David mentions the mosaic package in a response to one of the comments on his blog.

Less Volume, More Creativity

One of the big ideas in the mosaic package is that students build on the existing formula interface in R as a mechanism to calculate summary statistics, generate graphical displays, and fit regression models. Randy Pruim has dubbed this approach "Less Volume, More Creativity".

While teaching this formula interface involves adding a new learning outcome (what is "Y ~ X"?), the mosaic approach simplifies calculation of summary statistics by groups and the generation of two or three dimensional displays on day one of an introductory statistics course (see for example Wang et al., "Data Viz on Day One: bringing big ideas into intro stats early and often" (2017), TISE).

The formula interface also prepares students for more complicated models in R (e.g., logistic regression, classification).

Here's a simple example using the diamonds data from the ggplot2 package.  We model the relationships between two colors (D and J), number of carats, and price.

I'll begin with a bit of data wrangling to generate an analytic dataset with just those two colors. (Early in a course I would either hide the next code chunk or make the recoded dataframe accessible to the students to avoid cognitive overload.)  Note that an R Markdown file with the following commands is available for download at https://nhorton.people.amherst.edu/mosaic-blog.Rmd.

library(mosaic)
recoded <- diamonds %>%
  filter(color=="D" | color=="J") %>%
  mutate(col = as.character(color))

We first calculate the mean price (in US$) for each of the two colors.

mean(price ~ col, data = recoded)
   D    J 
3170 5324

This call is an example of how the formula interface facilitates calculation of a variable's mean for each of the levels of another variable. We see that D color diamonds tend to cost less than J color diamonds.

A useful function in mosaic is favstats() which provides a useful set of summary statistics (including sample size and missing values) by group.

favstats(price ~ col, data = recoded)
col
min
Q1
median
Q3
max
mean
sd
n
missing
D35791118384214186933170335767750
J335186042347695187105324443828080

A similar command can be used to generate side by side boxplots. Here we illustrate the use of lattice graphics. (An alternative formula based graphics system (ggformula) will be the focus of a future post.)

bwplot(col ~ price, data = recoded)


The distributions are skewed to the right (not surprisingly since they are prices). If we wanted to formally compare these sample means we could do so with a two-sample t-test (or in a similar fashion, by fitting a linear model).

t.test(price ~ col, data = recoded)
Welch Two Sample t-test

data:  price by col
t = -20, df = 4000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2336 -1971
sample estimates:
mean in group D mean in group J 
           3170            5324 


msummary(lm(price ~ col, data = recoded))
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3170.0       45.0    70.4   <2e-16 ***
colJ          2153.9       83.2    25.9   <2e-16 ***

Residual standard error: 3710 on 9581 degrees of freedom
Multiple R-squared:  0.0654, Adjusted R-squared:  0.0653 

F-statistic:  670 on 1 and 9581 DF,  p-value: <2e-16

The results from the two approaches are consistent: the group differences are highly statistically significant.  We could conclude that J diamonds tend to cost more than D diamonds, back in the population of all diamonds.

Let's do a quick review of the mosaic modeling syntax to date:
mean(price ~ col)
bwplot(price ~ col)t.test(price ~ col)lm(price ~ col) See the pattern? On a statistical note, it's important to remember that the diamonds were not randomized into colors: this is a found (observational dataset) so there may be other factors at play.  The revised GAISE College report reiterates the importance of multivariate thinking in intro stats.Moving to three dimensionsLet's continue with the "Less Volume, More Creativity" approach to bring in a third variable: the number of carats in each diamond. xyplot(price ~ carat, groups=col, auto.key=TRUE, type=c("p", "r"), data = recoded)
We see that controlling for the number of carats, the D color diamonds tend to sell for more than the J color diamonds.  We can confirm this by fitting a regression model that controls for both variables (and then display the resulting predicted values from this parallel slopes model using plotModel()).
This is a great example of Simpson's paradox: accounting for the number of carats has yielded opposite results from a model that didn't include carats.If we were to move forward with such an analysis we'd need to be sure to undertake an assessment of our model and verify conditions and assumptions (but for the purpose of the blog entry I'll defer that).

Moving beyond mosaic

The revised GAISE College report enunciated the importance of technology when teaching statistics. Many courses still use calculators or web-based applets to incorporate technology into their classes. R is an excellent environment for teaching statistics, but many instructors feel uncomfortable using it (particularly if they feel compelled to teach the $ and [[]] syntax, which many find offputting).  The mosaic approach helps make the use of R feasible for many audiences by keeping things simple. It's unfortunately true that many introductory statistics courses don't move beyond bivariate relationships (so students may feel paralyzed about what to do about other factors). The mosaic approach has the advantage that it can bring multivariate thinking, modeling, and exploratory data tools together with a single interface (and modest degree of difficulty in terms of syntax). I've been teaching multiple regression as a descriptive method early in an intro stat course for the past ten years (and it helps to get students excited about material that they haven't seen before). The mosaic approach also scales well: it's straightforward to teach students dplyr/tidyverse data wrangling by adding in the pipe operator and some key data idioms. (So perhaps the third option should be labeled "mosaic and tidyverse".)  

See the following for an example of how favstats() can be replaced by dplyr idioms. 

recoded %>%
  group_by(col) %>%
  summarize(meanval = mean(price, na.rm = TRUE))
col
meanval
D3170
J5324
That being said, I suspect that many students (and instructors) will still use favstats() for simple tasks (e.g., to check sample sizes, check for missing data, etc).  I know that I do.  But the important thing is that unlike training wheels, mosaic doesn't hold them back when they want to learn new things. I'm a big fan of ggplot2, but even Hadley agrees that the existing syntax is not what he wants it to be.  While it's not hard to learn to use + to glue together multiple graphics commands and to get your head around aesthetics, teaching ggplot2 adds several additional learning outcomes to a course that's already overly pregnant with them.


Side note

I would argue that a lot of what is in mosaic should have been in base R (e.g., formula interface to mean(), data= option for mean()).  Other parts are more focused on teaching (e.g., plotModel()xpnorm(), and resampling with the do() function).

Closing thoughts

In summary, I argue that the mosaic approach is consistent with the tidyverse. It dovetails nicely with David's "Teach tidyverse" as an intermediate step that may be more accessible for undergraduate audiences without a strong computing background.  I'd encourage people to check it out (and let Randy, Danny, and me know if there are ways to improve the package).

Want to learn more about mosaic?  In addition to the R Journal paper referenced above, you can see how we get students using R quickly in the package's "Less Volume, More Creativity" and "Minimal R" vignettes.  We also provide curated examples from commonly used textbooks in the “mosaic resources” vignette and a series of freely downloadable and remixable monographs including The Student’s Guide to R and Start Teaching with R.
7月 262017
 
One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data.  Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate?  How are the results interpreted and communicated? 


While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data.  The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods.  (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics.  Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation.  Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data.  A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:
  1. minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
  2. ethical considerations are raised early, to motivate later examples
  3. recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured
Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.  


This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We've taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.  
We've made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we're working to create a series of lab activities (e.g., text as data).  (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)
Modern Data Science with R cover
Modern Data Science with R

An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
1月 062016
 

We're pleased to announce that a special issue of the American Statistician on "Statistics and the Undergraduate Curriculum" (November, 2015) is available at http://amstat.tandfonline.com/toc/utas20/69/4.

Johanna Hardin (Pomona College) and Nick were the guest editors. There are a number of excellent and provocative papers that reinforce the importance of computing using tools such as R and SAS that are likely to be of widespread interest to the community.


Table of Contents

Teaching the Next Generation of Statistics Students to “Think With Data”: Special Issue on Statistics and the Undergraduate Curriculum : Nicholas J. Horton & Johanna S. Hardin, DOI:10.1080/00031305.2015.1094283 (freely available)

Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up George Cobb, DOI:10.1080/00031305.2015.1093029 (freely available for a limited period)

Teaching Statistics at Google-Scale: Nicholas Chamandy, Omkar Muralidharan & Stefan Wager, DOI:10.1080/00031305.2015.1089790

Explorations in Statistics Research: An Approach to Expose Undergraduates to Authentic Data Analysis by Deborah Nolan & Duncan Temple Lang, DOI:10.1080/00031305.2015.1073624

Beyond Normal: Preparing Undergraduates for the Work Force in a Statistical Consulting Capstone by Byran J. Smucker & A. John Bailer, DOI:10.1080/00031305.2015.1077731

A Framework for Infusing Authentic Data Experiences Within Statistics Courses: Scott D. Grimshaw, DOI:10.1080/00031305.2015.1081106

Fostering Conceptual Understanding in Mathematical Statistics: Jennifer L. Green & Erin E. Blankenship, DOI:10.1080/00031305.2015.1069759

The Second Course in Statistics: Design and Analysis of Experiments? by Natalie J. Blades, G. Bruce Schaalje & William F. Christensen, DOI:10.1080/00031305.2015.1086437

A Data Science Course for Undergraduates: Thinking With Data: Ben Baumer, DOI:10.1080/00031305.2015.1081105

Data Science in Statistics Curricula: Preparing Students to “Think with Data” : J. Hardin, R. Hoerl, Nicholas J. Horton, D. Nolan, B. Baumer, O. Hall-Holt, P. Murrell, R. Peng, P. Roback, D. Temple Lang & M. D. Ward, DOI:10.1080/00031305.2015.1077729

Using Online Game-Based Simulations to Strengthen Students’ Understanding of Practical Statistical Issues in Real-World Data Analysis: Shonda Kuiper & Rodney X. Sturdivant, DOI:10.1080/00031305.2015.1075421

Combating Anti-Statistical Thinking Using Simulation-Based Methods Throughout the Undergraduate Curriculum: Nathan Tintle, Beth Chance, George Cobb, Soma Roy, Todd Swanson & Jill VanderStoep, DOI:10.1080/00031305.2015.1081619

What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum : Tim C. Hesterberg, DOI:10.1080/00031305.2015.1089789

Incorporating Statistical Consulting Case Studies in Introductory Time Series Courses: Davit Khachatryan, DOI:10.1080/00031305.2015.1026611

Developing a New Interdisciplinary Computational Analytics Undergraduate Program: A Qualitative-Quantitative-Qualitative Approach: Scotland Leman, Leanna House & Andrew Hoegh, DOI:10.1080/00031305.2015.1090337

From Curriculum Guidelines to Learning Outcomes: Assessment at the Program Level by Beth Chance & Roxy Peck, DOI:10.1080/00031305.2015.1077730

Program Assessment for an Undergraduate Statistics Major: Allison Amanda Moore & Jennifer J. Kaplan, DOI:10.1080/00031305.2015.1087331

The Cobb paper ("Mere Renovation is Too Little Too Late: We Need to Rethink Our Undergraduate Curriculum from the Ground Up (Cobb, 2015) ") has an associated discussion with 19 provocative responses plus George's spirited rejoinder. These materials can be found on the TAS website or individually at http://www.amherst.edu/~nhorton/mererenovation/.

Discussion (and rejoinder):

  • Response from Albert and Glickman: Attracting undergraduates to statistics through data science
  • Response from Chance, Peck, and Rossman: Response to mere renovation is too little too late
  • Response from De Veaux and Velleman: Teaching statistics algorithmically or stochastically misses the point: why not teach holistically?
  • Response from Fisher and Bailar: Who, what, when and how: changing the undergraduate statistics curriculum
  • Response from Franklin: We need to rethink the way we teach statistics at K-12
  • Response from Gelman and Loken: Moving forward in statistics education while avoiding overconfidence
  • Response from Gould: Augmenting the vocabulary used to describe data
  • Response from Holcomb, Quinn, and Short: Seeking the niche for traditional mathematics within undergraduate statistics and data science curricula
  • Response from Kass: The gap between statistics education and statistical practice
  • Response from King: Training the next generation of statistical scientist
  • Response from Lane-Getaz: Stirring the curricular pot once again
  • Response from Notz: Vision or bad dream?
  • Response from Ridgway: Data Cowboys and Statistical Indians
  • Response from Temple Lang: Authentic data analysis experience
  • Response from Utts: Challenges, changes and choices in the undergraduate statistics curriculum
  • Response from Ward: Learning communities and the undergraduate statistics curriculum
  • Response from Wickham: Teaching Safe-Stats, not statistical abstinence
  • Response from Wild: Further, faster, wider
  • Response from Zieffler: Teardowns, historical renovation, and paint-and-patch: curricular changes and faculty development
  • Rejoinder by Cobb


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
2月 022015
 
Here's a timely guest entry from Jeffrey Witmer (Oberlin College).

As the “Deflate Gate” saga was unfolding, Warren Sharp analyzed “touches per fumble” for NFL teams before and after 2006, when a rule was changed so that teams playing on the road could provide their own footballs (http://www.sharpfootballanalysis.com/blog/). Sharp noted that, for whatever reason, the Patriots went from being a typical team, as regards fumbling, to a team with a very low fumble rate. Rather than rely on the data the Sharp provides at his website, I choose to collect and analyze some data on my own. I took a random sample of 30 games played by New England and 30 other games. For each game, I recorded all rushing and passing plays (except for QB kneels), but excluded kicking plays (the NFL, rather than the teams, provides special footballs for those plays). (Data source: http://www.pro-football-reference.com/play-index/play_finder.cgi.) I also recorded the weather for the game. (Data source: http://www.wunderground.com/history/.) Once I had the data (in a file that I called AllBig, which can be downloaded from http://www.amherst.edu/~nhorton/AllBig.csv), I noted whether or not there was a fumble on each play, aided by the grep() command:

grep("Fumb", AllBig$Detail, ignore.case=TRUE)
I labeled each play as Late or not according to whether it happened after the rule change:

AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)
Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:

require(mosaic)
tally(~Fumble+Late, data=filter(AllBig,Pats==0))

Late
Fumble 0 1
0 2588 2919
1 54 65
Then I asked for the data in proportion terms:

tally(Fumble~Late, data=filter(AllBig,Pats==0))
and got

Late
Fumble 0 1
0 0.9796 0.9782
1 0.0204 0.0218
For non-Pats there is a tiny increase in fumbles. This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure). mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0)) Repeating this for the Patriots shows a different picture:

tally(~Fumble+Late, data=filter(AllBig,Pats==1))
Late
Fumble 0 1
0 996 910
1 19 7


tally(Fumble~Late, data=filter(AllBig,Pats==1))
Late
Fumble 0 1
0 0.98128 0.99237
1 0.01872 0.00763
I fit a logistic regression model with the glm() command: glm(Fumble~Late*Pats, family=binomial, data=AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.8697 0.1375 -28.14 <2e-16 ***
Late 0.0650 0.1861 0.35 0.727
Pats -0.0897 0.2693 -0.33 0.739
Late:Pats -0.9733 0.4819 -2.02 0.043 *
I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:

AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)

glm(formula = Fumble ~ BadWeather + Temp + Late * Pats,
family = binomial, data = AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.23344 0.43164 -9.81 <2e-16 ***
BadWeather 0.33259 0.29483 1.13 0.26
Temp 0.00512 0.00612 0.84 0.40
Late 0.08871 0.18750 0.47 0.64
Pats -0.14183 0.27536 -0.52 0.61
Late:Pats -0.91062 0.48481 -1.88 0.06 .
Because there was suspicion that something changed starting in 2007 I added a three-way interaction:

glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
family = binomial, data = AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.51110 0.47707 -9.46 <2e-16 ***
BadWeather 0.34207 0.30013 1.14 0.254
Temp 0.00831 0.00653 1.27 0.203
IsAway 0.14791 0.27549 0.54 0.591
Late 0.13111 0.26411 0.50 0.620
Pats -0.80019 0.54360 -1.47 0.141
IsAway:Late -0.07348 0.37463 -0.20 0.845
IsAway:Pats 0.94335 0.63180 1.49 0.135
Late:Pats 0.51536 0.71379 0.72 0.470
IsAway:Late:Pats -3.14345 1.29480 -2.43 0.015 *
There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.

#Simulation of deflategate data null hypothesis
Late = rep(1,72) #creates 72 late fumbles
Early = rep(0,73) #creates 73 early fumbles
alldata = append(Late,Early) #puts the two groups together
table(alldata) #check to see that we have what we want

cards =1:length(alldata) # creates 145 cards, one "ID number" per fumble

FumbleLate = NULL # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
cardsgroup2 = cards[-cardsgroup1] # puts the remaining cards in group 2
NEPats = (alldata[cardsgroup2]) #reads the values of the cards in group 2
FumbleLate[i] = sum(NEPats) # counts NEPats late fumbles (the only stat we need)
}

table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results

sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086
Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
2月 022015
 
Here's a timely guest entry from Jeffrey Witmer (Oberlin College).

As the “Deflate Gate” saga was unfolding, Warren Sharp analyzed “touches per fumble” for NFL teams before and after 2006, when a rule was changed so that teams playing on the road could provide their own footballs (http://www.sharpfootballanalysis.com/blog/). Sharp noted that, for whatever reason, the Patriots went from being a typical team, as regards fumbling, to a team with a very low fumble rate. Rather than rely on the data the Sharp provides at his website, I choose to collect and analyze some data on my own. I took a random sample of 30 games played by New England and 30 other games. For each game, I recorded all rushing and passing plays (except for QB kneels), but excluded kicking plays (the NFL, rather than the teams, provides special footballs for those plays). (Data source: http://www.pro-football-reference.com/play-index/play_finder.cgi.) I also recorded the weather for the game. (Data source: http://www.wunderground.com/history/.) Once I had the data (in a file that I called AllBig, which can be downloaded from http://www.amherst.edu/~nhorton/AllBig.csv), I noted whether or not there was a fumble on each play, aided by the grep() command:

grep("Fumb", AllBig$Detail, ignore.case=TRUE)
I labeled each play as Late or not according to whether it happened after the rule change:

AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)
Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:

require(mosaic)
tally(~Fumble+Late, data=filter(AllBig,Pats==0))

Late
Fumble 0 1
0 2588 2919
1 54 65
Then I asked for the data in proportion terms:

tally(Fumble~Late, data=filter(AllBig,Pats==0))
and got

Late
Fumble 0 1
0 0.9796 0.9782
1 0.0204 0.0218
For non-Pats there is a tiny increase in fumbles. This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure). mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0)) Repeating this for the Patriots shows a different picture:

tally(~Fumble+Late, data=filter(AllBig,Pats==1))
Late
Fumble 0 1
0 996 910
1 19 7


tally(Fumble~Late, data=filter(AllBig,Pats==1))
Late
Fumble 0 1
0 0.98128 0.99237
1 0.01872 0.00763
I fit a logistic regression model with the glm() command: glm(Fumble~Late*Pats, family=binomial, data=AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.8697 0.1375 -28.14 <2e-16 ***
Late 0.0650 0.1861 0.35 0.727
Pats -0.0897 0.2693 -0.33 0.739
Late:Pats -0.9733 0.4819 -2.02 0.043 *
I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:

AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)

glm(formula = Fumble ~ BadWeather + Temp + Late * Pats,
family = binomial, data = AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.23344 0.43164 -9.81 <2e-16 ***
BadWeather 0.33259 0.29483 1.13 0.26
Temp 0.00512 0.00612 0.84 0.40
Late 0.08871 0.18750 0.47 0.64
Pats -0.14183 0.27536 -0.52 0.61
Late:Pats -0.91062 0.48481 -1.88 0.06 .
Because there was suspicion that something changed starting in 2007 I added a three-way interaction:

glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
family = binomial, data = AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.51110 0.47707 -9.46 <2e-16 ***
BadWeather 0.34207 0.30013 1.14 0.254
Temp 0.00831 0.00653 1.27 0.203
IsAway 0.14791 0.27549 0.54 0.591
Late 0.13111 0.26411 0.50 0.620
Pats -0.80019 0.54360 -1.47 0.141
IsAway:Late -0.07348 0.37463 -0.20 0.845
IsAway:Pats 0.94335 0.63180 1.49 0.135
Late:Pats 0.51536 0.71379 0.72 0.470
IsAway:Late:Pats -3.14345 1.29480 -2.43 0.015 *
There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.

#Simulation of deflategate data null hypothesis
Late = rep(1,72) #creates 72 late fumbles
Early = rep(0,73) #creates 73 early fumbles
alldata = append(Late,Early) #puts the two groups together
table(alldata) #check to see that we have what we want

cards =1:length(alldata) # creates 145 cards, one "ID number" per fumble

FumbleLate = NULL # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
cardsgroup2 = cards[-cardsgroup1] # puts the remaining cards in group 2
NEPats = (alldata[cardsgroup2]) #reads the values of the cards in group 2
FumbleLate[i] = sum(NEPats) # counts NEPats late fumbles (the only stat we need)
}

table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results

sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086
Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
11月 182014
 
A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book). 

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water),
x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

TRUE FALSE
0.1 99.9

The do() operator evaluates the expression on the right hand side a specified number of times.  In this case we shuffle (or permute) the group indicators. 

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package,  base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) -
mean(alldata[resample_labels=="water"])

resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.
resamp_means = function(data, labs){
resample_labels = sample(labs)
resample_diff = mean(data[resample_labels=="beer"]) -
mean(data[resample_labels=="water"])
return(resample_diff)
}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)


SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.
data bites;;
if _n_ le 18 then drink="water";
else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;

proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;

proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;

data om3;
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;

proc print data = om3; var obsdiff; run;

Obs obsdiff
1 4.37778
(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
data rerand;
set bites;
randorder = uniform(0);
run;

proc sort data = rerand; by randorder; run;

data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
else redrink = "beer";
run;

proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;

proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;

data rrdiff;
set rerand4;
rrdiff = beer-water;
run;

proc print data = rrdiff; var rrdiff; run;

Obs rrdiff
1 -1.73778
One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.
proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;
To get a p-value and make a histogram, we use the macro variable created earlier.
data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;

proc means data = ssresamp4 sum; var exceeds; run;

The MEANS Procedure
Analysis Variable : exceeds
Sum
9.0000000

proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;
The p-value is 0.001 (= (9+1)/10000).


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
8月 152012
 

For those of you who teach, or are interested in seeing an illustrated series of analyses, there is a new compendium of files to help describe how to fit models for the extended case studies in the Second Edition of the Statistical Sleuth: A Course in Methods of Data Analysis (2002), the excellent text by Fred Ramsey and Dan Schafer. If you are using this book, or would like to see straightforward ways to undertake analyses in R for intro and intermediate statistics courses, these may be of interest.

These files can be found here. The site includes both formatted pdf files as well as the original knitr files which were used to generate the output. Knitr is an elegant, flexible and fast means to undertake reproducible analysis and dynamic report generation within R and RStudio.

This work leverages efforts undertaken by Project MOSAIC, an NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the mosaic package, which was written to simplify the use of R for introductory statistics courses.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
7月 172012
 

Dynamite plots are a somewhat pejorative term for a graphical display where the height of a bar indicates the mean, and the vertical line on top of it represents the standard deviation (or standard error). These displays are commonly found in many scientific disciplines, as a way of communicating group differences in means.

Many find these displays troubling. One post entitled them unmitigated evil.
The Vanderbilt University Department of Biostatistics has a formal policy discouraing use of these plots, stating that:

Dynamite plots often hide important information. This is particularly true of small or skewed data sets. Researchers are highly discouraged from using them, and department members have the option to decline participation in papers in which the lead author requires the use of these plots.

Despite the limitations of the display, we believe that there may be times when the display is helpful as a way to compare groups, assuming distributions that are approximately normal. Samuel Brown also described creation of these figures, as a way to encourage computing in R. We previously demonstrated how to create them in SAS and R, and today discuss code created by Randall Pruim to demonstrate how such graphics can be created using lattice graphics within the mosaic package.

R

library(mosaic)
dynamitePlot <- function(height, error, names = as.character(1:length(height)),
significance = NA, ylim = c(0, maxLim), ...) {
if (missing(error)) { error = 0 }
maxLim <- 1.2* max(mapply(sum, height, error))
mError <- min(c(error, na.rm=TRUE))
barchart(height ~ names, ylim=ylim, panel=function(x,y,...) {
panel.barchart(x,y,...)
grid.polyline(c(x,x), c(y, y+error), id=rep(x,2), default.units='native',
arrow=arrow(angle=45, length=unit(mError, 'native')))
grid.polyline(c(x,x), c(y, y-error), id=rep(x,2), default.units='native',
arrow=arrow(angle=45, length=unit(mError, 'native')))
grid.text(x=x, y=y + error + .05*maxLim, label=significance,
default.units='native')
}, ...)
}

Much of the code involves setting up the appropriate axis limits, then drawing the lines and adding the text. We can call this new function with an artificial example with 4 groups:

Values <- c(1,2,5,4)
Errors <- c(0.25, 0.5, 0.33, 0.12)
Names <- paste("Trial", 1:4)
Sig <- c("a", "a", "b", "b")
dynamitePlot(Values, Errors, names=Names, significance=Sig)

We still don't recommend frequent use of these plots (as other displays may be better (e.g. dotplots for very small sample sizes or violin plots), but having the capability to generate dynamite plots within the lattice framework can be handy.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
7月 102012
 

While traditional statistics courses teach students to calculate intervals and test for binomial proportions using a normal or t approximation, this method does not always work well. Agresti and Coull ("Approximate is better than "exact' for interval estimation of binomial proportions". The American Statistician, 1998; 52:119-126) demonstrated this and reintroduced an improved (Plus4) estimator originally due to Wilson (1927).

In this entry, we demonstrate how the coverage varies as a function of the underlying probability and compare four intervals: (1) t interval, (2) Clopper-Pearson, (3) Plus4 (Wilson/Agresti/Coull) and (4) Score, using code contributed by Albyn Jones from Reed College. Here, coverage probability is defined as the expected value that a CI based on an observed value will cover the true binomial parameter that generated that value. The code calculates the coverage probability as a function of a given binomial probability p and a sample size n. Intervals are created for each of the possible outcomes from 0, ..., n, then checked to see if the intervals include the true value. Finally, the sum of the probabilities of observing outcomes in which the binomial parameter is included in the interval determines the exact coverage. Note several distinct probabilities: (1) binomial parameter "p", the probability of success on a trial; (2) probability of observing x successes in N trials, P(X=x); (3) coverage probability as defined above. For distribution quantiles and probabilities, see section 1.10 and table 1.1.

R

We begin by defining the support functions which will be used to calculate the coverage probabilities for a specific probability and sample size.

CICoverage = function(n, p, alpha=.05) {
# set up a table to hold the results
Cover = matrix(0, nrow=n+1, ncol=5)
colnames(Cover)=c("X","ClopperPearson","Plus4","Score","T")
Cover[,1]=0:n
zq = qnorm(1-alpha/2)
tq = qt(1-alpha/2,n-1)
for (i in 0:n) {
Phat = i/n
P4 = (i+2)/(n+4)
# Calculate T and plus4 intervals manually,
# use canned functions for the other
TInt = Phat + c(-1,1)*tq*sqrt(Phat*(1-Phat)/n)
P4Int = P4 + c(-1,1)*zq*sqrt(P4*(1-P4)/(n+4))
CPInt= binom.test(i,n)$conf.int
SInt = prop.test(i,n)$conf.int
# check to see if the binomial p is in each CI
Cover[i+1,2] = InInt(p, CPInt)
Cover[i+1,3] = InInt(p, P4Int)
Cover[i+1,4] = InInt(p, SInt)
Cover[i+1,5] = InInt(p, TInt)
}
# probability that X=x
p = dbinom(0:n, n, p)
ProbCover=rep(0, 4)
names(ProbCover) = c("ClopperPearson", "Plus4", "Score", "T")
# sum P(X=x) * I(p in CI from x)
for (i in 1:4){
ProbCover[i] = sum(p*Cover[,i+1])
}
list(n=n, p=p, Cover=Cover, PC=ProbCover)
}

In addition, we define a function to determine whether something is in the interval.

InInt = function(p,interval){
interval[1] <= p && interval[2] >= p
}

Finally, there's a function which summarizes the results.

CISummary = function(n, p) {
M = matrix(0,nrow=length(n)*length(p),ncol=6)
colnames(M) = c("n","p","ClopperPearson","Plus4","Score","T")
k=0
for (N in n) {
for (P in p) {
k=k+1
M[k,]=c(N, P, CICoverage(N, P)$PC)
}
}
data.frame(M)
}

We then generate the CI coverage plot provided at the start of the entry, which uses sample size n=50 across a variety of probabilities.

lwdval = 2
nvals = 50
probvals = seq(.01, .30, by=.001)
results = CISummary(nvals, probvals)
plot(range(probvals), c(0.85, 1), type="n", xlab="true binomial p",
ylab="coverage probability")
abline(h=0.95, lty=2)
lines(results$p, results$ClopperPearson, col=1, lwd=lwdval)
lines(results$p, results$Plus4, col=2, lwd=lwdval)
lines(results$p, results$Score, col=3, lwd=lwdval)
lines(results$p, results$T, col=4, lwd=lwdval)
tests = c("ClopperPearson", "Plus4", "Score", "T")
legend("bottomright", legend=tests,
col=1:4, lwd=lwdval, cex=0.70)

The resulting plot is quite interesting, and demonstrates how non-linear the coverage is for these methods, and how the t (almost equivalent to the normal, in this case) is anti-conservative in many cases. It also confirms the results of Agresti and Coull, who concluded that for interval estimation of a proportion, coverage probabilities from inverting the standard binomial and too small when inverting the Wald large-sample normal test, with the Plus 4 yielding coverage probabilities close to the desired, even for very small sample sizes.

SAS
Calculating the coverage probability for a given N and binomial p can be done in a single data step, summing the probability-weighted coverage indicators over the realized values of the random variate. Once this machinery is developed, we can call it repeatedly, using a macro, to find the results for different binomial p. We comment on the code internally.

%macro onep(n=,p=,alpha=.05);
data onep;
n = &n;
p = &p;
alpha = α
/* set up collectors of the weighted coverage indicators */
expcontrib_t = 0;
expcontrib_p4 = 0;
expcontrib_s = 0;
expcontrib_cp = 0;
/* loop through the possible observed successes x*/
do x = 0 to n;
probobs = pdf('BINOM',x,p,n); /* probability X=x */
phat = x/n;
zquant = quantile('NORMAl', 1 - alpha/2, 0, 1);
p4 = (x+2)/(n + 4);

/* calculate the half-width of the t and plus4 intervals */
thalf = quantile('T', 1 - alpha/2,(n-1)) * sqrt(phat*(1-phat)/n);
p4half = zquant * sqrt(p4*(1-p4)/(n+4));

/* the score CI in R uses a Yates correction by default, and is
reproduced here */
yates = min(0.5, abs(x - (n*p)));
z22n = (zquant**2)/(2*n);
yl = phat-yates/n;
yu = phat+yates/n;
slower = (yl + z22n - zquant * sqrt( (yl*(1-yl)/n) + z22n / (2*n) )) /
(1 + 2 * z22n);
supper = (yu + z22n + zquant * sqrt( (yu*(1-yu)/n) + z22n / (2*n) )) /
(1 + 2 * z22n);

/* cover = 1 if in the CI, 0 else */
cover_t = ((phat - thalf) < p) and ((phat + thalf) > p);
cover_p4 = ((p4 - p4half) < p) and ((p4 + p4half) > p);
cover_s = (slower < p) and (supper > p);
/* the Clopper-Pearson interval can be easily calculated on the fly */
cover_cp = (quantile('BETA', alpha/2 ,x,n-x+1) < p) and
(quantile('BETA', 1 - alpha/2 ,x+1,n-x) > p);

/* cumulate the weighted probabilities */
expcontrib_t = expcontrib_t + probobs * cover_t;
expcontrib_p4 = expcontrib_p4 + probobs * cover_p4;
expcontrib_s = expcontrib_s + probobs * cover_s;
expcontrib_cp = expcontrib_cp + probobs * cover_cp;
/* only save the last interation */
if x = N then output;
end;
run;
%mend onep;

The following macro calls the first one for a series of binomial p for a fixed N. Since the macro %do loop can only iterate through integers, we have to do a little division; the %sysevelf function will do this within the macro.

%macro repcicov(n=, lop=, hip=, byp=, alpha= .05);
/* need an empty data set to store the results */
data summ; set _null_; run;
%do stepp = %sysevalf(&lop / &byp, floor) %to %sysevalf(&hip / &byp,floor);
/* note that the p sent to the %onep macro is a
text string like "49 * .001" */
%onep(n = &n, p = &stepp * &byp, alpha = &alpha);
/* tack on the current results to the ones finished so far */
/* this is a simple but inefficient way to add each binomial p into
the output data set */
data summ; set summ onep; run;
%end;
%mend repcicov;

/* same parameters as in R */
%repcicov(n=50, lop = .01, hip = .3, byp = .001);

Finally, we can plot the results. One option shown here and not mentioned in the book are the mode=include option to the symbol statement, which allows the two distinct pieces of the T coverage to display correctly.

goptions reset=all;
legend1 label=none position=(bottom right inside)
mode=share across=1 frame value = (h=2);
axis1 order = (0.85 to 1 by 0.05) minor=none
label = (a=90 h=2 "Coverage probability") value=(h=2);
axis2 order = (0 to 0.3 by 0.05) minor=none
label = (h=2 "True binomial p") value=(h=2);
symbol1 i = j v = none l =1 w=3 c=blue mode=include;
symbol2 i = j v = none l =1 w=3 c=red;
symbol3 i = j v = none l =1 w=3 c=lightgreen;
symbol4 i = j v = none l =1 w=3 c=black;
proc gplot data = summ;
plot (expcontrib_t expcontrib_p4 expcontrib_s expcontrib_cp) * p
/ overlay legend vaxis = axis1 haxis = axis2 vref = 0.95 legend = legend1;
label expcontrib_t = "T approximation" expcontrib_p4 = "P4 method"
expcontrib_s = "Score method" expcontrib_cp = "Exact (CP)";
run; quit;




An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.