From LifeHacker: avoiding basic errors when interpreting stats

 World Statistics Day  From LifeHacker: avoiding basic errors when interpreting stats已关闭评论
10月 212010
It's like I was saying earlier: when used for good, statistics can inform your sound decisions and opinions. But stats can be used to mislead, as well. offers some basic guidance on this subject. And what if you already have the numbers you want to share, but you want some irrelevant facts to go with them? Visit and find some facts to go along with your measures. My favorite is the number 1: population for New Amsterdam, IN. It's also the number of dollars needed to buy everyone in New Amsterdam, IN, a hot dog at the 7-Eleven. That would be -- you guessed it -- one dollar.

The First Step to a New Profile Application

 site usage  The First Step to a New Profile Application已关闭评论
10月 212010
Contributed by Marci Russell, Web Project Manager, SAS
For years (more than 10, in fact), we’ve provided a way for users to log in to the and SAS Web sites to make your experience on our sites more relevant to you.

To provide an even more helpful Web experience for our customers, we’ve updated the SAS Profile application. Those of you who have a SAS Profile will notice the cleaner overlay interface the next time that you log in or update your profile information. We’ve also added convenient functionality at the top of nearly every SAS Web page to log in, log out or edit your SAS Profile.

We’ve made changes behind the scenes as well. When you create a new Profile, we’ll e-mail you to confirm your address and ensure you intended to create a SAS Profile. We’ve also created password strength rules to lock down your SAS Profile account. In the coming months when you log in, you will be asked to create a new strong password as we migrate "old" Profiles over to our new application.

You may be asking, why create a SAS Profile? If you’re a power user on our site (and we know that many of you are), having a Profile enables you to:
Continue reading "The First Step to a New Profile Application"

Analyzing NC State Fair Attendance Using JMP

 Data Visualization, Exploratory Data Analysis, JMP - General, JMP 9  Analyzing NC State Fair Attendance Using JMP已关闭评论
10月 202010
Photo of Krispy Kreme burger from NC State Fair Deep Fried blog
The North Carolina State Fair is in full swing in Raleigh this week. All the talk lately in the break room at the office concerns the Fair. "Did you go to the fair this weekend?" "What day will you go?" "Are you going to try a Krispy Kreme burger?" Fried Snickers bars and chocolate-covered bacon are sooo last year -- this year's outrageous food item is the hamburger with doughnuts where the buns should be (see photo at right, used with permission from Paul Jones of the NC State Fair Deep Fried blog).

If you're a planner, you might want to attend the fair on a day when the crowds aren't as large. That way parking should be more tolerable, and you won't have to wait as long in the Krispy Kreme burger line.

Conveniently, attendance numbers are available on the NC State Fair website. Last week, a SAS colleague analyzed this data using SAS. That motivated me to see what JMP can do with it today, which is World Statistics Day.

The website contains a table of data for each day of the fair. To get the data into JMP, first I tried copying and pasting; unfortunately, the data was not formatting properly. Luckily, someone told me about a handy feature in JMP to solve this issue -- Internet Open (under the File menu). Just enter the Web address where the data resides, and JMP automatically finds it, imports it and even formats it correctly.

Note: There are two tables found on the NC State Fair website. Choose the first table to see the attendance data.

Once the table is imported into JMP, there are two housekeeping details necessary to get the data in workable form. Change "Thu." to be a numeric column, and exclude/hide the row for 2010 if it contains partial data.

The first plot I'm interested in seeing will compare trends in attendance over the various years. The Parallel Plot platform can do this with just a few clicks (be sure to select Scale Uniformly). Below I've colored the rows according to which decade they belong to. For example, in the most recent decade (red lines), you can see a sharp spike in attendance on the second Thursday over previous decades. Can anyone tell me when they started the canned food drive on Thursdays?

A colleague suggested that the raw numbers are hard to compare from year to year since the overall numbers vary so much. He would prefer to know the daily attendance for each day as a percentage within each year. To create this plot, we first need to use Transpose in order to turn Day into a column. After adding a new column for the percent, Graph Builder is easily able to produce a graph of box plots that summarizes attendance by percentages per day.

The results are not surprising -- weekends have the heaviest traffic, with the second Saturday being the most popular day to go. I personally like to go on a weeknight. But no matter when I go, no matter what, you won't catch me anywhere near a Krispy Kreme burger!
10月 202010
I am not a statistician, but I love statistics. Statistics are facts, and when used for good, they are an important ingredient in sound decision making about almost any issue, whether it's about government policy or your personal behavior. The use of statistics has gone way beyond counting things, computing averages, and predicting trends. We can now apply sophisticated models to answer very specific questions in an objective way. In a published editorial yesterday, Jim Goodnight highlights the important role of statistics (and of course, statisticians) in the world today. Dr. Goodnight selected some important examples. He mentioned the Dartmouth Atlas Project, which sought to answer a number of questions including, "does higher spending in end-of-life care lead to better patient outcomes?" The answer was No, there was no evidence that more intense care (which can burn through money quickly) leads to better survival or satisfaction. That's an important finding that might be used to influence policies for how Medicare dollars are spent. (You can view reports and download data at But how to approach end-of-life care is not an objective issue; it's one that is full of emotion and intuition. If it's your life, or the life of a loved one, your instinct would be to "spend whatever it takes" to provide the best outcome. Do you care what the statistics tell you? Here's another example that's closer to home for us in North Carolina. The Wake County Public School System has been embroiled in much debate about student assignment methods -- what's the best way to assign students to schools to achieve the optimal socioeconomic balance, reduce bus commutes, limit overcrowding, and avoid "churn" via reassignments every year? SAS has been a partner with the Wake County Public School System on this problem; some of the findings were published in this SAS Global Forum paper: SAS/OR (OPTMODEL) and JMP were used in the study. SAS also supports the school system with EVAAS (education value-added assessment system). This provides a method to measure the effectiveness of schools and school systems across a variety of disciplines and predict future student achievement. It provides tremendous information, but it's not always what people want to hear. School assignments and how to measure student performance -- these issues are often charged with emotion as well, and that can drown out the facts. Statistics aren't always given a prominent role when making decisions around these issues; they certainly play a minor role in the debates that we hear about in the news. It's my hope that as we celebrate the first World Statistics Day, we will strive to educate our young people about the power of statistics (and it's even on Facebook, where all the kids hang out these days). It's important that we teach them how to ask critical questions and to demand solid answers, and to know that such answers are achievable when data are available. Human intuition and philosophy help to guide us when the facts are unknowable, but thanks to science and statistics, we've got a lot more in the "knowable" pile than we've ever had.

Top Ten Government Web Sites for Downloading Data

 data analysis  Top Ten Government Web Sites for Downloading Data已关闭评论
10月 202010
Today is World Statistics Day, an event set up to "highlight the role of official statistics and the many achievements of the national statistical system."

I want to commemorate World Statistics Day by celebrating the role of the US government in data collection and dissemination.

Data analysis begins with data. Over the years I have bookmarked several interesting US government Web sites that enable you to download data from samples or surveys. In the last several years, I have seen several of these sites go beyond merely making data available to become sites that offer data visualization in the form of maps, bar charts, or line plots.

Here is my Top 10 list of US government Web sites where you can download interesting data:

  1. Bureau of Transportation Statistics (BTS)
    Are you making airline reservations and want to check whether you plane is likely to be delayed? The BTS site has all sorts of statistics on transportation, and was used as the source for the data for the 2009 ASA Data Expo poster session.

  2. Centers for Disease Control and Prevention (CDC)
    Did you know that 4,316,233 births were registered in 2007 and that 40% of them were to unmarried mothers? Did you know that about one third of those births were by cesarean delivery? At the CDC you can explore trends and analyze hundreds of data sets by race, gender, age, and state of residence.

  3. Environmental Protection Agency (EPA)
    You can download data on air and water pollution, or find out if any industries near your home are incinerating hazardous waste.

  4. Federal Reserve System (The Fed)
    If you want data on the US economy, this is a great place to begin. A server to build custom data sets enables you to create a map of the percentage of prime mortgages that are in foreclosure. Notice the regional variation!

  5. My NASA Data
    The NASA server at this Web site enables you to create your own customized data set from 150 variables in atmospheric and earth science from five NASA scientific projects. This type of data was used for the 2006 ASA Data Expo.

  6. National Oceanic and Atmospheric Administration (NOAA)
    Interested in subsurface oil monitoring data from ships, buoys, and satellites in the aftermath of the Deepwater Horizon spill? More interested in a historical analysis of hurricanes? All this, and more!

  7. National Center for Atmospheric Research (NCAR)
    Everything you wanted to know about weather and climate in North America and beyond. Download data about temperatures, precipitation, arctic ice, and so on.

  8. US Department of Agriculture (USDA)
    Check out the very cool Food Environment Atlas. The Economic Research Service (ERS) branch of the USDA disseminates many data sets on land use, organic farming, and other agricultural concerns. Several USDA researchers use SAS/IML Studio and regularly present research papers at SAS Global Forum.

  9. US Census Bureau
    Do you want to know where people live who satisfy one of hundreds of demographic characteristics, such as the percent change in population from 2000 - 2009? I have two words for you: "thematic maps."

  10. US Geological Survey (USGS)
    Data on scientific topics such as climate change, erosion, earthquakes, volcanoes, and endangered species. What's not to like?
Did I omit YOUR favorite government site that provides raw or summarized data? Post a comment and a link.
10月 202010

In honor of World Statistics Day and the read paper that my co-authors Chris Wild, Maxine Pfannkuch, Matt Regan, and I are presenting at the Royal Statistical Society today, we present the R code to generate a combination dotplot/boxplot that is useful for students first learning statistics. One of the over-riding themes of the paper is that introductory students (be they in upper elementary or early university) should keep their eyes on the data.

When describing distributions, students often are drawn to the most visually obvious aspects: median, quartiles and extremes. These are the ingredients of the basic boxplot, which is often introduced as a graphical display in introductory courses. Students are taught to calculate the quartiles, and this becomes one of the components of a boxplot.

One limitation of the boxplot is that it loses the individual data points. Given a dotplot (see here for an example of one in Fathom), it is very easy to guesstimate and draw a boxplot by hand. Wild et al. argue that doing so is probably the best way of gaining an appreciation of just what a boxplot actually is. The boxplot provides a natural bridge between operating entirely in terms of what is seen in graphics to reasoning using summary statistics. Retaining the dots in the combination dotplot/boxplot provides a reminder that the box plot is just summarizing the raw data, thus preserving a connection to more concrete foundations.

Certainly, such a plot has a number of limitations. It breaks down when there are a large number of observations, and has redundancy not suitable for publication. But as a way to motivate informal inference, it has potential value. It's particularly useful when combined in an animation using multiple samples from a known population, as demonstrated here (scroll through the file quickly).

In addition, the code, drafted by Chris and his colleagues Steve Taylor and Dineika Chandrananda at the University of Auckland, demonstrates a number of useful and interesting techniques.


Because of the length of the code, we'll focus just on the boxpoints3() function shown below. The remaining support code must be run first (and can be found at The original boxpoints() function has a large number of options and configuration settings, which the function below sets at the default values.

# Create a plot from two variables, one continuous and one categorical.
# For each level of the categorical variable (grps) a stacked dot plot
# and a boxplot summary are created. Derived from code written by
# Christopher Wild, Dineika Chandrananda and Steve Taylor
boxpoints3 = function(x,grps,varnames1,varnames2,labeltext)
observed = (1:length(x))[! & !]
x = x[observed]
grps = as.factor(grps[observed])
ngrps = length(levels(grps))

# begin section to align titles and labels
xlims = range(x) + c(-0.2,0.1)*(max(x)-min(x))
top = 1.1
bottom = -0.2
plot(xlims, c(bottom,top), type="n", xlab="", ylab="", axes=F)
yvals = ((1:ngrps)-0.7)/ngrps
text(mean(xlims), top, paste(varnames1, "by", varnames2, sep=" "),
cex=1, font=2)
text(xlims[1],top-0.05,varnames2, cex=1, adj=0)
addaxis(xlims, ylev=0, tickheight=0.03, textdispl=0.07, nticks=5,
text(mean(xlims), -0.2, varnames1, adj=0.5, cex=1)
# end section for titles and labels

for (i in 1:ngrps) {
xi = x[grps==levels(grps)[i]]
text(xlims[1], yvals[i]+0.2/ngrps,
substr(as.character(levels(grps)[i]), 1, 12), adj=0, cex=1)
prettyrange = range(pretty(xi))
if (min(diff(sort(unique(xi)))) >= diff(prettyrange)/75)
xbins = xi # They are sufficiently well spaced.
else {
xbins = round(75 * (xi-prettyrange[1])/diff(prettyrange))
addpoints(xi, yval=yvals[i], vmax=0.62/ngrps, vadd=0.075/ngrps,
xbin=xbins, ptcex=1, ptcol='grey50', ptlwd=2)
bxplt = fivenum(xi)
if(length(xi) != 0)
addbox(x5=bxplt, yval=yvals[i], hbxwdth=0.2/ngrps,
boxcol="black", medcol="black", whiskercol="black",
boxfill=NA, boxlwidth=2)

Most of the function tends to issues of housekeeping, in particular aligning titles and labels. Once this is done, the support functions addpoints() and addbox() functions are called with the appropriate arguments. We can call the function using data from female subjects at baseline of the HELP study, comparing PCS (physical component scores) for homeless and non-homeless subjects.

ds = read.csv("")
female = subset(ds, female==1)
with(female,boxpoints3(pcs, homeless, "PCS", "Homeless"))

We see that the homeless subjects have lower PCS scores than the non-homeless (though the homeless group also has the highest score of either group in this sample).