Cat Truxillo

3月 232011
When you think of statistical process control, or SPC for short, what industry first comes to your mind? In the past 10 or 15 years, diverse industries have begun to standardize processes and administrative tasks with statistical process control. While the top two bars of the industrial Pareto chart are probably still manufacturing and machine maintenance, in North America, SPC is being used in far more types of work than many people realize.

One reason that researchers and process managers turn to Walter Shewhart’s trusty techniques to distinguish between common cause and special cause variation—in other words, variation that represents unimportant bumps on the road and variation that means that the wheels have fallen off the bus—is that they work far better than gut instinct in many cases. Furthermore, they are defensible, because they are based on sound, scientific practice. What might surprise you is the extent to which social science, human subject research, and administrative fields are making use of SPC.

1. Health Care. This is the area I hear the most buzz about for SPC. A great deal of interesting work is being done in monitoring process variables such as late payments, the number of available beds, payments to providers, script-writing rates for high-risk drugs, pain levels during recovery from surgery, and so on. I can only scratch the surface here. The writing is on the wall about health care becoming still more expensive, data are becoming more and more plentiful, and it is especially easy for problems to hide in massive data. Unpredictable problems = cost. Process control = savings.

2. Customer Service How long did you have to wait in line the last time you were in a big box store? When you last called the cable company? Some companies recognize that if customer service is slow, you will take your business elsewhere. Some are even willing to take action on that knowledge. There are plenty of measureable service quality characteristics that can be tapped into to identify times of day or product lines that are inconsistent, which translates to a better customer experience, and to customer loyalty.

3. Survey Research This is the one I’m most excited about right now. SPC interest in survey research has been on the rise for the past 5 or so years, and I think it’s an area ripe for this sort of analysis. If the news tells you that 52% of Americans who eat raw oysters only eat them to be polite, someone had to get that information. Enter the heroes of the survey world. Survey researchers, the people behind all the Things We Know about How We Are, are applying statistical process control methods with variables related to survey data collection, such as interview length, nonresponse rate, interviewer driving distances, and so on.

Continue reading "The Human Side of Statistical Process Control: Three Applications of SAS/QC You Might Not Have Thought About"
2月 192011
Lunch. For some workers, it’s the sweetest part of an otherwise bitter day at the grindstone. Nothing can turn that sweetness sour like going into the breakroom to discover that someone has taken your lunch and eaten it themselves.

Nothing like that ever happens here at SAS.

But if it did, I would set up a system to repeatedly collect and identify the saliva of the top suspects, and do an elegant chemical analysis. When a lunch goes missing, there’s always some residual spit on the used container.

I could develop a discriminant analysis model to identify each suspect. Then I’d score newly missing lunches with the model, flag the culprit, track them down and make them buy a box of Belgian chocolates for the person whose lunch they pilfered.

But what if I falsely accused someone who was innocent? Oh gosh. That could be an embarrassing and expensive error.

Let’s review how the discriminant analysis would look:

Continue reading "Who Ate My Lunch? Discriminant Thresholds to Reduce False Accusations"
2月 032011
So, if you were reading last week, we talked about how to structure your data for a mixed models repeated measures analysis. And as my friend Rick pointed out, there’s more than one way to go about restructuring your data (if you ask real nice, he’ll also do it in PROC IML- the Rockstar of programming languages). Then we played with a data set in which the dependent measurements were not ordered over time. In fact, it wasn’t even the same variable.
The Scene:
In order to increase the amount of money customers deposit in three different account types, a bank designs a factorial experiment with two factors: promotion (Gift or Discount) and minimum balance ($75, $125, or $225). Offers are sent to existing customers and the sum of their deposits to the three account types (savings, checking, and investment) are recorded.
The Classical Approach: MANOVA
Multiple continuous variables observed on the same subject is a textbook-perfect scenario for multivariate analysis of variance (MANOVA). MANOVA takes advantage of the correlation among responses within a subject and constructs a matrix of sums of squares and sums of cross-products (SSCP) to compare between- and within-group variability while accounting for correlation among the dependent variables within a subject and unequal variances across the dependent variables.
proc glm data = blog.promoexperiment;
class promotion minbal;
model savbal checkbal investamt= promotion|minbal ;
manova h=_all_;

The data set, as we discussed last week, looks like this:

With one row per customer, one column per dependent variable.
Just like multivariate repeated measures analysis (which is really just MANOVA with some fancy contrasts pre-cooked), a little missing data goes a long way to killing your sample size and therefore statistical power. Furthermore, working with covariates can be tricky with repeated measures MANOVA. The MANOVA SSCP matrices require estimation of many bits, which can also eat up your power. There are four multivariate test statistics, which can also complicate matters if you are not certain which one is the best for you to use.
The Modern Approach: Mixed Models
It turns out that it is really easy to fit an equivalent—but not identical—model in the MIXED procedure.
proc mixed data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint ;
repeated /subject = subject type=un;

The data set looks like this:

One row per observation (a dependent variable within a customer).
More, and Different:
If all we were doing was reproducing MANOVA results with PROC MIXED, I would not be writing this blog. We can do more. Instead of just accommodating unequal variances and covariance within a subject, the mixed models approach directly models the covariance structure of the multiple dependent variables. What’s more is that you can also simplify the structure, buying you more power, and making the interpretation of your model easier. For example, you might suspect that the variances are equal and the covariances between pairs of dependent variables are equal across the three dependent variables.
proc mixed data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint ;
repeated /subject = subject type=cs;

The fit statistics in the mixed model enable model comparison. Since the mean model is identical in both cases, fit statistics based on REML are appropriate.

Along with changing the covariance structure, there are the other advantages that tag along with using a mixed model: more efficient handling of missing data, easy to handle covariates, multiple levels of nesting is easy to accommodate (measurements within subjects within sales territories within your wildest imaginings), a time component is easy to model, heterogeneous groups models, to name a few.
Variation on a Theme: Mixture of Distributions in PROC GLIMMIX
Few days go by that I don’t use the GLIMMIX procedure, and as it happens, there’s a trick in PROC GLIMMIX that makes these types of models even more flexible. Starting in SAS 92, you can model a mixture of distributions from the exponential family, such as one gamma and two normal responses. If my data looked like this:

(Notice the column with the distribution name for each variable) then I could fit the model as follows:
proc glimmix data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint dist=byobs(distrib);
random intercept /subject = subject;

Or like this, instead:
proc glimmix data = blog.promouni;
class promotion minbal;
model value1= promotion|minbal/noint dist=byobs(distrib);
random _residual_ /subject = subject type=un;

Those two models are not equivalent, and they both use pseudo likelihood estimation, so you will probably only use this kind of a model in circumstances where nothing else will do the job. Still, it’s quite a bit more than could be done even a couple of years ago.
I know I’m keeping you hanging on for that punchline. So here you are (with my deepest apologies)…
Three correlated responses walk into a bar.
One asks for a pilsner. The second asks for an ale.
The third one tells the bartender, “I’m just not feeling normal today. Better gamma something mixed.”

(edited to fix the automatic underlining in html in the SAS code-- it should be correctly specified now)
1月 262011
Next week's blog entry will build on this one, so I want you to take notes, OK?

It's not headline news that in most cases, the best way to handle a repeated measures analysis is with a mixed models approach, especially for Normal reponses (for other distributions in the exponential family, GLIMMIX also has some advantages over GEEs. But that's another blog post for another day). You have more flexibility in modeling the variances and covariances among the time points, data do not need to be balanced, and a small amount of missingness doesn't spell the end of your statistical power.

But sometimes the data you have are living in the past: arranged as if you were going to use a multivariate repeated measures analysis. This multivariate data structure arranges the data with one row per subject, each time-based response a column in the data set. This enables the GLM procedure to set up H matrices for the model effects, and E and T matrices for testing hypotheses about those effects. It also means that if a subject has any missing time points, you lose the entire subject's data. I've worked on many repeated measures studies in my pre-SAS years, and I tell you, I've been on the phones, email, snail mail, and even knocked on doors to try to bring subjects back to the lab for follow-ups. I mourned over every dropout. To be able to use at least the observations you have for a subject before dropout would be consolation to a weary researcher's broken heart.

Enter the mixed models approach to repeated measures. But, your data need to be restructured before you can use MIXED for repeated measures analysis. This is, coincidentally, the same data structure you would use for a univariate repeated measures, like in the old-olden days of PROC ANOVA with hand-made error terms (well, almost hand-made). Remember those? The good old days. But I digress.

The MIXED and GLIMMIX procedures require the data be in the univariate structure, with one row per measurement. Notice that these procedures still use CCA, but now the "case" is different. Instead of a subject, which in the context of a mixed model can be many things at once (a person, a clinic, a network...), the "case" is one measurement occurence.

How do you put your wide (multivariate) data into the long (univariate) structure? Well, there are a number of ways, and to some extent it depends on how you have organized your data. If the multivariate response variable names share a prefix, then this macro will convert your data easily.

What if you want to go back to the wide structure (for example, to create graphs to profile subjects over time)? There's a macro for that as well.

What if your variables do not share a prefix, but instead have different names (such as SavBal, CheckBal, and InvestAmt)? Then you will need an alternative strategy. For example:

This needs some rearrangement, but there are two issues. First, there is no subject identifier, and I will want this in next week's blog when I fit a mixed model. Second, the dependent variables are not named with a common prefix. In fact, they aren't even measured over time! They are three variables measured for one person at a given time. (I'll explain why in next week's blog).

So, my preference is to use arrays to handle this:

Which results in the following:

I tip my hat to SAS Tech Support, who provide the %MAKELONG and %MAKEWIDE macros and to Gerhard Svolba, who authored them. If someone wants to turn my arrays into a macro, feel free. I'll tip my hat to you, too.

Tune in next week for the punchline to the joke:
"Three correlated responses walk into a bar..."
1月 062011
A student in my multivariate class last month asked a question about prior probability specifications in discriminant function analysis:
What if I don't know what the probabilities are in my population? Is it best to just use the default in PROC DISCRIM?

First, a quick refresher of priors in discriminant analysis. Consider a problem of classifying 150 cases (let's say, irises) into three categories (let's say, variety). I have four different measurements taken from each of the flowers.

If I walk through the bog and pick another flower and measure its 4 characteristics, how well can I expect to perform in classifying it as the right variety? One way to derive a classification algorithm is to use linear discriminant analysis.

A linear discriminant function to predict group membership is based on the squared Mahalanobis distance from each observation to the controid of the group plus a function of the prior probability of membership in that group.
This generalized squared distance is then converted into a score of similarity to each group, and the case is classified into the group it is most similar to.

The prior probability is the probability of an observation coming from a particular group in a simple random sample with replacement.

If the prior probabilities are the same for all three of the groups (also known as equal priors), then the function is only based on the squared Mahalanobis distance.

If the prior for group A is larger than for groups B and C, then the function makes it more likely that an observation will be classified as group A, all else being equal.

The default in PROC DISCRIM is equal priors. This default makes sense in the context of developing computational software: the function with equal priors is the simplest, and therefore the most computationally efficient.
PRIORS equal;

Alternatives are proportional priors (using priors that are the proportion of observations from each group in the same input data set) and user-specified priors (just what it sounds like: specify them yourself).
PRIORS proportional;
PRIORS 'A' = .5 'B' = .25 'C' = .25;

Of course this kind of problem is far more interesting when you consider something like people making choices, such as kids choosing an action figure of Tinkerbell, Rosetta, or Vidia. Those certainly don't have equal priors, and if your daughter's anything like mine, she doesn't want to be classified into the wrong group.

So back to the original question:
What if I don't know what the probabilities are in my population? Is it best to just use the default in PROC DISCRIM?

In this case, using the default would probably not be a great idea, as it would assign the dolls with equal probability, all else being equal.

So if not the default, then what should you use? This depends on what you're going to be scoring. Your priors should reflect the probabilities in the population that you will be scoring in the future. Some strategies for getting a decent estimate:
1. Go to historical data to see what the probabilities have been in the past.
2. If your input data set is a simple random sample, use proportional priors.
3. Take a simple random sample from the population and count up the number from each group. This can determine the priors.
4. Combine the probabilities you think are correct with the cost of different types of misclassification.

For example, suppose that. among 4-year-olds, the probabilities of wanting the Tinkerbell, Rosetta, and Vidia action figures are really 0.50, 0.35, and .15 respectively. After all, not many kids want to be the villian.
PRIORS 'Tink' = .5 'Rosetta' = .35 'Vidia' = .15

What is the cost of giving a girl the Rosetta doll when she wanted Tinkerbell? What's the cost of giving a girl Vidia when she wanted Rosetta?, and so on. A table is shown below (based on a very small sample of three interviews of 4-year-old girls):

Clearly the cost of an error is not the same for all errors. It is far worse to assign Vidia to a girl who doesn't want Vidia than for any other error to occur. Also notice the small detail that Vidia fans would prefer to get Rosetta over Tinkerbell. For birthday party favors, I'd massage those priors to err on the side of giving out Rosetta over Vidia.
PRIORS 'Tink' = .5 'Rosetta' = .4 'Vidia' = .1
Of course, depending on your tolerance for crying, you might just give everyone Rosetta and be done with it. But then, really, isn't variety the spice of life?

I hope this has helped at least one or two of you out there who were having trouble with priors. The same concepts apply in logistic regression with offset variables, by the way. But that's a party favor for another day.
1月 042011
Happy New Year!! This is a good time to think about what was going on here in SAS Education one year ago, and to introduce you to a big project that I'm really excited to "take public."

In January 2010 (as well as throughout 2009), we kept getting cries for help like this one: "Where do we find MBAs with quantitative skills? They need to know predictive modeling, large-scale time series forecasting, cluster analysis, design and analysis of experiments, honest assessment, and model management. Our problems cannot be solved with t-tests and most universities are not teaching the skills business analysts really need in the modern workplace."

Companies want to hire professionals who have a good head for business, an understanding of data analysis, and facility with advanced software for fitting sophisticated models. They have told us that these people are getting harder and harder to find.

Of course, SAS has had training on statistical analysis and software for decades. But analytical directors and business analysts frequently find themselves in a position of needing to know more about how to apply analytics in the business context: how do you make the most effective use of your (massive, messy, opportunistic) data? How do you "think analytically"?

About half of my time in 2010 (with help from some brilliant colleagues, as well) was spent developing a course to address this need: Advanced Analytics for the Modern Business Analyst. The first version was made exclusively available to university professors last August, and many of them are already teaching it in their graduate programs. Degree-seeking students are getting better prepared for handling large-scale data analysis in the so-called "real-world." We were all very excited about launching the course with the universities and worked individually with over 30 professors worldwide to train them and provide the software needed to teach this course.

Starting in 2011 the course will also be available to professionals.

In March, we are offering a nine-week class that combines statistical and software training with the business context-- think of it as a "how to solve problems with data analysis" class. It is like having that extra semester of graduate training that you wish had been offered.

This course is also a chance to try out a better way of learning. Instead of just a weekly live web class, this course models the strategies many universities are using to offer their distance learning curricula, making it easier for the student to retain learning from week to week and using multiple learning modes to reinforce understanding (reading, writing, listening, watching).

Each week, students will participate in:
...a 3.5-hour live web lecture with your instructor and classmates.
...practice activities using software on our servers.
...assigned reading of book chapters and articles to support the class topics, as well as discussion forums to exchange ideas about the readings.

The total time commitment is about 5-6 hours a week, roughly the same as a semester-long course.

You can read a little more about it here in the SAS Training Newsletter.

You can see the course outline here. There are only a few seats left, so if you're interested, you might want to jump on it. Since every teaching experience is also a learning experience, I hope that teaching this class will generate lots of new ideas and blog posts in the coming months. I'll let you know how it's coming along.

Does your organization have people who can to make sense of masses of data, forecast future behavior, and find the next best strategy to gain competitive edge?

See you in class!
12月 092010
At this time of year as fans debate over the best players in college football, quantum mechanics combines with college sports to produce the Heisman uncertainty principal: you cannot know who has won the trophy until it is announced, and so you have to treat it as if every candidate has won and no candidates have won.**

What does the quantum theory of football have to do with a blog about learning SAS? Well, everyone around here is abuzz about Change the Equation and STEM (if you don’t know what this is, check it out and come back to me. It’s OK, I’ll wait for you.) Here is SAS’ contribution to their video contest:

The winner of the viral video contest is also in a quantum state of uncertainty until the contest ends and the winner is announced.

The SAS Training Post writers and all of the Education Division at SAS have reasons to be interested in motivating the next generation to study STEM—after all, they’re our future users!

But my interest in this topic goes further than that of a SAS instructor. The state of education is often on my mind as I consider the future my 1- and 4-year old knee-gnawers can look forward to. Schroediner’s quarterback aside, one thing that is certain: learning does not end with the school day. A well-rounded continuing education at home is part of the solution to the problem of lagging in math and science.

Last week, Michele Reister asked me to blog about how I ended up with a career in statistics. It’s certainly not where I thought I’d end up, trying to pick a major among theatre arts, psychology, English, physics, and computer science. There were dozens of influences that led ultimately to here and now, but one that makes my point about education is this: As strange as it is, I ended up in statistics partly thanks to William Shakespeare. Iambic pentameter fed a love for how numbers and patterns play into everyday life that later bloomed with academic research in human behavior. Sometimes I’d miss the whole point of a sonnet because the grammatical gymnastics producing the rhythm were so gorgeously executed. I wasn’t a very good actor, but I love a play on numbers.

Another influence was my freshman semester statistics professor at (what is now called) Texas State University. She nurtured our interest in statistics, focusing on the theoretical and applied aspects of statistics rather than on the calculations. We never had to memorize a formula. Competing with a fellow student for top score in the class, we both realized that data analysis, as it pertains to research in behavioral sciences, is far more interesting than anything else we might be doing. One thing leads to another, and we each ended up in a quantitative field (he is now on the faculty at Texas State). One topic informs another, and creativity grows from diversity of information.

To think of knowledge as a siloed system of isolated subjects-- math, English, history, physics—is to miss the joy of learning. Learning can be part of everyday life.

In teaching my 4-year old basic math concepts, we play games with the numbers. How many jelly beans do you have if I take 3 away from your handful of 12? What if you give 3 jelly beans to each of 4 kids? How many beans is that? She makes math part of her imaginative play, and it plants the seed of learning that will hopefully serve her for a lifetime.

And just like the quantum state of the Heisman, the quantum state of future STEM professionals requires that we treat it as if we are ahead—and behind—at the same time. Teaching math and science in ways that are fun, and that inform other areas of study, might be the key to motivating students to study STEM, so that future generations can “open the box” in 10 or 20 years to find a “living” in science, technology, engineering and math inside. Now I’m going to play catch in a probability field with my favorite (electron-speed) preschooler.

Thanks for reading!!

** with apologies to my dad, a retired physicist and fair-weather armchair quarterback, who is no doubt shaking his head right now.
12月 032010
Have you used multivariate procedures in SAS and wanted to save out scores? Some procedures, such as FACTOR, CANDISC, CANCORR, PRINCOMP, and others have an OUT= option to save scores to the input data set. However, to score a new data set, or to perform scoring with multivariate procedures that do not have an OUT= option you need another way.

Scoring new data with these procedures is a snap if you know how to use PROC SCORE. The OUTSTAT data sets from these procedures includes scoring code that the procedure uses to score observations in the DATA= data set. They are contained in the observations where TYPE='SCORE'.

The only catch, however, is that if you use a hierarchical method such as the VARCLUS procedure, then the score code is included for the 1-, 2-, 3-, and so on cluster solutions, up to the final number of clusters in the analysis. PROC SCORE doesn't know which one to use.

So you have one extra (teeny-tiny) step to perform scoring:
Continue reading "Weekday Morning Quick-trick: How to Score from PROC VARCLUS"
11月 192010
You know the old joke,

Q. How can you tell an extroverted statistician from an introverted statistician?

A. The extrovert looks at your shoes when they talk.

Well, the statisticians that I work with every day are a pretty lively bunch, so this joke doesn’t really apply, but it brings up a stereotype that people who work with statistics are dull. Many of us working in the area of applied statistics are expats from other disciplines: psychology, physics, chemistry, education, engineering, mathematics. Something brought each of us together to work in the fields of applied statistics. I think that many cases, the common denominator is a strong desire to tinker. If you’re reading this blog, you are probably a fellow tinkerer.

I was reading this blog post over at Harvard Business Review about the threats to creativity. There’s no doubt in my mind that a work environment that fosters creativity should have some mix of these three key ingredients. To compete in the modern marketplace, creativity is critical.

The key ingredient that strikes a strong chord with me is related to #1: Smart people who think differently. Amabile describes creative thinkers as people who “[have] deep expertise… as well as broad acquaintance with seemingly unrelated fields.” These are people for whom graduation did not spell the end of being a student; these are people who read, learn, and practice new ideas continuously. In other words, tinkerers! Whether in academia or private industry or government research, stale knowledge is a death knell to progress. It’s the reason many universities won’t hire their own graduates as faculty, and why research sabbaticals can be beneficial to all parties. It’s also the reason my boss does not grumble about buying textbooks for the department to stay fresh on statistical and analytical topics. Tinkerers thrive in a supportive work environment.

Among SAS users, the tinkerers are also the ones who have the greatest impact on our statistical courses here at SAS. Students take classes, ask questions, pose suggestions. New ideas form and make their way into the next revision of the class. With feedback from our students, courses remain fresh. The same is true of our software: feedback from users keeps the software releases fresh.

I’d love to hear your thoughts on this theme. What are some ways that you have found help you stay creative, at the top of your game, fresh? Tell me all about it—I’m over here, the one in the brown and blue cowboy boots.

11月 062010
Last week, a student in my Mixed Models Analysis Using SAS class sent in the following text message during a discussion of crossover designs (sometimes known as ABBA designs, where factors vary within subjects, not ABBA designs where you’re like a Super Trouper).

Does it make sense to look at repeated measures (multiple treatments) in the same way as repeated measures (over time)? Is the model essentially the same?

This is a common point of confusion for people learning mixed models, particularly if they have experience with other types of repeated measures analysis. It is also such a good question, one that is central to selecting a covariance structure in a mixed models analysis, that I decided to make a blog post of it.

Continue reading "Is It Random or Repeated?"