2月 032011

So, if you were reading last week, we talked about how to structure your data for a mixed models repeated measures analysis. And as my friend Rick pointed out, there’s more than one way to go about restructuring your data (if you ask real nice, he’ll also do it in PROC IML- the Rockstar of programming languages). Then we played with a data set in which the dependent measurements were not ordered over time. In fact, it wasn’t even the same variable.

In order to increase the amount of money customers deposit in three different account types, a bank designs a factorial experiment with two factors: promotion (Gift or Discount) and minimum balance ($75, $125, or $225). Offers are sent to existing customers and the sum of their deposits to the three account types (savings, checking, and investment) are recorded.

Multiple continuous variables observed on the same subject is a textbook-perfect scenario for multivariate analysis of variance (MANOVA). MANOVA takes advantage of the correlation among responses within a subject and constructs a matrix of sums of squares and sums of cross-products (SSCP) to compare between- and within-group variability while accounting for correlation among the dependent variables within a subject and unequal variances across the dependent variables.

The data set, as we discussed last week, looks like this:

With one row per customer, one column per dependent variable.

Just like multivariate repeated measures analysis (which is really just MANOVA with some fancy contrasts pre-cooked), a little missing data goes a long way to killing your sample size and therefore statistical power. Furthermore, working with covariates can be tricky with repeated measures MANOVA. The MANOVA SSCP matrices require estimation of many bits, which can also eat up your power. There are four multivariate test statistics, which can also complicate matters if you are not certain which one is the best for you to use.

It turns out that it is really easy to fit an equivalent—but not identical—model in the MIXED procedure.

The data set looks like this:

One row per observation (a dependent variable within a customer).

If all we were doing was reproducing MANOVA results with PROC MIXED, I would not be writing this blog. We can do more. Instead of just accommodating unequal variances and covariance within a subject, the mixed models approach directly models the covariance structure of the multiple dependent variables. What’s more is that you can also simplify the structure, buying you more power, and making the interpretation of your model easier. For example, you might suspect that the variances are equal and the covariances between pairs of dependent variables are equal across the three dependent variables.

The fit statistics in the mixed model enable model comparison. Since the mean model is identical in both cases, fit statistics based on REML are appropriate.

Along with changing the covariance structure, there are the other advantages that tag along with using a mixed model: more efficient handling of missing data, easy to handle covariates, multiple levels of nesting is easy to accommodate (measurements within subjects within sales territories within your wildest imaginings), a time component is easy to model, heterogeneous groups models, to name a few.

Few days go by that I don’t use the GLIMMIX procedure, and as it happens, there’s a trick in PROC GLIMMIX that makes these types of models even more flexible. Starting in SAS 92, you can model a mixture of distributions from the exponential family, such as one gamma and two normal responses. If my data looked like this:

(Notice the column with the distribution name for each variable) then I could fit the model as follows:

Or like this, instead:

Those two models are not equivalent, and they both use pseudo likelihood estimation, so you will probably only use this kind of a model in circumstances where nothing else will do the job. Still, it’s quite a bit more than could be done even a couple of years ago.

I know I’m keeping you hanging on for that punchline. So here you are (with my deepest apologies)…

(edited to fix the automatic underlining in html in the SAS code-- it should be correctly specified now)

**The Scene:**In order to increase the amount of money customers deposit in three different account types, a bank designs a factorial experiment with two factors: promotion (Gift or Discount) and minimum balance ($75, $125, or $225). Offers are sent to existing customers and the sum of their deposits to the three account types (savings, checking, and investment) are recorded.

**The Classical Approach: MANOVA**Multiple continuous variables observed on the same subject is a textbook-perfect scenario for multivariate analysis of variance (MANOVA). MANOVA takes advantage of the correlation among responses within a subject and constructs a matrix of sums of squares and sums of cross-products (SSCP) to compare between- and within-group variability while accounting for correlation among the dependent variables within a subject and unequal variances across the dependent variables.

*proc glm data = blog.promoexperiment;*

class promotion minbal;

model savbal checkbal investamt= promotion|minbal ;

manova h=_all_;

run;class promotion minbal;

model savbal checkbal investamt= promotion|minbal ;

manova h=_all_;

run;

The data set, as we discussed last week, looks like this:

With one row per customer, one column per dependent variable.

Just like multivariate repeated measures analysis (which is really just MANOVA with some fancy contrasts pre-cooked), a little missing data goes a long way to killing your sample size and therefore statistical power. Furthermore, working with covariates can be tricky with repeated measures MANOVA. The MANOVA SSCP matrices require estimation of many bits, which can also eat up your power. There are four multivariate test statistics, which can also complicate matters if you are not certain which one is the best for you to use.

**The Modern Approach: Mixed Models**It turns out that it is really easy to fit an equivalent—but not identical—model in the MIXED procedure.

*proc mixed data = blog.promouni;*

class promotion minbal;

model value1= promotion|minbal/noint ;

repeated /subject = subject type=un;

run;class promotion minbal;

model value1= promotion|minbal/noint ;

repeated /subject = subject type=un;

run;

The data set looks like this:

One row per observation (a dependent variable within a customer).

**More, and Different:**If all we were doing was reproducing MANOVA results with PROC MIXED, I would not be writing this blog. We can do more. Instead of just accommodating unequal variances and covariance within a subject, the mixed models approach directly models the covariance structure of the multiple dependent variables. What’s more is that you can also simplify the structure, buying you more power, and making the interpretation of your model easier. For example, you might suspect that the variances are equal and the covariances between pairs of dependent variables are equal across the three dependent variables.

*proc mixed data = blog.promouni;*

class promotion minbal;

model value1= promotion|minbal/noint ;

repeated /subject = subject type=cs;

run;class promotion minbal;

model value1= promotion|minbal/noint ;

repeated /subject = subject type=cs;

run;

The fit statistics in the mixed model enable model comparison. Since the mean model is identical in both cases, fit statistics based on REML are appropriate.

Along with changing the covariance structure, there are the other advantages that tag along with using a mixed model: more efficient handling of missing data, easy to handle covariates, multiple levels of nesting is easy to accommodate (measurements within subjects within sales territories within your wildest imaginings), a time component is easy to model, heterogeneous groups models, to name a few.

**Variation on a Theme: Mixture of Distributions in PROC GLIMMIX**Few days go by that I don’t use the GLIMMIX procedure, and as it happens, there’s a trick in PROC GLIMMIX that makes these types of models even more flexible. Starting in SAS 92, you can model a mixture of distributions from the exponential family, such as one gamma and two normal responses. If my data looked like this:

(Notice the column with the distribution name for each variable) then I could fit the model as follows:

*proc glimmix data = blog.promouni;*

class promotion minbal;

model value1= promotion|minbal/noint dist=byobs(distrib);

random intercept /subject = subject;

run;class promotion minbal;

model value1= promotion|minbal/noint dist=byobs(distrib);

random intercept /subject = subject;

run;

Or like this, instead:

*proc glimmix data = blog.promouni;*

class promotion minbal;

model value1= promotion|minbal/noint dist=byobs(distrib);

random _residual_ /subject = subject type=un;

run;class promotion minbal;

model value1= promotion|minbal/noint dist=byobs(distrib);

random _residual_ /subject = subject type=un;

run;

Those two models are not equivalent, and they both use pseudo likelihood estimation, so you will probably only use this kind of a model in circumstances where nothing else will do the job. Still, it’s quite a bit more than could be done even a couple of years ago.

I know I’m keeping you hanging on for that punchline. So here you are (with my deepest apologies)…

*Three correlated responses walk into a bar.*

One asks for a pilsner. The second asks for an ale.

The third one tells the bartender, “I’m just not feeling normal today. Better gamma something mixed.”One asks for a pilsner. The second asks for an ale.

The third one tells the bartender, “I’m just not feeling normal today. Better gamma something mixed.”

(edited to fix the automatic underlining in html in the SAS code-- it should be correctly specified now)