PROC MIXED

2月 282014
 
One common violation of the assumptions needed for linear regression is heterscedasticity by group membership. Both SAS and R can easily accommodate this setting.

Our data today comes from a real example of vitamin D supplementation of milk. Four suppliers claimed that their milk provided 100 IU of vitamin D. The null hypothesis is that they all deliver this accurately, but there was some question about whether the variance was the same between the suppliers. Unfortunately, there are only four observations for each supplier.

SAS
In SAS, we'll read the data in with a simple input statement.
data milk;
input value milk_cat;
cards;
77 1
85 1
91 1
88 1
93 2
101 2
126 2
103 2
103 3
88 3
109 3
85 3
95 4
83 4
91 4
86 4
;;
run;
To fit the model, we'll use the group option to the repeated statement in proc mixed. This is specifically designed to allow differing values for groups sharing the same covariance structure. In this case, it's a simple structure: no correlation, constant value on the diagonal. The key pieces of output are selected out using ODS. To assess variance terms, it's best to use maximum likelihood, rather than REML fitting.
ods select covparms lrt;
proc mixed data = milk method = ml;
class milk_cat;
model value = milk_cat/solution;
repeated/group=milk_cat type = simple;
run;

  Covariance Parameter Estimates

Cov Parm Group Estimate

Residual milk_cat 1 27.1875
Residual milk_cat 2 150.69
Residual milk_cat 3 100.69
Residual milk_cat 4 21.1875


Null Model Likelihood Ratio Test

DF Chi-Square Pr > ChiSq

3 5.13 0.1623

There's not too much evidence to support different variances, but the power is likely quite small, so we'll retain this model when assessing the null hypothesis of equal means. A REML fit ought to be preferable here, but to agree with the R results, we fit again with ML. Note that proc mixed is not smart enough to accurately determine the degrees of freedom remaining (16 observations - 4 variance parameters - 4 mean parameters) so we need to manually specify the denominator degrees of freedom using the ddf option to the model statement.
ods select solutionf tests3;
proc mixed data = milk method = ml;
class milk_cat;
model value = milk_cat/solution ddf=8;
repeated/group=milk_cat type = simple;
run;
                   Solution for Fixed Effects

Standard
Effect milk_cat Estimate Error DF t Value Pr > |t|

Intercept 88.7500 2.3015 12 38.56 <.0001
milk_cat 1 -3.5000 3.4776 8 -1.01 0.3437
milk_cat 2 17.0000 6.5551 8 2.59 0.0319
milk_cat 3 7.5000 5.5199 8 1.36 0.2113
milk_cat 4 0 . . . .


Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

milk_cat 3 8 3.86 0.0563
So there's some reason to suspect that the suppliers may be different.

R
In R, we'll re-type this simple data set and assign the group labels manually.
value = c(77,85,91,88,93,101,126,103,103,88,109,85,95,83,91,86)
mc = as.factor(rep(1:4, each=4))
milk= data.frame(value, mc)
To fit the model with unequal variances, we'll use the gls() function in the nlme library. (Credit department: Ben Bolker discusses this and more complex models in an Rpubs document.) (Note that we use maximum likelihood, as we did in SAS.)
library(nlme)
mod = gls(value~mc, data=milk, weights = varIdent(form = ~1|mc),
method="ML")
To assess the hypothesis of equal variances, we'll compare to the homoscedasticity model using the anova() function.
mod2 = gls(value~mc, data=milk, method="ML")
anova(mod,mod2)

     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod 1 8 125.3396 131.5203 -54.66981
mod2 2 5 124.4725 128.3355 -57.23625 1 vs 2 5.132877 0.1623
This result is identical to SAS, although this is a likelihood ratio test and SAS shows a Wald test.

To assess whether the suppliers are different, we need to compare to the model with just an intercept. When using REML for both models, there was a warning message and a different answer than SAS (using REML in SAS as well). So we'll stick with maximum likelihood here.
mod3  = gls(value~1,data=milk, weights = varIdent(form = ~1|mc), method="ML")
anova(mod3,mod)
With ML in both programs we get the same results.
     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod3 1 5 126.8858 130.7488 -58.44292
mod 2 8 125.3396 131.5203 -54.66981 1 vs 2 7.546204 0.0564


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
5月 092012
 


/*********************************************************************
FILENAME: MIXVMEAN.SAS
SUBJECT HEADING: STAT
INITIALS: KBW
DATE: 7/26/96
PROGRAM: SAS
VERSION: 6.11
PLATFORM: WINDOWS 3.11 TS040
TITLE: USING PROC MIXED FOR A PAIRED T-TEST COMPARED TO PROC MEANS

DESCRIPTION: THIS PROGRAM DOES A PAIRED T-TEST, FOR THE SAME DATA,
FIRST USING PROC MEANS, AND THEN USING PROC MIXED.
NOTE THAT THE DATA STRUCTURE FOR THESE TWO METHODS
IS DIFFERENT. THE DATA STEPS AT THE BEGINNING SET
UP THE TWO DIFFERENT DATA STRUCTURES.
NOTE THAT THE T-TEST, THE DEGREES OF FREEDOM AND
THE P-VALUES ARE THE SAME FOR BOTH METHODS.
**********************************************************************/


data test;
input y1 y2;
diff=y1-y2;
Pair+1;
datalines;
13 15
12 14
17 17.2
14 18
11 12
5 4.1
7 9.3
;
run;

proc print data=test;
title 'printout of original data set';
run;

proc means n mean t prt;
var diff;
title 'paired t-test using proc means';
run;

data test2(keep=y pair group);
set test;
y=y1;
group=1;
output;
y=y2;
group=0;
output;
run;

proc print data=test2;
title 'rearranged data for proc mixed';
run;

proc mixed;
class pair group;
model y=group;
random pair;
lsmeans group / pdiff;
title 'paired t-test using proc mixed';
run;

8月 192011
 
Rick Wicklin discussed in his blog the performance in solving a linear system using SOLVE() function and INV() function from IML.

Since regression analysis is an integral part of SAS applications and there are many SAS procedures in SAS/STAT that are capable to conduct various regression analysis, it would be interesting to benchmark their relative performance using OLS regression, the fundamental regression analysis of all.

The analysis will compare REG, GLMSELECT, GENMOD, MIXED, GLIMMIX, GLM, ORTHOREG, HPMIXED and TRANSREG on 10 OLS regressions with 100 to 1000 variables, incremental at 100, and with the number of observations twice the number of variables to avoid possible numerical issues. HPMIXED uses sparse matrix techniques and will be put into great disadvantage in this comparison using large dense matrices. A macro wraps them together:




%macro wrap;
proc printto log='c:\testlog.txt';run;

%let t0=%sysfunc(datetime(), datetime.);
%let procnames=GLM REG GLMSELECT ORTHOREG MIXED GLIMMIX GENMOD ;
%let nproc=%sysfunc(countW(&procnames));
%put Test &nproc PROCEDURES;
%do i=1 %to 10;
%let nobs=%sysevalf(&i*100);
options nonotes;
data _temp;
array x{&nobs};
do i=1 to 2*&nobs;
do j=1 to &nobs;
x[j]=rannor(0);
end;
y=rannor(0);
drop i j;
output;
end;
run;
options notes;
sasfile _temp load;
ods select none;

%do j=1 %to &nproc;
%let proc=%scan(&procnames, &j);
%put &proc;
proc &proc data=_temp;
model y = x1-x&nobs;
run;
%end;
%put TRANSREG;
proc transreg data=_temp;
model identity(y) = identity(x1-x&nobs);
run;
sasfile _temp close;
ods select all;
%end;
proc printto; run;
%mend;
%wrap;

After running all iterations, the SAS log is parsed to obtain procedure names and corresponding real time and CPU time. The following SAS code does this job:




data proc_compare;
infile "c:\testlog.txt";
input;
retain procedure ;
retain realtime cputime realtime2 ;
length procedure $12.;
length realtime cputime $24.;
if _n_=1 then id=0;
x=_infile_;
if index(x, 'PROCEDURE')>0 then do;
procedure=scan(_infile_, 3);
if procedure="REG" then id+1;
end;

if index(x, 'real time')>0 then do;
_t1=index(_infile_, 'real time');
_t2=index(_infile_, 'seconds');
if _t2=0 then _t2=length(_infile_);
realtime=substr(_infile_, _t1+9, _t2-_t1-9);
if index(realtime, ':')>0 then do;
realtime2=scan(realtime, 1, ':')*60;
sec=input(substr(realtime, index(realtime, ':')+1), best.);
realtime2=realtime2+sec;
end;
else realtime2=input(compress(realtime), best.);
end;
if index(x, 'cpu time')>0 then do;
_t1=index(_infile_, 'cpu time');
_t2=index(_infile_, 'seconds');
if _t2=0 then _t2=length(_infile_);
cputime=substr(_infile_, _t1+8, _t2-_t1-8);
if index(cputime, ':')>0 then do;
cputime2=scan(cputime, 1, ':')*60;
sec=input(substr(cputime, index(cputime, ':')+1), best.);
cputime2=cputime2+sec;
end;
else cputime2=input(compress(cputime), best.);
keep id size procedure cputime2 realtime2 ;
size=id*100;
if compress(procedure)^="PRINTTO" then output;
end;
run;

We then visualize the results using the following code:



title "Benchmark Regression PROCs using OLS";
proc sgpanel data=proc_compare;
panelby procedure /rows=2;
series y=cputime2 x=size/ lineattrs=(thickness=2);
label cputime2="CPU Time (sec)"
size="Problem Size"
;;
colaxis grid;
rowaxis grid;
run;
title;

title "Closer Look on REG vs. GLM vs. GLMSELECT";
proc sgplot data=proc_compare uniform=group;
where procedure in ("GLMSELECT", "REG", "GLM");
series x=size y=cputime2/group=procedure curvelabel lineattrs=(thickness=2);
label cputime2="CPU Time (sec)"
size="# of Variables"
;;
yaxis grid ;
xaxis grid ;
run;
title;






It is found that PROC GLM and GLMSELECT beat all other procedures with large margin while HPMIXED is the slowest followed by GLIMMIX. Surprisingly, REG is slower than both GLM and GLMSELECT even though it utilized multi-threading technique while GLMSELECT does not:

************ Partial LOG of the last iteration ********
NOTE: PROCEDURE REG used (Total process time):
real time 6.79 seconds
cpu time 9.36 seconds


NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 3.06 seconds
cpu time 2.96 seconds
********************************************************

The performance gap between REG and GLM/GLMSELECT is getting larger when the number of variables increases to be more than 700.

Both REG and GLMSELECT are developed by the same group of developers in SAS, as far as I know.

********************* PS : ****************************
Rick and Charlie pointed out that real time is a more fair measure, which I agree.

The reading of real computing time has large variance from run to run because the testing enviornment is not very clean and there are many background window programs running. Below is part of the log file of another run with 2000 variables and 4000 records:

NOTE: PROCEDURE REG used (Total process time):
      real time           2.26 seconds
      cpu time            7.76 seconds
     


NOTE: PROCEDURE GLM used (Total process time):
      real time           3.57 seconds
      cpu time            4.58 seconds


NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
      real time           3.50 seconds
      cpu time            3.44 seconds
     
We see that REG has lower real time comparing to GLM/GLMSELECT, even though cpu time is about twice the average of GLM/GLMSELECT. In a case where BY-processing is used, GLMSELECT will use multi-threading as specified in PERFORMANCE statement, and the gap in real time between REG and GLMSELECT will be eliminated. In a collaborating environment, more CPU time also means competing for more resources. Below we show the real time of the same run as in above CPU Time figure.


Note that CPU Time difference and pattern is pretty consistent.Below is the mean CPU Time and its 90% C.I. of 100 runs using REG /GLM /GLMSELECT on different size of problems.


 Posted by at 3:18 上午