One common violation of the assumptions needed for linear regression is heterscedasticity by group membership. Both SAS and R can easily accommodate this setting.

Our data today comes from a real example of vitamin D supplementation of milk. Four suppliers claimed that their milk provided 100 IU of vitamin D. The null hypothesis is that they all deliver this accurately, but there was some question about whether the variance was the same between the suppliers. Unfortunately, there are only four observations for each supplier.

SAS
In SAS, we'll read the data in with a simple input statement.
`data milk;input value milk_cat;cards;77               185               191               188               193               2101              2126              2103              2103              388               3109              385               395               483               491               486               4;;run;`
To fit the model, we'll use the group option to the repeated statement in proc mixed. This is specifically designed to allow differing values for groups sharing the same covariance structure. In this case, it's a simple structure: no correlation, constant value on the diagonal. The key pieces of output are selected out using ODS. To assess variance terms, it's best to use maximum likelihood, rather than REML fitting.
`ods select covparms lrt;proc mixed data = milk method = ml;class milk_cat;model value = milk_cat/solution;repeated/group=milk_cat type = simple;run;`
`  Covariance Parameter EstimatesCov Parm     Group         EstimateResidual     milk_cat 1     27.1875Residual     milk_cat 2      150.69Residual     milk_cat 3      100.69Residual     milk_cat 4     21.1875  Null Model Likelihood Ratio Test    DF    Chi-Square      Pr > ChiSq     3          5.13          0.1623`
There's not too much evidence to support different variances, but the power is likely quite small, so we'll retain this model when assessing the null hypothesis of equal means. A REML fit ought to be preferable here, but to agree with the R results, we fit again with ML. Note that proc mixed is not smart enough to accurately determine the degrees of freedom remaining (16 observations - 4 variance parameters - 4 mean parameters) so we need to manually specify the denominator degrees of freedom using the ddf option to the model statement.
`ods select solutionf tests3;proc mixed data = milk method = ml;class milk_cat;model value = milk_cat/solution ddf=8;repeated/group=milk_cat type = simple;run;`
`                   Solution for Fixed Effects                               StandardEffect     milk_cat  Estimate     Error    DF  t Value  Pr > |t|Intercept             88.7500    2.3015    12    38.56    <.0001milk_cat   1          -3.5000    3.4776     8    -1.01    0.3437milk_cat   2          17.0000    6.5551     8     2.59    0.0319milk_cat   3           7.5000    5.5199     8     1.36    0.2113milk_cat   4                0         .     .      .       .        Type 3 Tests of Fixed Effects              Num     DenEffect         DF      DF    F Value    Pr > Fmilk_cat        3       8       3.86    0.0563`
So there's some reason to suspect that the suppliers may be different.

R
In R, we'll re-type this simple data set and assign the group labels manually.
`value = c(77,85,91,88,93,101,126,103,103,88,109,85,95,83,91,86)mc = as.factor(rep(1:4, each=4))milk= data.frame(value, mc)`
To fit the model with unequal variances, we'll use the gls() function in the nlme library. (Credit department: Ben Bolker discusses this and more complex models in an Rpubs document.) (Note that we use maximum likelihood, as we did in SAS.)
`library(nlme)mod = gls(value~mc, data=milk, weights = varIdent(form = ~1|mc),    method="ML")`
To assess the hypothesis of equal variances, we'll compare to the homoscedasticity model using the anova() function.
`mod2 = gls(value~mc, data=milk, method="ML")anova(mod,mod2)`

`     Model df      AIC      BIC    logLik   Test  L.Ratio p-valuemod      1  8 125.3396 131.5203 -54.66981                        mod2     2  5 124.4725 128.3355 -57.23625 1 vs 2 5.132877  0.1623`
This result is identical to SAS, although this is a likelihood ratio test and SAS shows a Wald test.

To assess whether the suppliers are different, we need to compare to the model with just an intercept. When using REML for both models, there was a warning message and a different answer than SAS (using REML in SAS as well). So we'll stick with maximum likelihood here.
`mod3  = gls(value~1,data=milk, weights = varIdent(form = ~1|mc), method="ML")anova(mod3,mod)`
With ML in both programs we get the same results.
`     Model df      AIC      BIC    logLik   Test  L.Ratio p-valuemod3     1  5 126.8858 130.7488 -58.44292                        mod      2  8 125.3396 131.5203 -54.66981 1 vs 2 7.546204  0.0564`

``/*********************************************************************FILENAME: MIXVMEAN.SASSUBJECT HEADING: STATINITIALS:  KBWDATE:  7/26/96PROGRAM:  SASVERSION:  6.11PLATFORM:  WINDOWS 3.11 TS040TITLE:  USING PROC MIXED FOR A PAIRED T-TEST COMPARED TO PROC MEANSDESCRIPTION:  THIS PROGRAM DOES A PAIRED T-TEST, FOR THE SAME DATA,              FIRST USING PROC MEANS, AND THEN USING PROC MIXED.              NOTE THAT THE DATA STRUCTURE FOR THESE TWO METHODS              IS DIFFERENT.  THE DATA STEPS AT THE BEGINNING SET              UP THE TWO DIFFERENT DATA STRUCTURES.              NOTE THAT THE T-TEST, THE DEGREES OF FREEDOM AND              THE P-VALUES ARE THE SAME FOR BOTH METHODS.**********************************************************************/data test;  input y1 y2;  diff=y1-y2;  Pair+1;  datalines;  13 15  12 14  17 17.2  14 18  11 12  5  4.1  7 9.3     ;run;proc print data=test;  title 'printout of original data set';run;proc means n mean t prt;  var diff;  title 'paired t-test using proc means';run;data test2(keep=y pair group);  set test;     y=y1;     group=1;     output;     y=y2;     group=0;     output;run;proc print data=test2;  title 'rearranged data for proc mixed';run;proc mixed;    class pair group;    model y=group;    random pair;    lsmeans group / pdiff;    title 'paired t-test using proc mixed';run; ``

Rick Wicklin discussed in his blog the performance in solving a linear system using SOLVE() function and INV() function from IML.

Since regression analysis is an integral part of SAS applications and there are many SAS procedures in SAS/STAT that are capable to conduct various regression analysis, it would be interesting to benchmark their relative performance using OLS regression, the fundamental regression analysis of all.

The analysis will compare REG, GLMSELECT, GENMOD, MIXED, GLIMMIX, GLM, ORTHOREG, HPMIXED and TRANSREG on 10 OLS regressions with 100 to 1000 variables, incremental at 100, and with the number of observations twice the number of variables to avoid possible numerical issues. HPMIXED uses sparse matrix techniques and will be put into great disadvantage in this comparison using large dense matrices. A macro wraps them together:

``````

%macro wrap;
proc printto log='c:\testlog.txt';run;

%let t0=%sysfunc(datetime(), datetime.);
%let procnames=GLM REG GLMSELECT ORTHOREG MIXED GLIMMIX GENMOD ;
%let nproc=%sysfunc(countW(&procnames));
%put Test &nproc PROCEDURES;
%do i=1 %to 10;
%let nobs=%sysevalf(&i*100);
options nonotes;
data _temp;
array x{&nobs};
do i=1 to 2*&nobs;
do j=1 to &nobs;
x[j]=rannor(0);
end;
y=rannor(0);
drop i j;
output;
end;
run;
options notes;
ods select none;

%do j=1 %to &nproc;
%let proc=%scan(&procnames, &j);
%put &proc;
proc &proc data=_temp;
model y = x1-x&nobs;
run;
%end;
%put TRANSREG;
proc transreg data=_temp;
model identity(y) = identity(x1-x&nobs);
run;
sasfile _temp close;
ods select all;
%end;
proc printto; run;
%mend;
%wrap;
``````

After running all iterations, the SAS log is parsed to obtain procedure names and corresponding real time and CPU time. The following SAS code does this job:

``````

data proc_compare;
infile "c:\testlog.txt";
input;
retain procedure ;
retain realtime cputime  realtime2 ;
length procedure \$12.;
length realtime  cputime \$24.;
if _n_=1 then id=0;
x=_infile_;
if index(x, 'PROCEDURE')>0 then do;
procedure=scan(_infile_, 3);
if procedure="REG" then id+1;
end;

if index(x, 'real time')>0 then do;
_t1=index(_infile_, 'real time');
_t2=index(_infile_, 'seconds');
if _t2=0 then _t2=length(_infile_);
realtime=substr(_infile_, _t1+9, _t2-_t1-9);
if index(realtime, ':')>0 then do;
realtime2=scan(realtime, 1, ':')*60;
sec=input(substr(realtime, index(realtime, ':')+1), best.);
realtime2=realtime2+sec;
end;
else realtime2=input(compress(realtime), best.);
end;
if index(x, 'cpu time')>0 then do;
_t1=index(_infile_, 'cpu time');
_t2=index(_infile_, 'seconds');
if _t2=0 then _t2=length(_infile_);
cputime=substr(_infile_, _t1+8, _t2-_t1-8);
if index(cputime, ':')>0 then do;
cputime2=scan(cputime, 1, ':')*60;
sec=input(substr(cputime, index(cputime, ':')+1), best.);
cputime2=cputime2+sec;
end;
else cputime2=input(compress(cputime), best.);
keep id size  procedure cputime2 realtime2 ;
size=id*100;
if compress(procedure)^="PRINTTO" then output;
end;
run;
``````

We then visualize the results using the following code:
``````

title "Benchmark Regression PROCs using OLS";
proc sgpanel data=proc_compare;
panelby procedure /rows=2;
series y=cputime2  x=size/ lineattrs=(thickness=2);
label cputime2="CPU Time (sec)"
size="Problem Size"
;;
colaxis grid;
rowaxis grid;
run;
title;

title "Closer Look on REG vs. GLM vs. GLMSELECT";
proc sgplot data=proc_compare  uniform=group;
where procedure in ("GLMSELECT", "REG", "GLM");
series x=size y=cputime2/group=procedure  curvelabel lineattrs=(thickness=2);
label cputime2="CPU Time (sec)"
size="# of Variables"
;;
yaxis grid ;
xaxis grid ;
run;
title;

``````

It is found that PROC GLM and GLMSELECT beat all other procedures with large margin while HPMIXED is the slowest followed by GLIMMIX. Surprisingly, REG is slower than both GLM and GLMSELECT even though it utilized multi-threading technique while GLMSELECT does not:

************ Partial LOG of the last iteration ********
NOTE: PROCEDURE REG used (Total process time):
real time 6.79 seconds
cpu time 9.36 seconds

NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 3.06 seconds
cpu time 2.96 seconds
********************************************************

The performance gap between REG and GLM/GLMSELECT is getting larger when the number of variables increases to be more than 700.

Both REG and GLMSELECT are developed by the same group of developers in SAS, as far as I know.

********************* PS : ****************************
Rick and Charlie pointed out that real time is a more fair measure, which I agree.

The reading of real computing time has large variance from run to run because the testing enviornment is not very clean and there are many background window programs running. Below is part of the log file of another run with 2000 variables and 4000 records:

NOTE: PROCEDURE REG used (Total process time):
real time           2.26 seconds
cpu time            7.76 seconds

NOTE: PROCEDURE GLM used (Total process time):
real time           3.57 seconds
cpu time            4.58 seconds

NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time           3.50 seconds
cpu time            3.44 seconds

We see that REG has lower real time comparing to GLM/GLMSELECT, even though cpu time is about twice the average of GLM/GLMSELECT. In a case where BY-processing is used, GLMSELECT will use multi-threading as specified in PERFORMANCE statement, and the gap in real time between REG and GLMSELECT will be eliminated. In a collaborating environment, more CPU time also means competing for more resources. Below we show the real time of the same run as in above CPU Time figure.

Note that CPU Time difference and pattern is pretty consistent.Below is the mean CPU Time and its 90% C.I. of 100 runs using REG /GLM /GLMSELECT on different size of problems.