PROC REG

5月 292020
 

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
  input ICD : $7. @@;
  if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;

 

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It's called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter "10-8 awaiting crew" or "Ron Cody" in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.

Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.

5月 292020
 

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
  input ICD : $7. @@;
  if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;

 

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It's called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter "10-8 awaiting crew" or "Ron Cody" in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.

Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.

10月 072011
 
Recently, I am working on coding in SAS for a set of regularized regressions and need to compute trace of the projection matrix:
$$ S=X(X'X + \lambda I)^{-1}X' $$.

Wikipedia has a well written introduction to Trace @ here.

To obtain the inverse of matrix (X'X + \lambda I) in SAS/STAT, there are multiple ways:
1. Build the SSCP matrix first, then inverse it using the following methods:
    1.1. Using SVD based method, shown @ here but not demonstrated in this post;
    1.2. Using PROC REG, shown @ here , and also demonstrated in the code below;
    1.3. Using SWEEP operator, shown @ here under Method 0 and Method 1;
2. Since (X'X + \lambda I) is the SSCP for a Ridge Regression with ridge parameter=\lambda, it is handy to directly use the ODS OUTPUT InvXPX= statement to obtain the inversed matrix, when X is appended by a diagonal matrix of \lambda*I_{p, p}. SAS code is demonstrated below;

Trace of the project matrix S is a key concept in modern regression analysis. For example, the effective degree of freedom of the model in a regularized linear regression is trace(S)/N, see [1] for details. For another example, in approximating leave-one-out-cross-validation using GCV, trace(S)/N is a key component (formula 7.52 of ~[1]).

Check the reference and the cited works therein for more information.

Ref:
1. Hastie et al., The Elements of Statistical Learning, 2nd Ed.


/* Use the SocEcon example data created in
Example A.1: A TYPE=CORR Data Set Produced by PROC CORR
on page 8153 of SAS/STAT 9.22 User Doc
*/
data SocEcon;
input Pop School Employ Services House;
datalines;
5700 12.8 2500 270 25000
1000 10.9 600 10 10000
3400 8.8 1000 10 9000
3800 13.6 1700 140 25000
4000 12.8 1600 140 25000
8200 8.3 2600 60 12000
1200 11.4 400 10 16000
9100 11.5 3300 60 14000
9900 12.5 3400 180 18000
9600 13.7 3600 390 25000
9600 9.6 3300 80 12000
9400 11.4 4000 100 13000
;
run;

%let depvar=HOUSE;
%let covars=pop school employ services;
%let lambda=1;
/* Below is the way to obtain trace(S), where S is the project matrix in a (regularized) linar regression.
For further information, check pp.68, pp.153 of Elements of Statistical Learning,2nd Ed.
*/

/* For details about TYPE=SSCP special SAS data, consult:
Appendix A: Special SAS Data Sets, pp.8159 of SAS/STAT 9.22 User's Guide
*/
proc corr data=SocEcon sscp out=xtx(where=(_TYPE_='SSCP')) noprint;
var &covars;
run;


data xtx2;
set xtx;
array _n{*} _numeric_;
array _i{*} i1-i5 (5*0);
do j=1 to 5;
if j=_n_ then _i[_n_]=λ
else _i[j]=0;
end;
_n[_n_]=_n[_n_]+1;
drop j _TYPE_ _NAME_;
run;

/* Obtain the inverse of (XTX+\lambda*I)
Note that we explicitly specified Intercept term in the
covariate list and fit a model without implicit intercept
term in the model.
*/
proc reg data=xtx2
outest=S0(type=SSCP
drop=i1-i5 _MODEL_ _DEPVAR_ _RMSE_)
singular=1E-17;
model i1-i5 = Intercept &covars / noint noprint;
run;quit;

data S0;
set S0;
length _NAME_ $8;
_NAME_=cats('X_', _n_);
run;

proc score data=SocEcon score=S0 out=XS0(keep=X_:) type=parms;
var &covars;
run;

data XS0X;
merge XS0 X;
array _x1{*} X_:;
array _x0{*} intercept pop school employ services;
do i=1 to dim(_X1);
_x1[i]=_x1[i]*_x0[i];
end;
rowSum=sum(of _x1[*]);
keep rowSum;
run;
proc means data=XS0X noprint;
var rowSum;
output out=trace sum(rowSum)=Trace;
run;


Verify the result using R:


> socecon<-read.csv('c:/socecon.csv', header=T)
> x<-as.matrix(cbind(1, socecon[,-5]))
> xtx<-t(x)%*%x
> phi<-xtx+diag(rep(1, 5))
>
> # method 1.
> S<-x%*%solve(phi)*x
> sum(S)
[1] 4.077865
> # method 2.
> S<-(x%*%solve(phi))%*%t(x)
> sum(diag(S))
[1] 4.077865
>

Of course, except for method 1.2 shown above, we can also use Method 2 mentioned above, and obtain the same inverse matrix:


data SocEcon2 /view=SocEcon2;
set x end=eof;
array x{5} Intercept &covars (5*0);
Intercept=1;
output;
if eof then do;
do j=1 to dim(x);
x[j]=λ
output;
drop j;
x[j]=0;
end;
end;
run;

ods select none;
ods output InvXPX=S1;
proc reg data=SocEcon2 singular=1E-17;
model y = Intercept &covars /noint i;
run; quit;
ods select all;


As a side point, it is also of interests to compare the computational performance of both illustrated methods. SVD-based approach and SWEEP operator based approach are omitted.

Using a SAS data set of 1001 covariates (including Intercept term) and 1E6 observations, total 7.6GB on a windows PC, the test was done on two dramatically different machines: one WindowsXP laptop with mediocre HDD and a 2-core T7300 CPU; the other one is a high-end Server running Linux64 with Disk Array and fast CPUs totaled 16 cores.

On the PC,:
-> Using method 1.2, the time used decomposition is listed below:
PROC CORR:  real time: 25:47.72; CPU time: 15:26.15
DATA step on XTX: real time: 2.28 sec; CPU time: 0.07 sec
PROC REG :  real time: 15.45 sec; CPU time: 27.31 sec;
Total real time: 26:05.45

-> Using method  2, the time used is:
PROC REG: real time: 48:28.44; CPU time: 1:16:46.41

On the server:
 -> Using method 1.2, the time used decomposition is listed below:
PROC CORR: real time: 5:40.61; CPU time: 5:40.58
DATA step on XTX: real time: 0.05 sec; CPU time: 0.05 sec
PROC REG: real time 1.71 sec; CPU time: 5.27 sec
Total real time: 5:42.37

-> Using method 2, the time used is:
PROC REG: real time: 6:01.46; CPU time: 19.13.49

The performance is summarized below:

 Posted by at 2:21 上午
8月 192011
 
Rick Wicklin discussed in his blog the performance in solving a linear system using SOLVE() function and INV() function from IML.

Since regression analysis is an integral part of SAS applications and there are many SAS procedures in SAS/STAT that are capable to conduct various regression analysis, it would be interesting to benchmark their relative performance using OLS regression, the fundamental regression analysis of all.

The analysis will compare REG, GLMSELECT, GENMOD, MIXED, GLIMMIX, GLM, ORTHOREG, HPMIXED and TRANSREG on 10 OLS regressions with 100 to 1000 variables, incremental at 100, and with the number of observations twice the number of variables to avoid possible numerical issues. HPMIXED uses sparse matrix techniques and will be put into great disadvantage in this comparison using large dense matrices. A macro wraps them together:




%macro wrap;
proc printto log='c:\testlog.txt';run;

%let t0=%sysfunc(datetime(), datetime.);
%let procnames=GLM REG GLMSELECT ORTHOREG MIXED GLIMMIX GENMOD ;
%let nproc=%sysfunc(countW(&procnames));
%put Test &nproc PROCEDURES;
%do i=1 %to 10;
%let nobs=%sysevalf(&i*100);
options nonotes;
data _temp;
array x{&nobs};
do i=1 to 2*&nobs;
do j=1 to &nobs;
x[j]=rannor(0);
end;
y=rannor(0);
drop i j;
output;
end;
run;
options notes;
sasfile _temp load;
ods select none;

%do j=1 %to &nproc;
%let proc=%scan(&procnames, &j);
%put &proc;
proc &proc data=_temp;
model y = x1-x&nobs;
run;
%end;
%put TRANSREG;
proc transreg data=_temp;
model identity(y) = identity(x1-x&nobs);
run;
sasfile _temp close;
ods select all;
%end;
proc printto; run;
%mend;
%wrap;

After running all iterations, the SAS log is parsed to obtain procedure names and corresponding real time and CPU time. The following SAS code does this job:




data proc_compare;
infile "c:\testlog.txt";
input;
retain procedure ;
retain realtime cputime realtime2 ;
length procedure $12.;
length realtime cputime $24.;
if _n_=1 then id=0;
x=_infile_;
if index(x, 'PROCEDURE')>0 then do;
procedure=scan(_infile_, 3);
if procedure="REG" then id+1;
end;

if index(x, 'real time')>0 then do;
_t1=index(_infile_, 'real time');
_t2=index(_infile_, 'seconds');
if _t2=0 then _t2=length(_infile_);
realtime=substr(_infile_, _t1+9, _t2-_t1-9);
if index(realtime, ':')>0 then do;
realtime2=scan(realtime, 1, ':')*60;
sec=input(substr(realtime, index(realtime, ':')+1), best.);
realtime2=realtime2+sec;
end;
else realtime2=input(compress(realtime), best.);
end;
if index(x, 'cpu time')>0 then do;
_t1=index(_infile_, 'cpu time');
_t2=index(_infile_, 'seconds');
if _t2=0 then _t2=length(_infile_);
cputime=substr(_infile_, _t1+8, _t2-_t1-8);
if index(cputime, ':')>0 then do;
cputime2=scan(cputime, 1, ':')*60;
sec=input(substr(cputime, index(cputime, ':')+1), best.);
cputime2=cputime2+sec;
end;
else cputime2=input(compress(cputime), best.);
keep id size procedure cputime2 realtime2 ;
size=id*100;
if compress(procedure)^="PRINTTO" then output;
end;
run;

We then visualize the results using the following code:



title "Benchmark Regression PROCs using OLS";
proc sgpanel data=proc_compare;
panelby procedure /rows=2;
series y=cputime2 x=size/ lineattrs=(thickness=2);
label cputime2="CPU Time (sec)"
size="Problem Size"
;;
colaxis grid;
rowaxis grid;
run;
title;

title "Closer Look on REG vs. GLM vs. GLMSELECT";
proc sgplot data=proc_compare uniform=group;
where procedure in ("GLMSELECT", "REG", "GLM");
series x=size y=cputime2/group=procedure curvelabel lineattrs=(thickness=2);
label cputime2="CPU Time (sec)"
size="# of Variables"
;;
yaxis grid ;
xaxis grid ;
run;
title;






It is found that PROC GLM and GLMSELECT beat all other procedures with large margin while HPMIXED is the slowest followed by GLIMMIX. Surprisingly, REG is slower than both GLM and GLMSELECT even though it utilized multi-threading technique while GLMSELECT does not:

************ Partial LOG of the last iteration ********
NOTE: PROCEDURE REG used (Total process time):
real time 6.79 seconds
cpu time 9.36 seconds


NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 3.06 seconds
cpu time 2.96 seconds
********************************************************

The performance gap between REG and GLM/GLMSELECT is getting larger when the number of variables increases to be more than 700.

Both REG and GLMSELECT are developed by the same group of developers in SAS, as far as I know.

********************* PS : ****************************
Rick and Charlie pointed out that real time is a more fair measure, which I agree.

The reading of real computing time has large variance from run to run because the testing enviornment is not very clean and there are many background window programs running. Below is part of the log file of another run with 2000 variables and 4000 records:

NOTE: PROCEDURE REG used (Total process time):
      real time           2.26 seconds
      cpu time            7.76 seconds
     


NOTE: PROCEDURE GLM used (Total process time):
      real time           3.57 seconds
      cpu time            4.58 seconds


NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
      real time           3.50 seconds
      cpu time            3.44 seconds
     
We see that REG has lower real time comparing to GLM/GLMSELECT, even though cpu time is about twice the average of GLM/GLMSELECT. In a case where BY-processing is used, GLMSELECT will use multi-threading as specified in PERFORMANCE statement, and the gap in real time between REG and GLMSELECT will be eliminated. In a collaborating environment, more CPU time also means competing for more resources. Below we show the real time of the same run as in above CPU Time figure.


Note that CPU Time difference and pattern is pretty consistent.Below is the mean CPU Time and its 90% C.I. of 100 runs using REG /GLM /GLMSELECT on different size of problems.


 Posted by at 3:18 上午
8月 102011
 
We demonstrate a comparison among various implementations of Rolling Regression in SAS and show that the fastest implementation is over 3200X faster than traditional BY-processing approach.

More often than not, we encounter a problem where an OLS over a rolling time window is required, see [1], [2], [3], [4], [5], [6], [7], for a few examples.

One solution is to resort to SAS MACRO, but it is extremely inefficient and can't handle large dataset in reality, [8]. This method is shown below as Method[4]. It couldn't finish the test in allowable time using the sample data below.

The other common solution is to use the BY-processing capability of PROC REG after re-shaping the data into appropriate format, see [9], [10]. This method is demonstrated as Method[3] below. While certainly much better than above one, it is still not the fastest and requires more memory.

The third solution comes to play by recognizing that in OLS, what you need is the SSCP and you can easily build up the SSCP over rolling time window by resorting to PROC EXPAND. This is demonstrated as Method[2] below. This approach will further improve the speed but still requires large amount of memory if the data is big and many rolling windows are generated.

Since what we need to do is to build the SSCP matrix and obtain the coefficient estimates based on the informaiton in SSCP, we can certainly code this in a DATA Step using ADJUST operator, which provides a solution that is both fast and low memory occupancy. See [11] for an introduction to ADJUST operator. To make this even faster, a modification of ADJUST operator, the SWEEP operator, can be used. For an introduction to SWEEP operator, see [11], [12]. In the code below, Method[0] implements the ADJUST operator, while Method[1] implements the SWEEP operator.

The experiment literally runs 499980 regressions each with 20 observations and 2 predictors, and the results are shown below:

                         Real Time    |        CPU Time          | Memory                   
=====================================================

Method 0 |    1.01 (seconds)   |    1.01 (seconds)        |    611K


Method 1 |    0.25 (seconds)   |    0.24 (seconds)        |    432K


Method 2 |    1.61 (seconds)   |    0.94 (seconds)        | 50381K


Method 3 |  80.54 (seconds)   |   79.61 (seconds)       |  2322K


Method 4 |         Failed           |           Failed               |    Failed
=====================================================

Reference:
[1]MYSAS.NET, http://www.mysas.net/forum/viewtopic.php?f=4&t=8070
[2]MYSAS.NET, http://www.mysas.net/forum/viewtopic.php?f=4&t=7898
[3]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0604D&L=sas-l&P=R32485
[4]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0704C&L=sas-l&P=R3305
[5]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0802C&L=sas-l&P=R9746
[6]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0801C&L=sas-l&P=R14671
[7]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0810A&L=sas-l&P=R19135
[8]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0802C&L=sas-l&P=R13489
[9]Michael D Boldin, "Programming Rolling Regressions in SAS", Proceedings of NESUG, 2007
[10]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0604D&L=sas-l&D=0&P=56926
[11]J. H. Goodnight, "The Sweep Operator: Its Importance in Statistical Computing", SAS Tech Report R-106, 1978
[12]Kenneth Lange, "Numerical Analysis for Statisticians", Springer, 1998


proc datasets library=work kill; run;


options fullstimer;
data test;
do seq=1 to 500000;
x1=rannor(9347957);
*x2=rannor(876769)+0.1*x1;
epsilon=rannor(938647)*0.5;
y = 1.5 + 0.5*x1 +epsilon;
output;
end;
run;

/* Method 0.*/
sasfile test load;
data res0;
set test;
array _x{3,3} _temporary_ ;
array _a{3,3} _temporary_ ;
array _tempval{5, 20} _temporary_ ;
m=mod(_n_-1, 20)+1;
_tempval[1, m]=x1;
_tempval[2, m]=y;
_tempval[3, m]=x1**2;
_tempval[4, m]=x1*y;
_tempval[5, m]=y**2;

link filler;
if _n_>=20 then do;
if _n_>20 then do;
m2=mod(_n_-20, 20)+1;
_x[1,2]+(-_tempval[1, m2]);
_x[1,3]+(-_tempval[2, m2]);
_x[2,2]+(-_tempval[3, m2]);
_x[2,3]+(-_tempval[4, m2]);
_x[3,3]+(-_tempval[5, m2]);
end;
do i=1 to dim(_a, 1);
do j=1 to dim(_a, 2);
_a[i, j]=_x[i, j];
end;
end;

do k=1 to dim(_a, 1)-1;
link adjust;
end;
Intercept=_a[1,3]; beta=_a[2,3];
keep seq intercept beta;
output;
end;

return;
filler:
_x[1,1]=20; _x[1,2]+x1; _x[1,3]+y;
_x[2,2]+_tempval[3,m]; _x[2,3]+_tempval[4,m]; _x[3,3]+_tempval[5,m];
_x[2,1]=_x[1,2]; _x[3,1]=_x[1,3]; _x[3,2]=_x[2,3];
return;

adjust:
B=_a[k, k];
do j=1 to dim(_a, 2);
_a[k, j]=_a[k, j]/B;
end;
do i=1 to dim(_a, 1);
if i ^=k then do;
B=_a[i, k];
do j=1 to dim(_a, 2);
_a[i, j]=_a[i, j]-B*_a[k, j];
end;
end;
end;
return;

run;
sasfile test close;



/* Method 1.*/

sasfile test load;
data rest0;
set test;
array _x{4} _temporary_;
array _a{2,20} _temporary_;
m=mod(_n_-1, 20)+1;
_a[1, m]=x1; _a[2,m]=y;
link filler;

m2=mod(_n_-20, 20)+1;
if _n_>=20 then do;
if _n_>20 then do;
link deduct;
end;
beta=(_x[2]-_x[1]*_x[4]/20)/(_x[3]-_x[1]**2/20);
intercept=_x[4]/20 - beta*_x[1]/20;
keep seq intercept beta ;
output;
end;
return;
filler:
_x[1]+x1;
_x[2]+x1*y;
_x[3]+x1**2;
_x[4]+y;
return;
deduct:
_x[1]=_x[1]-_a[1,m2];
_x[2]=_x[2]-_a[1,m2]*_a[2,m2];
_x[3]=_x[3]-_a[1,m2]**2;
_x[4]=_x[4]-_a[2,m2];
return;
run;
sasfile test close;



/* Method 2.*/

%macro wrap;
%let window=20;
%let diff=%eval(&window-0);
data testv/view=testv;
set test;
xy=x1*y;
run;

proc expand data=testv method=none out=summary(keep=seq sumxy sumx1 sumy ussx1 may max);
convert x1=sumx1/transformout=(movsum &diff);
convert xy=sumxy/transformout=(movsum &diff);
convert x1=ussx1/transformout=(movuss &diff);
convert y =sumy /transformout=(movsum &diff);
convert y =may / transformout=(movave &diff);
convert x1 =max / transformout=(movave &diff);
run;

data result1;
set summary(firstobs=&window);
beta = (sumxy - sumx1*sumy/&window)/(ussx1 - sumx1/&window.*sumx1);
alpha= may - beta*max;
keep seq beta alpha;
run;
%mend;

%let t0=%sysfunc(datetime(), datetime24.);
*options nosource nonotes;
%wrap;
options source notes;
%let t1=%sysfunc(datetime(), datetime24.);
%put Start @ &t0;
%put End @ &t1;



/* Method 3.*/
%let t0=%sysfunc(datetime(), datetime.);

data test2v/view=test2v;
set test;
array _x{2, 20} _temporary_ (20*0 20*0);
k=mod(_n_-1, 20)+1;
_x[1, k]=x1; _x[2, k]=y;
if _n_>=20 then do;
do j=1 to dim(_x, 2);
x=_x[1, j]; y=_x[2, j];
output;
keep seq x y;
end;
end;
run;

ods select none;
proc reg data=test2v outest=res2(keep=seq x intercept);
by seq;
model y = x;
run;quit;
ods select all;

%let t1=%sysfunc(datetime(), datetime.);
%put Start @ &t0;
%put End @ &t1;


/* Method 4. */
%macro wrap;
options nonotes;
ods select none;
%do i=20 %to 500000;
%let fo=%eval(&i-19);
proc reg data=test(firstobs=&fo obs=&i) outest=_xres(keep=x1 intercept);
model y =x1;
run;quit;
%if %eval(&i=20) %then %do;
data res3; set _xres; run;
%end;
%else %do;
proc append base=res3 data=_xres; run;
%end;
%end;

ods select all;
data res3;
set res3;
time=19+_n_;
run;
options notes;
%mend;

%let t0=%sysfunc(datetime(), datetime.);
%wrap;
%let t1=%sysfunc(datetime(), datetime.);
%put Start @ &t0;
%put End @ &t1;
 Posted by at 7:38 下午
1月 042011
 


The book "Bayesian Computation with R" by Jim Albert is an easy to read entry level book on applied Bayesian Statistics. While the book was written for R users, it is not difficult to translate the languages between R and SAS and I believe it is a good way to show Bayesian capability of SAS. In the next several months, I am going to translate all R code in the book into SAS. Readers are encouraged to buy this book to understand what's behind the code. The posts here will only spend minimum effects to explain the statistics underlying while most resource will still be denoted to SAS coding.

This post will cover Chapter 1.

We first follow section 1.2.2, using PROC IMPORT to read in TAB deliminated file, which corresponds to read.table(file, sep='\t', header=T)  in R. We also designated PDF as Output Destination.

/* Chapter 1 of Bayesian Computation with R */

%let folder=C:\Documents\SAS for Bayesian Compu with R;
%let datafolder=&folder\data;

libname data "&datafolder";
options missing='.' formchar = "|----|+|---+=|-/\<>*"  errors=2 fullstimer;

/*@@@@@@ Section 1.2.2 @@@@@*/
/*--> Read Tab Deliminated Data into SAS */
proc import datafile="&datafolder/studentdata.txt"  out=student dbms=dlm   replace;
      datarow=2;       /* redundent with GETNAMES= statement */
   guessingrows=2;  /* Only use first 2 rows of data to guess the type of data */
      delimiter='09'x;  /* this is redudent if we specify DBMS=TAB in PROC IMPORT option */
   getnames=YES;  /* This is the same as 'header=True' in R's table.read function */
run;

ods pdf file="&folder/Chapter1.pdf";
/*-->  To see Variable Names and print first row of this data 
   Use SPLIT="*" to specify the split character being "*", which controls 
   line breaks in column headings
*/
proc print data=student(obs=1)  split="*"; run;

SAS output looks like:
Section 1.2.3, summarize frequency tables and use Barplot to visualize the counts.

PROC FREQ directly corresponds to table() function in R, but provides much richer functionality. PROC GBARLINE is used to answer what barplot() does in R. The LABEL= statement in SAS is a very handy tool for annotation purpose for PRINT or GRAPH.

PROC MEANS somehow corresponds to summary() function to obtain summary statistics for quantitative variables. You can sepecify MAXDEC= to keep desired number of decimals in print out.
Histogram can be obtained in SAS via either PROC UNIVARIATE or newly introduced PROC SGPLOT. PROC UNIVARIATE provides more control though. As shown below, you can specify desired mid-points while you can't do so in SGPLOT. On the other hand, SGPLOT is very convenient if you want to overlay density curve, either a normal fit or kernel one.


/*@@@@ section 1.2.3 @@@*/
/*--> To summarize and graph a single Batch */
proc freq data=student;
     table Drink;
run;

/*--> Barplot on single variable */
title "Original idea, summarize and plot";
proc freq data=student  noprint;
      table Drink /out=t;
run;
goptions reset=all noborder;
proc gbarline data=t;
      bar  Drink /discrete space=4  sumvar=count;
run;quit;
title;
/*--> Actually, in SAS, it is better to directly use 
    PROC GBARLINE */
goptions reset=all noborder;
title "In SAS, PROC GBARLINE directly summarize";
proc gbarline data=student;
      bar Drink /discrete space=3 ;
run;quit;
title;

data student;
      set student;
   hours_of_sleep=WakeUp-ToSleep;
   /* Label can be useful for annotation purpose */
   label hours_of_sleep="Hours Of Sleep"; 
run;

proc means data=student 
           /* Request same statistics as in the book */
           min q1 median q3 max  nmiss             
           /* MAXDEC= specifies keeping 2 decimals */
           maxdec=2;                          
    var hours_of_sleep;
run;

/*--> Histogram using PROC UNIVARIATE */
proc univariate data=student  noprint;
      var hours_of_sleep;
   histogram /midpoints=2.5 to 12.5 by 1;
run;

/* new SGPLOT is a handy way to draw histogram and density */
proc sgplot data=student;
      histogram hours_of_sleep/scale=count;     
      density   hours_of_sleep/scale=count  type=normal;
run;

Compare SAS outputs:





Boxplot to visualize distribution of  quantitative variables by some classification variable. Here, formula such as var1*var2 style corresponds to var1~var2 style in R.


/*@@@ section 1.2.4 @@@@*/
/*---> Compare Batches using Boxplot */
proc sort data=student out=studentsorted; by Gender; run;
proc boxplot data=studentsorted;
      plot  hours_of_sleep*Gender;
run;   

proc means data=student  
            min q1 median q3 max nmiss  
            nway  maxdec=2;
     class Gender;
  var  Haircut;
run;

SAS output looks like:




R has built-in jitter function, however, we have to make our own in SAS. This verision doesn't apply Fuzzy yet, only for demo purpose.


%macro jitter(var, newname, data=last, factor=1, amount=);
/* follow the JITTER function in R, no fuzzy applied yet . 
   if amount is given, then use this value for disturbance
   else if amount is 0, then use the range of &var for disturbance
   else if amount is NULL, then use the smallest difference among
        distinct values of &var for disturbance. if all values are 
        the same, then use the value obtained from range

   if range of &var is 0, then use the lower range for disturbance
   otherwise 
*/
%let blank=%str( );
%if &data=' ' %then %let data=last;
%local fid;

data _null_;
     fid=round(ranuni(0)*10000, 1);
  call symput('fid', compress(fid));
run;
proc means data=&data  noprint;
     var &var;
  output  out=_util&fid  
          range(&var)=range  
          min(&var)=min  
          max(&var)=max;
run;
data _util&fid;
     set _util&fid;
  if range^=0 then z=range; else z=min;
  if z=0 then z=1;
run;

%if %eval(&amount=&blank) %then %do;
    proc sort data=&data.(keep=&var  where=(&var^=.))   
              out=_xuti&fid  nodupkey;
       by &var;
 run;
 data _duti&fid;
      set _xuti&fid  nobs=ntotal  end=eof;      
   array _x{2} _temporary_;
   if ntotal=1 then do;
      amount=&factor/50*abs(&var);
   keep amount;
   output;
   stop;
   end;
   else do;
      if _n_=1 then do; 
         _x[1]=&var; _x[2]=constant('BIG'); 
      end;
   else do;
               _x[2]=min(_x[2], &var - _x[1]);
      _x[1]=&var;
      if eof then do;
         amount=&factor/5*abs(_x[2]);
      keep amount;
      output;
      end;
   end;
   end;
  run;
    
%end;
%else %if %eval(&amount=0) %then %do;
    data _duti&fid;
      set _util&fid;
   amount=&factor*z/50;
   keep amount;
   output;
 run;     
%end;
%else %do;
    data _duti&fid;
      amount=&amount;
   keep amount;
   output;
 run;
%end;

proc sql noprint;
     select name into :keepvars separated by ' '
  from   sashelp.vcolumn
  where  libname='WORK' 
    and  memname=%upcase("&data") 
    and  memtype="DATA"
  ;
quit;
data &data;     
  array _x{1} _temporary_;
     if _n_=1 then do;
     set _duti&fid;
  _x[1]=amount;
  end;
     set &data;
  &newname=&var + ranuni(0)*(2*_x[1])-_x[1];
  label &newname="jitter(&var)";
  keep &keepvars  &newname;
run;
proc datasets library=work nolist;
     delete _duti&fid _xuti&fid _util&fid;
run;quit;
%mend; 


Now we can apply jitter function to the two variables studied in the book and check them out. For graphics, we use axis statement to define how we want the two axes to appear in print out, and use symbol statement to define the appearance of symbols for data. Here we ask for black circles, not connected (interpol=none).


%jitter(hours_of_sleep, xhos, data=student, factor=1, amount=); 
%jitter(ToSleep, xts, data=student, factor=1, amount=);

goptions border  reset=all;
axis1  label=("jitter(ToSleep)")    
       major=(height=2  number=5)   minor=none;
axis2  label=(angle=90  "jitter(Hours of Sleep")     
       major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=none  color=black;
proc gplot data=student;
      plot xhos*xts/haxis=axis1  vaxis=axis2;
run;quit;      


In order to overlay a fitted regression line, we have multiple ways to do it in SAS, and the Statistical Plot procedures are particularly handy to generate this type of figures. Compare the code and corresponding figures (in the same order of LINEPRINTER, ODS GRAPHICS, Manually Drawn, SGPLOT). For manually drawn figures, since we ask for connected dots in order to obtain a line overlay on circles, we have to sort the data by X-axis variable to make the line as straight as possible. The manually drawn figure is also the one closest to the R figure in the book.



/* old school LINEPRINTER option*/
proc reg data=student  lineprinter;
      title "Line Printer Style";
      model   Hours_of_Sleep=ToSleep /noprint;
   var     xhos  xts;   
   plot pred.*xts='X'  xhos*xts/overlay  symbol='o';
run;quit;
title;

/* Using ODS GRAPHICS */
ods select none; /* to suppress PROC REG output */
title "ODS Graphics";
ods graphics on;
proc reg data=student ;
      model   Hours_of_Sleep=ToSleep;
   var     xhos  xts;
   output  out=student  pred=p;  
      ods select FitPlot; 
run;quit;
ods graphics off;
title;
ods select all;

/* Following standard R approach, tedious in SAS */
proc sort data=student  out=studentsorted;
      by ToSleep;
run;
goptions border  reset=all;
axis1  label=("jitter(ToSleep)")      order=(-2 to 6 by 2)  major=(height=2  number=5)   minor=none;
axis2  label=(angle=90  "jitter(Hours of Sleep")    offset=(, 2 pct)  major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=none  color=black;
symbol2 value=none  interpol=line  color=black;
proc gplot data=studentsorted;
      plot xhos*xts  p*xts/overlay  haxis=axis1  vaxis=axis2;   
run;quit;      

/* Using build-in REGRESSION capability of GPLOT by specifying
   INTERPOL=R<><0><> option in SYMBOL statement. 
      In the first <>, you identify the type of regression: 
         L=linear, Q=quadratic, C=cubic
      In the second <>, you specify if intercept is omitted;
      In the third <>, you specify type of CI:
         CLM=CI of mean predicted value;
         CLI=CI of individual points;
 */
goptions border reset=all;
axis1  label=("jitter(ToSleep)")      order=(-2 to 6 by 2)  major=(height=2  number=5)   minor=none;
axis2  label=(angle=90 "jitter(Hours of Sleep")    offset=(, 2 pct)  major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=rlclm95  color=black  ci=blue  co=blue;
proc gplot data=student;
     plot xhos*xts/regeqn /* haxis=axis1  vaxis=axis2*/;
run;quit;

/* New SGPLOT has build in regression line capability: REG, SPLINE, LOESS */
proc sgplot data=student;
      scatter  x=xts  y=xhos;
   reg  x=ToSleep  y=Hours_of_Sleep;
run;




Now, to study the robustness of T-Statistics which relies on several assumptions, such as normality, homoskedasticity, etc. First, we need to define a "function" to calculate T-statistics like in the book.


/* section 1.3 Robustness of T-Statistics */
/* section 1.3.2 Write a function to compute T-Statistics */
%macro tstatistics(vars, dsn=last, outdsn=, class=);
%if &class='' %then %let classstmt=;
%else %let classstmt=class &class;

%let nvars=%sysfunc(count(&vars, ' '));

%let var1=%scan(&vars, 1, ' ');
%let var2=%scan(&vars, 2, ' ');
proc means data=&dsn noprint  nway;
     &classstmt;
  var &vars;
  output  out=tdata(where=(_STAT_ in ('STD', 'MEAN', 'N')));
run;
data &outdsn;
     set tdata; by &class _TYPE_;
  retain  m  n  mean1 mean2  std:;
  select(compress(_STAT_));
          when('N') do;
        m=&var1;
     n=&var2;
    end;
    when('MEAN') do;
              mean1=&var1;
     mean2=&var2;
    end;
          when('STD') do;
        std1=&var1;
     std2=&var2;
    end;
    otherwise;
  end;
     if last._TYPE_ then do;
        sp=sqrt(((m-1)*std1**2 + (n-1)*std2**2)/(m+n-2));
     t= (mean1-mean2)/(sp * sqrt(1/m + 1/n));
  keep &class _TYPE_ m  n std:  mean:  sp t;
  output;
  end;
run;
%mend;

Just like the "source(file)" command in R, in SAS, we have a counterpart called %include(or in abv. %inc). Suppose you have a SAS file called "tstatistics.sas" in some folder &folder, the following statement will read in the source code from this file. Since this is a SAS file, suffix ".sas" is not necessary.


filename BCR "&folder";
%inc BCR(tstatistics);

The following code pieces conduct the four simulate scenarios illustrated in the book and draw the last figure.

data test;
      input x y;
cards;
1  5
4  4
3  7
6  6
5  10
;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=);

/* section 1.3.3 Monte Carlo study on the Robustness of T-Statistics */
%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 99787  seed2 76568;
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then call rannor(seed1, x); else x=.;
   if j<=&n then call rannor(seed2, y); else y=.;
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;


/* section 1.3.4 Study the robustness of T-Statistics */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 987687  seed2 76568;
   array x[4];
   array y[4];
      do iter=1 to &Niter;
         do j=1 to max(&n, &m);
      do k=1 to 4;
      select (k);
         when(1) do;
                      if j<=&m then call rannor(seed1, x[k]); else x[k]=.;      
                      if j<=&n then call rannor(seed2, y[k]); else y[k]=.;
      end;
      when(2) do;
                      if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
                      if j<=&n then call rannor(seed2, y[k]); else y[k]=.;
                      y[k]=y[k]*10;
      end;
      when(3) do;
                      if j<=&m then x[k]=rand('T', 4); else x[k]=.;
                      if j<=&n then y[k]=rand('T', 4); else y[k]=.;  
      end;
      otherwise do;
                      if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
                      if j<=&n then call ranexp(seed2, y[k]); else y[k]=.;
                      x[k]=10+x[k]*2;  y[k]=y[k]/0.1;
      end;
     end;           
   end;
            keep iter x1-x4 y1-y4;
            output;
         end;
      end;
run;

%tstatistics(x1 y1, dsn=test, outdsn=tdata1, class=iter);

%tstatistics(x2 y2, dsn=test, outdsn=tdata2, class=iter);

%tstatistics(x3 y3, dsn=test, outdsn=tdata3, class=iter);

%tstatistics(x4 y4, dsn=test, outdsn=tdata4, class=iter);

data _null_;
      n_reject=0;
      do until (eof1);
         set tdata1  end=eof1  nobs=n1;
         retain n_reject 0;
         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
         if eof1 then do;
             *call symput('n_reject', n_reject/_n_);
       p_reject=n_reject/n1;
       put "NOTE>> Prob(Rejection)=" p_reject  8.4;
         end;
  end;
  n_reject=0;
  do until (eof2);
         set tdata2  end=eof2  nobs=n2;
         retain n_reject 0;
         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
         if eof2 then do;
             *call symput('n_reject', n_reject/_n_);
       p_reject=n_reject/n2;
       put "NOTE>> Prob(Rejection)=" p_reject  8.4;
         end;
  end;

  do until (eof3);
         set tdata3  end=eof3 nobs=n3;
         retain n_reject 0;
         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
         if eof3 then do;
             *call symput('n_reject', n_reject/_n_);
       p_reject=n_reject/n3;
       put "NOTE>> Prob(Rejection)=" p_reject  8.4;
         end;
  end;

  do until (eof4);
         set tdata4  end=eof4  nobs=n4;
         retain n_reject 0;
         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
         if eof4 then do;
             *call symput('n_reject', n_reject/_n_);
       p_reject=n_reject/n4;
       put "NOTE>> Prob(Rejection)=" p_reject  8.4;
         end;
  end;
run;

/* draw the overlay density curves of empirical T-statistics and T-distribution */
proc sort data=tdata2;
      by t;
run;
data tdata2;
      set tdata2;
   dt=pdf('T', t, 18);
run;

ods select none;
ods output Controls=Controls;
ods output UnivariateStatistics=UniStat;
proc kde data=tdata2  ;
      univar  t /out=density  method=snr  unistats;
run;
ods select all;

data density;
      set density;
   dt=pdf('T', value, 18);
run;

data _null_;
      set UniStat;
   if Descr='Bandwidth' then call symput ('bw', compress(round(t, 0.0001)));
run;

goptions reset=all border;

title "Comparing Empirical and Theoretical Densities";
legend label=none   position=(top right inside)  mode=share;
axis1  label=("N=&Niter Bandwidth=&bw") 
       order=(-4 to 8 by 2) offset=(1,1)  
       major=(height=2) minor=none;
axis2  label=(angle=90  "Density")     
       order=(0 to 0.4 by 0.1)  offset=(0, 0.1) 
       major=(height=2)  minor=none;
symbol1  interpol=join  color=black  value=none  width=3;
symbol2  interpol=join  color=grey   value=none  width=1;
proc gplot data=density;
      plot density*value  dt*value/overlay  legend=legend
                                   haxis=axis1  vaxis=axis2  ;
      label dt='t(18)' density='Exact';   
run;quit;

ods pdf close;

Key Points Summary: 

0. We can use FORMCHAR option to eliminate  weird characters

1. We can use PROC IMPORT to import deliminated data and it provided flexible options to control the output;

2. We can use PROC FREQ to summarize categorical data

3. We can use PROC MEANS and PROC UNIVARIATE to generate descriptive statistics for numerical data

4. We can study the dispersion of numerical variable by categorical variable using BOXPLOT

5. We provided a JITTER function for better graphical presentation of heavily overlapped data

6. We showed that there are multiple ways to visualize scatter plots together with fitted regression lines:

  • 6.1 LINEPRINTER in PROC REG 
  • 6.2 Using PROC REG to obtain regression line and overlay it on the scatter plot 
  • 6.3 Using ODS Graphics in PROC REG 
  • 6.4 Using INTERPOL=R<><><> option in SYMBOL statement 
  • 6.5 Using PROC SGPLOT 
7. We demonstrated how to conduct robust T-test in SAS following the code in the book.
Bayesian Computation with R (Use R)
 Posted by at 8:18 下午
1月 042011
 
The book "Bayesian Computation with R" by Jim Albert is an easy to read entry level book on applied Bayesian Statistics. While the book was written for R users, it is not difficult to translate the languages between R and SAS and I believe it is a good way to show Bayesian capability of SAS. In the next several months, I am going to translate all R code in the book into SAS. Readers are encouraged to buy this book to understand what's behind the code. The posts here will only spend minimum effects to explain the statistics underlying while most resource will still be denoted to SAS coding.

This post will cover Chapter 1.

We first follow section 1.2.2, using PROC IMPORT to read in TAB deliminated file, which corresponds to read.table(file, sep='\t', header=T)  in R. We also designated PDF as Output Destination.

/* Chapter 1 of Bayesian Computation with R */

%let folder=C:\Documents\SAS for Bayesian Compu with R;
%let datafolder=&folder\data;

libname data "&datafolder";
options missing='.' formchar = "|----|+|---+=|-/\<>*" errors=2 fullstimer;

/*@@@@@@ Section 1.2.2 @@@@@*/
/*--> Read Tab Deliminated Data into SAS */
proc import datafile="&datafolder/studentdata.txt" out=student dbms=dlm replace;
datarow=2; /* redundent with GETNAMES= statement */
guessingrows=2; /* Only use first 2 rows of data to guess the type of data */
delimiter='09'x; /* this is redudent if we specify DBMS=TAB in PROC IMPORT option */
getnames=YES; /* This is the same as 'header=True' in R's table.read function */
run;

ods pdf file="&folder/Chapter1.pdf";
/*--> To see Variable Names and print first row of this data
Use SPLIT="*" to specify the split character being "*", which controls
line breaks in column headings
*/
proc print data=student(obs=1) split="*"; run;

SAS output looks like:
Section 1.2.3, summarize frequency tables and use Barplot to visualize the counts.

PROC FREQ directly corresponds to table() function in R, but provides much richer functionality. PROC GBARLINE is used to answer what barplot() does in R. The LABEL= statement in SAS is a very handy tool for annotation purpose for PRINT or GRAPH.

PROC MEANS somehow corresponds to summary() function to obtain summary statistics for quantitative variables. You can sepecify MAXDEC= to keep desired number of decimals in print out.
Histogram can be obtained in SAS via either PROC UNIVARIATE or newly introduced PROC SGPLOT. PROC UNIVARIATE provides more control though. As shown below, you can specify desired mid-points while you can't do so in SGPLOT. On the other hand, SGPLOT is very convenient if you want to overlay density curve, either a normal fit or kernel one.


/*@@@@ section 1.2.3 @@@*/
/*--> To summarize and graph a single Batch */
proc freq data=student;
table Drink;
run;

/*--> Barplot on single variable */
title "Original idea, summarize and plot";
proc freq data=student noprint;
table Drink /out=t;
run;
goptions reset=all noborder;
proc gbarline data=t;
bar Drink /discrete space=4 sumvar=count;
run;quit;
title;
/*--> Actually, in SAS, it is better to directly use
PROC GBARLINE */
goptions reset=all noborder;
title "In SAS, PROC GBARLINE directly summarize";
proc gbarline data=student;
bar Drink /discrete space=3 ;
run;quit;
title;

data student;
set student;
hours_of_sleep=WakeUp-ToSleep;
/* Label can be useful for annotation purpose */
label hours_of_sleep="Hours Of Sleep";
run;

proc means data=student
/* Request same statistics as in the book */
min q1 median q3 max nmiss
/* MAXDEC= specifies keeping 2 decimals */
maxdec=2;
var hours_of_sleep;
run;

/*--> Histogram using PROC UNIVARIATE */
proc univariate data=student noprint;
var hours_of_sleep;
histogram /midpoints=2.5 to 12.5 by 1;
run;

/* new SGPLOT is a handy way to draw histogram and density */
proc sgplot data=student;
histogram hours_of_sleep/scale=count;
density hours_of_sleep/scale=count type=normal;
run;

Compare SAS outputs:





Boxplot to visualize distribution of  quantitative variables by some classification variable. Here, formula such as var1*var2 style corresponds to var1~var2 style in R.


/*@@@ section 1.2.4 @@@@*/
/*---> Compare Batches using Boxplot */
proc sort data=student out=studentsorted; by Gender; run;
proc boxplot data=studentsorted;
plot hours_of_sleep*Gender;
run;

proc means data=student
min q1 median q3 max nmiss
nway maxdec=2;
class Gender;
var Haircut;
run;

SAS output looks like:




R has built-in jitter function, however, we have to make our own in SAS. This verision doesn't apply Fuzzy yet, only for demo purpose.


%macro jitter(var, newname, data=last, factor=1, amount=);
/* follow the JITTER function in R, no fuzzy applied yet .
if amount is given, then use this value for disturbance
else if amount is 0, then use the range of &var for disturbance
else if amount is NULL, then use the smallest difference among
distinct values of &var for disturbance. if all values are
the same, then use the value obtained from range

if range of &var is 0, then use the lower range for disturbance
otherwise
*/
%let blank=%str( );
%if &data=' ' %then %let data=last;
%local fid;

data _null_;
fid=round(ranuni(0)*10000, 1);
call symput('fid', compress(fid));
run;
proc means data=&data noprint;
var &var;
output out=_util&fid
range(&var)=range
min(&var)=min
max(&var)=max;
run;
data _util&fid;
set _util&fid;
if range^=0 then z=range; else z=min;
if z=0 then z=1;
run;

%if %eval(&amount=&blank) %then %do;
proc sort data=&data.(keep=&var where=(&var^=.))
out=_xuti&fid nodupkey;
by &var;
run;
data _duti&fid;
set _xuti&fid nobs=ntotal end=eof;
array _x{2} _temporary_;
if ntotal=1 then do;
amount=&factor/50*abs(&var);
keep amount;
output;
stop;
end;
else do;
if _n_=1 then do;
_x[1]=&var; _x[2]=constant('BIG');
end;
else do;
_x[2]=min(_x[2], &var - _x[1]);
_x[1]=&var;
if eof then do;
amount=&factor/5*abs(_x[2]);
keep amount;
output;
end;
end;
end;
run;

%end;
%else %if %eval(&amount=0) %then %do;
data _duti&fid;
set _util&fid;
amount=&factor*z/50;
keep amount;
output;
run;
%end;
%else %do;
data _duti&fid;
amount=&amount;
keep amount;
output;
run;
%end;

proc sql noprint;
select name into :keepvars separated by ' '
from sashelp.vcolumn
where libname='WORK'
and memname=%upcase("&data")
and memtype="DATA"
;
quit;
data &data;
array _x{1} _temporary_;
if _n_=1 then do;
set _duti&fid;
_x[1]=amount;
end;
set &data;
&newname=&var + ranuni(0)*(2*_x[1])-_x[1];
label &newname="jitter(&var)";
keep &keepvars &newname;
run;
proc datasets library=work nolist;
delete _duti&fid _xuti&fid _util&fid;
run;quit;
%mend;


Now we can apply jitter function to the two variables studied in the book and check them out. For graphics, we use axis statement to define how we want the two axes to appear in print out, and use symbol statement to define the appearance of symbols for data. Here we ask for black circles, not connected (interpol=none).


%jitter(hours_of_sleep, xhos, data=student, factor=1, amount=);
%jitter(ToSleep, xts, data=student, factor=1, amount=);

goptions border reset=all;
axis1 label=("jitter(ToSleep)")
major=(height=2 number=5) minor=none;
axis2 label=(angle=90 "jitter(Hours of Sleep")
major=(height=2 number=5) minor=none;
symbol1 value=circle interpol=none color=black;
proc gplot data=student;
plot xhos*xts/haxis=axis1 vaxis=axis2;
run;quit;


In order to overlay a fitted regression line, we have multiple ways to do it in SAS, and the Statistical Plot procedures are particularly handy to generate this type of figures. Compare the code and corresponding figures (in the same order of LINEPRINTER, ODS GRAPHICS, Manually Drawn, SGPLOT). For manually drawn figures, since we ask for connected dots in order to obtain a line overlay on circles, we have to sort the data by X-axis variable to make the line as straight as possible. The manually drawn figure is also the one closest to the R figure in the book.



/* old school LINEPRINTER option*/
proc reg data=student lineprinter;
title "Line Printer Style";
model Hours_of_Sleep=ToSleep /noprint;
var xhos xts;
plot pred.*xts='X' xhos*xts/overlay symbol='o';
run;quit;
title;

/* Using ODS GRAPHICS */
ods select none; /* to suppress PROC REG output */
title "ODS Graphics";
ods graphics on;
proc reg data=student ;
model Hours_of_Sleep=ToSleep;
var xhos xts;
output out=student pred=p;
ods select FitPlot;
run;quit;
ods graphics off;
title;
ods select all;

/* Following standard R approach, tedious in SAS */
proc sort data=student out=studentsorted;
by ToSleep;
run;
goptions border reset=all;
axis1 label=("jitter(ToSleep)") order=(-2 to 6 by 2) major=(height=2 number=5) minor=none;
axis2 label=(angle=90 "jitter(Hours of Sleep") offset=(, 2 pct) major=(height=2 number=5) minor=none;
symbol1 value=circle interpol=none color=black;
symbol2 value=none interpol=line color=black;
proc gplot data=studentsorted;
plot xhos*xts p*xts/overlay haxis=axis1 vaxis=axis2;
run;quit;

/* Using build-in REGRESSION capability of GPLOT by specifying
INTERPOL=R<><0><> option in SYMBOL statement.
In the first <>, you identify the type of regression:
L=linear, Q=quadratic, C=cubic
In the second <>, you specify if intercept is omitted;
In the third <>, you specify type of CI:
CLM=CI of mean predicted value;
CLI=CI of individual points;
*/
goptions border reset=all;
axis1 label=("jitter(ToSleep)") order=(-2 to 6 by 2) major=(height=2 number=5) minor=none;
axis2 label=(angle=90 "jitter(Hours of Sleep") offset=(, 2 pct) major=(height=2 number=5) minor=none;
symbol1 value=circle interpol=rlclm95 color=black ci=blue co=blue;
proc gplot data=student;
plot xhos*xts/regeqn /* haxis=axis1 vaxis=axis2*/;
run;quit;

/* New SGPLOT has build in regression line capability: REG, SPLINE, LOESS */
proc sgplot data=student;
scatter x=xts y=xhos;
reg x=ToSleep y=Hours_of_Sleep;
run;




Now, to study the robustness of T-Statistics which relies on several assumptions, such as normality, homoskedasticity, etc. First, we need to define a "function" to calculate T-statistics like in the book.


/* section 1.3 Robustness of T-Statistics */
/* section 1.3.2 Write a function to compute T-Statistics */
%macro tstatistics(vars, dsn=last, outdsn=, class=);
%if &class='' %then %let classstmt=;
%else %let classstmt=class &class;

%let nvars=%sysfunc(count(&vars, ' '));

%let var1=%scan(&vars, 1, ' ');
%let var2=%scan(&vars, 2, ' ');
proc means data=&dsn noprint nway;
&classstmt;
var &vars;
output out=tdata(where=(_STAT_ in ('STD', 'MEAN', 'N')));
run;
data &outdsn;
set tdata; by &class _TYPE_;
retain m n mean1 mean2 std:;
select(compress(_STAT_));
when('N') do;
m=&var1;
n=&var2;
end;
when('MEAN') do;
mean1=&var1;
mean2=&var2;
end;
when('STD') do;
std1=&var1;
std2=&var2;
end;
otherwise;
end;
if last._TYPE_ then do;
sp=sqrt(((m-1)*std1**2 + (n-1)*std2**2)/(m+n-2));
t= (mean1-mean2)/(sp * sqrt(1/m + 1/n));
keep &class _TYPE_ m n std: mean: sp t;
output;
end;
run;
%mend;

Just like the "source(file)" command in R, in SAS, we have a counterpart called %include(or in abv. %inc). Suppose you have a SAS file called "tstatistics.sas" in some folder &folder, the following statement will read in the source code from this file. Since this is a SAS file, suffix ".sas" is not necessary.


filename BCR "&folder";
%inc BCR(tstatistics);

The following code pieces conduct the four simulate scenarios illustrated in the book and draw the last figure.

data test;
input x y;
cards;
1 5
4 4
3 7
6 6
5 10
;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=);

/* section 1.3.3 Monte Carlo study on the Robustness of T-Statistics */
%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
retain seed1 99787 seed2 76568;
do iter=1 to &Niter;
do j=1 to max(&n, &m);
if j<=&m then call rannor(seed1, x); else x=.;
if j<=&n then call rannor(seed2, y); else y=.;
keep iter x y;
output;
end;
end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
set tdata2 end=eof;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof then do;
call symput('n_reject', n_reject/_n_);
end;
run;
%put &n_reject;


/* section 1.3.4 Study the robustness of T-Statistics */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
retain seed1 987687 seed2 76568;
array x[4];
array y[4];
do iter=1 to &Niter;
do j=1 to max(&n, &m);
do k=1 to 4;
select (k);
when(1) do;
if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
if j<=&n then call rannor(seed2, y[k]); else y[k]=.;
end;
when(2) do;
if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
if j<=&n then call rannor(seed2, y[k]); else y[k]=.;
y[k]=y[k]*10;
end;
when(3) do;
if j<=&m then x[k]=rand('T', 4); else x[k]=.;
if j<=&n then y[k]=rand('T', 4); else y[k]=.;
end;
otherwise do;
if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
if j<=&n then call ranexp(seed2, y[k]); else y[k]=.;
x[k]=10+x[k]*2; y[k]=y[k]/0.1;
end;
end;
end;
keep iter x1-x4 y1-y4;
output;
end;
end;
run;

%tstatistics(x1 y1, dsn=test, outdsn=tdata1, class=iter);

%tstatistics(x2 y2, dsn=test, outdsn=tdata2, class=iter);

%tstatistics(x3 y3, dsn=test, outdsn=tdata3, class=iter);

%tstatistics(x4 y4, dsn=test, outdsn=tdata4, class=iter);

data _null_;
n_reject=0;
do until (eof1);
set tdata1 end=eof1 nobs=n1;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof1 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n1;
put "NOTE>> Prob(Rejection)=" p_reject 8.4;
end;
end;
n_reject=0;
do until (eof2);
set tdata2 end=eof2 nobs=n2;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof2 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n2;
put "NOTE>> Prob(Rejection)=" p_reject 8.4;
end;
end;

do until (eof3);
set tdata3 end=eof3 nobs=n3;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof3 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n3;
put "NOTE>> Prob(Rejection)=" p_reject 8.4;
end;
end;

do until (eof4);
set tdata4 end=eof4 nobs=n4;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof4 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n4;
put "NOTE>> Prob(Rejection)=" p_reject 8.4;
end;
end;
run;

/* draw the overlay density curves of empirical T-statistics and T-distribution */
proc sort data=tdata2;
by t;
run;
data tdata2;
set tdata2;
dt=pdf('T', t, 18);
run;

ods select none;
ods output Controls=Controls;
ods output UnivariateStatistics=UniStat;
proc kde data=tdata2 ;
univar t /out=density method=snr unistats;
run;
ods select all;

data density;
set density;
dt=pdf('T', value, 18);
run;

data _null_;
set UniStat;
if Descr='Bandwidth' then call symput ('bw', compress(round(t, 0.0001)));
run;

goptions reset=all border;

title "Comparing Empirical and Theoretical Densities";
legend label=none position=(top right inside) mode=share;
axis1 label=("N=&Niter Bandwidth=&bw")
order=(-4 to 8 by 2) offset=(1,1)
major=(height=2) minor=none;
axis2 label=(angle=90 "Density")
order=(0 to 0.4 by 0.1) offset=(0, 0.1)
major=(height=2) minor=none;
symbol1 interpol=join color=black value=none width=3;
symbol2 interpol=join color=grey value=none width=1;
proc gplot data=density;
plot density*value dt*value/overlay legend=legend
haxis=axis1 vaxis=axis2 ;
label dt='t(18)' density='Exact';
run;quit;

ods pdf close;

Key Points Summary: 

0. We can use FORMCHAR option to eliminate  weird characters

1. We can use PROC IMPORT to import deliminated data and it provided flexible options to control the output;

2. We can use PROC FREQ to summarize categorical data

3. We can use PROC MEANS and PROC UNIVARIATE to generate descriptive statistics for numerical data

4. We can study the dispersion of numerical variable by categorical variable using BOXPLOT

5. We provided a JITTER function for better graphical presentation of heavily overlapped data

6. We showed that there are multiple ways to visualize scatter plots together with fitted regression lines:

  • 6.1 LINEPRINTER in PROC REG 
  • 6.2 Using PROC REG to obtain regression line and overlay it on the scatter plot 
  • 6.3 Using ODS Graphics in PROC REG 
  • 6.4 Using INTERPOL=R<><><> option in SYMBOL statement 
  • 6.5 Using PROC SGPLOT 
7. We demonstrated how to conduct robust T-test in SAS following the code in the book.
Bayesian Computation with R (Use R)
 Posted by at 8:18 下午

An Economic Approach for a Class of Dimensionality Reduction Techniques

 PROC CANDISC, PROC DISCRIM, PROC GLMMOD, PROC REG  An Economic Approach for a Class of Dimensionality Reduction Techniques已关闭评论
7月 302010
 


Just back from KDD2010. In the conference, there are several papers that interested me.

On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved.

This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large.

The canonical analysis result from reduced version PROC CANDISC is the same as the full version.

In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2]

[1]. Liang Sun, Betul Ceran, Jieping Ye, "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques", KDD2010, Washington DC.

[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)




   proc format; 
      value specname 
         1='Setosa    ' 
         2='Versicolor' 
         3='Virginica '; 
   run; 
 
   data iris; 
      title 'Fisher (1936) Iris Data'; 
      input SepalLength SepalWidth PetalLength PetalWidth 
            Species @@; 
      format Species specname.; 
      label SepalLength='Sepal Length in mm.' 
            SepalWidth ='Sepal Width in mm.' 
            PetalLength='Petal Length in mm.' 
            PetalWidth ='Petal Width in mm.'; 
      symbol = put(Species, specname10.); 
      datalines; 
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 
   63 33 60 25 3 53 37 15 02 1 
   ; 
   proc candisc data=iris out=outcan distance anova; 
      class Species; 
      var SepalLength SepalWidth PetalLength PetalWidth; 
   run;
 
  ods select none;
  proc glmmod data=iris  outdesign=H(keep=COL:);
           class  Species;
     model SepalLength=Species/noint;
  run;  

  data H;
          merge H   iris;
  run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
     set iris;
     array _S{*} Col1-Col3 (3*0);     
     do j=1 to dim(_S); _S[j]=0; end;
     _S[Species]=1;
     drop j;
run;
****************************/
  proc reg data=H  outest=beta;
          model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
    output   out=P  p=yhat1-yhat3;
  run;quit;
  ods select all;


  proc candisc  data=P;
          class Species;
    var   yhat1-yhat3;
  run;

 Posted by at 12:19 下午
7月 302010
 
Just back from KDD2010. In the conference, there are several papers that interested me.

On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved.

This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large.

The canonical analysis result from reduced version PROC CANDISC is the same as the full version.

In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2]

[1]. Liang Sun, Betul Ceran, Jieping Ye, "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques", KDD2010, Washington DC.

[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)




   proc format; 
      value specname 
         1='Setosa    ' 
         2='Versicolor' 
         3='Virginica '; 
   run; 
 
   data iris; 
      title 'Fisher (1936) Iris Data'; 
      input SepalLength SepalWidth PetalLength PetalWidth 
            Species @@; 
      format Species specname.; 
      label SepalLength='Sepal Length in mm.' 
            SepalWidth ='Sepal Width in mm.' 
            PetalLength='Petal Length in mm.' 
            PetalWidth ='Petal Width in mm.'; 
      symbol = put(Species, specname10.); 
      datalines; 
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 
   63 33 60 25 3 53 37 15 02 1 
   ; 
   proc candisc data=iris out=outcan distance anova; 
      class Species; 
      var SepalLength SepalWidth PetalLength PetalWidth; 
   run;
 
  ods select none;
  proc glmmod data=iris  outdesign=H(keep=COL:);
           class  Species;
     model SepalLength=Species/noint;
  run;  

  data H;
          merge H   iris;
  run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
     set iris;
     array _S{*} Col1-Col3 (3*0);     
     do j=1 to dim(_S); _S[j]=0; end;
     _S[Species]=1;
     drop j;
run;
****************************/
  proc reg data=H  outest=beta;
          model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
    output   out=P  p=yhat1-yhat3;
  run;quit;
  ods select all;


  proc candisc  data=P;
          class Species;
    var   yhat1-yhat3;
  run;

 Posted by at 12:19 下午