Some discriminat methodes and SAS codes

NOTICE: The following text is one part of my published paper. Do not distribute it! All right reserved.

介绍线性判别方法、针对强共线性数据的两种降维判别方法及SAS实现代码。

1, linear discriminant analysis (LDA)
Because of its simplicity and robustness, LDA has been one of the most frequently used classification techniques since 1936.
LDA:
proc discrim data=ex1 testdata=ex2;
class g;
var x1-x10
run;

2, Combation of PCA and LDA(PCA+LDA), or PLS and PCA (PLS+LDA)
Principal component analysis (PCA) is the fundamental method used in chemometric and is based on vector algebra. The main purpose of this method is to reduce the dimensions of a data set with a large number of intercorrelated variables, whilst retaining as much of the information present in the original data as possible. A new set of orthogonal variables, principal components (PCs), describe the variance in data. Only first few of them can retain most of variation in describing the systematic information of all the original variables. Usually, a subset of limited PCs is used to explore the trends of samples with different treatments. Furthermore, when using these PCs as input variables, linear discriminant analysis (LDA) can greatly reduce multiple co-linearity among the variables of the original data. Therefore, the combination of PCA and LDA (PCA+LDA) was used here for the goal of classification. Principal component regression (PCR) is a multiple linear regression method for relating two sets of variables (PCs and response variables) with predictive purposes. PLS is an extension of PCR, which is applied to relate two sets of variables by a regression model. But in PLS, the principal components are more correlated with the response variables. This results in a more effective prediction of the response variable. In the same way, the PCs of PLS can be used in conjunction with LDA (PLS+LDA) to tackle classification problems.
PCA+DA:
data xxx_pca;
set xxx1;
name= scan(scan(name,1,'d'),2,'g');
run;
data p_ex;
set xxx_pca;
if name in ("20" "40") then class="L";
if name in ("50" "60" "70" ) then class="M";
if name in ("80" "90" "100") then class="H";
run;

proc sort ;
by name;
run;
proc surveyselect data=p_ex rate=0.46 seed=1 outall out=p_ex;
strata name ;
run;
data cal val;
set p_ex;
if selected then output cal; else output val;
keep name class var1-var1000;
run;

proc princomp data = cal cov noprint outstat = xvalbasis out=pcxval;
var var1-var1000; /* change */
quit;

proc sort data = pcxval;
by class;
quit;

proc score data = val score = xvalbasis out=pctest;
var var1-var1000; /* change */
id class;
quit;

proc sort data = pctest;
by class;
quit;

%macro pcada;
%if %sysfunc(exist(work.tolers)) %then
%do;
proc datasets library=work nolist; delete tolers; run; quit;
%end;

%do i= 1 %to 15;
ods listing close;
ods output ErrorTestClass=toler ErrorResub=tester;
proc discrim data=pcxval testdata = pctest method=normal pool=yes crossvalidate outcross = crossvalpreds testout = outtest;
class class;
var prin1-prin&i.;
quit;
ods listing;

data tol;
merge tester(rename=(total=testte) obs=1) toler(obs=1);
run;

proc append base=tolers data=tol ;
run;
%end;
%mend;

%pcada

PLS+DA:
data xxx_pca;
set xxx1;
name= scan(scan(name,1,'d'),2,'g');
run;
data p_ex;
set xxx_pca;
if name in ("20" "40") then do ; low=1; mid=0; high=0; class="L"; end;
if name in ("50" "60" "70" ) then do ; low=0; mid=1; high=0; class="M"; end;
if name in ("80" "90" "100") then do ; low=0; mid=0; high=1; class="H"; end;
run;

proc sort ;
by name;
run;
proc surveyselect data=p_ex rate=0.46 seed=1 outall out=p_ex;
strata name ;
run;
data cal val;
set p_ex;
if selected then output cal; else output val;
keep name class low mid high var1-var1000;
run;

data _val;
set val;
keep name class var1-var1000;
run;
data all;
set cal _val;
run;

/*you can set cv=split for determining  the  appropriate  number  of  principal components without macro*/
proc pls data=all method=pls;

model low mid high= var1-var1000;
output out=res p=low1 mid1 high1 xscore=prin h=x stdy=s ;
run;

data _val _cal;
set res;
if s=. then output _val; else output _cal;
keep class prin1-prin15;
run;

%macro plsda;
%if %sysfunc(exist(work.tolers)) %then
%do;
proc datasets library=work nolist; delete tolers; run; quit;
%end;

%do i= 1 %to 15;
ods listing close;
ods output ErrorTestClass=toler ErrorResub=tester;
proc discrim data=_cal testdata = _val method=normal pool=yes crossvalidate outcross = cal_re testout = val_re;
class class;
var prin1-prin&i.;
quit;
ods listing;

data tol;
merge tester(rename=(total=testte) obs=1) toler(obs=1);
run;

proc append base=tolers data=tol ;
run;
%end;
%mend;

%plsda

3,PLS-DA (usually written as PLSDA) and PCR-DA
PLS-DA and PCR-DA are applications of regression for the optimal separation of classes using a series of dummy variables as Y variables to represent each group. In this study, values 1, 2 and 3 represented groups of low temperature treatment (L), middle temperature treatment (M) and high temperature treatment (H), respectively. Models were constructed based on three category values; in the first group, Y1=1 Y2=0 Y3=0; in the second group, Y1=0 Y2=1 Y3=0 and in the third group, Y1=0 Y2=0 Y3=1. PLS or PCR computes a bilinear model of both the X- and the Y-spaces. The resulting model is a linear model that was proven to be statistically equivalent to the solution obtained using LDA. From a statistical point of view, whenever an unknown sample is analyzed, a predicted vector of responses is computed, and the latter’s components can be interpreted as the posterior probabilities that the sample belongs to each of the classes. The unknown sample is then assigned to the category corresponding to the highest valued component of the response vector.

PCA-DA:
data xxx_pca;
set xxx1;
name= scan(scan(name,1,'d'),2,'g');
run;
data p_ex;
set xxx_pca;
if name in ("20" "40") then do ; low=1; mid=0; high=0; class="L"; end;
if name in ("50" "60" "70" ) then do ; low=0; mid=1; high=0; class="M"; end;
if name in ("80" "90" "100") then do ; low=0; mid=0; high=1; class="H"; end;
run;

proc sort ;
by name;
run;
proc surveyselect data=p_ex rate=0.46 seed=1 outall out=p_ex;
strata name ;
run;
data cal val;
set p_ex;
if selected then output cal; else output val;
keep name class low mid high var1-var1000;
run;

data _val;
set val;
keep name class var1-var1000;
run;
data all;
set cal _val;
run;

%macro pcr;

%if %sysfunc(exist(work.statts)) %then
%do;
proc datasets library=work nolist;
delete statts;
run;
quit;
%end;

%if %sysfunc(exist(work.statrs)) %then
%do;
proc datasets library=work nolist;
delete statrs;
run;
quit;
%end;

%do i=1 %to 15;

/*you can set cv=split for determining  the  appropriate  number  of  principal components without macro*/

proc pls data=all method=PCR NFAC=&i. ;
model low mid high= var1-var1000;
output out=res p=low1 mid1 high1 xscore=lv h=x stdy=s ;
run;

data class pda;
set res;
_low=ifn(low1=max(of low1 mid1 high1),1,0);
_mid=ifn(mid1=max(of low1 mid1 high1),1,0);
_high=ifn(high1=max(of low1 mid1 high1),1,0);
keep class low mid high _low _mid _high ;

if low=. then output pda ; else output class;
run;

proc sql print;
create table statt as
select class, sum(_low) as _low ,sum(_mid) as _mid ,sum(_high) as _high
from class
group by class;
quit;

proc append base=statts data=statt ;
run;

proc sql print;
create table statr as
select class, sum(_low) as _low ,sum(_mid) as _mid ,sum(_high) as _high
from pda
group by class;
quit;

proc append base=statrs data=statr ;
run;

%end;
%mend;
%pcr

PLS DA:

data xxx_pca;
set xxx1;
name= scan(scan(name,1,'d'),2,'g');
run;
data p_ex;
set xxx_pca;
if name in ("20" "40") then do ; low=1; mid=0; high=0; class="L"; end;
if name in ("50" "60" "70" ) then do ; low=0; mid=1; high=0; class="M"; end;
if name in ("80" "90" "100") then do ; low=0; mid=0; high=1; class="H"; end;
run;

proc sort ;
by name;
run;
proc surveyselect data=p_ex rate=0.46 seed=1 outall out=p_ex;
strata name ;
run;
data cal val;
set p_ex;
if selected then output cal; else output val;
keep name class low mid high var1-var1000;
run;

data _val;
set val;
keep name class var1-var1000;
run;
data all;
set cal _val;
run;

/*you can set cv=split for determining  the  appropriate  number  of  principal components without macro*/

proc pls data=all method=pls ;
model low mid high= var1-var1000;
output out=res p=low1 mid1 high1 xscore=prin h=x stdy=s ;
run;

data _val _cal;
set res;
if s=. then output _val; else output _cal;
keep class prin1-prin15;
run;

%macro plsda;
%if %sysfunc(exist(work.tolers)) %then
%do;
proc datasets library=work nolist; delete tolers; run; quit;
%end;

%do i= 1 %to 15;
ods listing close;
ods output ErrorTestClass=toler ErrorResub=tester;
proc discrim data=_cal testdata = _val method=normal pool=yes crossvalidate outcross = cal_re testout = val_re;
class class;
var prin1-prin&i.;
quit;
ods listing;

data tol;
merge tester(rename=(total=testte) obs=1) toler(obs=1);
run;

proc append base=tolers data=tol ;
run;
%end;
%mend;

%plsda

《Some discriminat methodes and SAS codes》上有1条评论

发表评论

电子邮件地址不会被公开。 必填项已用*标注