3月 292010
 


 algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster
in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;
 Posted by at 11:12 上午
3月 292010
 


 algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster
in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;
 Posted by at 11:12 上午
3月 292010
 


 algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster
in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;
 Posted by at 11:12 上午
3月 292010
 


 algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster
in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;
 Posted by at 11:12 上午
3月 292010
 


 algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster
in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;
 Posted by at 11:12 上午
3月 292010
 


 algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster
in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;
 Posted by at 11:12 上午
3月 292010
 
*************************************************************;
* A SAS MACRO FOR DECISION STUMP SIMILAR TO %SPLIT() MACRO  *;
* IN "Pharmaceutical Statistics Using SAS" by Dmitrienko,   *;
* Chuang-Stein, AND D'Agostino                              *;
* --------------------------------------------------------- *;
* GENERAL CONCEPT:                                          *;
* 1. DECISION STUMP IS A NAIVELY SIMPLE 1-LEVEL DECISION    *;
*    TREE WITH TWO TERMINAL NODES                           *;
* 2. COMMMONLY USED AS COMPONENTS IN BAGGING / BOOSTING     *;
*    MACHINE LEARNING ENSEMBLE ALGORITHMS                   *;
* --------------------------------------------------------- *;
* PRACTICAL USAGES:                                         *;
* 1. VARIABLES PRE-SCREENING BEFORE MODEL DEVELOPMENT       *;
* 2. POINT SEARCHING FOR SCORECARD CUTOFF STRATEGY          *;
* --------------------------------------------------------- *;
* AUTHOR: wensliu@paypal.com                                *;
*************************************************************;

data example1;
  do i = 1 to 1000;
    * x1: key driver with single cutoff = 5 *;
    x1 = 10 * ranuni(1);
    * x2: related with 2 different cutoffs *;
          x2 = 10 * ranuni(2);
    * x3: unrelated *;
    x3 = 10 * ranuni(3);
    if (x1 < 5  and x2 < 1.5)  then y = 0;
    if (x1 < 5  and x2 >= 1.5) then y = 1;
    if (x1 >= 5 and x2 < 7.5)  then y = 0;
    if (x1 >= 5 and x2 >= 7.5) then y = 1;
    w = 1;
    output;
  end;
run;

%macro stump(data = , w = , y = , xlist = );
%let i = 1;
%local i;

proc sql;
create table _out
  (
  variable   char(32),
  gt_value   num,
  gini       num
  );
quit;

%do %while (%scan(&xlist, &i) ne %str());  
  %let x = %scan(&xlist, &i);
  
  data _tmp1(keep = &w &y &x);
    set &data;
    where &y in (0, 1);
  run;

  proc sql;
    create table
      _tmp2 as
    select
      b.&x                                                          as gt_value,
      sum(case when a.&x <= b.&x then &w * &y else 0 end) / 
      sum(case when a.&x <= b.&x then &w else 0 end)                as p1_1,
      sum(case when a.&x >  b.&x then &w * &y else 0 end) / 
      sum(case when a.&x >  b.&x then &w else 0 end)                as p1_2,
      sum(case when a.&x <= b.&x then 1 else 0 end) / count(*)      as ppn1,
      sum(case when a.&x >  b.&x then 1 else 0 end) / count(*)      as ppn2,
      2 * calculated p1_1 * (1 - calculated p1_1) * calculated ppn1 + 
      2 * calculated p1_2 * (1 - calculated p1_2) * calculated ppn2 as gini
    from
      _tmp1 as a,
      (select distinct &x from _tmp1) as b
    group by
      b.&x;

    insert into _out
    select
      "&x",
      gt_value,
      gini
    from
      _tmp2
    having
      gini = min(gini);

    drop table _tmp1;
  quit;

  %let i = %eval(&i + 1);
%end;

proc sort data = _out;
  by gini;
run;

proc report data = _out box spacing = 1 split = "*" nowd;
  column("DECISION STUMP SUMMARY"
         variable gt_value gini);
  define variable / "VARIABLE"                     width = 30 center;
  define gt_value / "CUTOFF VALUE*(GREATER THAN)"  width = 15 center;
  define gini     / "GINI"                         width = 10 center format = 9.4;
run;

%mend stump;

%stump(data = example1, w = w, y = y, xlist = x1 x2 x3);

/*
 -----------------------------------------------------------
 |                 DECISION STUMP SUMMARY                  |
 |                                CUTOFF VALUE             |
 |           VARIABLE            (GREATER THAN)     GINI   |
 |---------------------------------------------------------|
 |              x1              |   4.9742638   |   0.3125 |
 |---------------------------------------------------------|
 |              x2              |   7.4602286   |   0.3534 |
 |---------------------------------------------------------|
 |              x3              |   0.6173268   |   0.4939 |
 -----------------------------------------------------------
*/
 Posted by at 7:41 上午  Tagged with:
3月 272010
 
******************************************;
* A SAS ROUTINE FOR PREDICTORS RANKING   *;
* BY MAXIMIZED CHI-SQUARE BASED UPON THE *;
* BINARY SPLIT IMPLEMENTED IN DMSPLIT    *;
* PROCEDURE IN ENTERPRISE MINER.         *;
* -------------------------------------- *;
* author: wensliu@paypal.com             *;
******************************************;

libname data 'D:\projects\woe\data';

options mprint mlogic;

%let varlist = x2 x3 x4 x5 x10 x11 x12 x13 x14 x15;

%macro dmsplit(data = , y = , x = &varlist);
    
%let i = 1;
%local i;

data _tmp1(keep = &y &varlist);
  set &data;
  where &y in (0, 1);
run;

proc sql;
create table _out
(
  variable   char(32),
  type       char(1),
  chi_sq     num
);
quit;
  
%do %while (%scan(&varlist, &i) ne %str());  
  %let var = %scan(&varlist, &i);
  
  data _tmp2(keep = &y &var);
    set _tmp1;
    if _n_ = 1 then do;
      call symput('vtype', vtype(&var));
    end;
  run;

  proc dmdb data = _tmp2 out = _db1 dmdbcat = _ct1;
  %if &vtype = C %then %do;
    class &y &var;
  %end;
  %else %if &vtype = N %then %do;
    class &y;
    var &var;
  %end;
  run;

  proc dmsplit data = _db1 dmdbcat = _ct1 outvars = _tmp3 noprint passes = 5;
    var &var;
    target &y;
  run;

  %if %sysfunc(exist(_tmp3)) %then %do;
    proc sql noprint;
      select count(*) into :nobs from _tmp3;
    quit;

    %if &nobs > 0 %then %do;
      proc sql;
        insert into _out
      select
        upcase(_split_), "&vtype", round(_chisqu_, 0.01)
      from
        _tmp3
      where
        _parent_ = 0;
      quit;
    %end;
    %else %do;
      options obs = max nosyntaxcheck;
      proc sql;
        insert into _out
        values("%upcase(&var)", "&vtype", .);;
      quit;      
    %end;
  %end;

  proc datasets library = work nolist;
    delete _tmp2 _tmp3 / memtype = data;
  run;
  quit;
  
  %let i = %eval(&i + 1);    
%end;    

proc format;
  picture chi_fmt . = 'N / A';
run;

proc sort data = _out;
  by descending chi_sq;
run;

proc report data = _out box spacing = 1 split = "*";
  column("Predictors Ranking by*Maximized Chi-Square Based on Binary Cut"
         variable type chi_sq);
  define variable / "Predictor" width = 20 center;
  define type     / "Type"      width = 10 center;
  define chi_sq   / "ChiSQ"     width = 15 center format = chi_fmt.;
run;

%mend dmsplit;

%dmsplit(data = data.credit, y = y, x = &varlist);

   +-----------------------------------------------+
   |             Predictors Ranking by             |
   |   Maximized Chi-Square Based on Binary Cut    |
   |     Predictor          Type         ChiSQ     |
   |-----------------------------------------------|
   |        X11         |    C     |     91.51     |
   |--------------------+----------+---------------|
   |         X3         |    N     |     34.26     |
   |--------------------+----------+---------------|
   |        X10         |    C     |     19.99     |
   |--------------------+----------+---------------|
   |         X2         |    N     |     14.36     |
   |--------------------+----------+---------------|
   |         X4         |    N     |      7.39     |
   |--------------------+----------+---------------|
   |        X12         |    C     |      6.17     |
   |--------------------+----------+---------------|
   |         X5         |    N     |      3.95     |
   |--------------------+----------+---------------|
   |        X14         |    C     |      2.06     |
   |--------------------+----------+---------------|
   |        X13         |    C     |      1.79     |
   |--------------------+----------+---------------|
   |        X15         |    C     |     N / A     |
   +-----------------------------------------------+
 Posted by at 2:53 下午  Tagged with:
3月 272010
 


SVD is at the heart of many modern machine learning algorithms. As a computing vehicle for PCA, SVD can be obtained using PROC PRINCOMP on the covariance matrix of a given matrix withou correction for intercept. With SVD, we are ready to carry out many tasks that are very useful but not readily available in SAS/STAT, such as TextMining using LSI [default algorithm used in SAS TextMiner [1]], multivariate Time Series Analysis using MSSA, Logistic-PLS, etc.

I also highly recommend the book "Principal Component Analysis 2nd Edition" by I. T. Jolliffe. Prof. Jollliffe smoothly gave a thorough review of PCA and its applications in various fields, and provided a road map for further research and reading.


%macro SVD(
           input_dsn,
           output_V,
           output_S,
           output_U,     
           input_vars,
           ID_var,
     nfac=0
           );

%local blank   para  EV  USCORE  n  pos  dsid nobs nstmt
       shownote  showsource  ;

%let shownote=%sysfunc(getoption(NOTES));
%let showsource=%sysfunc(getoption(SOURCE));
options nonotes  nosource;

%let blank=%str( );
%let EV=EIGENVAL;
%let USCORE=USCORE;

%let n=%sysfunc(countW(&input_vars));

%let dsid=%sysfunc(open(&input_dsn));
%let nobs=%sysfunc(attrn(&dsid, NOBS));
%let dsid=%sysfunc(close(&dsid));
%if  &nfac eq 0 %then %do;
     %let nstmt=&blank; %let nfac=&n;
%end;     
%else %do;
     %let x=%sysfunc(notdigit(&nfac, 1)); 
  %if  &x eq 0 %then %do;
          %let nfac=%sysfunc(min(&nfac, &n));
          %let nstmt=%str(n=&nfac);
  %end;
  %else %do;
          %put ERROR: Only accept non-negative integer.;
          %goto exit;
  %end;
%end;

/* calculate U=XV/S */
%if &output_U ne %str() %then %do;
    %let outstmt=  out=&output_U.(keep=&ID_var  Prin:);
%end;
%else %do;
    %let outstmt=&blank;
%end;

%let options=noint cov noprint  &nstmt;

proc princomp data=&input_dsn  
             /* out=&input_dsn._score */
              &outstmt
              outstat=&input_dsn._stat(where=(_type_ in ("&USCORE", "&EV")))  &options;
     var &input_vars;
run;
data &output_S;
     set &input_dsn._stat;
     format Number 7.0;
     format EigenValue Proportion Cumulative 7.4;
     keep Number EigenValue  Proportion Cumulative;
     where _type_="&EV";
     array _X{&n} &input_vars;
     Total=sum(of &input_vars);
     Cumulative=0;
     do Number=1 to dim(_X);
     EigenValue=_X[number];
     Proportion=_X[Number]/Total;
     Cumulative=Cumulative+Proportion;  
     output;
  end;
run;

%if &output_V ne %str() %then %do;
proc transpose data=&input_dsn._stat(where=(_TYPE_="&USCORE")) 
               out=&output_V.(rename=(_NAME_=variable))
               name=_NAME_;
     var &input_vars;
     id _NAME_;
  format &input_vars 8.6;
run;
%end;

/* recompute Proportion */
%if &output_S ne %str() %then %do;
data &output_S;
     set &input_dsn._stat ;
  where _TYPE_="EIGENVAL";
  array _s{*} &input_vars;
  array _x{&nfac, 3} _temporary_; 
  Total=sum(of &input_vars, 0);
  _t=0;
  do _i=1 to &nfac;
     _x[_i, 1]=_s[_i]; _x[_i, 2]=_s[_i]/Total; 
  if _i=1 then _x[_i, 3]=_x[_i, 2]; 
  else _x[_i, 3]=_x[_i-1, 3]+_x[_i, 2];
  _t+sqrt(_x[_i, 2]);
  end;
  do _i=1 to &nfac;
     Number=_i;  
  EigenValue=_x[_i, 1]; Proportion=_x[_i, 2]; Cumulative=_x[_i, 3];
     S=sqrt(_x[_i, 2])/_t;  SinguVal=sqrt(_x[_i, 1] * &nobs);
  keep Number EigenValue  Proportion Cumulative  S SinguVal;
     output;
  end;
run;
%end;

%if &output_U ne %str() %then %do; 
data &output_U;
     array _S{&nfac}  _temporary_;  
     if _n_=1 then do;
        do j=1 to &nfac;
           set  &output_S(keep=SinguVal)  point=j;
           _S[j]=SinguVal; 
           if abs(_S[j]) < CONSTANT('MACEPS') then _S[j]=CONSTANT('BIG');
        end;
    end;
    set &output_U;
    array _A{*}  Prin1-Prin&nfac;
    do _j=1 to dim(_A);
        _A[_j]=_A[_j]/_S[_j];
    end;
    keep &ID_var Prin1-Prin&nfac ;
run;
%end;

%exit: 
options &shownote  &showsource;
%mend;

Try the following sample code to examine the results:


data td;
    input x1 x2;
cards;
2 0
0 -3
;
run;

%let input_dsn=td;
%let id_var= ;
%SVD(&input_dsn,
      output_V,
      output_S,
      output_U,     
      x1  x2,
      &id_var,
      nfac=0
      );



Reference:
[1] Albright, Russ, "Taming Text with the SVD", SAS Institute Inc., Cary, NC, available at :
http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf

[2] Jolliffe, I. T. , "Principal Component Analysis", 2nd Ed., Springer Series in Statistics, 2002
Principal Component Analysis
 Posted by at 6:00 上午

A Macro for SVD

 PROC PRINCOMP, SVD  A Macro for SVD已关闭评论
3月 272010
 


SVD is at the heart of many modern machine learning algorithms. As a computing vehicle for PCA, SVD can be obtained using PROC PRINCOMP on the covariance matrix of a given matrix withou correction for intercept. With SVD, we are ready to carry out many tasks that are very useful but not readily available in SAS/STAT, such as TextMining using LSI [default algorithm used in SAS TextMiner [1]], multivariate Time Series Analysis using MSSA, Logistic-PLS, etc.

I also highly recommend the book "Principal Component Analysis 2nd Edition" by I. T. Jolliffe. Prof. Jollliffe smoothly gave a thorough review of PCA and its applications in various fields, and provided a road map for further research and reading.


%macro SVD(
           input_dsn,
           output_V,
           output_S,
           output_U,     
           input_vars,
           ID_var,
     nfac=0
           );

%local blank   para  EV  USCORE  n  pos  dsid nobs nstmt
       shownote  showsource  ;

%let shownote=%sysfunc(getoption(NOTES));
%let showsource=%sysfunc(getoption(SOURCE));
options nonotes  nosource;

%let blank=%str( );
%let EV=EIGENVAL;
%let USCORE=USCORE;

%let n=%sysfunc(countW(&input_vars));

%let dsid=%sysfunc(open(&input_dsn));
%let nobs=%sysfunc(attrn(&dsid, NOBS));
%let dsid=%sysfunc(close(&dsid));
%if  &nfac eq 0 %then %do;
     %let nstmt=&blank; %let nfac=&n;
%end;     
%else %do;
     %let x=%sysfunc(notdigit(&nfac, 1)); 
  %if  &x eq 0 %then %do;
          %let nfac=%sysfunc(min(&nfac, &n));
          %let nstmt=%str(n=&nfac);
  %end;
  %else %do;
          %put ERROR: Only accept non-negative integer.;
          %goto exit;
  %end;
%end;

/* calculate U=XV/S */
%if &output_U ne %str() %then %do;
    %let outstmt=  out=&output_U.(keep=&ID_var  Prin:);
%end;
%else %do;
    %let outstmt=&blank;
%end;

%let options=noint cov noprint  &nstmt;

proc princomp data=&input_dsn  
             /* out=&input_dsn._score */
              &outstmt
              outstat=&input_dsn._stat(where=(_type_ in ("&USCORE", "&EV")))  &options;
     var &input_vars;
run;
data &output_S;
     set &input_dsn._stat;
     format Number 7.0;
     format EigenValue Proportion Cumulative 7.4;
     keep Number EigenValue  Proportion Cumulative;
     where _type_="&EV";
     array _X{&n} &input_vars;
     Total=sum(of &input_vars);
     Cumulative=0;
     do Number=1 to dim(_X);
     EigenValue=_X[number];
     Proportion=_X[Number]/Total;
     Cumulative=Cumulative+Proportion;  
     output;
  end;
run;

%if &output_V ne %str() %then %do;
proc transpose data=&input_dsn._stat(where=(_TYPE_="&USCORE")) 
               out=&output_V.(rename=(_NAME_=variable))
               name=_NAME_;
     var &input_vars;
     id _NAME_;
  format &input_vars 8.6;
run;
%end;

/* recompute Proportion */
%if &output_S ne %str() %then %do;
data &output_S;
     set &input_dsn._stat ;
  where _TYPE_="EIGENVAL";
  array _s{*} &input_vars;
  array _x{&nfac, 3} _temporary_; 
  Total=sum(of &input_vars, 0);
  _t=0;
  do _i=1 to &nfac;
     _x[_i, 1]=_s[_i]; _x[_i, 2]=_s[_i]/Total; 
  if _i=1 then _x[_i, 3]=_x[_i, 2]; 
  else _x[_i, 3]=_x[_i-1, 3]+_x[_i, 2];
  _t+sqrt(_x[_i, 2]);
  end;
  do _i=1 to &nfac;
     Number=_i;  
  EigenValue=_x[_i, 1]; Proportion=_x[_i, 2]; Cumulative=_x[_i, 3];
     S=sqrt(_x[_i, 2])/_t;  SinguVal=sqrt(_x[_i, 1] * &nobs);
  keep Number EigenValue  Proportion Cumulative  S SinguVal;
     output;
  end;
run;
%end;

%if &output_U ne %str() %then %do; 
data &output_U;
     array _S{&nfac}  _temporary_;  
     if _n_=1 then do;
        do j=1 to &nfac;
           set  &output_S(keep=SinguVal)  point=j;
           _S[j]=SinguVal; 
           if abs(_S[j]) < CONSTANT('MACEPS') then _S[j]=CONSTANT('BIG');
        end;
    end;
    set &output_U;
    array _A{*}  Prin1-Prin&nfac;
    do _j=1 to dim(_A);
        _A[_j]=_A[_j]/_S[_j];
    end;
    keep &ID_var Prin1-Prin&nfac ;
run;
%end;

%exit: 
options &shownote  &showsource;
%mend;

Try the following sample code to examine the results:


data td;
    input x1 x2;
cards;
2 0
0 -3
;
run;

%let input_dsn=td;
%let id_var= ;
%SVD(&input_dsn,
      output_V,
      output_S,
      output_U,     
      x1  x2,
      &id_var,
      nfac=0
      );



Reference:
[1] Albright, Russ, "Taming Text with the SVD", SAS Institute Inc., Cary, NC, available at :
http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf

[2] Jolliffe, I. T. , "Principal Component Analysis", 2nd Ed., Springer Series in Statistics, 2002
Principal Component Analysis
 Posted by at 6:00 上午