VARIMAX rotation of PLS loadings

 PROC FACTOR, PROC PLS, PROC SCORE  VARIMAX rotation of PLS loadings已关闭评论
8月 102010
 


Partial Least Square is one of several supervised dimension reduction techniques and attracts attention in recent years. In the one hand, PLS is able to generate a series of scores that maximize linear correlation between dependent variables and independent variables, on the other hand, the loading of PLS can be regarded as similar counterpart from factor analysis, hence we can rotate the loadings from PLS therefore eliminate some of the non-significant variable in terms of prediction.


%macro PLSRotate(Loading, TransMat, PatternOut, PatternShort, 
                 method=VARIMAX, threshold=0.25);
/* VARIMAX rotation of PLS loadings. Only variables having 
   large loadings after rotation will enter the final model. 

   Loading dataset contains XLoadings output from PROC PLS 
   and should have variable called NumberOfFactors
   TransMat is the generated Transformation matrix;
   PatternOut is the output Pattern after rotation;
   PatternShort is the output Pattern with selected variables
*/

%local covars;
proc sql noprint;
     select name into :covars separated by ' '
  from   sashelp.vcolumn
  where  libname="WORK" & memname=upcase("&Loading") 
        &   upcase(name) NE "NUMBEROFFACTORS"
  &   type="num"
  ;
quit;
%put &covars;

data &Loading.(type=factor);
         set &Loading;
         _TYPE_='PATTERN';
         _NAME_=compress('factor'||_n_);
run;
ods select none;
ods output OrthRotFactPat=&PatternOut;
ods output OrthTrans=&TransMat; 
proc factor  data=&Loading   method=pattern  rotate=&method  simple; 
         var &covars;
run;
ods select all;

data &PatternShort;
     set &PatternOut;
  array _f{*} factor:;
  _cntfac=0;
  do _j=1 to dim(_f);  
        _f[_j]=_f[_j]*(abs(_f[_j])>&threshold); _cntfac+(_f[_j]>0); 
     end;
  if _cntfac>0 then output;
  drop _cntfac _j;
run;
%mend;

Here I try to replicate the case study in [1] which elaborated how to do and properties of VARIMAX rotation to PLS loadings. The PROC PLS output, after various tweaks on convergence criteria and singularity conditions, is still a little different from the result reported in [1] for factors other than the leading one, therefore, I will directly use the U=PS matrix in pp.215.



data loading;
input factor1-factor3;
cards;
-0.9280  -0.0481  0.2750
0.0563  -0.8833  0.5306
-0.9296  -0.0450  0.2720
-0.7534  0.1705  -0.5945
0.5917  -0.0251  -0.6450
0.9082  0.3345    0.1118
-0.8086  0.4551  -0.3800
;
run;


proc transpose data=loading  out=loading2;
run;

data loading2(type=factor);
     retain _TYPE_ "PATTERN";
  set loading2;
run;


ods select none;
ods output OrthRotFactPat=OrthRotationOut;
ods output OrthTrans=OrthTrans; 
proc factor  data=Loading2   method=pattern  rotate=varimax  simple; 
         var col1-col7;
run;
ods select all;


Reference:
[1] Huiwen Wang; Qiang Liu , Yongping Tu, "Interpretation of PLS Regression Models with VARIMAX Rotation", Computational Statistics and Data Analysis, Vol.48 (2005) pp207 – 219
 Posted by at 4:09 上午

Credit Scorecard vs. Decision tree vs. Neural network

 默认分类  Credit Scorecard vs. Decision tree vs. Neural network已关闭评论
8月 092010
 

The traditional credit scoring model is a scorecard. A scorecard is a table that contains a number
of questions (called characteristics) that applicants answer. For each question there is a list of
possible answers (called attributes). For example, one characteristic may be the age of the
applicant. The attributes for this characteristic might be a series of age ranges into which an
applicant could fall. For each answer, the applicant receives a certain number of points — more if
the attribute is low risk, fewer if the risk for that attribute is higher. The resulting total score reflects
the probability that the applicant will default on the credit product.

The scorecard model, apart from being a long-established scoring method in the industry, still has
several advantages when compared with more recent data mining types of models, such as
decision trees or neural networks. First, a scorecard is easy to apply. If needed, the scorecard can
be evaluated on a sheet of paper in the presence of the applicant. The scorecard is also easy to
understand. The number of points for one answer doesn’t depend on any of the other answers,
and across the range of possible answers to any question, the number of points usually increases
in a simple way. Therefore, it is often easy to justify to the applicant a decision that is made on the
basis of a scorecard.


Unlike the scorecard, a decision tree detects and exploits interactions between characteristics. In
a decision tree model, each answer that an applicant gives determines which question is asked
next. Thus, a decision tree model consists of a set of if-then-else rules that segment applicants
based on their sequence of answers. Their greater flexibility makes trees potentially more
predictive than scorecard models. Trees can, however, become quite complex. They are also
unstable when updated with new data in the sense that their structure can change dramatically
with a change to the first question asked.

Neural networks are even more flexible models that account for interactions and combine
characteristics in a variety of ways. They don’t suffer from sharp splits between possibilities as
decision trees and scorecards sometimes do. They also don’t suffer from structural instability in
the same way as decision trees. However, it is virtually impossible to explain or understand the
score that is produced for a particular applicant using a simple method. It can be difficult to justify
a decision that is made on the basis of a neural network model. A neural network of superior
predictive power is therefore best suited for certain behavioral or collection scoring purposes
where the average accuracy of the prediction is more important than gaining insight into the score
for each particular case.

From sas inc whitepaper

阅读全文
类别:默认分类 查看评论
 Posted by at 10:06 下午

SAS书杂谈(1):两本统计书

 生物统计  SAS书杂谈(1):两本统计书已关闭评论
8月 082010
 

1. Categorical Data Analysis Using the SAS System (2nd edition, SAS Inc., 2000)

作者三位(SDK),在这个领域都是行家,SAS公司的一位负责统计工具研发的总监Maura E. Stokes , 爱荷华大学生物统计系的教授Charles S. Davis(后来他投身工业界)以及北卡大学教堂山分校生物统计系的教授 Gary G. Koch 。用SAS做Categorical Data Analysis,这书是引用最多的一本。国内一本书,刘勤和金丕焕编的《分类数据的统计分析及SAS编程》(复旦大学出版社,2002),能看到本书的影子(美中不足的是,这本书丝毫没有提到SDK)。

我喜欢这本书,还因为它是我读过的SAS BBU(books by users)系列中排版较好的一本。BBU一般由SAS Press或Wiley两家出版社出版,一个印象是,Wiley出的SAS书排版会比SAS Press的质量高。这本书由这两家联合出版。

以前学Categorical Data Analysis,上手就是Logistic回归,做信用评分/数据挖掘需要这个(这本书的后半部分讲这些模型)。现在的工作,需要多的是本书的前半部分,对列联表做的各种检验,这是统计学里相对传统的部分,也是我这样非统计科班出身学统计学时忽略最多的部分。

还有,这本书的数序公式用得恰到好处,简洁又富有启发性,能让一个(只)有基本数学素养的SAS 程序员体会到模型的含义,又不至于淹没在数学符号的汪洋大海之中。

2.Common Statistical Methods for Clinical Research with SAS Examples(2nd edition, SAS Inc., 2002)

作者是SAS最老的一批咨询顾问Glenn A. Walker。现在这书出了第三版,加上了一位合作者,Duke Clinical Research Institute(世界上最大的学院CRO)的一位SAS程序员,Jack Shostak,他也是SAS Programming in the Pharmaceutical Industry的作者。

我没有比较与新版的区别,只说说手头的第二版。跟SDK一样,这本书是用来恶补现在所需要的统计学基础的,给回归只留了两个章节:

Chapter 1 – Introduction & Basics 
Chapter 2 – Topics in Hypothesis Testing 
Chapter 3 – The Data Set TRIAL 
Chapter 4 – The One-Sample t-Test 
Chapter 5 – The Two-Sample t-Test 
Chapter 6 – One-Way ANOVA 
Chapter 7 – Two-Way ANOVA 
Chapter 8 – Repeated Measures Analysis
Chapter 9 – The Crossover Design
Chapter 10 – Linear Regression
Chapter 11 – Analysis of Covariance
Chapter 12 – The Wilcoxon Signed-Rank Test
Chapter 13 – The Wilcoxon Rank-Sum Test 
Chapter 14 – The Kruskal-Wallis Test
Chapter 15 – The Binomial Test
Chapter 16 – The Chi-Square Test
Chapter 17 – Fisher’s Exact Test
Chapter 18 – McNemar’s Test
Chapter 19 – The Cochran-Mantel-Haenszel Test
Chapter 20 – Logistic Regression
Chapter 21 – The Log-Rank Test
Chapter 22 – The Cox Proportional Hazards Model

章节设置,跟Jerrold Zar那册有名的Biostatistical Analysis类似,正好方便SAS程序员备查。这些统计学,在药厂的clinical部门常用。至于模型味更浓的data mining系列,discovery和pharmacovigilance(PV,药物警戒)部门用得多些。

这本书在写作方面没什么特色,好处在于实用、全面,做工具书再好不过。

————————————————————-

作为非统计出身的SAS程序员,经常徜徉于data steps,现在翻翻它的统计部分,又能感受到它背后的另一片广阔天地。浏览一下SAS/STAT 9.2的研发团队,——罗列书名、人名总能激起我的斗志,噫。

基于SAS的支持向量机

 默认分类  基于SAS的支持向量机已关闭评论
8月 082010
 

... /*create the catalog for svm*/
proc dmdb batch data=mylib.bank8dtr dmdbcat=ctr out=dtr;
   var atmct adbdda_ ddatot_ ddadep_ income invest savbal_ atres;
   class acquire(desc);
    run;

/*linear svm*/
proc svm data=dtr dmdbcat=ctr c=1. kernel=linear
testdata=mylib.bank8dte;
   var atmct adbdda_ ddatot_ ddadep_ income invest savbal_ atres;
   target acquire;
run;

proc contents data=temp;run;
proc print data=temp(obs=10);run;


/*linear svm with diffenrent C values*/
proc svm data=dtr dmdbcat=ctr kernel=linear
testdata=mylib.bank8dte testout=temp;
   var atmct adbdda_ ddatot_ ddadep_ income invest savbal_ atres;
   target acquire;
c .01;
run;

阅读全文
类别:默认分类 查看评论
 Posted by at 2:05 上午

怎么让图中曲线看起来更光滑以及相关问题和应用。

 B样条函数, SAS编程绘图, 光滑, 数据预处理, 样条spline  怎么让图中曲线看起来更光滑以及相关问题和应用。已关闭评论
8月 072010
 

有时候我们在做曲线图的时候,得到的点连接图像下面图一样,感觉很毛糙,本来是要反映一种变化趋势的,虽然这是实际的点,由于曲线图多为展示一种趋势,所以这种图看起来很难看。于是我们需要对曲线进行光滑处理。

这里我们将用到一种插值方法,即样品spline插值法。大多数工具软件都提供样品的方式来光滑曲线。其实样条来源于实践,在很久以前进行工艺加工和造船时,为了减少接触阻力,往往要把接触面设计得很光滑,于是就用很有弹性的木条来作为参考(如下图),这样就可以用样条作为参考工具来做出光滑的曲线。 Continue reading »

 Posted by at 12:00 上午
8月 052010
 
If repeats are sweet, then the SAS Education e-Learning Technology team is running on a sugar high that would make Willy Wonka (and my young daughters) envious. Not only did it just earn honors from the Society for Technical Communication’s (STC) international competition for the second consecutive year; it took home the highest accolade possible: an Award of Distinguished Technical Communication, for the multimedia e-course SAS Programming 2: Data Manipulation Techniques.

In the e-learning world, that feat is like chasing an ice cream sundae with chocolate cake and a hearty sampling of sugar-coated Saturday-morning breakfast cereal. It doesn’t get much sweeter. The STC, a membership organization that dedicates itself to advancing the arts and sciences of technical communication, is the largest group of its kind in the world.

Last year, the e-course SAS Programming Introduction: Basic Concepts won STC’s Excellence Award, proving, as Vice President for Education Larry Stewart commented, that the e-learning team “continues to set the bar high for this type of technical communication.”

Translating the engaging style of a traditional classroom instructor to electronic form is certainly no easy task, but SAS Instructional Technology Manager Nancy Goodyear’s group designed an effective way to transport color, verve and functionality to even the grayest of cubicles with a dynamic computer course that uses audio, Flash movie and demos to present the training. STC judges were particularly impressed with the course’s “exceptional” organization and navigation and its ability to engage the learner through interactive questions, practices and quizzes.

Do you need more proof that the SAS e-Learning team is the cream of the crop? Nope, probably not. But here’s the cherry on top, all the same: To advance to the international competition, the course first had to distinguish itself from all other entries in the remarkably challenging STC Carolina Chapter Online Competition – which includes many Research Triangle Park peers. It did, winning top honors and Best of Show there.

Goodyear said we can expect more new e-courses in the near future, and that’s (dare we say it?) sweet news for SAS users who love the convenience of e-Learning. Have you taken a SAS e-course recently? Tell us what you liked about it.


8月 052010
 
One guy asked in a SAS forum about a typical table look up problem:
He has a data with two IDs:
id1 id2
a b
a e
b c
b e
c e
d e

and he wants to generate a new data set with the following structure according to above information :
id a b c d e
a 0 1 0 0 1
b 1 0 1 0 1
c 0 1 0 0 1
d 0 0 0 0 1
e 1 1 1 1 0

The real data is potentially big.
***************************;
At first look, this is a typical table look up problem SAS programmers facing almost everyday, that is duplicate keyed lookup table. It is a simple one because there is no inherent relationship among records.


data original;
   input id1 $ id2 $;
datalines;
a b
a e
b c
b e
c e
d e
;
run;

proc datasets library=work nolist;
     modify original;
  index create id1 id2;
quit;

proc sql;
     create table all_cases as
  select a.*, monotonic() as seq
  from (
  select distinct id1 as id
  from original
  union
  select distinct id2 as id
  from original
  ) as a
  order by a.id
  ;
quit;

proc sql noprint;
     select id into :idnames separated by ' '
  from   all_cases
  ;
quit;

data new;
  if _n_=1 then do;
     declare hash _h(dataset:'all_cases');
     _h.defineKey('id');
     _h.defineData('seq');
     _h.defineDone();
     end; 
     set all_cases;

  array _a{*} &idnames; 

  id1=id;  
  set original key=id1;      
  _mx_=%sysrc(_sok);
  
  do while (_iorc_=%sysrc(_sok));   
     rc=_h.find(key:id2); if rc=0 then _a[seq]=1;
  id1=id;
     set original key=id1;  
  
  end;
  _ERROR_=0;
  
  id2=id;  
  set original key=id2;      
  do while (_iorc_=%sysrc(_sok)); 
     rc=_h.find(key:id1); if rc=0 then _a[seq]=1;
  id2=id;
     set original key=id2;  
  end;
  _ERROR_=0;
  do j=1 to dim(_a); _a[j]=max(0, _a[j]); end;
  keep id &idnames;
run;

On the other hand, this problem can be solved in a more SASsy way like this:


data original;
   input id1 $ id2 $;
datalines;
a b
a e
b c
b e
c e
d e
;
run;

proc sql;
     create table newx as
     select a.id1, a.id2, (sum(a.id1=c.id1 & a.id2=c.id2)>0) as count
     from   
       (select a.id as id1, b.id as id2 
        from all_cases as a, all_cases as b) as a
left join   original as c
       on   a.id1=c.id1 or a.id2=c.id1
    group by a.id1, a.id2
    ;
quit;

proc transpose data=newx  out=_freq_t name=id2;
     by id1;
     var count;
     id id2;
run;

data _freq_t;
     set _freq_t;
     array _n{*} _numeric_;
     do i=1 to dim(_n);
        _n[i]=(_n[i]>0);
     end;
     drop i;
run;

proc transpose data=_freq_t(drop=id2) out=_freq_t2  name=id1;
     id id1;
run;

proc sql noprint;
     select id1, count(distinct id1) into :covars separated by ' ', :count
     from   _freq_t;  
quit;

data new2;
     set _freq_t;
     array _x{*} &covars;
     array _x2{&count} _temporary_;

     do j=1 to &count; _x2[j]=_x[j]; end;
     set _freq_t2;
     do j=1 to &count; _x[j]=(_x[j]+_x2[j]>0); end;
     drop j id2;
run;
 Posted by at 12:06 上午

Table Look Up in SAS, practical problems

 Data Manipulation, Hash Object, Index  Table Look Up in SAS, practical problems已关闭评论
8月 052010
 


One guy asked in a SAS forum about a typical table look up problem:
He has a data with two IDs:
id1 id2
a b
a e
b c
b e
c e
d e

and he wants to generate a new data set with the following structure according to above information :
id a b c d e
a 0 1 0 0 1
b 1 0 1 0 1
c 0 1 0 0 1
d 0 0 0 0 1
e 1 1 1 1 0

The real data is potentially big.
***************************;
At first look, this is a typical table look up problem SAS programmers facing almost everyday, that is duplicate keyed lookup table. It is a simple one because there is no inherent relationship among records.


data original;
   input id1 $ id2 $;
datalines;
a b
a e
b c
b e
c e
d e
;
run;

proc datasets library=work nolist;
     modify original;
  index create id1 id2;
quit;

proc sql;
     create table all_cases as
  select a.*, monotonic() as seq
  from (
  select distinct id1 as id
  from original
  union
  select distinct id2 as id
  from original
  ) as a
  order by a.id
  ;
quit;

proc sql noprint;
     select id into :idnames separated by ' '
  from   all_cases
  ;
quit;

data new;
  if _n_=1 then do;
     declare hash _h(dataset:'all_cases');
     _h.defineKey('id');
     _h.defineData('seq');
     _h.defineDone();
     end; 
     set all_cases;

  array _a{*} &idnames; 

  id1=id;  
  set original key=id1;      
  _mx_=%sysrc(_sok);
  
  do while (_iorc_=%sysrc(_sok));   
     rc=_h.find(key:id2); if rc=0 then _a[seq]=1;
  id1=id;
     set original key=id1;  
  
  end;
  _ERROR_=0;
  
  id2=id;  
  set original key=id2;      
  do while (_iorc_=%sysrc(_sok)); 
     rc=_h.find(key:id1); if rc=0 then _a[seq]=1;
  id2=id;
     set original key=id2;  
  end;
  _ERROR_=0;
  do j=1 to dim(_a); _a[j]=max(0, _a[j]); end;
  keep id &idnames;
run;

On the other hand, this problem can be solved in a more SASsy way like this:


data original;
   input id1 $ id2 $;
datalines;
a b
a e
b c
b e
c e
d e
;
run;

proc sql;
     create table newx as
     select a.id1, a.id2, (sum(a.id1=c.id1 & a.id2=c.id2)>0) as count
     from   
       (select a.id as id1, b.id as id2 
        from all_cases as a, all_cases as b) as a
left join   original as c
       on   a.id1=c.id1 or a.id2=c.id1
    group by a.id1, a.id2
    ;
quit;

proc transpose data=newx  out=_freq_t name=id2;
     by id1;
     var count;
     id id2;
run;

data _freq_t;
     set _freq_t;
     array _n{*} _numeric_;
     do i=1 to dim(_n);
        _n[i]=(_n[i]>0);
     end;
     drop i;
run;

proc transpose data=_freq_t(drop=id2) out=_freq_t2  name=id1;
     id id1;
run;

proc sql noprint;
     select id1, count(distinct id1) into :covars separated by ' ', :count
     from   _freq_t;  
quit;

data new2;
     set _freq_t;
     array _x{*} &covars;
     array _x2{&count} _temporary_;

     do j=1 to &count; _x2[j]=_x[j]; end;
     set _freq_t2;
     do j=1 to &count; _x[j]=(_x[j]+_x2[j]>0); end;
     drop j id2;
run;
 Posted by at 12:06 上午
8月 042010
 
It’s no secret that SAS enjoys a high degree of customer loyalty and it’s also well known that part of what drives that loyalty is driven by how much we pay attention to our customers.

This devotion to our customers is helping to ensure that we deliver on our promise of rock-solid solutions to business problems by giving customers the power to know more about their business issues, as well as how to address them. As per my previous post about branding being as much about operational alignment as it is about clear communications ("Are You All Paying Attention?"), I see a story here about the power of the SAS brand.

The many ways SAS pays attention to its customers begins with the fact that SAS customers receive a full suite of support services at no additional charge, including skilled telephone technical support and unlimited around-the-clock online technical support. And that’s just the tip of the iceberg:

  • The SAS Customer Support Site has more than 250,000 searchable resource pages on documentation, samples and SAS® notes that provide problem solutions as well as usage and installation information.
  • The SAS Technical Support service includes an average hold time of less than 30 seconds.
  • The SAS Technical Support staff has an average of 12 years of experience as statisticians, technical architects, programmers, database administrators and operating system engineers.
  • The SAS Education Division aims to provide customers with information before they need it through nearly 5,000 courses in any given year to over 60,000 students worldwide through a variety of media to accommodate multiple learning styles.
  • So far this year, the SAS Customer Loyalty Team contacted over 6.500 SAS representatives at SAS Customers in the U.S. alone.
  • More than 5,400 emails have been sent about the new SAS® Customer Resource Center, which has a wealth of information organized by the person’s role, including administrators, analysts, decision makers, programmers and statisticians.
  • SAS offers a library of videos to provide an overview of resources that customers have available to them, and a different link if reading about it is preferred.
  • The SAS Publications Division enables SAS customers to help other SAS customers by highlighting their own real-world ways to solve their business problems using SAS software.

All these resources begin and end with the customer in mind, which points to the degree to which SAS culture centers on the customer. [As I write that, I hear my marketing professor in my head voicing his approval.] To me, what’s even more notable is how many SAS User’s Groups there are around the world, and every single one of them is run entirely by our customers. Now that is loyalty!

[Special thanks to Kat Hardy for the original internal article on this topic and to Melissa Perez for helping me keep the details in this post accurate.]