使用SAS进行互联网数据挖掘

 默认分类  使用SAS进行互联网数据挖掘已关闭评论
8月 112010
 

SAS / text miner 提供了强大的处理非结构化数据的能力. 搞文本挖掘的软件很多, 个人感觉,SAS的解决方案相对比较完善:

1. 多语言支持及文本解析. 其中中文分词比较烦躁, SAS/TM也是使用SAP INXIGHT技术而分解中文
    最主要的是数据抓取功能: 通过%TMFILTER实现
2.停用词;新词;近义词; 及TERM, DOCUMENT FREQ
3. 权重处理-WEIGHTED. 其核心意义还是在于降维
4. SVD分解, 是TERM CLUSTERING, DOCUMENT CLUSTERING的先决条件
5. 大规模聚类算法, 基于时间复杂度, 现有的 PROC CLUSTER 或 PROC FASTCLUS 肯定是不来塞的, 需要变种
6. 当然, SAS/TM FOR CHINESE的授权费用不便宜, 预算不足的话可以考虑结合ICTCLAS来搞中文分词, 技术上经过实践证明完全可行. 再结合其他SAS PROCEDURE进行深层分析


阅读全文
类别:默认分类 查看评论
 Posted by at 9:58 下午
8月 112010
 
Editor’s Note: Meet Rick Cornell, SAS Training Course Development Manager. In a multi-part series, Rick will tell us about how a SAS training course is born. In this first installment, learn where a course comes from, what new courses are in the works and Rick’s desert island list of albums.

1) Describe your job at SAS.

I’m the manager of the group that supports course developers and maintains the course development process. The group consists of project managers, editors, and production specialists. I’m an editor by trade (after an earlier life as a middle school teacher), and I still spend a great deal of my time editing course materials.

2) Where do the ideas come from for creating a new SAS training course? An instructor? A customer? From software development or product management; to support a new software product?

Let’s see: yes, yes, and yes. And several more yeses. Ideas come from all over: product managers in R&D or Sales & Marketing; student comments during classes or customer requests at user group events; instructors in Education; country managers in Europe and beyond; industry experts outside of SAS for the Business Knowledge Series program; and elsewhere. If my friend Tom Baker in the RFC (SAS’ Recreation and Fitness Center for employees) has an idea for a course, we’d love to hear it.

3) Is there one particular course development project that stands out in your mind?

That’s a tough one. It feels like all courses have their own quirks and personalities – and you love some courses because of those things, and some, well, you don’t love so much because of those things. The foundation course revision efforts for both SAS 9 and 9.2 both stand out just because there were so many people involved and they were such huge undertakings. The larger the project and the more people who are involved, the more crucial the role of a support team is. I’d have a much better answer if you asked me my all-time top ten favorite albums.


Continue reading "How a SAS Training Course is Made - Part 1"
8月 112010
 
I just learned today that you can add comments to your saved copies of the SAS 9.2 documentation. How cool is that! I'm not sure how I missed this news; I don't want you to miss it too.

Starting with SAS 9.2, the PDF versions of SAS documentation allow you to use the Comment & Markup feature of Adobe Reader to make notes in a local copy of the document. This sure beats keeping a printed copy on your desk with sticky notes and pencil scrawls all over it.

For those of you who have never used the comment feature, I have provided short instructions below. I verified these instructions using Adobe Reader 9.

  1. Find the document that you need. (I suggest using the documentation by product listing at support.sas.com/documentation to easily locate available PDF files.)


  2. Select the PDF link for your selected document and save the file to your local file system.


  3. Open the saved PDF document. Find a spot to add a comment, a sticky, or to highlight a few words.


  4. Select Tools from the Comment & Markup pull-down.


  5. Use one of the markup elements to add something to your document. I chose to add a sticky note that included a URL for more information about the topic. The image below shows the document text with the sticky note indicator where I placed it.





You can locate all of your comments when you return to the document by selecting the Comment element in the toolbar. Adobe Reader provides a complete set of tools for managing and using the elements that you added to the document. The next image shows part of the entry for my sticky note.

When I click the yellow note icon, I am taken to the location in the document that contains my note. How cool is that!

UPDATED October 27, 2010: I just learned about a paper written by Roger Muller and Joshua Horstman and presented at MWSUG 2010. The paper is Custom Google Searches, PDF Sticky Notes, and Other Tips for Organizing and Accessing SAS Help Resources. A section in the paper titled Working with the PDF Manuals addresses this topic in more detail.
8月 102010
 
Partial Least Square is one of several supervised dimension reduction techniques and attracts attention in recent years. In the one hand, PLS is able to generate a series of scores that maximize linear correlation between dependent variables and independent variables, on the other hand, the loading of PLS can be regarded as similar counterpart from factor analysis, hence we can rotate the loadings from PLS therefore eliminate some of the non-significant variable in terms of prediction.


%macro PLSRotate(Loading, TransMat, PatternOut, PatternShort, 
                 method=VARIMAX, threshold=0.25);
/* VARIMAX rotation of PLS loadings. Only variables having 
   large loadings after rotation will enter the final model. 

   Loading dataset contains XLoadings output from PROC PLS 
   and should have variable called NumberOfFactors
   TransMat is the generated Transformation matrix;
   PatternOut is the output Pattern after rotation;
   PatternShort is the output Pattern with selected variables
*/

%local covars;
proc sql noprint;
     select name into :covars separated by ' '
  from   sashelp.vcolumn
  where  libname="WORK" & memname=upcase("&Loading") 
        &   upcase(name) NE "NUMBEROFFACTORS"
  &   type="num"
  ;
quit;
%put &covars;

data &Loading.(type=factor);
         set &Loading;
         _TYPE_='PATTERN';
         _NAME_=compress('factor'||_n_);
run;
ods select none;
ods output OrthRotFactPat=&PatternOut;
ods output OrthTrans=&TransMat; 
proc factor  data=&Loading   method=pattern  rotate=&method  simple; 
         var &covars;
run;
ods select all;

data &PatternShort;
     set &PatternOut;
  array _f{*} factor:;
  _cntfac=0;
  do _j=1 to dim(_f);  
        _f[_j]=_f[_j]*(abs(_f[_j])>&threshold); _cntfac+(_f[_j]>0); 
     end;
  if _cntfac>0 then output;
  drop _cntfac _j;
run;
%mend;

Here I try to replicate the case study in [1] which elaborated how to do and properties of VARIMAX rotation to PLS loadings. The PROC PLS output, after various tweaks on convergence criteria and singularity conditions, is still a little different from the result reported in [1] for factors other than the leading one, therefore, I will directly use the U=PS matrix in pp.215.



data loading;
input factor1-factor3;
cards;
-0.9280  -0.0481  0.2750
0.0563  -0.8833  0.5306
-0.9296  -0.0450  0.2720
-0.7534  0.1705  -0.5945
0.5917  -0.0251  -0.6450
0.9082  0.3345    0.1118
-0.8086  0.4551  -0.3800
;
run;


proc transpose data=loading  out=loading2;
run;

data loading2(type=factor);
     retain _TYPE_ "PATTERN";
  set loading2;
run;


ods select none;
ods output OrthRotFactPat=OrthRotationOut;
ods output OrthTrans=OrthTrans; 
proc factor  data=Loading2   method=pattern  rotate=varimax  simple; 
         var col1-col7;
run;
ods select all;


Reference:
[1] Huiwen Wang; Qiang Liu , Yongping Tu, "Interpretation of PLS Regression Models with VARIMAX Rotation", Computational Statistics and Data Analysis, Vol.48 (2005) pp207 – 219
 Posted by at 4:09 上午

VARIMAX rotation of PLS loadings

 PROC FACTOR, PROC PLS, PROC SCORE  VARIMAX rotation of PLS loadings已关闭评论
8月 102010
 


Partial Least Square is one of several supervised dimension reduction techniques and attracts attention in recent years. In the one hand, PLS is able to generate a series of scores that maximize linear correlation between dependent variables and independent variables, on the other hand, the loading of PLS can be regarded as similar counterpart from factor analysis, hence we can rotate the loadings from PLS therefore eliminate some of the non-significant variable in terms of prediction.


%macro PLSRotate(Loading, TransMat, PatternOut, PatternShort, 
                 method=VARIMAX, threshold=0.25);
/* VARIMAX rotation of PLS loadings. Only variables having 
   large loadings after rotation will enter the final model. 

   Loading dataset contains XLoadings output from PROC PLS 
   and should have variable called NumberOfFactors
   TransMat is the generated Transformation matrix;
   PatternOut is the output Pattern after rotation;
   PatternShort is the output Pattern with selected variables
*/

%local covars;
proc sql noprint;
     select name into :covars separated by ' '
  from   sashelp.vcolumn
  where  libname="WORK" & memname=upcase("&Loading") 
        &   upcase(name) NE "NUMBEROFFACTORS"
  &   type="num"
  ;
quit;
%put &covars;

data &Loading.(type=factor);
         set &Loading;
         _TYPE_='PATTERN';
         _NAME_=compress('factor'||_n_);
run;
ods select none;
ods output OrthRotFactPat=&PatternOut;
ods output OrthTrans=&TransMat; 
proc factor  data=&Loading   method=pattern  rotate=&method  simple; 
         var &covars;
run;
ods select all;

data &PatternShort;
     set &PatternOut;
  array _f{*} factor:;
  _cntfac=0;
  do _j=1 to dim(_f);  
        _f[_j]=_f[_j]*(abs(_f[_j])>&threshold); _cntfac+(_f[_j]>0); 
     end;
  if _cntfac>0 then output;
  drop _cntfac _j;
run;
%mend;

Here I try to replicate the case study in [1] which elaborated how to do and properties of VARIMAX rotation to PLS loadings. The PROC PLS output, after various tweaks on convergence criteria and singularity conditions, is still a little different from the result reported in [1] for factors other than the leading one, therefore, I will directly use the U=PS matrix in pp.215.



data loading;
input factor1-factor3;
cards;
-0.9280  -0.0481  0.2750
0.0563  -0.8833  0.5306
-0.9296  -0.0450  0.2720
-0.7534  0.1705  -0.5945
0.5917  -0.0251  -0.6450
0.9082  0.3345    0.1118
-0.8086  0.4551  -0.3800
;
run;


proc transpose data=loading  out=loading2;
run;

data loading2(type=factor);
     retain _TYPE_ "PATTERN";
  set loading2;
run;


ods select none;
ods output OrthRotFactPat=OrthRotationOut;
ods output OrthTrans=OrthTrans; 
proc factor  data=Loading2   method=pattern  rotate=varimax  simple; 
         var col1-col7;
run;
ods select all;


Reference:
[1] Huiwen Wang; Qiang Liu , Yongping Tu, "Interpretation of PLS Regression Models with VARIMAX Rotation", Computational Statistics and Data Analysis, Vol.48 (2005) pp207 – 219
 Posted by at 4:09 上午

Credit Scorecard vs. Decision tree vs. Neural network

 默认分类  Credit Scorecard vs. Decision tree vs. Neural network已关闭评论
8月 092010
 

The traditional credit scoring model is a scorecard. A scorecard is a table that contains a number
of questions (called characteristics) that applicants answer. For each question there is a list of
possible answers (called attributes). For example, one characteristic may be the age of the
applicant. The attributes for this characteristic might be a series of age ranges into which an
applicant could fall. For each answer, the applicant receives a certain number of points — more if
the attribute is low risk, fewer if the risk for that attribute is higher. The resulting total score reflects
the probability that the applicant will default on the credit product.

The scorecard model, apart from being a long-established scoring method in the industry, still has
several advantages when compared with more recent data mining types of models, such as
decision trees or neural networks. First, a scorecard is easy to apply. If needed, the scorecard can
be evaluated on a sheet of paper in the presence of the applicant. The scorecard is also easy to
understand. The number of points for one answer doesn’t depend on any of the other answers,
and across the range of possible answers to any question, the number of points usually increases
in a simple way. Therefore, it is often easy to justify to the applicant a decision that is made on the
basis of a scorecard.


Unlike the scorecard, a decision tree detects and exploits interactions between characteristics. In
a decision tree model, each answer that an applicant gives determines which question is asked
next. Thus, a decision tree model consists of a set of if-then-else rules that segment applicants
based on their sequence of answers. Their greater flexibility makes trees potentially more
predictive than scorecard models. Trees can, however, become quite complex. They are also
unstable when updated with new data in the sense that their structure can change dramatically
with a change to the first question asked.

Neural networks are even more flexible models that account for interactions and combine
characteristics in a variety of ways. They don’t suffer from sharp splits between possibilities as
decision trees and scorecards sometimes do. They also don’t suffer from structural instability in
the same way as decision trees. However, it is virtually impossible to explain or understand the
score that is produced for a particular applicant using a simple method. It can be difficult to justify
a decision that is made on the basis of a neural network model. A neural network of superior
predictive power is therefore best suited for certain behavioral or collection scoring purposes
where the average accuracy of the prediction is more important than gaining insight into the score
for each particular case.

From sas inc whitepaper

阅读全文
类别:默认分类 查看评论
 Posted by at 10:06 下午

SAS书杂谈(1):两本统计书

 生物统计  SAS书杂谈(1):两本统计书已关闭评论
8月 082010
 

1. Categorical Data Analysis Using the SAS System (2nd edition, SAS Inc., 2000)

作者三位(SDK),在这个领域都是行家,SAS公司的一位负责统计工具研发的总监Maura E. Stokes , 爱荷华大学生物统计系的教授Charles S. Davis(后来他投身工业界)以及北卡大学教堂山分校生物统计系的教授 Gary G. Koch 。用SAS做Categorical Data Analysis,这书是引用最多的一本。国内一本书,刘勤和金丕焕编的《分类数据的统计分析及SAS编程》(复旦大学出版社,2002),能看到本书的影子(美中不足的是,这本书丝毫没有提到SDK)。

我喜欢这本书,还因为它是我读过的SAS BBU(books by users)系列中排版较好的一本。BBU一般由SAS Press或Wiley两家出版社出版,一个印象是,Wiley出的SAS书排版会比SAS Press的质量高。这本书由这两家联合出版。

以前学Categorical Data Analysis,上手就是Logistic回归,做信用评分/数据挖掘需要这个(这本书的后半部分讲这些模型)。现在的工作,需要多的是本书的前半部分,对列联表做的各种检验,这是统计学里相对传统的部分,也是我这样非统计科班出身学统计学时忽略最多的部分。

还有,这本书的数序公式用得恰到好处,简洁又富有启发性,能让一个(只)有基本数学素养的SAS 程序员体会到模型的含义,又不至于淹没在数学符号的汪洋大海之中。

2.Common Statistical Methods for Clinical Research with SAS Examples(2nd edition, SAS Inc., 2002)

作者是SAS最老的一批咨询顾问Glenn A. Walker。现在这书出了第三版,加上了一位合作者,Duke Clinical Research Institute(世界上最大的学院CRO)的一位SAS程序员,Jack Shostak,他也是SAS Programming in the Pharmaceutical Industry的作者。

我没有比较与新版的区别,只说说手头的第二版。跟SDK一样,这本书是用来恶补现在所需要的统计学基础的,给回归只留了两个章节:

Chapter 1 – Introduction & Basics 
Chapter 2 – Topics in Hypothesis Testing 
Chapter 3 – The Data Set TRIAL 
Chapter 4 – The One-Sample t-Test 
Chapter 5 – The Two-Sample t-Test 
Chapter 6 – One-Way ANOVA 
Chapter 7 – Two-Way ANOVA 
Chapter 8 – Repeated Measures Analysis
Chapter 9 – The Crossover Design
Chapter 10 – Linear Regression
Chapter 11 – Analysis of Covariance
Chapter 12 – The Wilcoxon Signed-Rank Test
Chapter 13 – The Wilcoxon Rank-Sum Test 
Chapter 14 – The Kruskal-Wallis Test
Chapter 15 – The Binomial Test
Chapter 16 – The Chi-Square Test
Chapter 17 – Fisher’s Exact Test
Chapter 18 – McNemar’s Test
Chapter 19 – The Cochran-Mantel-Haenszel Test
Chapter 20 – Logistic Regression
Chapter 21 – The Log-Rank Test
Chapter 22 – The Cox Proportional Hazards Model

章节设置,跟Jerrold Zar那册有名的Biostatistical Analysis类似,正好方便SAS程序员备查。这些统计学,在药厂的clinical部门常用。至于模型味更浓的data mining系列,discovery和pharmacovigilance(PV,药物警戒)部门用得多些。

这本书在写作方面没什么特色,好处在于实用、全面,做工具书再好不过。

————————————————————-

作为非统计出身的SAS程序员,经常徜徉于data steps,现在翻翻它的统计部分,又能感受到它背后的另一片广阔天地。浏览一下SAS/STAT 9.2的研发团队,——罗列书名、人名总能激起我的斗志,噫。

基于SAS的支持向量机

 默认分类  基于SAS的支持向量机已关闭评论
8月 082010
 

... /*create the catalog for svm*/
proc dmdb batch data=mylib.bank8dtr dmdbcat=ctr out=dtr;
   var atmct adbdda_ ddatot_ ddadep_ income invest savbal_ atres;
   class acquire(desc);
    run;

/*linear svm*/
proc svm data=dtr dmdbcat=ctr c=1. kernel=linear
testdata=mylib.bank8dte;
   var atmct adbdda_ ddatot_ ddadep_ income invest savbal_ atres;
   target acquire;
run;

proc contents data=temp;run;
proc print data=temp(obs=10);run;


/*linear svm with diffenrent C values*/
proc svm data=dtr dmdbcat=ctr kernel=linear
testdata=mylib.bank8dte testout=temp;
   var atmct adbdda_ ddatot_ ddadep_ income invest savbal_ atres;
   target acquire;
c .01;
run;

阅读全文
类别:默认分类 查看评论
 Posted by at 2:05 上午

怎么让图中曲线看起来更光滑以及相关问题和应用。

 B样条函数, SAS编程绘图, 光滑, 数据预处理, 样条spline  怎么让图中曲线看起来更光滑以及相关问题和应用。已关闭评论
8月 072010
 

有时候我们在做曲线图的时候,得到的点连接图像下面图一样,感觉很毛糙,本来是要反映一种变化趋势的,虽然这是实际的点,由于曲线图多为展示一种趋势,所以这种图看起来很难看。于是我们需要对曲线进行光滑处理。

这里我们将用到一种插值方法,即样品spline插值法。大多数工具软件都提供样品的方式来光滑曲线。其实样条来源于实践,在很久以前进行工艺加工和造船时,为了减少接触阻力,往往要把接触面设计得很光滑,于是就用很有弹性的木条来作为参考(如下图),这样就可以用样条作为参考工具来做出光滑的曲线。 Continue reading »

 Posted by at 12:00 上午
8月 052010
 
If repeats are sweet, then the SAS Education e-Learning Technology team is running on a sugar high that would make Willy Wonka (and my young daughters) envious. Not only did it just earn honors from the Society for Technical Communication’s (STC) international competition for the second consecutive year; it took home the highest accolade possible: an Award of Distinguished Technical Communication, for the multimedia e-course SAS Programming 2: Data Manipulation Techniques.

In the e-learning world, that feat is like chasing an ice cream sundae with chocolate cake and a hearty sampling of sugar-coated Saturday-morning breakfast cereal. It doesn’t get much sweeter. The STC, a membership organization that dedicates itself to advancing the arts and sciences of technical communication, is the largest group of its kind in the world.

Last year, the e-course SAS Programming Introduction: Basic Concepts won STC’s Excellence Award, proving, as Vice President for Education Larry Stewart commented, that the e-learning team “continues to set the bar high for this type of technical communication.”

Translating the engaging style of a traditional classroom instructor to electronic form is certainly no easy task, but SAS Instructional Technology Manager Nancy Goodyear’s group designed an effective way to transport color, verve and functionality to even the grayest of cubicles with a dynamic computer course that uses audio, Flash movie and demos to present the training. STC judges were particularly impressed with the course’s “exceptional” organization and navigation and its ability to engage the learner through interactive questions, practices and quizzes.

Do you need more proof that the SAS e-Learning team is the cream of the crop? Nope, probably not. But here’s the cherry on top, all the same: To advance to the international competition, the course first had to distinguish itself from all other entries in the remarkably challenging STC Carolina Chapter Online Competition – which includes many Research Triangle Park peers. It did, winning top honors and Best of Show there.

Goodyear said we can expect more new e-courses in the near future, and that’s (dare we say it?) sweet news for SAS users who love the convenience of e-Learning. Have you taken a SAS e-course recently? Tell us what you liked about it.