Clustering In SAS

 默认分类  Clustering In SAS已关闭评论
8月 182010
 

TWO-STAGE:

STAGE 1: Variable clustering based on a distance matrix
   1. Calculate the correlation matrix of the variables.
   2. Apply a hierarchical clustering algorithm to the correlation matrix.
   3. Using a predefined cluster number, cluster variables into homogeneous groups.
   The cluster number is generally no more than the integer value of (nvar/100+2).
   These clusters are called global clusters.
STAGE 2: Variable clustering based on latent variables
1. Run PROC VARCLUS with all variables within each global cluster as you would
      run a single-stage, variable clustering task.
   2. For each global cluster, calculate the global cluster components, which are the
      first principal component of the variables in its cluster.
3. Create a global cluster structure using the global cluster components and the
      same method as 1 at STAGE 2.
4. Form a single tree of variable clusters from 1 and 3.


阅读全文
类别:默认分类 查看评论
 Posted by at 7:31 下午
8月 182010
 
Those of you who have anxiously been awaiting the next installments in my "What is there to do in Cary?" series may have noticed a distinct lack of output over the last month or so. This is due to the fact that I have been forced to spend my time actually teaching courses, rather than sampling local restaurants for your benefit!

The schedule of an instructor not only makes it hard to research restaurants and update blogs, it also makes it hard to keep up with the technical stuff. For example, although I don't teach any of the SQL courses, it's useful for me to know a fairly good amount about SQL so that I can answer SQL-related questions from my students. So, I figured it would be a good idea to take the SQL 1: Essentials class to brush up on things. Unfortunately, I ran into the same dilemma that a lot of you have: (1) the classes in Cary were on dates when I couldn't attend, and (2) the courses on dates that I could attend were in locations I couldn't get to. My next option was Live Web, but that required 4 half-day sessions. Well, the first four available days I had that matched the Live Web offerings were in December! (We instructors keep very busy)

With the prospect of being innundated with SQL questions from my students that I couldn't answer, I searched for any alternative, and I found the perfect one--e-Learning! If you're not familiar with it, SAS e-Learning is a series of pre-recorded courses that cover the same topics as many instructor-based and Live Web courses. The advantage is that you can view them (and do the excercises) as your own pace. For someone who is in his office an average of 3 days every two weeks, this is a great alternative. So, I checked out the SQL 1: Essentials course at http://support.sas.com/training/elearn/, and became an SQL expert over the course of 2 weeks! If you need to learn some aspect of SAS, but you don't have a block of time available, I highly recommend the e-Learning courses (and I am not being paid to say this; I'm just a satisfied customer).

BTW, another aspect of e-Learning that you may find useful is e-Lectures. This is a series of recorded lectures that run 20-80 minutes and cover topics that are not included in regular courses. Check them out at https://support.sas.com/edu/elearning.html?ctry=us&productType=electure
8月 182010
 
Dr. Goutam Chakraborty is a professor of marketing and founder of the SAS and OSU Data Mining Certificate Program at Oklahoma State University. He has been involved with the annual data mining conference for five years and also teaches a Business Knowledge Series course, Getting the Most Out of Testing in Direct/Internet Marketing. Read below to get his unique perspective of M2010 and more information about his course, which he’ll be teaching at the conference.

1. You’ve had numerous roles at the M-series of data mining conferences: co-chair, speaker, sponsor, attendee and instructor. What information can you share about the conference with first-time attendees?

GC: This conference is really focused on data mining and analytics and not very large in size. As a result, you will have a chance not just to meet but actually get to know some of the experts in the industry and academia who are working on data mining and analytics problems. First time attendees should feel totally comfortable to walk up to any speaker, session chair, conference chair and get to know them. That is the culture of this conference.

2. Why do you continue to be involved with M2010?

GC: Because it is the place to be when your interests lie in learning new tools/techniques/applications in data mining and business analytics. Frankly, I get more out of attending this conference than many other conferences (academic or professional) that I have attended in my last 20+ years .

3. You are a professor in the Department of Marketing at Oklahoma State University and you’ve had several teams of your students compete in the Data Mining Shootout. A few of your teams have placed in the top 3. How does this competition serve the academic community?

GC: The shootout run by SAS and sponsored by Dow and CMRC (Central Michigan University Research Corp) is a vital link in development of students' skills in working with real business problems and answering those problems using data mining. It gives students from any university a chance to compete against their peers from other universities and in the process discover how good they really are. Let’s face it – competition brings out the best among each of the student teams and regardless of their actual standing all teams learn a lot in the process.


Continue reading "The Culture of the M2010 Data Mining Conference"

POPULATION STABILITY INDEX FOR SAS/ENTERPRISE MINER SCORECARD

 默认分类  POPULATION STABILITY INDEX FOR SAS/ENTERPRISE MINER SCORECARD已关闭评论
8月 172010
 

So far, sas/em 6.1 does not provide PSI in EM scorecard node. but could be solve this problem by another way. A solution using sas code from rpruitt@premierbankcard.com and detail as below:

/*************************************************************************/
/* This program calculates PSI (Population Stability Index) Statistic */
/* It was originally sent to PREMIER (Jay Kosters) per his request on */
/* 4/2/2009. */
/* Dan Kelly, SAS Institute, provided an example of code with various */
/* SAS Code options. */
/* Jay asked Rex to translate the SAS Code and refine it for use by */
/* PREMIER */
/* Programming was completed between 4/10 & 4/15, 2009 */
/*************************************************************************/
/* Dan Kelly's ancillary instructions: */
/* So a few obvious questions that come up are "how do you define the */
/* buckets" and "how many buckets do I need"? And "what are sample 1 */
/* and sample 2"? */
/* If sample 1 and sample 2 are different months (as you have) then you */
/* just need the bucket definition. */
/* */
/* Most of the time I think people use this on the scores, not the */
/* individual attributes that comprise the score. There's nothing */
/* to stop you from testing whether x1 drifts from month to month, */
/* or x2, or x3, ... */
/* */
/* For the most part when I see people use this they are just looking at */
/* whether the distribution of the score is fairly stable. */
/* */
/* I used 10 buckets just because I like the word "decile"; */
/* often people use "demidecile" for 20 5% buckets. */
/* */
/* Finally, your cutoffs (.1, .25...) sound like what I usually hear. */
SAS Global Forum 2010 Posters
5
/* This statistic is basically (I think) a divergence type statistic, */
/* like the Information value. So any cutoff that seems reasonable for */
/* those types of stats is probably reasonable here as well. */
/* */
/* You can change the distribution of MODELVAR in one of the data sets */
/* and see what that does to the PSI in the last printout to get a feel */
/* for what kind of differences in the distribution make what kind of */
/* difference in the work. */
/*************************************************************************/
/* Per Jay Kosters' research, a score of <= 0.1 indicates little change, */
/* 0.1 - 0.25 is little change but to small to determine and > 0.25 is */
/* a significant shift. */
/*************************************************************************/
/*************************************************************************/
/* These Macro variables must be changed to represent the PSI Variable */
/* (MODELVAR), PSI Output Library (PSILibrary) for storage of the ODS */
/* Output, Source Data representing the original data file name of the */
/* population being measured for stability (SourceData1), and the */
/* current population file name being used to identify possible */
/* divergence (SourceData2). */
/*************************************************************************/
/*insert the model variable (Interval ONLY) on this line*/
%Let MODELVAR=Receivables;
/*insert the PSI Output Data Library on this line*/
%Let PSILibrary=\\pbidelprd042\DM_Inputs\rpruitt\PSIResults;
/*insert the original population File Name on this line*/
%Let SourceData1=EMWS.Ids_DATA;
/*insert the current population File Name on this line*/
%Let SourceData2=EMWS.Ids4_DATA;
/**********************************************************************/
/* BEGIN Steps to get the data samples for the periods being compared */
LIBNAME PSI "&PSILibrary";
DATA PSI.PSISample1;
SET &SourceData1
(Keep=&MODELVAR)
;
Format &MODELVAR 12.2;
/******************************************************************/
/* This is where you can place more SAS statements to modify your */
/* PSI Variable so it accurately represents the format and value */
/* in your model. */
/******************************************************************/
RUN;
DATA PSI.PSISample2;
SET &SourceData2
(Keep=&MODELVAR)
;
Format &MODELVAR 12.2;
/******************************************************************/
/* This is where you can place more SAS statements to modify your */
/* PSI Variable so it accurately represents the format and value */
SAS Global Forum 2010 Posters
5
/* in your model. */
/******************************************************************/
RUN;
/* END Steps to get the data samples for the periods being compared */
/********************************************************************/
/**********************************/
/*BEGIN establish ODS Output File */
ODS Listing Close;
ODS HTML
Style=default
File="&PSILibrary\PSICode&MODELVAR..htm"
;
Title2 "PSI (Population Stability Index) Calculations for &MODELVAR";
/**************************/
/* BEGIN PSI Calculations */
/************************************/
/* BEGIN break Sample1 into bins */
/* BEGIN Sorting & Ranking process */
Proc Means Noprint Data=PSI.PSISample1 ;
Output
Out=PSI.RankedTotal (rename=(_freq_=RankedTotal))
;
run;
Data _Null_;
Set PSI.RankedTotal (Where=(_Type_=0));
Call Symput('RankedTotal',RankedTotal);
run;
Proc Means Noprint Data=PSI.PSISample2;
Output
Out=PSI.RankedTotal2 (rename=(_freq_=RankedTotal2))
;
run;
Data _Null_;
Set PSI.RankedTotal2 (Where=(_Type_=0));
Call Symput('RankedTotal2',RankedTotal2);
run;
Proc Sort
Data=PSI.PSISample1;
By &MODELVAR;
run;
Proc Sort
Data=PSI.PSISample2;
By &MODELVAR;
run;
/*********************************************************************/
/*BEGIN Use the Program Data Vector to override the binning of Zero's*/
Data PSI.PSISample1 (Keep=BinVar);
Set PSI.PSISample1;
BinVar=Sum(&MODELVAR,(_n_/&RankedTotal));
run;
SAS Global Forum 2010 Posters
5
Data PSI.PSISample2 (Keep=BinVar);
Set PSI.PSISample2;
BinVar=Sum(&MODELVAR,(_n_/&RankedTotal2));
run;
/*END Use the Program Data Vector to override the binning of Zero's*/
/*******************************************************************/
Proc Sort
Data=PSI.PSISample1;
By BinVar;
run;
Proc Sort
Data=PSI.PSISample2;
By BinVar;
run;
Proc Format;
Value DecileF
Low-0='00'
0-.1='01'
.1-.2='02'
.2-.3='03'
.3-.4='04'
.4-.5='05'
.5-.6='06'
.6-.7='07'
.7-.8='08'
.8-.9='09'
.9-1='10'
.='11'
;
Value DemiDecileF
Low-0='00'
0-.05='01'
.05-.1='02'
.1-.15='03'
.15-.2='04'
.2-.25='05'
.25-.3='06'
.3-.35='07'
.35-.4='08'
.4-.45='09'
.45-.5='10'
.5-.55='11'
.55-.6='12'
.6-.65='13'
.65-.7='14'
.7-.75='15'
.75-.8='16'
.8-.85='17'
.85-.9='18'
.9-.95='19'
.95-1='20'
.='21'
;
Value ZeroMiss
0='Zero'
11='Missing'
21='Missing'
;
run;

Data PSI.PSISample1;
Length decile 8.;
Set PSI.PSISample1;
Rank=_n_/&RankedTotal;
Decile=Put(Rank,DecileF.);
run;
/* END Sorting & Ranking process */
/* END break Sample1 into 10 bins */
/**********************************/
/*********************************************************************/
/* BEGIN you can see they are 10 equally sized bins with no ties in */
/* the output of this step. */
proc freq data=PSI.PSISample1;
tables decile / out=PSI.out1;
Title3 'Base-Line Sample Frequency By Decile Bin (Data=PSISample1)';
run;
/* END you can see they are 10 equally sized bins with no ties in */
/* the output of this step. */
/*********************************************************************/
/******************************************************/
/* BEGIN Calculate how the deciles are defined on the */
/* Supplied Variable (MODELVAR) scale */
/* so I want MAX(MODELVAR) in each decile */
proc means data=PSI.PSISample1 nway;
class decile;
var BinVar;
output out=PSI.endpoints max=maxVar;
Title3 'Base-Line Sample Mean, Max & Min Values (Data=PSISample1)';
run;
/* END Calculate how the deciles are defined on the */
/* Supplied Variable (MODELVAR) scale */
/* so I want MAX(MODELVAR) in each decile */
/******************************************************/
/*****************************************************************************/
/* BEGIN Data Step to write code that applies the above decile definition to */
/* the data set with MODELVAR on it */
data _NULL_;
set PSI.endpoints end=last;
file "&PSILibrary\decileSample1.sas";
if _N_ = 1 then put " select;";
put " when (BinVar le " maxVar ") decile = " decile ";" ;
if last then do ;
put " otherwise decile = " decile ";" ;
put "end;";
call symput('maxbin',decile);
end;
run;
data PSI.PSISample2;
set PSI.PSISample2;
%inc "&PSILibrary\decileSample1.sas" / source;
If BinVar=. Then decile=&maxbin;
run;
SAS Global Forum 2010 Posters
5
/* END Data Step to write code that applies the above decile definition to */
/* the data set with MODELVAR on it */
/*********************************************************************/
/*********************************************************************/
/* BEGIN Use the same definition for the buckets to establish how */
/* much data falls in each group for the sample 2 */
proc freq data=PSI.PSISample2;
tables decile / out=PSI.out2;
Title3 'Current Sample Frequency By Decile Bin (Data=PSISample2)';
run;
/* END Use the same definition for the buckets to establish how */
/* much data falls in each group for the sample 2 */
/*********************************************************************/
/************************************************************************************/
/* BEGIN put the % fields on the same file and calculate the terms that make up PSI */
data PSI.PSICompare;
merge PSI.out1 PSI.out2(rename=(percent=percent2));
by decile;
psi = log(percent/percent2)*(percent-percent2)/100;
run;
proc print data=PSI.PSICompare noobs;
var dec: per:;
Format decile ZeroMiss.;
sum psi;
Title3 "NOTE: PSI Calc Accomodates the Binning of Zero And Missing";
run;
/* END put the % fields on the same file and calculate the terms that make up PSI */
/**********************************************************************************/
/* END PSI Calculations */
/************************/
ODS _ALL_ Close;
ODS Listing;
/*END establish ODS Output File */
/********************************/

From Rex Pruitt PREMIER Bankcard LLC Sioux Falls, SD

阅读全文
类别:默认分类 查看评论
 Posted by at 7:28 下午

SAS Tech Report Reader’s Survey Uncovers Interesting Questions

 General Message  SAS Tech Report Reader’s Survey Uncovers Interesting Questions已关闭评论
8月 162010
 
Although I’ve only been with SAS for a little more than two years, I’ve made many SAS friends and I’m beginning to feel a part of the SAS user family. But as the editor of the SAS Tech Report, I really wanted to validate some of my assumptions about reader preferences: I’d made those assumptions based upon reader metrics and conversations I’ve had while at conferences and in online arenas including Twitter and e-mail. I thought it was time to float a short survey.
Continue reading "SAS Tech Report Reader’s Survey Uncovers Interesting Questions"
8月 132010
 
Last year, Richard Foley and Zack Marshall offered up the idea of calling Twitter from SAS right here in this blog. It was a really popular post. When I saw SAS and Twitter–how to harness SAS to grab data from Twitter in 2 easy steps by John Munoz, I knew I had to share it with you.

My favorite line in the post is "But Twitter returns a paltry 100 results at a time. You’re a SAS user, you don’t work with 100 record data sets! "

People use SAS to do so many ordinary and extraordinary things. John's post is just one example of that. I'll keep an eye out for others. If you find them before I do, please share them with me and other SAS users.
8月 122010
 
The following entry is reposted from the sascom voices blog, and was contributed by David Hughes, SAS VP of Sales for Europe, Middle East, Africa and Asia Pacific. I hope you enjoy reading his words, as there are important lessons for marketers in the examples he uses to illustrate the importance of attitude and focus in driving success.


---------------


Value (n): a fair return or equivalent in goods, services or money for something exchanged; relative worth, utility or importance; a numerical quantity that is assigned or is determined by calculation; something (as a principle or quality) intrinsically valuable or desirable

Competition can be a complicated thing. There was a distinct sense of competition between Michelangelo and one of his contemporaries, Bramante, who was fiercely jealous of Michelangelo’s incredible talent. Once, when Pope Julius II commissioned Michelangelo to build him an extravagant tomb, Michelangelo took to it with his usual fervour and spent eight months in a marble quarry selecting and cutting the most perfect pieces of marble. While Michelangelo was away, Bramante had deviously influenced the Pope to cancel the project. Some years later, when the Pope commissioned a new project, Bramante turned down what he saw as an arduous task with little possibility of recognition. Instead, Bramante proposed that Pope Julius select Michelangelo for this time waster, to keep the master busy and away from the limelight.

Michelangelo was no fool ... he knew of Bramante’s deception and took on the project with vigour. He spent many years working tediously under physically exhausting conditions to complete the project. Bramante had miscalculated. The outcome was so breathtaking that it has become recognised as Michelangelo’s most iconic work. It was the Sistine Chapel.

I recently attended The Premier Business Leadership Series in Berlin, where I was pleased to spend time with a large French bank discussing an opportunity to address an issue that our high-performance risk solution can clearly address. Another vendor had convinced the customer that a solution which requires more and more hardware investment over time is the way to go… The customer was clear with us that the decision was all but made. Like Michelangelo, the SAS team decided to think differently from the competition. Since the bank is very focused at this time on being prudent with its funds, we showed how SAS High-Performance Risk could be more cost effective because of the low hardware cost and the ability to scale easily using the flexible blade architecture. The bank is now considering re-opening the issue.

I met with another bank CIO from Belgium who had a cost-of-hardware issue. When his IT people informed the risk department of the hardware costs (due to the many scenarios they wanted to analyze), the risk team decided to reduce the number of scenarios to save money. So the cost of hardware is a good reason to put the bank at risk, I asked? I alluded to the similarity to the Titanic debacle, where the number of lifeboats was reduced to accommodate more passengers. Again, this CIO saw the efficacy of our high-performance risk solution.

After The Series, we held our Olympic-themed AP Sales Forum in Indonesia in which salespeople from across AP gathered to network and take part in a case study competition. It was a mock sales opportunity with country managers playing the part of customers. Interestingly, culture and language barriers melted away. Reps from China bonded with reps from India, while reps from Thailand and Taiwan joined forces on a single team. Aside from the practical lessons in value-based selling, one of the most meaningful aspects for participants was the ability to trade war stories among peers with whom they might not under normal circumstances.

Speaking of war stories, here’s a story I shared recently with my staff that I hope you find inspiring:

Károly Takács was a Hungarian Army officer and, in 1938, the top pistol shooter in the world, roundly expected to win the gold medal in the 1940 Summer Olympics in Tokyo. Mere months before the Olympics, a hand grenade exploded in Takács’ right hand, taking it completely off at the wrist. Takács spent a month in the hospital and, after returning home, immediately taught himself to shoot with his left hand. You see, instead of focusing on what he lost, he shifted his focus to the two things he did have – mental toughness and a healthy left hand. He practiced alone and in secret for months. Then, in the spring of 1939 he attended the Hungarian National Pistol Shooting Championship. Each of the competitors approached him to offer sympathetic words and admiration for his courage in coming to watch them shoot. With resolve, Takács said, “I didn’t come to watch, I came to compete.”

And he won.

In 1940 and 1944, the Olympics were cancelled because of World War II. In 1948 he qualified for the Summer Olympics in London, where he set a new world record and won the gold medal. Incredibly, he won the gold again at the 1952 Summer Olympics in Helsinki.

The truth is, people with a winning spirit recover quickly. When set upon by hardship, true winners force themselves to look at the bright side. Recovering quickly ensures that you don’t lose your momentum. After all, when a boxer gets knocked down, he has ten seconds to get back up. One second more, and he loses the fight.

"The problem is not that there are problems. The problem is expecting otherwise, and thinking that having problems is a problem." - Theodore Rubin

--------

In the case of Michelangelo, his talent shone through even on an obscure project. In the case of Takács, he showed that his brain and eyes trumped the loss of his hand as a sharp-shooter. Both men triumphed. So, how do those examples apply to you? Would you like to share a story of adversity turned into opportunity? Please leave a comment or two. Thanks!

使用SAS进行互联网数据挖掘

 默认分类  使用SAS进行互联网数据挖掘已关闭评论
8月 112010
 

SAS / text miner 提供了强大的处理非结构化数据的能力. 搞文本挖掘的软件很多, 个人感觉,SAS的解决方案相对比较完善:

1. 多语言支持及文本解析. 其中中文分词比较烦躁, SAS/TM也是使用SAP INXIGHT技术而分解中文
    最主要的是数据抓取功能: 通过%TMFILTER实现
2.停用词;新词;近义词; 及TERM, DOCUMENT FREQ
3. 权重处理-WEIGHTED. 其核心意义还是在于降维
4. SVD分解, 是TERM CLUSTERING, DOCUMENT CLUSTERING的先决条件
5. 大规模聚类算法, 基于时间复杂度, 现有的 PROC CLUSTER 或 PROC FASTCLUS 肯定是不来塞的, 需要变种
6. 当然, SAS/TM FOR CHINESE的授权费用不便宜, 预算不足的话可以考虑结合ICTCLAS来搞中文分词, 技术上经过实践证明完全可行. 再结合其他SAS PROCEDURE进行深层分析


阅读全文
类别:默认分类 查看评论
 Posted by at 9:58 下午
8月 112010
 
Editor’s Note: Meet Rick Cornell, SAS Training Course Development Manager. In a multi-part series, Rick will tell us about how a SAS training course is born. In this first installment, learn where a course comes from, what new courses are in the works and Rick’s desert island list of albums.

1) Describe your job at SAS.

I’m the manager of the group that supports course developers and maintains the course development process. The group consists of project managers, editors, and production specialists. I’m an editor by trade (after an earlier life as a middle school teacher), and I still spend a great deal of my time editing course materials.

2) Where do the ideas come from for creating a new SAS training course? An instructor? A customer? From software development or product management; to support a new software product?

Let’s see: yes, yes, and yes. And several more yeses. Ideas come from all over: product managers in R&D or Sales & Marketing; student comments during classes or customer requests at user group events; instructors in Education; country managers in Europe and beyond; industry experts outside of SAS for the Business Knowledge Series program; and elsewhere. If my friend Tom Baker in the RFC (SAS’ Recreation and Fitness Center for employees) has an idea for a course, we’d love to hear it.

3) Is there one particular course development project that stands out in your mind?

That’s a tough one. It feels like all courses have their own quirks and personalities – and you love some courses because of those things, and some, well, you don’t love so much because of those things. The foundation course revision efforts for both SAS 9 and 9.2 both stand out just because there were so many people involved and they were such huge undertakings. The larger the project and the more people who are involved, the more crucial the role of a support team is. I’d have a much better answer if you asked me my all-time top ten favorite albums.


Continue reading "How a SAS Training Course is Made - Part 1"