7月 302010
 
Just back from KDD2010. In the conference, there are several papers that interested me.

On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved.

This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large.

The canonical analysis result from reduced version PROC CANDISC is the same as the full version.

In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2]

[1]. Liang Sun, Betul Ceran, Jieping Ye, "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques", KDD2010, Washington DC.

[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)




   proc format; 
      value specname 
         1='Setosa    ' 
         2='Versicolor' 
         3='Virginica '; 
   run; 
 
   data iris; 
      title 'Fisher (1936) Iris Data'; 
      input SepalLength SepalWidth PetalLength PetalWidth 
            Species @@; 
      format Species specname.; 
      label SepalLength='Sepal Length in mm.' 
            SepalWidth ='Sepal Width in mm.' 
            PetalLength='Petal Length in mm.' 
            PetalWidth ='Petal Width in mm.'; 
      symbol = put(Species, specname10.); 
      datalines; 
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 
   63 33 60 25 3 53 37 15 02 1 
   ; 
   proc candisc data=iris out=outcan distance anova; 
      class Species; 
      var SepalLength SepalWidth PetalLength PetalWidth; 
   run;
 
  ods select none;
  proc glmmod data=iris  outdesign=H(keep=COL:);
           class  Species;
     model SepalLength=Species/noint;
  run;  

  data H;
          merge H   iris;
  run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
     set iris;
     array _S{*} Col1-Col3 (3*0);     
     do j=1 to dim(_S); _S[j]=0; end;
     _S[Species]=1;
     drop j;
run;
****************************/
  proc reg data=H  outest=beta;
          model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
    output   out=P  p=yhat1-yhat3;
  run;quit;
  ods select all;


  proc candisc  data=P;
          class Species;
    var   yhat1-yhat3;
  run;

 Posted by at 12:19 下午

An Economic Approach for a Class of Dimensionality Reduction Techniques

 PROC CANDISC, PROC DISCRIM, PROC GLMMOD, PROC REG  An Economic Approach for a Class of Dimensionality Reduction Techniques已关闭评论
7月 302010
 


Just back from KDD2010. In the conference, there are several papers that interested me.

On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved.

This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large.

The canonical analysis result from reduced version PROC CANDISC is the same as the full version.

In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2]

[1]. Liang Sun, Betul Ceran, Jieping Ye, "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques", KDD2010, Washington DC.

[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)




   proc format; 
      value specname 
         1='Setosa    ' 
         2='Versicolor' 
         3='Virginica '; 
   run; 
 
   data iris; 
      title 'Fisher (1936) Iris Data'; 
      input SepalLength SepalWidth PetalLength PetalWidth 
            Species @@; 
      format Species specname.; 
      label SepalLength='Sepal Length in mm.' 
            SepalWidth ='Sepal Width in mm.' 
            PetalLength='Petal Length in mm.' 
            PetalWidth ='Petal Width in mm.'; 
      symbol = put(Species, specname10.); 
      datalines; 
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 
   63 33 60 25 3 53 37 15 02 1 
   ; 
   proc candisc data=iris out=outcan distance anova; 
      class Species; 
      var SepalLength SepalWidth PetalLength PetalWidth; 
   run;
 
  ods select none;
  proc glmmod data=iris  outdesign=H(keep=COL:);
           class  Species;
     model SepalLength=Species/noint;
  run;  

  data H;
          merge H   iris;
  run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
     set iris;
     array _S{*} Col1-Col3 (3*0);     
     do j=1 to dim(_S); _S[j]=0; end;
     _S[Species]=1;
     drop j;
run;
****************************/
  proc reg data=H  outest=beta;
          model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
    output   out=P  p=yhat1-yhat3;
  run;quit;
  ods select all;


  proc candisc  data=P;
          class Species;
    var   yhat1-yhat3;
  run;

 Posted by at 12:19 下午
7月 292010
 
During the week of August 1 - 7, 2010, SAS Customer Intelligence will be making a big splash in the New York area and there are several ways you can catch up with us in person:

CRM Evolution Conference & Exhibition
This event begins on Monday, August 2 and ends on Wednesday, August 4. Visit us at any time during the show in booth 117, where we will showcase our Customer Intelligence solutions. SAS is proud to be a platinum sponsor of this major annual event, and we've invited two of our customers to share their experiences with driving marketing success using customer analytics:

1-800FLOWERS.com's Nachiket Desai will be in Room A103 at 2:15p.m. on Monday, Aug. 2,
Wyndham Worldwide's Sean Lowe will be in Room C202 at 11:45a.m. on Tuesday, Aug. 3

Registration is being handled by the show organizer and SAS customers can get a 25% discount off admission by using the "VIPSAS" code. REGISTER HERE

The SAS New York Tweetup & Web Analytics Monday Reception
This hospitality reception will be held in the Basement Lounge of the Houndstooth Pub on Monday, August 2 from 6:30p.m. until 8:30p.m. This will be a great opportunity to mingle with marketers attending the conference, as well as New York members of the Web Analytics Association, who we're working with to promote this event. We'll have a drawing for a FLIP video recorder, so come join us! REGISTER HERE.

CRM Evolution Executive Breakfast
This exclusive event is happening on Tuesday, August 3 from 7:45a.m. until 8:45a.m. at the Marriott Marquis Hotel at Times Square. SAS is the exclusive host of executive attendees at CRM Evolution and SAS executive-level customers in New York City for a closed-door session with Forrester Analyst Dave Frankland and CRM Media's Editorial Director David Myron. REGISTER HERE

The Strategic Customer Intelligence Dinner
We're coming across the river on Thursday, August 5 to host an executive dinner in Hackensack, NJ at Morton's Steakhouse in "The Boardroom," their private dining room. The evening will begin at 6:00p.m. with refreshments served during a newtworking reception, followed by remarks from Forrester Analyst Dave Frankland. Dinner will follow featuring Morton's world famous steak, chicken, salmon or vegetarian option. After dinner, Jeff Hoffman of Chubb & Sons insurance company will highlight his company's experiences in driving success using customer analytics. John Bastone of SAS will provide closing comments to wrap the evening at 9:00p.m. REGISTER HERE.

This event is also the big unveiling of the new creative theme we plan to use this year centered around apples. Our creative team peeled away at it and got to the core of the true essence of SAS Customer Intelligence. Come by booth 117 to check it out - also, we'll be handing out Green Apple Jolly Rancher candies.

If you can't be with us in the Big Apple, be with us virtually by following our official Twitter account: @SAS_CI, or follow the CRM Evolution list on Twitter: @CRMevolution/crme2010, or the CRM Evolution Group on LinkedIn.
7月 292010
 
As a marketer I spend all day, every day thinking about expressing how my company's products meet market/customer needs. I work hard on the prose, simplify the diagrams and especially focus on the underlying issues.

That said, I have a sneaking suspicion that I could be doing better and here's why - is my positioning really doing that good a job about reflecting what customers were telling us they needed?

The challenge comes from every customer expressing their specific needs, using their specific language and our ability to translate that into a description of a requirement that we can build / provide the capability for and doing that in a standardised, scalable fashion.

Think for a while about buying a computer - relatively few people will express what they need as a list of components or technical features; they tell us what they want to use the computer for, where/how it will be used and expect us to work out the details and propose the best solution that meets their needs. In addition to which, they are bombarded by conflicting advice / recommendations from a host of different sources - sound familiar?

Thankfully, customer analytics can help us understand those myriad of requirements, find the patterns in the otherwise unique conversations that help us address those requirements and in a way that plays to our strengths. It can even help us to understand how good a job we are doing in satisfying our customers.

The formula is simple enough - the customer will more fully understand the value when they know that your proposal has accurately addressed and reflected their expressed needs in the language with which they find comfortable. To paraphrase Clay Shirky at a recent conference - let's not spend our time trying to educate the customer - I would rather educate myself.

用SAS模拟随机数据 求pie值

 pie, SAS CODING, 数据, 模拟, 统计  用SAS模拟随机数据 求pie值已关闭评论
7月 282010
 

刚刚看到一本好书《统计模拟》作者叫罗斯[英文:Sheldon M. Ross. Simulation(4th Ed).Elsevier Inc..2006 ]. 顾名思义,这是一本描述怎么利用模拟一些符合统计学理论的数据,用途很广,也就是说实际中的任何数据的分布都符合某种统计学模型,于是在没有得到真实数据之前,我们可以通过模拟数据来研究这些现实中的问题。如果通过模拟来研究未知问题,可以说得上是研究境界很高了。总不能拿到一些实际数据,画个好看的图,就觉得自己可画遍天下了吧。

由于自己不是统计出生,但是受过统计学老师的循循教诲,凡事从简单开始。于是goolge了一下,当当中有这本书的中文介绍:

本书系统阐述了统计模拟的一些实用方法和技术。在对概率的基本知识进行了简单的回顾之后,介绍如何利用计算机产生随机数以及如何利用这些随机数产生任意分布的随机变量、随机过程等。然后讨论了一些分析统计数据的方法和技术。如Bootstrap(自助法)、方差缩减技术等。接着讲述了如何利用统计模拟来判断所选的随机模型是否拟合实际的数据。最后介绍MCMC及一些最新发展的统计模拟技术和论题,如随机序列函数和随机子集函数的评估。本书在每章的最后还提供了不同难度的习题。本书可作为高等院校数学、统计学、科学计算、保险学、精算学等专业的教材,也可供工程技术人员和应用工作者参考。 Continue reading »

 Posted by at 12:00 上午  Tagged with:
7月 272010
 
I wanted to blog about our new Extended Learning Pages, but Larry LaRusso beat me to it in the latest issue of SAS Training Report. Larry is a fantastic story-teller, so instead of trying to come up with something original, I decided to share his story with you:

For most school age children, summer is a carefree time filled with lazy days of fun in the sun. As I was growing up, though, summers were largely a continuation of the school year, not because of summer remediation but because of a driven older sister who knew she was going to be a teacher from the moment she started talking.

For me and my twin brother (and any of our friends Roseann could trick into joining us), hours of baseball and pool time were replaced by workbooks and blackboards. And, Roseann took her job seriously; her "school" came complete with homework, picture day and regularly scheduled parent-teacher conferences. In fact, I still have a report card from those days. Though the grades were good, Roseann pulled no punches, telling our parents: "Larry should spend less time socializing and more time on task if he hopes to reach his full potential." Nice, huh?

Anyway, though I lament the lost play time, I realize it was this extra learning, learning after the "official" learning had concluded, that probably steered my academic career in the right direction.

This past month SAS Education launched its own version of Roseann's Summer School with the release of Extended Learning Pages. Extended Learning Pages provide our students access to all the course content they encountered in class, combined with additional learning tools to create a richer learning experience that extends well beyond the last day of class.

Here's to sisters and aspiring teachers.


While I can’t give away free access to an Extended Learning Page (because it’s for SAS training students only), I will share with you the types of material you can expect to receive on an Extended Learning Page. Each page is specifically tailored for the course you have taken, but in the example below, I’ve provided links to content that, A) I thought all blog readers would find useful and, B) is free to access.

Extended Learning — Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Thank you for taking the Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression course. You are invited to extend your learning experience by using the resources listed below.
Continue reading "Learning Doesn't End When the Class Does"
7月 272010
 
Our newest book, Using SAS for Data Management, Statistical Analysis and Graphics, will soon be shipping from Amazon, CRC Press, and other fine retailers.



The book complements our SAS and R book, particularly for users less interested in R. It presents an easy way to learn how to perform analytical tasks in SAS, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation, and demonstrates useful applications, shortcuts, and tricks. Organized by short, clear descriptive entries, the book covers many common tasks, such as data management, descriptive summaries, inferential procedures, regression analysis, multivariate methods, and the creation of graphics.

Through the extensive indexing, cross-referencing, and worked examples in this text, users can directly find and implement the material they need. The text includes convenient indices organized by topic and SAS syntax, and presents example analyses that employ a single data set from the HELP study to demonstrate the SAS code in action and facilitate exploration. We also provide several case studies of more complex applications. Data sets and code are available for download on the book’s website. Many features of SAS version 9.2 (including new procedures and ODS support) are highlighted.

To book tries to lucidly summarize the aspects of SAS most often used by statistical analysts. We believe that new users of SAS will find the simple approach easy to understand while more sophisticated users will appreciate the invaluable source of task-oriented information.

Note as of August 6, 2010: the book is now shipping from Amazon, albeit with no discount.
7月 242010
 
Link: http://support.sas.com/resources/papers/proceedings10/093-2010.pdf

鮮少有人知道 SAS/ETS 裡面有個叫做 PROC EXPAND 的程序。這個程序主要是用來整理時間序列資料,比方說可以自己定義時間區間後再來做一些次數量表的製作。David L. Cassell 利用這個程序進行了一些平常使用者可以用來做資料處理的功能,並發表了一篇教學文件在 SAS GLOBAL FORUM 2010 上。讓我們來看看如何用 PROC EXPAND 程序來取代一些複雜的資料操作。


TRICK 1: LAGS AND LEADS
在 data step 裡面,我們可用 LAG() 或 LAGn() 函數來求出某個變數的前1~n天的數據,但如果反過來要找某變數的後1~n天數據,雖然有許多方法可以做,但並沒有一個簡便的 call function 可以處理。但是,用 PROC EXPAND 可以輕鬆地製作出這兩種數據出來。範例程式如下:
proc expand data=YourData method=none;
by pt;
convert dosedt = lead1_dt / transformout = (lead 1);
convert dosedt = lead2_dt / transformout = (lead 2);
convert dosedt = lag1_dt / transformout = (lag 1);
convert dosedt = lag2_dt / transformout = (lag 2);
run;

其中,我們打蘇案要轉換的變數是dosedt,而在 PROC EXPAND 裡面,只需要用 convert 指令,把新變數名稱定義好先(lead1_dt, lead2_dt, lag1_dt, lag2_dt),然後在後面斜線後加上一個 transformout= 的選項,然後在等號括弧裡面寫上 lead n 或 lag n 便可以造出後 n 天或指是前 n 天的變數。
TRICK 2: MAX OR MIN OF THE LAST K RECORDS
如果想要知道某一個變數在前n筆或後n筆資料的最大值或最小值,一般的做法是會在PROC MEANS或PROC SUMMARY後加上一個where statement來限制程序要讀取的資料範圍,但在PROC EXPAND裡面,可以直接指定前n筆或後n筆資料,如下所示:
proc expand data=YourData out=YourMax method=none;
by factory;
convert x = max_x / transformin=(movmax 50);
convert x = min_x / transformin=(movmin 50);
run;

同樣利用convert statement,在後面加上transformin選項,並用"movmax n"或"movmin n"來讓SAS去讀前n筆或後n筆資料的最大值最小值。實際上SAS還是會從第一筆資料開始讀,每讀一筆資料就會算一次最大最小值,直到讀進來的資料數量大於movmax和movmin所指定的n值後,SAS才會開始去計算前n筆或後n筆資料的最大值最小值,而所有的結果都會完整呈現在新的資料裡面,我們只需要取最後一筆資料的x_max和x_min數據即為所求。
TRICK 3: AGGREGATING CHUNKS OF RECORDS
我曾經在網路上看過很多人詢問過如何讓資料每跳幾筆資料算一個平均數,或者是每跳幾筆資料取一個數據。許多人用數個PROC程序和data step來完成這兩個動作,但在PROC EXPAND裡面,只需要兩個指令便可完成。

假設資料如下:
data temp1;
input obs vol price;
datalines;
1 2 11
2 2 11
3 2 12
4 3 13
5 3 11
6 3 12
7 4 14
8 4 12
9 4 12
10 5 11
11 5 16
12 5 14
13 6 10
;
run;

總計十三筆資料,兩個變數(vol, price)。如果我們想要每三筆資料算一次vol的總值,以及每三筆資料抓出price的數據,則程式如下:
proc expand data=temp1 out=exp1 factor=(1:3);
convert vol=aggrvol / observed=total;
convert price=new_price / observed=end;
run;


結果如下:
TIME AGGRVOL NEW_PRICE
0 6 12
3 9 12
6 12 12
9 15 14


如何讓成是每三筆資料做一次運算,關鍵完全在PROC EXPAND後面的factor(n:m)選項。以此例來看,factor(1:3)便是要求SAS每三筆資料做一次運算。至於運算的方法,則是定義在convert statement後面的observed=選項。當observed=total時,則會求出每三筆資料的總合,當observed=end時,則會求出每三筆資料的最後一筆數據。實際上,observed=選項有許多參數可以設定,比方要算平均數就只要用observed=average即可。所有可使用的選項可以參考SAS線上手冊。

TRICK 4: THE MOVING AVERAGE
最後一個是算moving average,而這也是PROC EXPAND在處理時間序列資料時最常使用到的計算。假設我們要算每筆資料含前四筆資料的平均,程式如下:
proc expand data=temp1 out=YourOut method=none;
convert number1=mean1 / transformout=(movave 5);
run;

如同前面所述,我們只要在convert statement後面加上一個transformout=的選項,並於括弧內寫上"moveave 5"便可輕鬆算出moving average。

CONTACT INFORMATION
David L. Cassell
Design Pathways
3115 NW Norwood Pl.
Corvallis, OR 97330
DavidLCassell@msn.com
541-754-1304
7月 242010
 
Link: http://support.sas.com/resources/papers/proceedings10/093-2010.pdf

鮮少有人知道 SAS/ETS 裡面有個叫做 PROC EXPAND 的程序。這個程序主要是用來整理時間序列資料,比方說可以自己定義時間區間後再來做一些次數量表的製作。David L. Cassell 利用這個程序進行了一些平常使用者可以用來做資料處理的功能,並發表了一篇教學文件在 SAS GLOBAL FORUM 2010 上。讓我們來看看如何用 PROC EXPAND 程序來取代一些複雜的資料操作。

Continue reading »
 Posted by at 5:51 上午
7月 242010
 
Link: http://support.sas.com/resources/papers/proceedings10/048-2010.pdf

在進行資料分析前,有些人會習慣把一些含有missing data的樣本給清除掉,雖然這不會影響到分析結果,因為大部分的SAS程序都是用CCA(Complete Case Analysis)來處理含有missing data的數據,不過若要用手動的方法來清掉missing data的話,在遇到龐大數量的變數時,需要消耗很多時間在輸入遍數名稱上。美國人口普查局的Selvaratnam Sridharma發表了一個macro程序於SAS Global Forum 2010,讓這個程式撰寫的過程只需要幾秒鐘的時間就可以完成。


這個macro如下所示:
%DROPMISS(DSIN, DSOUT, NODROP);
裡面只需要定義兩個macro參數:

  • DSIN: 原始資料的名稱
  • DSOUT: 新資料的名稱
  • NODROP: 不要處理missing data的變數名稱,此為optional選項。

其中,DSIN和DSOUT應該不用多做解釋。NODROP如果沒有指定特定的變數名稱的話,這個macro就會針對所有在DSIN所指定的資料裡面的變數進行missing data的處理。但如果想要讓這個macro不處理所有文字型變數或數值型變數,則可以用 __NUMERIC_ 或_CHARACTER_ 來限制。如下所示:
%DROPMISS (DSIN=olddata,DSOUT=newdata, nodrop= _NUMERIC_ );
%DROPMISS (DSIN=oldedata,DSOUT=newdata, nodrop= _CHARACTER_ );
此macro的原始碼如下:
/******************/
options nomprint noSYMBOLGEN MLOGIC;
/****************************/
%macro DROPMISS( DSNIN /* name of input SAS dataset
*/
, DSNOUT /* name of output SAS dataset
*/
, NODROP= /* [optional] variables to be omitted from dropping even if
they have only missing values */
) ;
/* PURPOSE: To find both Character and Numeric the variables that have only
missing values and drop them if
* they are not in &NONDROP
*
* NOTE: if there are no variables in the dataset, produce no variables
processing code
*
*
* EXAMPLE OF USE:
* %DROPMISS( DSNIN, DSNOUT )
* %DROPMISS( DSNIN, DSNOUT, NODROP=A B C D--H X1-X100 )
* %DROPMISS( DSNIN, DSNOUT, NODROP=_numeric_ )
* %DROPMISS( DSNIN, DSNOUT, NOdrop=_character_ )
*/
%local I ;
%if "&DSNIN" = "&DSNOUT"
%then %do ;
%put /------------------------------------------------\ ;
%put | ERROR from DROPMISS: | ;
%put | Input Dataset has same name as Output Dataset. | ;
%put | Execution terminating forthwith. | ;
%put \------------------------------------------------/ ;
%goto L9999 ;
%end ;
/*###################################################################*/
/* begin executable code
/*####################################################################/
/*===================================================================*/
/* Create dataset of variable names that have only missing values
/* exclude from the computation all names in &NODROP
/*===================================================================*/
proc contents data=&DSNIN( drop=&NODROP ) memtype=data noprint out=_cntnts_( keep=
name type ) ; run ;
%let N_CHAR = 0 ;
%let N_NUM = 0 ;
data _null_ ;
set _cntnts_ end=lastobs nobs=nobs ;

if nobs = 0 then stop ;
n_char + ( type = 2 ) ;
n_num + ( type = 1 ) ;
/* create macro vars containing final # of char, numeric variables */
if lastobs
then do ;
call symput( 'N_CHAR', left( put( n_char, 5. ))) ;
call symput( 'N_NUM' , left( put( n_num , 5. ))) ;
end ;
run ;
/*===================================================================*/
/* if there are no variables in dataset, stop further processing
/*===================================================================*/
%if %eval( &N_NUM + &N_CHAR ) = 0
%then %do ;
%put /----------------------------------\ ;
%put | ERROR from DROPMISS: | ;
%put | No variables in dataset. | ;
%put | Execution terminating forthwith. | ;
%put \----------------------------------/ ;
%goto L9999 ;
%end ;
/*===================================================================*/
/* put global macro names into global symbol table for later retrieval
/*===================================================================*/
%LET NUM0 =0;
%LET CHAR0 = 0;
%IF &N_NUM >0 %THEN %DO;
%do I = 1 %to &N_NUM ;
%global NUM&I ;
%end ;
%END;
%if &N_CHAR > 0 %THEN %DO;
%do I = 1 %to &N_CHAR ;
%global CHAR&I ;
%end ;
%END;
/*===================================================================*/
/* create macro vars containing variable names
/* efficiency note: could compute n_char, n_num here, but must declare macro names
to be
global b4 stuffing them
/*
/*===================================================================*/
proc sql noprint ;
%if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from
_cntnts_ where type = 2 ; ) ;
%if &N_NUM > 0 %then %str( select name into :NUM1 - :NUM&N_NUM from
_cntnts_ where type = 1 ; ) ;
quit ;
/*===================================================================*/
/* Determine the variables that are missing

/*
/*===================================================================*/
%IF &N_CHAR > 1 %THEN %DO;
%let N_CHAR_1 = %EVAL(&N_CHAR - 1);
%END;
Proc sql ;
select %do I= 1 %to &N_NUM; max (&&NUM&I) , %end; %IF &N_CHAR > 1 %THEN %DO;
%do I= 1 %to &N_CHAR_1; max(&&CHAR&I), %END; %end; MAX(&&CHAR&N_CHAR)
into
%do I= 1 %to &N_NUM; :NUMMAX&I , %END; %IF &N_CHAR > 1 %THEN %DO;
%do I= 1 %to &N_CHAR_1; :CHARMAX&I,%END; %END; :CHARMAX&N_CHAR
from &DSNIN;
quit;
/*===================================================================*/
/* initialize DROP_NUM, DROP_CHAR global macro vars
/*===================================================================*/
%let DROP_NUM = ;
%let DROP_CHAR = ;
%if &N_NUM > 0 %THEN %DO;
DATA _NULL_;
%do I = 1 %to &N_NUM ;
%IF &&NUMMAX&I =. %THEN %DO;
%let DROP_NUM = &DROP_NUM %qtrim( &&NUM&I ) ;
%END;
%end ;
RUN;
%END;
%IF &N_CHAR > 0 %THEN %DO;
DATA _NULL_;
%do I = 1 %to &N_CHAR ;
%IF "%qtrim(&&CHARMAX&I)" eq "" %THEN %DO;
%let DROP_CHAR = &DROP_CHAR %qtrim( &&CHAR&I ) ;
%END;
%end ;
RUN;
%END;
/*===================================================================*/
/* Create output dataset
/*===================================================================*/
data &DSNOUT ;
%if &DROP_CHAR ^= %then %str(DROP &DROP_CHAR ; ) ; /* drop char variables
that
have only missing values */
%if &DROP_NUM ^= %then %str(DROP &DROP_NUM ; ) ; /* drop num variables
that
have only missing values */
set &DSNIN ;
%if &DROP_CHAR ^= or &DROP_NUM ^= %then %do;

%put /----------------------------------\ ;
%put | Variables dropped are &DROP_CHAR &DROP_NUM | ;
%put \----------------------------------/ ;
%end;
%if &DROP_CHAR = and &DROP_NUM = %then %do;
%put /----------------------------------\ ;
%put | No variables are dropped |;
%put \----------------------------------/ ;
%end;
run ;
%L9999:
%mend DROPMISS ;

CONTACT INFORMATION

Selvaratnam Sridharma
Economic Planning and Coordination Division
U.S. Bureau of the Census
Address
Washington, DC 20233-6100
301-763-6774
Email: selvaratnam.sridharma@census.gov