Just back from KDD2010. In the conference, there are several papers that interested me.

On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved.

This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large.

The canonical analysis result from reduced version PROC CANDISC is the same as the full version.

In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2]

[1]. Liang Sun, Betul Ceran, Jieping Ye, "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques", KDD2010, Washington DC.

[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition.

``````

proc format;
value specname
1='Setosa    '
2='Versicolor'
3='Virginica ';
run;

data iris;
title 'Fisher (1936) Iris Data';
input SepalLength SepalWidth PetalLength PetalWidth
Species @@;
format Species specname.;
label SepalLength='Sepal Length in mm.'
SepalWidth ='Sepal Width in mm.'
PetalLength='Petal Length in mm.'
PetalWidth ='Petal Width in mm.';
symbol = put(Species, specname10.);
datalines;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
proc candisc data=iris out=outcan distance anova;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
run;

ods select none;
proc glmmod data=iris  outdesign=H(keep=COL:);
class  Species;
model SepalLength=Species/noint;
run;

data H;
merge H   iris;
run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
set iris;
array _S{*} Col1-Col3 (3*0);
do j=1 to dim(_S); _S[j]=0; end;
_S[Species]=1;
drop j;
run;
****************************/
proc reg data=H  outest=beta;
model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
output   out=P  p=yhat1-yhat3;
run;quit;
ods select all;

proc candisc  data=P;
class Species;
var   yhat1-yhat3;
run;

``````

As a marketer I spend all day, every day thinking about expressing how my company's products meet market/customer needs. I work hard on the prose, simplify the diagrams and especially focus on the underlying issues.

That said, I have a sneaking suspicion that I could be doing better and here's why - is my positioning really doing that good a job about reflecting what customers were telling us they needed?

The challenge comes from every customer expressing their specific needs, using their specific language and our ability to translate that into a description of a requirement that we can build / provide the capability for and doing that in a standardised, scalable fashion.

Think for a while about buying a computer - relatively few people will express what they need as a list of components or technical features; they tell us what they want to use the computer for, where/how it will be used and expect us to work out the details and propose the best solution that meets their needs. In addition to which, they are bombarded by conflicting advice / recommendations from a host of different sources - sound familiar?

Thankfully, customer analytics can help us understand those myriad of requirements, find the patterns in the otherwise unique conversations that help us address those requirements and in a way that plays to our strengths. It can even help us to understand how good a job we are doing in satisfying our customers.

The formula is simple enough - the customer will more fully understand the value when they know that your proposal has accurately addressed and reflected their expressed needs in the language with which they find comfortable. To paraphrase Clay Shirky at a recent conference - let's not spend our time trying to educate the customer - I would rather educate myself.

I wanted to blog about our new Extended Learning Pages, but Larry LaRusso beat me to it in the latest issue of SAS Training Report. Larry is a fantastic story-teller, so instead of trying to come up with something original, I decided to share his story with you:

For most school age children, summer is a carefree time filled with lazy days of fun in the sun. As I was growing up, though, summers were largely a continuation of the school year, not because of summer remediation but because of a driven older sister who knew she was going to be a teacher from the moment she started talking.

For me and my twin brother (and any of our friends Roseann could trick into joining us), hours of baseball and pool time were replaced by workbooks and blackboards. And, Roseann took her job seriously; her "school" came complete with homework, picture day and regularly scheduled parent-teacher conferences. In fact, I still have a report card from those days. Though the grades were good, Roseann pulled no punches, telling our parents: "Larry should spend less time socializing and more time on task if he hopes to reach his full potential." Nice, huh?

Anyway, though I lament the lost play time, I realize it was this extra learning, learning after the "official" learning had concluded, that probably steered my academic career in the right direction.

This past month SAS Education launched its own version of Roseann's Summer School with the release of Extended Learning Pages. Extended Learning Pages provide our students access to all the course content they encountered in class, combined with additional learning tools to create a richer learning experience that extends well beyond the last day of class.

Here's to sisters and aspiring teachers.

While I can’t give away free access to an Extended Learning Page (because it’s for SAS training students only), I will share with you the types of material you can expect to receive on an Extended Learning Page. Each page is specifically tailored for the course you have taken, but in the example below, I’ve provided links to content that, A) I thought all blog readers would find useful and, B) is free to access.

Extended Learning — Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Thank you for taking the Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression course. You are invited to extend your learning experience by using the resources listed below.
Our newest book, Using SAS for Data Management, Statistical Analysis and Graphics, will soon be shipping from Amazon, CRC Press, and other fine retailers.

The book complements our SAS and R book, particularly for users less interested in R. It presents an easy way to learn how to perform analytical tasks in SAS, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation, and demonstrates useful applications, shortcuts, and tricks. Organized by short, clear descriptive entries, the book covers many common tasks, such as data management, descriptive summaries, inferential procedures, regression analysis, multivariate methods, and the creation of graphics.

Through the extensive indexing, cross-referencing, and worked examples in this text, users can directly find and implement the material they need. The text includes convenient indices organized by topic and SAS syntax, and presents example analyses that employ a single data set from the HELP study to demonstrate the SAS code in action and facilitate exploration. We also provide several case studies of more complex applications. Data sets and code are available for download on the book’s website. Many features of SAS version 9.2 (including new procedures and ODS support) are highlighted.

To book tries to lucidly summarize the aspects of SAS most often used by statistical analysts. We believe that new users of SAS will find the simple approach easy to understand while more sophisticated users will appreciate the invaluable source of task-oriented information.

Note as of August 6, 2010: the book is now shipping from Amazon, albeit with no discount.

`proc expand data=YourData method=none;by pt;convert dosedt = lead1_dt / transformout = (lead 1);convert dosedt = lead2_dt / transformout = (lead 2);convert dosedt = lag1_dt / transformout = (lag 1);convert dosedt = lag2_dt / transformout = (lag 2);run;`

TRICK 2: MAX OR MIN OF THE LAST K RECORDS

`proc expand data=YourData out=YourMax method=none;by factory;convert x = max_x / transformin=(movmax 50);convert x = min_x / transformin=(movmin 50);run;`

TRICK 3: AGGREGATING CHUNKS OF RECORDS

`data temp1;input obs vol price;datalines;1 2 112 2 113 2 124 3 135 3 116 3 127 4 148 4 129 4 1210 5 1111 5 1612 5 1413 6 10;run;`

`proc expand data=temp1 out=exp1 factor=(1:3);convert vol=aggrvol / observed=total;convert price=new_price / observed=end;run;`

`TIME AGGRVOL NEW_PRICE0 6 123 9 126 12 129 15 14`

TRICK 4: THE MOVING AVERAGE

`proc expand data=temp1 out=YourOut method=none;convert number1=mean1 / transformout=(movave 5);run;`

``%DROPMISS(DSIN, DSOUT, NODROP);``

• DSIN: 原始資料的名稱
• DSOUT: 新資料的名稱
• NODROP: 不要處理missing data的變數名稱，此為optional選項。

``%DROPMISS (DSIN=olddata,DSOUT=newdata, nodrop= _NUMERIC_ );%DROPMISS (DSIN=oldedata,DSOUT=newdata, nodrop= _CHARACTER_ );``

``/******************/options nomprint noSYMBOLGEN MLOGIC;/****************************/%macro DROPMISS( DSNIN /* name of input SAS dataset*/, DSNOUT /* name of output SAS dataset*/, NODROP= /* [optional] variables to be omitted from dropping even ifthey have only missing values */) ;/* PURPOSE: To find both Character and Numeric the variables that have onlymissing values and drop them if* they are not in &NONDROP** NOTE: if there are no variables in the dataset, produce no variablesprocessing code*** EXAMPLE OF USE:* %DROPMISS( DSNIN, DSNOUT )* %DROPMISS( DSNIN, DSNOUT, NODROP=A B C D--H X1-X100 )* %DROPMISS( DSNIN, DSNOUT, NODROP=_numeric_ )* %DROPMISS( DSNIN, DSNOUT, NOdrop=_character_ )*/%local I ;%if "&DSNIN" = "&DSNOUT"%then %do ;%put /------------------------------------------------\ ;%put | ERROR from DROPMISS: | ;%put | Input Dataset has same name as Output Dataset. | ;%put | Execution terminating forthwith. | ;%put \------------------------------------------------/ ;%goto L9999 ;%end ;/*###################################################################*//* begin executable code/*####################################################################//*===================================================================*//* Create dataset of variable names that have only missing values/* exclude from the computation all names in &NODROP/*===================================================================*/proc contents data=&DSNIN( drop=&NODROP ) memtype=data noprint out=_cntnts_( keep=name type ) ; run ;%let N_CHAR = 0 ;%let N_NUM = 0 ;data _null_ ;set _cntnts_ end=lastobs nobs=nobs ;if nobs = 0 then stop ;n_char + ( type = 2 ) ;n_num + ( type = 1 ) ;/* create macro vars containing final # of char, numeric variables */if lastobsthen do ;call symput( 'N_CHAR', left( put( n_char, 5. ))) ;call symput( 'N_NUM' , left( put( n_num , 5. ))) ;end ;run ;/*===================================================================*//* if there are no variables in dataset, stop further processing/*===================================================================*/%if %eval( &N_NUM + &N_CHAR ) = 0%then %do ;%put /----------------------------------\ ;%put | ERROR from DROPMISS: | ;%put | No variables in dataset. | ;%put | Execution terminating forthwith. | ;%put \----------------------------------/ ;%goto L9999 ;%end ;/*===================================================================*//* put global macro names into global symbol table for later retrieval/*===================================================================*/%LET NUM0 =0;%LET CHAR0 = 0;%IF &N_NUM >0 %THEN %DO;%do I = 1 %to &N_NUM ;%global NUM&I ;%end ;%END;%if &N_CHAR > 0 %THEN %DO;%do I = 1 %to &N_CHAR ;%global CHAR&I ;%end ;%END;/*===================================================================*//* create macro vars containing variable names/* efficiency note: could compute n_char, n_num here, but must declare macro namesto beglobal b4 stuffing them/*/*===================================================================*/proc sql noprint ;%if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from_cntnts_ where type = 2 ; ) ;%if &N_NUM > 0 %then %str( select name into :NUM1 - :NUM&N_NUM from_cntnts_ where type = 1 ; ) ;quit ;/*===================================================================*//* Determine the variables that are missing/*/*===================================================================*/%IF &N_CHAR > 1 %THEN %DO;%let N_CHAR_1 = %EVAL(&N_CHAR - 1);%END;Proc sql ;select %do I= 1 %to &N_NUM; max (&&NUM&I) , %end; %IF &N_CHAR > 1 %THEN %DO;%do I= 1 %to &N_CHAR_1; max(&&CHAR&I), %END; %end; MAX(&&CHAR&N_CHAR)into%do I= 1 %to &N_NUM; :NUMMAX&I , %END; %IF &N_CHAR > 1 %THEN %DO;%do I= 1 %to &N_CHAR_1; :CHARMAX&I,%END; %END; :CHARMAX&N_CHARfrom &DSNIN;quit;/*===================================================================*//* initialize DROP_NUM, DROP_CHAR global macro vars/*===================================================================*/%let DROP_NUM = ;%let DROP_CHAR = ;%if &N_NUM > 0 %THEN %DO;DATA _NULL_;%do I = 1 %to &N_NUM ;%IF &&NUMMAX&I =. %THEN %DO;%let DROP_NUM = &DROP_NUM %qtrim( &&NUM&I ) ;%END;%end ;RUN;%END;%IF &N_CHAR > 0 %THEN %DO;DATA _NULL_;%do I = 1 %to &N_CHAR ;%IF "%qtrim(&&CHARMAX&I)" eq "" %THEN %DO;%let DROP_CHAR = &DROP_CHAR %qtrim( &&CHAR&I ) ;%END;%end ;RUN;%END;/*===================================================================*//* Create output dataset/*===================================================================*/data &DSNOUT ;%if &DROP_CHAR ^= %then %str(DROP &DROP_CHAR ; ) ; /* drop char variablesthathave only missing values */%if &DROP_NUM ^= %then %str(DROP &DROP_NUM ; ) ; /* drop num variablesthathave only missing values */set &DSNIN ;%if &DROP_CHAR ^= or &DROP_NUM ^= %then %do;%put /----------------------------------\ ;%put | Variables dropped are &DROP_CHAR &DROP_NUM | ;%put \----------------------------------/ ;%end;%if &DROP_CHAR = and &DROP_NUM = %then %do;%put /----------------------------------\ ;%put | No variables are dropped |;%put \----------------------------------/ ;%end;run ;%L9999:%mend DROPMISS ;``

