[BUG FOUND, IN PROCESS OF FIXING IT]

Gap statistic is a method used to estimate the most possible number of clusters in a partition clustering, noticeablly k-means clustering. This measurement was originated by Trevor Hastie, Robert Tibshirani, and Guenther Walther, all from Standford University.

The final estimate of cluster number is
min_K Gap(K)>=Gap(K+1)-s(K+1)

Code below implements uniform sampling approach (method (a) in their paper, samplemethod=1)and SVD sampling approach (method (b), samplemethod=2).

The simulated data has 5 clusters as seen below, while the Gap Statistic correctly identifies this number of clusters as:
\hat_k = 4, Gap=1.37285, Gap_(K+1)-s_(K+1)=1.6494
\hat_k = 5, Gap=1.65057, Gap_(K+1)-s_(K+1)=1.6485, Gap_K>[Gap-s]_(K+1)
\hat_k = 6, Gap=1.64965, Gap_(K+1)-s_(K+1)=1.6483


/* Gap Statistics of Hastie et al. */

data test;
length ID 5;
array _X{50} X1-X50;
do ID=1 to 4000;
beta=mod(ID, 5)+1;
do _j=1 to 50;
_X[_j]=rannor(6898776)+beta*3;
end;
output;
keep ID X:;
end;
run;

%macro GAPstat(dsn, ID_var, dsn_GAP, max_K, max_rep, samplemethod=1, seed=147852369);
%local dsn  B K dsid nobs ID dsn_out  features features2 num_features;

%let B=&max_rep;
%let dsid=%sysfunc(open(&dsn));
%let nobs=%sysfunc(attrn(&dsid, nobs));
%let dsid=%sysfunc(close(&dsid));
%put &nobs;

%let ID=&ID_var;
%let dsn_out=_clus_out;
%let ref_dist=_ref_dist;

proc contents data=&dsn out=allvars(keep=name varnum) noprint; run;
proc sort data=allvars; by varnum; run;
proc sql  noprint;
select name into :features separated by ' '
from   allvars
where  name^="&ID"
;
select count(distinct name) into :num_features
from   allvars
where  name^="&ID"
;
quit;

%if &samplemethod eq 1 %then %do;
proc means data=&dsn noprint;
var &features;
output out=_lim(where=(_STAT_ in ('MIN', 'MAX')));
run;
%end;
%else %if &samplemethod eq 2 %then %do;
proc princomp data=&dsn   noint cov noprint
outstat=_V(where=(_TYPE_='USCORE'));
var &features;
run;
data _V_score/view=_V_score;
set _V;
_TYPE_='PARMS';

run;
proc score data=&dsn  score=_V_score type=parms  out=_Z;
var &features;
run;
proc means data=_Z  noprint;
var &features;
output   out=_lim(where=(_STAT_ in ('MAX', 'MIN')));
run;
%end;
%else %do;
%put Only value=1 (uniformly sampling) and 2 (SVD sampling) allowed.;
%goto exit;
%end;

data &ref_dist;
array _stat{2, &num_features} _temporary_;
array _F{&num_features}  &features;

do until (eof);
set _lim end=eof;
if _STAT_='MIN' then _row=1;
else _row=2;
do _k=1 to &num_features;
if _row=1 then _stat[_row, _k]=_F[_k];
else _stat[_row, _k]=_F[_k]-_stat[_row-1, _k];
end;
end;

do _B=1 to &B;
do &ID=1 to &nobs;
do _k=1 to &num_features;
_F[_k]=ranuni(&seed)*(_stat[2, _k]) + _stat[1, _k];
end;
keep _B &ID &features;
output;
end;
end;
run;

%if &samplemethod eq 2 %then %do;
proc transpose data=_V  out=_Vt;  run;
proc sql noprint;
select  name into :features2 separated by ' ", "'
from   allvars
where  upcase(name)^=("&ID")
;
quit;
data _Vt_score/view=_Vt_Score;
set _Vt;
set allvars(where=(name in ("&features2")));
retain  _TYPE_ "PARMS";
_LABEL_=_NAME_;
_NAME_=name;
drop name;
run;
proc score data=&ref_dist   score=_Vt_score   type=parms
out=&ref_dist.(keep=_B  &ID &features);
var &features;
run;
%end;

%do K=1 %to &max_K;
proc fastclus data=&dsn  maxclusters=&K
out=&dsn_out.(keep=&ID Distance CLUSTER)
noprint;
var &features;
run;

proc fastclus data=&ref_dist maxclusters=&K
out=&dsn_out._&K.(keep=&ID _B Distance CLUSTER)
noprint;
by _B;
var &features;
run;

proc sort data=&dsn_out; by CLUSTER; run;
proc sort data=&dsn_out._&K; by _B CLUSTER; run;

proc means data=&dsn_out  sum noprint;
class CLUSTER;
var Distance;
output out=_dist_  sum(Distance)=Distortion;
run;
proc means data=&dsn_out._&K  sum noprint;
by _B  CLUSTER;
var Distance;
output out=_dist_&K  sum(Distance)=Distortion;
run;
data _dist_;
set _dist_;
_FREQ_=_FREQ_/&nobs;
run;
data _dist_&K;
set _dist_&K;
_FREQ_=_FREQ_/&nobs;
run;
proc means data=_dist_ noprint  sum;
var Distortion;
weight _FREQ_;
output out=_dist_  sum(distortion)=distortion;
run;
proc means data=_dist_&K noprint  sum;
by _B;
var Distortion;
weight _FREQ_;
output out=_dist_&K  sum(distortion)=distortion;
run;

data _dist_&K(drop=mean _B:) _logmean_&K.(keep=mean);
array _D{&B} _B1-_B&B;
set _dist_&K end=eof;
logD=log(Distortion)+log(&nobs);
_D[_n_]=logD;
output _dist_&k;
if eof then do;
mean=mean(of _D[*]);
output  _logmean_&K;
end;
run;

proc means data=_dist_&K  noprint;
var logD;
output out=_s_&K.(where=(_STAT_='STD'));
run;

data Gap_&K;
set _dist_; set _logmean_&K ; set _s_&K(keep=logD);
Wk=log(distortion)+log(&nobs);
GAP=mean-Wk;
s=logD*sqrt(1+1/&B);
keep mean  Wk  GAP s;
run;
%end;

data &dsn_GAP;
set %do K=1 %to &max_K;
Gap_&K
%end;;
run;
%exit:
%mend;

%let dsn=test;
%let ID=ID;
%let dsn_GAP=GAP;
%let max_K=10;
%let max_rep=100;
%GAPStat(&dsn, &ID, &dsn_GAP, &max_K, &max_rep, samplemethod=2);



A reference paper can be found at:
R. Tibshirani, G. Walther, and T. Hastie. 2001.
Estimating the number of clusters in a dataset via the Gap statistic.
Journal of the Royal Statistics Society (Series B), pp. 411--423.

Or the book by R. Tibshirani, G. Walther, and T. Hastie:
The Elements of Statistical Learning

You have probably heard the news by now. SAS is ranked as the #1 best place to work by Fortune Magazine. You can read more about why SAS is ranked #1.

But here's my story
I have worked at SAS for a long time. Some of my best friends work here. We met in the gym or the break room or a conference room.Some I even met at local stores and restaurants. Over the years, we have worked together, played together, grieved together, and celebrated together. The people are great. If you ask a SAS employee to tell you why they like working here, most say healthcare, daycare, and the people. I'll agree with all of those things. But let me tell you about yesterday and you'll understand why I really like my job at SAS.

We're ready to make some updates to support.sas.com/community. We want these updates to be innovative, useful, and fun. So I scheduled a brainstorming session over lunch (yes, over lunch) and invited colleagues from around the company to come and participate. Ok, here's the good part: 17 people gave up their lunch hour to discuss ideas that can enhance how our customers work with our Web site, with us as a company, and with each other. It was exciting.

And that is why I think SAS is a great place to work -- the number of work days that offer something to be excited about outnumber the days that don't.

I hope that you find something at work to be excited about this week. It sure makes working more fun.

I hate shopping. Going to the local mall is a form of torture for me. But send me to a virtual store, and I’ll happily browse online and likely place an order. Now I can do the same thing, of sorts, with SAS Global Forum since the presentation schedules and abstracts are posted online. I can just “shop” for what interests me most.

With over 375 presentations to choose from, planning ahead is the smart way for me to manage my time at SGF. Want to go shopping too? Just use the Personal Agenda Builder - it’s your starting point for exploring all that SAS Global Forum 2010 has to offer. Currently, it has the latest up-to-date information on the papers, posters and workshops. The information, including abstracts, may be viewed by Date, Company, Industry, Paper Type, Section, Skill Level or Speaker. An ad-hoc search will work too. Happy “shopping,” searching and planning!

http://cos.name/author/hujiangtang/

********************************************************;
* A SAS MACRO IMPORTING SQLITE DATA TABLE WITHOUT ODBC *;
* ---------------------------------------------------- *;
* REQUIREMENT:                                         *;
*   HAVE SQLITE3.EXE DIRECTORY IN THE SYSTEM PATH.     *;
* LIMITATIONS:                                         *;
*   1. UNABLE TO HANDLE THE TABLE WITH BLOB DATA TYPE  *;
*   2. THE DEFAULT LENGTH OF ALL CHAR DATA IS 200      *;
*   3. SYSTEM DEPENDENT (CHANGES NEEDED FOR OTHER OS)  *;
*   4. POTENTIALLY SLOW FOR LARGE TABLES               *;
********************************************************;

%let chlen = 200;

options mlogic symbolgen mprint;

%macro sqlite(path = , dbfile = , table = );

%let fileref = %trim(&path)\%trim(&dbfile);

filename fileref "&fileref";

%if %sysfunc(fexist(fileref)) %then %do;

%let dumpref = %trim(&path)\%trim(dump.txt);

filename dumpref "&dumpref";

%if %sysfunc(fexist(dumpref)) %then %do;
data _null_;
rc = fdelete("dumpref");
run;
%end;

%let batref = %trim(&path)\%trim(dump.bat);

filename batref "&batref";

%if %sysfunc(fexist(dumpref)) %then %do;
data _null_;
rc = fdelete("batref");
run;
%end;

data _null_;
file batref;
cmd = "sqlite3 &fileref "||'"'||".dump &table"||'"'||" > &dumpref";
put cmd;
run;

%end;
%else %goto exit;

data create;
infile dumpref;
input;
create = upcase(_infile_);
if _n_ > 1 then output;
if substr(create, 1, 2) = ');' then stop;
run;

proc sql noprint;
select
tranwrd(create, "TEXT", "CHAR(&chlen.)")
into
:create separated by ' '
from
create;

&create;
quit;

proc datasets library = work nolist;
delete create(mt = data);
run;
quit;

data _null_;
infile dumpref end = eof;
input;
if _n_ = 1 then do;
call execute("PROC SQL;");
end;
if index(upcase(_infile_), 'INSERT INTO') > 0 then do;
pos1 = index(_infile_, '(');
pos2 = index(_infile_, ');');
insert = substr(_infile_, pos1, pos2 - pos1 + 2);
call execute("INSERT INTO &TABLE VALUES"||insert);
end;
if eof then do;
call execute("QUIT;");
end;
run;

data _null_;
rc = fdelete("batref");
run;

data _null_;
rc = fdelete("dumpref");
run;

proc contents data = &table varnum;
run;

%exit:

%mend sqlite;

%sqlite(path = D:\sas, dbfile = test.db, table = tblco);
Tagged with:

################################################
# EXPORT DATA FRAME INTO SQLITE DATABASE       #
################################################

library(sqldf)
library(datasets)
data(CO2)

### SET WORKING DIRECTORY ###
setwd('d:\\r')

### TEST IF THE FILE EXISTS ###
if(file.exists('test.db')) file.remove('test.db')

### CREATE A NEW EMPTY SQLITE DATABASE ###
sqldf("attach 'test.db' as new")

### CREATE A NEW TABLE FROM DATA FRAME ###
sqldf("create table tblco as select * from CO2", dbname = 'test.db')

### SHOW THE DATA IN SQLITE DATABASE ###
sqldf("select * from tblco limit 5", dbname = 'test.db')
#  Plant   Type  Treatment conc uptake
#1   Qn1 Quebec nonchilled   95   16.0
#2   Qn1 Quebec nonchilled  175   30.4
#3   Qn1 Quebec nonchilled  250   34.8
#4   Qn1 Quebec nonchilled  350   37.2
#5   Qn1 Quebec nonchilled  500   35.3

###############################
# CALCULATE SUMMARY BY GROUPS #
###############################

data(iris)

### WITH AGGREGATE() FUNCTION ###
sum1 <- aggregate(iris[-5], iris[5], mean)
sum1
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH BY() FUNCTION ###
tmp2 <- by(iris[-5], iris[5], mean)
sum2 <- cbind(expand.grid(dimnames(tmp2)), do.call(rbind, tmp2))
rownames(sum2) <- NULL
sum2
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH RESHAPE PACKAGE ###
library(reshape)
tmp3 <- melt(iris)
sum3 <- cast(tmp3, Species ~ variable, mean)
sum3
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH SQLDF PACKAGE ###
library(sqldf)
sum4 <- sqldf("select
Species,
avg(Sepal_Length) as Sepal_Length,
avg(Sepal_Width)  as Sepal_Width,
avg(Petal_Length) as Petal_Length,
avg(Petal_Width)  as Petal_Width
from
iris group by Species");
sum4
#     Species Sepal_Length Sepal_Width Petal_Length Petal_Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH LAPPLY() FUNCTION ###
tmp5 <- lapply(split(iris[-5], iris[5]), mean)
sum5 <- data.frame(Species = names(tmp5), do.call(rbind, tmp5))
rownames(sum5) <- NULL
sum5
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

This blog, “From A Logical Point of View”, is not supposed to be owned by a logician. Actually, the book, From A Logical Point of View, is a collection of logical and philosophical essays by W.V.Quine(1908-2000), an American philosopher and mathematician.

A story about this book. From A Logical Point of View, was a calypso song by Robert Mitchum, an US actor, composer and singer. Quine enjoyed this music and used it as his new book, which is Quine’s best seller. Now I love this book and give the name to my blog from a logical point of view^.

If you have a magazine subscription or you read the magazines at the checkout counter, you know that they are always a month ahead of your calendar. My January 2010 magazines have already arrived and they are all talking about resolutions. Resolutions are a good thing, but shouldn't be limited to January 1. So try this year-long resolution -- challenge your comfort zone once each week. That's only 52 new things for 2010.

We all get in ruts and don't always realize it. These ruts can keep us from having innovative ideas, from meeting interesting people, or from finding a new favorite food. Stepping outside of your comfort zone can make you happier, healthier, and maybe wealthier if your innovative ideas are good enough.

I have started to work on this concept already. Here's my small, fun story. I went to a holiday lunch with a couple of friends. At the last minute, we picked the cafe in the landscaping company next to the resturant for which we had driven downtown. With a little encouragement, I bought a new swag to decorate my mantel. It is beautful and I can't wait for my family to take Christmas pictures in front of it.

My comfort zone challenge continues into the new year, personally and professionally. Over the next few weeks, I'll be finishing up a paper to be presented at SAS Global Forum 2010. This is a big step for many of us, but when you have information to share, the payoff for you and your audience can be well worth the sweaty palms and rapid heartbeat.

But, you don't have to go that far. I've provided a few simple personal and professional suggestions to help you challenge your comfort zone in 2010.

Start small. Try the resturant next door to your favorite resturant. Sit at a different table in your company cafe. Watch a video to learn a new skill instead of using your favorite documentation. Visit your local library and check out an author you have never heard of. Or, try one of these: