1月 262010
 
Back in December I wrote about the t-shirt contest on sasCommunity.org. Meanwhile, awards season kicked into high gear with the Golden Globes, Critics Choice Awards, and Screen Actors Guild Awards - just to name a few. Now it’s time for the first annual Best SAS t-shirt Award Show, and it’s all online at sasCommunity.org! Not a member of the Hollywood Foreign Press association? No problem. Anyone can vote. It just helps if you know a thing or two about SAS and can understand a little SAS code. But decide for yourself. There are 7 finalists for Best SAS T-Shirt!

The design receiving the most votes between now and 5:00 pm EST on February 5, 2010 will be declared the winner. Will there be an acceptance speech with tears and thanks to Mom and Dad? Probably not, nor will there be a red carpet at SAS Global Forum. However, the winning design will be making its debut at SGF on a limited number of t-shirts. I hope I get one! Just one more reason that you need to be in Seattle April 11-14.
1月 252010
 
I saw an interesting question posted to Twitter by @sastips. She asked experienced SAS programmers what advice they would give to new or junior SAS programmers. New user advice is always needed. On support.sas.com, we are frequently asked for advice on how to get started, so this question peaked my interest. I may have missed a few responses, but I managed to catch a few of them and post them here for you.

@SASJedi: Get certified!

@Jeff_Bailey: learn PROC SQL.if they deal w/ databases this is priceless: OPTIONS SASTRACE=',,,d' SASTRACELOC=saslog NOSTSUFFIX;

@gtcox76: my tip to jr programmer. learn the data. You can write great code but if logically wrong it's bad code.

@SSessoms: My advice to a jr SAS prgmr? Read "The Little SAS Book." It's a wealth of info! http://tinyurl.com/yaat224

@sasbi: My advice to a jr SAS pgmr? ALWAYS check the log first!

@sastip: just because it runs doesn't mean it's right.

These tips are interesting and useful. Let's continue this topic and offer more help to those just starting to learn SAS. If you have a tip, post your advice in the comments for this blog post.
What advice would you give?

My advice is more generic than the tips offered on Twitter. I suggest using the free resources available to you.

Two great resources are support.sas.com and sasCommunity.org. Both of these sites contain so much information. I recommend:
  • starting with a few sample programs from the support.sas.com samples database or the Tip of the Day on sasCommunity.org

  • signing-up for training or tutorials from SAS Training

  • Finally, leverage the knowledge of the user community. Take advantage of the discussion forums, sasCommunity.org, SAS users on Twitter, and your co-workers.


And one last thing for junior and senior SAS programmers. I saw this great thread on SAS-L Some Tips and Tricks from SAS-L. It is loaded with years of sage advice on all sorts of SAS topics.

1月 222010
 
We received several comments noting extremley slow loading pages, especially in the search and browse areas. I checked on the pages and the slowness seemed random and impossible to reproduce. Then I remembered that the world just found out that SAS is the #1 place to work according to Fortune Magazine. I know our Web sites saw a lot more traffic yesterday than usual. I jumped to ideas about pipes and traffic and such. Luckily, our Webmasters are always watching and checking and tweaking, so they were keeping an eye on it. Everything is good.

This gives me a chance to say that if you experience this or any other issues with support.sas.com, post a comment here or use this feedback form to let us know about your experience. We read every comment and do our best to correct and respond.

[I'm putting this topic in the Q&A topic category because I am responding to feedback received in our feedback forms. I hope that Customer Support Site users are starting to look here for answers to questions and comments that are submitted about support.sas.com.]


If you find yourself thinking "what search and browse pages?", read on.
We have a few sections of the site where the pages are tagged with metadata (words that we think you use to describe the topics within those pages). You can drill through the topic hierarchy to narrow the resulting document list to only those pages that were tagged with a specific topic value (browse). Searching is still a powerful way to locate content, so you can also use the search box to search the content in the same way that you search any part of support.sas.com.

Topic browse is available for

Implementing Gap statistic for clustering number estimation

 Array, Gap Statistic, K-means Clustering, predictive modeling, SVD  Implementing Gap statistic for clustering number estimation已关闭评论
1月 222010
 


[BUG FOUND, IN PROCESS OF FIXING IT]

Gap statistic is a method used to estimate the most possible number of clusters in a partition clustering, noticeablly k-means clustering. This measurement was originated by Trevor Hastie, Robert Tibshirani, and Guenther Walther, all from Standford University.


The final estimate of cluster number is
min_K Gap(K)>=Gap(K+1)-s(K+1)

Code below implements uniform sampling approach (method (a) in their paper, samplemethod=1)and SVD sampling approach (method (b), samplemethod=2).

The simulated data has 5 clusters as seen below, while the Gap Statistic correctly identifies this number of clusters as:
\hat_k = 4, Gap=1.37285, Gap_(K+1)-s_(K+1)=1.6494
\hat_k = 5, Gap=1.65057, Gap_(K+1)-s_(K+1)=1.6485, Gap_K>[Gap-s]_(K+1)
\hat_k = 6, Gap=1.64965, Gap_(K+1)-s_(K+1)=1.6483



/* Gap Statistics of Hastie et al. */

data test;
     length ID 5;
     array _X{50} X1-X50;
  do ID=1 to 4000;
     beta=mod(ID, 5)+1;
  do _j=1 to 50;
     _X[_j]=rannor(6898776)+beta*3;
  end;
  output;
  keep ID X:;
  end;
run;


%macro GAPstat(dsn, ID_var, dsn_GAP, max_K, max_rep, samplemethod=1, seed=147852369);
%local dsn  B K dsid nobs ID dsn_out  features features2 num_features;

%let B=&max_rep;
%let dsid=%sysfunc(open(&dsn));
%let nobs=%sysfunc(attrn(&dsid, nobs));
%let dsid=%sysfunc(close(&dsid));
%put &nobs;

%let ID=&ID_var;
%let dsn_out=_clus_out;
%let ref_dist=_ref_dist;


proc contents data=&dsn out=allvars(keep=name varnum) noprint; run;
proc sort data=allvars; by varnum; run;
proc sql  noprint;
     select name into :features separated by ' '
  from   allvars
  where  name^="&ID"
  ;
     select count(distinct name) into :num_features 
  from   allvars
  where  name^="&ID"
  ;
quit;


%if &samplemethod eq 1 %then %do;
proc means data=&dsn noprint;
     var &features;
  output out=_lim(where=(_STAT_ in ('MIN', 'MAX')));
run;
%end;
%else %if &samplemethod eq 2 %then %do;
proc princomp data=&dsn   noint cov noprint
                     outstat=_V(where=(_TYPE_='USCORE'));
    var &features;
run;
data _V_score/view=_V_score;
       set _V;
    _TYPE_='PARMS';
    
run;
proc score data=&dsn  score=_V_score type=parms  out=_Z;
       var &features;
run;
proc means data=_Z  noprint;
       var &features;
    output   out=_lim(where=(_STAT_ in ('MAX', 'MIN')));
run;
%end;
%else %do;
         %put Only value=1 (uniformly sampling) and 2 (SVD sampling) allowed.;
         %goto exit;
%end;

data &ref_dist;
     array _stat{2, &num_features} _temporary_;
  array _F{&num_features}  &features;
  
     do until (eof);
     set _lim end=eof;
     if _STAT_='MIN' then _row=1;
     else _row=2;
     do _k=1 to &num_features; 
              if _row=1 then _stat[_row, _k]=_F[_k];
     else _stat[_row, _k]=_F[_k]-_stat[_row-1, _k];
     end;
 end;
  
     
  do _B=1 to &B;
     do &ID=1 to &nobs;
     do _k=1 to &num_features;
        _F[_k]=ranuni(&seed)*(_stat[2, _k]) + _stat[1, _k];
     end;    
     keep _B &ID &features;
     output;
  end;
 end;
run;

%if &samplemethod eq 2 %then %do;
proc transpose data=_V  out=_Vt;  run;
proc sql noprint;
       select  name into :features2 separated by ' ", "'
    from   allvars
    where  upcase(name)^=("&ID")
    ;
quit;
data _Vt_score/view=_Vt_Score;     
    set _Vt; 
    set allvars(where=(name in ("&features2")));
       retain  _TYPE_ "PARMS"; 
    _LABEL_=_NAME_;
    _NAME_=name;
    drop name;
run;
proc score data=&ref_dist   score=_Vt_score   type=parms
                out=&ref_dist.(keep=_B  &ID &features);
    var &features;
run;
%end;

%do K=1 %to &max_K;
    proc fastclus data=&dsn  maxclusters=&K  
              out=&dsn_out.(keep=&ID Distance CLUSTER) 
              noprint;
         var &features;
    run;

    proc fastclus data=&ref_dist maxclusters=&K  
              out=&dsn_out._&K.(keep=&ID _B Distance CLUSTER)  
              noprint;
         by _B;
         var &features;
    run;

 proc sort data=&dsn_out; by CLUSTER; run;
 proc sort data=&dsn_out._&K; by _B CLUSTER; run;

    proc means data=&dsn_out  sum noprint;          
      class CLUSTER;
         var Distance;
         output out=_dist_  sum(Distance)=Distortion;
    run;
    proc means data=&dsn_out._&K  sum noprint;
         by _B  CLUSTER;
         var Distance;
         output out=_dist_&K  sum(Distance)=Distortion;
    run;
    data _dist_; 
      set _dist_;
   _FREQ_=_FREQ_/&nobs;
 run;
 data _dist_&K;
      set _dist_&K;
   _FREQ_=_FREQ_/&nobs;
 run;
 proc means data=_dist_ noprint  sum;
      var Distortion;
   weight _FREQ_;
   output out=_dist_  sum(distortion)=distortion;
 run;
 proc means data=_dist_&K noprint  sum;
      by _B;
   var Distortion;
   weight _FREQ_;
   output out=_dist_&K  sum(distortion)=distortion;
 run;

    data _dist_&K(drop=mean _B:) _logmean_&K.(keep=mean);
         array _D{&B} _B1-_B&B;
         set _dist_&K end=eof;
         logD=log(Distortion)+log(&nobs);
         _D[_n_]=logD;
         output _dist_&k;
         if eof then do;     
            mean=mean(of _D[*]);
            output  _logmean_&K;  
         end;
    run;

    proc means data=_dist_&K  noprint;
         var logD;
         output out=_s_&K.(where=(_STAT_='STD'));
    run;

    data Gap_&K;
        set _dist_; set _logmean_&K ; set _s_&K(keep=logD);
  Wk=log(distortion)+log(&nobs);
        GAP=mean-Wk;
        s=logD*sqrt(1+1/&B);
        keep mean  Wk  GAP s;
    run;
%end;

data &dsn_GAP;
     set %do K=1 %to &max_K;
         Gap_&K
   %end;;
run;
%exit:
%mend;

%let dsn=test;
%let ID=ID;
%let dsn_GAP=GAP;
%let max_K=10;
%let max_rep=100;
%GAPStat(&dsn, &ID, &dsn_GAP, &max_K, &max_rep, samplemethod=2);


A reference paper can be found at:
R. Tibshirani, G. Walther, and T. Hastie. 2001.
Estimating the number of clusters in a dataset via the Gap statistic.
Journal of the Royal Statistics Society (Series B), pp. 411--423.

Or the book by R. Tibshirani, G. Walther, and T. Hastie:
The Elements of Statistical Learning

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)
 Posted by at 8:36 上午
1月 222010
 
You have probably heard the news by now. SAS is ranked as the #1 best place to work by Fortune Magazine. You can read more about why SAS is ranked #1.


But here's my story
I have worked at SAS for a long time. Some of my best friends work here. We met in the gym or the break room or a conference room.Some I even met at local stores and restaurants. Over the years, we have worked together, played together, grieved together, and celebrated together. The people are great. If you ask a SAS employee to tell you why they like working here, most say healthcare, daycare, and the people. I'll agree with all of those things. But let me tell you about yesterday and you'll understand why I really like my job at SAS.

We're ready to make some updates to support.sas.com/community. We want these updates to be innovative, useful, and fun. So I scheduled a brainstorming session over lunch (yes, over lunch) and invited colleagues from around the company to come and participate. Ok, here's the good part: 17 people gave up their lunch hour to discuss ideas that can enhance how our customers work with our Web site, with us as a company, and with each other. It was exciting.

And that is why I think SAS is a great place to work -- the number of work days that offer something to be excited about outnumber the days that don't.

I hope that you find something at work to be excited about this week. It sure makes working more fun.
1月 162010
 
I hate shopping. Going to the local mall is a form of torture for me. But send me to a virtual store, and I’ll happily browse online and likely place an order. Now I can do the same thing, of sorts, with SAS Global Forum since the presentation schedules and abstracts are posted online. I can just “shop” for what interests me most.

With over 375 presentations to choose from, planning ahead is the smart way for me to manage my time at SGF. Want to go shopping too? Just use the Personal Agenda Builder - it’s your starting point for exploring all that SAS Global Forum 2010 has to offer. Currently, it has the latest up-to-date information on the papers, posters and workshops. The information, including abstracts, may be viewed by Date, Company, Industry, Paper Type, Section, Skill Level or Speaker. An ad-hoc search will work too. Happy “shopping,” searching and planning!

技术博客重新开张

 人物, 技术博客, 统计之都, 统计备忘录, 胡说  技术博客重新开张已关闭评论
1月 042010
 

把以前在space写的文字都导入到这个新博客里了。

这新得白花花扎眼的一年,还想多写些关于SAS程序员本身的文字,关于这个职业,它依托的行业环境等等。SAS程序员在国内还不是一个很兴盛的职业。

还会有关于SAS本身的文字,关于SAS语言,SAS公司,关于它的创始人等等。最近我对SAS的创始人Tony Barr比较感兴趣。

技术本身,这个跟饭碗相关,除了SAS技术,很多笔墨可能会停留在CDISC上面。当然还会有自个兴之所至的其他文字,才年初呢,啥都没定。作为跟“统计之都”的约定,所有跟统计相关的文字,我会首先发布到“统计之都”,然后在自个的博客做个备份:

http://cos.name/author/hujiangtang/

1月 022010
 
********************************************************;
* A SAS MACRO IMPORTING SQLITE DATA TABLE WITHOUT ODBC *;
* ---------------------------------------------------- *;
* REQUIREMENT:                                         *;
*   HAVE SQLITE3.EXE DIRECTORY IN THE SYSTEM PATH.     *;
* LIMITATIONS:                                         *;
*   1. UNABLE TO HANDLE THE TABLE WITH BLOB DATA TYPE  *;
*   2. THE DEFAULT LENGTH OF ALL CHAR DATA IS 200      *;
*   3. SYSTEM DEPENDENT (CHANGES NEEDED FOR OTHER OS)  *;
*   4. POTENTIALLY SLOW FOR LARGE TABLES               *;
********************************************************;

%let chlen = 200;

options mlogic symbolgen mprint;

%macro sqlite(path = , dbfile = , table = );

%let fileref = %trim(&path)\%trim(&dbfile);

filename fileref "&fileref";

%if %sysfunc(fexist(fileref)) %then %do;

  %let dumpref = %trim(&path)\%trim(dump.txt);

  filename dumpref "&dumpref";

  %if %sysfunc(fexist(dumpref)) %then %do;
    data _null_;
      rc = fdelete("dumpref");
    run;
  %end;

  %let batref = %trim(&path)\%trim(dump.bat);

  filename batref "&batref";
  

  %if %sysfunc(fexist(dumpref)) %then %do;
    data _null_;
      rc = fdelete("batref");
    run;
  %end;

  data _null_;
    file batref;
    cmd = "sqlite3 &fileref "||'"'||".dump &table"||'"'||" > &dumpref";
    put cmd;
  run;
  
  systask command "&batref" wait;  
      
%end;
%else %goto exit;

data create;
  infile dumpref;
  input;
  create = upcase(_infile_);
  if _n_ > 1 then output;
  if substr(create, 1, 2) = ');' then stop;
run;

proc sql noprint;
select
  tranwrd(create, "TEXT", "CHAR(&chlen.)")
into
  :create separated by ' '
from
  create;

&create;
quit;    

proc datasets library = work nolist;
  delete create(mt = data);
run;
quit;

data _null_;
  infile dumpref end = eof;
  input;
  if _n_ = 1 then do;
    call execute("PROC SQL;");
  end;
  if index(upcase(_infile_), 'INSERT INTO') > 0 then do;
    pos1 = index(_infile_, '(');
    pos2 = index(_infile_, ');');    
    insert = substr(_infile_, pos1, pos2 - pos1 + 2);
    call execute("INSERT INTO &TABLE VALUES"||insert);
  end;
  if eof then do;
    call execute("QUIT;");
  end;
run;

data _null_;
  rc = fdelete("batref");
run;

data _null_;
  rc = fdelete("dumpref");
run;

proc contents data = &table varnum;
run;

%exit:
    
%mend sqlite;    

%sqlite(path = D:\sas, dbfile = test.db, table = tblco);
 Posted by at 12:57 下午  Tagged with:
1月 012010
 
################################################
# EXPORT DATA FRAME INTO SQLITE DATABASE       #
# USING SQLDF PACKAGE(code.google.com/p/sqldf) #
################################################

library(sqldf)
library(datasets)
data(CO2)

### SET WORKING DIRECTORY ###
setwd('d:\\r')

### TEST IF THE FILE EXISTS ###
if(file.exists('test.db')) file.remove('test.db')

### CREATE A NEW EMPTY SQLITE DATABASE ###
sqldf("attach 'test.db' as new")

### CREATE A NEW TABLE FROM DATA FRAME ###
sqldf("create table tblco as select * from CO2", dbname = 'test.db')

### SHOW THE DATA IN SQLITE DATABASE ###
sqldf("select * from tblco limit 5", dbname = 'test.db')
#  Plant   Type  Treatment conc uptake
#1   Qn1 Quebec nonchilled   95   16.0
#2   Qn1 Quebec nonchilled  175   30.4
#3   Qn1 Quebec nonchilled  250   34.8
#4   Qn1 Quebec nonchilled  350   37.2
#5   Qn1 Quebec nonchilled  500   35.3
 Posted by at 1:26 下午
12月 262009
 
###############################
# CALCULATE SUMMARY BY GROUPS #
###############################

data(iris)

### WITH AGGREGATE() FUNCTION ###
sum1 <- aggregate(iris[-5], iris[5], mean)
sum1
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH BY() FUNCTION ###
tmp2 <- by(iris[-5], iris[5], mean)
sum2 <- cbind(expand.grid(dimnames(tmp2)), do.call(rbind, tmp2))
rownames(sum2) <- NULL
sum2              
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH RESHAPE PACKAGE ###
library(reshape)
tmp3 <- melt(iris)
sum3 <- cast(tmp3, Species ~ variable, mean)
sum3
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH SQLDF PACKAGE ###
library(sqldf)
sum4 <- sqldf("select
                 Species,
                 avg(Sepal_Length) as Sepal_Length,
                 avg(Sepal_Width)  as Sepal_Width,
                 avg(Petal_Length) as Petal_Length,
                 avg(Petal_Width)  as Petal_Width
               from
                 iris group by Species");
sum4
#     Species Sepal_Length Sepal_Width Petal_Length Petal_Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026

### WITH LAPPLY() FUNCTION ###
tmp5 <- lapply(split(iris[-5], iris[5]), mean)
sum5 <- data.frame(Species = names(tmp5), do.call(rbind, tmp5))
rownames(sum5) <- NULL
sum5 
#     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     setosa        5.006       3.428        1.462       0.246
#2 versicolor        5.936       2.770        4.260       1.326
#3  virginica        6.588       2.974        5.552       2.026
 Posted by at 1:53 下午