3月 042010
 
As Dr. Goodnight explains, SAS Global Forum (formerly known as SUGI) was started by SAS users, who held their first conference in 1976, several months before SAS was even incorporated as a company. The successful tradition of a user run conference, guided by an Executive Board of SAS users, continues today, 35 years later.



If you haven’t registered already, be sure to do so soon. Early registration ends March 8! Register now, and you can save up to $250 off your registration fee. There are additional ways to save as well. Fees will increase on March 9 for all Registration Packages.

This year will be my 14th conference, and I totally agree with how Dr. Goodnight sums things up in his invite video. “[SAS Global Forum is a...] huge collaborative forum where problems are solved, techniques are shared, and life-long friendships are made.”
3月 042010
 
While at SAS, I hope to get the opportunity to talk with all of the SAS icons. I’ve had the privilege to meet many. Before the SAS winter break, I had the privilege to talk with another: Michael Raithel.

At SAS® Global Forum 2010, Raithel will be presenting the Tuesday lunch feature presentation, “It’s not easy being a SAS programmer." So, I thought you might like to know a little more about who he is and what he does.

Who he is
Raithel is a SAS programmer, former NESUG conference chair, former SUGI section chair, three-time SAS author and a Senior Systems Analyst at Westat. Senior Systems Analyst … how many Senior Systems Analysts have you met? When I was at NESUG (my first regional), I met many. Raithel’s answer to my quip said it all, though. “If they called me ‘Senior Dog Catcher,’ it wouldn’t matter. I have a wonderful, fulfilling job here. This is an excellent place to work. There is relatively low turnover and very high job satisfaction; a very professional and serious, yet respectful, work environment.”

So far, he seems pretty much rank and file with the other SAS icons, doesn’t he? Raithel also has at least one other cool point: His first book, Tuning SAS Applications in the MVS Environment, is now part of the Smithsonian Institution’s Museum of American History Information Technology Collection. Well now, that’s the stuff of legends. Wouldn’t you agree?

SAS crunchers and munchers
Westat is an employee-owned corporation that provides contract research services to businesses, foundations, US government agencies, and state and local governments. A typical example of work Westat does is to host, manage and clean data for a state cancer registry. “SAS is an integral part of the software we use to manage the data” said Raithel. “Most of the other hundreds of research projects conducted for clients each year by Westat use SAS products.”

“We’re basically a meat and potatoes kind of SAS shop,” said Raithel. “By that, I mean we have 18 SAS products that are the foundation of much of our work. We have deployed four SAS/ACCESS® interfaces, SAS/CONNECT®, SAS/GRAPH®, SAS/Genetics™, SAS/IML®, SAS/Intrnet, and SAS/STAT®, to name a few. We are also looking to expand our use of SAS to include SAS® Data Quality.”

Westat may consider itself a “meat and potatoes SAS shop,” but the company is proving that treating people well leads to success. Success is promoted and encouraged by a support infrastructure that includes an in-house technical support unit and Westat-written SAS resources.

“On our intranet, we have SAS Resources Web pages that give our users one-stop shopping to learn what SAS is all about at Westat, including links to contact information for the people in my group, Westat’s SAS technical support,” said Raithel. “The pages include instructions for loading SAS on a desktop, SAS resources and Westat conference papers, documentation for the SAS products we use, the methods we use for validating hot fixes and even the date and location of the next Westat SAS Users Group meeting.”

According to Raithel, Westat’s SAS technical support department, which he heads, helps to ensure that SAS Institute's Technical Support doesn’t receive a flood of calls each day from Westat. Raithel and his two staff members “answer everything from, ‘I can’t open SAS today,’ to ‘What is happening with SAS 9.2?’ and ‘This report looks kind of funny.’”

When Raithel and his staff have questions they can’t answer, they usher them through the official SAS Technical Support process. “When there is a satisfactory result, and we believe that question is going to affect more users, we post it to the in-house listserv – SAS Outlook Information Forum,” he said. “SAS users at Westat also post questions and answers on the listserv.”

Success isn’t guaranteed, it’s mastered
Each year, Westat purchases SAS training EPTO units. “We buy enough EPTO units per year to allow us to have custom-tailored classes taught in our facilities,” said Raithel. “SAS Education has been very cooperative during the past three years in listening to our training needs and modifying some of the existing SAS classes to suit those needs.”

Westat also relies on a cadre of in-house SAS experts to teach beginner and intermediate SAS classes. “Once a year we have a SAS Institute speaker present at our Westat SAS user group meeting, and we invite speakers from outside organizations, such as the Federal Reserve Board and the Bureau of Labor Statistics,” said Raithel.

The education that Westat SAS users receive is not all piped in. Many Westat SAS users are experts in their own right. One former SUGI conference chair, two NESUG and one SESUG conference chair came from Westat’s staff. At Westat, SAS conference participation is indispensable. “We send a lot of people to SAS user group meetings” said Raithel. “During the past 13 years, Westat staff has published 200 SAS conference papers.

“We feel that conference experience is invaluable,” he said. “When our staff attends a SAS conference, they pick up tips from the presentations and a lot of good information comes back to Westat.

“After every conference, those who attend are on the hook for a 5- to10-minute presentation of the best papers and new ideas. We know that when we send people they’re going to enrich themselves by gaining new SAS knowledge, and that knowledge is going to come back and be available to other Westat staff.”

You can find some of Michael Raithel’s books, interviews, tips and past presentations on support.sas.com. You can meet Raithel at SAS Global Forum 2010 at the authors’ reception in the SAS Publishing demo area on Monday evening from 6 p.m. until 7:30 p.m.

Add “It’s not easy being a SAS programmer” to your SAS Global Forum agenda. This SAS Global Forum Tuesday lunchtime presentation by Raithel will take a lighthearted look at some of the societal, workplace and industry issues that SAS programmers routinely face. *An extra fee event.
3月 032010
 
Q: How do I refer to SAS in a scientific paper?
A: Always write the name "SAS" in uppercase letters with no periods.

As you might guess, there are more guidelines and specifics about how to refer to SAS and its products and services. The complete list of guidelines is available on our corporate Web site under the Press Center in Editorial Guidelines.

I have selected a few items from the guidelines that may more accurately address the question that was intended.

  • Do not use a trademark symbol after the word SAS when referring to SAS Institute Inc.


  • Include the SAS trademark notice shown below in a footnote at the bottom of the page or the end of the article.

    SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.


  • Scientific journals often require enough information to replicate your data findings. In these cases, the proper citation would be as follows (brackets "[]" indicate data that should be supplied by you):

    The [output/code/data analysis] for this paper was generated using [SAS/STAT] software, Version [8] of the SAS System for [Unix]. Copyright © [year of copyright] SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.


Remember that I have provided only a partial list. Visit the Editorial Guidelines for full details.

2月 232010
 
Contributed by Scott Vodicka, a member of the SAS Global Consulting Business Intelligence Practice

An interesting topic came up the other day in one of my email conversations. What was the topic of the email? Well, I am glad you asked. It was how do we send text messages from SAS to someone's cell phone?

It is so easy that you will not even believe it! As a lot of us know SAS supports sending emails from a data step program, and the publish and subscribe model supports delivering information via email. The quick answer is to use the email address of the person's phone in the TO: field for the data step, or the email address used to subscribe to a channel.

So, now that the easy part is done, let me give you some information that will help you solve the problem I know you are now facing: How do I determine the users email address for their phone? There are a couple of tricks that make this really easy.

Below is a short list of popular carriers with their associated domains:
  AT&T:          @txt.att.net
  Alltel:           @message.alltel.com
  Sprint:           @messaging.sprintpcs.com
  T-Mobile:         @tmomail.net
  Verizon:          @vtext.com
  Virgin Mobile:     @vmobl.com


Check out these web sites for updated information on carriers, especially since this sector is constantly in flux.
If your phone number is: (919) 555-1234 (not a valid number), and your carrier is AT&T then the email address for your phone is: 9195551234@txt.att.net

Another way to find out your phone's address is to send a text message from your phone to your SAS email account: and you’ll get the email address for your phone in the from field (it might go to your junk folder).

Here is a sample data step program that sends an alert as a text message to a user's cell phone. Before you submit the program below be sure to update your SAS configuration file for the following options:
EMAILSYS  EMAILHOST  EMAILID  

You may have to supply the EMAILPW value depending on how your SMTP server is configured. You may specify the EMAILID and EMAILPW values in your SAS program via an options statement. See SAS documentation for details.

%let email_alert="9195551234@txt.att.net";
%let email_from="SAS Monitoring System ";

%let errormsg=Server failed to start;
%let host=metadata_server;

filename em_out email to=(&email_alert) from=(&email_from)
    subject="Alert: SAS 9.2 Critical Status Error";
data _null_;
  file em_out;
  format timechar $21.;
  timechar=put(datetime(),datetime21.2);
  put timechar "ERROR : &sysuserid.@&host - &errormsg..";
run;


Important Note: Many people are charged per text message. Be sure to get permission before sending a text message to an individual.
2月 222010
 
Last month I pointed you towards the conference t-shirt contest on sasCommunity.org. The designs and coding techniques on display were quite creative, and the contest triggered some fun conversation among community members. Now the votes have been cast.

And the winner is …self-proclaimed “new kid” Lynne Krajevski. I loved what she had to say about her unique design: “SAS is enormous power at my fingertips, always within my reach - this is what I was trying to convey with my design.” Way to go Lynne!



How can you get one of these limited number t-shirts? Come to SAS Global Forum 2010 in Seattle to find out!
2月 182010
 


In this post, I post an improved SAS macro of the single partition split algorithm in Chapter 2 of "Pharmaceutical Statistics Using SAS: A Practical Guide" by Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino.

The single partition split algorithm is a simplified version of Stumps, and is a weak classifier, usually used to form the base weak learner for boosting algorithm. This specific classifier seeks to separate the space into 2 subspaces for independent variables where each subspace has increasingly higher purity of the response classes, say 1 and 0. In the examples, Gini Index is used to measure purity/impurity.

The SAS example macro %SPLIT in the book (found @ Here) is for illustration purpose, and is so inefficient that it practically can't be used by industrial standard.

I modified this macro and made it usable in real business applications where millions of observations and hundreds of variables are more than common.

Improved Code:



%macro Dsplit(dsn, p);
/************************************************/
/* dsn: Name of input SAS data sets. All        */
/*        independent variables should be named */
/*        as X1, X2,....,Xp and be continous    */
/*        numeric variables                     */
/*   p: Number of independent variables         */
/************************************************/
options nonotes;
%do i=1 %to &p;
proc sort data=&dsn.(keep=y w x&i)  out=work.sort; by x&i;
run;

proc means data=work.sort noprint;
     var y;
     weight w;
     output out=_ysum  n(y)=ntotal  sum(y)=ysum;
run;
data _null_;
     set _ysum;
     call symput('ntotal', ntotal);
     call symput('ysum', ysum); 
run;
data y_pred;
     set work.sort  end=eof;
     array _p[&ntotal, 2] _temporary_;
     array _g[&ntotal, 2] _temporary_;
     array _x[&ntotal]    _temporary_;
     retain _y1  _oldx 0;
     if _n_=1 then _oldx=x&i;
     _x[_n_]=x&i;
     if ^eof then do;
        _y1+y; 
       _p[_n_, 1]=_y1/_n_; _p[_n_, 2]=(&ysum-_y1)/(&ntotal-_n_);
       ppn1=_n_/&ntotal;  ppn2=1-ppn1;
       _g[_n_, 1]=2*(ppn1*(1-_p[_n_, 1])*_p[_n_, 1]+ppn2*(1-_p[_n_, 2])*_p[_n_, 2]);
       if _n_>1 then _g[_n_-1, 2]=(_oldx+x&i)/2;
       _oldx=x&i;
     end;
     else do;
       _g[_n_-1, 2]=(_oldx+x&i)/2;
       ginimin=2;
       do i=1 to &ntotal-1;
          gini=_g[i, 1]; x&i=_g[i, 2];
          keep gini x&i;
          output;
     if gini lt ginimin then do;
        ginimin=gini; xmin=x&i;
            p1_LH=_P[i, 1]; p0_LH=1-p1_LH;
              p1_RH=_P[i, 2]; p0_RH=1-p1_RH;
        c_L=(p1_LH>0.5); c_R=(p1_RH>0.5);
    end;
       end;
     end;  
     do i=1 to &ntotal;
        if _x[i]<=xmin then y_pred=c_L;
        if _x[i]>xmin  then y_pred=c_R;
        keep y_pred y w; output y_pred;
     end;
     call symput('ginimin', ginimin);
     call symput('xmin', xmin);
     end;
run;
data _giniout&i;
     length varname $ 8;
     varname="x&i";
     cutoff=&xmin;
     gini=&ginimin;
run;
%end;
data outsplit;
     set %do i=1 %to &p;
            _giniout&i
         %end;;
run;
proc datasets library=work nolist;
     delete _giniout:;
quit;
option notes;    
%mend;

/* weak classifiers for Boost Algorithm */
%macro gini(_y0wsum, _y1wsum, i, nobs);
data _giniout&i.(keep=varname  mingini cut_val   p0_LH  p1_LH  c_L  p0_RH  p1_RH  c_R);     
     length varname $ 8;
     set sorted  end=eof;
  retain  _y0w  _y1w  _w  _ginik  0;
  retain  p0_LH  p1_LH  p0_RH  p1_RH  c_L  c_R  0; 
  array _mingini{4}  _temporary_; 
  if _n_=1 then do;
     _y0w = (y^=1)*w;  _y1w = (y=1)*w;   _w = w;    
  _mingini[1] = 2;
        _mingini[2] = 1; 
        _mingini[3] = x&i; 
        _mingini[4] = x&i;        
  end;
  else do;
     _y0w + (y^=1)*w; _y1w + (y=1)*w; _w + w;
  end;

  if ^eof then do;       
        p0_L = _y0w/_w;  p0_R = (&_y0wsum - _y0w)/(1-_w);
        p1_L = _y1w/_w;  p1_R = (&_y1wsum - _y1w)/(1-_w);
        _ginik= p1_L*p0_L*_w + p1_R*p0_R*(1-_w);
  end;

  if _ginik<_mingini[1] then do;     
  _mingini[1]=_ginik;   _mingini[2]=_n_; _mingini[3]=x&i;
  p0_LH=p0_L;  p1_LH=p1_L;  p0_RH=p0_R;  p1_RH=p1_R;
  c_L = (p1_LH > 0.5); c_R = (p1_RH > 0.5);  
  end; 
  if _n_=(_mingini[2]+1) then _mingini[4]=x&i;

  if eof then do;   
     cut_val=(_mingini[3]+_mingini[4])/2;
  mingini=_mingini[ 1]; 
        varname="x&i";   
  output  ; 
  end;   

run;
%mend;

%macro stump_gini(dsn, p, outdsn);
/***************************************************/
/*    dsn: Name of input SAS data sets. All        */
/*          independent variables should be named  */
/*          as X1, X2,....,Xp and be continous     */
/*          numeric variables                      */
/*      p: Number of independent variables         */
/* outdsn: Name of output SAS data sets. Used for  */
/*          Subsequent scoring. Not to named as    */
/*          _giniout.....                          */
/***************************************************/
%local i  p  ;

%do i=1 %to &p;
    proc sort data=&dsn.(keep=x&i  y  w)  out=sorted  sortsize=max;
      by x&i;
 run;
 data sortedv/view=sortedv;
      set sorted;
   y1=(y=1); y0=(y^=1);
 run;
 proc means data=sortedv(keep=y0 y1 w)   noprint;
      var y0  y1;
   weight w;
   output out=_ywsum(keep=_y0wsum  _y1wsum  _FREQ_)  
                sum(y0)=_y0wsum  sum(y1)=_y1wsum;
 run;
    data _null_;
      set _ywsum;
   call execute('%gini('|| compress(_y0wsum) || ','
                        || compress(_y1wsum) || ','
                              || compress(&i)      || ','
                              || compress(_FREQ_)  || ')'
                      );
 run;
%end;
data &outdsn;
     set %do i=1 %to &p;
         _giniout&i
   %end;;
run;
proc sort data=&outdsn; by mingini; run;
proc datasets library=work nolist;
     delete %do i=1 %to &p;
            _giniout&i
   %end;;
run;quit;
%mend;


%macro css(_ywsum, i, nobs);
data _regout&i.(keep=varname  mincss cut_val  ypred_L  ypred_R);     
     length varname $ 8;
     set sorted  end=eof;
  retain _yw  _w  0;
  retain  ypred_L  ypred_R 0;
  array _mincss{4}  _temporary_; 
  if _n_=1 then do;
     _yw = y*w;  _w = w;  
  _mincss[1] = constant('BIG'); 
        _mincss[2] = 1; 
        _mincss[3] = x&i; 
        _mincss[4] = x&i;
        ypred_L = _yw/_w;  ypred_R = (&_ywsum-_yw)/(1-_w);  
  end;
  else do;
     _yw + y*w; _w + w;
  end;
  if ^eof then do;     
  cssk = 1 - _yw/_w*_yw - (&_ywsum-_yw)/(1-_w)*(&_ywsum-_yw);   
  end;
  else do;
     cssk = 1 -_yw**2;
  end;
  if cssk<_mincss[1] then do;     
  _mincss[1]=cssk;   _mincss[2]=_n_; _mincss[3]=x&i;
  ypred_L=_yw/_w;  ypred_R=(&_ywsum-_yw)/(1-_w);
  end; 
  if _n_=(_mincss[2]+1) then _mincss[4]=x&i;

  if eof then do;   
     cut_val=(_mincss[3]+_mincss[4])/2;
  mincss=_mincss[ 1]; 
        varname="x&i";   
  output  ;
  end;   
     
run;
%mend;

%macro stump_css(dsn, p, outdsn);
/***************************************************/
/*    dsn: Name of input SAS data sets. All        */
/*          independent variables should be named  */
/*          as X1, X2,....,Xp and be continous     */
/*          numeric variables                      */
/*      p: Number of independent variables         */
/* outdsn: Name of output SAS data sets. Used for  */
/*          Subsequent scoring. Not to named as    */
/*          _giniout.....                          */
/***************************************************/
%local i  p  ;
options nosource;
%do i=1 %to &p;
    proc sort data=&dsn.(keep=x&i  y  w)  out=sorted  sortsize=max;
      by x&i;
 run;
 proc means data=sorted(keep=y w)   noprint;
      var y;
   weight w;
   output out=_ywsum(keep=_ywsum  _FREQ_)  sum(y)=_ywsum;
 run;
    data _null_;
      set _ywsum;
   call execute('%css(' || compress(_ywsum) || ','
                              || compress(&i)     || ','
                              || compress(_FREQ_) || ')'
                      );
 run;
%end;
data &outdsn;
     set %do i=1 %to &p;
         _regout&i
   %end;;
run;
proc sort data=&outdsn; by mincss; run;
options source;
%mend;


Comparing the time used and results.

The example data used is the AUC Small training data, 200 sets of predictors for 15,000 ratings, from AusDM2009 competition, and can be found @ Here .

Using example code from the book, it takes 1563 seconds on a regular Windows desktop (Core2Duo E6750 2.67GHz, 4GB Memory, 7K2 rpm HDD with 8MB cache) to process 5 numerical continous variables, whereas with improved macro, it only takes 1 second to process the same amount of data. This improvement is criticle since weak classifiers like this one won't be used alone, but as the base for more time-consuming Boosting algorithms. With original macro, it is practically not usable for any boosting algorithms on data sets with hundreds of or more observations.

Original Macro:

Improved Macro:
Comparing the results from original macro and the improved macro, they both select X3 with the same partition cut off point and the same Gini Index.

Reference:
Pharmaceutical Statistics Using SAS: A Practical Guide by Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, SAS Publishing 2007

Pharmaceutical Statistics Using SAS: A Practical Guide (SAS Press)
 Posted by at 3:50 上午

An efficient macro for Stump – two terminal nodes tree

 Array, Boost Algorithms, Data Mining, Gini Index, predictive modeling  An efficient macro for Stump – two terminal nodes tree已关闭评论
2月 182010
 


In this post, I post an improved SAS macro of the single partition split algorithm in Chapter 2 of "Pharmaceutical Statistics Using SAS: A Practical Guide" by Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino.

The single partition split algorithm is a simplified version of Stumps, and is a weak classifier, usually used to form the base weak learner for boosting algorithm. This specific classifier seeks to separate the space into 2 subspaces for independent variables where each subspace has increasingly higher purity of the response classes, say 1 and 0. In the examples, Gini Index is used to measure purity/impurity.

The SAS example macro %SPLIT in the book (found @ Here) is for illustration purpose, and is so inefficient that it practically can't be used by industrial standard.

I modified this macro and made it usable in real business applications where millions of observations and hundreds of variables are more than common.

Improved Code:



%macro Dsplit(dsn, p);
/************************************************/
/* dsn: Name of input SAS data sets. All        */
/*        independent variables should be named */
/*        as X1, X2,....,Xp and be continous    */
/*        numeric variables                     */
/*   p: Number of independent variables         */
/************************************************/
options nonotes;
%do i=1 %to &p;
proc sort data=&dsn.(keep=y w x&i)  out=work.sort; by x&i;
run;

proc means data=work.sort noprint;
     var y;
     weight w;
     output out=_ysum  n(y)=ntotal  sum(y)=ysum;
run;
data _null_;
     set _ysum;
     call symput('ntotal', ntotal);
     call symput('ysum', ysum); 
run;
data y_pred;
     set work.sort  end=eof;
     array _p[&ntotal, 2] _temporary_;
     array _g[&ntotal, 2] _temporary_;
     array _x[&ntotal]    _temporary_;
     retain _y1  _oldx 0;
     if _n_=1 then _oldx=x&i;
     _x[_n_]=x&i;
     if ^eof then do;
        _y1+y; 
       _p[_n_, 1]=_y1/_n_; _p[_n_, 2]=(&ysum-_y1)/(&ntotal-_n_);
       ppn1=_n_/&ntotal;  ppn2=1-ppn1;
       _g[_n_, 1]=2*(ppn1*(1-_p[_n_, 1])*_p[_n_, 1]+ppn2*(1-_p[_n_, 2])*_p[_n_, 2]);
       if _n_>1 then _g[_n_-1, 2]=(_oldx+x&i)/2;
       _oldx=x&i;
     end;
     else do;
       _g[_n_-1, 2]=(_oldx+x&i)/2;
       ginimin=2;
       do i=1 to &ntotal-1;
          gini=_g[i, 1]; x&i=_g[i, 2];
          keep gini x&i;
          output;
     if gini lt ginimin then do;
        ginimin=gini; xmin=x&i;
            p1_LH=_P[i, 1]; p0_LH=1-p1_LH;
              p1_RH=_P[i, 2]; p0_RH=1-p1_RH;
        c_L=(p1_LH>0.5); c_R=(p1_RH>0.5);
    end;
       end;
     end;  
     do i=1 to &ntotal;
        if _x[i]<=xmin then y_pred=c_L;
        if _x[i]>xmin  then y_pred=c_R;
        keep y_pred y w; output y_pred;
     end;
     call symput('ginimin', ginimin);
     call symput('xmin', xmin);
     end;
run;
data _giniout&i;
     length varname $ 8;
     varname="x&i";
     cutoff=&xmin;
     gini=&ginimin;
run;
%end;
data outsplit;
     set %do i=1 %to &p;
            _giniout&i
         %end;;
run;
proc datasets library=work nolist;
     delete _giniout:;
quit;
option notes;    
%mend;

/* weak classifiers for Boost Algorithm */
%macro gini(_y0wsum, _y1wsum, i, nobs);
data _giniout&i.(keep=varname  mingini cut_val   p0_LH  p1_LH  c_L  p0_RH  p1_RH  c_R);     
     length varname $ 8;
     set sorted  end=eof;
  retain  _y0w  _y1w  _w  _ginik  0;
  retain  p0_LH  p1_LH  p0_RH  p1_RH  c_L  c_R  0; 
  array _mingini{4}  _temporary_; 
  if _n_=1 then do;
     _y0w = (y^=1)*w;  _y1w = (y=1)*w;   _w = w;    
  _mingini[1] = 2;
        _mingini[2] = 1; 
        _mingini[3] = x&i; 
        _mingini[4] = x&i;        
  end;
  else do;
     _y0w + (y^=1)*w; _y1w + (y=1)*w; _w + w;
  end;

  if ^eof then do;       
        p0_L = _y0w/_w;  p0_R = (&_y0wsum - _y0w)/(1-_w);
        p1_L = _y1w/_w;  p1_R = (&_y1wsum - _y1w)/(1-_w);
        _ginik= p1_L*p0_L*_w + p1_R*p0_R*(1-_w);
  end;

  if _ginik<_mingini[1] then do;     
  _mingini[1]=_ginik;   _mingini[2]=_n_; _mingini[3]=x&i;
  p0_LH=p0_L;  p1_LH=p1_L;  p0_RH=p0_R;  p1_RH=p1_R;
  c_L = (p1_LH > 0.5); c_R = (p1_RH > 0.5);  
  end; 
  if _n_=(_mingini[2]+1) then _mingini[4]=x&i;

  if eof then do;   
     cut_val=(_mingini[3]+_mingini[4])/2;
  mingini=_mingini[ 1]; 
        varname="x&i";   
  output  ; 
  end;   

run;
%mend;

%macro stump_gini(dsn, p, outdsn);
/***************************************************/
/*    dsn: Name of input SAS data sets. All        */
/*          independent variables should be named  */
/*          as X1, X2,....,Xp and be continous     */
/*          numeric variables                      */
/*      p: Number of independent variables         */
/* outdsn: Name of output SAS data sets. Used for  */
/*          Subsequent scoring. Not to named as    */
/*          _giniout.....                          */
/***************************************************/
%local i  p  ;

%do i=1 %to &p;
    proc sort data=&dsn.(keep=x&i  y  w)  out=sorted  sortsize=max;
      by x&i;
 run;
 data sortedv/view=sortedv;
      set sorted;
   y1=(y=1); y0=(y^=1);
 run;
 proc means data=sortedv(keep=y0 y1 w)   noprint;
      var y0  y1;
   weight w;
   output out=_ywsum(keep=_y0wsum  _y1wsum  _FREQ_)  
                sum(y0)=_y0wsum  sum(y1)=_y1wsum;
 run;
    data _null_;
      set _ywsum;
   call execute('%gini('|| compress(_y0wsum) || ','
                        || compress(_y1wsum) || ','
                              || compress(&i)      || ','
                              || compress(_FREQ_)  || ')'
                      );
 run;
%end;
data &outdsn;
     set %do i=1 %to &p;
         _giniout&i
   %end;;
run;
proc sort data=&outdsn; by mingini; run;
proc datasets library=work nolist;
     delete %do i=1 %to &p;
            _giniout&i
   %end;;
run;quit;
%mend;


%macro css(_ywsum, i, nobs);
data _regout&i.(keep=varname  mincss cut_val  ypred_L  ypred_R);     
     length varname $ 8;
     set sorted  end=eof;
  retain _yw  _w  0;
  retain  ypred_L  ypred_R 0;
  array _mincss{4}  _temporary_; 
  if _n_=1 then do;
     _yw = y*w;  _w = w;  
  _mincss[1] = constant('BIG'); 
        _mincss[2] = 1; 
        _mincss[3] = x&i; 
        _mincss[4] = x&i;
        ypred_L = _yw/_w;  ypred_R = (&_ywsum-_yw)/(1-_w);  
  end;
  else do;
     _yw + y*w; _w + w;
  end;
  if ^eof then do;     
  cssk = 1 - _yw/_w*_yw - (&_ywsum-_yw)/(1-_w)*(&_ywsum-_yw);   
  end;
  else do;
     cssk = 1 -_yw**2;
  end;
  if cssk<_mincss[1] then do;     
  _mincss[1]=cssk;   _mincss[2]=_n_; _mincss[3]=x&i;
  ypred_L=_yw/_w;  ypred_R=(&_ywsum-_yw)/(1-_w);
  end; 
  if _n_=(_mincss[2]+1) then _mincss[4]=x&i;

  if eof then do;   
     cut_val=(_mincss[3]+_mincss[4])/2;
  mincss=_mincss[ 1]; 
        varname="x&i";   
  output  ;
  end;   
     
run;
%mend;

%macro stump_css(dsn, p, outdsn);
/***************************************************/
/*    dsn: Name of input SAS data sets. All        */
/*          independent variables should be named  */
/*          as X1, X2,....,Xp and be continous     */
/*          numeric variables                      */
/*      p: Number of independent variables         */
/* outdsn: Name of output SAS data sets. Used for  */
/*          Subsequent scoring. Not to named as    */
/*          _giniout.....                          */
/***************************************************/
%local i  p  ;
options nosource;
%do i=1 %to &p;
    proc sort data=&dsn.(keep=x&i  y  w)  out=sorted  sortsize=max;
      by x&i;
 run;
 proc means data=sorted(keep=y w)   noprint;
      var y;
   weight w;
   output out=_ywsum(keep=_ywsum  _FREQ_)  sum(y)=_ywsum;
 run;
    data _null_;
      set _ywsum;
   call execute('%css(' || compress(_ywsum) || ','
                              || compress(&i)     || ','
                              || compress(_FREQ_) || ')'
                      );
 run;
%end;
data &outdsn;
     set %do i=1 %to &p;
         _regout&i
   %end;;
run;
proc sort data=&outdsn; by mincss; run;
options source;
%mend;


Comparing the time used and results.

The example data used is the AUC Small training data, 200 sets of predictors for 15,000 ratings, from AusDM2009 competition, and can be found @ Here .

Using example code from the book, it takes 1563 seconds on a regular Windows desktop (Core2Duo E6750 2.67GHz, 4GB Memory, 7K2 rpm HDD with 8MB cache) to process 5 numerical continous variables, whereas with improved macro, it only takes 1 second to process the same amount of data. This improvement is criticle since weak classifiers like this one won't be used alone, but as the base for more time-consuming Boosting algorithms. With original macro, it is practically not usable for any boosting algorithms on data sets with hundreds of or more observations.

Original Macro:

Improved Macro:
Comparing the results from original macro and the improved macro, they both select X3 with the same partition cut off point and the same Gini Index.

Reference:
Pharmaceutical Statistics Using SAS: A Practical Guide by Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, SAS Publishing 2007

Pharmaceutical Statistics Using SAS: A Practical Guide (SAS Press)
 Posted by at 3:50 上午
2月 172010
 
Today our official press release hit the business wire! In this announcement we launched four new product pages featuring our offerings that sprouted from the power and expertise of the Teragram technologies acquired in 2008. I encourage you to take some time to click thru the links in the press release, check out the fact sheets, and explore our web sites.

Fiona McNeill, our Global Text Analytics Product Marketing Manager expresses pride in our SAS uers as she says “Those involved in enterprise content, Web content, document, records and knowledge management, enterprise search, article research and online marketing measurement can easily access and reuse information with SAS Text Analytics.”

Two new customer case studies are posted on our success story site showing innovative applications of text analytics by organizations who are reaping great value from their unstructured data.

• An Italian company takes advantage of Web 2.0 interactivity and social networking tools to Implement online lending model that pairs borrowers and investors without intervention from traditional institutions. This new credit scoring process goes beyond the quantitative variables collected from past history and standardized defined risk categories, to being able to incorporate qualitative evaluations gathered from written descriptions of the projects and business plans to make better decisions of credit risk or credit worthiness.

A Hong Kong government office faced with the challenge of processing large volumes of structured and unstructured text, in traditional Chinese, simplified Chinese and English not only found a way to meet the challenge but are now doing so quickly and accurately. With SAS software decoding the ’messages’ and supporting their statistical and root-cause analyses of data collected in their call centers, the government is now better understanding the voice of the people, as they develop strategies for improved service boosting public satisfaction with the government.

The program for the April 2010 SAS user conference is now online - and I see several interesting text analytics projects listed there. The buzz is growing and I thank you, our readers, for finding new ways to apply the technology!
2月 162010
 
Q: A customer wrote, "What, no stored process samples?"
A: We do have samples and notes about writing and using stored processes. We do not have a browse topic for stored processes. Therefore, to find samples about creating and maintaining stored processes, you will have to use the search feature.

  1. Go to support.sas.com

  2. Select Samples & SAS Notes from the left navigation. The sub-items for this section will be displayed.

  3. Select Search SAS Samples from the left navigation.

  4. Type stored processes in the search box next to contenttype:"Sample".
    The result is 98 items.


As you know, there is always more than one way to accomplish a task. A more efficient way to accomplish the same results is

  1. Go to support.sas.com

  2. Type stored processes in the search box and select Samples & SAS Notes from the drop list.

  3. Select Search
  4. .
  5. Select Samples from the Type drop list found in the Filter Results By: area of the page.


Hint: If you know exatly what you hope to find, you may want to give Advanced Search a try.

If you don't find the examples that you were hoping for, visit the SAS Stored Processes forum and ask for references or specific help.