4月 272010
 
On a recent trip to SAS, Paul Greenberg shared his thoughts and ideas about marketing and social media. He pointed out that the rise of social media is not simply a change in how we do business; it signals a deeper transformation in how humans communicate online.

Paul cited examples, some real-time, of companies that have suffered the negative impacts of this communication transformation. In each example, something was going wrong. And in each case, the “going wrong” quickly escalated into a proverbial “PR nightmare,” when the company reacted either too slowly, or in a way that was out of alignment with expectations, as might happen if a company—having built their brand as a quality leader—reacted obstinately to a quality problem.

This is a lesson so basic it's way before marketing 101: pay attention, or else. Communicating a brand promise is meaningless if you fail to deliver on that promise. Yet the people communicating and the people delivering live in different worlds, or at least different organizations. Effective brand strategists know that aligning what you say with what you do is Job 1. Marketing, with all its knowledge about customer needs and wants, has to be in step with operations, with all its knowledge of how and when to get things done.

Social Media has a way of seeking out and magnifying any saying/doing contradictions, so internal alignment is ever more important. If you say you’re a quality leader but act like a value leader, social media will find you. Or, more precisely, it will enable crowds of current and potential customers to highlight the contradiction, with rapid speed, in a very public way. Of course this is the beginning of a downward spiral leading to apologies, recalls, or worse. So, what to do?

Focus outside and inside. Listen to everyone, on every (relevant) social media network, everyday. Then, share the meaningful bits of what you hear across the company, to everyone who touches a customer, and to everyone who could benefit from a better understanding of customer sentiment. Then act. Immediately, as best you can, in accordance with your values as a company.

Additional thoughts about branding can be read in a great Editor’s Letter published in the May, 2010 issue of Vanity Fair magazine. This letter, written by VF’s Editor Graydon Carter, is titled “And the Brands Played On.” In it, Graydon examines how both corporate and personal brands have suffered of late. He doesn’t explicitly mention social media, but consider the points he raises in light of Paul Greenberg’s thoughts, concerning social media signaling a change in how we communicate. We’ve all come to expect the ability to interact, and increasingly we’re all coming to expect transparency, accountability and relevancy from the brands we relate to.

And you – are you paying attention? Do you agree? What do you think?
4月 272010
 


SAS can allow the strings up to 32,767 characters long but some times SAS will write a Warning message ‘WARNING: The quoted string currently being processed has become more than 262 characters long. You may have unbalanced quotation marks.’, when you try to keep a character string longer than 262 characters to a variable.  It is hard to look back at the SAS code to search for unbalanced quotes. To make it more clearly I am going to show an example. I want to add a 263 characters long name to a variable (longvar) and to do that I will simply use a data step… and when I do that I will see the WARNING message in Log. data TEST; x="(SEE DOCTOR'S LETTER)3RD ADMINISTRATION OF MTX WAS DELAYED BY 14 DAYS AND WAS REDUCED TO 1G/M2 INSTEAD OF 5G/M2, PROBLEMS, E.COLI SEPSIS WITH HEART INSUFFICIENCY WITH SINUS TACHYCARDY, PARALYTIC ILEUS, TACHYPNEA , PATIENT DIED ON 21.04.98 FROM MULTIORGAN FAILURE."; y=length(x); put x; run; LOG FILE: There is a SAS option (NOQUOTELENMAX)...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 8:25 上午
4月 272010
 


SAS can allow the strings up to 32,767 characters long but some times SAS will write a Warning message ‘WARNING: The quoted string currently being processed has become more than 262 characters long. You may have unbalanced quotation marks.’, when you try to keep a character string longer than 262 characters to a variable.  It is hard to look back at the SAS code to search for unbalanced quotes. To make it more clearly I am going to show an example. I want to add a 263 characters long name to a variable (longvar) and to do that I will simply use a data step… and when I do that I will see the WARNING message in Log. data TEST; x="(SEE DOCTOR'S LETTER)3RD ADMINISTRATION OF MTX WAS DELAYED BY 14 DAYS AND WAS REDUCED TO 1G/M2 INSTEAD OF 5G/M2, PROBLEMS, E.COLI SEPSIS WITH HEART INSUFFICIENCY WITH SINUS TACHYCARDY, PARALYTIC ILEUS, TACHYPNEA , PATIENT DIED ON 21.04.98 FROM MULTIORGAN FAILURE."; y=length(x); put x; run; LOG FILE: There is a SAS option (NOQUOTELENMAX)...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 8:25 上午
4月 262010
 
1) How to use SAS to merge base and look-up tables ? pro and con?
1. array 2. sort-sort-merge;3. proc sql; 4. proc format; 5. hash object

Coding efficiency: 3>2>4>1>5
I/O resource: 5>4>1>3>2
flexibility: 3>>others


2) What are the common methods for large-scale database data cleaning ?

Using proc sql to access database via DBMS. Then use it to check outlier/
missing/invalid/duplicate values, do hard coding correction, and update integrity constraint. Also use proc sort/means/freq/univariate/datasets/compare/rank and SAS functions
(date/money/regular expression/common math and probability). Write macro to do data imputation. If prefer point and click,then choose SAS/DataFlux.
4月 252010
 
"http://www.mitbbs.com/article_t/Statistics/31224273.html"
Derive new var in data step --> reshape data in proc transpose --> use proc sql to generate report

data one;
input good P1 P2 P3 P4 P5 P6 P7 P8 P9 P10;
if sum(of good-P10)=11 then yes=1;
else yes=0;
cards;
1 1 0 0 1 0 1 1 0 1 0
0 0 0 1 0 1 1 0 1 0 1
1 1 1 1 0 1 1 0 1 0 1
1 1 1 1 1 1 1 1 1 1 1
;
run;

proc sort data=one; by good yes; run;
Proc transpose data=one out=two name=P;
var P1-P10;
by good yes;
run;

Proc Sql ;
create table three as
select good, p, yes,
sum( col1 , col2) as count
from two
order by p, yes
;quit;

SAS vs. R

 Computers and Internet, COS, R, 统计之都  SAS vs. R已关闭评论
4月 222010
 

在“统计之都”(COS)发布了一篇《Think  SAS》,是有感于R在学院(尤其是统计系)的流行,主要是写给对工业界感兴趣的在校生看,动员他们在学R之余,不妨考虑一下SAS。不想却引来长篇的R与SAS之争。关于软件的功能,比如R或SAS,本身没有任何问题:一个语言,只要能使分支语句和循环,再加上极少的要求,就可以完成几乎所有的任务了(语言的完备性,要求其实很简单)。所以讨论R与SAS谁更强大,意义不大,这更多是个人使用偏好的问题。在“统计之都”的博客文章,评论大多很精彩,有的甚至要超过正文,读者朋友不妨移步观望下。这将会是一个系列文章,陆续在COS发布,慢慢写了。

又,关于R与SAS,Peter Flom也在写系列比较文章,见

1. SAS v. R: Ease of learning

0. SAS vs. R: Introduction and request

4月 222010
 


In Chapter 2 of the book "Pharmaceutical Statistics Using SAS: A Practical Guide" (SAS Press), Prof. Rayens, W and Dr. Johnson K. presented their SAS implementation of boost algorithms, including AdaBoost, RealBoost, GentleBoost and LogitBoost. The original SAS macro can be found at Here.

Their macro uses PROC IML and is preferred by my colleagues since it is not I/O bounded as opposed to my DATA STEP implementation. One shortcoming of their macro is inefficiency because I think it is for DEMONSTRATION purpose rather than for serious applications. The way they wrote the program shows too many redundent calculations as well as inefficient matrix operation. In return, their macro is not able to handle a data set with more than 5000 observations, and on my PC [Intel E6750 2.66GHz, 7.4Gflops/core, 4G memory], it took 1m41s to go over 10 iterations on a data set with 2000 observations and 10 numerical variables and consumed 158MB memory. Note that the resource consumptions increase quadratically in the number of observations. In another experiment with 4000 observations, the macro consumed 752MB memory and took 6m53s to go over 10 iterations on 10 variables. My colleagues asked me if I could make some improvement so that this SAS implementation is usable in industrial applicaitons where data sets with >10K observations and hundreds of features are more than common.

In their implementation, an upper triangle matrix and a lower triangle matrix are used to obtain cumulative weighted sum from either direction of the sorted data, which increases memory consumption and calculation time in quadratic rates. But in PROC IML, there is a built-in function called CUSUM that can be used to obtain cumulative sum lightning fast. In order to use CUSUM function to replace the cumbersome matrix operation, we also need to pay attention to the dimension change. Since there is no huge triangle matrices involves, the consumption on RAM also significantly reduced, which in turn means we can process bigger data [see experiment below]. I did a test on the speed and memory consumption using AdaBoost on 2000 observation and 10 variables, with 10 iterations, before and after the improvement we see:

2000 Obs:


4000 Obs:

Consider the difference showing above, with 2000 observations, new macro used less than 0.3s and merely 1.1MB memory while the original one used 102.17s and 158.3MB memory. With 4000 observations, original macro used 752MB and 6:53s, as comparison, new macro used 0.5s and 1.9MB memory. We see that the memory and time consumption is no longer O(n^2) but rather O(n), where n is the number of observations. Both macro produced almost identical results in two experiments.


With 500K observations, the original macro was not able to proceed but the improved one finished successfully in about 70s and took up only 194MB memory:


With the much reduced processing time, on top of the boost algorithms, analysts are now able to apply more computationally intensive algorithms, such as Bagging.

I also tried to improve the REGSPLIT_IML subroutine. Yesterday, time usage was reduced 69% [1m21s vs 25s in 2K records] comparing to original macro but still increases quadratically with number of observation, memory consumption is much reduced and increase linearly with size of data. With careful study of the calculation and formula involved, I further decomposed the matrix operation into more efficient mathematical calculations, and now the time usage is further reduced to only 0.9% [329.84s vs 2.59s in 4K records] of original macro and increase only linearly with the number of records. Therefore we are practically able to use GentleBoost and LogitBoost [preferred over AdaBoost/RealBoost in many cases] for predictive modeling projects.

First Improvement

Second Improvement:

Using the most recent version of the macro, on a PC with E6320 1.86GHz (5.5Gflps/core) and 4GB memory, we observe the following performance benchmark (another 15% improvement over what's shown above):


Replace SPLIT_IML subroutine with the following code.


/*************************************************************************
This is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 or 3 of the License
(at your option).

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
**************************************************************************/
    start split_iml(x,y,w,g_info,out_type, y_pred);
        n = nrow(x);
        p = ncol(x);
        gini_min = 2;
        gini_var = 0;
        gini_cut = 0;
        y_pred = repeat(0,n,1);
        wsum = sum(w);
        ywsum = sum(y#w);  
        ywsum1 = wsum - ywsum; 
        do j=1 to p;
            x_curr = x[,j]||y||w;
            b=x_curr;  
            x_curr[rank(x[,j]),]=b;  free b;
            x_sort = x_curr[,1]; 
            y_sort = x_curr[,2]; 
            w_sort = x_curr[,3];
            yw_sort=(y_sort#w_sort);
            yw_sort1=(w_sort - yw_sort);        
            yw_cusum=cusum(yw_sort[1:(n-1)]);    

            lpwt = cusum(w_sort[1:(n-1)]);
            lpwt = lpwt#(lpwt >= 2*CONSTANT('SMALL')) + 
                   (lpwt < 2*CONSTANT('SMALL'))*2*CONSTANT('SMALL');
    
            p1_L = yw_cusum # (1/lpwt);
            gini = yw_cusum # (1-p1_L);
                       
            rpwt = wsum - lpwt; 
            rpwt = rpwt#(rpwt >= 2*CONSTANT('SMALL')) + 
                   (rpwt < 2*CONSTANT('SMALL'))*2*CONSTANT('SMALL');
    
            yw_cusum = ywsum - yw_cusum;
            p1_R = yw_cusum # (1/rpwt);
            
            gini = gini + yw_cusum # (1-p1_R);

            free lpwt  rpwt  yw_cusum  yw_sort1;

            g_min=gini[><];  g_loc=gini[>:<];

            if g_min < gini_min then do;
                gini_min=g_min;
                gini_var = j;
                gini_cut = (x_sort[g_loc] + x_sort[g_loc+1]) / 2;
                p1_RH = p1_R[g_loc];
                p0_RH = 1-p1_R[g_loc];
                p1_LH = p1_L[g_loc];
                p0_LH = 1-p1_L[g_loc];

                c_R = 0;
                if p1_RH > 0.5 then c_R = 1;
                c_L = 0;
                if p1_LH > 0.5 then c_L = 1;
            end;
        end;
        g_info = gini_var||gini_min||gini_cut||p0_LH||p1_LH||c_L||p0_RH||p1_RH||c_R;
        if out_type = 1 then 
           y_pred = (x[, gini_var] <=gini_cut)*c_L + 
                    (x[, gini_var] > gini_cut) *c_R
        ;
       
        if out_type=2 then
           y_pred[, 1] =( x[, gini_var]<=gini_cut) * ( (c_L=0)*(1-p0_LH) + (c_L=1)*p1_LH) +
                        ( x[, gini_var] >  gini_cut) * ( (c_R=0)*(1-p0_RH) + (c_R=1)*p1_RH)
        ;
 
 
    finish split_iml;
Replace REGSPLIT_IML subroutine with the following code.

/*************************************************************************
This is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 or 3 of the License
(at your option).

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
**************************************************************************/
start regsplit_iml(x,y,w,j_info,y_pred);
        n = nrow(x);
        p = ncol(x);
        min_css = 10000000000000;
        y_pred = repeat(0,n,1);
  wy2sum = sum( w#y#y );
        wsum = sum(w);
        ywsum = sum(y#w);
  ywsum1 = wsum - ywsum;
        do j=1 to p;
            x_curr = x[,j]||y||w;
            b=x_curr;
            x_curr[rank(x[,j]),]=b;   free b;
            x_sort = x_curr[,1];
            y_sort = x_curr[,2];
            w_sort = x_curr[,3];

   yw_sort=(y_sort#w_sort);
   yw_sort1=((1-y_sort)#w_sort);
   w_sort = (w_sort);

   yw_cusum = cusum(yw_sort[1:(n-1)]);

   lpwt = cusum(w_sort[1:(n-1)]);
   lpwt = lpwt# (lpwt>constant('SMALL')) + 
                         constant('SMALL')#(lpwt<=constant('SMALL'));
   p1_L = (yw_cusum # (1/lpwt));

   rpwt = wsum - lpwt;
   rpwt = rpwt#(rpwt>constant('MACEPS')) + 
                         constant('MACEPS')#(lpwt<=constant('MACEPS'));
   p1_R = ((ywsum - yw_cusum) # (1/rpwt)); 

   css=(1:n-1)*0;
   lpwt = cusum(w_sort); rpwt = cusum(yw_sort);

   css = wy2sum + p1_L##2#lpwt[1:(n-1)] + p1_R##2#(wsum - lpwt[1:(n-1)]) -
           2*(p1_L#rpwt[1:(n-1)] + p1_R#(ywsum - rpwt[1:(n-1)]));

   free  lpwt  rpwt  yw_cusum  yw_sort1;
   css_min=css[><];  css_loc=css[>:<];

            if css_min < min_css then do;
                min_css = css_min;
                cut_val = (x_sort[css_loc] + x_sort[css_loc+1]) / 2;
                reg_var = j;
                ypred_L = (sum(yw_sort[1:css_loc]))/sum(w_sort[1:css_loc]);
                ypred_R = (sum(yw_sort[css_loc+1:n]))/
                        sum(w_sort[css_loc+1:n]);
                y_pred = ypred_L*(x[,j] < cut_val) + ypred_R*(x[,j] >= cut_val);
                j_info = reg_var||min_css||cut_val||ypred_L||ypred_R;
            end;
        end;
    finish regsplit_iml;
Reference:
Dmitrienko, Alex, Christy Chuang-Stein, and Ralph D’Agostino.  Pharmaceutical Statistics Using SAS®: A Practical Guide. Cary, NC: SAS Institute Inc. 2007

Pharmaceutical Statistics Using SAS: A Practical Guide (SAS Press)
 Posted by at 4:52 上午

Improve the Boost macro from Prof. Rayens, W and Dr. Johnson, K

 Boost Algorithms, predictive modeling  Improve the Boost macro from Prof. Rayens, W and Dr. Johnson, K已关闭评论
4月 222010
 


In Chapter 2 of the book "Pharmaceutical Statistics Using SAS: A Practical Guide" (SAS Press), Prof. Rayens, W and Dr. Johnson K. presented their SAS implementation of boost algorithms, including AdaBoost, RealBoost, GentleBoost and LogitBoost. The original SAS macro can be found at Here.

Their macro uses PROC IML and is preferred by my colleagues since it is not I/O bounded as opposed to my DATA STEP implementation. One shortcoming of their macro is inefficiency because I think it is for DEMONSTRATION purpose rather than for serious applications. The way they wrote the program shows too many redundent calculations as well as inefficient matrix operation. In return, their macro is not able to handle a data set with more than 5000 observations, and on my PC [Intel E6750 2.66GHz, 7.4Gflops/core, 4G memory], it took 1m41s to go over 10 iterations on a data set with 2000 observations and 10 numerical variables and consumed 158MB memory. Note that the resource consumptions increase quadratically in the number of observations. In another experiment with 4000 observations, the macro consumed 752MB memory and took 6m53s to go over 10 iterations on 10 variables. My colleagues asked me if I could make some improvement so that this SAS implementation is usable in industrial applicaitons where data sets with >10K observations and hundreds of features are more than common.

In their implementation, an upper triangle matrix and a lower triangle matrix are used to obtain cumulative weighted sum from either direction of the sorted data, which increases memory consumption and calculation time in quadratic rates. But in PROC IML, there is a built-in function called CUSUM that can be used to obtain cumulative sum lightning fast. In order to use CUSUM function to replace the cumbersome matrix operation, we also need to pay attention to the dimension change. Since there is no huge triangle matrices involves, the consumption on RAM also significantly reduced, which in turn means we can process bigger data [see experiment below]. I did a test on the speed and memory consumption using AdaBoost on 2000 observation and 10 variables, with 10 iterations, before and after the improvement we see:

2000 Obs:


4000 Obs:

Consider the difference showing above, with 2000 observations, new macro used less than 0.3s and merely 1.1MB memory while the original one used 102.17s and 158.3MB memory. With 4000 observations, original macro used 752MB and 6:53s, as comparison, new macro used 0.5s and 1.9MB memory. We see that the memory and time consumption is no longer O(n^2) but rather O(n), where n is the number of observations. Both macro produced almost identical results in two experiments.


With 500K observations, the original macro was not able to proceed but the improved one finished successfully in about 70s and took up only 194MB memory:


With the much reduced processing time, on top of the boost algorithms, analysts are now able to apply more computationally intensive algorithms, such as Bagging.

I also tried to improve the REGSPLIT_IML subroutine. Yesterday, time usage was reduced 69% [1m21s vs 25s in 2K records] comparing to original macro but still increases quadratically with number of observation, memory consumption is much reduced and increase linearly with size of data. With careful study of the calculation and formula involved, I further decomposed the matrix operation into more efficient mathematical calculations, and now the time usage is further reduced to only 0.9% [329.84s vs 2.59s in 4K records] of original macro and increase only linearly with the number of records. Therefore we are practically able to use GentleBoost and LogitBoost [preferred over AdaBoost/RealBoost in many cases] for predictive modeling projects.

First Improvement

Second Improvement:

Using the most recent version of the macro, on a PC with E6320 1.86GHz (5.5Gflps/core) and 4GB memory, we observe the following performance benchmark (another 15% improvement over what's shown above):


Replace SPLIT_IML subroutine with the following code.


/*************************************************************************
This is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 or 3 of the License
(at your option).

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
**************************************************************************/
    start split_iml(x,y,w,g_info,out_type, y_pred);
        n = nrow(x);
        p = ncol(x);
        gini_min = 2;
        gini_var = 0;
        gini_cut = 0;
        y_pred = repeat(0,n,1);
        wsum = sum(w);
        ywsum = sum(y#w);  
        ywsum1 = wsum - ywsum; 
        do j=1 to p;
            x_curr = x[,j]||y||w;
            b=x_curr;  
            x_curr[rank(x[,j]),]=b;  free b;
            x_sort = x_curr[,1]; 
            y_sort = x_curr[,2]; 
            w_sort = x_curr[,3];
            yw_sort=(y_sort#w_sort);
            yw_sort1=(w_sort - yw_sort);        
            yw_cusum=cusum(yw_sort[1:(n-1)]);    

            lpwt = cusum(w_sort[1:(n-1)]);
            lpwt = lpwt#(lpwt >= 2*CONSTANT('SMALL')) + 
                   (lpwt < 2*CONSTANT('SMALL'))*2*CONSTANT('SMALL');
    
            p1_L = yw_cusum # (1/lpwt);
            gini = yw_cusum # (1-p1_L);
                       
            rpwt = wsum - lpwt; 
            rpwt = rpwt#(rpwt >= 2*CONSTANT('SMALL')) + 
                   (rpwt < 2*CONSTANT('SMALL'))*2*CONSTANT('SMALL');
    
            yw_cusum = ywsum - yw_cusum;
            p1_R = yw_cusum # (1/rpwt);
            
            gini = gini + yw_cusum # (1-p1_R);

            free lpwt  rpwt  yw_cusum  yw_sort1;

            g_min=gini[><];  g_loc=gini[>:<];

            if g_min < gini_min then do;
                gini_min=g_min;
                gini_var = j;
                gini_cut = (x_sort[g_loc] + x_sort[g_loc+1]) / 2;
                p1_RH = p1_R[g_loc];
                p0_RH = 1-p1_R[g_loc];
                p1_LH = p1_L[g_loc];
                p0_LH = 1-p1_L[g_loc];

                c_R = 0;
                if p1_RH > 0.5 then c_R = 1;
                c_L = 0;
                if p1_LH > 0.5 then c_L = 1;
            end;
        end;
        g_info = gini_var||gini_min||gini_cut||p0_LH||p1_LH||c_L||p0_RH||p1_RH||c_R;
        if out_type = 1 then 
           y_pred = (x[, gini_var] <=gini_cut)*c_L + 
                    (x[, gini_var] > gini_cut) *c_R
        ;
       
        if out_type=2 then
           y_pred[, 1] =( x[, gini_var]<=gini_cut) * ( (c_L=0)*(1-p0_LH) + (c_L=1)*p1_LH) +
                        ( x[, gini_var] >  gini_cut) * ( (c_R=0)*(1-p0_RH) + (c_R=1)*p1_RH)
        ;
 
 
    finish split_iml;
Replace REGSPLIT_IML subroutine with the following code.

/*************************************************************************
This is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 or 3 of the License
(at your option).

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
**************************************************************************/
start regsplit_iml(x,y,w,j_info,y_pred);
        n = nrow(x);
        p = ncol(x);
        min_css = 10000000000000;
        y_pred = repeat(0,n,1);
  wy2sum = sum( w#y#y );
        wsum = sum(w);
        ywsum = sum(y#w);
  ywsum1 = wsum - ywsum;
        do j=1 to p;
            x_curr = x[,j]||y||w;
            b=x_curr;
            x_curr[rank(x[,j]),]=b;   free b;
            x_sort = x_curr[,1];
            y_sort = x_curr[,2];
            w_sort = x_curr[,3];

   yw_sort=(y_sort#w_sort);
   yw_sort1=((1-y_sort)#w_sort);
   w_sort = (w_sort);

   yw_cusum = cusum(yw_sort[1:(n-1)]);

   lpwt = cusum(w_sort[1:(n-1)]);
   lpwt = lpwt# (lpwt>constant('SMALL')) + 
                         constant('SMALL')#(lpwt<=constant('SMALL'));
   p1_L = (yw_cusum # (1/lpwt));

   rpwt = wsum - lpwt;
   rpwt = rpwt#(rpwt>constant('MACEPS')) + 
                         constant('MACEPS')#(lpwt<=constant('MACEPS'));
   p1_R = ((ywsum - yw_cusum) # (1/rpwt)); 

   css=(1:n-1)*0;
   lpwt = cusum(w_sort); rpwt = cusum(yw_sort);

   css = wy2sum + p1_L##2#lpwt[1:(n-1)] + p1_R##2#(wsum - lpwt[1:(n-1)]) -
           2*(p1_L#rpwt[1:(n-1)] + p1_R#(ywsum - rpwt[1:(n-1)]));

   free  lpwt  rpwt  yw_cusum  yw_sort1;
   css_min=css[><];  css_loc=css[>:<];

            if css_min < min_css then do;
                min_css = css_min;
                cut_val = (x_sort[css_loc] + x_sort[css_loc+1]) / 2;
                reg_var = j;
                ypred_L = (sum(yw_sort[1:css_loc]))/sum(w_sort[1:css_loc]);
                ypred_R = (sum(yw_sort[css_loc+1:n]))/
                        sum(w_sort[css_loc+1:n]);
                y_pred = ypred_L*(x[,j] < cut_val) + ypred_R*(x[,j] >= cut_val);
                j_info = reg_var||min_css||cut_val||ypred_L||ypred_R;
            end;
        end;
    finish regsplit_iml;
Reference:
Dmitrienko, Alex, Christy Chuang-Stein, and Ralph D’Agostino.  Pharmaceutical Statistics Using SAS®: A Practical Guide. Cary, NC: SAS Institute Inc. 2007

Pharmaceutical Statistics Using SAS: A Practical Guide (SAS Press)
 Posted by at 4:52 上午
4月 212010
 

No more teachers, no more books!! Now in its second year, SAS' "Applying Business Analytics" Webinar series has proven to be a powerful resource for many people. In 2009, more than 4,500 customers and prospects participated in the nine live and on-demand webinars, representing more than 3,200 organizations around the world on topics including Analytics, Data Management & Reporting.

So, why should you pay attention to the series this year? Beginning April 21 with, Text Analytics 101, and running through November, the Applying Business Analytics Webinar series will enable you to learn from the best business, product marketing and tech experts as they highlight the value of a complete business analytics framework.

You are welcome to join the sessions live or view on demand at your leisure. The Webinars will each follow a "101" format, helping set a foundation around each of the Webinar topics, which include: Hope you can join us & I'll see you online soon!

@kristinevick
4月 212010
 
##################################################
# A DEMO HOW TO COPY A SQLITE DB TABLE FROM DISK #
# INTO MEMORY                                    #
##################################################

import sqlite3

# CONNECT TO THE IN-MEMORY DATABASE
con = sqlite3.connect(":memory:")
cur = con.cursor()

# ATTACH SQLITE DB IN THE DISK
cur.execute("attach 'd:\mydb' as filedb")

# COPY THE TABLE INTO IN-MEMORY DB
cur.execute("create table memory_tbl as select * from filedb.mytab")

# RELEASE THE DB IN THE DISK
cur.execute("detach filedb")

# FETCH ROWS FROM IN-MEMORY DB
cur.execute("select Sepal_Length, Species from memory_tbl limit 3")

for row in cur.fetchall():
  print row

# OUTPUT:
# (5.0999999999999996, u'setosa')
# (4.9000000000000004, u'setosa')
# (4.7000000000000002, u'setosa')
 Posted by at 1:01 下午