12月 042009
 

(Note: All the followings are tested on Windows XP environment.)

0. Install Oracle Database 10g Express Edition

Fast (and free) to download, easy to deploy and simple to admin–for learning and testing purpose, Oracle Database 10g Express Edition (Oracle Database XE, a mini version of Oracle Database 10g Release 2) are strongly recommended:

0.1 Download it at its homepage(206MB);

0.2 Install it following default settings;

0.3 Unlock accounts for HR and SCOTT. In a windows DOS prompt, using the following scripts:

sqlplus sys/YourSysPassword as sysdba
alter user HR account unlock
alter user HR identified by YourHRPassword
alter user SCOTT account unlock
alter user SCOTT identified by TIGER
Or you can accomplish these tasks within the Oracle web application:
http://localhost:8081/apex

1. Connect Oracle using SAS libname engine

libname SCOTT oracle user=”SCOTT” password=”TIGER” path=’xe’ ;

Note: xe is the default path for Oracle Database XE.

2. Connect Oracle using SQL Procedure Pass-Through Facility

proc sql ;
connect to oracle as orcl
(user=”SCOTT” password=”TIGER” path=’xe’);

select *
from connection to orcl

(

SELECT …

) ;

disconnect from orcl;

quit;

Note: 1) In this approach, Oracle, instead of SAS, processes the SQL statement (i.e., you use the more powerful and flexible Oracle SQL syntax instead of SAS Proc SQL procedure for querying. For more, see SAS Doc).

2) Your Oracle SQL codes should be placed in the RED blocks, and end without a semicolon(;):

select *

from emp

3) Only basic Oracle SQL statements (select, create table, . . .)can pass through this facility.

3. Load SAS datasets to Oracle database

proc copy in=sashelp
out=scott;
select class;
run;

4. Misc: Some differences between Oracle SQL and SAS Proc SQL

4.1 Table aliases

Oracle: from employees a;

SAS: from employees a; or

from employees as a;

4.2 Column aliases

Oracle: don’t use single quotation marks(‘’).

select job_id as job, job_id job, job_id as “job” , job_id as ‘job’

SAS:

select job_id as job, job_id “job” , job_id ‘job’, job_id job

4.3 Literal Character Strings

Oracle: Date and character literal values must be enclosed within single quotation marks(‘’).

SAS: Use both single and double quotation marks.

4.4 Order the null value

Oracle: Null values are displayed last for ascending sequences and first for descending sequences.

SAS: Null values are displayed first for ascending sequences and last for descending sequences.

4.5 Nesting Group Functions

select max(avg(salary))
from employees
group by department_id

SAS log:

ERROR: Summary functions nested in this way are not supported.

Optional approach for SAS:

select max(avg) as max
from(
select avg(salary) as avg
from employees
group by department_id )

4.6 row number(_n_)

Oracle:

select rownum
from employees

SAS:

select monotonic()
from employees

Note: monotonic() is an undocumented function of SAS, seehttp://www2.sas.com/proceedings/sugi29/040-29.pdf

Applying Delta Method in SAS

 未分类  Applying Delta Method in SAS已关闭评论
12月 022009
 


Delta method is a method in statistics for deriving an approximate probability distribution for a function of an asymptotically normally distributed random variable from knowledge of the limiting variance of that random variable.
See "http://en.wikipedia.org/wiki/Delta_method".

In order to apply delta method, it is necessary to obtain the first order derivatives of the function with respect to the randome variables, which may require either symbolic or finite difference approximation computing power. SAS doesn’t explicitly have symbolic computing power nor does it ouptut intermediate result from finite approximation computations and we have to turn to PROC MODEL in ETS (or PROC NLIN in SAS/STAT) to obtain the symbolic derivatives of a given function F(.). PROC MODEL output three pieces of related information:

1. Internal machine code for computing each part of the derivatives, say g(b). This is the table called CodeList;
2. Summary symbolic result for first order derivatives of f(b), i.e. the complete symbolic expression of g(b); This is the table called ProgList;
3. Indicator of partial derivatives, this is the table called Derivatives

The code list is the human-readable code for internal computing and I use this to guide further computing on the derivatives. I also use the ProgList in the first step to obtain the expression of first order derivatives of f(b), i.e. expression of g(b).

Suppose you have a translog model, called F(p1, p2, p3; b1, b2, b3), where p1, p2, p3 are input prices and b1, b2, b3 are parameter vectors (parameters should be more in real translog model). The elasticity of substitution of input 1 and input2 is defined as [@F()/@p1]/[@F()/@p2]=g(b1, b2). *NOTE: @ indicates derivative operator;

In my first step, I use PROC MODEL on F(p1, p2, p3; b1, b2, b3) to estimate parameter vector {b1, b2, b3}; In this step, we obtain tables CodeList_0 and ProgList_0;

In the second step, I tweak PROC MODEL to obtain the partial derivetives: @F()/@p1, and @F()/@p2. Then, I read in ProgList table to form the explicit AUES expression g(p1, p2| b1, b2, b3)= [@F()/@p1]/[@F()/@p2]; This step generates tables CodeList, ProgList;


Since delta method based approach to estimate V(g) relies on the first order derivative of function g() w.r.t parameter vector (b1, b2, b3), we also need to obtain the partial derivatives of g() with respect to each of the parameters:, ie. @g()/@b1, @g()/@b2, @g()/@b3. Therefore, I have to call PROC MODEL once again to obtain these formula. Tables CodeList_g and ProgList_g were generated in this step.

Later, I parse the instructions in CodeList_g to calculate @g()/@b1, @g()/@b2, @g()/@b3 at every data point of the sample. Once we obtain these values, we can use delta method to get the asymptotical mean and variance of g() function

V[g()]=[@g()/@b1, @g()/@b2, @g()/@b3]’ %*% V(b1, b2, b3) %*% [@g()/@b1, @g()/@b2, @g()/@b3]


In summary, my program only ask user to input the original model for estimation as well as the two input prices for cross elasticity calculation, users don’t need to specify the derivative functions themselves. If we use IML, we have to specify explicit formula for vector [@g()/@b1, @g()/@b2, @g()/@b3], which may very tedious to calculate.

Below is sample code using data from Econometrics Analysis of W.Greene, 6th Ed.


data Green;
input Year   Cost    K        L      E       M        Pk      Pl      Pe     Pm;
datalines;
1947  182.373 0.05107 0.24727 0.04253 0.65913 1.00000 1.00000 1.00000 1.00000
1948  183.161 0.05817 0.27716 0.05127 0.61340 1.00270 1.15457 1.30258 1.05525
1949  186.533 0.04602 0.25911 0.05075 0.64411 0.74371 1.15584 1.19663 1.06625
1950  221.710 0.04991 0.24794 0.04606 0.65609 0.92497 1.23535 1.12442 1.12430
1951  255.945 0.05039 0.25487 0.04482 0.64992 1.04877 1.33784 1.25179 1.21694
1952  264.699 0.04916 0.26655 0.04460 0.63969 0.99744 1.37949 1.27919 1.19961
1953  291.160 0.04728 0.26832 0.04369 0.64071 1.00653 1.43458 1.27505 1.19044
1954  274.457 0.05635 0.27167 0.04787 0.62411 1.08757 1.45362 1.30356 1.20612
1955  308.908 0.05258 0.26465 0.04517 0.63760 1.10315 1.51120 1.34277 1.23835
1956  328.286 0.04604 0.26880 0.04576 0.63940 0.99606 1.58186 1.37154 1.29336
1957  338.633 0.05033 0.27184 0.04820 0.62962 1.06321 1.64641 1.38010 1.30703
1958  323.318 0.06015 0.27283 0.04836 0.61886 1.15619 1.67389 1.39338 1.32699
1959  358.435 0.06185 0.27303 0.04563 0.61948 1.30758 1.73430 1.36756 1.30774
1960  366.251 0.05788 0.27738 0.04585 0.61889 1.25413 1.78280 1.38025 1.33946
1961  366.162 0.05903 0.27839 0.04640 0.61617 1.26328 1.81977 1.37630 1.34319
1962  390.668 0.05578 0.28280 0.04530 0.61613 1.26525 1.88531 1.37689 1.34745
1963  412.188 0.05601 0.27968 0.04470 0.61962 1.32294 1.93379 1.34737 1.33143
1964  433.768 0.05452 0.28343 0.04392 0.61814 1.32798 2.00998 1.38969 1.35197
1965  474.969 0.05467 0.27996 0.04114 0.62423 1.40659 2.05539 1.38635 1.37542
1966  521.291 0.05460 0.28363 0.04014 0.62163 1.45100 2.13441 1.40102 1.41878
1967  540.941 0.05443 0.28646 0.04074 0.61837 1.38617 2.20616 1.39197 1.42428
1968  585.447 0.05758 0.28883 0.03971 0.61388 1.49901 2.33869 1.43388 1.43481
1969  630.450 0.05410 0.29031 0.03963 0.61597 1.44957 2.46412 1.46481 1.53356
1970  623.466 0.05255 0.29755 0.04348 0.60642 1.32464 2.60532 1.45907 1.54758
1971  658.235 0.04675 0.28905 0.04479 0.61940 1.20177 2.76025 1.64689 1.54978
;
run;

data translog;
     set green;
  l_C=log(cost);
     lpk = log(pk); 
     lpl = log(pl); 
     lpe = log(pe); 
  lpm = log(pm);
run;

/* Specify the prices used in elasticity measurements */
%let covar1=lpk;
%let covar2=lpl;
/* Specify the translog cost function. It can be a nonlinear function, too */
%let model = %bquote(l_c=b0 + bk*lpk + bl*lpl + be*lpe + bm*lpm +
            bkk*0.5*lpk*lpk + bll*0.5*lpl*lpl +
      bee*0.5*lpe*lpe + bmm*0.5*lpm*lpm +
      bkl*lpk*lpl + bke*lpk*lpe + bkm*lpk*lpm +
      ble*lpl*lpe + blm*lpl*lpm + bem*lpe*lpm + exp(lpk););

/*@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*/
/*                          do not change code below                    */
/*@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*/
/* Estimate original model */
ods select none ;
ods output FirstDerivatives=Derivatives_0;
ods output CodeList=CodeList_0;
ods output ProgList=ProgList_0;
ods output ParameterEstimates=Est;
ods output CovB=CovB;
proc model data=Translog ListCode ListDer  covb ;
     *endogenous K L E; 
     *parms  b0 bk bl be bm
               bkk bll bee bmm
               bkl bke bkm
               ble blm
               bem;     
      &model;
      *K = bk + bkk*lpk + bkl*lpl + bke*lpe + bkm*lpm; 
      *L = bl + bll*lpk + bkl*lpk + ble*lpe + blm*lpm;
   *E = be + bee*lpe + bke*lpk + ble*lpl + bem*lpm;
      fit   l_c / outsused = smatrix  ols  outcov  outest=beta ; 
   *estimate '&PRED.K/@bkk' exp(lpk*bkk);  
   run;
quit;     
ods select all;


proc sql noprint;
     select wrt into :para_names separated by ' '
  from   derivatives_0
  ;
quit;
%put ¶_names;

data covb;
        length _TYPE_ $ 5;
        set covb;  
  _TYPE_='PARMS';
  rename Parameter=_NAME_;
run;
 

proc transpose data=Est out=Est_t name=Parameter;
     var Estimate;
  id Parameter;
run;

%let elastic_covars=lpk lpl;
data _new;
     set Translog(drop=&elastic_covars); set Est_t;  
run;


/* Obtain elasticity measurements */
ods select none ;
options error=0;
ods output FirstDerivatives=Derivatives;
ods output CodeList=CodeList;
ods output ProgList=ProgList;
proc model data=_new ListCode ListDer  maxiter=0;
      *endogenous Sk Sl Se; 
      &model;
      fit  l_c /  ols;    
   run;
quit;     
ods select all;

/* obtain analytical derivatives w.r.t input prices */

data _null_;
     length _temp_ $ 1024;
     set ProgList;
  if indexc(source, '@')>0 then do;
     _temp_='';
  x1=indexc(source, '/')+2; x2=indexc(source, '=')-1;
  name=substr(source, x1, x2-x1); put name=;
     _temp_=cats(_temp_, compress(substr(source, find(source, '=')+1), '; '));  
  j=_n_+1;  
     do while( indexc(source, ';')=0);
           set ProgList point=j ;
           _temp_=cats(_temp_, compress(substr(source, find(source, '=')+1), '; '));  
     j+1;
  end;
  call symput(compress('foc_'||name), _temp_);
      end;
   put _temp_=;
run;

%put %bquote((&&foc_&covar1)/(&&foc_&covar2));


/* Obtain derivatives of elasticity measurements w.r.t parameters */
data _new;
     set Translog;
     if _n_>1 then stop; 
run;
ods select none;
ods output FirstDerivatives=Derivatives_g;
ods output CodeList=CodeList_g;
ods output ProgList=ProgList_g;
proc model data=_new ListCode ListDer  maxiter=0;     
      *parms  b0 bk bl be bm
                bkk bll bee bmm
                bkl bke bkm
                ble blm
                bem;     
   nom=&&foc_&covar1;
   denom=&&foc_&covar2;
      l_c= nom / denom;
      fit  l_c / ols;    
   run;
quit;     
ods select all;
/*
proc sql noprint;
     select name into :para_names separated by ' '
  from   sashelp.vcolumn
  where  varnum>=2 
         &  libname='WORK' 
         &  memname='EST_T'
  ;
quit;
%put ¶_names;
*/

data _null_;
     set CodeList_g end=eof;
  if _n_=1 then do;
     call execute('data compute;');
  call execute('if _n_=1 then set Est_T;');
  call execute('set translog;');
  call execute('retain ¶_names ;');
  /*call execute('lpk = log(pk); lpl = log(pl); lpe = log(pe);' );*/

  declare hash h();
  length word $ 32 wordtrans $ 32;
  h.defineKey('word');
  h.defineData('word', 'wordtrans');
  h.defineDone();
  call missing(word, wordtrans);

  declare hiter hi('h');
  end;
  /* check if it is a Operation item */
  where index(col1, 'Oper')>0;
  if index(col1, 'eeocf')>0 then col1=tranwrd(col1, '=', 'eeocf');
     j0=index(col3, '@'); j1=indexc(col3, ':')+1; j2=indexc(col3, '<')+2;     
  s1=trim(substr(col3, 1, j1-2));
  s2=compress(substr(col3, j1, j2-j1-2));
  s3=trim(substr(col3, j2));

  s2B=translate(s2, '___', '/@.');
  *put s2=  s2B=;
  if indexc(s2, '@./')>0 then do;
     rc=h.find(key:s2);
     if rc=0 then h.replace(key:s2, data:s2, data:s2B);
     else h.add(key:s2, data:s2, data:s2B);
  end;

  j=1;
  do until (scan(trim(col3), j, ' ')='');
     w=subpad(scan(col3, j, ' '), 1, 32);  
        if h.find(key:w)=0 then do;
     word2=wordtrans;
     s3=tranwrd(s3, compress(w), compress(wordtrans));
     *put s3=  wordtrans=;
     *put '^^^^^^^^^';
  end;

  j+1;
  end;

  if s1 in ('*', '-', '+', '/', '**') then  s3=translate(trim(left(s3)), s1, ' '); 
  else if s1 not in ('eeocf', '=') then s3=compress(cats(s1, '(', s3, ')'));
  rc=h.find(key: subpad(s2, 1, 32));
  if rc=0 then  s3=cats(compress(wordtrans), '=', s3);   
  else s3=cats(s2, '=', s3);

  s3=tranwrd(s3, 'ACTUAL.', ' ');
  *put s3;  
  *put '________________';
  if index(s3, 'ERROR')=0 | index(s3, 'RESID')=0 then
     call execute(s3||';');
  if eof then do;
     call execute ('drop ¶_names;');
        call execute ('run;');
  h.output(dataset:'_test');
  end;
run;



data  Mean(drop=_PRED:)   FirstDerivatives(drop=PRED:);
        set compute(keep=Year PRED: _PRED:);  
run;

proc sql noprint;
        select cats(a.name, ' = ', substr(a.name, index(a.name, '__')+2)) into:renamestmt separated by ' '
  from  sashelp.vcolumn as a
  join  _test as b
    on  compress(a.name)=compress(b.wordtrans)
  where  a.libname='WORK' and a.memname='FIRSTDERIVATIVES'
  ;
quit;
%put &renamestmt;

proc sort data=derivatives_0; by wrt; run;
proc sort data=derivatives_g; by wrt; run;
data _dev;
     merge derivatives_0(keep=wrt in=_0) 
        derivatives_g(keep=wrt in=_g)  end=eof
  ;
  by wrt;
  length _temp_ $ 1024;
  retain _temp_ ;
  diff=(_0-_g); 
  if _n_=1 then _temp_=';';
  if diff=1 then _temp_=cats(_temp_, ';', wrt, '=0');
  put _temp_=;
  if eof then call symput ('impute', _temp_);
run;
%put %bquote(&impute);


data FirstDerivatives;
        set FirstDerivatives;
  rename &renamestmt;
  &impute;
  ;
run;

proc score data=firstderivatives  score=covb  type=parms  out=_temp1;        
        var ¶_names;
run;

data _temp1;
     set _temp1(drop=¶_names);  
run;
proc sql noprint;
     select cats(name, '=', translate(name, '', '2')) into :renamestmt separated by ' '
  from   sashelp.vcolumn
  where  libname='WORK' 
      &  memname='_TEMP1'
   &  varnum>1
   ;
quit;
%put &renamestmt;

data _temp1_score;
     set _temp1;
  rename &renamestmt;
  _TYPE_='parms';
  _NAME_=compress('v'||year);
run;

proc score data=firstderivatives  score=_temp1_score  type=parms  out=v_G(keep=Year v:);        
     by year;
     var ¶_names;
run;

data v_G;
     set v_G;
  _var_=sum(of v:, 0);
  drop v:;
run;

proc datasets library=work nolist;
     delete _temp1 _temp1_score  _new _test;
quit;
 Posted by at 6:55 上午
12月 012009
 
Email - love it or hate it. Either way, and in whatever form (please read on), you can't avoid it. SAS CMO, Jim Davis blogged last week about getting 300-400 emails a day! And, while I agree that you can't avoid email, email as it is traditionally known is about to get a complete makeover.

Microsoft Outlook is just one form of email that I work with. I also get direct messages via social media sites like Twitter and Facebook. And, if I want, I can manage those messages and have them all appear in my In Box -- which then just adds to the volume of email for me to digest. Alternatively I could have all of my social media messages appear in Facebook. Or I could leave them all as is.

Which then means I have to deal with this rather unwieldy email list: Outlook, Facebook, MySpace, LinkedIn, Plaxo, Twitter, and much more!

If you thought email meant one source, forget it! It's email, Jim, but not as we know it. Soon enough, traditional email will be an ancient relic. Generation Y does not reply to messages via email. Social media site of preference, please! A succinct 140-character tweet probably sounds quite appealing to anyone who glosses over long emails.

But for an organization analyzing the masses of tweets that convey customer feedback, succinct tweets with limited grammatical accuracy will be quite a different can of worms. Which brings me right back to Text Data Quality.

Some 60-80% of any data mining effort involves data cleansing/data quality, imputing missing values, filtering outliers, etc. Text mining requires that same upfront effort, especially if you are performing unsupervised classification (grouping documents when there is no other target/ranking/outcome information available). Most of the social media text based information that we will want to analyze will require a rather significant data quality effort - the topic of a future blog.
11月 302009
 
> ##################################################
> #        EXTRACT ROWS FROM A DATA FRAME          #
> ##################################################
>
> data <- data.frame(scan(what = list(x = 0, y = 0)))
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7:
Read 6 records
>
> # EXTRACT ROWS WITH LOGICAL SUBSCRIPTS
> rows1 <- data[data$x == 2 & data$y >= 2,]
> rows1
  x y
5 2 2
6 2 3
>
> # EXTRACT ROWS WITH ROW INDEX
> rows2 <- data[which(data$x == 2 & data$y >= 2),]
> rows2
  x y
5 2 2
6 2 3
>
> # EXTRACT ROWS WITH SUBSET() FUNCTION
> rows3 <- subset(data, x == 2 & y >= 2)
> rows3
  x y
5 2 2
6 2 3
>
> # EXTRACT ROWS WITH SQLDF() FUNCTION
> require(sqldf)
> rows4 <- sqldf("select * from data where x = 2 and y >= 2", row.names = TRUE)
> rows4
  x y
5 2 2
6 2 3
 Posted by at 9:59 上午
11月 292009
 
The Thanksgiving holiday weekend is screaming to a close. I can see the faint twinkle of Christmas lights on the horizon. But it is never too late for a thankful post. I read somewhere (Reader's Digest, I think) that optimism can bring happiness, health, and friends. The article also said that one great way to be more optimistic is to be thankful. Timely, don't you think?

I am always thankful for my family and friends, for their support and love, and for time spent with them in good health. But this is a support.sas.com blog, so let me move on to topics relative to those of you reading. Thanks for reading, by the way!

I am thankful for a job that enables me to interact with so many different and interesting people. Do you remember back in college when you were picking a career and you would say "I want to make a difference"? Somehow I thought that was limited to doctors and firemen and the Peace Corp. I'll chalk that up to the short-sightedness of youth. I am none of those things, but I do work on support.sas.com, which in turn, helps SAS users world-wide be more efficient at their jobs.

We see the Customer Support Web site as a 24-hour self-serve helpdesk. The goal of my team is to make support.sas.com useful for you so that you can do the things that you do. Things that include: research cures for disease, seek out financial fraud, help protect endangered animals, and allow us to send flowers to the family and friends that me miss during the year. Yep, you all help make the world a better place and that is really cool. I'm thankful for all that you do.

Finally, I'm thankful for each of you who have taken the time to send a comment about how the Web site is or is not working for you. All of those comments and conversations add up to change in the way support.sas.com works.

Thank you for reading this blog, for your comments, and for using SAS!
11月 262009
 


Seth Grimes posted more information about Text Data Quality today. I really think that text data quality issues will only become more pronounced as Gen Y users move more heavily into the social media space where spelling and grammar accuracy are not a great concern to these digital producers.

The content being actively used for text analysis is, more often than not, in need of cleansing for misspellings, synonyms, acronyms, profanity and more, as mentioned in my last blog. I'd hazard a guess at over 85% of text content needing cleansing prior to analysis. Even published articles contain the occasional spellling errors!

I see huge potential for using "clipped/shortened text" language dictionaries as part of our underlying NLP capabilities. I look forward to reading about Seth's continued research in this area!
11月 252009
 
I haven't reminded you about YouTube in a while. After reading a recent post in the Open Mic blog, I thought it might be time to nudge you toward SAS resources on YouTube. It is a great way to take a break from your normal workday while still learning about SAS and analytics.
There are some great videos out there. Here are a couple favorites from my last YouTube visit:

Visit the SAS channel on YouTube. If you have a favorite SAS-related YouTube video, share it with others. (OK now, remember that I asked for a video related to SAS!)
11月 252009
 
原文載點:User-Written DATA Step Functions

這是一個 SAS V9.2 版新釋出的功能,類似 macro,可自行定義 SAS 沒有的函式(function),並可在不用重複呼叫的情況之下,自由使用於每個 data step 裡面。這個自行定義函式的功能,是由 PROC FCMP 來完成。

直接來看一個範例:
proc fcmp outlib=sasuser.funcs.trial;
function study_day(intervention_date, event_date);
if event_date < intervention_date then
return(event_date – intervention_date);
else
return(event_date – intervention_date + 1);
endsub;
run;

options cmplib=sasuser.funcs;
data _null_;
start = '15Feb2006'd;
today = '27Mar2006'd;
sd = study_day(start, today);
put sd=;
run;

這個範例可分為兩部分。第一部份使用 PROC FCMP 去製作一個名叫 study_day 的函式,而裡頭包含了兩個參數,一個叫 intervention _date,另一個叫 event_date。裡面的內容包含了一個 if-else- 的條件判斷式,並依照不同的條件傳回 event_date 減去 intervention_date 的天數或 event_date 減去 intervention_date 加一的天數。結束後使用 endsub 把 function 裡面的內容包起來,並將結果存到 sasuser 底下的 funcs 這個資料裡面,並命名為 trials。

等到要使用時,必須用 options cmplib 把 sasuser.funcs 叫進來,之後便可以無限次的在任何 data step 裡面使用 study_day 這個函式。即便重新開啟 SAS,也不用再跑一次 PROC FCMP 程序,只要執行 options cmplib=sasuser.funcs 這一行後,便可以直接使用 study_day 這個函式了。

如果想要製作 call routine 的函式,則需要在 PROC FCMP 裡面使用 subroutine statement 來製作。範例如下:
proc fcmp outlib=sasuser.funcs.math;
subroutine subA();
x = 5;
call subB();
put 'In subA:' x=;
endsub;
subroutine subB();
x = 'subB';
put 'In subB:' x=;
endsub;
run;

options cmplib=sasuser.funcs;
data _null_;
x = 99;
call subA();
put 'In DATA step: ' x=;
run;

在這個範例中,PROC FCMP 程序先用 subroutine 製作一個叫做 subA() 的 call routine,其功能是會定義 x=5,並且呼叫另一個 call routine 名為 subB(),最後會在 log 視窗列印出「In subA: x=」的字樣。但 subB() 這個 call routine 也是自創的,所以需要用另一個 subroutine statement 把他的功能寫好。完成後兩個 subroutine statement 都要用 endsub 把 subA() 和 subB() 打包好,如此一來,這個自創的 subA() 就可以在第二部份的 data step 裡面使用。

原文裡面還有一些更進階的寫法,但我覺得上面兩個範例已經足夠一般使用者來使用,所以就不花篇幅寫進階的教學了,有興趣的人可以自行下載原文來研究。

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author:
Jason Secosky
SAS Institute Inc.
SAS Campus Drive
Cary, NC 27513
919-677-8000
Jason.Secosky@sas.com
11月 252009
 
原文載點:User-Written DATA Step Functions

這是一個 SAS V9.2 版新釋出的功能,類似 macro,可自行定義 SAS 沒有的函式(function),並可在不用重複呼叫的情況之下,自由使用於每個 data step 裡面。這個自行定義函式的功能,是由 PROC FCMP 來完成。

直接來看一個範例:
proc fcmp outlib=sasuser.funcs.trial;
function study_day(intervention_date, event_date);
if event_date < intervention_date then
return(event_date – intervention_date);
else
return(event_date – intervention_date + 1);
endsub;
run;

options cmplib=sasuser.funcs;
data _null_;
start = '15Feb2006'd;
today = '27Mar2006'd;
sd = study_day(start, today);
put sd=;
run;

這個範例可分為兩部分。第一部份使用 PROC FCMP 去製作一個名叫 study_day 的函式,而裡頭包含了兩個參數,一個叫 intervention _date,另一個叫 event_date。裡面的內容包含了一個 if-else- 的條件判斷式,並依照不同的條件傳回 event_date 減去 intervention_date 的天數或 event_date 減去 intervention_date 加一的天數。結束後使用 endsub 把 function 裡面的內容包起來,並將結果存到 sasuser 底下的 funcs 這個資料裡面,並命名為 trials。

等到要使用時,必須用 options cmplib 把 sasuser.funcs 叫進來,之後便可以無限次的在任何 data step 裡面使用 study_day 這個函式。即便重新開啟 SAS,也不用再跑一次 PROC FCMP 程序,只要執行 options cmplib=sasuser.funcs 這一行後,便可以直接使用 study_day 這個函式了。

如果想要製作 call routine 的函式,則需要在 PROC FCMP 裡面使用 subroutine statement 來製作。範例如下:
proc fcmp outlib=sasuser.funcs.math;
subroutine subA();
x = 5;
call subB();
put 'In subA:' x=;
endsub;
subroutine subB();
x = 'subB';
put 'In subB:' x=;
endsub;
run;

options cmplib=sasuser.funcs;
data _null_;
x = 99;
call subA();
put 'In DATA step: ' x=;
run;

在這個範例中,PROC FCMP 程序先用 subroutine 製作一個叫做 subA() 的 call routine,其功能是會定義 x=5,並且呼叫另一個 call routine 名為 subB(),最後會在 log 視窗列印出「In subA: x=」的字樣。但 subB() 這個 call routine 也是自創的,所以需要用另一個 subroutine statement 把他的功能寫好。完成後兩個 subroutine statement 都要用 endsub 把 subA() 和 subB() 打包好,如此一來,這個自創的 subA() 就可以在第二部份的 data step 裡面使用。

原文裡面還有一些更進階的寫法,但我覺得上面兩個範例已經足夠一般使用者來使用,所以就不花篇幅寫進階的教學了,有興趣的人可以自行下載原文來研究。

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author:
Jason Secosky
SAS Institute Inc.
SAS Campus Drive
Cary, NC 27513
919-677-8000
Jason.Secosky@sas.com
 Posted by at 11:23 上午
11月 252009
 
原文連結:Stopping Stepwise: Why Stepwise Selection Methods are Bad and What you Should use Instead

從事統計諮詢工作多年,經常遇到這樣的問題:為什麼經過stepwise過後的線性迴歸結果與預期的相差很大,無法有效解釋。這類問題也反應出理論統計和實際應用上的出入。SAS 在選擇變數的過程中,只能藉由讀進去的數字來套用公式,但每個變數的背後意義,在 SAS 還沒演化成「天網」Skynet 之前,應該是猜不出來。以往的解決方式,通常是希望使用者根據經驗法則,從電腦挑出的變數裡面刪除研究內不會去討論到的變數,或者是從刪除掉的變數裡面強迫加入研究內會討論到的變數。但這些比較主觀的方法,總是有點失去科學性。這篇教學文件主要是回答諸多常用線性迴歸做研究的人內心的疑惑,並使用 PROC GLMSELECT 來解決這個問題。

早在2001年,Dr. Frank Harrell 就提出傳統的線性迴歸模式選擇法(含backward, forward and stepwise)的盲點,諸如:
。R平方有偏誤(過高)
。參數估計值的標準差太小
。F值或卡方統計量並沒有真正服從F分配或卡方分配
。參數估計值的信賴區間太小
。p-value有偏誤(太小)
。參數估計值(絕對值)太大
。共線性

有人可能懷疑會不會是因為樣本數不多導致 power 不夠,但這跟模式選擇是沒有太大的關係。原文作者甚至利用模擬的資料來比較 100 subjects 和 1000 subjects 在同一個模式下進行模式選擇,發現 1000 subjects 的結果也沒有來得特別好。

為了要克服模式選擇的問題,原文作者提出幾個方案:

(1) A FULL(ER) MODEL
顧名思義,就是完全不要去做模式選擇,直接用 full model 去分析資料。當然使用這種有點逃避現實的方法很明顯的會出現嚴重的共線性問題。以我個人的經驗來看,為了要規避模式選擇卻讓共線性影響分析的方式並沒有高明到哪裡去。因此我得建議是,共線性分析一定要做,並把他視為模式選擇的替代品。若某個變數不是研究的重點,也沒有統計上的顯著性,又跟其他獨立變數產生高度的相關性時,便可以直接砍除了。

(2) EXPERT KNOWLEDGE
這個方法就是我在第一段提到的,使用個人對研究主題和資料背景的認知,主觀地決定什麼變數該留下,什麼變數開刪除。雖然這並不是非常科學的方法,但我一直認為統計分析必須適度跳脫教科書的範疇,在整個統計分析的過程中,完全交由電腦去決定並不是一個百分之百有保障的作法,真正的決策還是得依靠人腦才行。

(3) MODEL AVERAGING
這個方法的觀念有點像是在稀釋掉使用 full model 時會造成的問題,其概念是建立數個可能的模式,再利用 AIC 去計算每個變數在每個模式的權重,最後用加權的方法把每個變數在不同模式下的估計值給平均起來。詳細的作法可以參考這本書:

Burnham, K. P. & Anderson, D. R. (2002), Model selection and multimodel inference, Springer, New York.

(4) PARTIAL LEAST SQUARES
這個方法主要是用來解決用主成分分析法(Principal Component Analysis, PCA)建立 PCA regression 以達成變數減少目的所產生不易解釋 PCA 變數的問題。
 
(5) LASSO
這個方法是借用脊迴歸(ridge regression)的參數估計方法來重新估計參數,詳細過程可見:

Trevor Hastie, R. T. & Friedman, J. (2001), The elements of statistical learning, Springer-Verlag, New York.

(6) LAR
Least Angle Regression 的縮寫,其概念是將所有參數中央化,讓所有的參數從零開始,然後再慢慢加上從殘差相關係數所產生的數據。詳細方法可見:

Bradley Efron, Trevor Hastie, I. J. & Tibshirani, R. (2004), ‘Least angle regression’, Annals of Statistics 32, 407–499.

(7) CROSS VALIDATION
類似 bootstrap 和 jackknife 的 resampling 方法,先決定要重複抽出 K 組資料,每一組資料裡面用 (K-1)/K 筆資料去估計模式,剩下的 1/K 筆資料拿來做驗證。

上述的方法中,LASSO 和 LAR 已經可以透過 PROC GLMSELECT 來完成,此程序的基本語法如下:
PROC GLMSELECT (options);
CLASS variable;
MODEL variable = (effects)/(options);
SCORE (data=dataset) (out=dataset)

此程式在執行模式選擇的主要關鍵語法是 MODEL statement 後面的 selection=。透過 selection= 這個 option 來指定選模方法和設定變數選取的準則(AIC, AICC, BIC,...,etc)。如果要計算 LASSO 和 LAR,只需要簡單地寫上「selection=LAR」和「selection=LASSO」即可。傳統的 backward、foreward 以及 stepwise 法仍舊可以使用在 PROC GLMSELECT 程序中,但提供額外參數供使用者自行設定,例如:

selection=forward(select=SL stop=validation)

selection=backward(select=SL)

selection=stepwise(select=SL SLE=0.1 SLS=0.08 choose=AIC)

這些參數其實就跟 PROC REG 裡面會用到的模式選擇參數一樣,在此不額外補充(其實原文裡面也沒有,是我自己從 PROC GLMSELECT手冊裡面添加的)。

使用 LASSO 和 LAR 法可提供傳統的模式選擇結果出現問題的替代方案,但這兩個方法也不是一本萬利,因為他們也各自有一些假設需要去檢驗。由於原文裡面沒有提到要檢定哪些假設,所以請自行參考原著。但原文作者也提到,這兩個方法仍舊比傳統的 stepwise 要好很多。若今後我有看到有專門在探討這兩種方法的 SAS 技術文件,會再以專文提出。

CONTACT INFORMATION
Peter L. Flom
515 West End Ave
Apt 8C
New York, NY 10024
peterflomconsulting@mindspring.com
(917) 488 7176

David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
 Posted by at 8:22 上午