Array

4月 182012
 
In the following data, group 1 is control grp and grp 2 to 6 are test grps. Each group we recorded the visits.First we want to calculate the ratio of each group v.s. group 1. After we get the ratio, we compare the ratio of each day with the ratio of the sample to check whether the ratio is higher or lower than the sample ratio.
To calculate the ratio, by intuition we will use array. Before using array, we need to transpose the data set.

proc transpose data=report_visits_h out=report_visits_v prefix=group;
run;

proc print data=report_visits_v ;
title "-----title print of report_visits_v ";
run;
title "";


The output is like:
From the output, we should drop the first row:

proc sql;
select count(1) into :m_n from report_visits_v;
quit;

%let n_m_n=%eval(&m_n-1);
%let m_pct=%eval(&n_m_n-1);

data report_visits_v;
set report_visits_v;
if _n_>=2;
drop _name_;
run;

proc print data=report_visits_v;
title "------title trans to horizontal";
run;
title "";


Then we can calculate the ratio, like (here use group: to list all vars whose name begin with group):

data report_visits_v;
set report_visits_v;
array a1{*}
group: ; ** array with dynamic # of dim;
array a2{*} ratio1-ratio6;
do i=1 to dim(a1);
a2{i}=a1{i}/a1{1};
end;
format ratio2-ratio6 6.5;
drop i;
run;

proc print data=report_visits_v width=min;
run;

The output is:
Then we transfer it back to horizontal data, that is, each row is for one group.

proc transpose data=report_visits_v out=report_visits_h2;
run;

data orig_records;
set report_visits_h2;
if _n_<=6 then output orig_records;
rename _name_=group;
run;

data ratio;
set report_visits_h2;
if _n_>6 then output ratio;
run;

data ratio(drop=_name_);
retain grp_over_grp1;
set ratio;
grp_over_grp1=substr(_name_,6,1);
run;

proc print data=ratio;
title "------title print of ratio";
run;
title "";

Now the print of data set ratio is like:
Next we calculate the percentage of ratio changed for each day compared with the sample ratio.(if macro variable n_m_n without %eval, it will not work here since its resolution will be a character and therefore the resolution could not be used as the number of dim or to indicate how many col are there):

data pct;
set ratio;
pct_over_sample = '';
array a3{*} col1-col
&n_m_n; ** compare with condition without %eval for n_m_n;
array pct{
&n_m_n} ; ** array with dim number being a macro variable;
do i=1 to dim(a3);
pct{i}=(a3{i}-a3{1})/a3{1};
end;
format pct2-pct
&n_m_n percentn12.6;
keep pct: ;
run;

proc print data=pct;
title "-----title print of percentage change";
run;


The output is:

That is what we want.
10月 042011
 

In last post, I mentioned Hadoop, the open source implementation of Google’s MapReduce for parallelized processing of big data. In this long National Holiday, I read the original Google paper, MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat and got that the terminologies of “map” and “reduce” were basically borrowed from Lisp, an old functional language that I even didn’t play “hello world” with. For Python users, the idea of Map and Reduce is also very straightforward because the workhorse data structure in Python is just the list, a sequence of values that you can just imagine that they are the nodes(clusters, chunk servers, …) in a distributed system.

MapReduce is a programming framework and really language independent, so SAS users can also get the basic idea from their daily programming practices and here is just a simple illustration using data step array (not array in Proc FCMP or matrix in IML). Data step array in SAS is fundamentally not a data structure but a convenient way of processing group of variables, but it can also be used to play some list operations like in Python and other rich data structure supporting languages(an editable version can be founded in here):

MapReduce

Follow code above, the programming task is to capitalize a string “Hadoop” (Line 2) and the “master” method is just to capitalize the string in buddle(Line 8): just use a master machine to processing the data.

Then we introduce the idea of “big data” that the string is too huge to one master machine, so “master method” failed. Now we distribute the task to thousands of low cost machines (workers, slaves, chunk servers,. . . in this case, the one dimensional array with size of 6, see Line 11), each machine produces parts of the job (each array element only capitalizes a single letter in sequence, see Line 12-14). Such distributing operation is called “map”. In a MapReduce system, a master machine is also needed to assign the maps and reduce.

How about “reduce”?  A “reduce” operation is also called “fold”—for example, in Line 17, the operation to combine all the separately values into a single value: combine results from multiple worker machines.

3月 192011
 

Array is probably the only number-indexed data type in SAS. Interestingly, SAS and R, the major statistical softwares, use 1 instead of 0 to specify the first element. SAS programmers adopt array mostly for multiple-variable batch-processing. For example, longitudinal summation can be achieved by specifying a one-dimensional array and then adding all array elements together. On the contrary, general-purpose programming languages, like Java, Python, C++,  widely use array or list to store and index data. The programmers who newly switch into SAS often feel confused about the limitation of SAS’ array, which is transient and need output to the parental data step to save result. Since data step does not memorize the array’s original structure, multiple dimensional arrays are almost useless in creating more complex data structures. In SAS, those multiple dimensional arrays are occasionally implemented to reshape data [Ref. 1]. If a matrix, such as a big one with random numbers of 5000 rows by 5000 columns [Example 1], is needed, Other programming languages would use int[][] to declare a matrix. In SAS, 2D array, such as a [5000, 5000] array, may be intuitively chosen to perform this task. However, the result is not desired, and the right way is still to apply a one-dimensional array. 

Array in Proc Fcmp is an upgraded array data type other than data step array. This new version of array allows not only creating matrices but also indexing matrix. In addition, the communication between data step and Proc Fcmp is fast and convenient: this procedure supplies the READ_ARRAY() and WRITE_ARRAY() functions, which transform a dataset to an array and vice versa. For example, to test an interviewee’s knowledge on SAS’s data step array, a typical interview question is to ask her/him to write a 9X9 multiplication table [Example 2]. The expected solution is to place the OUTPUT statement between the inner Do-loop and outer Do-loop. Placing OUTPUT into the inner layer of Do-loops builds an 81*9 monster, while ignoring OUPTUT would only generate the last row of multiplication table. This task is much simpler in Proc Fcmp: just produce a matrix and write it to a dataset. Proc Fcmp is a nice alternative to SAS/IML, a specialized matrices language module in SAS. For instance, a typical SAS/IML kind of job, such as filling missing cells in a matrix (or a dataset) with its diagonal elements, can be fulfilled by arrays in Proc Fcmp [Example 3]. Proc Fcmp is shipped in SAS/BASE, which means that no extra out-of-pocket money is needed for another license. Another concern is the efficiency of SAS/IML module. It is reported that frequently calling of SAS/IML's functions would decrease system speed dramatically[Ref. 3].

The matrix feature of Proc Fcmp’s array benefits other SAS programming areas. For example, Proc Transpose and data step array usually reshape data structure from either long to wide or wide to long. However, in many cases that positions between observation and variable in a dataset have to be exchanged, the two methods may require several try-and-error steps. The TRANSPOSE() subroutine in Proc Fcmp solves such a problem easily [Example 4]. Currently there are 13 functions or subroutines available in Proc Fcmp for matrices operation[Ref. 2]. Some may complain they are still not enough for their particular need. Don’t forget: Proc Fcmp makes user-defined functions and subroutines! For example, to simulate set() function in Python, a deldup_array() function, based on encapsulating Proc SQL in a macro by RUN_MACRO(), can delete duplicate elements in array[Example 5]. Therefore, users of Proc Fcmp’s array can always construct and accumulate their tools to suit their purpose.

References: 1. UCLA Statistics Course. http://www.ats.ucla.edu/stat/sas/library/multidimensional_arrays.htm
2. SAS 9.2 Online Help. http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a003193719.htm
3. Wang, Songfeng; Zhang, Jiajia. Developing User-Defined Functions in SAS®: A Summary and Comparison. SAS Global 2011.

******(1) EXAMPLE 1: CREATE A 5000X5000 RANDOM MATRIX****;
****(1.1) WRONG 2D ARRAY SOLUTION******;
data matrix1;
   array v[5000, 5000];
   do i = 1 to 5000;
     do j = 1 to 5000;
       v[i, j] = ranuni(0);
     end;
     output;
   end;
run;  

****(1.2) CORRECT 1D ARRAY SOLUTION*****;
data matrix2;
   array x[5000];
   do i = 1 to 5000;
     do j = 1 to 5000;
       x[j] = ranuni(0);
     end;
     output;
   end;
run;

******(2) EXAMPLE 2: CREATE A MULTIPLICATION TABLE******;
****(2.1) DATA STEP ARRAY SOLUTION ********;
data mt1;
   array a[9] a1-a9;   
   do row = 1 to 9;           
      do col= 1 to 9;    
         if row ge col then a[col]=row*col;  
      end;
     output;        
   end;
   drop row col;
run;

****(2.2) PROC FCMP ARRAY SOLUTION*****;
proc fcmp;
   array a[9, 9] / nosymbols;
   do row =1 to 9;
      do col = 1 to 9;
         if row ge col  then a[row,  col] = row*col;
      end;
   end;
   rc1 = write_array('mt2', a);
quit;

****(3) EXAMPLE 3: FILL MISSING CELL WITH DIAGONAL ELEMENT******;
****(3.0) INPUT RAW DATA****;
data have;
   input x1-x4;
   datalines;
   1 . . .
   2 1 . . 
   3 4 1 .
   7 6 5 1
;
run;

****(3.1) PROC FCMP TRANSPOSITION******;
proc fcmp;
   array a[4, 4] / nosymbols;
   rc1 = read_array('have', a);
   do i = 1 to 4;
      do j = 1 to 4;
         if missing(a[i, j]) = 1 then a[i, j] = a[j, i];
      end;
   end;
   rc2 = write_array('want', a);
quit;

*****(4) EXAMPLE 4: RESHAPE SQUARE-SHAPE DATA********;
****(4.0) INPUT RAW DATA********;
data have1;
   input x1-x5;
   cards;
   1 . 0 1 1
   0 1 . 0 0
   .  . 1 1 1
   0 0 0 1 .
   . 0 0 1 1
;
run;

****(4.1) PROC TRANSPOSE SOLUTION******;
data trps1;
   set have1;
   obs = _n_;
run;

proc transpose data = trps1 out = trps2;
   by obs;
   var x1-x5;
run;

proc sort data = trps2 out = trps3;
   by _name_;
run;

proc transpose data = trps3 out = want1_1;
   by _name_;
   var col1;
run;

****(4.2) PROC FCMP SOLUTION********;
proc fcmp;
   array a[5, 5] / nosymbols;
   rc1 = read_array('have1', a);
   array b[5, 5] ;
   call transpose(a, b);
   rc2 = write_array('want1_2', b);
quit;

******(5) EXAMPLE 5: FCMP DEDUPLICATION FUNCTION FOR FCMP ARRAY*******;
****(5.1) ENCAPSULATE DEDUPLICATIONA UTILITY OF PROC SQL IN A MACRO ******;
%macro deldup_array;
   %let dsname = %sysfunc(dequote(&dsname));
   %let arrayname = %sysfunc(dequote(&arrayname));
   /*(5.1.1) GENERATE UNIQUE DATASET AND ITS OBSERVATION NUMBER*/
   proc sql noprint;
      select count(unique(&arrayname.1)) into: obs_num
      from &dsname;
      create table _temp as
      select distinct *
      from &dsname;
   quit;
   /*(5.1.2) USE TEMP DATASET TO REPLACE RAW DATASET*/
   data &dsname;
      set _temp;
   run;
   /*(5.1.3) DELETE TEMP DATASET*/ 
   proc datasets;
      delete _temp;
   run;
%mend deldup_array; 

****(5.2) ENCAPSULATE MACRO ABOVE IN A FUNCTION*****;
proc fcmp outlib=work.func.practice;
   function deldup_array(dsname $, arrayname $);
      rc = run_macro('deldup_array', dsname, arrayname, obs_num);
      if rc eq 0 then return(obs_num);
      else return(.);
   endsub;
run;

****(5.3) USE THIS FUNCION TO DELETE DUPLICATES IN ARRAY*****;
option cmplib = (work.func) mlogic mprint symbolgen; 
proc fcmp;
   array a[1000] /nosymbols;
   do j = 1 to 1000;
      a[j] = ceil((ranuni(12345)*100) + rannor(12345));
   end;
   
   dsname = 'numbers'; 
   rc1 = write_array(dsname, a);
   
   n = deldup_array(dsname,  %sysfunc(quote(a)));
   
   call dynamic_array(a, n);
   rc2 = read_array(dsname, a);
   put a = ;
quit;

*****************END OF ALL CODING****************************;


1月 132011
 

The original data is multi_wide as below. There are 10 variables each with three observation. What we want to do is to reshape the data from wide to long. Three ways are given to fulfill this job. The first one is using data step, which is available for a small number of variables; The second is ARRAY method, which is easy to write and understand; The last one is MACRO method.

data multi_wide;
  input famid faminc96 faminc97 faminc98 spend96 spend97 spend98
        debt96 $ debt97 $ debt98 $ ;
cards;
1 40000 40500 41000 38000 39000 40000 yes yes no
2 45000 45400 45800 42000 43000 44000 yes no  no
3 75000 76000 77000 70000 71000 72000 no  no  no
;
run;

First method, using DATA STEP:
data multi_long (keep=famid year faminc spend debt);
   set multi_wide;
   year=96;
   faminc=faminc96;
   spend=spend96;
   debt=debt96;
   output;
   year=97;
   faminc=faminc97;
   spend=spend97;
   debt=debt97;
   output;
   year=98;
   faminc=faminc98;
   spend=spend98;
   debt=debt98;
   output;
run;

Second, ARRAY method:
data multi_long (keep=famid year faminc spend debt);
   set multi_wide;
   array afaminc(96:98) faminc96-faminc98;
   array aspend(96:98) spend96-spend98;
   array adebt(96:98) debt96-debt98;
   do year = 96 to 98;
      faminc=afaminc[year];
         spend=aspend[year];
         debt=adebt[year];
         output;
       end;
run;

Third, MACRO:
%macro multilong(olddata, year1,year2);
   data multi_long (keep=famid year faminc spend debt);
      set &olddata;
         %do year=&year1 %to &year2;
            year=&year;
            faminc=faminc&year;
               spend=spend&year;
               debt=debt&year;
               output;
               %end;
%mend;

%multilong(multi_wide,96,98)


proc print data=&syslast;
run;
12月 282010
 
/* Here is an example of using SAS DICTIONARY. If using original data set wide and variables summer1-summer4 which is a string with suffixes having the same construction, it's easy to get the sum of each column in proc sql using macros. However, if we change the name to the different one with no similar structure, then we need to use DICTIONARY to help us with it.  */


data wide;
   input summer1-summer4;
   cards;
   1 2 3 4
   1 2 3 4
   ;
run;

proc datasets;
   modify wide;
   rename summer1 = salary
          summer2 = net
                summer3 = gross
                summer4=adjusted
                ;
run;
quit;

proc sql;
  select name, type
  from dictionary.columns
  where libname='WORK' and    /* here WORK is case sensitive */
        memname='WIDE'        /* here WIDE is case sensitive */
       ;
quit;

proc sql;
   select 'sum(' || trim(name) || ') as sum_' || name
   into :selections separated by ' , '
   from dictionary.columns
   where libname='WORK' and   /* here WORK is case sensitive */
         memname='WIDE' and   /* here WIDE is case sensitive */
               type   ='num'        /* here num is case sensitive */
       ;
quit;

%put &selections;

proc sql;
   select &selections
   from wide
   ;
quit;
12月 262010
 
options nodate nocenter  formdlim='8';
/* 这儿是原始数据,现在需要的是对summer1summer4求和,也就是说要得到36912  */
data wide;
   input name $ summer1-summer4;
   cards;
   a 1 2 3 4
   b 1 2 3 4
   c 1 2 3 4
;

/* 第一种办法,用矩阵来做,因为涉及到对同一个变量反复做相同的操作 */
data sum_array;
   set wide end=last;
   array summer{4} summer1-summer4;
   array asum{4};
   keep asum1-asum4;
   do i=1 to 4;
      asum{i}+summer{i};
   end;
   if last then output;
run;

proc print data=sum_array noobs;
run;

/* 第二个办法,用macrosql混合使用 */
%MACRO sumver(maxindex=);
%DO n = 1 %TO &maxindex;
SUM(summer&n) as Sum&n
%IF &n NE &maxindex %THEN ,
;
%END;
%MEND sumver;

PROC SQL;
SELECT %sumver(maxindex=4)
FROM wide
;
QUIT;

/* 第三个办法,用macro来代替反复执行 */
%macro Test(Countt);
data sum_data;
   set wide end=lastobs;
%do i =1 %to &Countt;
   sum&i+summer&i;
%end;
   keep sum1-sum&Countt;
   if lastobs then output;
run;
%mend;
%Test(4);

proc print data=sum_data;
run;


/* 第四个办法,用一段macro来代替执行所有的相加项 */
%macro sumv(maxobs=);
   %do i=1 %to &maxobs;
     sum&i+summer&i ;
      %end;
%mend sumv;

option mlogic mprint symbolGEN;

data sum_data;
   set wide end=lastobs;
   %sumv(maxobs=4) ;
   keep sum1-sum4;
   if lastobs then output;
run;

12月 052010
 
Problems: Quote for six-month American style euro currency options on plain vanilla, Max[S-K,0]and 〖Max[S-K,0]〗^0.5. Exchange rate S_0=$1.3721 /euro
Six-month continuously compounded inter-bank rates: r=0.4472%,r_f=1.2840%.
Assumptions:The exchange rate for euro follows an iid log normal price changes and volatility is constant.
Methodology:Binomial Model is used to price American currency options on euros.
We calculate the payoffs at time T and discount payoffs to the prior time step. Under the risk neutral probability measure,
c_(t-1)=(q×c_u+(1-q)×c_d)/R
Since these two options are American styles, we need to check for optimal early exercise at each node of the binomial tree. For these two currency options, we check Max[S-K,c_(t-1),0] and Max[〖Max[S-K,0]〗^0.5,c_(t-1) ] . Matlab is the software used to implement the binomial model.
--Parameters:1. Time steps n
As the number of time steps n increases, we can apply CRR model and the binomial model approaches real world price changes. We choose n=80,h=T/n=0.5/80=0.00625.
2. u and dWe choose CRR model to define u and d for binomial model.
u=e^(σ√h) ; d=1/u=e^(-σ√h)
Where h is the length of the binomial times step and σ is the annualized log price change volatility.
3. Annualized log price change volatility σThe daily log changes and daily squared log changes for two year exchange rates from 11/5/2008 – 11/5/2010 are as follows. We consider the volatility to be constant since 05/01/2009. Thus, we choose the historical prices from 05/01/2009-11/5/2010 and apply the volatility of the daily log changes as an estimate. Then the annualized log price change volatility equals the square root of the trading days in one year (252 days) times the daily log price change volatility.
σ=√252×σ_daily
4. Risk Neutral Measure QOptions on currencies can be regarded as an asset providing a yield at the foreign risk-free rate of interest. Thus, the risk neutral probability measure Q :
q=(e^((r-r_f ) )-d)/(u-d) ;
1-q= 〖u- e〗^((r-r_f ) )/(u-d)
5. Strike Price KSet strike price K from $1.3000/euro to $1.5000/euro with $0.005 per euro increments.
Reference: 1. John C. Hull.Options, Futures and Other Derivatives, 7th edition. Prentice Hall. 2008.
2. Base SAS 9.2 Procedures Guide. SAS Publishing. 2009

**********************AUTHOR(DAPANGMAO)----HCHAO8@GMAIL.COM***********************************;
****************(1) CONSTRUCT € TO $ EXCHANGE RATIO VECTOR******************;
data vector;
attrib value informat=dollar10.4 format=dollar10.4 ;
StrikeS=1.3000;
StrikeE=1.5000;
Increment=0.005;
do value=StrikeS to StrikeE by Increment;
drop StrikeS StrikeE Increment;
output;
end;
run;

**************(2) BUILD THE FUNCTION TO EVALUATE OPTION PRICES********;
proc fcmp outlib = myfunc.finance.subrout;
subroutine mysubr(T, n, r, rf, s0, sigma, c1[*], c2[*]);
/*Two vectors are output arguments*/
outargs c1, c2;
/*Inside calculation from input arguments*/
dt=T/n;
length=dim(c1);
u=exp(sigma*sqrt(dt));
d=1/u;
q=(exp((r-rf)*dt)-d)/(u-d);
/*Announce 4 arrays -- 1 vector and 3 matrixes */
array k[1]/ nosymbols;
array s[1] /nosymbols;
array x1[1] /nosymbols;
array x2[1] /nosymbols;
n=n+1;
/*The sizes of the arrays are specified*/
call dynamic_array(s, n, n);
call dynamic_array(x1, n, n);
call dynamic_array(x2, n, n);
call dynamic_array(k, length);
/*Read the exchange ratio into function*/
rc=read_array('vector', k);
/*Assign values to S matrix */
call zeromatrix(s);
S[1,1]=S0;
do _col=2 to n ;
do _row=1 to n;
S[_row,_col]=S0*u**(_col-_row)*d**(_row-1);
if _row gt _col then S[_row, _col]=0;
end;
end;
/*Generate final option vectors */
do i=1 to length;
x=k[i];
call zeromatrix (x1);
call zeromatrix (x2);
do j=1 to n;
x1[j,n]=max(S[j,n]-x,0);
x2[j,n]=max(S[j,n]-x,0)**0.5;
end;
do _col=(n-1) to 1 by -1;
do _row=1 to (n-1) by 1;
h=exp(-r*dt)*(q*x1[_row,_col+1]+(1-q)*x1[_row+1,_col+1]);
h2=exp(-r*dt)*(q*x2[_row,_col+1]+(1-q)*x2[_row+1,_col+1]);
x1[_row,_col]=max(S[_row,_col]-X,h,0);
x2[_row,_col]=max( max(S[_row,_col]-x ,0) **0.5, h2);
end;
c1[i]=x1[1,1];
c2[i]=x2[1,1];
end;
end;
endsub;
run;
quit;

***********SOME INITIAL VALUES*****************;
/*T=1/2;*/
/*n=80;*/
/*r=0.004472; */
/*rf=0.01284;*/
/*S0=1.3721;*/
/*Sigma=0.1074*/

**************(3) MEASURE THE LENGTH OF PRICING VECTOR**************;
proc sql;
select count(*) into: vecnum from vector;
quit;

**************(4) USE THE SUBROUTINE TO GENERATE TWO VECTORS********************;
options cmplib = (myfunc.finance);
data final;
array c1[&vecnum] _temporary_;
array c2[&vecnum] _temporary_;
call mysubr(0.5, 80, 0.004472, 0.01284, 1.3721, 0.1074, c1, c2); /*subroutine mysubr(T, n, r, rf, s0, sigma, c1[*], c2[*]);*/
do i=1 to dim(c1);
c1value= c1[i];
c2value=c2[i];
output;
end;
run;
****************TEST PASSED ----------- END****************************;


SAS 练手: 根据多个变量的值删除特定观测

 Array, Expected dataset, Question, Solution, Statement, 练手  SAS 练手: 根据多个变量的值删除特定观测已关闭评论
10月 252010
 

Question from: 上海-奋斗

Dataset: have

data have;
	input code $ month;
cards;
A     201002
A     201041
A     201003
A     201010
A     201009
B     201004
B     201009
B     201003
B     201011
C     201004
C     201007
D     201234
D     230103
E     201003
E     202011
;
run;

Expected dataset: want

如果变量 month 的值为 201003,在本例中,有两个观测的 month 值为 201003,其所对应的变量 code 值分别为 “A”, “B”, “E”,则在数据集 want 中,删除所有 code 值为 “A”, “B” 或 “E” 的观测,期望结果如下:

C     201004
C     201007
D     201234
D     230103

My Solution 1: 利用 data 步中的 merge 语句

data want;
	merge have(in=aa) have(where=(month=201003) in=bb);
	by code;
	if aa and not bb;
run;

My Solution 2: 与上面思路类似,但是用 sql 过程步来实现

proc sql;
	create table want as
	select * from have where code not in (select distinct code from have where month=201003);
quit;

My Solution 3: 利用 data 步 by 语句 (by 老马

data want(drop=tmp);
	do until(last.code);
		set have;
		by code;
		if month=201003 then tmp=code;
	end;
	do until(last.code);
		set have;
		by code;
		if code^=tmp then output;
	end;
run;



用bShare分享或收藏本文


您可能也喜欢:

SAS 练手: 根据要求查找缺失值

Create new variables based on many original variables

Retrieve physical path of the current SAS program

我表示 灰常无奈

通过 SAS 读取网页内容
来自无觅网络的相关文章:

美圆交友网 交友网站源码 整站粉色风格asp+sql (@piaoxian)

动网8.3SQL存储过程商业版 ASP论坛源码 (@piaoxian)

○ 删除23张 (@cy8robin)

告诉大家一个无敌删除命令,任意无法删除的文件都能删除[删除顽固文件|病毒] (@lffly)

花旗宝支付平台(完整商业SQL版) 花旗游戏支付平台源码 (@piaoxian)
无觅
Related Posts

Get specific variable name according to observation in SAS

 Array, Dataset, Expected dataset, Function, Macro, Question, Solution  Get specific variable name according to observation in SAS已关闭评论
9月 042010
 

Question from: 峥岩

Dataset: have

data have;
	input x1-x8;
cards;
1 3 2 6 3 . 4 5
2 6 1 4 6 3 6 2
2 .z . 3 . 5 3 7
1 . 3 . 1 1 5 5
8 8 8 8 8 8 8 8
7 8 6 8 8 9 1 2
;

Expected dataset: want

  • Create a character variable named MaxVariable.  The value of MaxVariable should be the name of variable that has the maximum value among x1-x8.
  • Create a numeric variable named Num; its value should be the number of variables with missing values.
  • The expected dataset want:
x1	x2	x3	x4	x5	x6	x7	x8	MaxVariable	Num
1	3	2	6	3	.	4	5	x4	1
2	6	1	4	6	3	6	2	x2	0
2	Z	.	3	.	5	3	7	x8	3
1	.	3	.	1	1	5	5	x7	2
8	8	8	8	8	8	8	8	x1	0
7	8	6	8	8	9	1	2	x6	0

Solution 1:

data want(drop=i);
	set have;
	length MaxVariable $ 32;
	array tmp[*] x1-x8;
	Num=nmiss(of tmp[*]);
	do i=1 to dim(tmp);
		if tmp[i]=max(of tmp[*]) then do;
			MaxVariable=vname(tmp[i]);
			return;
		end;
	end;
run;

There is a problem in Solution 1.  If multiple maximum values exist in the the same observation, the three maximum values (6) in the second observation for example, this solution can only get the name of the first variable with the maximum value, allthough this code can be easily modified (just replace do i=1 to dim(tmp) with do i=dim(tmp) to 1 by -1) to obtain the name of the last variable.  So, for Solution 1, the value of MaxVariable depends on the variable order in the dataset have. In Solution 2, this problem will be considered.

Solution 2 (improved by 老马, elek.me):

%let n=3; /* choose the n(th) variable among those with maximum values */
%let first=1; /* when &n is larger than the number of maximum values:
				 assign 1 to obtain the first variable name, and 0 the last */
data want(drop=i nMax MaxVariableTmp);
	set have;
	length MaxVariable $ 32;
	length MaxVariableTmp $ 32767;
	array tmp[*] x1-x8;
	nMax=0;
	do i=1 to dim(tmp);
		if tmp[i]=max(of tmp[*]) then do;
			nMax+1;
			MaxVariableTmp=catx(",",MaxVariableTmp,vname(tmp[i]));
		end;
	end;
	MaxVariable=scan(MaxVariableTmp,ifn(nMax>=&n.,&n.,ifn(&first.=1,1,nMax)),",");
	Num=nmiss(of tmp[*]);
run;

In Solution 2, you can specify the Nth variable name among those variables with the same maximum values, by defining the value of the macro variable &n.  For each observation, if &n is larger than the number of maximum values, the macro variable &first is provided an option to choose, 1 for the first and 0 for the last (by default).


Related Posts

An efficient macro for Stump – two terminal nodes tree

 Array, Boost Algorithms, Data Mining, Gini Index, predictive modeling  An efficient macro for Stump – two terminal nodes tree已关闭评论
2月 182010
 


In this post, I post an improved SAS macro of the single partition split algorithm in Chapter 2 of "Pharmaceutical Statistics Using SAS: A Practical Guide" by Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino.

The single partition split algorithm is a simplified version of Stumps, and is a weak classifier, usually used to form the base weak learner for boosting algorithm. This specific classifier seeks to separate the space into 2 subspaces for independent variables where each subspace has increasingly higher purity of the response classes, say 1 and 0. In the examples, Gini Index is used to measure purity/impurity.

The SAS example macro %SPLIT in the book (found @ Here) is for illustration purpose, and is so inefficient that it practically can't be used by industrial standard.

I modified this macro and made it usable in real business applications where millions of observations and hundreds of variables are more than common.

Improved Code:



%macro Dsplit(dsn, p);
/************************************************/
/* dsn: Name of input SAS data sets. All        */
/*        independent variables should be named */
/*        as X1, X2,....,Xp and be continous    */
/*        numeric variables                     */
/*   p: Number of independent variables         */
/************************************************/
options nonotes;
%do i=1 %to &p;
proc sort data=&dsn.(keep=y w x&i)  out=work.sort; by x&i;
run;

proc means data=work.sort noprint;
     var y;
     weight w;
     output out=_ysum  n(y)=ntotal  sum(y)=ysum;
run;
data _null_;
     set _ysum;
     call symput('ntotal', ntotal);
     call symput('ysum', ysum); 
run;
data y_pred;
     set work.sort  end=eof;
     array _p[&ntotal, 2] _temporary_;
     array _g[&ntotal, 2] _temporary_;
     array _x[&ntotal]    _temporary_;
     retain _y1  _oldx 0;
     if _n_=1 then _oldx=x&i;
     _x[_n_]=x&i;
     if ^eof then do;
        _y1+y; 
       _p[_n_, 1]=_y1/_n_; _p[_n_, 2]=(&ysum-_y1)/(&ntotal-_n_);
       ppn1=_n_/&ntotal;  ppn2=1-ppn1;
       _g[_n_, 1]=2*(ppn1*(1-_p[_n_, 1])*_p[_n_, 1]+ppn2*(1-_p[_n_, 2])*_p[_n_, 2]);
       if _n_>1 then _g[_n_-1, 2]=(_oldx+x&i)/2;
       _oldx=x&i;
     end;
     else do;
       _g[_n_-1, 2]=(_oldx+x&i)/2;
       ginimin=2;
       do i=1 to &ntotal-1;
          gini=_g[i, 1]; x&i=_g[i, 2];
          keep gini x&i;
          output;
     if gini lt ginimin then do;
        ginimin=gini; xmin=x&i;
            p1_LH=_P[i, 1]; p0_LH=1-p1_LH;
              p1_RH=_P[i, 2]; p0_RH=1-p1_RH;
        c_L=(p1_LH>0.5); c_R=(p1_RH>0.5);
    end;
       end;
     end;  
     do i=1 to &ntotal;
        if _x[i]<=xmin then y_pred=c_L;
        if _x[i]>xmin  then y_pred=c_R;
        keep y_pred y w; output y_pred;
     end;
     call symput('ginimin', ginimin);
     call symput('xmin', xmin);
     end;
run;
data _giniout&i;
     length varname $ 8;
     varname="x&i";
     cutoff=&xmin;
     gini=&ginimin;
run;
%end;
data outsplit;
     set %do i=1 %to &p;
            _giniout&i
         %end;;
run;
proc datasets library=work nolist;
     delete _giniout:;
quit;
option notes;    
%mend;

/* weak classifiers for Boost Algorithm */
%macro gini(_y0wsum, _y1wsum, i, nobs);
data _giniout&i.(keep=varname  mingini cut_val   p0_LH  p1_LH  c_L  p0_RH  p1_RH  c_R);     
     length varname $ 8;
     set sorted  end=eof;
  retain  _y0w  _y1w  _w  _ginik  0;
  retain  p0_LH  p1_LH  p0_RH  p1_RH  c_L  c_R  0; 
  array _mingini{4}  _temporary_; 
  if _n_=1 then do;
     _y0w = (y^=1)*w;  _y1w = (y=1)*w;   _w = w;    
  _mingini[1] = 2;
        _mingini[2] = 1; 
        _mingini[3] = x&i; 
        _mingini[4] = x&i;        
  end;
  else do;
     _y0w + (y^=1)*w; _y1w + (y=1)*w; _w + w;
  end;

  if ^eof then do;       
        p0_L = _y0w/_w;  p0_R = (&_y0wsum - _y0w)/(1-_w);
        p1_L = _y1w/_w;  p1_R = (&_y1wsum - _y1w)/(1-_w);
        _ginik= p1_L*p0_L*_w + p1_R*p0_R*(1-_w);
  end;

  if _ginik<_mingini[1] then do;     
  _mingini[1]=_ginik;   _mingini[2]=_n_; _mingini[3]=x&i;
  p0_LH=p0_L;  p1_LH=p1_L;  p0_RH=p0_R;  p1_RH=p1_R;
  c_L = (p1_LH > 0.5); c_R = (p1_RH > 0.5);  
  end; 
  if _n_=(_mingini[2]+1) then _mingini[4]=x&i;

  if eof then do;   
     cut_val=(_mingini[3]+_mingini[4])/2;
  mingini=_mingini[ 1]; 
        varname="x&i";   
  output  ; 
  end;   

run;
%mend;

%macro stump_gini(dsn, p, outdsn);
/***************************************************/
/*    dsn: Name of input SAS data sets. All        */
/*          independent variables should be named  */
/*          as X1, X2,....,Xp and be continous     */
/*          numeric variables                      */
/*      p: Number of independent variables         */
/* outdsn: Name of output SAS data sets. Used for  */
/*          Subsequent scoring. Not to named as    */
/*          _giniout.....                          */
/***************************************************/
%local i  p  ;

%do i=1 %to &p;
    proc sort data=&dsn.(keep=x&i  y  w)  out=sorted  sortsize=max;
      by x&i;
 run;
 data sortedv/view=sortedv;
      set sorted;
   y1=(y=1); y0=(y^=1);
 run;
 proc means data=sortedv(keep=y0 y1 w)   noprint;
      var y0  y1;
   weight w;
   output out=_ywsum(keep=_y0wsum  _y1wsum  _FREQ_)  
                sum(y0)=_y0wsum  sum(y1)=_y1wsum;
 run;
    data _null_;
      set _ywsum;
   call execute('%gini('|| compress(_y0wsum) || ','
                        || compress(_y1wsum) || ','
                              || compress(&i)      || ','
                              || compress(_FREQ_)  || ')'
                      );
 run;
%end;
data &outdsn;
     set %do i=1 %to &p;
         _giniout&i
   %end;;
run;
proc sort data=&outdsn; by mingini; run;
proc datasets library=work nolist;
     delete %do i=1 %to &p;
            _giniout&i
   %end;;
run;quit;
%mend;


%macro css(_ywsum, i, nobs);
data _regout&i.(keep=varname  mincss cut_val  ypred_L  ypred_R);     
     length varname $ 8;
     set sorted  end=eof;
  retain _yw  _w  0;
  retain  ypred_L  ypred_R 0;
  array _mincss{4}  _temporary_; 
  if _n_=1 then do;
     _yw = y*w;  _w = w;  
  _mincss[1] = constant('BIG'); 
        _mincss[2] = 1; 
        _mincss[3] = x&i; 
        _mincss[4] = x&i;
        ypred_L = _yw/_w;  ypred_R = (&_ywsum-_yw)/(1-_w);  
  end;
  else do;
     _yw + y*w; _w + w;
  end;
  if ^eof then do;     
  cssk = 1 - _yw/_w*_yw - (&_ywsum-_yw)/(1-_w)*(&_ywsum-_yw);   
  end;
  else do;
     cssk = 1 -_yw**2;
  end;
  if cssk<_mincss[1] then do;     
  _mincss[1]=cssk;   _mincss[2]=_n_; _mincss[3]=x&i;
  ypred_L=_yw/_w;  ypred_R=(&_ywsum-_yw)/(1-_w);
  end; 
  if _n_=(_mincss[2]+1) then _mincss[4]=x&i;

  if eof then do;   
     cut_val=(_mincss[3]+_mincss[4])/2;
  mincss=_mincss[ 1]; 
        varname="x&i";   
  output  ;
  end;   
     
run;
%mend;

%macro stump_css(dsn, p, outdsn);
/***************************************************/
/*    dsn: Name of input SAS data sets. All        */
/*          independent variables should be named  */
/*          as X1, X2,....,Xp and be continous     */
/*          numeric variables                      */
/*      p: Number of independent variables         */
/* outdsn: Name of output SAS data sets. Used for  */
/*          Subsequent scoring. Not to named as    */
/*          _giniout.....                          */
/***************************************************/
%local i  p  ;
options nosource;
%do i=1 %to &p;
    proc sort data=&dsn.(keep=x&i  y  w)  out=sorted  sortsize=max;
      by x&i;
 run;
 proc means data=sorted(keep=y w)   noprint;
      var y;
   weight w;
   output out=_ywsum(keep=_ywsum  _FREQ_)  sum(y)=_ywsum;
 run;
    data _null_;
      set _ywsum;
   call execute('%css(' || compress(_ywsum) || ','
                              || compress(&i)     || ','
                              || compress(_FREQ_) || ')'
                      );
 run;
%end;
data &outdsn;
     set %do i=1 %to &p;
         _regout&i
   %end;;
run;
proc sort data=&outdsn; by mincss; run;
options source;
%mend;


Comparing the time used and results.

The example data used is the AUC Small training data, 200 sets of predictors for 15,000 ratings, from AusDM2009 competition, and can be found @ Here .

Using example code from the book, it takes 1563 seconds on a regular Windows desktop (Core2Duo E6750 2.67GHz, 4GB Memory, 7K2 rpm HDD with 8MB cache) to process 5 numerical continous variables, whereas with improved macro, it only takes 1 second to process the same amount of data. This improvement is criticle since weak classifiers like this one won't be used alone, but as the base for more time-consuming Boosting algorithms. With original macro, it is practically not usable for any boosting algorithms on data sets with hundreds of or more observations.

Original Macro:

Improved Macro:
Comparing the results from original macro and the improved macro, they both select X3 with the same partition cut off point and the same Gini Index.

Reference:
Pharmaceutical Statistics Using SAS: A Practical Guide by Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, SAS Publishing 2007

Pharmaceutical Statistics Using SAS: A Practical Guide (SAS Press)
 Posted by at 3:50 上午