data preparation

2月 252011
 

R surpassed Matlab and SAS on recent Tiobe Programming index for popularity. I agree with this ranking[1], since R could largely replace Matlab and SAS for MB size data, and it is getting hot. The fast growing sectors, quantitative finance, bioinformatics and web analytics, are all embracing R. In those competitive fields, survival demands the skills of a developer instead of a programmer, which suggests something brings more power or higher AUC. The cross-platform, open-sourced, C-based, function-friendly R is right on top of the tide. SAS programmers have to wait SAS Institute to build tools for them, while an R developer could invent his own gloves and axes. SAS’s licensing policy is also problematic: SAS is divided into modules, such as BASE, STAT, IML, GRAPH, ETS, etc; if you want to use just one procedure in a module, you should buy the whole module and renew the license each year. Norman Nie[2] said that 15 core statistical procedures, such as cluster analysis, factor analysis, PCA and Cox regression, would satisfy 85% of the needs. Thus SAS programmers may have to pay the cost of hundreds of procedures to use the 15 procedures. On the contrary, you can specify and download any package for R. Or you construct brand-new functions or packages.

However, with larger data, R or Matlab users have to painfully jump to low-level languages. For convenience of matrix computation, R or Matlab routinely keeps copies of data in memory and does not let garbage collector frenetically dump them. With GB size data, people would soon hear the outcry of memory shortage in their workstation, and ponder to resort to C or C++. That is possibly why Matlab or R talents are so popular in academia but never welcome in job market. Amazingly, SAS is pretty robust to deal with this scale of data. SAS rarely failed me even though I sometimes complain about the speed. Of course, if memory is not a hurdle, R is another good option. Inspired by this idea, Norman Nie‘s startup company Revolution Analytics reinvigorated the community R. They used an XDF data file on hard disk to contain data before into memory stack. Then their unique package RevoScaleR would perform data analysis inside of those files. This strategy is pretty similar to the In-Database technology by SAS’s high-end enterprise products. Besides, XDF would forcibly partition data to allow multiple-core processing. On a common PC with 3G memory, I compared Revolution R and SAS using a 1.1 GB flat file. SAS integrated it in 20 seconds, while Rev R did that in 5 minutes. Afterward SAS realized some data transformation steps and summarization procedures. Unfortunately in those occasions the R core of Rev R usually crashed with the Rev R’s IDE left as a zombie. Presumably the compatibility between the R core and the XDF file system is a concern. In addition, at this stage the RevoScaleR package in Rev R has limited functions or procedures, such as summary statistics, linear regression, logistic regression, cross-tab, which utilized the advantage of the XDF system. Another worry is that the R core contributors, led by Ross Ihaka and Robert Gentleman, may feel reluctant to further assist Norman Nie's effort to develop his profitable product.

For TB size data, R will roll back. Like the physical world with extraordinary speed or space that Newton's theories expire, the complex, high-volume, high-dimensional data would stun any extraction, transformation, loading and analysis operations base on the concept of RDBMS. MySQL, Oracle and SAS all fade away. Distributed system and function languages are the rescue. Open source software, like Python and R, rule in this domain. The data center would aggregate the total memory from thousands of servers. With gigantic memories all together, data analysis by R could be performed by scheduling a Map/Reduce task. However, those well-known IT giants owning those data centers are not likely to buy many licenses for a commercial package. The community R, not Revolution R, will enjoy a significant share from the rapid growth in this area.

Overall SAS is still much better than Revolution R in handling GB size data now. Revolution R may need more time to get matured for production purpose. Revolution R launched a SAS-to-R challenge and deliberately created a function to transform SAS datasets to its XDF format. I like to see newcomers in the arena, and SAS may benefit from this competition.

Reference: 1. 'TIOBE Programming Community Index for February 2011'. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
2. 'Another Open Source Swipe at IBM and SAS'. http://blogs.forbes.com/quentinhardy/2011/02/01/another-open-source-swipe-at-ibm-and-sas/


************(1) DOWNLOAD AND UNZIP THE DATA ***********************;
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2006us.zip

***********(2)SAS'S CODE TO INTEGRATE DATA*****************;
data mort06;
 infile 'C:\mort06.dat' 
    lrecl=150 missover ;
 input    @20 Resident_status     $1. 
    @83 Plac_of_death_decedent_status   $1.  
    @85 Day_of_week _of_death   $1. 
    @94 Date_receipt    $8.  
    @102 Data_Year     $4. 
    @65 Month_of_Death     $2.
    @60 Sex      $1.  
    @445 Race      $6.
    @70 Age      $13.  
    @84 Marital      $1.  
    @484 Hispanic_origin     $5.  
    @61 Education             $4.  
    @146 ICD10 $4.  
    @150 code358    $3.   
    @154 code113 $3.     
    @157 code130 $3.    
    @160 cause_recode   $2.   
    @145 Place_injury   $1.   
    @106 Injury_work     $1.  
    @108 Method_disposition    $1. 
    @109 Autopsy      $1. 
    @144 Activity    $1. 
;
run;

proc contents data=mort06;
run;

############(3) REV R 'S CODE TO INTEGRATE DATA#####################
colList <- list(  
    "Resident_status"        =list(type="factor",  start=20, width=1 ),
     "Plac_of_death_decedent_status"    =list(type="factor", start=83, width=1 ),
     "Day_of_week _of_death"      =list(type="factor", start=85, width=1 ),     
     "Date_receipt"         =list(type="factor", start=94, width=1 ),         
     "Data_Year"         =list(type="factor",  start=102, width=4  ),
     "Month_of_Death"        =list(type="factor", start=65, width=2  ),
     "Sex"           =list(type="factor", start=60, width=1  ),
     "Race"           =list(type="factor", start=445, width=6 ),
     "Age"           =list(type="factor", start=70, width=13  ),
     "Marital"          =list(type="factor", start=84, width=1 ),
     "Hispanic_origin"        =list(type="factor", start=484, width=5  ),
     "Education"         =list(type="factor", start=61, width=4 ),
     "ICD10"          =list(type="factor", start=146, width=4  ),     
     "code358"          =list(type="factor", start=150, width=3  ), 
     "code113"          =list(type="factor", start=154, width=3  ),
     "code130"          =list(type="factor", start=157, width=3  ),
     "cause_recode "        =list(type="factor", start=160, width=2  ),
     "Place_injury"         =list(type="factor", start=145, width=1  ),
     "Injury_work "         =list(type="factor", start=106, width=1  ),
     "Method_disposition"       =list(type="factor", start=108, width=1  ),
     "Autopsy"          =list(type="factor", start=109, width=1  ),
     "Activity"          =list(type="factor", start=144, width=1  )  
    )
mortFile <- file.path("C:", "MORT06.DAT")
sourceData <- RxTextData(mortFile, colInfo=colList )                 
outputData <- RxXdfData("MORT")
rxImportToXdf(sourceData, outputData,  overwrite = TRUE) 
rxGetInfoXdf("MORT.xdf", getVarInfo=TRUE) 

#################END OF CODING#######################################;  
12月 132010
 


Three dimensions are usually regarded as the maximum for data presentation. With the opening of ODS from SAS 9.2 and its graph template language, 3D graphing is no longer a perplexing problem for SAS programmers. However, nowadays magnificent amount of data with multi-dimension structure needs more vivid and simpler way to be displayed.

The emerging of Google Motion Chart now provides a sound solution to visualize data in a more than three dimensions scenario. This web-based analytical technology originated from Dr. Hans Rosling’s innovation. Dr. Rosling and his Gapminder foundation invented a technology to demonstrate the relationship among multiple dimensions by animated bubbles. They developed a lot of bubble plots in Gapminder’s website to discover knowledge form a bulk of public information, especially for regional/national comparison. It soon attracted Google’s attention. In 2008 after an agreement between Dr. Rosling and Google’s two founders, Google launched its Motion Chart gadget. People could create motion chart by using Google Docs, an online alternative to Microsoft’s Office.

The combination between SAS and Google Motion Chart shows a handy and cheap way for up-to-five-dimension data visualization. For Motion Chart, it supports five variables all together in a plot. Commonly the data structure requires time(animation), var1(X axis), var2(Y axis), var3(color) and var4(bubble size). The correlation from var1 to var4 is expected: usually the bubbles with changing color and size tend to move along the diagonal line. Overall 5d visualization can be rendered within such a single plot. In this example, a SAS help dataset ‘SASHELP.SHOES’ is used. The data set has several regions to compare each other. Logged return money is Y-axis, while logged sale money is X-axis. A series of virtual time is given to each region, with inventory as bubble size and the store number as color. By SAS, the data structure in Motion Chart can be prepared quickly. Thus, once the CSV file is uploaded to Google Docs, a motion chart is ready to be published in any webpage. OK, it's time to sit and discover some interesting tendency...

Reference:
1.'Show me--New ways of visualising data’. The Economist. Feb 25th 2010.
2.‘Making data dance’. The Economist. Dec 11st 2010.
3. Google Docs online help center. 2010.

*********(1) Extract data from SASHELP.SHOES***********;
proc sql;
create table test as
select region, Sales, Inventory, Returns, Stores
from sashelp.shoes
order by region , sales desc
;quit;
********(2) Create a random variable for time************;
data test1;
do i=1 by 1 until (last.region);
set test;
by region;
time=today()-i+1;
mytime=put(time, mmddyy8.);
drop i;
output;
end;
run;
********(3) Transform some variables with log**********;
proc sql;
create table test2 as
select region, mytime, log(sales) as logsales, log(returns) as logreturn, Stores as storenum, Inventory
from test1
order by region, mytime
;quit;
********(4) Export data as CSV***************;
proc export data=test2 outfile='C:\Users\Yanyi\Desktop\test.csv' replace;
run;
*******(5) Upload CSV to Google Docs************;
******(6)  Create Google Motion Chart manually**********;

**********END*********TEST PASSED 12DEC2010****************************;


9月 232010
 


Situation:
The purpose of this research was to (1) to explore a recent multi-study approach (Arends, et al. 2008) in combining observational survival data instead of traditional meta-analysis, and (2) to develop multivariate random-effects models with or without covariates to aggregate three studies on Bovine Respiratory Disease (BRD). Models were constructed, assessed and presented by programming in SAS®.
Task:The multivariate random-effects models built in this report demonstrated improved efficiency, and generalizability and precision.
Action:First the modeling is simple and easy to explain. Second the aggregation of the three studies was accomplished and survival proportion with CIs were updated. Second the estimated survival proportions by such models, showing reduced standard error, are more precise than Kaplan-Meier estimated survival proportion or Cox proportion hazards regression estimated survival proportion.
Result:Transformed data with LOCF by LAST. and FIRST. variables;Proposed a multi-survival-data model by PROC PHREG/LIFETEST/GLIMMIX and increased >70% precision; Rendered graphs by PROC SGPLOT/SGPANEL and SAS Graph Template language; Generated table and listing by PROC REPORT/SQL/TABULATE

Reference: Arends LR, Hunink MG, Stijnen T. Meta-analysis of summary survival curve data. Stat Med. 2008 Sep 30;27(22):4381-96.
Code I

/******************************************************************/
PROGRAM: NO-COVARIATES MODEL
PURPOSE: CREATE ARENDS' MODEL FOR DATA SETS 1 & 2 $ 3 BY NO-COVARIATE METHOD
AUTHOR: ---------------
LAST REVISE DATE: JUNE 18, 2010
LAST TEST DATE: JUNE 18, 2010
*****************************************************************/
libname source 'd:\animal_data'; /*set up the library for the input*/;
/*set up the work library for output*/;

********************BEGINNING OF THE PROGRAM********************;

**********#1 USE PROC LIFETEST**********;
ods graphics on;
proc lifetest data = source.data_1_2_3 OUTSURV =pred_bylifetest plots=(ls,lls) alpha=0.05; /*generate the log and loglog transformation plots*/
title 'Figure 2A & 2B ';
time DOF_FIRST_TX * status(0);
strata source;
run;
ods graphics off;
**********END**********;

**********#2 GENERATE SURVIVAL CURVE BY WEEKS**********;
proc sgplot data=pred_bylifetest;
title 'Figure 1';
step x= DOF_FIRST_TX y=survival /group=source;
yaxis min=0;
xaxis grid values=(0, 7, 14, 21, 28, 35, 42, 49, 56);
run;
**********END**********;

**********#3 MANIPULATION OF THE PREDICTED DATA SET BY PROC LIFETEST**********;
data pred_bylifetest;
set pred_bylifetest;
*****CLEAN THE DATA SET;
if survival=. then delete; /*delete invalid information for survival proportion*/
*****ADD TWO VARIABLES;
if survival ne 1 then lls=log(-log(survival)); /*add the log(-log()) transformed survival proportion*/
if DOF_FIRST_TX gt 0 then ln_day=log(DOF_FIRST_TX); /*add the log transformed time*/
*****ADD LABELS;
label DOF_FIRST_TX='Days to the first treatment'
ln_day='log(Days to the first treatment)'
lls='log(-log(Survival probability function))';
run;
**********END**********;

**********#4 FIT THE SCATTER DOTS WITH LEAST-SQUAR LINES**********;
*****FIT THE SCATTER DOTS WITH SEPARATE STRAIGHT LINES;
proc sgplot data=pred_bylifetest;
title 'Figure 2C';
scatter x=ln_day y=lls/group=source;
reg x=ln_day y=lls/group=source degree=1; /*fit the scatter dots with the straight lines separately*/
yaxis label='log(-log(Survival probability))';
xaxis label='log(Days to the first treatment)';
run;
*****FIT THE SCATTER DOTS WITH SEPARATE QUARRATIC CURVES;
proc sgplot data=pred_bylifetest;
title 'Figure 2D';
scatter x=ln_day y=lls/group=source;
reg x=ln_day y=lls/group=source degree=2; /*fit the scatter dots with the quadratic curves separately*/
yaxis label='log(-log(Survival probability))';
xaxis label='log(Days to the first treatment)';
run;
**********END**********;

**********#5 USE PROC MIXED TO GENERATE COEFFICIENTS**********;
*****GENERATE THE COEFFICIENTS WITH THE LINEAR MODEL;
proc mixed data= pred_bylifetest;
title 'Table 1 & 2';
class source;
model lls=ln_day/s outp=pred_bymixed_linear;
random source/s;
run;
*****GENERATE THE COEFFICIENTS WITH THE LINEAR MODEL;
proc mixed data= pred_bylifetest;
title 'Table & 2';
class source;
model lls=ln_day ln_day*ln_day/s outp=pred_bymixed_quadratic;
random source/s;
run;
**********END**********;

**********#6 GENERATE THE COMBINED MODEL BY THE LINEAR EQUATION**********;
*****GENERATE THE MODEL;
data combined_bymixed_linear;
do DOF_FIRST_TX=1 to 56 by 1;
Esurvival=exp(-exp(-1.9249+0.6421*log(DOF_FIRST_TX)));
source='Combine';
output;
end;
run;
*****COMBINE THE MODEL WITH RAW SURVIVAL PROPORTION;
data combined_bymixed_linear_plus_raw;
set combined_bymixed_linear(rename=(Esurvival=SURVIVAL)) pred_bylifetest(keep=source DOF_FIRST_TX SURVIVAL);
run;
*****COMPARE THE COMBINED WITH RAW SURVIVAL CURVES;
proc sgplot data=combined_bymixed_linear_plus_raw;
title 'Figure 3A';
step x=DOF_FIRST_TX y=survival/group=source;
yaxis grid values=(0 to 1 by 0.1);
run;
**********END**********;

**********#7 GENERATE THE COMBINED MODEL BY THE QUADRATIC EQUATION**********;
*****GENERATE THE MODEL;
data combined_bymixed_quadratic;
do DOF_FIRST_TX=1 to 56 by 1;
Esurvival=exp(-exp(-2.4587+1.3285*log(DOF_FIRST_TX)-0.1690*log(DOF_FIRST_TX)*log(DOF_FIRST_TX))); /*coefficients were given by proc mixed above*/
source='Combine';
output;
end;
run;
*****COMBINE THE MODEL WITH RAW SURVIVAL PROPORTION;
data combined_bymixed_quad_plus_raw;
set combined_bymixed_quadratic(rename=(Esurvival=SURVIVAL)) pred_bylifetest(keep=source DOF_FIRST_TX SURVIVAL);
run;
*****COMPARE THE COMBINED WITH RAW SURVIVAL CURVES;
proc sgplot data=combined_bymixed_quad_plus_raw;
title 'Figure 3B';
step x=DOF_FIRST_TX y=survival/group=source;
yaxis grid values=(0 to 1 by 0.1);
run;
**********END**********;

**********#8 RECOVER THE MODELS FOR THE SPECIFIC-STUDY CURVES**********;
*****RECOVER THE LINEAR MODEL;
data recover_bymixed_linear;
set pred_bymixed_linear;
Esurvival=exp(-(exp(pred)));
Eucl=exp(-(exp(lower)));
Elcl=exp(-(exp(upper)));
run;
*****RECOVER THE QUADRATIC MODEL;
data recover_bymixed_quadratic;
set pred_bymixed_quadratic;
Esurvival=exp(-(exp(pred)));
Eucl=exp(-(exp(lower)));
Elcl=exp(-(exp(upper)));
run;
**********END**********;

**********#9 GENERATE STUDY-SPECIFIC SURVIVAL PROPORTIONS AND CONFIDENCE INTERVALS*********;
*****DRAW THE RAW SURVIVAL CURVES AND THE CORRESPONDING CONFIDENCE INTERVALS;
proc sgplot data=pred_bylifetest;
title 'Figure 4A';
series x=DOF_FIRST_TX y=survival/group=source;
band lower=SDF_LCL upper=SDF_UCL x=DOF_FIRST_TX /group=source transparency=0.5;
yaxis grid values=(0 to 1 by 0.1);
run;
*****DRAW THE SURVIVAL CURVES FROM THE LINEAR MODEL AND THE CORRESPONDING CONFIDENCE INTERVALS;
proc sgplot data=recover_bymixed_linear;
title 'Figure 4B';
series x=DOF_FIRST_TX y=Esurvival/group=source;
band lower=Elcl upper=Eucl x=DOF_FIRST_TX /group=source transparency=0.5;
yaxis grid values=(0 to 1 by 0.1);
run;
*****DRAW THE SURVIVAL CURVES FROM THE QUADRATIC MODEL AND THE CORRESPONDING CONFIDENCE INTERVALS;
proc sgplot data=recover_bymixed_quadratic;
title 'Figure 4C';
series x=DOF_FIRST_TX y=Esurvival/group=source;
band lower=Elcl upper=Eucl x=DOF_FIRST_TX /group=source transparency=0.5;
yaxis grid values=(0 to 1 by 0.1);
run;
**********END**********;

********************END OF THE PROGRAM********************;

Code II

/*******************************************************************
PROGRAM: COVARIATE MODEL
PURPOSE: CREATE ARENDS' MODEL FOR DATA SETS 1 & 2 BY COVARIATE METHOD
AUTHOR: -----------------
LAST REVISE DATE: JUNE 18, 2010
LAST TEST DATE: JUNE 18, 2010
********************************************************************/

libname source 'd:\animal_data'; /*set up the library for the input*/;
libname mylib 'd:\animal_data_output'; *set up the library for the output*/;

********************BEGINNING OF THE PROGRAM********************;

**********#1 USE PROC PHREG WITH TEMPERATURE COVARIATE**********;
**** INPUT VALUES FOR TEMPERATURE;
data mylib.vector;
do rect=38.1 to 41.7 by 0.1; /*generate a vector as covariate for proc phreg*/
id=cats('The temperature=', rect); /*add a label for each temperature level*/
output;
end;
run;
*****OUTPUT PREDICTIONS ;
proc phreg data=source.data_1_2;
model DOF_FIRST_TX * status(0)=RECT;
strata source;
baseline covariates=mylib.vector out=mylib.pred_byphreg logsurv=ls loglogs=lls survival=_all_ /alpha=0.05; /*generate the estimated survival proportions and the corresponding 95% CI*/
run;
***********END ***********;

**********#2 MANIPULATION OF PREDICTION DATA SET BY PHREG*********;
data mylib.pred_byphreg ;
set mylib.pred_byphreg ;
if DOF_FIRST_TX gt 0 then logday=log(DOF_FIRST_TX) ; /*add a log transformation for the time variable*/
nls=-ls; /*add a negative log transformation for the survival proportions*/
label DOF_FIRST_TX='Days to the first treatment' /*add the labels to the variables*/
RECT='Rectal temperature'
logday='log(Days to the first treatment)'
nls='-log(Survival probability function)';
rect=trim(left(rect)); /*align all temperatures to the same format*/
run;
***********END***********;

**********#3 USE GRAPH TEMPLATE LANGUAGE TO GENERATE THE RESPONSE SURFACE FOR THE RAW SURVIVAL PROPORTIONS**********;
***** MAKE A 3D MODEL FOR SURVIVAL PROPORTION BY PROC TEMPLATE;
proc template;
define statgraph survival_surface_3d; /*the template will be saved under sasuser.templat*/
dynamic var1 var2 var3 ; /*make the three dynamic variables for proc sgrender*/
begingraph;
layout overlay3d / cube=false;
surfaceplotparm x=var1 y=var2 z=var3 /
surfacetype=fill
surfacecolorgradient=var3
colormodel=twocolorramp
reversecolormodel=true ; /*the gradient color of surface will be by the survival proportion*/;
endlayout;
endgraph;
end;
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 1 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=survival_surface_3d;
title 'Figure 5A';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="survival" ;
where source='A';
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 2 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=surv3d;
title 'Figure 5C';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="survival";
where source='B';
run;
***** MAKE A 3D MODEL FOR CONFIDENCE INTERVAL BY PROC TEMPLATE;
proc template;
define statgraph ci_surface_3d; /*the template will be saved under sasuser.templat*/
begingraph;
dynamic var1 var2 var3 var4 ; /*make the four dynamic variables for proc sgrender*/
layout overlay3d / cube=false rotate=60;
surfaceplotparm x=var1 y=var2 z=var3; /*make the reponse surface for the upper confidence interval*/
surfaceplotparm x=var1 y=var2 z=var4; /*make the reponse surface for the lower confidence interval*/
endlayout;
endgraph;
end;
run;
***** RENDER THE CONFIDENCE INTERVAL FOR DATA SET 1 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=ci_surface_3d;
title 'Figure 5B';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="UpperSurvival" var4="LowerSurvival";
where source='A';
run;
***** RENDER THE CONFIDENCE INTERVAL FOR DATA SET 2 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=ci_surface_3d;
title 'Figure 5D';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="UpperSurvival" var4="LowerSurvival";
where source='B';
run;
***** MAKE THE CROSS SECTIONS OF SURVIVAL CURVES BY PROC SGPANEL;
proc sgpanel data=mylib.pred_byphreg(where=(rect in (38.5, 39,39.5, 40, 40.5,41, 41.5 ))); /*choose 7 temperatures to output*/
title 'Figure 5E';
panelby rect source/ layout=lattice novarname columns=7; /*decide the 2X7 configuration of the panel*/
band x=DOF_FIRST_TX lower=LowerSurvival upper=UpperSurvival; /*draw the confidence intervals*/
series x=DOF_FIRST_TX y=survival; /*draw the survival curves*/
run;
**********END**********;

**********#4 MODEL SELECTION BY PROC GLMSELECT**********;
ods graphics on;
proc glmselect data=mylib.pred_byphreg plot=CriterionPanel;
title 'Figure 6 ';
class source;
model lls=logday rect logday*logday rect*rect rect*logday/selection = stepwise(select=SL) stats=all;
run;
ods graphics off;
***********END**********;

**********#5 MODEL FITTING BY PROC MIXED**********;
proc mixed data=mylib.pred_byphreg;
title 'Table 3';
class source;
model lls=logday rect logday*logday/solution outp=mylib.pred_bymixed; /*use the selection result by proc glmselect*/
random source/s;
run;
**********END**********;

**********#6 GENERATE THE RESPONSE SURFACE FOR THE COMBINED SURVIVAL PROPORTIONS**********;
*****CONSTRUCT THE COMBINED EQUATION;
data mylib.combined_bymixed;
do DOF_FIRST_TX=1 to 42 by 1;
do rect=38.1 to 41.7 by 0.1;
esuv=exp(-exp(-32.7610+1.5148*log(DOF_FIRST_TX)-0.2024*log(DOF_FIRST_TX)*log(DOF_FIRST_TX)+0.7610*rect)); /*use the coeffiicients of fixed effects result by proc mixed*/
rect=trim(left(rect));
output;
end;
end;
label esuv='Combined survival proportion';
run;
***** RENDER THE COMBINED SURVIVAL PROPORTION ;
proc sgrender data=mylib.combined_bymixed template=survival_surface_3d; /*use the previous 3D model*/
title 'Figure 7A';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="Esuv" ;
run;
***** MAKE THE CROSS SECTIONS OF SURVIVAL CURVES BY PROC SGPANEL;
proc sgpanel data=mylib.combined_bymixed (where=(rect in (38.5, 39,39.5, 40, 40.5,41, 41.5 )));
title 'Figure 7B';
panelby rect / novarname columns=7;
series x=DOF_FIRST_TX y=esuv;
run;
**********END **********;

***********#7 GENERATE THE RESPONSE SURFACE FOR THE STUDY-SPECIFIC SURVIVAL PROPORTIONS**********;
data mylib.specific_bymixed(keep=DOF_FIRST_TX rect Esurvival source Eucl Elcl);
set mylib.pred_bymixed;;
Esurvival=exp(-(exp(pred)));
Eucl=exp(-(exp(lower)));
Elcl=exp(-(exp(upper)));
rect=trim(left(rect));
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 1 ;
proc sgrender data=mylib.specific_bymixed template=survival_surface_3d;
title 'Figure 8A';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="esurvival" ;
where source='A';
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 2;
proc sgrender data=mylib.specific_bymixed template=survival_surface_3d;
title 'Figure 8C';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="esurvival" ;
where source='B';
run;
*****RENDER THE CONFIDENCE INTERVAL FOR DATA SET 1 ;
proc sgrender data=mylib.specific_bymixed template=ci_surface_3d;
title 'Figure 8B';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="Eucl" var4="Elcl";
where source='A';
run;
*****RENDER THE CONFIDENCE INTERVAL FOR DATA SET 2 ;
proc sgrender data=mylib.specific_bymixed template=ci_surface_3d;
title 'Figure 8D';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="Eucl" var4="Elcl";
where source='B';
run;
***** MAKE THE CROSS SECTIONS OF SURVIVAL CURVES BY PROC SGPANEL;
proc sgpanel data=mylib.specific_bymixed(where=(rect in (38.5, 39,39.5, 40, 40.5,41, 41.5 ))); /*choose 7 temperatures to output*/
title 'Figure 8E';
panelby rect source/ layout=lattice novarname columns=7; /*decide the 2X7 configuration of the panel*/
band x=DOF_FIRST_TX lower=Elcl upper=Eucl; /*draw the confidence intervals*/
series x=DOF_FIRST_TX y=Esurvival; /*draw the survival curves*/
run;
**********END*********;

********************END OF THE PROGRAM********************;
9月 222010
 


Business understanding: A Healthcare company would like to know how the health status and personal information of a patient would influence his/her total health care expense.
Data understanding: The target variable is the total health cost (interval). Besides it, there are 40 input variables, such as sex, age, income, marital status, disease indicator, etc. A multiple linear regression is needed to predict the total health cost according the patients’ overall situation.
Data preparation: The data has already been cleaned. Since the target variable is highly skewed, a log transformation is applied. In each variable, any missing or inapplicable value is indicated by ‘-1’. Some variables, such as education level, are aggregated into ordinal variable.
Modeling: Proc Glmselect is used to construct the linear models. The raw data is divided into two parts. A stepwise method is used to select significant parameters.
Model evaluation: The fit statistics on the validation dataset is used to evaluate the performance of the model. The final model is eventually determined by validation average standard error (ASE).
Deployment: The GLM formula can be outputted and implemented by Proc Score in the future.

Acknowledgment: Dr. Goutam Chakraborty. Department of Marketing. Oklahoma State University.

ods html;
ods graphics on;
proc select data=USHETH seed=93093 plots(stepaxis=number)=all;
partition fraction(validate=0.5);
class sex census_region marital_status years_educ highest_degree served_armed_forces foodstamps_purchase more_than_one_job wears_eyeglasses person_blind wear_hearing_aid is_deaf person_weight numb_visits dental_checkup cholest_lst_chck last_checkup last_flushot lost_all_teeth last_psa last_pap_smear last_breast_exam last_mammogram bld_stool_tst sigmoidoscopy_colonoscopy wear_seat_belt high_blood_pressure_diag heart_disease_diag angina_diagnosis heart_attack other_heart_disease stroke_diagnosis emphysema_diagnosis joint_pain currently_smoke asthma_diagnosis diabetes_diag_binary
;
model log_totalexp=sex census_region age marital_status years_educ highest_degree served_armed_forces foodstamps_purchase total_income more_than_one_job wears_eyeglasses person_blind wear_hearing_aid is_deaf person_weight numb_visits dental_checkup cholest_lst_chck last_checkup last_flushot lost_all_teeth last_psa last_pap_smear last_breast_exam last_mammogram bld_stool_tst sigmoidoscopy_colonoscopy wear_seat_belt high_blood_pressure_diag heart_disease_diag angina_diagnosis heart_attack other_heart_disease stroke_diagnosis emphysema_diagnosis joint_pain currently_smoke asthma_diagnosis child_bmi adult_bmi diabetes_diag_binary
/selecion=stepwise(stop=validate select=sl)
;
output out=outdata;
run;
ods graphics off;
ods html close;
8月 232010
 

Situation: For a telecommunication company, there are a training dataset of 18,000 customers and a scoring dataset of 2,000 customers.
Task:Find potential 3G users from the existent 2G users to increase ARPU and MARPU
Action: Trained models by decision tree, neural network and logistic regression on SAS EM 5.2.
Result: Proposed tailed device and service, promotion channel, and branding image strategy for segments; Formed an ensemble model with misclassification rate <.04 and Impremented the model.

/*VERY BEGINNING: DATA TRANSFER FOR MODELING BY SAS ENTERPRISE MINER*/
options noxwait noxsync;
dm 'x "cd D:\";';
dm 'x " md mylib" ';
dm 'x "xcopy d:\matchresult\*.*/D/E/S/R/Y/A d:\mylib " ';