data preparation

9月 232010
 


Situation:
The purpose of this research was to (1) to explore a recent multi-study approach (Arends, et al. 2008) in combining observational survival data instead of traditional meta-analysis, and (2) to develop multivariate random-effects models with or without covariates to aggregate three studies on Bovine Respiratory Disease (BRD). Models were constructed, assessed and presented by programming in SAS®.
Task:The multivariate random-effects models built in this report demonstrated improved efficiency, and generalizability and precision.
Action:First the modeling is simple and easy to explain. Second the aggregation of the three studies was accomplished and survival proportion with CIs were updated. Second the estimated survival proportions by such models, showing reduced standard error, are more precise than Kaplan-Meier estimated survival proportion or Cox proportion hazards regression estimated survival proportion.
Result:Transformed data with LOCF by LAST. and FIRST. variables;Proposed a multi-survival-data model by PROC PHREG/LIFETEST/GLIMMIX and increased >70% precision; Rendered graphs by PROC SGPLOT/SGPANEL and SAS Graph Template language; Generated table and listing by PROC REPORT/SQL/TABULATE

Reference: Arends LR, Hunink MG, Stijnen T. Meta-analysis of summary survival curve data. Stat Med. 2008 Sep 30;27(22):4381-96.
Code I

/******************************************************************/
PROGRAM: NO-COVARIATES MODEL
PURPOSE: CREATE ARENDS' MODEL FOR DATA SETS 1 & 2 $ 3 BY NO-COVARIATE METHOD
AUTHOR: ---------------
LAST REVISE DATE: JUNE 18, 2010
LAST TEST DATE: JUNE 18, 2010
*****************************************************************/
libname source 'd:\animal_data'; /*set up the library for the input*/;
/*set up the work library for output*/;

********************BEGINNING OF THE PROGRAM********************;

**********#1 USE PROC LIFETEST**********;
ods graphics on;
proc lifetest data = source.data_1_2_3 OUTSURV =pred_bylifetest plots=(ls,lls) alpha=0.05; /*generate the log and loglog transformation plots*/
title 'Figure 2A & 2B ';
time DOF_FIRST_TX * status(0);
strata source;
run;
ods graphics off;
**********END**********;

**********#2 GENERATE SURVIVAL CURVE BY WEEKS**********;
proc sgplot data=pred_bylifetest;
title 'Figure 1';
step x= DOF_FIRST_TX y=survival /group=source;
yaxis min=0;
xaxis grid values=(0, 7, 14, 21, 28, 35, 42, 49, 56);
run;
**********END**********;

**********#3 MANIPULATION OF THE PREDICTED DATA SET BY PROC LIFETEST**********;
data pred_bylifetest;
set pred_bylifetest;
*****CLEAN THE DATA SET;
if survival=. then delete; /*delete invalid information for survival proportion*/
*****ADD TWO VARIABLES;
if survival ne 1 then lls=log(-log(survival)); /*add the log(-log()) transformed survival proportion*/
if DOF_FIRST_TX gt 0 then ln_day=log(DOF_FIRST_TX); /*add the log transformed time*/
*****ADD LABELS;
label DOF_FIRST_TX='Days to the first treatment'
ln_day='log(Days to the first treatment)'
lls='log(-log(Survival probability function))';
run;
**********END**********;

**********#4 FIT THE SCATTER DOTS WITH LEAST-SQUAR LINES**********;
*****FIT THE SCATTER DOTS WITH SEPARATE STRAIGHT LINES;
proc sgplot data=pred_bylifetest;
title 'Figure 2C';
scatter x=ln_day y=lls/group=source;
reg x=ln_day y=lls/group=source degree=1; /*fit the scatter dots with the straight lines separately*/
yaxis label='log(-log(Survival probability))';
xaxis label='log(Days to the first treatment)';
run;
*****FIT THE SCATTER DOTS WITH SEPARATE QUARRATIC CURVES;
proc sgplot data=pred_bylifetest;
title 'Figure 2D';
scatter x=ln_day y=lls/group=source;
reg x=ln_day y=lls/group=source degree=2; /*fit the scatter dots with the quadratic curves separately*/
yaxis label='log(-log(Survival probability))';
xaxis label='log(Days to the first treatment)';
run;
**********END**********;

**********#5 USE PROC MIXED TO GENERATE COEFFICIENTS**********;
*****GENERATE THE COEFFICIENTS WITH THE LINEAR MODEL;
proc mixed data= pred_bylifetest;
title 'Table 1 & 2';
class source;
model lls=ln_day/s outp=pred_bymixed_linear;
random source/s;
run;
*****GENERATE THE COEFFICIENTS WITH THE LINEAR MODEL;
proc mixed data= pred_bylifetest;
title 'Table & 2';
class source;
model lls=ln_day ln_day*ln_day/s outp=pred_bymixed_quadratic;
random source/s;
run;
**********END**********;

**********#6 GENERATE THE COMBINED MODEL BY THE LINEAR EQUATION**********;
*****GENERATE THE MODEL;
data combined_bymixed_linear;
do DOF_FIRST_TX=1 to 56 by 1;
Esurvival=exp(-exp(-1.9249+0.6421*log(DOF_FIRST_TX)));
source='Combine';
output;
end;
run;
*****COMBINE THE MODEL WITH RAW SURVIVAL PROPORTION;
data combined_bymixed_linear_plus_raw;
set combined_bymixed_linear(rename=(Esurvival=SURVIVAL)) pred_bylifetest(keep=source DOF_FIRST_TX SURVIVAL);
run;
*****COMPARE THE COMBINED WITH RAW SURVIVAL CURVES;
proc sgplot data=combined_bymixed_linear_plus_raw;
title 'Figure 3A';
step x=DOF_FIRST_TX y=survival/group=source;
yaxis grid values=(0 to 1 by 0.1);
run;
**********END**********;

**********#7 GENERATE THE COMBINED MODEL BY THE QUADRATIC EQUATION**********;
*****GENERATE THE MODEL;
data combined_bymixed_quadratic;
do DOF_FIRST_TX=1 to 56 by 1;
Esurvival=exp(-exp(-2.4587+1.3285*log(DOF_FIRST_TX)-0.1690*log(DOF_FIRST_TX)*log(DOF_FIRST_TX))); /*coefficients were given by proc mixed above*/
source='Combine';
output;
end;
run;
*****COMBINE THE MODEL WITH RAW SURVIVAL PROPORTION;
data combined_bymixed_quad_plus_raw;
set combined_bymixed_quadratic(rename=(Esurvival=SURVIVAL)) pred_bylifetest(keep=source DOF_FIRST_TX SURVIVAL);
run;
*****COMPARE THE COMBINED WITH RAW SURVIVAL CURVES;
proc sgplot data=combined_bymixed_quad_plus_raw;
title 'Figure 3B';
step x=DOF_FIRST_TX y=survival/group=source;
yaxis grid values=(0 to 1 by 0.1);
run;
**********END**********;

**********#8 RECOVER THE MODELS FOR THE SPECIFIC-STUDY CURVES**********;
*****RECOVER THE LINEAR MODEL;
data recover_bymixed_linear;
set pred_bymixed_linear;
Esurvival=exp(-(exp(pred)));
Eucl=exp(-(exp(lower)));
Elcl=exp(-(exp(upper)));
run;
*****RECOVER THE QUADRATIC MODEL;
data recover_bymixed_quadratic;
set pred_bymixed_quadratic;
Esurvival=exp(-(exp(pred)));
Eucl=exp(-(exp(lower)));
Elcl=exp(-(exp(upper)));
run;
**********END**********;

**********#9 GENERATE STUDY-SPECIFIC SURVIVAL PROPORTIONS AND CONFIDENCE INTERVALS*********;
*****DRAW THE RAW SURVIVAL CURVES AND THE CORRESPONDING CONFIDENCE INTERVALS;
proc sgplot data=pred_bylifetest;
title 'Figure 4A';
series x=DOF_FIRST_TX y=survival/group=source;
band lower=SDF_LCL upper=SDF_UCL x=DOF_FIRST_TX /group=source transparency=0.5;
yaxis grid values=(0 to 1 by 0.1);
run;
*****DRAW THE SURVIVAL CURVES FROM THE LINEAR MODEL AND THE CORRESPONDING CONFIDENCE INTERVALS;
proc sgplot data=recover_bymixed_linear;
title 'Figure 4B';
series x=DOF_FIRST_TX y=Esurvival/group=source;
band lower=Elcl upper=Eucl x=DOF_FIRST_TX /group=source transparency=0.5;
yaxis grid values=(0 to 1 by 0.1);
run;
*****DRAW THE SURVIVAL CURVES FROM THE QUADRATIC MODEL AND THE CORRESPONDING CONFIDENCE INTERVALS;
proc sgplot data=recover_bymixed_quadratic;
title 'Figure 4C';
series x=DOF_FIRST_TX y=Esurvival/group=source;
band lower=Elcl upper=Eucl x=DOF_FIRST_TX /group=source transparency=0.5;
yaxis grid values=(0 to 1 by 0.1);
run;
**********END**********;

********************END OF THE PROGRAM********************;

Code II

/*******************************************************************
PROGRAM: COVARIATE MODEL
PURPOSE: CREATE ARENDS' MODEL FOR DATA SETS 1 & 2 BY COVARIATE METHOD
AUTHOR: -----------------
LAST REVISE DATE: JUNE 18, 2010
LAST TEST DATE: JUNE 18, 2010
********************************************************************/

libname source 'd:\animal_data'; /*set up the library for the input*/;
libname mylib 'd:\animal_data_output'; *set up the library for the output*/;

********************BEGINNING OF THE PROGRAM********************;

**********#1 USE PROC PHREG WITH TEMPERATURE COVARIATE**********;
**** INPUT VALUES FOR TEMPERATURE;
data mylib.vector;
do rect=38.1 to 41.7 by 0.1; /*generate a vector as covariate for proc phreg*/
id=cats('The temperature=', rect); /*add a label for each temperature level*/
output;
end;
run;
*****OUTPUT PREDICTIONS ;
proc phreg data=source.data_1_2;
model DOF_FIRST_TX * status(0)=RECT;
strata source;
baseline covariates=mylib.vector out=mylib.pred_byphreg logsurv=ls loglogs=lls survival=_all_ /alpha=0.05; /*generate the estimated survival proportions and the corresponding 95% CI*/
run;
***********END ***********;

**********#2 MANIPULATION OF PREDICTION DATA SET BY PHREG*********;
data mylib.pred_byphreg ;
set mylib.pred_byphreg ;
if DOF_FIRST_TX gt 0 then logday=log(DOF_FIRST_TX) ; /*add a log transformation for the time variable*/
nls=-ls; /*add a negative log transformation for the survival proportions*/
label DOF_FIRST_TX='Days to the first treatment' /*add the labels to the variables*/
RECT='Rectal temperature'
logday='log(Days to the first treatment)'
nls='-log(Survival probability function)';
rect=trim(left(rect)); /*align all temperatures to the same format*/
run;
***********END***********;

**********#3 USE GRAPH TEMPLATE LANGUAGE TO GENERATE THE RESPONSE SURFACE FOR THE RAW SURVIVAL PROPORTIONS**********;
***** MAKE A 3D MODEL FOR SURVIVAL PROPORTION BY PROC TEMPLATE;
proc template;
define statgraph survival_surface_3d; /*the template will be saved under sasuser.templat*/
dynamic var1 var2 var3 ; /*make the three dynamic variables for proc sgrender*/
begingraph;
layout overlay3d / cube=false;
surfaceplotparm x=var1 y=var2 z=var3 /
surfacetype=fill
surfacecolorgradient=var3
colormodel=twocolorramp
reversecolormodel=true ; /*the gradient color of surface will be by the survival proportion*/;
endlayout;
endgraph;
end;
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 1 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=survival_surface_3d;
title 'Figure 5A';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="survival" ;
where source='A';
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 2 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=surv3d;
title 'Figure 5C';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="survival";
where source='B';
run;
***** MAKE A 3D MODEL FOR CONFIDENCE INTERVAL BY PROC TEMPLATE;
proc template;
define statgraph ci_surface_3d; /*the template will be saved under sasuser.templat*/
begingraph;
dynamic var1 var2 var3 var4 ; /*make the four dynamic variables for proc sgrender*/
layout overlay3d / cube=false rotate=60;
surfaceplotparm x=var1 y=var2 z=var3; /*make the reponse surface for the upper confidence interval*/
surfaceplotparm x=var1 y=var2 z=var4; /*make the reponse surface for the lower confidence interval*/
endlayout;
endgraph;
end;
run;
***** RENDER THE CONFIDENCE INTERVAL FOR DATA SET 1 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=ci_surface_3d;
title 'Figure 5B';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="UpperSurvival" var4="LowerSurvival";
where source='A';
run;
***** RENDER THE CONFIDENCE INTERVAL FOR DATA SET 2 BY PROC TEMPLATE;
proc sgrender data=mylib.pred_byphreg template=ci_surface_3d;
title 'Figure 5D';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="UpperSurvival" var4="LowerSurvival";
where source='B';
run;
***** MAKE THE CROSS SECTIONS OF SURVIVAL CURVES BY PROC SGPANEL;
proc sgpanel data=mylib.pred_byphreg(where=(rect in (38.5, 39,39.5, 40, 40.5,41, 41.5 ))); /*choose 7 temperatures to output*/
title 'Figure 5E';
panelby rect source/ layout=lattice novarname columns=7; /*decide the 2X7 configuration of the panel*/
band x=DOF_FIRST_TX lower=LowerSurvival upper=UpperSurvival; /*draw the confidence intervals*/
series x=DOF_FIRST_TX y=survival; /*draw the survival curves*/
run;
**********END**********;

**********#4 MODEL SELECTION BY PROC GLMSELECT**********;
ods graphics on;
proc glmselect data=mylib.pred_byphreg plot=CriterionPanel;
title 'Figure 6 ';
class source;
model lls=logday rect logday*logday rect*rect rect*logday/selection = stepwise(select=SL) stats=all;
run;
ods graphics off;
***********END**********;

**********#5 MODEL FITTING BY PROC MIXED**********;
proc mixed data=mylib.pred_byphreg;
title 'Table 3';
class source;
model lls=logday rect logday*logday/solution outp=mylib.pred_bymixed; /*use the selection result by proc glmselect*/
random source/s;
run;
**********END**********;

**********#6 GENERATE THE RESPONSE SURFACE FOR THE COMBINED SURVIVAL PROPORTIONS**********;
*****CONSTRUCT THE COMBINED EQUATION;
data mylib.combined_bymixed;
do DOF_FIRST_TX=1 to 42 by 1;
do rect=38.1 to 41.7 by 0.1;
esuv=exp(-exp(-32.7610+1.5148*log(DOF_FIRST_TX)-0.2024*log(DOF_FIRST_TX)*log(DOF_FIRST_TX)+0.7610*rect)); /*use the coeffiicients of fixed effects result by proc mixed*/
rect=trim(left(rect));
output;
end;
end;
label esuv='Combined survival proportion';
run;
***** RENDER THE COMBINED SURVIVAL PROPORTION ;
proc sgrender data=mylib.combined_bymixed template=survival_surface_3d; /*use the previous 3D model*/
title 'Figure 7A';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="Esuv" ;
run;
***** MAKE THE CROSS SECTIONS OF SURVIVAL CURVES BY PROC SGPANEL;
proc sgpanel data=mylib.combined_bymixed (where=(rect in (38.5, 39,39.5, 40, 40.5,41, 41.5 )));
title 'Figure 7B';
panelby rect / novarname columns=7;
series x=DOF_FIRST_TX y=esuv;
run;
**********END **********;

***********#7 GENERATE THE RESPONSE SURFACE FOR THE STUDY-SPECIFIC SURVIVAL PROPORTIONS**********;
data mylib.specific_bymixed(keep=DOF_FIRST_TX rect Esurvival source Eucl Elcl);
set mylib.pred_bymixed;;
Esurvival=exp(-(exp(pred)));
Eucl=exp(-(exp(lower)));
Elcl=exp(-(exp(upper)));
rect=trim(left(rect));
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 1 ;
proc sgrender data=mylib.specific_bymixed template=survival_surface_3d;
title 'Figure 8A';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="esurvival" ;
where source='A';
run;
*****RENDER THE SURVIVAL PROPORTION FOR DATA SET 2;
proc sgrender data=mylib.specific_bymixed template=survival_surface_3d;
title 'Figure 8C';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="esurvival" ;
where source='B';
run;
*****RENDER THE CONFIDENCE INTERVAL FOR DATA SET 1 ;
proc sgrender data=mylib.specific_bymixed template=ci_surface_3d;
title 'Figure 8B';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="Eucl" var4="Elcl";
where source='A';
run;
*****RENDER THE CONFIDENCE INTERVAL FOR DATA SET 2 ;
proc sgrender data=mylib.specific_bymixed template=ci_surface_3d;
title 'Figure 8D';
dynamic var1='rect' var2="DOF_FIRST_TX" var3="Eucl" var4="Elcl";
where source='B';
run;
***** MAKE THE CROSS SECTIONS OF SURVIVAL CURVES BY PROC SGPANEL;
proc sgpanel data=mylib.specific_bymixed(where=(rect in (38.5, 39,39.5, 40, 40.5,41, 41.5 ))); /*choose 7 temperatures to output*/
title 'Figure 8E';
panelby rect source/ layout=lattice novarname columns=7; /*decide the 2X7 configuration of the panel*/
band x=DOF_FIRST_TX lower=Elcl upper=Eucl; /*draw the confidence intervals*/
series x=DOF_FIRST_TX y=Esurvival; /*draw the survival curves*/
run;
**********END*********;

********************END OF THE PROGRAM********************;
9月 222010
 


Business understanding: A Healthcare company would like to know how the health status and personal information of a patient would influence his/her total health care expense.
Data understanding: The target variable is the total health cost (interval). Besides it, there are 40 input variables, such as sex, age, income, marital status, disease indicator, etc. A multiple linear regression is needed to predict the total health cost according the patients’ overall situation.
Data preparation: The data has already been cleaned. Since the target variable is highly skewed, a log transformation is applied. In each variable, any missing or inapplicable value is indicated by ‘-1’. Some variables, such as education level, are aggregated into ordinal variable.
Modeling: Proc Glmselect is used to construct the linear models. The raw data is divided into two parts. A stepwise method is used to select significant parameters.
Model evaluation: The fit statistics on the validation dataset is used to evaluate the performance of the model. The final model is eventually determined by validation average standard error (ASE).
Deployment: The GLM formula can be outputted and implemented by Proc Score in the future.

Acknowledgment: Dr. Goutam Chakraborty. Department of Marketing. Oklahoma State University.

ods html;
ods graphics on;
proc select data=USHETH seed=93093 plots(stepaxis=number)=all;
partition fraction(validate=0.5);
class sex census_region marital_status years_educ highest_degree served_armed_forces foodstamps_purchase more_than_one_job wears_eyeglasses person_blind wear_hearing_aid is_deaf person_weight numb_visits dental_checkup cholest_lst_chck last_checkup last_flushot lost_all_teeth last_psa last_pap_smear last_breast_exam last_mammogram bld_stool_tst sigmoidoscopy_colonoscopy wear_seat_belt high_blood_pressure_diag heart_disease_diag angina_diagnosis heart_attack other_heart_disease stroke_diagnosis emphysema_diagnosis joint_pain currently_smoke asthma_diagnosis diabetes_diag_binary
;
model log_totalexp=sex census_region age marital_status years_educ highest_degree served_armed_forces foodstamps_purchase total_income more_than_one_job wears_eyeglasses person_blind wear_hearing_aid is_deaf person_weight numb_visits dental_checkup cholest_lst_chck last_checkup last_flushot lost_all_teeth last_psa last_pap_smear last_breast_exam last_mammogram bld_stool_tst sigmoidoscopy_colonoscopy wear_seat_belt high_blood_pressure_diag heart_disease_diag angina_diagnosis heart_attack other_heart_disease stroke_diagnosis emphysema_diagnosis joint_pain currently_smoke asthma_diagnosis child_bmi adult_bmi diabetes_diag_binary
/selecion=stepwise(stop=validate select=sl)
;
output out=outdata;
run;
ods graphics off;
ods html close;
8月 232010
 

Situation: For a telecommunication company, there are a training dataset of 18,000 customers and a scoring dataset of 2,000 customers.
Task:Find potential 3G users from the existent 2G users to increase ARPU and MARPU
Action: Trained models by decision tree, neural network and logistic regression on SAS EM 5.2.
Result: Proposed tailed device and service, promotion channel, and branding image strategy for segments; Formed an ensemble model with misclassification rate <.04 and Impremented the model.

/*VERY BEGINNING: DATA TRANSFER FOR MODELING BY SAS ENTERPRISE MINER*/
options noxwait noxsync;
dm 'x "cd D:\";';
dm 'x " md mylib" ';
dm 'x "xcopy d:\matchresult\*.*/D/E/S/R/Y/A d:\mylib " ';