3月 262010
 
原文載點:http://analytics.ncsu.edu/sesug/2009/SC013.Bardenheier.pdf

若要從兩個nested mixed model挑選較佳的模式,比較專業的人會推薦使用Likelihood Ratio Test(以下簡稱LRT)來處理這類的問題。PROC MIXED程序並沒有語法來進行這個檢定,但這個檢定需要使用到的參數,卻全部都可以從PROC MIXED程序的輸出報表內找到。懂得運作的使用者可以從報表拿擷取相關參數,然後自己算一算再查個表大概就知道結果了。我自己曾經在準備博士班資格考時寫了一個簡易的macro來做LRT,但沒想到真的有人把這個觀念發表出來。我稍微看了一下這份技術文件內的macro寫法,與我當年寫的大同小異,但作者有將輸出報表格式設計的比較好看一點,所以我就來介紹一下這個方便進行LRT的macro。


假設兩個nested mixed model分別稱為reduced model和full model,通常reduced model會被設定在虛無假設H0下,full model會被設定在對立假設H1下。而整個LRT會從這兩個模式各取出兩個數據出來,一個是他們的自由度,另一個是他們-2log likelihood function的值。這兩個數據全部都可以從PROC MIXED的報表內找到。我以PROC MIXED線上手冊所附的一個範例為例,請參照這個網頁:

http://www.tau.ac.il/cc/pages/docs/sas8/stat/chap41/sect30.htm

這兩個數據可以在這個地方找到,圖示如下(紅色箭頭處):



至於公式,引用 wiki 的資料:

\begin{align}D & = -2(\ln(\text{likelihood for null model}) - \ln(\text{likelihood for alternative model})) \\& = -2\ln\left( \frac{\text{likelihood for null model}}{\text{likelihood for alternative model}} \right).\end{align}

白話一點,就是把 reduced model 的 -2 log likelihood function 的值減掉 full model 的 -2 log likelihood function 的值,就可以算出上面那個 D。然後兩個模式的自由度數量的差異套進卡方分配裡面,就可以算出 p-value 了。如果 p-value < 0.05,就表示代表 reduced model 的 H0 被拒絕了,也就是說 full model 比 reduced model 好。反之亦然。

這種計算其實根本不需要使用到電腦的,但懶人永遠可以找藉口,所以原文作者還是花了將近兩頁的篇幅寫了這個 macro 來服務大眾。

%macro LRMixtureTest(fullmodel=,redmodel=,DFfull=,DFred=); 
/* 
This macro calculates the Likelihood Ratio Test for fixed effects in a 
mixed model 
See Singer Applied Longitudinal Data Analysis pp 116-120. 
and also calculates the mixture method pvalue for random effects in a 
mixed model -see KKNM chapter 26. 
The macro argument &fullmodel refers to the output dataset produced by 
running the full model in proc mixed and using the statement: 
    ods output FitStatistics=fm; 
The macro argument &DFfull refers to the output dataset produced by 
running the full model in proc mixed and adding to the statement: 
    ods output FitStatistics=fm SolutionF=SFfm; The number of 
parameters being tested (ie, fixed effects) determine the degrees of 
freedom in the LRT and come from this dataset. 
The argument &redmodel refers to the output dataset produced by running 
the reduced model in proc mixed using the statement: 
   ods output FitStatistics=rm  SolutionF=SFrm ; 
Maximum Likelihood should be used when comparing fixed effects because 
it maximizes the likelihood of the sample data. Using the Restricted 
Maximum likelihood should only be used when assessing a random term 
(ie, the fixed effects should be the same in both models)  
because in RML we maximize the likelihood of the residuals. The degrees 
of freedom would not be correct if the number of fixed effects were 
different in each model when using RML -which is the default in PROC 
MIXED. 
*/ 
data &fullmodel; set &fullmodel; 
 if descr='-2 Res Log Likelihood' then chisqfullRML=value;  *RML; 
 else if descr='-2 Log Likelihood' then chisqfullFML=value; *FML; 
run; 
data &redmodel; set &redmodel; 
  if descr='-2 Res Log Likelihood' then chisqredRML=value; *RML; 
  else  if descr='-2 Log Likelihood' then chisqredFML=value; *FML; 
run; 
proc sql; 
CREATE TABLE flDF AS SELECT  effect, count(*) AS fullDF from 
work.&DFfull; quit; *Count number of effects to get degrees of freedom; 

proc sql; 
create table rdDF as select  effect, count(*) as redDF from 
work.&DFred; quit; 

data degfree (drop=effect); 
  merge  work.flDF work.rdDF; 
if _n_=1 then  LRTDF= fullDF  - redDF; run; 

data likelihood; 
 merge &fullmodel &redmodel degfree; 
 testintRML=abs((chisqredRML)-(chisqfullRML)); **Models can yield 
negative LLs, those that are smaller in absolute value -ie, closer to 0 
fit better (pg 116-117, Singer);  


testintFML=abs((chisqredFML)-(chisqfullFML)); **Models can yield 
negative LLs, those that are smaller in absolute value -ie, closer to 0 
fit better (pg 116-117, Singer); 
 pvaluemixture=(( .5*(1-probchi(testintRML,2)) + .5*(1-
probchi(testintRML,1)) ));    *for random terms; 
  pvalueLRT=(1-probchi(testintFML,LRTDF));      *For fixed terms; 
  
proc print data=likelihood split='*' noobs; 
   var  testintFML  pvalueLRT; 
 format  pvalueLRT 6.4; 
 where testintFML ne .; 
   label testintFML='likelihood ratio*test statistic* 
         comparing*reduced model to full model' 
                   pvalueLRT='p-value LRT'; 
      title "Likelihood Ratio Test for fixed effects"; run; 
  
proc print data=likelihood split='*' noobs; 
   var  testintRML  pvaluemixture; 
 format pvaluemixture  6.4; 
 where  testintRML ne . ; 
   label testintRML='likelihood ratio*test statistic* 
         comparing*reduced model to full model' 
                   pvaluemixture='mixture method p-value'; 
      title "Mixture method Test for random effects"; 
 run;  
%mend LRMixtureTest; 


這個 macro 裡面綠色的部分完全都是作者的註釋和說明,只有藍色的部分才是真正 SAS 會去執行的程式碼。其中只有四個參數要說明:

  • fullmodel:full model 的 -2 log likelihood function 值
  • redmodel:reduced model 的 -2 log likelihood function 值
  • DFfull:full model 的自由度
  • DFred:reduced model 的自由度

四個參數設定就和上一段的敘述相同。全部都是必要的參數,缺一不可。因此,只要呼叫 %macro LRMixtureTest 再套上四個參數值,電腦就會幫你做 LRT。我們來看原文內的一個範例。

範例模式:

doodle.png

這模式只使用一個因變數叫做 PCTBLK,但他同時是固定效應也是隨機效應。假設這是個 full model,而 reduced model 就設定為拿掉 PCTBLK 固定效應的模式,也就是拿掉上面劃紅線的那一段。

因此,先跑兩個程式把那四個要用到的參數算出來:

%include '\\cdc\private\mixture method pvalue macro1.sas';
proc mixed data=bhb7.allUS2levge10blk method=ml covtest empirical
noclprint ;
class  state_id;
model diffcov=pctblk/ ddfm=contain s;
random intercept pctblk /subject=state_id type=ar(1) s ;
ods output FitStatistics=fm  SolutionF=SFfm ;
run;

proc mixed data=bhb7.allUS2levge10blk method=ml covtest empirical
noclprint;
class  state_id;
model diffcov=/ddfm=contain   s;
random intercept  pctblk /subject=state_id type=ar(1) s;
ods output FitStatistics=rm SolutionF=SFrm ;
run;


跑完後,兩個模式的 -2 log likelihood function 值和自由度會被 ods output 輸出到不同的資料裡面,接著使用那個 macro 去把從這四個資料裡面抓數值出來算:

%LRMixtureTest(fullmodel=fm,redmodel=rm,DFfull=SFfm,DFred=SFrm);

最後這個 macro 就會輸出下面這個報表給你:


結果顯示 p-value < 0.05,所以 full model 比 reduced model 好。

從這個範例中,特別要強調一點是,如果兩個 nested mixed model 差異只有在固定效應時,一開始那兩個 PROC MIXED 執行時,後面的 method option 必須強制設定為 ML,因為如果沒有特別去設定的話,PROC MIXED 預設的參數估計方法是 REML,但這只有在兩個模式的差異是在隨機效應或者是 covariance structure 時才能使用。原文中的第一個範例就是用 REML 再估計參數,因為兩個模式的差別只有在一個 random intercept 而已。


Contact Information: 
Barbara Bardenheier, MPH, MA
Centers for Disease Control and Prevention
1600 Clifton Rd, MS E-52
Atlanta, GA 30333
Tel : (404) 639-8789
Fax : (404) 639-8614
Email : bfb7@cdc.gov
3月 262010
 
原文載點:http://analytics.ncsu.edu/sesug/2009/SC013.Bardenheier.pdf

若要從兩個nested mixed model挑選較佳的模式,比較專業的人會推薦使用Likelihood Ratio Test(以下簡稱LRT)來處理這類的問題。PROC MIXED程序並沒有語法來進行這個檢定,但這個檢定需要使用到的參數,卻全部都可以從PROC MIXED程序的輸出報表內找到。懂得運作的使用者可以從報表拿擷取相關參數,然後自己算一算再查個表大概就知道結果了。我自己曾經在準備博士班資格考時寫了一個簡易的macro來做LRT,但沒想到真的有人把這個觀念發表出來。我稍微看了一下這份技術文件內的macro寫法,與我當年寫的大同小異,但作者有將輸出報表格式設計的比較好看一點,所以我就來介紹一下這個方便進行LRT的macro。

Continue reading »
 Posted by at 12:21 下午
3月 262010
 
Hello! And welcome to the grand opening of Get, Grow, Keep, our spanking new group blog focused on the remarkably old business problems every marketer faces: namely, how to get, grow, and keep your best customers.

The Timeline of innovation on Wikipedia’s History of marketing entry makes it clear – the accelerating pace of technological innovation is a double-edge sword. Increasingly, we have new ways to glean insights about our customers, new ways to interact with our customers, and new ways to measure and improve our performance. But all these “new ways” also mean we marketers have other “new’s” not explicitly listed on the timeline: new tools to use, new strategies to devise, and perhaps most importantly, new (customer) expectations to meet.

And here we are. As a marketing organization, we struggle with the same challenges our customers and prospects face.

We are starting to ask new questions like:

  • How can we evolve as an organization and develop better relationships with today’s highly empowered customers?

And we are continuing to ask fundamental questions like:

  • What is the best way to create a single view of our customers?
  • How can we uncover insights about our customers to drive our marketing decisions?
  • What are the best solutions and strategies for executing integrated, cross-channel communications?
  • How can we best develop and view metrics to measure and improve our marketing efforts?

Answer these questions correctly and you can acquire new customers, increase the value of current customers, and retain profitable customers longer—leading to a growing and profitable customer base. Answer incorrectly and you risk high attrition rates and low acquisition rates, and more to the point, you may end up acquiring and keeping unprofitable customers too.

We offer this new blog as a platform to talk with you, not only about the challenges of getting, growing, and keeping customers, but also about current customer intelligence/CRM industry news, events, ideas, and trends. From time to time we’ll invite industry leaders, analysts, professors and customers to provide new perspectives through guest blog posts. Not surprisingly, we are determined to have a lot of fun along the way as well.

What does success look like for this blog? Simply put, we’ll be measuring ourselves against our ability to encourage conversations, lower the usual customer/corporate barriers, and give you access to our key contributors.

Fortunately we have a fantastic group of core contributors, consisting of experienced and new bloggers, covering a range of marketing roles. Our team includes:

  • Justin Huntsman (that’s me), and John Balla. We’re both field marketers focused on creating opportunities for our prospects and customers to learn more about our company and solutions.
  • John Bastone is our global SAS Customer Intelligence product marketing manager. He's a direct marketing enthusiast with a current focus on “all things web and social."
  • Mark Chaves is the director of media intelligence solutions for our Customer Intelligence business unit. His current focus is to guide the solution strategy for our marketing mix analysis and social marketing solutions.
  • David B. Thomas is our social media manager. He blogs regularly at Conversations & Connections, the SAS social media blog.
  • Matt Fulk manages our database marketing team and is leading a project focusing on buy-cycle marketing and list development optimization using SAS Marketing Automation.
  • Alison Bolen is the editor of blogs and social content at SAS. You’ll find her regularly at the sascom voices blog.
  • Deb Orton is our director of field marketing, responsible for leading a group of talented marketers "across the chasm" from traditional marketing approaches to social, digital methods.

Our names above are linked to our Linked-In profiles. We’d each be delighted to connect with you and to hear from you via comments to this blog, or our new group Twitter account @SAS_CI.

This blog is going to be great fun. Welcome!
3月 252010
 



FILENAME test URL "http://www.indeed.com/jobs?q=sas+programmer&limit=100&fromage=0&start=0" DEBUG LRECL=700;

DATA test;
infile test length=len;
input record $varying700. len;
*****DELETE MOST LINES W/O JOB;
if index(record, 'cmpid')=0 then delete;
*****DELETE HEAD ADVERSTISEMENT;
if index(record, 'jobmap')=0 then delete;

run;
data test2 ;
set test;
format zip z5.;
length state $2.;
*****SET UP ROAD SIGNS;
id1=index(record, 'srcname');
id2=index(record, 'cmpesc');
id3=index(record, 'cmplnk');
id4=index(record, 'loc');
id5=index(record, 'lat');
id6=index(record, 'lon');
id7=index(record, 'country');
id8=index(record, 'zip');
id9=index(record, 'state');
id10=index(record, 'city');
id11=index(record, 'title');
id12=index(record, 'locid');
*****OUPUT VARIABLES ;
source=substr(record, id1+9, id2-id1-11);
company=substr(record, id2+8, id3-id2-10); if company= 'na' or substr(company,1,1)="'" then company='N/A';
loc=substr(record, id4+5, id5-id4-7);
country=substr(record, id7+9, id8-id7-11);
if id8+5=id9-id8-7 then zip=.; else zip=substr(record, id8+5, id9-id8-7);
state=substr(record, id9+7, id10-id9-9); if state= 'na' or substr(state,1,1)="'" then state=.;
city=substr(record, id10+6, id11-id10-8); if city= 'na' or substr(city,1,1)="'" then city=.;
title=substr(record, id11+7, id12-id11-9);

drop record id1-id12;
index=_n_;
;
run;

ods rtf file='d:\sas.rtf';
proc sort data=test2 out=test3;
by state city;
run;
proc print data=test3;
var index title company state city source;
title "SAS programmer opens on &sysday, &sysdate ";
title2 "Collected by &SYSTIME ";
footnote "Created by Charlie Huang";
run;
ods rtf close;

3月 252010
 


Sometimes it is very handy to have a macro variable contanining the variables names of the dataset. Here are the 2 different ways you can create a macro variable with list of variables names ... *Method1: Using Proc Contents and Proc SQL; proc contents data=sashelp.class out=class; run; proc sql noprint; select distinct(name) into:vars separated by " " from class; quit; %put &vars; *Method2: Using SASHELP tables and Proc SQL; data class; set sashelp.vcolumn(where=(libname="SASHELP" and memname="CLASS")); keep name; run; proc sql noprint; select distinct(name) into:vars separated by " " from class; quit; %put &vars;

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 9:49 上午
3月 252010
 


Sometimes it is very handy to have a macro variable contanining the variables names of the dataset. Here are the 2 different ways you can create a macro variable with list of variables names ... *Method1: Using Proc Contents and Proc SQL; proc contents data=sashelp.class out=class; run; proc sql noprint; select distinct(name) into:vars separated by " " from class; quit; %put &vars; *Method2: Using SASHELP tables and Proc SQL; data class; set sashelp.vcolumn(where=(libname="SASHELP" and memname="CLASS")); keep name; run; proc sql noprint; select distinct(name) into:vars separated by " " from class; quit; %put &vars;

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 9:49 上午
3月 232010
 
Every person in my department at SAS (User and Customer Marketing) is involved in supporting SAS Global Forum in one way or another. One of the major roles is to coordinate the SAS Demo Area. I recently had a chat with my co-worker Katie Strange, who is responsible for teaming with many others at SAS to make the Demo Area a reality.

1. When did you start planning the SAS Demo Area?
From mid-October to April is the time frame when the planning is the most intense. However, like many events, the demo area planning is an on-going processes and overlaps with the current and future show.

2. How long does it take to build the SAS Demo Area, once you get on-site?
It takes an abundance of people with an incredible amount of talent to take the demo area from concept to reality. Our core team is a group of exceptional folks in Art and Scenic Operations, IT, Materials Management, Corporate Creative, Marketing and R&D. The demo area is a magnificent example of how SAS employees work in a cross-functional environment for the betterment of SAS and our customers.

But to answer your question - we start with a blank canvas at 6 am on Thursday morning, April 8, and are slated for the demo area to be set by 1 pm on Sunday, April 11. We open for attendees at 10 am on Monday, April 12.

3. What have you learned over the past few years that might surprise people?
I have learned that our show floor is a living, breathing and ever-evolving creature. Although the demo area takes many shapes and forms over the years, the one constant is the fact that this is an area that brings together a vast array of people because of their passion for knowledge and SAS. To me, that’s a beautiful thing.

4. Who can we find in the SAS Demo Area?
You will always be able to find SAS experts, such as SAS developers who are available and eager to help in any way they can. And at the Monday Night SAS Mixer you can find a few surprises, such as a lipsologist, a masseuse and much more! Don’t miss it!

5. What’s always been popular, and what’s new this year?
Perennial favorites are the Demo Alley where customers can have one on one time with SAS developers, the books store in the Publications area, the SAS Education Center and the SGF postcards booth.

There are also several exciting, new things this year such as the Innovations Wall and the Twitter Wall. The Innovations Wall is a place where attendees can view some of SAS’ innovations and contribute their own stories about how they have used SAS in innovative ways. The Twitter Wall will be displayed above the entrance to the Demo Area and will stream live tweets about the conference. There will also be a social networking booth near the Twitter Wall that is a resource for individuals who want to make the transition into the social networking realm but haven’t quite taken that leap yet.

3月 222010
 
Contributed by Memsy Price, Manager, Online Customer Support, SAS

Spring has sprung in North Carolina and here on the support.sas.com team our thoughts have turned towards sunny afternoons, daffodils, and — oh my goodness! — all we have to do before SAS Global Forum.

The 2010 SAS Global Forum, in Seattle, will be only my second, which makes me whatever ranks below novice in SAS years. What I lack in years of attendance, though, I’m making up for in enthusiasm. My first SAS Global Forum was a terrific experience — I got to meet and talk with so many users, and to hear directly from you what you like and, almost more importantly, what you don’t like about support.sas.com.

I hope this year will bring more of the same. Three of us from the support site team — Renee Harper, Paul Tidball, and myself — will be in Seattle working a station in the demo room. We’ll be showing the new SASware Ballot application, the recently-unveiled Advanced Search feature, updates to the product pages, some tips and tricks for navigating and searching the site, and some pointers about how to get the most out of the discussion forums.

Mostly, though, we hope you stop by to share your ideas with us. We talk to you every day in the forums, on e-mail, and through this blog, but we welcome the chance to talk face-to-face.

In addition to our spot in the demo room, the support.sas.com extended team has a couple of other activities planned. We hope that you can find time to add these to your conference agenda.
Continue reading "The support.sas.com team wants to meet YOU at SAS Global Forum 2010"
3月 222010
 
原文載點:http://analytics.ncsu.edu/sesug/2009/PO008.Welch.pdf

在作2乘2的卡方分析時,我們通常會看每一格裡面的樣本期望值是不是大於五,如果不大於五的比例超過一半的話,便需要使用 Fisher's exact test 的結果來取代原本的卡方分析結果。但這種判斷方法,在2乘2乘H的表格下並不適用。Mantel 和 Fleisś 在 1980 年代發明了一種判斷方法來專門處理這種情況,不過這個方法並沒有內建在 SAS 現存的程序內,因此 Brandon Welch, Jane Eslinger 和 Rob Woolson 在 2009 年的 SESUG 發表了一篇技術文件,提供一個簡便的 macro 供使用者使用。

有關 Mantel-Fleisś criterion 的理論,可詳見原文。以下就直接切入這個 macro 的使用方法:
%MantelFleiss(In,Strat,Var1,Var2,Count)

這個名為 MantelFleiss 的 macro 只有五個參數,定義如下:
  • In:資料集名稱
  • Strat: 層數,亦即2乘2表格的數量(H)
  • Var1:行變數
  • Var2:列變數
  • Count:此變數是加權變數(weight),是非必要的變數。如果省略的話,程式會自動定義為「1」。
原始碼如下:
%macro MantelFleiss(In,Strat,Var1,Var2,Count);

%*-----------------------------------------------------------;
%* &In - input data set ;
%* &Strat - stratification variable (e.g., site, race, etc.);
%* &Var1 - row variable ;
%* &Var2 - column variable ;
%* &Count - optional weight variable ;
%*-----------------------------------------------------------;

%*reset output data sets;
PROC DATASETS nodetails nolist;
delete _counts _expect _mantel:;
RUN;
QUIT;

%*determine number of strata levels;
PROC SQL noprint;
select count(distinct &strat) into: stratflag from &In;
QUIT;
%put **Number of strata levels &stratflag;

%*get expected counts and totals in each partial table;
PROC FREQ data = &In;
tables &Strat*&Var1*&Var2 / expected outexpect out = _expect;
%if %nrbquote(&Count.) ne %then weight &Count.;;
ODS output CrossTabFreqs = _counts;
RUN;

%*get sum of expected cell counts in cell n11 in each partial table
(assign to macro var SUMEXP);
DATA _null_;
set _expect end = eof;
by &Strat;
retain sumexp 0;
if first.&Strat then do;
sumexp = sumexp + expected;
end;
if eof then call symput('sumexp',put(sumexp,8.4));
RUN;
%put Sum of expected values in cells nh11: sum(mh11) = %cmpres(&sumexp);

PROC SORT data = _counts;
by &Strat &Var1 &Var2;
RUN;


%*Subset to the needed totals output by PROC FREQ
number the &Strat, &Var1, and &Var2 for use below
0 - totals
1 - first &Strat/&Var1/&Var2
2 - second &Strat/&Var1/&Var2
...;

DATA _mantelf1;
%*only include marginal totals from ODS;
set _counts(where = (_type_ not in ('100' '111')));
by &Strat &Var1 &Var2;
retain &Strat._ 0 &Var1._ &Var2._;

%*reset for change in strata;
if first.&Strat then do;
&Var1._ = 0; &Var2._ = 0;
end;

%*get numeric version of strata variable;
if first.&Strat then &Strat._ = &Strat._ + 1;

%*define zero as total for any &Var1/&Var2;
if missing(&Var1) then &Var1._ = 0;
else &Var1._ + 1;

if missing(&Var2) then &Var2._ = 0;
else &Var2._ + 1;

keep &Strat._ &Var1._ &Var2._ &Strat. &Var1. &Var2. frequency;
RUN;

DATA _mantelf2;
set _mantelf1;
by &Strat._;

retain n11_ n1_1 n1_2;

if first.&Strat._ then do;
n11_ = 0;
n1_1 = 0;
n1_2 = 0;
end;

%*get each total;
if &Var1._ = 1 then n11_ = frequency; %*for row one in the hth strata;
if &Var2._ = 1 then n1_1 = frequency; %*for column one in the hth strata;
if &Var2._ = 2 then n1_2 = frequency; %*for column two in the hth strata;


%*define (nh11)L and (nh11)U for final computation;
max_ = max(0,n11_-n1_2);
min_ = min(n1_1,n11_);

if last.&Strat._;

RUN;

%*sum acoss min and max to get lower/upper bounds of criterion;
DATA _mantelf3(keep = mf_suml mf_sumu mflendpt mfrendpt mf_r);
set _mantelf2 end = eof;

retain mf_suml mf_sumu;

if _n_ = 1 then do;
mf_suml = 0; mf_sumu = 0;

end;

mf_suml = mf_suml + max_;
mf_sumu = mf_sumu + min_;

if eof then do;
mflendpt = &sumexp-mf_suml;
mfrendpt = mf_sumu-&sumexp;
mf_r = min(mflendpt,mfrendpt);

put "Sum of lower bound (sum[(nh11)L]) = " mf_suml;
put "Sum of upper bound (sum[(nh11)U]) = " mf_sumu;
put "Left end-point: sum[mh11] - sum[(nh11)L] = " mflendpt;
put "Right endpoint: sum[(nh11)U] - sum[mh11] = " mfrendpt;
put "Mantel-Fleiss R = min[" mflendpt "," mfrendpt "] = " mf_r;

if mf_r>5 then put "Mantel-Fleiss criterion satisfied (since R > 5)";
else put "Mantel-Fleiss criterion NOT satisfied (since R <= 5)";
output;
end;

label
mf_suml = "Summation of Maximum (sum[(nh11)L])"
mf_sumu = "Summation of Minumum (sum[(nh11)U])"
mflendpt = "Left End-Point: sum[mh11] - sum[(nh11)L]"
mfrendpt = "Right End-Point: sum[(nh11)U] - sum[mh11]"
mf_r = "R Value: min[sum[mh11] - sum[(nh11)L],sum[(nh11)U] - sum[mh11]]";
RUN;
%mend;


範例一:
資料檔:
PROC FORMAT;
value trtf
1 = 'TreatmentA'
2 = 'TreatmentB'
;

value sitef
1 = 'Arkansas'
2 = 'Indiana'
3 = 'Illinois'
;

value respf
1 = 'Present'
2 = 'Absent'
;
RUN;

DATA test;
do site = 1 to 3;
do treat = 1 to 2;
do resp = 1 to 2;
input counts @@;
output;
end;
end;
end;
label site = "Site" treat = "Treatment" resp = "Response";
format treat trtf. site sitef. resp respf.;
cards;
12 5 7 3
1 2 5 2
0 9 5 6
;
RUN;


上述程式在表格內呈現的結果如下:


接著執行 macro:
%MantelFleiss(test,site,treat,resp,counts) ;

要特別注意的一點是,這個 macro 的結果並不是呈現在 output 視窗裡面,而是直接呈現在 log 視窗裡面,所以送出這行程式碼後,便可以直接在 log 視窗上看到結果,如下所示:
Sum of expected values in cells nh11: sum(mh11) = 20.5130
Sum of lower bound (sum[(nh11)L]) = 9
Sum of upper bound (sum[(nh11)U]) = 25
Left end-point: sum[mh11] - sum[(nh11)L] = 11.513
Right endpoint: sum[(nh11)U] - sum[mh11] = 4.487
Mantel-Fleiss R = min[11.513 ,4.487 ] = 4.487
Mantel-Fleiss criterion NOT satisfied (since R <= 5)


原文作者很好心地幫使用者做出該資料是否有滿足 Mantel-Fleiss criterion 的判斷句,所以就不用管上面的數據所代表的意義了,但若在寫 paper 時,可能還是要把一些數據呈現出來,但其實這些數據並沒有解釋上的明顯意義。

這個例子並沒有滿足 Mantel-Fleiss criterion,也就是說我們得使用 Fisher's exact test 來取代原本的卡方檢定。

我們再來看另一個有滿足 Mantel-Fleiss criterion 的例子。

範例二:

我們直接來看資料的表格結構。

所有參數設定都和範例一同樣,只是數據改變而已。執行 macro 後的結果如下:
Sum of expected values in cells nh11: sum(mh11) = 19.3630
Sum of lower bound (sum[(nh11)L]) = 9
Sum of upper bound (sum[(nh11)U]) = 30
Left end-point: sum[mh11] - sum[(nh11)L] = 10.363
Right endpoint: sum[(nh11)U] - sum[mh11] = 10.637
Mantel-Fleiss R = min[10.363 ,10.637 ] = 10.363
Mantel-Fleiss criterion satisfied (since R > 5)


從最後一句話得知這份資料是有滿足 Mantel-Fleiss criterion 的,所以我們便可以直接使用卡方檢定的結果。

CONTACT INFORMATION
Brandon Welch
Rho®, Inc.
Statistical Programming
6330 Quadrangle Dr., Ste. 500
Chapel Hill, NC 27517
Email: Brandon_Welch@rhoworld.com
Phone: 919-595-6339
3月 222010
 
原文載點:http://analytics.ncsu.edu/sesug/2009/PO008.Welch.pdf

在作2乘2的卡方分析時,我們通常會看每一格裡面的樣本期望值是不是大於五,如果不大於五的比例超過一半的話,便需要使用 Fisher's exact test 的結果來取代原本的卡方分析結果。但這種判斷方法,在2乘2乘H的表格下並不適用。Mantel 和 Fleisś 在 1980 年代發明了一種判斷方法來專門處理這種情況,不過這個方法並沒有內建在 SAS 現存的程序內,因此 Brandon Welch, Jane Eslinger 和 Rob Woolson 在 2009 年的 SESUG 發表了一篇技術文件,提供一個簡便的 macro 供使用者使用。

有關 Mantel-Fleisś criterion 的理論,可詳見原文。以下就直接切入這個 macro 的使用方法:
%MantelFleiss(In,Strat,Var1,Var2,Count)

這個名為 MantelFleiss 的 macro 只有五個參數,定義如下:
  • In:資料集名稱
  • Strat: 層數,亦即2乘2表格的數量(H)
  • Var1:行變數
  • Var2:列變數
  • Count:此變數是加權變數(weight),是非必要的變數。如果省略的話,程式會自動定義為「1」。
原始碼如下:
%macro MantelFleiss(In,Strat,Var1,Var2,Count);

%*-----------------------------------------------------------;
%* &In - input data set ;
%* &Strat - stratification variable (e.g., site, race, etc.);
%* &Var1 - row variable ;
%* &Var2 - column variable ;
%* &Count - optional weight variable ;
%*-----------------------------------------------------------;

%*reset output data sets;
PROC DATASETS nodetails nolist;
delete _counts _expect _mantel:;
RUN;
QUIT;

%*determine number of strata levels;
PROC SQL noprint;
select count(distinct &strat) into: stratflag from &In;
QUIT;
%put **Number of strata levels &stratflag;

%*get expected counts and totals in each partial table;
PROC FREQ data = &In;
tables &Strat*&Var1*&Var2 / expected outexpect out = _expect;
%if %nrbquote(&Count.) ne %then weight &Count.;;
ODS output CrossTabFreqs = _counts;
RUN;

%*get sum of expected cell counts in cell n11 in each partial table
(assign to macro var SUMEXP);
DATA _null_;
set _expect end = eof;
by &Strat;
retain sumexp 0;
if first.&Strat then do;
sumexp = sumexp + expected;
end;
if eof then call symput('sumexp',put(sumexp,8.4));
RUN;
%put Sum of expected values in cells nh11: sum(mh11) = %cmpres(&sumexp);

PROC SORT data = _counts;
by &Strat &Var1 &Var2;
RUN;


%*Subset to the needed totals output by PROC FREQ
number the &Strat, &Var1, and &Var2 for use below
0 - totals
1 - first &Strat/&Var1/&Var2
2 - second &Strat/&Var1/&Var2
...;

DATA _mantelf1;
%*only include marginal totals from ODS;
set _counts(where = (_type_ not in ('100' '111')));
by &Strat &Var1 &Var2;
retain &Strat._ 0 &Var1._ &Var2._;

%*reset for change in strata;
if first.&Strat then do;
&Var1._ = 0; &Var2._ = 0;
end;

%*get numeric version of strata variable;
if first.&Strat then &Strat._ = &Strat._ + 1;

%*define zero as total for any &Var1/&Var2;
if missing(&Var1) then &Var1._ = 0;
else &Var1._ + 1;

if missing(&Var2) then &Var2._ = 0;
else &Var2._ + 1;

keep &Strat._ &Var1._ &Var2._ &Strat. &Var1. &Var2. frequency;
RUN;

DATA _mantelf2;
set _mantelf1;
by &Strat._;

retain n11_ n1_1 n1_2;

if first.&Strat._ then do;
n11_ = 0;
n1_1 = 0;
n1_2 = 0;
end;

%*get each total;
if &Var1._ = 1 then n11_ = frequency; %*for row one in the hth strata;
if &Var2._ = 1 then n1_1 = frequency; %*for column one in the hth strata;
if &Var2._ = 2 then n1_2 = frequency; %*for column two in the hth strata;


%*define (nh11)L and (nh11)U for final computation;
max_ = max(0,n11_-n1_2);
min_ = min(n1_1,n11_);

if last.&Strat._;

RUN;

%*sum acoss min and max to get lower/upper bounds of criterion;
DATA _mantelf3(keep = mf_suml mf_sumu mflendpt mfrendpt mf_r);
set _mantelf2 end = eof;

retain mf_suml mf_sumu;

if _n_ = 1 then do;
mf_suml = 0; mf_sumu = 0;

end;

mf_suml = mf_suml + max_;
mf_sumu = mf_sumu + min_;

if eof then do;
mflendpt = &sumexp-mf_suml;
mfrendpt = mf_sumu-&sumexp;
mf_r = min(mflendpt,mfrendpt);

put "Sum of lower bound (sum[(nh11)L]) = " mf_suml;
put "Sum of upper bound (sum[(nh11)U]) = " mf_sumu;
put "Left end-point: sum[mh11] - sum[(nh11)L] = " mflendpt;
put "Right endpoint: sum[(nh11)U] - sum[mh11] = " mfrendpt;
put "Mantel-Fleiss R = min[" mflendpt "," mfrendpt "] = " mf_r;

if mf_r>5 then put "Mantel-Fleiss criterion satisfied (since R > 5)";
else put "Mantel-Fleiss criterion NOT satisfied (since R <= 5)";
output;
end;

label
mf_suml = "Summation of Maximum (sum[(nh11)L])"
mf_sumu = "Summation of Minumum (sum[(nh11)U])"
mflendpt = "Left End-Point: sum[mh11] - sum[(nh11)L]"
mfrendpt = "Right End-Point: sum[(nh11)U] - sum[mh11]"
mf_r = "R Value: min[sum[mh11] - sum[(nh11)L],sum[(nh11)U] - sum[mh11]]";
RUN;
%mend;


範例一:
資料檔:
PROC FORMAT;
value trtf
1 = 'TreatmentA'
2 = 'TreatmentB'
;

value sitef
1 = 'Arkansas'
2 = 'Indiana'
3 = 'Illinois'
;

value respf
1 = 'Present'
2 = 'Absent'
;
RUN;

DATA test;
do site = 1 to 3;
do treat = 1 to 2;
do resp = 1 to 2;
input counts @@;
output;
end;
end;
end;
label site = "Site" treat = "Treatment" resp = "Response";
format treat trtf. site sitef. resp respf.;
cards;
12 5 7 3
1 2 5 2
0 9 5 6
;
RUN;


上述程式在表格內呈現的結果如下:


接著執行 macro:
%MantelFleiss(test,site,treat,resp,counts) ;

要特別注意的一點是,這個 macro 的結果並不是呈現在 output 視窗裡面,而是直接呈現在 log 視窗裡面,所以送出這行程式碼後,便可以直接在 log 視窗上看到結果,如下所示:
Sum of expected values in cells nh11: sum(mh11) = 20.5130
Sum of lower bound (sum[(nh11)L]) = 9
Sum of upper bound (sum[(nh11)U]) = 25
Left end-point: sum[mh11] - sum[(nh11)L] = 11.513
Right endpoint: sum[(nh11)U] - sum[mh11] = 4.487
Mantel-Fleiss R = min[11.513 ,4.487 ] = 4.487
Mantel-Fleiss criterion NOT satisfied (since R <= 5)


原文作者很好心地幫使用者做出該資料是否有滿足 Mantel-Fleiss criterion 的判斷句,所以就不用管上面的數據所代表的意義了,但若在寫 paper 時,可能還是要把一些數據呈現出來,但其實這些數據並沒有解釋上的明顯意義。

這個例子並沒有滿足 Mantel-Fleiss criterion,也就是說我們得使用 Fisher's exact test 來取代原本的卡方檢定。

我們再來看另一個有滿足 Mantel-Fleiss criterion 的例子。

範例二:

我們直接來看資料的表格結構。

所有參數設定都和範例一同樣,只是數據改變而已。執行 macro 後的結果如下:
Sum of expected values in cells nh11: sum(mh11) = 19.3630
Sum of lower bound (sum[(nh11)L]) = 9
Sum of upper bound (sum[(nh11)U]) = 30
Left end-point: sum[mh11] - sum[(nh11)L] = 10.363
Right endpoint: sum[(nh11)U] - sum[mh11] = 10.637
Mantel-Fleiss R = min[10.363 ,10.637 ] = 10.363
Mantel-Fleiss criterion satisfied (since R > 5)


從最後一句話得知這份資料是有滿足 Mantel-Fleiss criterion 的,所以我們便可以直接使用卡方檢定的結果。

CONTACT INFORMATION
Brandon Welch
Rho®, Inc.
Statistical Programming
6330 Quadrangle Dr., Ste. 500
Chapel Hill, NC 27517
Email: Brandon_Welch@rhoworld.com
Phone: 919-595-6339
 Posted by at 9:32 上午