6月 192009
 
原文載點:http://support.sas.com/resources/papers/proceedings09/191-2009.pdf

這一篇技術文件是簡單地利用一個真正的女性HIV資料來教如何使用 SAS 檢定 mediator(或稱 mediation) 和 moderator。關於 mediator 和 moderator 的定義請參照:

Mediator:http://davidakenny.net/cm/mediate.htm
Moderator:http://davidakenny.net/cm/moderation.htm

首先,這個資料背景是來自一個cross-sectional的長期研究裡面所抽出來的第一次面訪資料,總計有 280 位遭到 HIV 感染的女性。

Baron & Kenny (1986) 首度發表檢測 mediator 的方法,整個流程要跑三個迴歸模式(Eq1, Eq2, Eq3),並且要符合四個準則(C1, C2, C3, C4)。三個迴歸模式分別為:
(Eq1) IV -> DV
(Eq2) IV -> M (Mediator)
(Eq3) IV + M -> DV

前兩個準則 C1 和 C2 是,如果 Eq1 和 Eq2 都出現顯著的結果,基本上就表示 Mediator 可能是存在的。此外,還需要另兩個存在於 Eq3 的準則需要符合:
(C3) M 在 Eq3 一定要顯著。
(C4) IV 的估計參數在 Eq3 要降到 0。

若這四個準則都達成了,則可以說這個 M 變數是 full mediator。如果 C4 沒有達成,則稱 M 為 partial mediator。

最後,再用 Sobel test 來檢定一個 mediator 是否顯著地將 IV -> DV 的總效應完全取代。

因此,在這個 HIV Women 的資料庫中,定義 Available Social Support (tssqav) 為 IV,Reason for Missing Medication (treas) 為 DV,而 mediator 變數則為 Spiritual Activity (tcopesa)。

我們可以連續用三個 PROC REG 程序把 Eq1~Eq3 給建立起來:
ods rtf;
ods listing close;
proc reg data=two;
model treas = tssqav / stb pcorr2 scorr2;
title ' Regression model / step1 y=x' ;
run;
proc reg data=two;
model tcopesa = tssqav / stb pcorr2 scorr2;
title ' Regression model / step2 m=x' ;
run;
proc reg data=two;
model treas = tssqav tcopesa / stb pcorr2 scorr2;
title ' Regression model / step3 y=m x' ;
run;
ods rtf close;
ods listing;
quit;
run;
第一個 PROC REG 得到 β=-0.98 (p-value=0.02),因此 C1 達成。第二個 PROC REG 程序得到 β=0.143 (p-value=0.003),因此 C2 也達成。至於 Eq3 的配適結果,得到 (β1, β2)=(-0.79, -0.44),其 p-value 分別為 0.055 和 0.02,表示 M 在 Eq3 依舊顯著,但 IV 在 Eq3 變成不顯著了。所以基本上 C3 是達成了,但由於 β1 沒有降到接近 0,所以 C4 不算達成,因此 Spiritual Activity 只能稱是個 partial mediator,而不是 full mediator。

不過本篇技術文件並沒有繼續去做 Sobel test,但前北卡大心理系教授 Dr. Preacher 和 OSU 教授 Dr. Hayes 曾經在 2004 年發表了一篇關於 Sobel test 的論文,裡面有附完整的 Sobel test SAS macro。

原文:http://www.comm.ohio-state.edu/ahayes/BRMIC2004.pdf
程式:here
語法:%sobel(data=file, y=dv, x=iv, m=med, boot=z);

其中,data表示想要呼叫進來使用的資料,y 是放 DV 變數名稱,x 是放 IV 變數名稱, m 是放 mediator 變數名稱,而 boot 則是指定要做 bootstrap resampling 的次數,從 1000 到 1000000 之間任一數字皆可。如果不想用的話就直接寫 0,這樣一來 %sobel 會自動關閉 bootstrap resampling 的功能。

網路上也有不少的 Sobel test calculator,可自行用 google 搜尋。

那麼,要進行 moderator 的檢定,則需要配適這個迴歸模型:IV+M+IV*M -> DV。如果 IV*M 的估計參數是顯著的話,則表示 M 是 IV 和 DV 的 moderator。程式如下:

ods rtf;
ods listing close
proc reg data=two;
model treas = tssqav tcopesa sscopesa/ stb pcorr2 scorr2;
title ' Regression model / testing moderator effect' ;
run;
ods rtf close;
ods listing;
quit;
run;
其中 sscopesa 是 tssqav 和 tcopesa 的交互作用項,這是由於 PROC REG 程序裡面不能使用 tssqav*tcopesa 這種語法來代表交互作用。因此在跑這個程式之前,一定要先用一個 data step 把交互作用項用另一個變數名稱給建立起來。

最後我們得到 IV*M 的估計參數 β=0.00175,其 p-value=0.5172 並不顯著,因此可以測得 Spiritual Activity 並不是 Moderator。
 Posted by at 4:57 上午
6月 192009
 
原文載點:http://support.sas.com/resources/papers/proceedings09/191-2009.pdf

這一篇技術文件是簡單地利用一個真正的女性HIV資料來教如何使用 SAS 檢定 mediator(或稱 mediation) 和 moderator。關於 mediator 和 moderator 的定義請參照:

Mediator:http://davidakenny.net/cm/mediate.htm
Moderator:http://davidakenny.net/cm/moderation.htm

首先,這個資料背景是來自一個cross-sectional的長期研究裡面所抽出來的第一次面訪資料,總計有 280 位遭到 HIV 感染的女性。

Baron & Kenny (1986) 首度發表檢測 mediator 的方法,整個流程要跑三個迴歸模式(Eq1, Eq2, Eq3),並且要符合四個準則(C1, C2, C3, C4)。三個迴歸模式分別為:
(Eq1) IV -> DV
(Eq2) IV -> M (Mediator)
(Eq3) IV + M -> DV

前兩個準則 C1 和 C2 是,如果 Eq1 和 Eq2 都出現顯著的結果,基本上就表示 Mediator 可能是存在的。此外,還需要另兩個存在於 Eq3 的準則需要符合:
(C3) M 在 Eq3 一定要顯著。
(C4) IV 的估計參數在 Eq3 要降到 0。

若這四個準則都達成了,則可以說這個 M 變數是 full mediator。如果 C4 沒有達成,則稱 M 為 partial mediator。

最後,再用 Sobel test 來檢定一個 mediator 是否顯著地將 IV -> DV 的總效應完全取代。

因此,在這個 HIV Women 的資料庫中,定義 Available Social Support (tssqav) 為 IV,Reason for Missing Medication (treas) 為 DV,而 mediator 變數則為 Spiritual Activity (tcopesa)。

我們可以連續用三個 PROC REG 程序把 Eq1~Eq3 給建立起來:
ods rtf;
ods listing close;
proc reg data=two;
model treas = tssqav / stb pcorr2 scorr2;
title ' Regression model / step1 y=x' ;
run;
proc reg data=two;
model tcopesa = tssqav / stb pcorr2 scorr2;
title ' Regression model / step2 m=x' ;
run;
proc reg data=two;
model treas = tssqav tcopesa / stb pcorr2 scorr2;
title ' Regression model / step3 y=m x' ;
run;
ods rtf close;
ods listing;
quit;
run;
第一個 PROC REG 得到 β=-0.98 (p-value=0.02),因此 C1 達成。第二個 PROC REG 程序得到 β=0.143 (p-value=0.003),因此 C2 也達成。至於 Eq3 的配適結果,得到 (β1, β2)=(-0.79, -0.44),其 p-value 分別為 0.055 和 0.02,表示 M 在 Eq3 依舊顯著,但 IV 在 Eq3 變成不顯著了。所以基本上 C3 是達成了,但由於 β1 沒有降到接近 0,所以 C4 不算達成,因此 Spiritual Activity 只能稱是個 partial mediator,而不是 full mediator。

不過本篇技術文件並沒有繼續去做 Sobel test,但前北卡大心理系教授 Dr. Preacher 和 OSU 教授 Dr. Hayes 曾經在 2004 年發表了一篇關於 Sobel test 的論文,裡面有附完整的 Sobel test SAS macro。

原文:http://www.comm.ohio-state.edu/ahayes/BRMIC2004.pdf
程式:here
語法:%sobel(data=file, y=dv, x=iv, m=med, boot=z);

其中,data表示想要呼叫進來使用的資料,y 是放 DV 變數名稱,x 是放 IV 變數名稱, m 是放 mediator 變數名稱,而 boot 則是指定要做 bootstrap resampling 的次數,從 1000 到 1000000 之間任一數字皆可。如果不想用的話就直接寫 0,這樣一來 %sobel 會自動關閉 bootstrap resampling 的功能。

網路上也有不少的 Sobel test calculator,可自行用 google 搜尋。

那麼,要進行 moderator 的檢定,則需要配適這個迴歸模型:IV+M+IV*M -> DV。如果 IV*M 的估計參數是顯著的話,則表示 M 是 IV 和 DV 的 moderator。程式如下:

ods rtf;
ods listing close
proc reg data=two;
model treas = tssqav tcopesa sscopesa/ stb pcorr2 scorr2;
title ' Regression model / testing moderator effect' ;
run;
ods rtf close;
ods listing;
quit;
run;
其中 sscopesa 是 tssqav 和 tcopesa 的交互作用項,這是由於 PROC REG 程序裡面不能使用 tssqav*tcopesa 這種語法來代表交互作用。因此在跑這個程式之前,一定要先用一個 data step 把交互作用項用另一個變數名稱給建立起來。

最後我們得到 IV*M 的估計參數 β=0.00175,其 p-value=0.5172 並不顯著,因此可以測得 Spiritual Activity 並不是 Moderator。
 Posted by at 4:57 上午
6月 172009
 
SAS Publishing wants to get closer to the people who read, write, or dream of writing a SAS Press book. You can find SAS Publishing products and contact information on support.sas.com in the Bookstore. SAS Publishing is reaching out to you in other locations so that we can get to know each other better. Join the conversation.

  • Become a fan of SAS Publishing on Facebook.
  • Follow @SASPublishing on Twitter.
  • Join Fans of SAS Books on LinkedIn.

Watch this space for more ways to connect with SAS and SAS Publishing.

Text Speak

 未分类  No Responses »
6月 172009
 
I just posted a tweet to my @ManyaMayes Twitter account. In order to get my message across, in 140 characters or less, I had to shorten my text. This is a very common practise for mobile phone users who send text messages that look a lot like a foreign language. My Mum writes messages that are so clipped that I have trouble deciphering them! As a BlackBerry user, I send email messages but I rarely send SMS messages. I've spent many years making sure I write messages that are easy for audiences to understand. It's going to take me a while to get used to writing clipped text (writing in text speak) as part of my job. It goes against much of my professional training to write like this: u no wot u no & u don't no wot u don't

How does text mining handle this? One approach would be to specify synonyms for these clipped terms:

u = you
no = know
wot = what

But "no" and "know" are both valid dictionary entries, so this will immediately cause a follow on problem since surely not all occurrences of "no" should be replaced with "know". Deciding which occurrences of "no" should be replaced with "know" is aided by using additional context of the document. Boolean and linguistic rules can help with this.

It can be difficult to solve data quality problems like this and typically solutions are specific to both the data and the application. For example, the way you would replace R&R would depend on whether the data came from a forum for military personnel talking about upcoming "rest and relaxation" or whether it was a warranty report describing "repair and replace" for a defective part or other...
6月 122009
 
原文載點:http://support.sas.com/resources/papers/proceedings09/158-2009.pdf

雖然不曉得有多少人已經拿到 SAS V9.2,不過由於我已經拿到了,所以之後會開始陸續介紹一些新版的功能。

首先先來展示一個 V9.2 最新的繪圖程序—PROC SGPLOT。舊版的 SAS 雖然有提供繪圖程序,但是他們都分散在不同的程序裡面,反而造成使用者的不便。此外,他們的老毛病還是存在,那就是畫出來的圖品質不佳,後來雖然有 ODS 的協助,稍微改善了這方面的缺失,不過 V9.2 版把這些舊的繪圖程序都打包在 PROC SGPLOT 裡面。SGPLOT 顧名思義就是 sophisticated graphical plot 的縮寫,讓我們先來看看這個新繪圖程序的功能。

Continue reading »
 Posted by at 3:23 上午
6月 122009
 
原文載點:http://support.sas.com/resources/papers/proceedings09/158-2009.pdf

雖然不曉得有多少人已經拿到 SAS V9.2,不過由於我已經拿到了,所以之後會開始陸續介紹一些新版的功能。

首先先來展示一個 V9.2 最新的繪圖程序—PROC SGPLOT。舊版的 SAS 雖然有提供繪圖程序,但是他們都分散在不同的程序裡面,反而造成使用者的不便。此外,他們的老毛病還是存在,那就是畫出來的圖品質不佳,後來雖然有 ODS 的協助,稍微改善了這方面的缺失,不過 V9.2 版把這些舊的繪圖程序都打包在 PROC SGPLOT 裡面。SGPLOT 顧名思義就是 sophisticated graphical plot 的縮寫,讓我們先來看看這個新繪圖程序的功能。

。HISTOGRAMS
舊版的長條圖要用 PROC GCHART 或 PROC NORMAL 裡面的 HISTOGRAM statement 才能畫出,而且還需要加上許多語法。現在在 PROC SGPLOT 裡面只要使用 HISTOGRAM statement,後面加上變數名稱,就可以完成一個精美的長條圖。

範例:

PROC SGPLOT DATA = Freestyle;
HISTOGRAM Time;
TITLE "Olympic Men's Swimming Freestyle 100";
RUN;
成果:

你拍攝的 image006.jpg。

若想要加上機率密度曲線,則只要多寫一行 DENSITY option,並宣告變數名稱即可。

範例:

PROC SGPLOT DATA = Freestyle;
HISTOGRAM Time;
DENSITY Time;
TITLE "Olympic Men's Swimming Freestyle 100";
RUN;
成果:

你拍攝的 image008.jpg。

。BAR CHARTS
要畫柱狀圖的的話,原本 PROC GCHART 裡面的 VBAR 和 HBAR 被完全移植過來。

範例:

PROC SGPLOT DATA = Countries;
VBAR Region;
TITLE 'Olympic Countries by Region';
RUN;
成果:

你拍攝的 image010.jpg。

若想要顯示每根 bar 裡面不同群體所佔的比例,則只要在後面加上 GROUP option 即可。

範例:

PROC SGPLOT DATA = Countries;
VBAR Region / GROUP = PopGroup;
TITLE 'Olympic Countries by Region and Population Group';
RUN;
成果:

你拍攝的 image012.jpg。

如果要計算次數的目標是另一個變數的話,可以用 RESPONSE option來另外累計。

範例:

PROC SGPLOT DATA = Countries;
VBAR Region / RESPONSE = NumParticipants;
TITLE 'Olympic Participants by Region';
RUN;
成果:
你拍攝的 image014.jpg。

。SERIES PLOTS
畫 X-Y 座標圖的功能被整合進 SERIES statement 中。而且重複呼叫 SERIES statement 的話,可以自動完成重疊圖形的功能,不用像以前一樣一定得加上 overlay option。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image016.jpg。

接下來看看怎樣替圖形做一些細部的調整。

。XAXIS AND YAXIS STATEMENTS

這兩個 statement 便是拿來更改 X 軸和 Y 軸設定的語法,功能就和以前的 AXISn statement 一樣。從上面的圖可以發現,X-軸代表月份,但是刻度卻是 2.5, 5.0, 7.5, 10.0, 12.5,顯然是不合理的,若加上 TYPE=DISCRETE 則會以實際資料裡面的數據來刻畫度數。GRID 則是在每個刻度上面劃上一條淡灰色的準線。如果 XAXIS 和 YAXIS 都加上 GRID option 的話就可以畫出格狀的底圖。LABEL 自然就是將軸重新命名,而 VALUES 則可以自己定義刻度的起始點和間距大小。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
XAXIS TYPE = DISCRETE GRID;
YAXIS LABEL = 'Rain in Inches' GRID VALUES = (0 TO 10 BY 1);
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image018.jpg。

。PLOT STATEMENT OPTIONS
如果想要針對座標軸內的線條圖形或是圖例說明做調整,則需要在 SEREIS statement 後面加上一些 option。LEGENDLABEL 可以更改圖例說明內的標籤,MARKERS 的功能就和以前的 SYMBOL 一樣,可以在資料點上標上符號。LINEATTRS 則是可以規範線條型態和粗細。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain / LEGENDLABEL = 'Beijing' MARKERS LINEATTRS = (THICKNESS = 2);
SERIES X = Month Y = VRain / LEGENDLABEL = 'Vancouver' MARKERS LINEATTRS = (THICKNESS = 2);
SERIES X = Month Y = LRain / LEGENDLABEL = 'London' MARKERS LINEATTRS = (THICKNESS = 2);
XAXIS TYPE = DISCRETE;
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image020.jpg。

。REFLINE STATEMENT
如果想要在座標圖上加上一些參考線,則可以 REFLINE statement來完成。同樣地,參考線也可以做一些細部設定,比方說透明度可以用 TRANSPARENCY option 來設定,每條參考線也可以用 LABEL option 寫上標籤。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
XAXIS TYPE = DISCRETE;
REFLINE 2.03 4.78 1.94 / TRANSPARENCY = 0.5 LABEL = ('Beijing(Mean)' 'Vancouver(Mean)' 'London(Mean)');
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image022.jpg。

。INSET STATEMENT
座標圖內可以用 INSET statement 寫上一些註釋文字,至於位置當然就得用 POSITION option 來設定。若要將註釋文字加框,則簡單地用 BORDER option 即可搞定。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
XAXIS TYPE = DISCRETE;
INSET 'Source Lonely Planet Guide'/ POSITION = TOPRIGHT BORDER;
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image024.jpg。

。THE SGPANEL PROCEDURE
以前如果要針對一個資料集裡面不同的群組或個體畫出各自的圖形並且放在同一張圖裡面,是相當大費周章的事情。現在這種苦差事利用 PROC SPGPANEL 程序即可輕鬆解決。假設我們要畫一個資料集的迴歸線,用 PROC SGPLOT 的 REG statement 就可以畫出。程式如下:


PROC SGPLOT DATA=sg.countries;
REG X=NumParticipants Y=TotalMedals;
TITLE 'Number of Participants by Total Medals Won for Each Country';
RUN;
圖形如下:

你拍攝的 image062.jpg。

如果這個資料集裡面共包含六個區域,該如何分別製作迴歸圖,並且用 2X3 排成一張圖。程式如下:

PROC SGPANEL DATA=sg.countries;
PANELBY Region;
REG X=NumParticipants Y=TotalMedals;
TITLE 'Number of Participants by Total Medals Won for Each Country';
RUN;
首先呼叫 PROC SGPANEL,再用 PANELBY statement 定義 Region 是類別變數。這功能就很像 CLASS statement。其餘的程式都和之前的 PROC SGPLOT 一樣,結果如下:

你拍攝的 image064.jpg。

這份技術文件裡面有製作幾個表格,分別說明每種圖形要在 PROC SGPLOT 裡面用哪一種 statement,以及選了一些比較重要的 option,如下所示:





由於裡面的敘述都很簡單,各位可以自行下載原始檔案來看。不過這些語法都只是整個 PROC SGPLOT 裡面的九牛一毛而已,當然如果都學會的話應該也夠用了。如果想要知道全部的語法,可以到 SAS 官網去看。

網址:http://support.sas.com/documentation/cdl/en/grstatproc/61948/HTML/default/sgplot-stmt.htm

ABOUT THE AUTHORS
Lora Delwiche and Susan Slaughter are the authors of The Little SAS Book: A Primer, and The Little SAS Book for
Enterprise Guide which are published by SAS Institute. The authors may be contacted at:
Lora D. Delwiche
(530) 752-9321
llddelwiche@ucdavis.edu

Susan J. Slaughter
(530)756-8434
susan@avocetsolutions.com
 Posted by at 3:23 上午
6月 122009
 
I saw the following comment on Twitter yesterday about sentiment analysis limitations and decided it would make a good topic for a blog update:

@concannon: Can anybody explain to me why automated sentiment analysis is anything more than flaky, snake-oil BS? The technology just isn't ready yet.

I’m going make a bold statement here – automated sentiment analysis using the right methodology – is actually superior to human sentiment analysis. Bear with me and read through.

The available approaches to analyzing sentiment/satisfaction vary based on the data provided. I would categorize the approaches based on the availability of three types of data:
1. Customer feedback (free-form text) with customer ranked satisfaction (discrete value), like Amazon product reviews.
2. Customer feedback (free-form text) with manually ranked satisfaction (discrete value), where human readers subjectively score the content.
3. Customer feedback only, no ranked satisfaction, as with blog posts and comments

For the first data type, machine learning algorithms do a good job of measuring overall sentiment (say, +ve/neutral/-ve). Examples of data suitable for this approach are: survey data and product review forums. The problem is that not a lot of text is gathered this way (with a purpose in mind). Even if it is, the machine learning algorithms struggle with distinguishing positive elements from negative. It's one thing to know if a customer is dissatisfied, it is another to know about what!

Given no customer ranked satisfaction, it is possible to build a statistical model using a sample of manually ranked documents, then automatically score the remaining unranked documents. Not many companies are willing to do this. It also doesn't truly represent the customer’s opinion - just the reader’s interpretation of what the customer thinks.

For the third option, customer opinion with no ranking, you can derive sentiment from the context of the text using natural language processing or NLP. This data is most common and hence so are the approaches to analyzing it. It’s not easy, but it’s the sweet spot for gain value from the massive volumes of consumer generated text.

One widely available, cheap technology assigns an overall positive or negative sentiment based on assigning positive or negative values to individual words then summing them to get an overall sentiment rating. This approach fails in situations like the following:
"It's not bad" (two negatives that actually suggest a positive)
"I'm not going to say this sucks" (sarcasm or humor)
“The keyboard is impossibly small but the display is the best I’ve seen.” (combination)

The most recent advances in sentiment analysis technology use a combination of techniques:
(1) statistics
(2) rule-based definitions and
(3) human intervention, e.g. a final review of the machine scoring.

The results are less expensive than human-only sentiment analysis, but more consistent. Why? Because the automation adds consistency, while the human verifies the result. When put in the right workflow then it clearly increases scalability by a substantial factor.

Teragram, a division of SAS, announced the Teragram Sentiment Analysis Manager at the Text Analytics Summit early June. More to come on that!
6月 112009
 
I mentioned the buzz around Social Media Analysis (SMA) at the Text Analytics Summit. If we took all the speakers content and produced a tag cloud, Twitter would have the biggest 'floor space'. I don't think there was a single presentation that did NOT mention Twitter.

While doing some background research for SMA, I ran across an article entitled State of the Twittersphere, that HubSpot blogged about just this week (that's @HubSpot for the 55.5% of Twitter users that don't follow anyone). There's a lot of really great Twitter usage statistics in this report. It's amazing how many people sign up with Twitter but are very inactive (I have multiple Twitter accounts and one is definitely contributing to inactivity). I'm more interested in those users that are very active. It would be good to connect with other users who post materials similar to my own (like a document recommendation system) and Text Mining can definitely help with this. I'd also like to see something like a “users who posted materials like this, also connected with these users:" - like the recommendations you get from Amazon. Ranking the tweets of users you follow based on content would also be fabulous. Some users post about both personal and business related materials. I personally prefer not to read the personal posts (sorry y'all). Having personal tweets, or topics less interesting to me appear further down the list (if at all) would be another desirable feature...

I have a bunch of other recommendations for Twitter product management - as do many other Twitter users. How about using Text Analytics/Text Mining for managing product requirements...
6月 042009
 
I am back in my office after a thoroughly enjoyable time at the annual Text Analytics Summit in Boston. I have to admit I was in my element rubbing shoulders with thought leaders, end users, analysts and press.

Jim Cox and I arrived Sunday afternoon to attend two preconference presentations: "Text Analytics for Dummies" by Conference Chair, Seth Grimes of Alta Plana, and a vendor comparison presentation by Nick Patience of technology industry analyst company, 451group.

The themes dominating the conference were: sentiment analysis, social media analysis, social network analysis, voice of the customer, eDiscovery, Web search, visualization, SaaS and Cloud.

We heard keynote presentations:
“Discover and Drive Brand Activity in Social Networks” by Emmanuel Roche, Teragram and Jim Cox, SAS

“A Tale of Two Search Engines – The Evolution of Search Technology and the Role of Social Networking in Marketing” – Usama Fayyad, Open Insights

“Sentiment Analysis” – Bing Liu, University of Illinois

We also saw end user case studies, analyst and end user panels, a Text Analytics Market Report by IDC, vendor presentations and a group of very active roundtable discussions.

sentiment analysis. Key capabilities focused on product and feature level sentiment extraction. Sentiment is also considered a key component to Social Media Analysis. While many vendors play in the social media analysis space, not many vendors provide all the necessary capabilities on their own. Tracking social networks, reach, promoters, detractors, key influencers/key opinion leaders (KOL) and key themes/trends were put forth as valuable.

Voice of the customer / customer feedback continued to play a key role of text input to text analytics models that look to find key issues being reported by customers.

eDiscovery is probably the top text analytics application area at this year’s summit. Several law firms were represented and the ability to mine legal documents crucial.

Web search in relation to advertising was shown to be very powerful due to the user indication of intent. Advertising based on Web search and user behavior improves click-through ratio (CTR) by an average of 652%! Also mentioned was the mammoth effort required to tag massive volumes of rapidly changing Web content. There are numerous Web sites who employ user bases to do this for them. The new look of Web search goes far beyond providing lists of documents. Document facets, snippets, images, sentiment and more can be derived from search results.

Sue Feldman of IDC indicated the Text Analytics and Search market is moving in direct opposition to the current economic market. The analysts represented at the summit all agreed that visualization of huge volumes of text should be an area that all vendors pay more attention to. Other sentiments echoed by the analysts included the desirability of Software as a Service (SaaS) applications, and the overwhelming need (and analyst amazement) that Text Analytics vendors had not provided Cloud Computing yet.

On the whole, conference goers imparted a great amount of valuable information. I will wrap up my commentary with these overheard statements:
“Search doesn’t help you discover things you are unaware of.”
“TA technology can solve problems we don’t even know about yet.”
“Text analytics puts humanity into statistics.” (Thanks to Chris Bowman for that one!)
“The most common search on Monster is: Find me a job!” (followed by another that Blog Administrator refuses to post)
"Missing a piece of a puzzle is frustrating, can anyone spot the missing piece to my wardrobe?" [shoes]

Additional conference commentary can be found on twitter.com #textsummit. My colleague Anne Milley also summarized Day 1 and Day 2 wrote about it on our sascom voices blog.
Curt Monash, we missed you this year!
SAS and Teragram would like to thank conference goers. It was a pleasure seeing you all!
6月 022009
 
The last time I visited blogs.sas.com, there were a handful of interesting blogs listed down the right side of the page. Over the last few weeks, I have seen and heard about new blogs coming online but it didn't really sink in. Then today, I visited blogs.sas to find a long and growing list of bloggers. The following highlights some of the new bloggers:

  • In Other Words (by Senior Vice President and Chief Marketing Officer, Jim Davis)
    Davis' initial posts offer insight into how to retain and motivate employees and how those efforts create an environment for innovation and loyalty. Read Davis' most recent post Loyalty insurance.

  • The Business Forecasting Deal
    Michael Gilliland is a product marketing manager at SAS who focuses on business forecasting. Gilliland's blog offers practical solutions for common mistakes and bad practices. He is currently blogging about interesting activity at F2009. Visit Gilliland's forecasting blog.

  • In the Final Analysis (by executive Vice President, EMEA and Asia Pacific, SAS, Mikeal Hagstroem)
    Hagstroem is interested in optimizing business performance and uses his blog to discuss issues facing organizations today. Visit the blog for a global view into business.


You can access all blogs by SAS employees at visiting blogs.sas.com. And don't forget the blogs written by SAS users. I'm sure that we don't have a comprehensive list, but Alison Bolen does her best to keep a current list of blogs by SAS users.