6月 122009
 
原文載點:http://support.sas.com/resources/papers/proceedings09/158-2009.pdf

雖然不曉得有多少人已經拿到 SAS V9.2,不過由於我已經拿到了,所以之後會開始陸續介紹一些新版的功能。

首先先來展示一個 V9.2 最新的繪圖程序—PROC SGPLOT。舊版的 SAS 雖然有提供繪圖程序,但是他們都分散在不同的程序裡面,反而造成使用者的不便。此外,他們的老毛病還是存在,那就是畫出來的圖品質不佳,後來雖然有 ODS 的協助,稍微改善了這方面的缺失,不過 V9.2 版把這些舊的繪圖程序都打包在 PROC SGPLOT 裡面。SGPLOT 顧名思義就是 sophisticated graphical plot 的縮寫,讓我們先來看看這個新繪圖程序的功能。

Continue reading »
 Posted by at 3:23 上午
6月 122009
 
原文載點:http://support.sas.com/resources/papers/proceedings09/158-2009.pdf

雖然不曉得有多少人已經拿到 SAS V9.2,不過由於我已經拿到了,所以之後會開始陸續介紹一些新版的功能。

首先先來展示一個 V9.2 最新的繪圖程序—PROC SGPLOT。舊版的 SAS 雖然有提供繪圖程序,但是他們都分散在不同的程序裡面,反而造成使用者的不便。此外,他們的老毛病還是存在,那就是畫出來的圖品質不佳,後來雖然有 ODS 的協助,稍微改善了這方面的缺失,不過 V9.2 版把這些舊的繪圖程序都打包在 PROC SGPLOT 裡面。SGPLOT 顧名思義就是 sophisticated graphical plot 的縮寫,讓我們先來看看這個新繪圖程序的功能。

。HISTOGRAMS
舊版的長條圖要用 PROC GCHART 或 PROC NORMAL 裡面的 HISTOGRAM statement 才能畫出,而且還需要加上許多語法。現在在 PROC SGPLOT 裡面只要使用 HISTOGRAM statement,後面加上變數名稱,就可以完成一個精美的長條圖。

範例:

PROC SGPLOT DATA = Freestyle;
HISTOGRAM Time;
TITLE "Olympic Men's Swimming Freestyle 100";
RUN;
成果:

你拍攝的 image006.jpg。

若想要加上機率密度曲線,則只要多寫一行 DENSITY option,並宣告變數名稱即可。

範例:

PROC SGPLOT DATA = Freestyle;
HISTOGRAM Time;
DENSITY Time;
TITLE "Olympic Men's Swimming Freestyle 100";
RUN;
成果:

你拍攝的 image008.jpg。

。BAR CHARTS
要畫柱狀圖的的話,原本 PROC GCHART 裡面的 VBAR 和 HBAR 被完全移植過來。

範例:

PROC SGPLOT DATA = Countries;
VBAR Region;
TITLE 'Olympic Countries by Region';
RUN;
成果:

你拍攝的 image010.jpg。

若想要顯示每根 bar 裡面不同群體所佔的比例,則只要在後面加上 GROUP option 即可。

範例:

PROC SGPLOT DATA = Countries;
VBAR Region / GROUP = PopGroup;
TITLE 'Olympic Countries by Region and Population Group';
RUN;
成果:

你拍攝的 image012.jpg。

如果要計算次數的目標是另一個變數的話,可以用 RESPONSE option來另外累計。

範例:

PROC SGPLOT DATA = Countries;
VBAR Region / RESPONSE = NumParticipants;
TITLE 'Olympic Participants by Region';
RUN;
成果:
你拍攝的 image014.jpg。

。SERIES PLOTS
畫 X-Y 座標圖的功能被整合進 SERIES statement 中。而且重複呼叫 SERIES statement 的話,可以自動完成重疊圖形的功能,不用像以前一樣一定得加上 overlay option。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image016.jpg。

接下來看看怎樣替圖形做一些細部的調整。

。XAXIS AND YAXIS STATEMENTS

這兩個 statement 便是拿來更改 X 軸和 Y 軸設定的語法,功能就和以前的 AXISn statement 一樣。從上面的圖可以發現,X-軸代表月份,但是刻度卻是 2.5, 5.0, 7.5, 10.0, 12.5,顯然是不合理的,若加上 TYPE=DISCRETE 則會以實際資料裡面的數據來刻畫度數。GRID 則是在每個刻度上面劃上一條淡灰色的準線。如果 XAXIS 和 YAXIS 都加上 GRID option 的話就可以畫出格狀的底圖。LABEL 自然就是將軸重新命名,而 VALUES 則可以自己定義刻度的起始點和間距大小。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
XAXIS TYPE = DISCRETE GRID;
YAXIS LABEL = 'Rain in Inches' GRID VALUES = (0 TO 10 BY 1);
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image018.jpg。

。PLOT STATEMENT OPTIONS
如果想要針對座標軸內的線條圖形或是圖例說明做調整,則需要在 SEREIS statement 後面加上一些 option。LEGENDLABEL 可以更改圖例說明內的標籤,MARKERS 的功能就和以前的 SYMBOL 一樣,可以在資料點上標上符號。LINEATTRS 則是可以規範線條型態和粗細。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain / LEGENDLABEL = 'Beijing' MARKERS LINEATTRS = (THICKNESS = 2);
SERIES X = Month Y = VRain / LEGENDLABEL = 'Vancouver' MARKERS LINEATTRS = (THICKNESS = 2);
SERIES X = Month Y = LRain / LEGENDLABEL = 'London' MARKERS LINEATTRS = (THICKNESS = 2);
XAXIS TYPE = DISCRETE;
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image020.jpg。

。REFLINE STATEMENT
如果想要在座標圖上加上一些參考線,則可以 REFLINE statement來完成。同樣地,參考線也可以做一些細部設定,比方說透明度可以用 TRANSPARENCY option 來設定,每條參考線也可以用 LABEL option 寫上標籤。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
XAXIS TYPE = DISCRETE;
REFLINE 2.03 4.78 1.94 / TRANSPARENCY = 0.5 LABEL = ('Beijing(Mean)' 'Vancouver(Mean)' 'London(Mean)');
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image022.jpg。

。INSET STATEMENT
座標圖內可以用 INSET statement 寫上一些註釋文字,至於位置當然就得用 POSITION option 來設定。若要將註釋文字加框,則簡單地用 BORDER option 即可搞定。

範例:

PROC SGPLOT DATA = Weather;
SERIES X = Month Y = BRain;
SERIES X = Month Y = VRain;
SERIES X = Month Y = LRain;
XAXIS TYPE = DISCRETE;
INSET 'Source Lonely Planet Guide'/ POSITION = TOPRIGHT BORDER;
TITLE 'Average Monthly Rainfall in Olympic Cities';
RUN;
成果:

你拍攝的 image024.jpg。

。THE SGPANEL PROCEDURE
以前如果要針對一個資料集裡面不同的群組或個體畫出各自的圖形並且放在同一張圖裡面,是相當大費周章的事情。現在這種苦差事利用 PROC SPGPANEL 程序即可輕鬆解決。假設我們要畫一個資料集的迴歸線,用 PROC SGPLOT 的 REG statement 就可以畫出。程式如下:


PROC SGPLOT DATA=sg.countries;
REG X=NumParticipants Y=TotalMedals;
TITLE 'Number of Participants by Total Medals Won for Each Country';
RUN;
圖形如下:

你拍攝的 image062.jpg。

如果這個資料集裡面共包含六個區域,該如何分別製作迴歸圖,並且用 2X3 排成一張圖。程式如下:

PROC SGPANEL DATA=sg.countries;
PANELBY Region;
REG X=NumParticipants Y=TotalMedals;
TITLE 'Number of Participants by Total Medals Won for Each Country';
RUN;
首先呼叫 PROC SGPANEL,再用 PANELBY statement 定義 Region 是類別變數。這功能就很像 CLASS statement。其餘的程式都和之前的 PROC SGPLOT 一樣,結果如下:

你拍攝的 image064.jpg。

這份技術文件裡面有製作幾個表格,分別說明每種圖形要在 PROC SGPLOT 裡面用哪一種 statement,以及選了一些比較重要的 option,如下所示:





由於裡面的敘述都很簡單,各位可以自行下載原始檔案來看。不過這些語法都只是整個 PROC SGPLOT 裡面的九牛一毛而已,當然如果都學會的話應該也夠用了。如果想要知道全部的語法,可以到 SAS 官網去看。

網址:http://support.sas.com/documentation/cdl/en/grstatproc/61948/HTML/default/sgplot-stmt.htm

ABOUT THE AUTHORS
Lora Delwiche and Susan Slaughter are the authors of The Little SAS Book: A Primer, and The Little SAS Book for
Enterprise Guide which are published by SAS Institute. The authors may be contacted at:
Lora D. Delwiche
(530) 752-9321
llddelwiche@ucdavis.edu

Susan J. Slaughter
(530)756-8434
susan@avocetsolutions.com
 Posted by at 3:23 上午
6月 122009
 
I saw the following comment on Twitter yesterday about sentiment analysis limitations and decided it would make a good topic for a blog update:

@concannon: Can anybody explain to me why automated sentiment analysis is anything more than flaky, snake-oil BS? The technology just isn't ready yet.

I’m going make a bold statement here – automated sentiment analysis using the right methodology – is actually superior to human sentiment analysis. Bear with me and read through.

The available approaches to analyzing sentiment/satisfaction vary based on the data provided. I would categorize the approaches based on the availability of three types of data:
1. Customer feedback (free-form text) with customer ranked satisfaction (discrete value), like Amazon product reviews.
2. Customer feedback (free-form text) with manually ranked satisfaction (discrete value), where human readers subjectively score the content.
3. Customer feedback only, no ranked satisfaction, as with blog posts and comments

For the first data type, machine learning algorithms do a good job of measuring overall sentiment (say, +ve/neutral/-ve). Examples of data suitable for this approach are: survey data and product review forums. The problem is that not a lot of text is gathered this way (with a purpose in mind). Even if it is, the machine learning algorithms struggle with distinguishing positive elements from negative. It's one thing to know if a customer is dissatisfied, it is another to know about what!

Given no customer ranked satisfaction, it is possible to build a statistical model using a sample of manually ranked documents, then automatically score the remaining unranked documents. Not many companies are willing to do this. It also doesn't truly represent the customer’s opinion - just the reader’s interpretation of what the customer thinks.

For the third option, customer opinion with no ranking, you can derive sentiment from the context of the text using natural language processing or NLP. This data is most common and hence so are the approaches to analyzing it. It’s not easy, but it’s the sweet spot for gain value from the massive volumes of consumer generated text.

One widely available, cheap technology assigns an overall positive or negative sentiment based on assigning positive or negative values to individual words then summing them to get an overall sentiment rating. This approach fails in situations like the following:
"It's not bad" (two negatives that actually suggest a positive)
"I'm not going to say this sucks" (sarcasm or humor)
“The keyboard is impossibly small but the display is the best I’ve seen.” (combination)

The most recent advances in sentiment analysis technology use a combination of techniques:
(1) statistics
(2) rule-based definitions and
(3) human intervention, e.g. a final review of the machine scoring.

The results are less expensive than human-only sentiment analysis, but more consistent. Why? Because the automation adds consistency, while the human verifies the result. When put in the right workflow then it clearly increases scalability by a substantial factor.

Teragram, a division of SAS, announced the Teragram Sentiment Analysis Manager at the Text Analytics Summit early June. More to come on that!
6月 112009
 
I mentioned the buzz around Social Media Analysis (SMA) at the Text Analytics Summit. If we took all the speakers content and produced a tag cloud, Twitter would have the biggest 'floor space'. I don't think there was a single presentation that did NOT mention Twitter.

While doing some background research for SMA, I ran across an article entitled State of the Twittersphere, that HubSpot blogged about just this week (that's @HubSpot for the 55.5% of Twitter users that don't follow anyone). There's a lot of really great Twitter usage statistics in this report. It's amazing how many people sign up with Twitter but are very inactive (I have multiple Twitter accounts and one is definitely contributing to inactivity). I'm more interested in those users that are very active. It would be good to connect with other users who post materials similar to my own (like a document recommendation system) and Text Mining can definitely help with this. I'd also like to see something like a “users who posted materials like this, also connected with these users:" - like the recommendations you get from Amazon. Ranking the tweets of users you follow based on content would also be fabulous. Some users post about both personal and business related materials. I personally prefer not to read the personal posts (sorry y'all). Having personal tweets, or topics less interesting to me appear further down the list (if at all) would be another desirable feature...

I have a bunch of other recommendations for Twitter product management - as do many other Twitter users. How about using Text Analytics/Text Mining for managing product requirements...
6月 042009
 
I am back in my office after a thoroughly enjoyable time at the annual Text Analytics Summit in Boston. I have to admit I was in my element rubbing shoulders with thought leaders, end users, analysts and press.

Jim Cox and I arrived Sunday afternoon to attend two preconference presentations: "Text Analytics for Dummies" by Conference Chair, Seth Grimes of Alta Plana, and a vendor comparison presentation by Nick Patience of technology industry analyst company, 451group.

The themes dominating the conference were: sentiment analysis, social media analysis, social network analysis, voice of the customer, eDiscovery, Web search, visualization, SaaS and Cloud.

We heard keynote presentations:
“Discover and Drive Brand Activity in Social Networks” by Emmanuel Roche, Teragram and Jim Cox, SAS

“A Tale of Two Search Engines – The Evolution of Search Technology and the Role of Social Networking in Marketing” – Usama Fayyad, Open Insights

“Sentiment Analysis” – Bing Liu, University of Illinois

We also saw end user case studies, analyst and end user panels, a Text Analytics Market Report by IDC, vendor presentations and a group of very active roundtable discussions.

sentiment analysis. Key capabilities focused on product and feature level sentiment extraction. Sentiment is also considered a key component to Social Media Analysis. While many vendors play in the social media analysis space, not many vendors provide all the necessary capabilities on their own. Tracking social networks, reach, promoters, detractors, key influencers/key opinion leaders (KOL) and key themes/trends were put forth as valuable.

Voice of the customer / customer feedback continued to play a key role of text input to text analytics models that look to find key issues being reported by customers.

eDiscovery is probably the top text analytics application area at this year’s summit. Several law firms were represented and the ability to mine legal documents crucial.

Web search in relation to advertising was shown to be very powerful due to the user indication of intent. Advertising based on Web search and user behavior improves click-through ratio (CTR) by an average of 652%! Also mentioned was the mammoth effort required to tag massive volumes of rapidly changing Web content. There are numerous Web sites who employ user bases to do this for them. The new look of Web search goes far beyond providing lists of documents. Document facets, snippets, images, sentiment and more can be derived from search results.

Sue Feldman of IDC indicated the Text Analytics and Search market is moving in direct opposition to the current economic market. The analysts represented at the summit all agreed that visualization of huge volumes of text should be an area that all vendors pay more attention to. Other sentiments echoed by the analysts included the desirability of Software as a Service (SaaS) applications, and the overwhelming need (and analyst amazement) that Text Analytics vendors had not provided Cloud Computing yet.

On the whole, conference goers imparted a great amount of valuable information. I will wrap up my commentary with these overheard statements:
“Search doesn’t help you discover things you are unaware of.”
“TA technology can solve problems we don’t even know about yet.”
“Text analytics puts humanity into statistics.” (Thanks to Chris Bowman for that one!)
“The most common search on Monster is: Find me a job!” (followed by another that Blog Administrator refuses to post)
"Missing a piece of a puzzle is frustrating, can anyone spot the missing piece to my wardrobe?" [shoes]

Additional conference commentary can be found on twitter.com #textsummit. My colleague Anne Milley also summarized Day 1 and Day 2 wrote about it on our sascom voices blog.
Curt Monash, we missed you this year!
SAS and Teragram would like to thank conference goers. It was a pleasure seeing you all!
6月 022009
 
The last time I visited blogs.sas.com, there were a handful of interesting blogs listed down the right side of the page. Over the last few weeks, I have seen and heard about new blogs coming online but it didn't really sink in. Then today, I visited blogs.sas to find a long and growing list of bloggers. The following highlights some of the new bloggers:

  • In Other Words (by Senior Vice President and Chief Marketing Officer, Jim Davis)
    Davis' initial posts offer insight into how to retain and motivate employees and how those efforts create an environment for innovation and loyalty. Read Davis' most recent post Loyalty insurance.

  • The Business Forecasting Deal
    Michael Gilliland is a product marketing manager at SAS who focuses on business forecasting. Gilliland's blog offers practical solutions for common mistakes and bad practices. He is currently blogging about interesting activity at F2009. Visit Gilliland's forecasting blog.

  • In the Final Analysis (by executive Vice President, EMEA and Asia Pacific, SAS, Mikeal Hagstroem)
    Hagstroem is interested in optimizing business performance and uses his blog to discuss issues facing organizations today. Visit the blog for a global view into business.


You can access all blogs by SAS employees at visiting blogs.sas.com. And don't forget the blogs written by SAS users. I'm sure that we don't have a comprehensive list, but Alison Bolen does her best to keep a current list of blogs by SAS users.
5月 232009
 
SAS campus has this great art installation titled Frightened Deer by Richard Rothschild. (That's it in the picture below.) As you can see, this is a large art installation. I run or drive by these deer almost every day. Most days I am oblivious to their existence, but some mornings, I find myself pulling up short and sucking in my breath startled by what is about to attack me. How can I forget that these deer are there? It's easy really. They are a part of the background of my life.




Sometimes the things we are the most blind to are the things that we look at everyday. I haven't found a trick that helps me open my eyes and mind and take in my everyday surroundings. But I'm trying.

What fades into the background on support.sas.com


My team monitors the comments that come from visitors to the Website and actively solicits input from you. From your comments, I have created a list of support.sas.com elements that have faded into the background for many site visitors. The remainder of this post will introduce you to a few of the features that you might be missing.
Continue reading "What you don't see is right in front of you"
5月 212009
 
We're on a roll with discussion forums. We launched yet another customer-requested forum this week; it focuses on data mining and text mining. Mining is all about digging through vast amounts of data to find trends that enable creating predictive and descriptive models. The SAS Data Mining and Text Mining forum will be most helpful to those who have SAS Enterprise Miner, SAS Text Miner, and SAS Credit Scoring. However, you don't have to have these products to explore large data sets. Note: If you are unfamiliar with the data and text mining offerings from SAS, review the material provided in the Products & Solutions section of the SAS Web site.

As always, we hope that you will use this forum to share experiences, post questions and suggestions, offer solutions, and interact with other SAS data miners. Remember that you can follow the conversation in e-mail by setting a watch or in an RSS feed by subscribing to items that interest you. Instructions for both of these tasks are provided in Watching a forum.
5月 192009
 
I wrote a piece for eWeek about the Voice of the Customer. In it, I talk about how conversational data collected in call centers is growing faster than our ability to deal with it. Those who don't want to miss insights buried in their data, can now turn to predictive modeling (data mining and text mining) to help them perform voice analytics. Armed with these emerging technologies you can decipher key messages from all the noise and really listen to what customers are saying. Those who learn quickly can respond first (before competitors do) and can deliver better service, better products resulting in happier customers!

Component No. 3: Voice mining your own business

The best way to overcome this obstacle is to secure access to analytic experts at the same time you address any voice mining software purchases. A trained analytical expert will ensure you not only "see" insights, but actually move on them and get the value out of predictive analytic workbenches.


Where have you seen these technologies implemented? do share!
5月 102009
 
If you are applying these technologies today - or are considering implementing Text Analytic into your organization in the near future - we invite you to take a few moments and take a survey here.

As Manya and others have stated , interest in this field is indeed growing, however there remain many unanswered challenges for our R&D groups to pursue. With your inputs here you can help craft the direction of the next enhancements and guide future application direction. This is an opportunity for all of you out there to share your Perceptions & Plans for text analytics.

Seth Grimes' text-analytics survey will close tommorrow - May 10. He'll write up his findings on how organizations are dealing with unstructured sources and the role text mining/analytics plays as a free report, available in early June.

The survey will take you 5-10 minutes. Thanks for responding!

PS - new members are welcome to the YAHOO group on text analytics.

read about and join us here http://tech.groups.yahoo.com/group/TextAnalytics/