12月 122008
During the many years I've been looking at customer data with SAS Text Miner, I've run into a few situations where I've wondered if I need danger money and here's why:

I had an interesting experience analysing some consumer security software Web search results of the 'less than savory' kind. Colorful language is common in the younger generation (18 yrs +/- ~6 yrs) Web sites, and colorful Web sites are just plain common. An innocent search can lead you to places you never intended to go. While you are likely to come across this kind of html data for text mining/text analytics at some point (like I did), I am always pleased to see that Text Miner creates its own segment for this data and I can treat it as noise and continue my analysis focusing on analyzing more useful trends.
12月 122008
I got an email yesterday from a government customer that asked a very good question:
It seems like Wordnet might be used to construct synonym lists [for SAS Text Miner] that could map terms "up" to more general synonyms possibly reducing noise and enhancing concept extraction. Has anyone in TM R+D ever considered using Wordnet?
Wordnet is a public-domain thesaurus/lexical database for English. It contains synsets or synonym rings that can show all related words to a given word. Since we allow the user to create synonym lists for SAS Text Miner, it seems reasonable to assume that some generic free source of a huge lists of synonyms might be beneficial. And in fact, we have looked at Wordnet before, but found that the reality does not live up to the expectation. In fact, using a generic synonym substitution usually turns out to generate worse results than doing nothing at all.

For why that is we need to look at when synonym substitution is helpful.

Continue reading "When are synonyms useful?"

Haiku from SAS R&D staff

 haiku, Matsuo Basho, misc  Haiku from SAS R&D staff已关闭评论
12月 102008

First prompts are silent.
Subsequent prompts loud and clear.
Now all prompts are heard.

Poem from R&D staff?
Yes. Rhyming sonnets were shakespeare-like complex;
they wrote Japanese haiku, showed as above.

The SAS R&D staff should complete some paper work in defects system before changing a code. They use informal descriptive language(HAIKUUU!) in the early stage. Chris Hemedinger, a senior software engineer at SAS, collected some haikus in his blog to show the humor side of SAS R&D staff. It’s interesting to cite one of the most famous haikus by Matsuo Bashō for comparison:

Old pond
a frog jumps
the sound of water

del.icio.us Tags: ,,,
12月 102008

Happy hearts and happy faces,
Happy play in grassy places–
-Good and Bad Children by Robert Louis Stevenson

I read this verse in W. Bennentt’s popular book, The Book of Virtues, during the bus-to-company time this morning. It’s interesting to read Stevenson’s Treasure Island, of course in Chinese edition when I was young.

Yes, it sounds “uncool”, –I went to work, with technical documents in my bag, and read a for-children book. A grown-up with childlike innocence? dare not say. I just read the book to fresh my mind and my English.

It snows little Beijing.

12月 102008

I just wanted to quickly introduce myself as the SAS R&D manager for SAS Text Miner. With my research-oriented background, I will be posting distinctly different types of blog entries than you will see from Manya, Barry and Mary.

I will be looking at detailed technical approaches and algorithms being researched for handling text data, i.e. the grungy details. So if you are more interested in a bird's eye view, you may want to skim over my postings. On the other hand, if you want to understand how things work, why we've decided to take the approach that we do, and what we are considering doing for the future, then tune right in. And I encourage you to make comments and suggestions. I am not tied to particular approaches, and I would love to find out "better" ways to do things that we may not even have considered.

Particular areas that I will be blogging about in the coming months include:
Continue reading "My wisdom (and lack thereof)"
12月 062008

Do you remember the old Faberge shampoo marketing campaign? "I told two friends about Faberge shampoo and they told two friends and so on and so on and so on ... " The one with the great visual to highlight viral marketing?

The onslaught of blogs and social media sites has initiated a huge power shift INTO the hands of customers. This can be good (if your customer is singing your praises), bad (if they are not), or more likely both.

The reality is that this represents a huge opportunity for businesses to use all available data in decision making to help you understand not only what your customers look like, but what they think. During a downturn in any economy, the customer is the last bastion, THE touch point to help you better understand what they think about your products and services.

Text mining is the technology that integrates structured and unstructured data to help you better understand your customers, enabling you to surpass the competition, save time and save money. While blogs and social media sites put power into customers’ hands, they also can empower businesses. Consider the JetBlue fiasco, which generated outrage across the Internet . The JetBlue CEO publicly apologized via YouTube! Since then, some 338,000 YouTube users have viewed the apology. They gave David Neeleman four stars for his performance. What also came out of this media was the opportunity to mine additional information – customer comments in response to the YouTube apology about the flight cancellations and peer ratings about those comments. All the makings of a “goldmine” for text and data mining for decision making. Current manual processes are inconsistent, costly and time-consuming, with information typically organized by functional area, not across the enterprise. Decisions get made in isolation. It's clear that companies must have automated processes to mine data to consistently to identify and quantify customer/product issues. Text mining is that technology. Businesses are rapidly embracing this technology. Are you one of them?
12月 042008
To take yesterday’s quote from a social media friend – “we live in a world of unlimited ideas”.  When it comes to analyzing text this quote would probably have to be my mantra. Analyzing text itself isn’t exactly a new idea.Government agencies have been doing this behind closed doors for a long time. What’s new is the ability to understand textual information while NOT being behind closed doors. Text mining/text analytics technology is available for commercial businesses to understand data about their customers, their competitors and much more. We use text for analysis, and combine related numeric fields. But even numbers can be saved as text strings and voice signals can be translated to text and used for better understanding of information. Imagine a bullet pushing the sound barrier. That’s what I picture when I think of The Text Frontier. We like to push boundaries – hard. And we’d like to share our experiences with you. We encourage you to join us pushing boundaries while sharing your experiences, or just watch us and comment. Whether you watch, wait or dive in with your thoughts, here’s something for you to think about:
2 + 2 = 4, but
two + 2 > 4
11月 182008
Treasures and memories and trash is what I found in my closet during a much needed cleaning. This was one of those deep cleans that only happens once every few years. I looked in every box, bag, and dark corner. You wouldn't believe the things I found -- treasures, memories, and trash. During one archeological dig into a plastic storage bag, I found a purse that had been long forgotten. As I am preparing it for the charity pile, I noticed a brilliant blue corner of cloth peeking out from inside the purse. I found this:

I do windows T-shirt
Do you recognize it? It is a SAS T-shirt from the early 90s. I got this shirt shortly after coming to work for SAS. (I'm guessing that it was about the time we released SAS 6.08.) Running SAS on Windows was new and exiting in the early 90s and this was a hot shirt. Finding this pristine, never-worn T-shirt started me to thinking. I can't be the only person with old SAS memorabilia stashed in a closet or drawer.

This post from Tom Hide on SAS-L assures me that I am not the only person keeping stuff. Tom has a copy of Guide to Using SAS 76. I've only seen pictures of manuals this old. See the pictures at the end of this post for a glimpse of old manuals as well as some other items from the past.

Do you have the oldest item or the most unique?

Let's have a little fun with our old stuff. What do you have? Is it old, funny, or cherished? Is it from a user group conference or from SAS? Tell us about it by writing a comment on this post. Show it to us by posting a link to a picture in your comment or by uploading pictures of your items to the sasmemories group on Flickr. If you just want to see the pictures from others, join the Flickr group at www.flickr.com/groups/sasmemories.
Continue reading "Treasures and Memories and Trash"
11月 142008
As you prepare your 2009 travel and education budgets, keep SAS Global Forum 2009 in mind. This conference is a great way to share and expand your existing SAS knowledge. The Gaylord National Resort in National Harbor will be over-run with SAS professionals from around the world. You can

  • meet the authors of your favorite conference paper
  • discover a new favorite paper
  • chat with SAS Technical Support staff, SAS R&D staff, and other SAS users.

Wait, there's more. The conference offers

If you are as excited about SAS Global Forum 2009 as I am, here's some information that you need to know.

SAS Global Forum registration and housing is now open. Take advantage of registering online for prompt confirmation reply and payment processes. Visit the home page at www.sasglobalforum.com and click REGISTER NOW.

The online registration system has a number of features that we hope you will find helpful. Those features include:

  • The ability to connect directly to hotel reservations at the end of your registration
  • The ability to log back into the system to review or modify your registration and hotel
  • Separate meal selections for each guest

If you have questions while registering or you are just wondering what else SAS Global Forum has to offer, look for the Chat Now option on the right side of most of our conference pages.

SAS Global Forum has reserved rooms at discounted group rates for conference attendees at hotels in the National Harbor, Maryland, and Alexandria, Virginia, areas. Reservations must be made by booking online.
11月 102008
原文載點:http://www.nesug.info/Proceedings/nesug06/dm/da30.pdf這是一篇教導如何使用 data step 和 proc sql 合併和分割資料的 SAS 技術文件,由 Emmy Pahmer 於 NESUG 2006 發表。本文所使用的範例如下:你拍攝的 2008-11-03_1353。這筆資料總共只有五個觀測值,每個觀測值包含三個變數:年齡、姓名和性別。其中第五個觀測值的性別是打錯的 G。。分割資料若想要將上述資料依照性別切割成兩個分開的資料集,則有下列幾種方式。1. 使用兩個 data step:這是最笨但也是最直接的方式。程式碼如下所示:
data males;    set everyone;    if sex = 'M';run;data females;    set everyone;    if sex = 'F';run;
2. 使用兩個 data step 配合 where:這個方法僅僅是縮短程式碼行數。程式碼如下:
data female;    set everyone (where=(sex=’F’)) ;run;data male;    set everyone (where=(sex=’M’)) ;run;
有此程式可知,這只是把 IF 的指令改成 WHERE 並放在 SET 後面。對行數來說,的確是減少了,不過打的字變多了,因為 WHERE 後面的條件式要加上括弧。3. 使用兩個 data step 配合 where:這個方法僅僅是把 WHERE 從 SET 移到 DATA 後面,並沒有太特別的地方。程式碼如下:
data female (where= (sex = ‘F’)) ;    set everyone ;run;data male (where= (sex = ‘M’)) ;    set everyone ;run;
4. 使用一個 data step 並配合 IF...ELSE... 和 OUTPUT:這是比較進階的方式,可以大幅縮短程式碼的行數。如下所示:
data males females;    set everyone;    if sex = 'F' then output females;    else if sex = 'M' then output males;run;
data males females;    set everyone;    if sex = 'F' then output females;        else if sex = 'M' then output males;            else put “Neither F nor M - check “ _all_; *or output to another dataset ;run;
其中紅色那行的程式碼會讓性別變數不是 F 和 M 的觀測值通通分類到 _all_ 這個資料集中。亦或是另外定義一個資料集把他們 output 進去。當打開這個資料集時,如果裡面是空的,就可以確定沒有打錯的情況產生。5. 使用一個 data step 並配合 WHERE:這是從方法二改良來。程式碼如下:
data female (where=(sex=’F’)) male (where=(sex=’M’)) ;    set everyone ;run;
這個程式碼比方法四 要來的更精簡精簡。如果想要擁有方法三第二個可以偵測有沒有打錯的資料,則可以改進如下:
data female (where=(sex=’F’)) male (where=(sex=’M’)) checkothers (where = (sex not in (‘M’,’F’))) ;    set everyone ;run;
道理同方法四,把 sex 不是 M 和 F 的通通丟到 checkothers 這個資料集裡面,然後再去看看該資料集是不是空的。6. 使用 proc sql:使用 proc sql 看起來好像比較高檔,但是行數並沒有減少。
proc sql;    create table males as    select *    from everyone (where=(sex='M'));    create table females as    select *    from everyone (where=(sex='F'));quit;
7. 使用一個 data step 把 不同的變數放到不同的資料集:若想要把 who 和 age 放到新資料集 age,然後再把 who 和 sex 放到新資料集 sex,則可仿照方法五來切割。程式如下:
data age (keep = who age) sex (keep = who sex);    set everyone;run;
8. 使用 proc sql 完成方法七:程式如下:
proc sql;    create table age as    select who, age    from everyone;    create table sex as    select who, sex    from everyone;quit;
感覺行數沒有減少很多,只是寫法比較接近口語。。合併資料這回使用兩筆資料,如下所示:你拍攝的 2008-11-03_1421。其中 EVERYONE 這筆和之前用的一樣,而新的 ACTIVITY 資料有一點要特別注意的是它並沒有排序過。為什麼要特別強調這一點,理由是在 SAS 的合併過程中,一定要設定一個 index variable,這樣 SAS 才有辦法依照那個 index variable 來進行資料合併。而那個被設定成 index variable 的變數一定要經過排序,否則 SAS 在合併的過程中會錯亂掉。這個錯亂有時候還是會給你 output,只是結果是錯的。如果一時忽略沒有看到 log 視窗上面的警告訊息,就完蛋了。因此若要依照「who」這個變數來合併這兩組資料,則必須要先用 PROC SORT 把該變數排序:
proc sort data=activity;    by who;run;
而通常要養成一個好習慣就是所有合併的資料最好都給他排序一下,免得有漏網之魚。反正 PROC SORT 的程式碼很簡單,多寫幾行比較安心:
proc sort data=everyone;    by who;run;
然後用 merge 和 by 來合併:
data combined_11;    merge everyone activity;    by who;run;
結果如下:你拍攝的 2008-11-03_1427。從上表得知,Annie, Bill, Chandra, Igor, Jose 和 Karen 在 ACTIVITY 裡面有資料,所以合併時會顯示在 activity 這個變數底下,但 David 和 Eleanor 則沒有出現在 ACTIVITY 裡面,所以合併後他們兩人的 activity 變數就變成 missing data 了。Age 也是同樣的道理。如果只想顯示同時出現在兩個資料的觀測值,則必須啟用 in 這個指令。程式碼如下:
data combined_12;    merge everyone (in = in_a) activity (in = in_b);    by who;    if in_a and in_b;run;
使用 in 這個指令會讓 SAS 在合併的過程中,於兩組資料裡面各加上一個隱藏的變數,分別名為 in_a 和 in_b,其數值都預設為 1。合併之後在程式裡面加上「if in_a and in_b;」來讓 SAS 挑出同時具有 in_a=1 和 in_b=1 的觀測值(要打成「if in_a=1 and in_b=1;」也可以),缺少任一個變數的觀測值則會自動被剔除。結果如下:你拍攝的 2008-11-03_1433。如果想要知道哪些觀測值被剔除,可使用下列程式碼:
data combined_14;    merge everyone (in = in_a) activity (in = in_b);    by who;    if in_a and in_b then output;        else if in_a then put "In A only: " Who=; *or output to another dataset ;            else if in_b then put "In B only: " Who=;run;
最後那兩個 else if... 會讓程式在 log 視窗印出下列字樣:你拍攝的 2008-11-03_1437。同樣地,proc sql 也可達成同樣效果:
proc sql;    create table combined_15a as    select a.*, b.activity, b.sex    from everyone as a, activity as b    where a.who = b.who    order by activity    ;quit;
proc sql;    create table combine_15b as    select a.*, b.activity, b.sex    from everyone as a inner join activity as b    on a.who = b.who    ;quit;
但由於 proc sql 的指令比較麻煩,所以還是建議使用 data step 來完成。CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the author at:Emmy PahmerMDS Pharma ServicesSt. Laurent, QuébecWork Phone: (514) 333-0042 ext. 4222E-mail: emmy.pahmer@mdsinc.com