text analytics

12月 162010
After recently listening to another text analytics vendor talk about how they describe what they do, I couldn’t hold off on saying this any longer: Words, especially in the field of text – matter. SAS has been offering Text Mining, one aspect of SAS Text Analytics since 2002. However, with [...]
10月 222010
Facebook now uses technology to help detect instances of cyberbullying before it gets out of hand. I heard this report on CNN last week, and the Facebook spokesman described their detection techniques as "background technology that I can't really talk about." We don't know for certain what techniques Facebook employs for this problem. But Richard Foley does a great job of describing what could be done using text analytics. Richard's explanation is a nice complement to the series of posts from Jim Cox about how text mining determines topics.

Text Analytics at the M2010 Data Mining Conference

 analytics, data mining conference, text analytics, text mining  Text Analytics at the M2010 Data Mining Conference已关闭评论
8月 312010
Are you attending the Data Mining Conference this year? If your plans bring you to Las Vegas this October, you'll want to know what Terry Woodfield is up to! Terry is teaching his new course, Text Analytics with SAS Text Miner, at the upcoming M2010 Data Mining Conference. Terry has been a SAS instructor for more than 10 years and has attended several Data Mining Conferences. He took some time out of his busy schedule to answer a few questions from Michele Reister on The SAS Training Post blog. To learn more about this new course at the Data Mining Conference and some of the things he is looking forward to at M2010, you can read the original blog post.
8月 272010
Like a parent seeing my child on stage at an awards ceremony, I smile when I see SAS software front and center. Four times this year, SAS Text Analytics has received awards.

First, based on its text analytics launch, KMWorld named SAS in the “100 Companies that Matter in Knowledge Management.” ChinaHR.com was highlighted for using SAS Text Analytics to extract data from résumés for its online recruitment services.

Also this year, SAS Sentiment Analysis earned a Communications Solutions Product of the Year Award from TMCnet. SAS Sentiment Analysis uses both statistical techniques and linguistics rules to extract the sentiment expressed in text collections.

Then, in conjunction with SAS Text Analytics customer Sub-Zero Inc. winning the 2010 Progressive Manufacturing 100 award, SAS was inducted to the Technology Partner Hall of Fame by Managing Automation magazine. Sub-Zero reduced costs by $4 million using SAS® software and services to improve customer service.

And, most recently, KMWorld magazine named SAS Text Analytics, Trend Setting Product for 2010, citing SAS’ deep analytics expertise to process unstructured documents--helping organizations reduce manual labor, improve workflow and reduce operating costs.

With SAS Text Analytics at the ready, the days of manually reading through stored documents to find information are gone. What’s more, by examining entire collections of materials, people are discovering patterns that would never emerge from reading each document in isolation.

Few things of value come from working alone. The entire SAS Analytics team shares in this parenting joy. We’d love to hear your text analytics questions and suggestions. Please let us know your SAS successes so we can nominate you for an award.
8月 272010
Like a parent seeing my child on stage at an awards ceremony, I smile when I see SAS software front and center. Four times this year, SAS Text Analytics has received awards. First, based on its text analytics launch, KMWorld named SAS in the “100 Companies that Matter in Knowledge [...]
7月 202010
Back in April we hosted Text Analytics 101, the first in this year’s Applying Business Analytics Webinar Series. As many of you know, organizations today are faced with a flood of text-based content, a shortage of domain and subject-matter experts and an inability to analyze data in an automated, consistent manner. So we created this 101 session to provide practical, accessible advice about the methods and technologies that will enable you to improve efficiencies, ease staff resources and seamlessly incorporate text-based insights for better decisions. SAS’ Kathy Lange & Fiona McNeill walked 350+ attendees through a Text Analytics overview.

We had registrations from the US and 40 other countries and attendees joined us from a vast array of industries including communications, education, Federal Government, Financial Services, Health and Life Sciences, Retail & Manufacturing and State and Local Governments. Of the 350+ attendees about 40% of them responded to the post survey questions and shared some interesting feedback with us regarding the stage of text analytics adoption that they are in. From the results it looks to me like majority are in initial investigation & some more 101 sessions might just be what folks need to help understand the landscape.

Other attendees were assessing vendors, enhancing existing methods, some were reviewing technologies and several were already implementing and a couple were already in a testing phase. How about your organization? What phase are you in?
7月 102010
You would think that a in discipline focused on information retrieval, where understanding the meaning of words and phrases is critical, everyone would know the difference between a taxonomy, an ontology, a thesauri and semantics. Oddly this is not always the case.

It's not that the terms are poorly defined; the terms are extremely well defined - with fun words like entity, hyponym, or zeugma. Now you have some new words for Scrabble. I think, could be wrong, the confusion comes from the fact that information retrieval relies on each term in the acronym TOTS.

Some asides:
  • The phrase “I think, could be wrong,” is a zeugma, the noun "I" is a noun that links "could be wrong" with "I think"—the I is implied in "could be wrong."

  • The game Scrabble is an example of an entity, a separate and distinct object or concept.

  • Information Retrieval is a hyponym of the more general term Search.

Now, I will use one of my favorite guilty pleasures, tater tots, to help explain the differences between, a taxonomy, an ontology, a thesauri and semantics: aka TOTS.

I recently read an article about new, trendy gourmet tots: everything from blue cheese tots to truffle tots. Another trendy thing is foraging for wild food. Depending on where you live this might include: clams, wild mushrooms, herbs, thistle, blueberries and much more. In order to understand the definitions contained in the acronym TOTS, I will try to discover a recipe for the ultimate trendy food "Gourmet forage tater tots," and define what I'm doing as I go.

To start, how do search engines and text analytics products know what I am talking about when I type the word “tots?" Am I talking about the Tater Tots? Or an 8th grade class? What are the semantics of what I am talking about in the document?
Continue reading "TOTS: Taxonomy, Ontology, Thesauri and Semantics"

Corpus Callosum .. Where Right and Left Brain Meet

 Barry DeVille, linguistics, text, text analytics  Corpus Callosum .. Where Right and Left Brain Meet已关闭评论
6月 042010
The Corpus Callosum is a huge switching station in the middle of our brains that connects the right and left hemisphere. Without it we would not be able to reason about what we are looking at (reasoning is a left brain function while vision is in the right brain).

Similarly, in Text Analytics, the "Corpus" is the "huge switching station" that tells us the meaning of words and how to associate different forms of words to the items of interest that we are trying to extract from text.

The Wall Street Journal’s “Numbers Guy” -- Carl Bialik -- quoted Mike Calcagno, general manager of the Microsoft group that manages Word. Calgagno says "Text corpora is the lifeblood of most of our development and testing processes."

"Microsoft has licensed over one trillion words of English text in each of the past two years, and bolsters its collection with emails exchanged on its Hotmail program, with identifying details removed", according to a Microsoft spokeswoman.

SAS's own Enterprise Content Categorization maintains huge corpora in various languages. As Bialik notes in the Wall Street Journal article ("Making Every Word Count" , Sept. 12, 2008): "Without enough spoken-language data, subtleties may not emerge."

"The word 'rife' only occurs in negative contexts," says Anne O'Keeffe, a linguist at Mary Immaculate College, the University of Limerick, Ireland. "We are never rife with money," despite that affliction's appeal.

In spite of their utility, publicly-available Corpora are hard to come by and even harder to update.
The largest public collection may be the British National Corpus, which was assembled in the early 1990s. The BNC included the recorded conversations of 200 Britons. The intended American counterpart to the BNC --the American National Corpus -- is a collection of text that includes the 9/11 Commission Report and Berlitz travel guides. With only 22 million words, the ANC is small when compared to the BNC.

Copora and associated taxonomies are extremely valuable components of a robust text mining/text analytics solution. We are fortunate to have these assets available to us in support of our text mining/analytics tasks.