text

9月 282016
 

With the first debate between the two candidates behind us and the culmination of the US presidential election drawing near, who wouldn’t love to predict the winner? I don't have a crystal ball, but I do have the power of unstructured text analytics at my fingertips. With the help of […]

Presidential election quiz was published on SAS Voices.

6月 252014
 

Did you check your email and/or favourite social media site before leaving for work this morning? Did you do it before getting out of bed? Don’t worry, you’re not alone. I do it every morning as a little treat while I “wake up” and whether I realize it or not, sometimes it sets the tone for the rest of the day.

PhonebedThe other day I was looking at my Facebook news feed and a couple of things drew my attention. One of them was an abridged transcript of an interview on Conan of the writer of the A Song of Ice and Fire books, which the TV series Game of Thrones is based on, George R. R. Martin. Because I don’t have much time in the mornings, if the article was much longer I would probably have stopped partway through. If I had, I would have missed this quote right at the end: “I hate some of these modern systems where you type a lowercase letter and it becomes a capital. I don’t want a capital. If I’d wanted a capital, I would’ve typed a capital. I know how to work the shift key.”. This put a smile on my face and in a good mood for the rest of the day.

This made me think.

What if I had woken up just that 5 minutes too late to have read the whole thing or the article at all? How many other interesting things could I have missed because I didn’t have time or had a “squirrel” moment? And as everybody with a social media account knows, there is too much to delve into everything.

With all that data, how can I be sure that I’m exposing myself to the most interesting information? Maybe it’s just a matter of looking at how many people have viewed something. But then I like to think I’m unique so that doesn’t really work. Maybe I can only look at things my closest friends recommend. But that’s more about being a good friend than being interested. Sorry friends.

Let’s take this situation to your organisation. How do you know what information is relevant to your business? There are a myriad of techniques to analyse the structured data that the data warehouse folks have invested a large amount of time designing and the business folks have spent vetting. But how about the unstructured data – the text-based data like survey responses, call centre feedback and social media posts? This data accounts for up to 90% of all information available for analysis.

How can we make use of this text-based data? Should you have people read through all the documents and give their opinions on themes and trends? But there is an inherent bias and unreliability as people have different backgrounds and perspectives. And it’s unlikely that 1 person would be able to read everything in a timely manner. On top of all this, what we need is more than just a word search. It’s more than word counts or word clouds. It’s more than just discovering topics. In fact, what we really need is to attach a measure of prediction to words, phrases and topics.

  • Escalate an incoming customer service call to the client relations team because the caller has used 3 key “future churn” phrases in the right order.
  • Redesign a product because very negative feedback always contain words that are classified under “aesthetic features”.
  • Discover the true customer service differentiators which give a positive Net Promoter Score (NPS).
  • The areas law enforcement should increase its presence to protect the public from activities being promoted on social media that are likely to have dangerous outcomes.
  • In a B2B scenario, understand and communicate the gaps in knowledge of the client organisation based on the volume, topics and severity of support calls they put through to the service organisation.
  • Determine the root cause of employee concerns and the best methods to manage them.

We need Text Analytics to structure the unstructured text-based data in an objective way.text globe

There are 2 sides to Text Analytics:

  • A statistical approach where text data is converted into numerical information for analysis, and words and phrases are grouped by their common pattern across documents. The converted data and groupings can then be used alone or combined with structured data in a statistical model to predict outcomes.
  • A linguistic approach or Natural Language Processing (NLP) where logical text-based rules are created to classify and measure the polarity (e.g. of sentiment) of documents.

Both sides are equally important because despite how far advanced computing algorithms have gotten, there is still a lot of nuance in the way people speak, like sarcasm and colloquialism. By using techniques from both sides in an integrated environment, we can create a whole brained analysis which includes clustering of speech behavior, prediction of speech and other behavior against topics, and prediction of the severity of sentiment towards a product, person or organisation.

One organisation which has been using Text Analytics from SAS for a number of years to provide pro-active services to their clients is the Hong Kong Efficiency Unit. This organisation is the central point of contact for handling public inquiries and complaints on behalf of many government departments.  With this comes the responsibility of managing 2.65 million calls and 98,000 e-mails a year.

"Having received so many calls and e-mails, we gather substantial volumes of data. The next step is to make sense of the data. Now, with SAS®, we can obtain deep insights through uncovering the hidden relationship between words and sentences of complaints information, spot emerging trends and public concerns, and produce high-quality, visual interactive intelligence about complaints for the departments we serve." Efficiency Unit's Assistant Director, W. F. Yuk.

Whatever size your organisation is, and whatever purpose your organisation has, there are many sources of text-based data that is readily available and may already be amongst the structured data in your data warehouse, Hadoop or on a network drive. By using this data to supplement the structured data many people are already analyzing, we can better pinpoint not only what is driving behavior but how we can better serve our customers and our employees. Wouldn’t it be great to know what is relevant to people in and out of your organisation without having to manually read thousands of documents?

Applying Text Analytics to your documents is treating yourself with Text because amongst the masses of words you will find nuggets which will brighten up your (and your company’s) day.

If you’re interested in treating yourself with Text and are in the Sydney region this July, sign up for the SAS Business Knowledge Series course Text Analytics and Sentiment Mining Using SAS taught by renowned expert in the field Dr Goutam Chakraborty from Oklahoma State University. Dr Chakraborty will also be speaking of his experiences in the field at the next Institute of Analytics Professionals of Australia (IAPA) NSW Chapter meeting.

Happy Texting!

4月 162012
 

To celebrate the beginning of the professional baseball season here in the US and Canada, we revisit a famous example of using baseball data to demonstrate statistical properties.

In 1977, Bradley Efron and Carl Morris published a paper about the James-Stein estimator-- the shrinkage estimator that has better mean squared error than the simple average. Their prime example was the batting averages of 18 player in the 1970 season: they considered trying to estimate the players' average over the remainder of the season, based on their first 45 at-bats. The paper is a pleasure to read, and can be downloaded here. The data are available here, on the pages of Statistician Phil Everson, of Swarthmore College.

Today we'll review plotting the data, and intend to look at some other shrinkage estimators in a later entry.

SAS
We begin by reading in the data for Everson's page. (Note the long address would need to be on one line, or you could could use a URL shortener like TinyURL.com. To read the data, we use the infile statement to indicate a tab-delimited file and to say that the data begin in row 2. The informat statement helps read in the variable-length last names.


filename bb url "http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/
JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt";

data bball;
infile bb delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
informat firstname $7. lastname $10.;
input FirstName $ LastName $ AtBats Hits BattingAverage RemainingAtBats
RemainingAverage SeasonAtBats SeasonHits SeasonAverage;
run;

data bballjs;
set bball;
js = .212 * battingaverage + .788 * .265;

avg = battingaverage; time = 1;
if lastname not in("Scott","Williams", "Rodriguez", "Unser","Swaboda","Spencer")
then name = lastname; else name = '';
output;
avg = seasonaverage; name = ''; time = 2; output;
avg = js; time = 3; name = ''; output;
run;

In the second data step, we calculate the James-Stein estimator according to the values reported in the paper. Then, to facilitate plotting, we convert the data to the "long" format, with three rows for each player, using the explicit output statement. The average in the first 45 at-bats, the average in the remainder of the season, and the James-Stein estimator are recorded in the same variable in each of the three rows, respectively. To distinguish between the rows, we assign a different value of time: this will be used to order the values on the graphic. We also record the last name of (most of) the players in a new variable, but only in one of the rows. This will be plotted in the graphic-- some players' names can't be shown without plotting over the data or other players' names.

Now we can generate the plot. Many features shown here have been demonstrated in several entries. We call out 1) the h option, which increases the text size in the titles and labels, 2) the offset option, which moves the data away from the edge of the plot frame, 3) the value option in the axis statement, which replaces the values of "time" with descriptive labels, and 4) the handy a*b=c syntax which replicates the plot for each player.

title h=3 "Efron and Morris example of James-Stein estimation";
title2 h=2 "Baseball players' 1970 performance estimated from first 45 at-bats";
axis1 offset = (4cm,1cm) minor=none label=none
value = (h = 2 "Avg. of first 45" "Avg. of remainder" "J-S Estimator");
axis2 order = (.150 to .400 by .050) minor=none offset=(0.5cm,1.5cm)
label = (h =2 r=90 a = 270 "Batting Average");
symbol1 i = j v = none l = 1 c = black r = 20 w=3
pointlabel = (h=2 j=l position = middle "#name");

proc gplot data = bballjs;
plot avg * time = lastname / haxis = axis1 vaxis = axis2 nolegend;
run; quit;

To read the plot (shown at the top) consider approaching the nominal true probability of a hit, as represented by the average over the remainder of the season, in the center. If you begin on the left, you see the difference associated with using the simple average of the first 45 at-bats as the estimator. Coming from the right, you see the difference associated withe using the James-Stein shrinkage estimator. The improvement associated with the James-Stein estimator is reflected in the generally shallower slopes when coming from the left. With the exception of Pirates great Roberto Clemente and declining third-baseman Max Alvis, most every line has a shallower slope from the left; James' and Stein's theoretical work proved that overall the lines must be shallower from the right.

R
A similar process is undertaken within R. Once the data are loaded, and a subset of the names are blanked out (to improve the readability of the figure), the matplot() and matlines() functions are used to create the lines.

bball = read.table("http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt",
header=TRUE, stringsAsFactors=FALSE)
bball$js = bball$BattingAverage * .212 + .788 * (0.265)
bball$LastName[!is.na(match(bball$LastName,
c("Scott","Williams", "Rodriguez", "Unser","Swaboda","Spencer")))] = ""

a = matrix(rep(1:3, nrow(bball)), 3, nrow(bball))
b = matrix(c(bball$BattingAverage, bball$SeasonAverage, bball$js),
3, nrow(bball), byrow=TRUE)
matplot(a, b, pch=" ", ylab="predicted average", xaxt="n", xlim=c(0.5, 3.1), ylim=c(0.13, 0.42))
matlines(a, b)
text(rep(0.7, nrow(bball)), bball$BattingAverage, bball$LastName, cex=0.6)
text(1, 0.14, "First 45\nat bats", cex=0.5)
text(2, 0.14, "Average\nof remainder", cex=0.5)
text(3, 0.14, "J-S\nestimator", cex=0.5)

7月 292011
 
When working to discover and extract new knowledge from text, many of us fall in love with language itself. If you really look, you can sometimes see the common roots of many words in disparate languages. Even languages which are acknowledged to spring from different linguistic roots share common terms. One such common term is “ma” (for mother) which turns up in languages the world over.

Scholarship has recently emerged with the claim that languages, like humans themselves, can be traced back to a common root. Quentin Atkinson, an evolutionary psychologist at the University of Auckland in New Zealand goes so far as to describe a primordial language – called “IXU” – which is thought to have originated in the south of Africa between 50,000 and 70,000 years ago.

Atkinson analyzed more than 500 of the currently-spoken world languages (about 6,000 and diminishing rapidly) to search for and document the use of “phonemes”. Phonemes are elemental units of speech. He found that dialects with the most phonemes were spoken in Africa – especially south Africa – and dialects with the fewest phonemes were spoken in South America and remote, South Pacific islands.

Atkinson appealed to the so-called “founder effect” to explain phoneme loss with distance travelled. The “founder effect” posits that as creations move further and further from their original source they suffer from a kind of entropy which diminishes their original complexity over time and distance.

If we overlay what we know about written languages, we see that it took approximately 30,000 years for spoken language to evolve to the point that symbolic markings were placed on the walls of caves. Apparently, all the hominoids of this time possessed this ability. During the next great leap – from cave art to pictographic languages about 15,000 years -- our language skills became even more refined.

As we move closer to the current era the time available for evolutionary change and adaptation shrinks radically: only 2500 years separate the Bronze-era languages from the Iron-era languages. It is notable that some of the world’s greatest literature emerged on the heels of the early and later Iron Age. In the West, this includes the Biblical writings which inspire of all contemporary Abrahamaic religions. Further East this includes the Vedic texts of India and such classics as the Tao Te Ching in China.

Overall, it seems that from a language point of view – including a common mother tongue, a historical single source for significant world literature and scripture, and a remarkably similar gene pool – we should not be surprised to observe as much commonality in the human species around the world as we do. And now, as we enter the Age of the Internet, where we have 24/7, instantaneous sharing of information virtually everywhere we should expect even more commonality as we move through this period.

It is true that our vocabulary is expanding – perhaps exponentially – as words like “OMG” merit entries in the Oxford Dictionary of the English Language. “Pidgin” English and simplified Mandarin Chinese are likely contenders for the status of “World Language”. Let me know your thoughts too.
4月 052011
 



It's often necessary to combine data from two data sets for further analysis. Such merging can be one-to-one, many-to-one, and many-to-many. The most common form is the one-to-one match, which we cover in section 1.5.7. Today we look at a one-to-many merge.

Since the Major League baseball season started last Thursday, we'll use baseball as an example. Sean Lahman has compiled a large set of baseball data. In fact, it's large enough that it's only hosted in zipped form. The zipped comma-delimited files, with data through the 2010 season, can be downloaded here.

One file you get is the batting.csv file. This contains yearly batting data for all players. However, if you want to identify players by name, you have to use the playerid variable to link to the master.csv data set, which has names and other information. We'll add the names to each row of the batting data set. Then we can pull up players' batting data directly by name instead of with the playerid

SAS

We'll start by reading the data from the csv files using proc import (section 1.1.4).

proc import datafile="c:\ken\baseball\master.csv"
out=bbmaster dbms=dlm;
delimiter=',';
getnames=yes;
run;

proc import datafile="c:\ken\baseball\batting.csv"
out=bbbatting dbms=dlm;
delimiter=',';
getnames=yes;
run;

Then we sort on playerid in both data sets and use the merge and by statements to link them (section 1.5.7). SAS replicates each row of the master data set for every row of the batting data set with the matching by values.

proc sort data=bbmaster; by playerid; run;
proc sort data=bbbatting; by playerid; run;

data bbboth; merge bbmaster bbbatting;
by playerid;
run;

Then we can use the data! Here, we plot the annual (regular season) RBIs of Derek Jeter and David Ortiz. Note the use of the v= option to the symbol statement (section 5.2.2) to display the players' names at the plot locations, and the offset option in the axis definition to make extra space for those names to plot.

symbol1 i=none font=swissb v="JETER" h=2 c=red;
symbol2 i=none font=swissb v="ORTIZ" h=2 c=blue;
axis1 offset = (1.5cm,1.5cm);
axis2 offset = (.5cm,);
proc gplot data=bbboth;
where (namelast="Jeter" and namefirst="Derek") or
(namelast="Ortiz" and namefirst="David");
plot rbi * yearid = namelast / haxis=axis1 vaxis=axis2 nolegend;
run;

The result is shown above.

R
We start here by reading the data sets. Then we use the merge() function to generate the desired dataset. As with SAS, the default behavior of the merging facility does what we need in this case.

master = read.csv("bball/Master.csv")
batting = read.csv("bball/Batting.csv")
mergebb = merge(batting,master)

To make the plot, we first make a new data set with just the information about Jeter and Ortiz. This isn't really necessary, but it does make typing slightly less awkward in later steps. Then we make an empty plot. In order to make room for the names (which are much bigger than usual plot symbols) we have to set the x and y limits manually (section 5.3.7).

jo = mergebb[
(mergebb$nameLast == "Jeter" & mergebb$nameFirst == "Derek") |
(mergebb$nameLast == "Ortiz" & mergebb$nameFirst == "David"),]
plot(jo$RBI~jo$yearID, type = "n",xlim = c(1993, 2012),
ylim = c(-10,160), xlab = "Year", ylab = "RBI")

Then we can add the text values, using the text() function (section 5.2.11). We do this by separately pulling rows from the new jo dataset. In the reduced data set, we can specify rows using last names only.

text(jo$yearID[jo$nameLast == "Jeter"], jo$RBI[jo$nameLast == "Jeter"],
"JETER", col="red")
text(jo$yearID[jo$nameLast == "Ortiz"], jo$RBI[jo$nameLast == "Ortiz"],
"ORTIZ", col = "blue")

The result is seen below. David Ortiz has driven in more runs than Derek Jeter since about 2002 (Go Sox!).

Note that if space requirements prevent making a single massive dataset with many replicated rows, you can generate a lookup vector using the match() function, and use this to make the short data set. This version won't have the names in it, though.

matchlist = match(batting$playerID, master$playerID)
lastname.batting = master$nameLast[matchlist]
firstname.batting = master$nameFirst[matchlist]

jo2 = batting[
(lastname.batting == "Jeter" & firstname.batting == "Derek") |
(lastname.batting == "Ortiz" & firstname.batting == "David"),]


4月 052011
 



It's often necessary to combine data from two data sets for further analysis. Such merging can be one-to-one, many-to-one, and many-to-many. The most common form is the one-to-one match, which we cover in section 1.5.7. Today we look at a one-to-many merge.

Since the Major League baseball season started last Thursday, we'll use baseball as an example. Sean Lahman has compiled a large set of baseball data. In fact, it's large enough that it's only hosted in zipped form. The zipped comma-delimited files, with data through the 2010 season, can be downloaded here.

One file you get is the batting.csv file. This contains yearly batting data for all players. However, if you want to identify players by name, you have to use the playerid variable to link to the master.csv data set, which has names and other information. We'll add the names to each row of the batting data set. Then we can pull up players' batting data directly by name instead of with the playerid

SAS

We'll start by reading the data from the csv files using proc import (section 1.1.4).

proc import datafile="c:\ken\baseball\master.csv"
out=bbmaster dbms=dlm;
delimiter=',';
getnames=yes;
run;

proc import datafile="c:\ken\baseball\batting.csv"
out=bbbatting dbms=dlm;
delimiter=',';
getnames=yes;
run;

Then we sort on playerid in both data sets and use the merge and by statements to link them (section 1.5.7). SAS replicates each row of the master data set for every row of the batting data set with the matching by values.

proc sort data=bbmaster; by playerid; run;
proc sort data=bbbatting; by playerid; run;

data bbboth; merge bbmaster bbbatting;
by playerid;
run;

Then we can use the data! Here, we plot the annual (regular season) RBIs of Derek Jeter and David Ortiz. Note the use of the v= option to the symbol statement (section 5.2.2) to display the players' names at the plot locations, and the offset option in the axis definition to make extra space for those names to plot.

symbol1 i=none font=swissb v="JETER" h=2 c=red;
symbol2 i=none font=swissb v="ORTIZ" h=2 c=blue;
axis1 offset = (1.5cm,1.5cm);
axis2 offset = (.5cm,);
proc gplot data=bbboth;
where (namelast="Jeter" and namefirst="Derek") or
(namelast="Ortiz" and namefirst="David");
plot rbi * yearid = namelast / haxis=axis1 vaxis=axis2 nolegend;
run;

The result is shown above.

R
We start here by reading the data sets. Then we use the merge() function to generate the desired dataset. As with SAS, the default behavior of the merging facility does what we need in this case.

master = read.csv("bball/Master.csv")
batting = read.csv("bball/Batting.csv")
mergebb = merge(batting,master)

To make the plot, we first make a new data set with just the information about Jeter and Ortiz. This isn't really necessary, but it does make typing slightly less awkward in later steps. Then we make an empty plot. In order to make room for the names (which are much bigger than usual plot symbols) we have to set the x and y limits manually (section 5.3.7).

jo = mergebb[
(mergebb$nameLast == "Jeter" & mergebb$nameFirst == "Derek") |
(mergebb$nameLast == "Ortiz" & mergebb$nameFirst == "David"),]
plot(jo$RBI~jo$yearID, type = "n",xlim = c(1993, 2012),
ylim = c(-10,160), xlab = "Year", ylab = "RBI")

Then we can add the text values, using the text() function (section 5.2.11). We do this by separately pulling rows from the new jo dataset. In the reduced data set, we can specify rows using last names only.

text(jo$yearID[jo$nameLast == "Jeter"], jo$RBI[jo$nameLast == "Jeter"],
"JETER", col="red")
text(jo$yearID[jo$nameLast == "Ortiz"], jo$RBI[jo$nameLast == "Ortiz"],
"ORTIZ", col = "blue")

The result is seen below. David Ortiz has driven in more runs than Derek Jeter since about 2002 (Go Sox!).

Note that if space requirements prevent making a single massive dataset with many replicated rows, you can generate a lookup vector using the match() function, and use this to make the short data set. This version won't have the names in it, though.

matchlist = match(batting$playerID, master$playerID)
lastname.batting = master$nameLast[matchlist]
firstname.batting = master$nameFirst[matchlist]

jo2 = batting[
(lastname.batting == "Jeter" & firstname.batting == "Derek") |
(lastname.batting == "Ortiz" & firstname.batting == "David"),]


Corpus Callosum .. Where Right and Left Brain Meet

 Barry DeVille, linguistics, text, text analytics  Corpus Callosum .. Where Right and Left Brain Meet已关闭评论
6月 042010
 
The Corpus Callosum is a huge switching station in the middle of our brains that connects the right and left hemisphere. Without it we would not be able to reason about what we are looking at (reasoning is a left brain function while vision is in the right brain).

Similarly, in Text Analytics, the "Corpus" is the "huge switching station" that tells us the meaning of words and how to associate different forms of words to the items of interest that we are trying to extract from text.

The Wall Street Journal’s “Numbers Guy” -- Carl Bialik -- quoted Mike Calcagno, general manager of the Microsoft group that manages Word. Calgagno says "Text corpora is the lifeblood of most of our development and testing processes."

"Microsoft has licensed over one trillion words of English text in each of the past two years, and bolsters its collection with emails exchanged on its Hotmail program, with identifying details removed", according to a Microsoft spokeswoman.

SAS's own Enterprise Content Categorization maintains huge corpora in various languages. As Bialik notes in the Wall Street Journal article ("Making Every Word Count" , Sept. 12, 2008): "Without enough spoken-language data, subtleties may not emerge."

"The word 'rife' only occurs in negative contexts," says Anne O'Keeffe, a linguist at Mary Immaculate College, the University of Limerick, Ireland. "We are never rife with money," despite that affliction's appeal.

In spite of their utility, publicly-available Corpora are hard to come by and even harder to update.
The largest public collection may be the British National Corpus, which was assembled in the early 1990s. The BNC included the recorded conversations of 200 Britons. The intended American counterpart to the BNC --the American National Corpus -- is a collection of text that includes the 9/11 Commission Report and Berlitz travel guides. With only 22 million words, the ANC is small when compared to the BNC.

Copora and associated taxonomies are extremely valuable components of a robust text mining/text analytics solution. We are fortunate to have these assets available to us in support of our text mining/analytics tasks.
12月 102008
 

I just wanted to quickly introduce myself as the SAS R&D manager for SAS Text Miner. With my research-oriented background, I will be posting distinctly different types of blog entries than you will see from Manya, Barry and Mary.

I will be looking at detailed technical approaches and algorithms being researched for handling text data, i.e. the grungy details. So if you are more interested in a bird's eye view, you may want to skim over my postings. On the other hand, if you want to understand how things work, why we've decided to take the approach that we do, and what we are considering doing for the future, then tune right in. And I encourage you to make comments and suggestions. I am not tied to particular approaches, and I would love to find out "better" ways to do things that we may not even have considered.

Particular areas that I will be blogging about in the coming months include:
Continue reading "My wisdom (and lack thereof)"