text analytics

7月 122018
 

Word Mover's Distance (WMD) is a distance metric used to measure the dissimilarity between two documents, and its application in text analytics was introduced by a research group from Washington University in 2015. The group's paper, From Word Embeddings To Document Distances, was published on the 32nd International Conference on Machine Learning (ICML). In this paper, they demonstrated that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates on eight real world document classification data sets.

They leveraged word embedding and WMD to classify documents, and the biggest advantage of this method over the traditional method is its capability to incorporate the semantic similarity between individual word pairs (e.g. President and Obama) into the document distance metric. In a traditional way, one method to manipulate semantically similar words is to provide a synonym table so that the algorithm can merge words with same meaning into a representative word before measuring document distance, otherwise you cannot get an accurate dissimilarity result. However, maintaining synonym tables need continuous efforts of human experts and thus is time consuming and very expensive. Additionally, the semantic meaning of words depends on domain, and the general synonym table does not work well for varied domains.

Definition of Word Mover's Distance

WMD is the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from one document to the other document. The distance is calculated through solving the following linear program problem.

Where

  • Tij denotes how much of word i in document d travels to word j in document d';
  • c(i; j) denotes the cost “traveling” from word i in document d to word j in document d'; here the cost is the two words' Euclidean distance in the word2vec embedding space;
  • If word i appears ci times in the document d, we denote

WMD is a special case of the earth mover's distance metric (EMD), a well-known transportation problem.

How to Calculate Earth Mover's Distance with SAS?

SAS/OR is the tool to solve transportation problems. Figure-1 shows a transportation example with four nodes and the distances between nodes, which I copied from this Earth Mover's Distance document. The objective is to find out the minimum flow from {x1, x2} to {y1, y2}. Now let's see how to solve this transportation problem using SAS/OR.

The weights of nodes and distances between nodes are given below.

Figure-1 A Transportation Problem

data x_set;
input _node_ $ _sd_;
datalines;
x1 0.74
x2 0.26
;
 
data y_set;
input _node_ $ _sd_;
datalines;
y1 0.23
y2 0.51
;
 
data arcdata;
input _tail_ $ _head_ $ _cost_;
datalines;
x1 y1 155.7
x1 y2 252.3
x2 y1 292.9
x2 y2 198.2
;
 
proc optmodel;
set xNODES;
num w {xNODES};
 
set yNODES;
num u {yNODES};
 
set <str,str> ARCS;
num arcCost {ARCS};
 
read data x_set into xNODES=[_node_] w=_sd_;
read data y_set into yNODES=[_node_] u=_sd_;
read data arcdata into ARCS=[_tail_ _head_] arcCost=_cost_;
 
var flow {<i,j> in ARCS} >= 0;
impvar sumY = sum{j in yNODES} u[j];
min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j];
con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];
 
solve with lp / algorithm=ns scale=none logfreq=1;
print flow;
quit;

The solution of SAS/OR as Table-1 shows, and the EMD is the objective value: 203.26756757.

Table-1 EMD Calculated with SAS/OR

The flow data I got with SAS/OR as Table-2 shows the following, which is same as the diagram posted in the aforementioned Earth Mover's Distance document.

Table-2 Flow data of SAS/OR

Figure-2 Flow Diagram of Transportation Problem

 

How to Calculate Word Mover's Distance with SAS

The paper, From Word Embeddings To Document Distances, proposed a new metric called Relaxed Word Mover's Distance (RWMD) by removing the second constraint of WMD in order to decrease the calculations. Since we need to read word embedding data, I will show you how to calculate RWMD of two documents with SAS Viya.

/* start CAS server */
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
/* load documents into CAS */
data sascas1.documents;
infile datalines delimiter='|' missover;
length text varchar(300);
input text$ did;
datalines;
Obama speaks to the media in Illinois.|1
The President greets the press in Chicago.|2
;
run;
 
/* create stop list*/
data sascas1.stopList;
infile datalines missover;
length term $20;
input term$;
datalines;
the
to
in
;
run;
 
/* load word embedding model */
proc cas;
loadtable path='datasources/glove_100d_tab_clean.txt'
caslib="CASTestTmp"
importOptions={
fileType="delimited",
delimiter='\t',
getNames=True,
guessRows=2.0,
varChars=True
}
casOut={name='glove' replace=True};
run;
quit;
 
%macro calculateRWMD
(
textDS=documents,
documentID=did,
text=text,
language=English,
stopList=stopList,
word2VectDS=glove,
doc1_id=1,
doc2_id=2
);
/* text parsing and aggregation */
proc cas;
textParse.tpParse/
table = {name="&textDS",where="&documentID=&doc1_id or &documentID=&doc2_id"}
docId="&documentID",
language="&language",
stemming=False,
nounGroups=False,
tagging=False,
offset={name="outpos", replace=1},
text="&text";
run;
 
textparse.tpaccumulate/
parent={name="outparent1", replace=1}
language="&language",
offset='outpos',
stopList={name="&stoplist"},
terms={name="outterms1", replace=1},
child={name="outchild1", replace=1},
reduce=1,
cellweight='none',
termWeight='none';
run;
quit;
 
/* terms of the two test documents */
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_terms",replace=true}
query="
select outparent1.*,_term_
from outparent1
left join outterms1
on outparent1._termnum_ = outterms1._termnum_
where _Document_=&doc1_id or _Document_=&doc2_id;
"
;
run;
quit;
 
/* term vectors and counts of the two test documents */
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc1_termvects",replace=true}
query="
select word2vect.*
from &word2VectDS word2vect, doc_terms
where _Document_=&doc2_id and lowcase(term) = _term_;
"
;
run;
 
execdirect casout={name="doc1_terms",replace=true}
query="
select doc_terms.*
from &word2VectDS, doc_terms
where _Document_=&doc2_id and lowcase(term) = _term_;
"
;
run;
 
simple.groupBy / table={name="doc1_terms"}
inputs={"_Term_", "_Count_"}
aggregator="n"
casout={name="doc1_termcount", replace=true};
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc2_termvects",replace=true}
query="
select word2vect.*
from &word2VectDS word2vect, doc_terms
where _Document_=&doc1_id and lowcase(term) = _term_;
"
;
run;
 
execdirect casout={name="doc2_terms",replace=true}
query="
select doc_terms.*
from &word2VectDS, doc_terms
where _Document_=&doc1_id and lowcase(term) = _term_;
"
;
run;
 
simple.groupBy / table={name="doc2_terms"}
inputs={"_Term_", "_Count_"}
aggregator="n"
casout={name="doc2_termcount", replace=true};
run;
quit;
 
/* calculate Euclidean distance between words */
data doc1_termvects;
set sascas1.doc1_termvects;
run;
 
data doc2_termvects;
set sascas1.doc2_termvects;
run;
 
proc iml;
use doc1_termvects;
read all var _char_ into lterm;
read all var _num_ into x;
close doc1_termvects;
 
use doc2_termvects;
read all var _char_ into rterm;
read all var _num_ into y;
close doc2_termvects;
 
d = distance(x,y);
 
lobs=nrow(lterm);
robs=nrow(rterm);
d_out=j(lobs*robs, 3, ' ');
do i=1 to lobs;
do j=1 to robs;
d_out[(i-1)*robs+j,1]=lterm[i];
d_out[(i-1)*robs+j,2]=rterm[j];
d_out[(i-1)*robs+j,3]=cats(d[i,j]);
end;
end;
 
create distance from d_out;
append from d_out;
close distance;
run;quit;
 
/* calculate RWMD between documents */
data x_set;
set sascas1.doc1_termcount;
rename _term_=_node_;
_weight_=_count_;
run;
 
data y_set;
set sascas1.doc2_termcount;
rename _term_=_node_;
_weight_=_count_;
run;
 
data arcdata;
set distance;
rename col1=_tail_;
rename col2=_head_;
length _cost_ 8;
_cost_= col3;
run;
 
proc optmodel;
set xNODES;
num w {xNODES};
 
set yNODES;
num u {yNODES};
 
set <str,str> ARCS;
num arcCost {ARCS};
 
read data x_set into xNODES=[_node_] w=_weight_;
read data y_set into yNODES=[_node_] u=_weight_;
read data arcdata into ARCS=[_tail_ _head_]
arcCost=_cost_;
 
var flow {<i,j> in ARCS} >= 0;
impvar sumY = sum{j in yNODES} u[j];
min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j];
/* con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];*/
 
solve with lp / algorithm=ns scale=none logfreq=1;
 
call symput('obj', strip(put(obj,best.)));
create data flowData
from [i j]={<i, j> in ARCS: flow[i,j].sol > 0}
col("cost")=arcCost[i,j]
col("flowweight")=flow[i,j].sol;
run;
quit;
 
%put RWMD=&obj;
%mend calculateRWMD;
 
%calculateRWMD
(
textDS=documents,
documentID=did,
text=text,
language=English,
stopList=stopList,
word2VectDS=glove,
doc1_id=1,
doc2_id=2
);

The RWMD value is 5.0041121662.

Now let's have a look at the flow values.

proc print data=flowdata;
run;quit;

WMD method does not only measure document similarity, but also it explains why the two documents are similar by visualizing the flow data.

Besides document similarity, WMD is useful to measure sentence similarity. Check out this article about sentence similarity and you can try it with SAS.

How to calculate Word Mover's Distance with SAS was published on SAS Users.

6月 122018
 

My local middle school publishes a weekly paper. Very recently, I noted an article in that paper regarding an expose on human trafficking overseas, "World Slavery: The Terrors Our World Tries to Forget." The eloquent article in part highlighted how children have been exploited in the fishing industry in Ghana [...]

Shining a spotlight on human trafficking was published on SAS Voices by Tom Sabo

6月 122018
 

My local middle school publishes a weekly paper. Very recently, I noted an article in that paper regarding an expose on human trafficking overseas, "World Slavery: The Terrors Our World Tries to Forget." The eloquent article in part highlighted how children have been exploited in the fishing industry in Ghana [...]

Shining a spotlight on human trafficking was published on SAS Voices by Tom Sabo

5月 302018
 

Maybe you’ve heard of text analytics (or natural language processing) as a way to analyze consumer sentiment. Businesses often use these techniques to analyze customer complaints or comments on social media, to identify when a response is needed. But text analytics has far more to offer than examining posts on [...]

5 remarkable uses for text analytics was published on SAS Voices by Tom Sabo

2月 012018
 

Groups and organizations lauding artificially intelligent solutions are popping up everywhere with promises to create the next battlefield advantage using next generation weapons, gear, or satellites.  The term artificial intelligence (AI) splashes the headlines with promises that we're moments away from revolutionizing the battlefield. As a former intelligence analyst, I [...]

13 ideas for AI in military intelligence was published on SAS Voices by Mary Beth Ainsworth

9月 132017
 

The government has an unfathomable amount of data -- and it grows more and more each year. This puts agencies in a unique and important position to use that data for the public good. Whether it be improving government operations, solving some of the nation’s biggest challenges or empowering citizens [...]

Using government data for good: Hear leaders share analytics stories was published on SAS Voices by Trent Smith

6月 242017
 

To demonstrate the power of text mining and the insights it can uncover, I used SAS Text Mining technologies to extract the underlying key topics of the children's classic Alice in Wonderland. I want to show you what Alice in Wonderland can tell us about both human intelligence and artificial [...]

AI in wonderland was published on SAS Voices by Frederic Thys

4月 272017
 

Recently I backed into a hotel parking spot after returning from a customer dinner. It was dark and rainy, and I was tired from traveling. My mind wandered until I heard a shrill “BEEP BEEP BEEP” coming from my rental car. I looked down at the dashboard’s rear-view camera, and [...]

Tripping over text icebergs was published on SAS Voices by Lonnie Miller

3月 042017
 

The Obama administration made great strides in improving the government’s use of information technology over the past eight years, and now it's up to the Trump administration to expand upon it. Let’s look at five possible Trump administration initiatives that can take government’s use of information technology to the next [...]

5 Tech initiatives for the Trump administration was published on SAS Voices by Steve Bennett

2月 012017
 

Each day, the SAS Customer Contact Center participates in hundreds of interactions with customers, prospective customers, educators, students and the media. While the team responds to inbound calls, web forms, social media requests and emails, the live-chat sessions that occur on the corporate website make up the majority of these interactions.

The information contained in these chat transcripts can be a useful way to get feedback from customers and prospects. As a result, the contact center frequently asked by departments across the company what customers are saying about the company and its products – and what types of questions are asked.

The challenge

Chat transcripts are a source for measuring the relative happiness of those engaged with SAS. Using sentiment analysis, this information can help paint a more accurate picture of the health of customer relationships.

The live-chat feature includes an exit survey that provides some data including the visitor’s overall satisfaction with the chat agent and with SAS. While 13 percent of chat visitors complete the exit survey (which is above the industry average), that means thousands of chat sessions only have the transcript as a record of participant sentiment.

Analyzing chat transcripts often required the contact center to pore through the text to identify trends within the chat transcripts. With other, more pressing priorities, the manual review only provided some anecdotal information.

The approach

Performing more formal analytics using text information gets tricky due to the nature of text data. Text, unlike tabular data in databases or spreadsheets, is unstructured. There are no columns that dictate what bits of data go where. And, words can be assembled in nearly infinite combinations.

For the SAS team, however, the information contained within these transcripts were a valuable asset. Using text analytics, the team could start to uncover and understand trends and connections across thousands of chat sessions.

SAS turned to SAS Text Miner to conduct a more thorough analysis of the chat transcripts. The contact center worked with subject-matter experts across SAS to feed this text information into the analytics engine. The team used a variety of dimensions in the analysis:

  • Volume of the chat transcripts across different topics.
  • Web pages where the chat session originated.
  • Location of the customer.
  • Contact center agent who responded.
  • Duration of the chat session.
  • Products or initiatives mentioned within the text.

In addition, North Carolina State University’s Institute for Advanced Analytics began to use the chat data for a text analytics project focused on sentiment analysis. This partnership between the university and SAS helped students learn how to uncover trends in positive and negative sentiment across topics.

The results

After applying SAS text analytics to the chat data, the SAS contact center better understood the volume and type of inquiries and how they were being addressed. Often, the analysis could point areas on the corporate website that needed updates or improvements by tracking URLs for web pages that were the launch point for a chat.

Information from chat sessions also helped tune SAS’ strategy. After the announcement of Windows 10, the contact center received customer questions about the operating system, including some negative sentiment about a perceived lack of support. Based on this feedback, SAS released a statement to customers assuring them that Windows 10 was an integral part of the product roadmap.

The project with NC State University has also provided an opportunity for SAS and soon-to-be analytics professionals to continue and expand on the analysis of chat transcripts. They continue to look at the sentiment data and how it changes across different categories (products in use, duration of chat) to see if there are any trends to explore further.

Today, sentiment analysis feeds the training process for new chat agents and enables managers to highlight examples where an agent was able to turn a negative chat session into a positive resolution.

SAS Sentiment Analysis and SAS Text Analytics, combined with SAS Customer Intelligence solutions such as SAS Marketing Automation and SAS Real Time Decision Manager, allow marketing organizations like SAS to understand sentiment or emotion within text strings (chat, email, social, even voice to text) and use that information to inform sales, service, support and marketing efforts.

If you’d like to learn more about how to use SAS Sentiment Analysis to explore sentiment in electronic chat text, register for our SAS Sentiment Analysis course. And, the book, Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS, offers insights into SAS Text Miner capabilities and more.

==

Editor’s note: This post is part of a series excerpted from Adele Sweetwood’s book, The Analytical Marketer: How to Transform Your Marketing Organization. Each post is a real-world case study of how to improve your customers’ experience and optimize your marketing campaigns.

tags: Adele Sweetwood, contact center, live chat, SAS Text Miner, sentiment analysis, text analytics, The Analytical Marketer

Using chat transcripts to understand customer sentiment was published on Customer Intelligence Blog.