5月 182017
 

Are you caught up in the machine learning forecasting frenzy? Is it reality or more hype?  There's been a lot of hype about using machine learning for forecasting. And rightfully so, given the advancements in data collection, storage, and processing along with technology improvements, such as super computers and more powerful [...]

Straight talk about forecasting and machine learning was published on SAS Voices by Charlie Chase

5月 172017
 

Deep learning made the headlines when the UK’s AlphaGo team beat Lee Sedol, holder of 18 international titles, in the Go board game. Go is more complex than other games, such as Chess, where machines have previously crushed famous players. The number of potential moves explodes exponentially so it wasn’t [...]

Deep learning: What’s changed? was published on SAS Voices by Colin Gray

5月 012017
 

Machine learning seems to be the new hot topic these days. Everybody's talking about how machines are beating human players in chess, Jeopardy, and now even Go. In the future, artificial intelligence will drive our cars and our jobs will be taken over by robots. There’s a lot of hype, [...]

Autotuning: How machine learning helps optimize itself was published on SAS Voices by Sascha Schubert

3月 012017
 

In 2011, Loughran and McDonald applied a general sentiment word list to accounting and finance topics, and this led to a high rate of misclassification. They found that about three-fourths of the negative words in the Harvard IV TagNeg dictionary of negative words are typically not negative in a financial context. For example, words like “mine”, “cancer”, “tire” or “capital” are often used to refer to a specific industry segment. These words are not predictive of the tone of documents or of financial news and simply add noise to the measurement of sentiment and attenuate its predictive value. So, it is not recommended to use any general sentiment dictionary as is.

Extracting domain-specific sentiment lexicons in the traditional way is time-consuming and often requires domain expertise. Today, I will show you how to extract a domain-specific sentiment lexicon from movie reviews through a machine learning method and construct SAS sentiment rules with the extracted lexicon to improve sentiment classification performance. I did the experiment with the help from my colleagues Meilan Ji and Teresa Jade, and our experiment with the Stanford Large Movie Review Dataset showed around 8% increase in the overall accuracy with the extracted lexicon. Our experiment also showed that the lexicon coverage and accuracy could be improved a lot with more training data.

SAS Sentiment Analysis Studio released domain-independent Taxonomy Rules for 12 languages and domain-specific Taxonomy Rules for a few languages. For English, SAS has covered 12 domains, including Automotive, Banking, Health and Life Sciences, Hospitalities, Insurance, Telecommunications, Public Policy Countries and others. If the domain of your corpus is not covered by these industry rules, your first choice is to use general rules, which sometimes lead to poor classification performance, as Loughran and McDonald found. Automatically extracting domain-specific sentiment lexicons has been studied by researchers and three methods were proposed. The first method is to create a domain-specific word list by linguistic experts or domain experts, which may be expensive or time-consuming. The second method is to derive non-English lexicons based on English lexicons and other linguistic resources such as WordNet. The last method is to leverage machine learning to learn lexicons from a domain-specific corpus. This article will show you the third method.

Because of the emergence of social media, researchers are able to relatively easily get sentiment data from the internet to do experiments. Dr. Saif Mohammad, a researcher in Computational Linguistics, National Research Council Canada, proposed a method to automatically extract sentiment lexicons from tweets. His method provided the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI (pointwise mutual information) between words and tweet sentiment to define the sentiment attributes of words. It is a simple method, but quite powerful. At the ACL 2016 conference, one paper introduced how to use neural networks to learn sentiment scores, and in this paper I found the following simplified formula to calculate a sentiment score.

Given a set of tweets with their labels, the sentiment score (SS) for a word w was computed as:
SS(w) = PMI(w, pos) − PMI(w, neg), (1)

where pos represents the positive label and neg represents the negative label. PMI stands for pointwise mutual information, which is
PMI(w, pos) = log2((freq(w, pos) * N) / (freq(w) * freq(pos))), (2)

Here freq(w, pos) is the number of times the word w occurs in positive tweets, freq(w) is the total frequency of word w in the corpus, freq(pos) is the total number of words in positive tweets, and N is the total number of words in the corpus. PMI(w, neg) is calculated in a similar way. Thus, Equation 1 is equal to:
SS(w) = log2((freq(w, pos) * freq(neg)) / (freq(w, neg) * freq(pos))), (3)

The movie review data I used was downloaded from Stanford; it is a collection of 50,000 reviews from IMDB. I used 25,000 reviews in train and test datasets respectively. The constructed dataset contains an even number of positive and negative reviews. I used SAS Text Mining to parse the reviews into tokens and wrote a SAS program to calculate sentiment scores.

In my experiment, I used the train dataset to extract sentiment lexicons and the test dataset to evaluate sentiment classification performance with each sentiment score cutoff value from 0 to 2 with increment of 0.25. Data-driven learning methods frequently have an overfitting problem, and I used test data to filter out all weak-predictive words whose absolute value of sentiment scores are less than 0.75. In Figure-1, there is an obvious drop in the accuracy line plot of test data when the cutoff value is less than 0.75.

extract domain-specific sentiment lexicons

Figure-1 Sentiment Classification Accuracy by Sentiment Score Cutoff

Finally, I got a huge list of 14,397 affective words; 7,850 positive words and 6,547 negative words from movie reviews. The top 50 lexical items from each sentiment category as Figure-2 shows.

Figure-2 Sentiment Score of Top 50 Lexical Items

Now I have automatically derived the sentiment lexicon, but how accurate is this lexicon and how to evaluate the accuracy? I googled movie vocabulary and got two lists from Useful Adjectives for Describing Movies and Words for Movies & TV with 329 adjectives categorized into positive and negative. 279 adjectives have vector data in the GloVe word embedding model downloaded from http://nlp.stanford.edu/projects/glove/ and the T-SNE plot as Figure-3 shows. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus. T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten.[1]  It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. So two words are co-located or located closely in the scatter plot, if their semantic meanings are close or their co-occurrence in same contexts is high. Besides semantic closeness, I also showed sentiment polarity via different colors. Red stands for negative and blue stands for positive.

Figure-3 T-SNE Plot of Movie Vocabularies

From Figure-3, I find the positive vocabulary and the negative vocabulary are clearly separated into two big clusters, with very little overlap in the plot.

Now, let me check the sentiment scores of the terms. 175 of them were included in my result and Figure-4 displays the top 50 terms of each category. I compared sentiment polarity of my result with the list, 168 of 175 terms are correctly labelled as negative or positive and the overall accuracy is 96%.

Figure-4 Sentiment Score of Top 50 Movie Terms

There are 7 polarity differences between my prediction and the list as the Table-1 shows.

Table-1 Sentiment Polarity Difference between Predictions and Actual Labels

One obvious prediction mistake is coherent. I checked the raw movie reviews that contain “coherent”, and only 25 of 103 reviews are positive. This is why its sentiment score is negative rather than positive. I went through these reviews and found most of them had a sentiment polarity reversal, such as “The Plot - Even for a Seagal film, the plot is just stupid. I mean it’s not just bad, it’s barely coherent. …” A possible solution to make the sentiment scores more accurate is to use more data or add a special manipulation for polarity reversals. I tried first method, and it did improve the accuracy significantly.

So far, I have evaluated the sentiment scores’ accuracy with public linguistic resources and next I will test the prediction effect with SAS Sentiment Analysis Studio. I ran sentiment analysis against the test data with the domain-independent sentiment rules developed by SAS and the domain-specific sentiment rules constructed by machine learning, and compared the performance of two methods. The results showed an 8% increase in the overall accuracy. Table-2 and Table-3 show the detailed information.

Test data (25,000 docs)

Table-2 Performance Comparison with Test Data

Table-3 Overall Performance Comparison with Test Data

After you get domain-specific sentiment lexicons from your corpora, only a few steps are required in SAS Sentiment Analysis Studio to construct the domain-specific sentiment rules. So, next time you are processing domain-specific text for sentiment, you may want to try this method to get a listing of terms that are positive or negative polarity to augment your SAS domain-independent model.

Detailed steps to construct domain-specific sentiment rules as follows.

Step 1. Create a new Sentiment Analysis project.
Step 2. Create intermediate entities named “Positive” and “Negative”, then put the learned lexicons to the two entities respectively.

Step 3. Besides the learned lexicons, you may add an entity named “Negation” to handle the negated expressions. You can list some negations you are familiar with, such as “not, don’t, can’t” etc.

Step 4. Create positive and negative rules in the Tonal Keyword. Add the rule “CONCEPT: _def{Positive}” to Positive tab, and the rule “CONCEPT: _def{Negative}” and “CONCEPT: _def{Negation} _def{Positive}” to Negative tab.

Step 5. Build rule-based model, and now, you can use this model to predict the sentiment of documents.

How to extract domain-specific sentiment lexicons was published on SAS Users.

2月 252017
 

Machine learning is a type of artificial intelligence that uses algorithms to iteratively learn from data and finds hidden insights in data without being explicitly programmed where to look or how to find the answer. Here at SAS, we hear questions every day about machine learning: what it is, how it compares to [...]

12 machine learning articles to catch you up on the latest trend was published on SAS Voices by Alison Bolen

2月 112017
 

People come from all over the world to attend this highlight of the season. It’s been a tradition for decades. Hotels book months in advance. Traffic is horrendous in the city center. The coveted tickets can cost thousands of dollars, but tens of thousands of people are lucky enough to score them. In […]

It's February. Game On! was published on SAS Voices.

2月 112017
 

People come from all over the world to attend this highlight of the season. It’s been a tradition for decades. Hotels book months in advance. Traffic is horrendous in the city center. The coveted tickets can cost thousands of dollars, but tens of thousands of people are lucky enough to score them. In […]

It's February. Game On! was published on SAS Voices.

1月 282017
 

Digital intelligence is a trending term in the space of digital marketing analytics that needs to be demystified. Let's begin by defining what a digital marketing analytics platform is:

Digital marketing analytics platforms are technology applications used by customer intelligence ninjas to understand and improve consumer experiences. Prospecting, acquiring, and holding on to digital-savvy customers depends on understanding their multidevice behavior, and derived insight fuels marketing optimization strategies. These platforms come in different flavors, from stand-alone niche offerings, to comprehensive end-to-end vehicles performing functions from data collection through analysis and visualization.

However, not every platform is built equally from an analytical perspective. According to Brian Hopkins, a Forrester analyst, firms that excel at using data and analytics to optimize their digital businesses will together generate $1.2 trillion per annum in revenue by 2020. And digital intelligence — the practice of continuously optimizing customer experiences with online and offline data, advanced analytics and prescriptive insights — supports every insights-driven business. Digital intelligence is the antidote to the weaknesses of analytically immature platforms, leaving the world of siloed reporting behind and maturing towards actionable, predictive marketing. Here are a couple of items to consider:

  • Today's device-crazed consumers flirt with brands across a variety of interactions during a customer life cycle. However, most organizations seem to focus on website activity in one bucket, mobile in another, and social in . . . you see where I'm going. Strategic plans often fall short in applying digital intelligence across all channels — including offline interactions like customer support or product development.
  • Powerful digital intelligence uses timely delivery of prescriptive insights to positively influence customer experiences. This requires integration of data, analytics and the systems that interact with the consumer. Yet many teams manually apply analytics and deliver analysis via endless reports and dashboards that look retroactively at past behavior — begging business leaders to question the true value and potential impact of digital analysis.

As consumer behavioral needs and preferences shifts over time, the proportion of digital to non-digital interactions is growing. With the recent release of Customer Intelligence 360, SAS has carefully considered feedback from our customers (and industry analysts) to create technology that supports a modern digital intelligence strategy in guiding an organization to:

  • Enrich your first-party customer data with user level data from web and mobile channels. It's time to graduate from aggregating data for reporting purposes to the collection and retention of granular, customer-level data. It is individual-level data that drives advanced segmentation and continuous optimization of customer interactions through personalization, targeting and recommendations.
  • Keep up with customers through machine learning, data science and advanced analytics. The increasing pace of digital customer interactions requires analytical maturity to optimize marketing and experiences. By enriching first-party customer data with infusions of web and mobile behavior, and more importantly, in the analysis-ready format for sophisticated analytics, 360 Discover invites analysts to use their favorite analytic tool and tear down the limitations of traditional web analytics.
  • Automate targeting, channel orchestration and personalization. Brands struggle with too few resources to support the manual design and data-driven design of customer experiences. Connecting first-party data that encompasses both offline and online attributes with actionable propensity scores and algorithmically-defined segments through digital channel interactions is the agenda. If that sounds mythical, check out a video example of how SAS brings this to life.

The question now is - are you ready? Learn more here of why we are so excited about enabling digital intelligence for our customers, and how this benefits testing, targeting, and optimization of customer experiences.

 

tags: Customer Engagement, customer intelligence, Customer Intelligence 360, customer journey, data science, Digital Intelligence, machine learning, marketing analytics, personalization, predictive analytics, Predictive Personalization, Prescriptive Analytics

Digital intelligence for optimizing customer engagement was published on Customer Intelligence.

1月 092017
 

I've long been fascinated by both science and the natural world around us, inspired by the amazing Sir David Attenborough with his ever-engaging documentaries and boundless enthusiasm for nature, and also by the late, great Carl Sagan and his ground-breaking documentary series, COSMOS. The relationships between the creatures, plants and […]

Intelligent ecosystems and the intelligence of things was published on SAS Voices.