In 2011, Loughran and McDonald applied a general sentiment word list to accounting and finance topics, and this led to a high rate of misclassification. They found that about three-fourths of the negative words in the Harvard IV TagNeg dictionary of negative words are typically not negative in a financial context. For example, words like “mine”, “cancer”, “tire” or “capital” are often used to refer to a specific industry segment. These words are not predictive of the tone of documents or of financial news and simply add noise to the measurement of sentiment and attenuate its predictive value. So, it is not recommended to use any general sentiment dictionary as is.

Extracting domain-specific sentiment lexicons in the traditional way is time-consuming and often requires domain expertise. Today, I will show you how to extract a domain-specific sentiment lexicon from movie reviews through a machine learning method and construct SAS sentiment rules with the extracted lexicon to improve sentiment classification performance. I did the experiment with the help from my colleagues Meilan Ji and Teresa Jade, and our experiment with the Stanford Large Movie Review Dataset showed around 8% increase in the overall accuracy with the extracted lexicon. Our experiment also showed that the lexicon coverage and accuracy could be improved a lot with more training data.

SAS Sentiment Analysis Studio released domain-independent Taxonomy Rules for 12 languages and domain-specific Taxonomy Rules for a few languages. For English, SAS has covered 12 domains, including Automotive, Banking, Health and Life Sciences, Hospitalities, Insurance, Telecommunications, Public Policy Countries and others. If the domain of your corpus is not covered by these industry rules, your first choice is to use general rules, which sometimes lead to poor classification performance, as Loughran and McDonald found. Automatically extracting domain-specific sentiment lexicons has been studied by researchers and three methods were proposed. The first method is to create a domain-specific word list by linguistic experts or domain experts, which may be expensive or time-consuming. The second method is to derive non-English lexicons based on English lexicons and other linguistic resources such as WordNet. The last method is to leverage machine learning to learn lexicons from a domain-specific corpus. This article will show you the third method.

Because of the emergence of social media, researchers are able to relatively easily get sentiment data from the internet to do experiments. Dr. Saif Mohammad, a researcher in Computational Linguistics, National Research Council Canada, proposed a method to automatically extract sentiment lexicons from tweets. His method provided the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI (pointwise mutual information) between words and tweet sentiment to define the sentiment attributes of words. It is a simple method, but quite powerful. At the ACL 2016 conference, one paper introduced how to use neural networks to learn sentiment scores, and in this paper I found the following simplified formula to calculate a sentiment score.

Given a set of tweets with their labels, the sentiment score (SS) for a word w was computed as:

SS(w) = PMI(w, pos) − PMI(w, neg), (1)

where pos represents the positive label and neg represents the negative label. PMI stands for pointwise mutual information, which is

PMI(w, pos) = log2((freq(w, pos) * N) / (freq(w) * freq(pos))), (2)

Here freq(w, pos) is the number of times the word w occurs in positive tweets, freq(w) is the total frequency of word w in the corpus, freq(pos) is the total number of words in positive tweets, and N is the total number of words in the corpus. PMI(w, neg) is calculated in a similar way. Thus, Equation 1 is equal to:

SS(w) = log2((freq(w, pos) * freq(neg)) / (freq(w, neg) * freq(pos))), (3)

The movie review data I used was downloaded from Stanford; it is a collection of 50,000 reviews from IMDB. I used 25,000 reviews in train and test datasets respectively. The constructed dataset contains an even number of positive and negative reviews. I used SAS Text Mining to parse the reviews into tokens and wrote a SAS program to calculate sentiment scores.

In my experiment, I used the train dataset to extract sentiment lexicons and the test dataset to evaluate sentiment classification performance with each sentiment score cutoff value from 0 to 2 with increment of 0.25. Data-driven learning methods frequently have an overfitting problem, and I used test data to filter out all weak-predictive words whose absolute value of sentiment scores are less than 0.75. In Figure-1, there is an obvious drop in the accuracy line plot of test data when the cutoff value is less than 0.75.

Finally, I got a huge list of 14,397 affective words; 7,850 positive words and 6,547 negative words from movie reviews. The top 50 lexical items from each sentiment category as Figure-2 shows.

Now I have automatically derived the sentiment lexicon, but how accurate is this lexicon and how to evaluate the accuracy? I googled movie vocabulary and got two lists from Useful Adjectives for Describing Movies and Words for Movies & TV with 329 adjectives categorized into positive and negative. 279 adjectives have vector data in the GloVe word embedding model downloaded from http://nlp.stanford.edu/projects/glove/ and the T-SNE plot as Figure-3 shows. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus. T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten.[1] It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. So two words are co-located or located closely in the scatter plot, if their semantic meanings are close or their co-occurrence in same contexts is high. Besides semantic closeness, I also showed sentiment polarity via different colors. Red stands for negative and blue stands for positive.

From Figure-3, I find the positive vocabulary and the negative vocabulary are clearly separated into two big clusters, with very little overlap in the plot.

Now, let me check the sentiment scores of the terms. 175 of them were included in my result and Figure-4 displays the top 50 terms of each category. I compared sentiment polarity of my result with the list, 168 of 175 terms are correctly labelled as negative or positive and the overall accuracy is 96%.

There are 7 polarity differences between my prediction and the list as the Table-1 shows.

One obvious prediction mistake is coherent. I checked the raw movie reviews that contain “coherent”, and only 25 of 103 reviews are positive. This is why its sentiment score is negative rather than positive. I went through these reviews and found most of them had a sentiment polarity reversal, such as “The Plot - Even for a Seagal film, the plot is just stupid. I mean it’s not just bad, it’s barely coherent. …” A possible solution to make the sentiment scores more accurate is to use more data or add a special manipulation for polarity reversals. I tried first method, and it did improve the accuracy significantly.

So far, I have evaluated the sentiment scores’ accuracy with public linguistic resources and next I will test the prediction effect with SAS Sentiment Analysis Studio. I ran sentiment analysis against the test data with the domain-independent sentiment rules developed by SAS and the domain-specific sentiment rules constructed by machine learning, and compared the performance of two methods. The results showed an 8% increase in the overall accuracy. Table-2 and Table-3 show the detailed information.

Test data (25,000 docs)

After you get domain-specific sentiment lexicons from your corpora, only a few steps are required in SAS Sentiment Analysis Studio to construct the domain-specific sentiment rules. So, next time you are processing domain-specific text for sentiment, you may want to try this method to get a listing of terms that are positive or negative polarity to augment your SAS domain-independent model.

Detailed steps to construct domain-specific sentiment rules as follows.

Step 1. Create a new Sentiment Analysis project.

Step 2. Create intermediate entities named “Positive” and “Negative”, then put the learned lexicons to the two entities respectively.

Step 3. Besides the learned lexicons, you may add an entity named “Negation” to handle the negated expressions. You can list some negations you are familiar with, such as “not, don’t, can’t” etc.

Step 4. Create positive and negative rules in the Tonal Keyword. Add the rule “CONCEPT: _def{Positive}” to Positive tab, and the rule “CONCEPT: _def{Negative}” and “CONCEPT: _def{Negation} _def{Positive}” to Negative tab.

Step 5. Build rule-based model, and now, you can use this model to predict the sentiment of documents.

How to extract domain-specific sentiment lexicons was published on SAS Users.

.@philsimon raises some chilling questions about the IoT and current threats.

The post IOT's twin challenges: Compliance and privacy appeared first on The Data Roundtable.

Monte Carlo techniques have many applications, but a primary application is to approximate the probability that some event occurs. The idea is to simulate data from the population and count the proportion of times that the event occurs in the simulated data.

For continuous univariate distributions, the probability of an event is the area under a density curve. The integral of the density from negative infinity to a particular value is the definition of the cumulative distribution function (CDF) for a distribution. Instead of performing numerical integration, you can use Monte Carlo simulation to approximate the probability.

### One-dimensional CDFs

In SAS software, you can use the CDF function to compute the CDF of many standard univariate distributions. For example, the statement `prob = cdf("Normal", -1)` computes the probability that a standard normal random variable takes on a value less than -1.

The CDF function is faster and more accurate than a Monte Carlo approximation, but let's see how the two methods compare. You can estimate the probability P(X < -1) by generating many random values from the N(0,1) distribution and computing the proportion that is less than -1, as shown in the following SAS DATA step

data _NULL_; call streaminit(123456); N = 10000; /* sample size */ do i = 1 to N; x = rand("Normal"); /* X ~ N(0,1) */ cnt + (x < -1); /* sum of counts for value less than -1 */ end; Prob = cdf("Normal", -1); /* P(x< -1) */ MCEst = cnt / N; /* Monte Carlo approximation */ put Prob=; put MCEst=; run; |

Prob =0.1586552539 MCEst=0.1551 |

The Monte Carlo estimate is correct to two decimal places. The accuracy of this Monte Carlo computation is proportional to 1/sqrt(*N*), where *N* is the size of the Monte Carlo sample. Thus if you want to double the accuracy you need to quadruple the sample size.

### Two-dimensional CDFs

SAS provides the PROBBNRM function for computing the CDF of a bivariate normal distribution, but does not provide a built-in function that computes the CDF for other multivariate probability distributions. However, you can use Monte Carlo techniques to approximate multivariate CDFs for any multivariate probability distribution for which you can generate random variates.

I have previously blogged about how to use the PROBBNRM function to compute the bivariate normal CDF.
The following SAS/IML statements demonstrate how to use a Monte Carlo computation to approximate the bivariate normal CDF. The example uses a bivariate normal random variable Z ~ MVN(**0**, **Σ**), where **Σ** is the correlation matrix with **Σ**_{12} = 0.6.

The example computes the probability that a bivariate normal random variable is in the region G = {(x,y) | x<x_{0} and y<y_{0}}.
The program first calls the built-in PROBBNRM function to compute the probability. Then the program calls the RANDNORMAL function to generate 100,000 random values from the bivariate normal distribution. A binary vector (`group`) indicates whether each observation is in G. The MEAN function computes the proportion of observations that are in the region.

proc iml; x0 = 0.3; y0 = 0.4; rho = 0.6; Prob = probbnrm(x0, y0, rho); /* P(x<x0 and y<y0) */ call randseed(123456); N = 1e5; /* sample size */ Sigma = { 1 &rho, /* correlation matrix */ &rho 1 }; mean = {0 0}; Z = randnormal(N, mean, Sigma); /* sample from MVN(0, Sigma) */ group = (Z[,1] < x0 & Z[,2] < y0); /* binary vector */ MCEst = mean(group); /* = sum(group=1) / N */ print Prob MCEst; |

You can use a scatter plot to visualize the Monte Carlo technique. The following statements create a scatter plot and use the DROPLINE statement in PROC SGPLOT to indicate the region G. Of the 100000 random observations, 49750 of them were in the region G. These observations are drawn in red. The observations that are outside the region are drawn in blue.

ods graphics / width=400px height=400px; title "Estimate of P(x < x0 and y < y0) is 0.4978"; title2 "x0 = 0.3; y0 = 0.4; rho = 0.6"; call scatter(Z[,1], Z[,2]) group=group grid={x y} procopt="noautolegend aspect=1" option="transparency=0.9 markerattrs=(symbol=CircleFilled)" other="dropline x=0.3 y=0.4 / dropto=both;"; |

### Higher dimensions

The Monte Carlo technique works well in low dimensions. As the dimensions get larger, you need to generate a lot of random variates in order to obtain an accurate estimate. For example, the following statements generate 10 million random values from the five-dimensional distribution of uncorrelated normal variates and estimate the probability of all components being less than 0:

d = 5; /* dimension */ N = 1e7; /* sample size */ mean = j(1, d, 0); /* {0,0,...,0} */ Z = randnormal(N, mean, I(d)); /* Z ~ MVN (0, I) */ v0 = {0 0 0 0 0}; /* cutoff values in each component */ ComponentsInRegion = (Z < v0)[,+]; /* number of components in region */ group = (ComponentsInRegion=d); /* binary indicator vector */ MCEst = mean(group); /* proportion of obs in region */ print (1/2**d)[label="Prob"] MCEst; |

Because the normal components are independent, the joint probability is the product of the probabilities for each component: 1/2^{5} = 0.03125. The Monte Carlo estimate is accurate to three decimal places.

The Monte Carlo technique can also handle non-rectangular regions. For example, you can compute the probability that a random variable is in a spherical region.

The Monte Carlo method is computationally expensive for high-dimensional distributions. In high dimensions (say, d > 10), you might need billions of random variates to obtain a reasonable approximation to the true probability. This is another example of the curse of dimensionality.

The post Monte Carlo estimates of joint probabilities appeared first on The DO Loop.

DevOps is enjoying a lot of attention at the moment, mostly because of its promise to increase software quality at lower cost. DevOps is a hybrid term linking development and operations. It's a philosophy that focuses on improving efficiencies in how we build and deliver software, solutions and the supporting environments. [...]

Should data scientists buy into the DevOps philosophy? was published on SAS Voices by Greg Willis

Whether you are just getting started with SAS or an experienced user there are a slew of free resources you should know about. First, you can download and learn SAS for free. SAS University Edition is available for anyone who wants to learn SAS. Additionally, SAS offers many free online [...]

The post Free SAS resources that every SAS user should know about appeared first on SAS Learning Post.

At its core, data compliance is built on simple foundations. Dylan Jones closes this series by explaining the remaining components of the "4F framework."

The post A simple framework for getting started with data compliance (Part 2 ) appeared first on The Data Roundtable.

Do you remember the 90s? It seemed like every company and organization had some sort of strategic plan that had “2000” in its title. And they were all going to achieve and exceed these Year 2000 goals … if their systems didn’t crash at 12:00:01 on January 1, 2000! So [...]

2020 vision for utilities: What’s the prescription? was published on SAS Voices by Mike F. Smith

Does your data just look like a bunch of dots? Would you like to bring those dots to life so they can tell their story? This example shows how to make some voter-registration dots a bit more lifelike, so you can see what they're saying. You can apply the same [...]

The post Analytics - helping your 'dots' tell their story appeared first on SAS Learning Post.