SAS R&D

4月 182017
 

In addition to his day job as Chief Technology Officer at SAS, Oliver Schabenberger is a committed lifelong learner. During his opening remarks for the SAS Technology Connection at SAS Global Forum 2017, Schabenberger confessed to having a persistent nervous curiosity, and insisted that he’s “learning every day.” And, he encouraged attendees to do the same, officially proclaiming lifelong learning as a primary theme of the conference and announcing a social media campaign to explore the issue with attendees.

This theme of lifelong learning served a backdrop – figuratively at first, literally once the conference began! – when Schabenberger, R&D Vice President Oita Coleman and Senior R&D Project Manager Lisa Morton sat down earlier this year to determine the focus for the Catalyst Café at SAS Global Forum 2017.

A centerpiece of SAS Global Forum’s Quad area, the Catalyst Café is an interactive space for attendees to try out new SAS technology and provide SAS R&D with insight to help guide future software development. At its core, the Catalyst Café is an incubator for innovation, making it the perfect place to highlight the power of learning.

After consulting with SAS Social Media Manager Kirsten Hamstra and her team, Schabenberger, Coleman and Morton decided to explore the theme by asking three questions related to lifelong learning, one a day during each day of the conference. Attendees, and others following the conference via social media channels, would respond using the hashtag #lifelearner. Morton then visualized the responses on a 13-foot-long by 8-foot-high wall, appropriately titled the Social Listening Mural, for all to enjoy during the event.

Questions for a #lifelearner

The opening day of the conference brought this question:

Day two featured this question:

Finally, day three, and this question:

"Committed to lifelong learning"

Hamstra said the response from the SAS community was overwhelming, with hundreds of individuals contributing.

Morton working on the Social Listening Mural at the SAS Global Forum Catalyst Café

“It was so interesting to see what people shared as their first jobs,” said Morton. “One started out as a bus boy and ended up a CEO, another went from stocking shelves to analytical consulting, and a couple said they immediately started their analytical careers by becoming data analysts right out of school.”

The “what do you want to learn next?” question brought some interesting responses as well. While many respondents cited topics you’d expect from a technically-inclined crowd – things like SAS Viya, the Go Programming Language and SASPy – others said they wanted to learn Italian, how to design websites or teach kids how to play soccer.

Morton said the connections that were made during the process was fascinating and made the creation of the mural so simple and inspiring. “The project showed me how incredibly diverse our SAS users are and what a wide variety of backgrounds and interests they have.”

In the end, Morton said she learned one thing for sure about SAS users: “It’s clear our users are just as committed to lifelong learning as we are here at SAS!”

My guess is that wherever you’ll find Schabenberger at this moment – writing code in his office, behind a book at the campus library, or discussing AI with Dr. Goodnight – he’s nodding in agreement.

The final product

Nurturing the #lifelearner in all of us was published on SAS Users.

3月 172017
 

Community detection has been used in multiple fields, such as social networks, biological networks, tele-communication networks and fraud/terrorist detection etc. Traditionally, it is performed on an entity link graph in which the vertices represent the entities and the edges indicate links between pairs of entities, and its target is to partition an entity-link graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.[1] SAS implements the most popular algorithms and SAS-proprietary algorithms of graph and network analysis, and integrated them with other powerful analytics into SAS Social Network Analysis. SAS graph algorithms are packaged into PROC OPTGRAPH, and with it you can detect communities from network graph.

In text analytics, researchers did some explorations in applying community detection on textual interaction data and showcased its effectiveness, such as co-authorship network, textual-interaction network, and social-tag network etc.[2] In this post, I would like to show you how to cluster papers based on the keyword link graph using community detection.

Following steps that I introduced in my previous blog, you may get paper keywords with SAS Text Analytics. Suppose you already have the paper keywords, then you need to go through the following three steps.

Step 1: Build the network.

The network structure depends on your analysis purpose and data. Take the paper-keyword relationships, for example, there are two ways to construct the network. The first method uses papers as vertices and a link occurs only when two papers have a same keyword or keyword token. The second method treats papers, keywords and keyword tokens as vertices, and links only exist in paper-keyword pairs or paper-keyword_token pairs. There is no direct link between papers.

I compared the community detection results of the two methods, and finally I chose the second method because its result is more reasonable. So in this article, I will focus on the second method only.

In addition, SAS supports weighted graph analysis, in my experiment I used term frequency as weight. For example, keywords of paper 114 are “boosting; neural network; word embedding”. After parsing the paper keywords with SAS Text Analytics, we get 6 terms. They are “boost”, “neural network”, “neural”, “network”, “word”, and “embedding”. Here I turned on stemming and noun group options, and honored SAS stoplist for English. The network data of this paper as Table-1 shows.

In the text parse step, I set term weight and cell weight with none, because community defection depends on link density and term frequencies are more effective than weighted values in this tiny data. As the table-1 shows, term frequency is small too, so no need to use log transform for cell weight.

Step 2: Run community detection to clustering papers and output detection result.

There are two types of network graphs. They are directed graph and undirected graph. In paper-keyword relationships, direction from paper to keyword or versus does not make difference, so I chose undirected graph. PROC OPTGRAPH implements three heuristic algorithms for finding communities: the LOUVAIN algorithm proposed in Blondel et al. (2008), the label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007), and the parallel label propagation algorithm developed by SAS (patent pending). The Louvain algorithm aims to optimize modularity, which is one of the most popular merit functions for community detection. Modularity is a measure of the quality of a division of a graph into communities. The modularity of a division is defined to be the fraction of the links that fall within the communities minus the expected fraction if the links were distributed at random, assuming that you do not change the degree of each node. [3] In my experiment, I used Louvain.

Besides algorithm, you also need to set resolution value. Larger resolution value produces more communities, each of which contains a smaller number of nodes. I tried three resolution values: 0.5, 1, 1.5, and finally I set 1 as resolution value because I think topic of each community is more reasonable. With these settings, I got 18 communities at last.

Step 3: Explore communities visually and get insights.

Once you have community detection result, you may use Network Diagram of SAS Visual Analytics to visually explore the communities and understand their topics or characteristics.

Take the largest community as an example, there are 14 papers in this community. Nodes with numbers notated are papers, otherwise they are keyword tokens. Node size is determined by sum of link weight (term frequency), and node color is decided by community value. From Figure-1, you may easily find out its topic: sentiment, which is the largest node in all keyword nodes. After I went through the conference program, I found they all are papers of IALP 2016 shared task, which is targeted to predict valence-arousal ratings of Chinese affective words.

Figure-1 Network Diagram of Papers in Community 0

 

Another example is community 8, and its topic terms are annotation and information.

Figure-2 Network Diagram of Papers in Community 8

Simultaneously the keywords were also clustered, and the keyword community may be used in your search engine to improve the keyword-based recommendation or improve the search performance by retrieving more relevant documents. I extracted the keywords (noun group only) of the top 5 communities and displayed them with SAS Visual Analytics. The top 3 keywords of community 0 are: sentiment analysis, affective computing, and affective lexicon, which are very close from semantic perspective. If you have more data, you may get better results than mine.

Figure-3 Keyword frequency chart of the top 5 communities

If you are interested in this analysis, why not try it with your data? The SAS scripts for clustering papers as below.

* Step 1: Build the paper-keyword network;
proc hptmine data=outlib.paper_keywords;
   doc_id documentName; 
   var keywords;
   parse notagging termwgt=none cellwgt=none
         stop=sashelp.engstop
         outterms=terms
         outparent=parent
         reducef=1;
run;quit;
 
proc sql;
   create table outlib.paper_keyword_network as
   select _document_ as from, term as to, _count_ as weight
   from parent
   left join terms
   on parent._termnum_ = terms.key
   where parent eq .;
quit;
 
* Step 2: Run community detection to clustering papers;
* NOTE: huge network = low resolution level;
proc optgraph
   loglevel = 1
   direction=undirected
   graph_internal_format = thin
   data_links = outlib.paper_keyword_network
   out_nodes  = nodes
   ;
 
   community
      loglevel = 1
      maxiter = 20
      link_removal_ratio = 0
      resolution_list    = 1
      ;
run;
quit;
 
proc sql;
   create table paper_community as
   select distinct Paper_keywords.documentName, keywords, community_1 as community
   from outlib.Paper_keywords
   left join nodes
   on nodes.node = Paper_keywords.documentName
   order by community_1, documentName;
quit;
 
* Step 3: Merge network and community data for VA exploration;
proc sql;
   create table outlib.paper_community_va as
   select paper_keyword_network.*, community
   from outlib.paper_keyword_network
   left join paper_community
   on paper_keyword_network.from = paper_community.documentName;
quit;
 
* Step 4: Keyword communities;
proc sql;
   create table keyword_community as
   select *
   from nodes
   where node not in (select documentName from outlib.paper_keywords)
   order by community_1, node;
quit;
 
proc sql;
   create table outlib.keyword_community_va as
   select keyword_community.*, freq
   from keyword_community
   left join terms
   on keyword_community.node = terms.term
   where parent eq . and role eq 'NOUN_GROUP'
   order by community_1, freq desc;
quit;

 

References

[1]. Communities in Networks
[2]. Automatic Clustering of Social Tag using Community Detection
[3]. SAS(R) OPTGRAPH Procedure 14.1: Graph Algorithms and Network Analysis

Clustering of papers using Community Detection was published on SAS Users.

3月 172017
 

Community detection has been used in multiple fields, such as social networks, biological networks, tele-communication networks and fraud/terrorist detection etc. Traditionally, it is performed on an entity link graph in which the vertices represent the entities and the edges indicate links between pairs of entities, and its target is to partition an entity-link graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.[1] SAS implements the most popular algorithms and SAS-proprietary algorithms of graph and network analysis, and integrated them with other powerful analytics into SAS Social Network Analysis. SAS graph algorithms are packaged into PROC OPTGRAPH, and with it you can detect communities from network graph.

In text analytics, researchers did some explorations in applying community detection on textual interaction data and showcased its effectiveness, such as co-authorship network, textual-interaction network, and social-tag network etc.[2] In this post, I would like to show you how to cluster papers based on the keyword link graph using community detection.

Following steps that I introduced in my previous blog, you may get paper keywords with SAS Text Analytics. Suppose you already have the paper keywords, then you need to go through the following three steps.

Step 1: Build the network.

The network structure depends on your analysis purpose and data. Take the paper-keyword relationships, for example, there are two ways to construct the network. The first method uses papers as vertices and a link occurs only when two papers have a same keyword or keyword token. The second method treats papers, keywords and keyword tokens as vertices, and links only exist in paper-keyword pairs or paper-keyword_token pairs. There is no direct link between papers.

I compared the community detection results of the two methods, and finally I chose the second method because its result is more reasonable. So in this article, I will focus on the second method only.

In addition, SAS supports weighted graph analysis, in my experiment I used term frequency as weight. For example, keywords of paper 114 are “boosting; neural network; word embedding”. After parsing the paper keywords with SAS Text Analytics, we get 6 terms. They are “boost”, “neural network”, “neural”, “network”, “word”, and “embedding”. Here I turned on stemming and noun group options, and honored SAS stoplist for English. The network data of this paper as Table-1 shows.

In the text parse step, I set term weight and cell weight with none, because community defection depends on link density and term frequencies are more effective than weighted values in this tiny data. As the table-1 shows, term frequency is small too, so no need to use log transform for cell weight.

Step 2: Run community detection to clustering papers and output detection result.

There are two types of network graphs. They are directed graph and undirected graph. In paper-keyword relationships, direction from paper to keyword or versus does not make difference, so I chose undirected graph. PROC OPTGRAPH implements three heuristic algorithms for finding communities: the LOUVAIN algorithm proposed in Blondel et al. (2008), the label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007), and the parallel label propagation algorithm developed by SAS (patent pending). The Louvain algorithm aims to optimize modularity, which is one of the most popular merit functions for community detection. Modularity is a measure of the quality of a division of a graph into communities. The modularity of a division is defined to be the fraction of the links that fall within the communities minus the expected fraction if the links were distributed at random, assuming that you do not change the degree of each node. [3] In my experiment, I used Louvain.

Besides algorithm, you also need to set resolution value. Larger resolution value produces more communities, each of which contains a smaller number of nodes. I tried three resolution values: 0.5, 1, 1.5, and finally I set 1 as resolution value because I think topic of each community is more reasonable. With these settings, I got 18 communities at last.

Step 3: Explore communities visually and get insights.

Once you have community detection result, you may use Network Diagram of SAS Visual Analytics to visually explore the communities and understand their topics or characteristics.

Take the largest community as an example, there are 14 papers in this community. Nodes with numbers notated are papers, otherwise they are keyword tokens. Node size is determined by sum of link weight (term frequency), and node color is decided by community value. From Figure-1, you may easily find out its topic: sentiment, which is the largest node in all keyword nodes. After I went through the conference program, I found they all are papers of IALP 2016 shared task, which is targeted to predict valence-arousal ratings of Chinese affective words.

Figure-1 Network Diagram of Papers in Community 0

 

Another example is community 8, and its topic terms are annotation and information.

Figure-2 Network Diagram of Papers in Community 8

Simultaneously the keywords were also clustered, and the keyword community may be used in your search engine to improve the keyword-based recommendation or improve the search performance by retrieving more relevant documents. I extracted the keywords (noun group only) of the top 5 communities and displayed them with SAS Visual Analytics. The top 3 keywords of community 0 are: sentiment analysis, affective computing, and affective lexicon, which are very close from semantic perspective. If you have more data, you may get better results than mine.

Figure-3 Keyword frequency chart of the top 5 communities

If you are interested in this analysis, why not try it with your data? The SAS scripts for clustering papers as below.

* Step 1: Build the paper-keyword network;
proc hptmine data=outlib.paper_keywords;
   doc_id documentName; 
   var keywords;
   parse notagging termwgt=none cellwgt=none
         stop=sashelp.engstop
         outterms=terms
         outparent=parent
         reducef=1;
run;quit;
 
proc sql;
   create table outlib.paper_keyword_network as
   select _document_ as from, term as to, _count_ as weight
   from parent
   left join terms
   on parent._termnum_ = terms.key
   where parent eq .;
quit;
 
* Step 2: Run community detection to clustering papers;
* NOTE: huge network = low resolution level;
proc optgraph
   loglevel = 1
   direction=undirected
   graph_internal_format = thin
   data_links = outlib.paper_keyword_network
   out_nodes  = nodes
   ;
 
   community
      loglevel = 1
      maxiter = 20
      link_removal_ratio = 0
      resolution_list    = 1
      ;
run;
quit;
 
proc sql;
   create table paper_community as
   select distinct Paper_keywords.documentName, keywords, community_1 as community
   from outlib.Paper_keywords
   left join nodes
   on nodes.node = Paper_keywords.documentName
   order by community_1, documentName;
quit;
 
* Step 3: Merge network and community data for VA exploration;
proc sql;
   create table outlib.paper_community_va as
   select paper_keyword_network.*, community
   from outlib.paper_keyword_network
   left join paper_community
   on paper_keyword_network.from = paper_community.documentName;
quit;
 
* Step 4: Keyword communities;
proc sql;
   create table keyword_community as
   select *
   from nodes
   where node not in (select documentName from outlib.paper_keywords)
   order by community_1, node;
quit;
 
proc sql;
   create table outlib.keyword_community_va as
   select keyword_community.*, freq
   from keyword_community
   left join terms
   on keyword_community.node = terms.term
   where parent eq . and role eq 'NOUN_GROUP'
   order by community_1, freq desc;
quit;

 

References

[1]. Communities in Networks
[2]. Automatic Clustering of Social Tag using Community Detection
[3]. SAS(R) OPTGRAPH Procedure 14.1: Graph Algorithms and Network Analysis

Clustering of papers using Community Detection was published on SAS Users.

3月 012017
 

In 2011, Loughran and McDonald applied a general sentiment word list to accounting and finance topics, and this led to a high rate of misclassification. They found that about three-fourths of the negative words in the Harvard IV TagNeg dictionary of negative words are typically not negative in a financial context. For example, words like “mine”, “cancer”, “tire” or “capital” are often used to refer to a specific industry segment. These words are not predictive of the tone of documents or of financial news and simply add noise to the measurement of sentiment and attenuate its predictive value. So, it is not recommended to use any general sentiment dictionary as is.

Extracting domain-specific sentiment lexicons in the traditional way is time-consuming and often requires domain expertise. Today, I will show you how to extract a domain-specific sentiment lexicon from movie reviews through a machine learning method and construct SAS sentiment rules with the extracted lexicon to improve sentiment classification performance. I did the experiment with the help from my colleagues Meilan Ji and Teresa Jade, and our experiment with the Stanford Large Movie Review Dataset showed around 8% increase in the overall accuracy with the extracted lexicon. Our experiment also showed that the lexicon coverage and accuracy could be improved a lot with more training data.

SAS Sentiment Analysis Studio released domain-independent Taxonomy Rules for 12 languages and domain-specific Taxonomy Rules for a few languages. For English, SAS has covered 12 domains, including Automotive, Banking, Health and Life Sciences, Hospitalities, Insurance, Telecommunications, Public Policy Countries and others. If the domain of your corpus is not covered by these industry rules, your first choice is to use general rules, which sometimes lead to poor classification performance, as Loughran and McDonald found. Automatically extracting domain-specific sentiment lexicons has been studied by researchers and three methods were proposed. The first method is to create a domain-specific word list by linguistic experts or domain experts, which may be expensive or time-consuming. The second method is to derive non-English lexicons based on English lexicons and other linguistic resources such as WordNet. The last method is to leverage machine learning to learn lexicons from a domain-specific corpus. This article will show you the third method.

Because of the emergence of social media, researchers are able to relatively easily get sentiment data from the internet to do experiments. Dr. Saif Mohammad, a researcher in Computational Linguistics, National Research Council Canada, proposed a method to automatically extract sentiment lexicons from tweets. His method provided the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI (pointwise mutual information) between words and tweet sentiment to define the sentiment attributes of words. It is a simple method, but quite powerful. At the ACL 2016 conference, one paper introduced how to use neural networks to learn sentiment scores, and in this paper I found the following simplified formula to calculate a sentiment score.

Given a set of tweets with their labels, the sentiment score (SS) for a word w was computed as:
SS(w) = PMI(w, pos) − PMI(w, neg), (1)

where pos represents the positive label and neg represents the negative label. PMI stands for pointwise mutual information, which is
PMI(w, pos) = log2((freq(w, pos) * N) / (freq(w) * freq(pos))), (2)

Here freq(w, pos) is the number of times the word w occurs in positive tweets, freq(w) is the total frequency of word w in the corpus, freq(pos) is the total number of words in positive tweets, and N is the total number of words in the corpus. PMI(w, neg) is calculated in a similar way. Thus, Equation 1 is equal to:
SS(w) = log2((freq(w, pos) * freq(neg)) / (freq(w, neg) * freq(pos))), (3)

The movie review data I used was downloaded from Stanford; it is a collection of 50,000 reviews from IMDB. I used 25,000 reviews in train and test datasets respectively. The constructed dataset contains an even number of positive and negative reviews. I used SAS Text Mining to parse the reviews into tokens and wrote a SAS program to calculate sentiment scores.

In my experiment, I used the train dataset to extract sentiment lexicons and the test dataset to evaluate sentiment classification performance with each sentiment score cutoff value from 0 to 2 with increment of 0.25. Data-driven learning methods frequently have an overfitting problem, and I used test data to filter out all weak-predictive words whose absolute value of sentiment scores are less than 0.75. In Figure-1, there is an obvious drop in the accuracy line plot of test data when the cutoff value is less than 0.75.

extract domain-specific sentiment lexicons

Figure-1 Sentiment Classification Accuracy by Sentiment Score Cutoff

Finally, I got a huge list of 14,397 affective words; 7,850 positive words and 6,547 negative words from movie reviews. The top 50 lexical items from each sentiment category as Figure-2 shows.

Figure-2 Sentiment Score of Top 50 Lexical Items

Now I have automatically derived the sentiment lexicon, but how accurate is this lexicon and how to evaluate the accuracy? I googled movie vocabulary and got two lists from Useful Adjectives for Describing Movies and Words for Movies & TV with 329 adjectives categorized into positive and negative. 279 adjectives have vector data in the GloVe word embedding model downloaded from http://nlp.stanford.edu/projects/glove/ and the T-SNE plot as Figure-3 shows. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus. T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten.[1]  It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. So two words are co-located or located closely in the scatter plot, if their semantic meanings are close or their co-occurrence in same contexts is high. Besides semantic closeness, I also showed sentiment polarity via different colors. Red stands for negative and blue stands for positive.

Figure-3 T-SNE Plot of Movie Vocabularies

From Figure-3, I find the positive vocabulary and the negative vocabulary are clearly separated into two big clusters, with very little overlap in the plot.

Now, let me check the sentiment scores of the terms. 175 of them were included in my result and Figure-4 displays the top 50 terms of each category. I compared sentiment polarity of my result with the list, 168 of 175 terms are correctly labelled as negative or positive and the overall accuracy is 96%.

Figure-4 Sentiment Score of Top 50 Movie Terms

There are 7 polarity differences between my prediction and the list as the Table-1 shows.

Table-1 Sentiment Polarity Difference between Predictions and Actual Labels

One obvious prediction mistake is coherent. I checked the raw movie reviews that contain “coherent”, and only 25 of 103 reviews are positive. This is why its sentiment score is negative rather than positive. I went through these reviews and found most of them had a sentiment polarity reversal, such as “The Plot - Even for a Seagal film, the plot is just stupid. I mean it’s not just bad, it’s barely coherent. …” A possible solution to make the sentiment scores more accurate is to use more data or add a special manipulation for polarity reversals. I tried first method, and it did improve the accuracy significantly.

So far, I have evaluated the sentiment scores’ accuracy with public linguistic resources and next I will test the prediction effect with SAS Sentiment Analysis Studio. I ran sentiment analysis against the test data with the domain-independent sentiment rules developed by SAS and the domain-specific sentiment rules constructed by machine learning, and compared the performance of two methods. The results showed an 8% increase in the overall accuracy. Table-2 and Table-3 show the detailed information.

Test data (25,000 docs)

Table-2 Performance Comparison with Test Data

Table-3 Overall Performance Comparison with Test Data

After you get domain-specific sentiment lexicons from your corpora, only a few steps are required in SAS Sentiment Analysis Studio to construct the domain-specific sentiment rules. So, next time you are processing domain-specific text for sentiment, you may want to try this method to get a listing of terms that are positive or negative polarity to augment your SAS domain-independent model.

Detailed steps to construct domain-specific sentiment rules as follows.

Step 1. Create a new Sentiment Analysis project.
Step 2. Create intermediate entities named “Positive” and “Negative”, then put the learned lexicons to the two entities respectively.

Step 3. Besides the learned lexicons, you may add an entity named “Negation” to handle the negated expressions. You can list some negations you are familiar with, such as “not, don’t, can’t” etc.

Step 4. Create positive and negative rules in the Tonal Keyword. Add the rule “CONCEPT: _def{Positive}” to Positive tab, and the rule “CONCEPT: _def{Negative}” and “CONCEPT: _def{Negation} _def{Positive}” to Negative tab.

Step 5. Build rule-based model, and now, you can use this model to predict the sentiment of documents.

How to extract domain-specific sentiment lexicons was published on SAS Users.

2月 252017
 

As a practitioner of visual analytics, I read the featured blog of ‘Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Python, SAS’ last year with great interest. In the post, the blogger Tim Matteson asked the readers to guess which software was used to create his 18 graphs. My buddy, Emily Gao, suggested that I should see how SAS VA does recreating these visualizations. I agreed.

SAS Visual Analytics (VA) is better known for its interactive visual analysis, and it’s also able to create nice visualizations. Users can easily create professional charts and visualizations without SAS coding. So what I am trying to do in this post, is to load the corresponding data to SAS VA environment, and use VA Explorer and Designer to mimic Matteson’s visualizations.

I want to specially thank Robert Allison for his valuable advices during the process of writing this post. Robert Allison is a SAS graph expert, and I learned a lot from his posts. I read his blog on creating 18 amazing graphs using purely SAS code, and I copied most data from his blog when doing these visualization, which saved me a lot time preparing data.

So, here’s my attempt at recreating Matteson’s 18 visualization using SAS Visual Analytics.

Chart 1

This visualization is created by using two customized bar charts in VA, and putting them together using precision layout so it looks like one chart. The customization of bar charts can be done by using the ‘Custom Graph Builder’ in SAS VA, which includes: set the reverse order for X axis, set the axes direction to horizontal, and don’t show axis label for X axis and Y axis, uncheck the ‘show tick marks’, etc. Comparing with Matteson’s visualization, my version has the tick values on X axis displayed as non-negative numbers, as people generally would expect positive value for the frequency.

Another thing is, I used the custom sort for the category to define the order of the items in the bar chart. This can be done by right click on the category and select ‘Edit Custom Sort…’ to get the desired order. You may also have noticed that the legend is a bit strange for the Neutral response, since it is split into Neutral_1stHalf and Neutral_2ndHalf, which I need to gracefully show the data symmetrically in the visualization in VA.

Chart 2

VA can create a grouped bar chart with desired sort order for the countries and the questions easily. However, we can only put the questions texts horizontally atop of each group bar in VA. VA uses vertical section bar instead, with its tooltip to show the whole question text when the mouse is hovered onto it. And we can see the value of each section in bar interactively in VA when hovering the mouse over.

Chart 3

Matteson’s chart looks a bit scattered to me, while Robert’s chart is great at label text and markers for the scatterplot matrix. Here I use VA Explorer to create the scatterplot matrix for the data, which omitted the diagonal cells and its diagonal symmetrical part for easier data analysis purpose. It can then be exported to report, and change the color of data points.

Chart 4

I used the ‘Numeric Series Plot’ to draw this chart of job losses in recession. It was straightforward. I just adjust some setting like checking the ‘Show markers’ in the Properties tab, unchecking the ‘Show label’ in X Axis and unchecking the ‘Use filled markers’, etc. To make refinement of X axis label of different fonts, I need to use the ‘Precision’ layout instead of the default ‘Tile’ layout. Then drag the ‘Text’ object to contain the wanted X axis label.

Chart 5

VA can easily draw the grouped bar charts automatically. Disable the X axis label, and set the grey color for the ‘Header background.’ What we need to do here, is to add some display rules for the mapping of color-value. For the formatted text at the bottom, use the ‘Text’ object. (Note: VA puts the Age_range values at the bottom of the chart.)

Chart 6

SAS VA does not support drawn 3D charts, so I could not make similar chart as Robert did with SAS codes. What I do for this visualization, is to create a network diagram using the Karate club dataset. The grouped detected communities (0, 1, 2, 3) are showing with different colors. The diagram can be exported as image in VAE.

***I use the following codes to generate the necessary data for the visualization:

http://support.sas.com/documentation/cdl/en/procgralg/68145/HTML/default/viewer.htm#procgralg_optgraph_examples07.htm 

/* Dataset of Zachary’s Karate Club data is from: http://support.sas.com/documentation/cdl/en/procgralg/68145/HTML/default/viewer.htm#procgralg_optgraph_examples07.htm  
This dataset describes social network friendships in karate club at a U.S. university.
*/
data LinkSetIn;
   input from to weight @@;
   datalines;
 0  9  1  0 10  1  0 14  1  0 15  1  0 16  1  0 19  1  0 20  1  0 21  1
 0 23  1  0 24  1  0 27  1  0 28  1  0 29  1  0 30  1  0 31  1  0 32  1
 0 33  1  2  1  1  3  1  1  3  2  1  4  1  1  4  2  1  4  3  1  5  1  1
 6  1  1  7  1  1  7  5  1  7  6  1  8  1  1  8  2  1  8  3  1  8  4  1
 9  1  1  9  3  1 10  3  1 11  1  1 11  5  1 11  6  1 12  1  1 13  1  1
13  4  1 14  1  1 14  2  1 14  3  1 14  4  1 17  6  1 17  7  1 18  1  1
18  2  1 20  1  1 20  2  1 22  1  1 22  2  1 26 24  1 26 25  1 28  3  1
28 24  1 28 25  1 29  3  1 30 24  1 30 27  1 31  2  1 31  9  1 32  1  1
32 25  1 32 26  1 32 29  1 33  3  1 33  9  1 33 15  1 33 16  1 33 19  1
33 21  1 33 23  1 33 24  1 33 30  1 33 31  1 33 32  1
;
run;
/* Perform the community detection using resolution levels (1, 0.5) on the Karate Club data. */
proc optgraph
   data_links            = LinkSetIn
   out_nodes             = NodeSetOut
   graph_internal_format = thin;
   community
      resolution_list    = 1.0 0.5
      out_level          = CommLevelOut
      out_community      = CommOut
      out_overlap        = CommOverlapOut
      out_comm_links     = CommLinksOut;
run;
 
/* Create the dataset of detected community (0, 1, 2, 3) for resolution level equals 1.0 */ 
proc sql;
	create table mylib.newlink as 
	select a.from, a.to, b.community_1, c.nodes from LinkSetIn a, NodeSetOut b, CommOut c
 	where a.from=b.node and b.community_1=c.community and c.resolution=1 ;
quit;

Chart 7

I created this map using the ‘Geo Coordinate Map’ in VA. I need to create a geography variable by right clicking on the ‘World-cities’ and selecting Geography->Custom…->, and set the Latitude to the ‘Unprojected degrees latitude,’ and Longitude to the ‘Unprojected degrees longitude.’ To get the black continents in the map, go to VA preferences, check the ‘Invert application colors’ under the Theme. Remember to set the ‘Marker size’ to 1, and change the first color of markers to black so that it will show in white when application color is inverted.

Chart 8

This is a very simple scatter chart in VA. I only set transparency in order to show the overlapping value. The blue text in left-upper corner is using a text object.

Chart 9

To get this black background graph, set the ‘Wall background’ color to black. Then change the ‘Line/Marker’ color in data colors section accordingly. I’ve also checked the ‘Show markers’ option and changed the marker size to bigger 6.

Chart 10

There is nothing special for creating this scatter plot in VA. I simply create several reference lines, and uncheck the ‘Use filled markers’ with smaller marker size. The transparency of the markers is set to 30%.

Chart 11

In VA’s current release, if we use a category variable for color, the marker will automatically change to different markers for different colors. So I create a customized scatterplot using VA Custom Graph Builder, to define the marker as always round. Nothing else, just set the transparency to clearly show the overlapping values. As always, we can add an image object in VA with precision layout.

Chart 12

I used the GEO Bubble Map to create this visualization. I needed to create a custom Geography variable from the trap variable using ‘lat_deg’ and ‘lon_deg’ as latitude and longitude respectively. Then rename the NumMosquitos measure to ‘Total Mosquitos’ and use it for bubble size. To show the presence of west nile virus, I use the display rule in VA. I also create an image to show the meaning of the colored icons for display rule. The precision layout is enabled in order to have text and images added for this visualization.

Chart 13

This visualization is also created with GEO bubble map in VA. First I did some data manipulation to make the magnitude squared just for the sake of the bubble size resolution, so it shows contrast in size. Then I create some display rules to show the significance of the earth quakes with different colors, and set the transparency of the bubble to 30% for clarity. I also created an image to show the meaning of the colored icons.

Be aware that some data manipulation is needed for original longitude data. Since the geographic coordinates will use the meridian as reference, if we want to show the data of American in the right part, we need to add 360 to the longitude, whose value is negative.

Chart 14

My understanding that one of the key points of this visualization Matteson made, is to show the control/interaction feature. Great thing is, VA has various control objects for interactive analysis. For the upper part in this visualization, I simply put a list table object. The trick here is how to use display rule to mimic the style. Before assigning any data to the list table in VA, I create a display rule with Expression, and at this moment we can specify the column with any measure value in an expression. (Otherwise, you need to define the display rule for each column with some expressions.) Just define ‘Any measure value’ is missing or greater than a value with proper filled color for cell. (VA doesn’t support filling the cell with certain pattern like Robert did for missing value. Therefore, I use grey for missing value to differentiate from 0 with a light color.)

For the lower part, I create a new dataset for interventions to hold the intervention items, and put it in the list control and a list table. The right horizontal bar chart is a target bar chart with the expected duration as the targeted value. The label on each bar shows the actual duration.

Chart 15

VA does not have solid-modeling animation like Matteson made in his original chart, yet VA has animation support for bubble plots in an interactive mode. So I made this visualization using Robert’s animation dataset, trying to make an imitation of the famous animation by the late Hans Rosling as a memorial. I set the dates for animation by creating the dates variable with the first day in each year (just for simplicity). One customization here is: I use the custom graph builder to add a new role so that it can display the data label in the bubble plot, and set the country name as the bubble label in VA Designer. Certainly, we can always filter the interested countries in VA for further analysis.

VA can’t show only a part of the bubble labels as Robert did using SAS codes. So in order to clearly show the labels of those interested countries, I made a rank of top 20 countries of average populations, and set a filter to show data between year 1950 to 2011. I use a capture screen tool to have the animation saved as a .gif file. Be sure to click the chart to see the animation.

Chart 16

I think Matteson’s original chart is to show the overview axis in the line chart, since I don’t see specialty of the line chart otherwise. So I draw this time series plot with the overview axis enabled in VA using the SASHELP.STOCK dataset. It shows the date on X axis with tick marks splitting to months, which can be zoomed in to day level in VA interactively. The overview axis can do the zooming in and out, as well as movement of the focused period.

Chart 17

For this visualization, I use a customized bubble plot (in Custom Graph Builder, add a Data Label Role for Bubble Plot.) so it will have bubble labels displayed. I use one reference line with label of Gross Avg., and 2 reference lines for X and Y axis accordingly, thus it visually creats four quadrants. As usual, add 4 text objects to hold the labels at each corner in the precision layout.

Chart 18

I think Matteson made an impressive 3D chart, and Robert recreated a very beautiful 3D chart with pure SAS codes. But VA does not have any 3D charts. So for this visualization, I simply load the data in VA, and drag them to have a visualization in VAE. Then choose the best fit from the fit line list, and export the visualization to report. Then, add display rules according to the value of Yield. Since VA shows the display rules at information panel, I create an image for colored markers to show them as legend in the visualization and put it in the precision layout.

There you have it. Matteson’s 18 visualizations recreated in VA.

How did I do?

18 Visualizations created by SAS Visual Analytics was published on SAS Users.

1月 262017
 

Recently a colleague told me Google had published new, interesting data sets at BigQuery. I found a lot of Reddit data as well, so I quickly tried running BigQuery with these text data to see what I could produce.  After getting some pretty interesting results, I wanted to see if I could implement the same analysis with SAS and if using SAS Text Mining you would get deeper insights than simple queries. So, I tried SAS with Reddit comments data and I’d like to share my analyses and findings with you.

Analysis 1: Significant Words

To get started with BigQuery, I googled what others were sharing regarding BigQuery and Reddit, and I found USING BIGQUERY WITH REDDIT DATA. In this article the author posted a query statement about extracting significant words from Politics subreddit. I then wrote a SAS program to mimic this query and I got following data with the July of Reddit comments. The result is not completely same as the one from BigQuery, since I downloaded the Reddit data from another web site and used SAS Text Parsing action to parse the comments into tokens rather than just splitting tokens by white space.

Analysis 2: Daily Submissions

The words Trump and Hillary in the list raised my interest and begged for further analysis. So, I did a daily analysis to understand how hot Trump and Hillary were during this month. I filtered all comments mentioning Trump or Hillary under Politics subreddit and counted total submissions per day. The resulting time series plot is shown below.

I found several spikes in the plot, which happened on 2016/7/5, 2016/7/12, 2016/7/21, and 2016/7/26.

Analysis 3: Topics Time Line

I wondered what Reddit users were concerned about on these specific days, so I extracted the top 10 topics from all comments submitted in July, 2016 within Politics subreddit and got the following data. These topics obviously focused on several aspects, such as vote, president candidates, party, and hot news such as Hillary’s email probe.

The topics showed what people were concerned about in the whole month, but I need further investigation in order to explain which topic mostly contributed to the four spikes. The topics’ time series plot helped me find the answer.

Some topics’ time series trends are very close and it is hard to determine which topic contributed mostly, so I got the top contribution topic based on their daily percentage growth. The top growth topic on July 05 is “emails, dnc, +server, hillary, +classify”, which has 256.23 times of growth.

Its time series plot also shows a high spike on July 05. Then, I googled with “July 5, 2016 emails dnc server hillary classify” and I got following news.

There is no doubt the spike on July 05 is related to the FBI’s decision about Clinton’s email probe. In order to confirm this, I extracted the Top 20 Reddit comments submitted on July 05 according to its Reddit score. I quoted partial comment from the top one and I found the link in the comment was included in the Google’s search result.

"...under normal circumstances, security clearances would be revoked. " This is your FBI. EDIT: I took paraphrased quote, this is the actual quote as per https://www.fbi.gov/news/pressrel/press-releases/statement-by-fbi-director-james-b.-comey-on-the-investigation-of-secretary-hillary-clintons-use-of-a-personal-e-mail-system - "

Similar analysis was done on the other three days and the hot topics as follows.

Interestingly, one person did a sentiment analysis with Twitter data and the tweet submission trend of July looks the same as Reddit.

And in this blog, he listed several important events that happened in July.

  • July 5th: the FBI says it’s not going to end Clinton’s email probe and will not recommend prosecution.
  • July 12th: Bernie Sanders endorses Hillary Clinton for president.
  • July 21st: Donald Trump accepts the Republican nomination.
  • July 25-28: Clinton accepts nomination in the DNC.

It showcased that different social media data have similar response trends on the same events.

Now I know why these spikes happened. However, more questions came to my mind.

  • Who started posting these news?
  • Were there cyber armies?
  • Who were opinion leaders in the politics community?

I believe all these questions can be answered by analyzing the data with SAS.

tags: SAS R&D, SAS Text Analytics

Analyzing Trump v. Clinton text data at Reddit was published on SAS Users.

12月 142016
 

Fun with Text AnalyticsLast week, I attended the IALP 2016 conference (20th International Conference on Asian Language Processing) in Taiwan. After the conference, each presenter received a u-disk with all accepted papers in PDF format. So when I got back to Beijing, I began going through the papers to extend my learning. Usually, when I return from a conference, I go through all paper titles and my conference notes, then choose the most interesting articles and dive into them for details. I’ll then summarize important research discoveries into one document. This always takes me several days or more to complete.

This time, I decided to try SAS Text Analytics to help me read papers efficiently. Here’s how I did it.

My first experiment was to generate a word cloud of all papers. I used these three steps.

Step 1: Convert PDF collections into text files.

With the SAS procedure TGFilter and SAS Document Conversion Server, you may convert PDF collections into a SAS dataset. If you don’t have SAS Document Conversion Server, you can download pdftotext for free. Pdftotext converts PDFfiles into texts only, you need to write SAS code to import all text files into a dataset. Moreover, if you use pdftotext, you need to check if the PDF file is converted correctly or not. It’s annoying to check texts one by one and I hope you look for smart ways to do this check. SAS TGFilter procedure has language detection functionality and language of any garbage document after conversion is empty rather than English, so I recommend you use TGFilter, then you can filter garbage documents out easily with a where statement of language not equal to ‘English.’

Step 2: Parse documents into words and get word frequencies.

Run SAS procedure HPTMINE or TGPARSE against the document SAS dataset, with stemming option turned on and English stop-word list released by SAS, you may get frequencies of all stems.

Step 3: Generate word cloud plot.

Once you have term frequencies, you can either use SAS Visual Analytics or use R to generate word cloud plot. I like programming, so I used SAS procedure IML to submit R scripts via SAS.

These steps generated a word cloud with the top 500 words of 66 papers. There were a total of 87 papers and 21 of them could not be converted correctly by SAS Document Conversion Server. 19 papers could not be converted correctly by pdftotext.

fun-with-sas-text-analytics

Figure-1 Word Cloud of Top 500 Words of 66 Papers

From figure-1, it is easy to see that the top 10 words were: word, language, chinese, model, system, sentence, corpus, information, feature and method. IALP is an international conference, and does not focus on Chinese only. However there was a shared task at this year’s conference, and its purpose is to predict traditional Chinese affective words’ valance and arousal ratings. Moreover, each team who had attended the shared task was requested to submit a paper to introduce their work, so “Chinese” contributed more than other languages in the word cloud.

You probably think there’s a lot of noise if we use paper bodies to do word cloud analysis, so my second experiment is to generate word cloud of paper abstracts through similar processing. You can view these results in Figure-2.

fun-with-sas-text-analytics02

Figure-2 Word Cloud of Top 500 Words of 66 Paper Abstracts

The top 10 words from the paper abstracts were:  word, chinese, system, language, method, paper, propose, sentiment, baseand corpus. These are quite a bit different from top 10 words extracted from paper bodies. Character, word and sentence are fundamental components of natural languages. For machine learning in NLP (natural language processing), annotated corpus is the key. Without corpus, you cannot build any model. However, annotated corpus is very rare even in big data era and that is why so many researchers pay efforts in annotation.

Can we do more analyses with SAS? Of course. We may analyze keywords, references, paper influence, paper categorization, etc. I hope I have time to try these interesting analyses and share my work with you in a future blog.

The SAS scripts for paper word cloud as below.

* Step 1: Convert PDF collections into sas dataset;
* NOTE: You should have SAS Document Converte purchased;
proc tgfilter out=paper(where=(text ne '' and language eq 'English'))
   tgservice=(hostname="yourhost" port=yourport)  
   srcdir="D:temppdf_collections"
   language="english"
   numchars=32000
   extlist="pdf"
   force;
run;quit;
 
 
* Step 2: Parse documents into words and get word frequencies;
* Add document id for each document;
data paper;
   set paper;
   document_id = _n_;
run;
 
proc hptmine data=paper;
   doc_id document_id;
   var text;
   parse notagging nonoungroups termwgt=none cellwgt=none
         stop=sashelp.engstop
         outparent=parent
         outterms=key
         reducef=1;
run;quit;
 
* Get stem freq data;
proc sql noprint;
   create table paper_stem_freq as
   select term as word, freq as freq
   from key
   where missing(parent) eq 1
   order by freq descending;
quit;
 
 
* Step 3: Generate word cloud plot with R;
data topstems;
   set paper_stem_freq(obs=500);
run;
 
proc iml;
call ExportDataSetToR("topstems", "d" );
 
submit / R;
library(wordcloud)
 
# sort by frequency
d
tags: SAS R&D, SAS Text Analytics

Fun with SAS Text Analytics: A qualitative analysis of IALP papers was published on SAS Users.

5月 122016
 

SASMobile_BI_FAQSAS Visual Analytics users view and interact with reports on their desktop computers or laptops. Many, however, have never heard of the SAS Mobile BI app or how it extends the viewing and interactive capabilities of their reports to mobile devices. SAS Mobile BI app is simple to download to a mobile device, and you can immediately view some sample SAS Visual Analytics reports from within the app.

If your organization has deployed SAS Visual Analytics, but is not taking advantage of the ability to view and interact with reports via mobile devices, I urge you to consider it.  Almost every type of interaction that you have with a SAS Visual Analytics report on your desktop – you can do the same with reports viewed in SAS Mobile BI!

It’s worth noting that having SAS Visual Analytics in your organization is not a requirement for downloading this nifty, free app on to your Android or iOS mobile devices.  Once you download the app, you can view and interact with a wide spectrum of sample SAS Visual Analytics reports for different industries.

Like what you see?

If so, talk with the SAS Visual Analytics administrator in your organization and ask them to enable support for viewing your SAS Visual Analytics reports in the SAS Mobile BI app.

To give you a little more guidance, here are some of the most frequently asked questions about SAS Mobile BI.

Do we need a special license to use SAS Mobile BI?

Absolutely not! It is a free app from SAS that is available to anyone who wants to download it on to an Apple or Android device and view reports and dashboards.

Do we need to have SAS Visual Analytics in our organization to use the app?

No, you don’t.

When the app is downloaded to your device, sample charts and reports are made available to you by SAS via a server connection to the SAS Demo Server.

How do we get this free app?

You can download the free app to your Apple or Android device from:

Apple iTunes Store
Google Play

After downloading the app, what are some things we can do with it?

Well, as soon as you download the app and open it, you have a connection established to the SAS Demo Server which hosts a variety of sample reports with data associated with different industries. These reports give you an idea of the various features that are supported by the app. Here is a snapshot of the different folders that contain sample reports for you to browse and view.

 

SASMobileBIFAQ1

Here is the Portfolio where subscribed reports are available in the form of thumbnails that can be tapped to open them.

SASMobileBIFAQ2

If we have reports in our organization that were created with SAS Visual Analytics Designer, can we view those reports in this app?

Absolutely, yes!

The same reports that you view in your web browser on a desktop can be viewed in SAS Mobile BI by using your Apple or Android devices (both tablets and smartphones).  Simply define a connection to your company server and browse for reports. Access to reports from SAS Mobile BI is granted by your SAS Administrator. Report access can be based upon various Capabilities and Permissions.

Live data access requires either a Data or a Wi-Fi connection, and your company may require VPN access or other company-specific security measures. Contact your SAS Visual Analytics Administrator. Also, be sure to mention that there is a comprehensive Help menu within the app that provides answers to queries they might have.

What type of features are available in reports that are viewed from SAS Mobile BI?

Typically, almost all of the features that are available in the desktop-based reports are supported in the app. Drilling, filtering, brushing, display rules, linking, info windows – are just some of the key features available to you when you interact with reports via this app.

I hope you’ve found this blog and the accompanying FAQ helpful. In my next blog I’ll cover some basic security features we’ve built into the app to protect user data.

tags: SAS Mobile BI, SAS R&D, SAS Visual Analytics

SAS Mobile BI FAQ was published on SAS Users.

3月 252014
 
“My goal is to constantly improve the quality and stability of our software while at the same time innovating,” said Vice President of SAS Resarch and Development Armistead Sapp yesterday at the SAS Global Forum Technology Connection. Hosted by Product Management Director Michele Eggers, the Technology Connection focused not only on […]
2月 152014
 
Bridging the Rift between Dev and Ops As a member of the Product Marketing team at SAS, I spend a good part of my time researching – analyst reports, industry journals, blogs, social channels – and listening to what our customers are saying. Early last spring I began noticing the term [...]