1月 262017

Recently a colleague told me Google had published new, interesting data sets at BigQuery. I found a lot of Reddit data as well, so I quickly tried running BigQuery with these text data to see what I could produce.  After getting some pretty interesting results, I wanted to see if I could implement the same analysis with SAS and if using SAS Text Mining you would get deeper insights than simple queries. So, I tried SAS with Reddit comments data and I’d like to share my analyses and findings with you.

Analysis 1: Significant Words

To get started with BigQuery, I googled what others were sharing regarding BigQuery and Reddit, and I found USING BIGQUERY WITH REDDIT DATA. In this article the author posted a query statement about extracting significant words from Politics subreddit. I then wrote a SAS program to mimic this query and I got following data with the July of Reddit comments. The result is not completely same as the one from BigQuery, since I downloaded the Reddit data from another web site and used SAS Text Parsing action to parse the comments into tokens rather than just splitting tokens by white space.

Analysis 2: Daily Submissions

The words Trump and Hillary in the list raised my interest and begged for further analysis. So, I did a daily analysis to understand how hot Trump and Hillary were during this month. I filtered all comments mentioning Trump or Hillary under Politics subreddit and counted total submissions per day. The resulting time series plot is shown below.

I found several spikes in the plot, which happened on 2016/7/5, 2016/7/12, 2016/7/21, and 2016/7/26.

Analysis 3: Topics Time Line

I wondered what Reddit users were concerned about on these specific days, so I extracted the top 10 topics from all comments submitted in July, 2016 within Politics subreddit and got the following data. These topics obviously focused on several aspects, such as vote, president candidates, party, and hot news such as Hillary’s email probe.

The topics showed what people were concerned about in the whole month, but I need further investigation in order to explain which topic mostly contributed to the four spikes. The topics’ time series plot helped me find the answer.

Some topics’ time series trends are very close and it is hard to determine which topic contributed mostly, so I got the top contribution topic based on their daily percentage growth. The top growth topic on July 05 is “emails, dnc, +server, hillary, +classify”, which has 256.23 times of growth.

Its time series plot also shows a high spike on July 05. Then, I googled with “July 5, 2016 emails dnc server hillary classify” and I got following news.

There is no doubt the spike on July 05 is related to the FBI’s decision about Clinton’s email probe. In order to confirm this, I extracted the Top 20 Reddit comments submitted on July 05 according to its Reddit score. I quoted partial comment from the top one and I found the link in the comment was included in the Google’s search result.

"...under normal circumstances, security clearances would be revoked. " This is your FBI. EDIT: I took paraphrased quote, this is the actual quote as per - "

Similar analysis was done on the other three days and the hot topics as follows.

Interestingly, one person did a sentiment analysis with Twitter data and the tweet submission trend of July looks the same as Reddit.

And in this blog, he listed several important events that happened in July.

  • July 5th: the FBI says it’s not going to end Clinton’s email probe and will not recommend prosecution.
  • July 12th: Bernie Sanders endorses Hillary Clinton for president.
  • July 21st: Donald Trump accepts the Republican nomination.
  • July 25-28: Clinton accepts nomination in the DNC.

It showcased that different social media data have similar response trends on the same events.

Now I know why these spikes happened. However, more questions came to my mind.

  • Who started posting these news?
  • Were there cyber armies?
  • Who were opinion leaders in the politics community?

I believe all these questions can be answered by analyzing the data with SAS.

tags: SAS R&D, SAS Text Analytics

Analyzing Trump v. Clinton text data at Reddit was published on SAS Users.

12月 142016

Fun with Text AnalyticsLast week, I attended the IALP 2016 conference (20th International Conference on Asian Language Processing) in Taiwan. After the conference, each presenter received a u-disk with all accepted papers in PDF format. So when I got back to Beijing, I began going through the papers to extend my learning. Usually, when I return from a conference, I go through all paper titles and my conference notes, then choose the most interesting articles and dive into them for details. I’ll then summarize important research discoveries into one document. This always takes me several days or more to complete.

This time, I decided to try SAS Text Analytics to help me read papers efficiently. Here’s how I did it.

My first experiment was to generate a word cloud of all papers. I used these three steps.

Step 1: Convert PDF collections into text files.

With the SAS procedure TGFilter and SAS Document Conversion Server, you may convert PDF collections into a SAS dataset. If you don’t have SAS Document Conversion Server, you can download pdftotext for free. Pdftotext converts PDFfiles into texts only, you need to write SAS code to import all text files into a dataset. Moreover, if you use pdftotext, you need to check if the PDF file is converted correctly or not. It’s annoying to check texts one by one and I hope you look for smart ways to do this check. SAS TGFilter procedure has language detection functionality and language of any garbage document after conversion is empty rather than English, so I recommend you use TGFilter, then you can filter garbage documents out easily with a where statement of language not equal to ‘English.’

Step 2: Parse documents into words and get word frequencies.

Run SAS procedure HPTMINE or TGPARSE against the document SAS dataset, with stemming option turned on and English stop-word list released by SAS, you may get frequencies of all stems.

Step 3: Generate word cloud plot.

Once you have term frequencies, you can either use SAS Visual Analytics or use R to generate word cloud plot. I like programming, so I used SAS procedure IML to submit R scripts via SAS.

These steps generated a word cloud with the top 500 words of 66 papers. There were a total of 87 papers and 21 of them could not be converted correctly by SAS Document Conversion Server. 19 papers could not be converted correctly by pdftotext.


Figure-1 Word Cloud of Top 500 Words of 66 Papers

From figure-1, it is easy to see that the top 10 words were: word, language, chinese, model, system, sentence, corpus, information, feature and method. IALP is an international conference, and does not focus on Chinese only. However there was a shared task at this year’s conference, and its purpose is to predict traditional Chinese affective words’ valance and arousal ratings. Moreover, each team who had attended the shared task was requested to submit a paper to introduce their work, so “Chinese” contributed more than other languages in the word cloud.

You probably think there’s a lot of noise if we use paper bodies to do word cloud analysis, so my second experiment is to generate word cloud of paper abstracts through similar processing. You can view these results in Figure-2.


Figure-2 Word Cloud of Top 500 Words of 66 Paper Abstracts

The top 10 words from the paper abstracts were:  word, chinese, system, language, method, paper, propose, sentiment, baseand corpus. These are quite a bit different from top 10 words extracted from paper bodies. Character, word and sentence are fundamental components of natural languages. For machine learning in NLP (natural language processing), annotated corpus is the key. Without corpus, you cannot build any model. However, annotated corpus is very rare even in big data era and that is why so many researchers pay efforts in annotation.

Can we do more analyses with SAS? Of course. We may analyze keywords, references, paper influence, paper categorization, etc. I hope I have time to try these interesting analyses and share my work with you in a future blog.

The SAS scripts for paper word cloud as below.

* Step 1: Convert PDF collections into sas dataset;
* NOTE: You should have SAS Document Converte purchased;
proc tgfilter out=paper(where=(text ne '' and language eq 'English'))
   tgservice=(hostname="yourhost" port=yourport)  
* Step 2: Parse documents into words and get word frequencies;
* Add document id for each document;
data paper;
   set paper;
   document_id = _n_;
proc hptmine data=paper;
   doc_id document_id;
   var text;
   parse notagging nonoungroups termwgt=none cellwgt=none
* Get stem freq data;
proc sql noprint;
   create table paper_stem_freq as
   select term as word, freq as freq
   from key
   where missing(parent) eq 1
   order by freq descending;
* Step 3: Generate word cloud plot with R;
data topstems;
   set paper_stem_freq(obs=500);
proc iml;
call ExportDataSetToR("topstems", "d" );
submit / R;
# sort by frequency
tags: SAS R&D, SAS Text Analytics

Fun with SAS Text Analytics: A qualitative analysis of IALP papers was published on SAS Users.

5月 122016

SASMobile_BI_FAQSAS Visual Analytics users view and interact with reports on their desktop computers or laptops. Many, however, have never heard of the SAS Mobile BI app or how it extends the viewing and interactive capabilities of their reports to mobile devices. SAS Mobile BI app is simple to download to a mobile device, and you can immediately view some sample SAS Visual Analytics reports from within the app.

If your organization has deployed SAS Visual Analytics, but is not taking advantage of the ability to view and interact with reports via mobile devices, I urge you to consider it.  Almost every type of interaction that you have with a SAS Visual Analytics report on your desktop – you can do the same with reports viewed in SAS Mobile BI!

It’s worth noting that having SAS Visual Analytics in your organization is not a requirement for downloading this nifty, free app on to your Android or iOS mobile devices.  Once you download the app, you can view and interact with a wide spectrum of sample SAS Visual Analytics reports for different industries.

Like what you see?

If so, talk with the SAS Visual Analytics administrator in your organization and ask them to enable support for viewing your SAS Visual Analytics reports in the SAS Mobile BI app.

To give you a little more guidance, here are some of the most frequently asked questions about SAS Mobile BI.

Do we need a special license to use SAS Mobile BI?

Absolutely not! It is a free app from SAS that is available to anyone who wants to download it on to an Apple or Android device and view reports and dashboards.

Do we need to have SAS Visual Analytics in our organization to use the app?

No, you don’t.

When the app is downloaded to your device, sample charts and reports are made available to you by SAS via a server connection to the SAS Demo Server.

How do we get this free app?

You can download the free app to your Apple or Android device from:

Apple iTunes Store
Google Play

After downloading the app, what are some things we can do with it?

Well, as soon as you download the app and open it, you have a connection established to the SAS Demo Server which hosts a variety of sample reports with data associated with different industries. These reports give you an idea of the various features that are supported by the app. Here is a snapshot of the different folders that contain sample reports for you to browse and view.



Here is the Portfolio where subscribed reports are available in the form of thumbnails that can be tapped to open them.


If we have reports in our organization that were created with SAS Visual Analytics Designer, can we view those reports in this app?

Absolutely, yes!

The same reports that you view in your web browser on a desktop can be viewed in SAS Mobile BI by using your Apple or Android devices (both tablets and smartphones).  Simply define a connection to your company server and browse for reports. Access to reports from SAS Mobile BI is granted by your SAS Administrator. Report access can be based upon various Capabilities and Permissions.

Live data access requires either a Data or a Wi-Fi connection, and your company may require VPN access or other company-specific security measures. Contact your SAS Visual Analytics Administrator. Also, be sure to mention that there is a comprehensive Help menu within the app that provides answers to queries they might have.

What type of features are available in reports that are viewed from SAS Mobile BI?

Typically, almost all of the features that are available in the desktop-based reports are supported in the app. Drilling, filtering, brushing, display rules, linking, info windows – are just some of the key features available to you when you interact with reports via this app.

I hope you’ve found this blog and the accompanying FAQ helpful. In my next blog I’ll cover some basic security features we’ve built into the app to protect user data.

tags: SAS Mobile BI, SAS R&D, SAS Visual Analytics

SAS Mobile BI FAQ was published on SAS Users.

3月 252014
“My goal is to constantly improve the quality and stability of our software while at the same time innovating,” said Vice President of SAS Resarch and Development Armistead Sapp yesterday at the SAS Global Forum Technology Connection. Hosted by Product Management Director Michele Eggers, the Technology Connection focused not only on […]
2月 152014
Bridging the Rift between Dev and Ops As a member of the Product Marketing team at SAS, I spend a good part of my time researching – analyst reports, industry journals, blogs, social channels – and listening to what our customers are saying. Early last spring I began noticing the term [...]
10月 142013
Look anywhere within the software industry, and you’ll see agile under construction. You’ve read about how it’s helping teams streamline large projects into smaller ones that are easier to manage and deliver. Being swift and agile has long been one of our company values here at SAS, and executing with [...]
5月 032013
Enthusiastic SAS Global Forum users flocked to the SAS Business Intelligence Development Roundtable for an interactive panel discussion on the current BI and reporting solutions portfolio including SAS Enterprise BI Server, SAS Enterprise Guide, SAS Mobile BI and SAS Visual Analytics and the future focus of the products. Panelists included SAS representatives from Product [...]
2月 072012
In less than two weeks, I’ll be in sunny Florida attending a brand new conference. According to their Web site, Statistical Practice 2012 aims to bring together hundreds of statistical practitioners, including data analysts, researchers and scientists who engage in the application of statistics to solve real-world problems on a daily basis. [...]