SAS Event Stream Processing (ESP) cannot only process structured streaming events (a collection of fields) in real time, but has also very advanced features regarding the collection and the analysis of unstructured events. Twitter is one of the most well-known social network application and probably the first that comes to mind when thinking about streaming data source. On the other hand, SAS has powerful solutions to analyze unstructured data with SAS Text Analytics. This post is about merging 2 needs: collecting unstructured data coming from Twitter and doing some text analytics processing on tweets (contextual extraction, content categorization and sentiment analysis).
Before moving forward, SAS ESP is based on a publish and subscribe model. Events are injected into an ESP model using an “adapter” or a “connector.” or using Python and the publisher API Target applications consume enriched events output by ESP using the same technology, “adapters” and “connectors.” SAS ESP provides lots of them, in order to integrate with static and dynamic applications.
Then, an ESP model flow is composed of “windows” which are basically the type of transformation we want to perform on streaming events. It can be basic data management (join, compute, filter, aggregate, etc.) as well as advanced processing (data quality, pattern detection, streaming analytics, etc.).
SAS ESP Twitter Adapters background
SAS ESP 4.2 provides two adapters to connect to Twitter as a data source and to publish events from Twitter (one event per tweet) to a running ESP model. There are no equivalent connectors for Twitter.
Both two adapters are publisher only and include:
- Twitter Publisher Adapter
- Twitter Gnip Publisher Adapter
The second one is more advanced, using a different API (GNIP, bought by Twitter) and providing additional capabilities (access to history of tweets) and performance. The adapter builds event blocks from a Twitter Gnip firehose stream and publishes them to a source window. Access to this Twitter stream is restricted to Twitter-approved parties. Access requires a signed agreement.
In this article, we will focus on the first adapter. It consumes Twitter streams and injects event blocks into source windows of an ESP engine. This adapter has free capabilities. The default access level of a Twitter account allows us to use the following methods:
- Sample: Starts listening on random sample of all public statuses.
- Filter: Starts consuming public statuses that match one or more filter predicates.
SAS ESP Text Analytics background
SAS ESP 4.1/4.2 provides three window types (event transformation nodes) to perform Text Analytics in real time on incoming events.
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.
Here are the SAS ESP Text Analytics features:
- Text Category” window:
- Content categorization or document classification into topics
- Automatically identify or extract content that matches predefined criteria to more easily search by, report on, and model/segment by important themes
- Relies on “.mco” binary files coming from SAS Contextual Analysis solution
- “Text Context” window:
- Contextual extraction of named entities (people, titles, locations, dates, companies, etc.) or facts of interest
- Relies on “.li” binary files coming from SAS Contextual Analysis solution
- “Text Sentiment” window:
- Sentiment analysis of text coming from documents, social networks, emails, etc.
- Classify documents and specific attributes/features as having positive, negative, or neutral/mixed tone
- Relies on “.sam” binary files coming from SAS Sentiment Analysis solution
Binary files (“.mco”, “.li”, “.sam”) cannot be reverse engineered. The original projects in their corresponding solutions (SAS Contextual Analysis or SAS Sentiment Analysis) should be used to perform modifications on those binaries.
The ESP project
The following ESP project is aimed to:
- Wait for events coming from Twitter in the source Twitter window (this is a source window, the only entry point for streaming events)
- Perform basic event processing and counting
- Perform text analytics on tweets (in the input streaming, the tweet text is injected as a single field)
Let’s have a look at potential text analytics results.
Here is a sample of the Twitter stream that SAS ESP is able to catch (the tweet text is collected in a field called tw_Text):
The “Text Category” window, with an associated “.mco” file, is able to classify tweets into topics/categories with a related score:
The “Text Context” window, with an associated “.li” file, is able to extract terms and their corresponding entity (person, location, currency, etc.) from a tweet:
The “Text Sentiment” window, with an associated “.sam” file, is able to determine a sentiment with a probability from a tweet:
Run the Twitter adapter
In order to inject events into a running ESP model, the Twitter adapter should be started and is going to publish live tweets into the sourceTwitter window of our model.
Here we search for tweets containing “iphone”, but you can change to any keyword you want to track (assuming people are tweeting on that keyword…).
There are many additional options: -f allows to follow specific user ids, -p allows to specify locations of interest, etc.
Consume enriched events with SAS ESP Streamviewer
SAS ESP provides a way to render events in real-time graphically. Here is an example of how to consume real-time events in a powerful dashboard.
With SAS ESP, you can bring the power of SAS Analytics into the real-time world. Performing Text Analytics (content categorization, sentiment analysis, reputation management, etc.) on the fly on text coming from tweets, documents, emails, etc. and triggering consequently some relevant actions have never been so simple and so fast.