SAS Visual Analytics

10月 092019
 

What can data tell us about the easiest hole at your favorite golf course? Or which hole contributes the most to mastering the course? A golf instructor once told me golf is not my sport, and that my swing is hopeless, but that didn’t stop me from analyzing golf data. [...]

How to win the SAS Championship was published on SAS Voices by Frank Silva

9月 262019
 

Mirror, mirror on the wall, whose conference presentations are the best of all?

Ok, well it doesn’t quite go that way in the fairy tale, but remakes and reimagining of classic tales have been plentiful in books (see The Shadow Queen), on the big screen (see Maleficent, which is about to get a sequel), on the little screen (see the seven seasons of Once upon a Time) and even on stage and screen (see Into the Woods). So, why not take some liberties in the service of analytics?

For this blog, I have turned our analytics mirror inward and gazed at the social media messages from four SAS conferences: SAS Global Forum 2018 in Denver, Analytics Experience 2018 in San Diego, Analytics Experience 2018 in Milan, and the 2019 Analyst Conference in Naples. While simply counting retweets could provide insight into what was popular, I wanted to look deeper to answer the question: What SAS conference presenters were most praised in social media and how? Information extraction, specifically fact extraction, could help with answering those questions.

Data preparation

Once upon a time, in a land far far away, there was a collection of social media messages, mostly Tweets, that the SAS social media department was kind enough to provide. I didn’t do much in terms of data preparation. I was only interested in unique messages, so I used Excel to remove duplicates based on the “Message” column.

Additionally, I kept only messages for which the language was listed as English, using the “language” column that was already provided in the data. SAS Text Analytics products support 33 languages, but for the purposes of this investigation I chose to focus on English only because the presentations were in English. Then, I imported this data, which was about 4,400 messages, into SAS Visual Text Analytics to explore it and create an information extraction model.

While exploring the data, I noticed that most of the tweets were in fact positive. Additionally, negation, such as “not great” for example, was generally absent. I took this finding into consideration while building my information extraction model: the rules did not have to account for negation, which made for a simpler model. No conniving sorcerer to battle in this tale!

Information extraction model

The magic wand here was SAS Visual Text Analytics. I created a rather simple concepts model with a top-level concept named posPerson, which was extracting pairs of mentions of presenters and positive words occurring within two sentences of the mentions of presenters. The model included several supporting concepts, as shown in this screenshot from SAS Visual Text Analytics concepts node.

Before I explain a little bit about each of the concepts, it is useful to understand how they are related together in the hierarchy represented in the following diagram. The lower-level concepts in the diagram are referenced in the rules of the higher-level ones.

Extending predefined concepts

The magic wand already came with predefined concepts such as nlpPerson and nlpOrganization (thanks, fairy godmother, ahem, SAS linguists). These concepts are included with Visual Text Analytics out of the box and allow users to tap into the knowledge of the SAS linguists for identifying person and organization names. Because Twitter handles, such as @oschabenberger and @randyguard, are not included in these predefined concepts, I expanded the predefined concepts with custom ones. The custom concepts for persons and organizations, customPerson and customOrg, referenced matches from the predefined concepts in addition to rules for combining the symbol @ from the atSymbol concept and various Twitter handles known to belong to persons and organizations, respectively. Here is the simple rule in the atSymbol concept that helps to accomplish this task:

CLASSIFIER:@ 

The screenshot below shows how the atSymbol concept and the personHandle concept are referenced together in the customPerson concept rule and produce matches, such as @RobertoVerganti and @mabel_pooe. Note also how the nlpPerson concept is referenced to produce matches, such as Oliver Schabenberger and Mary Beth Moore, in the same customPerson concept.

If you are interested to learn more about information extraction rules like the ones used in this blog, check out the book SAS Text Analytics for Business Applications: Concept Rules for Information Extraction Models, which my colleagues Teresa Jade and Michael Wallis co-authored with me. It’s a helpful guide for using your own magic wand for information extraction!

Exploratory use of the Sandbox

Visual Text Analytics also comes with its own crystal ball: the Sandbox feature. In the Sandbox, I refined the concept rules iteratively and was able to run the rules for each concept faster than running the entire model. Gazing into this crystal ball, I could quickly see how rule changes for one concept impacted matches.

In an exploratory step, I made the rules in the personHandle concept as general as possible, using part of speech tags such as :N (noun) and :PN (proper noun) in the definitions. As I explored the matches to those rules, I was able to identify matches that were actually organization handles, which I then added as CLASSIFIER rules to the orgHandle concept by double-clicking on a word or phrase and right-clicking to add that string to a rule.

I noticed that some handles were very similar to each other and REGEX rules more efficiently captured the possible combinations. Consult the book referenced above if you’re interested in understanding more about different rule types and how to use them effectively. After moving the rules to the Edit Concept tab, the rules for orgHandle included some of the ones in the following screenshot.

Automatic concept rule generation

Turning now to the second part of the original question, which was what words and phrases people used to praise the presenters, the answers came from two custom concepts: posAdj and posPhrase. The posAdj concept had rules that captured adjectives with positive sentiment, such as the following:

Most of these were captured from the text of the messages in the same manner as the person and organization Twitter handles.

But, the first two were created automatically by way of black magic! When I selected a term from the Textual Elements, as you can see below for the term “great”, the system automatically created the first rule in the concept above, including also the comparative form, “greater,” and the superlative, “greatest.” This is black magic harnessing the power of stemming or lemmatization.

The concept posPhrase built onto the posAdj concept by capturing the nouns that typically follow the adjectives in the first concept as well as a few other strings that have a positive connotation.

Filtering with global rules

Because the rules created overlapping matches, I took advantage of a globalRule concept, which allowed me to distinguish between the poisoned apples and the edible ones. Global rules served the following purposes:

  1. to remove matches from the more generally defined customPerson concept that were also matched for the customOrg concept
  2. to remove matches from the posAdj concept (such as “good”) that were also matched in the posPhrase concept (such as “good talk”)
  3. to remove false positives

As an example of a false positive, consider the following rule:

REMOVE_ITEM:(ALIGNED, "Data for _c{posAdj}", "Data for Good") 

Because the phrase “Data for Good” is a name of a program, the word “good” should not be taken into consideration in evaluating the positive mention. Therefore, the REMOVE_ITEM rule stated that when the posAdj concept match “good” is part of the phrase “Data for Good,” it should be removed from the posAdj concept matches.

Automatic fact rule generation

The top-most concept in the model, posPerson, took advantage of a magic potion called automatic fact rule building, which is another new feature added to the Visual Text Analytics product in the spring of 2019. This feature was used to put together matches from the posAdj and posPhrase concepts with matches from the customPerson concept without constructing the rule myself. It is a very useful feature for newer rule writers who want to explore the use of fact rules.

As input into the cauldron to make this magic potion, I selected the posAdj and customPerson concepts. These are the concepts I wanted the system to relate as facts.

I ran the node and inspected the autogenerated magic potion, i.e. the fact rule.

Then I did the same thing with the posPhrase and customPerson concepts. Each of the two rules that were created by Visual Text Analytics contained the SENT operator.

But I wanted to expand the context of the related concepts and tweaked the recipe a bit by replacing SENT with SENT_2 in order to look for matches within two sentences instead of one. I also replaced the names of the arguments, which the rule generation algorithm called concept1 and concept2, with ones that were more relevant to the task at hand, person and pos. Thus, the following rules were created:

PREDICATE_RULE:(person, pos):(SENT_2, "_person{customPerson}", "_pos{posAdj}")
PREDICATE_RULE:(person, pos):(SENT_2, "_person{customPerson}", "_pos{posPhrase}")

Results

So, what did the magic mirror show? Out of the 4,400 messages, I detected a reference to a person in about 1,650 (37%). In nearly 600 of the messages (14%) I extracted a positive phrase and in over 300 (7%) at least one positive adjective. Finally, only 7% (321) of the messages contained both a reference to a person and a positive comment within two sentences of each other.

I changed all but the posPerson and globalRule concepts to “supporting” so they don’t produce results and I can focus only on the relevant results. This step was akin to adjusting the mirror to focus only on the most important things and tuning out the background. You can learn more about this and other SAS Visual Text Analytics features in the User Guide.

Switching from the interactive view to the results view of the concepts node, I viewed the transactional output table.

With one click, I exported and opened this table in Visual Analytics in order to answer the questions which presenters were mentioned most often and in the context of what words or phrases with positive sentiment.

Visualization

With all of the magic items and preparation out of the way, I was ready to build a sparkly palace for my findings; that is, a report in Visual analytics. On the left, I added a treemap of the most common matches for the person argument. On the right, I added a word cloud with the most common matches for the pos argument and connected it with the treemap on the left. In both cases I excluded missing values in order to focus on the extracted information. With my trees and clouds in place, I turned to the bottom of the report. I added and connected a list table with the message, which was the entire input text, and keywords, which included the span of text from the match for the first argument to the match for the last argument, for an easy reference to the context for the above visualizations.

Based on the visualization on the left, the person with the most positive social media messages was SAS Chief Operating Officer (COO), Dr. Oliver Schabenberger, who accounted for 12% of the messages that contained both a person and a positive comment. His lead was followed by the featured presenters at the Milan conference, Roberto Verganti, Anders Indset and Giles Hutchins. Next most represented were the featured presenters at the San Diego conference, Robyn Benincasa and Dr. Tricia Wang.

Looking at the visualization on the right, some of the most common phrases expressing praise for all the presenters were “important,” “well done,” “great event,” and “exciting.” Quite a few phrases also contain the term “inspiring,” such as “inspiring videos,” “inspiring keynote,” “inspiring talk,” “inspiring speech,” etc.

Because of the connections that I set up in Visual Analytics between these visualizations, if I want to look at what positive phrases were most commonly associated with each presenter, I can click on their name in the treemap on the left; as a result, the word cloud on the right as well as the list table on the bottom will filter out data from other presenters. For example, the view for Oliver Schabenberger shows that the most common positive phrase associated with tweets about him was “great discussion.”

Conclusions

It is not surprising that the highest accolades in this experiment went to SAS’ COO since he participated in all four conferences and therefore had four times the opportunity to garner positive messages. Similarly, the featured presenters probably had larger audiences than breakout sessions, allowing these presenters more opportunities to be mentioned in social media messages. In this case, the reflection in the mirror is not too surprising. And they all lived happily ever after.

What tale does your social media data tell?

Learn more

A data fairy tale: Which speakers are the best received at conferences? was published on SAS Users.

9月 132019
 

Time-series decomposition is an important technique for time series analysis, especially for seasonal adjustment and trend strength measurement. Decomposition deconstructs a time series into several components, with each representing a certain pattern or characteristic. This post shows you how to use SAS® Visual Analytics to visually show the decomposition of a time series so that you can understand more about its underlying patterns.

Characteristics of time series decomposition

Time series decomposition generally splits a time series into three components: 1) a trend-cycle, which can be further decomposed into trend and cycle components; 2) seasonal; and 3) residual, in an additive or multiplicative fashion.

In additive decomposition, the cyclical, seasonal, and residual components are absolute deviations from the trend component, and they do not depend on trend level. In multiplicative decomposition, the cyclical, seasonal and residual components are relative deviations from the trend. Thus, we often see different magnitudes of seasonal, cyclical and residual components when comparing with the trend component, while the trend component keeps the same scale as the original series.

How to begin a time series decomposition

SAS provides several procedures for time series decomposition, I use the PROC Timeseries in this post. Now the first step is to decide whether to use additive or multiplicative decomposition. You know SAS PROC Timeseries provides multiplicative (MODE=MULT), additive (MODE=ADD), pseudo-additive (MODE=PSEUDOADD) and log-additive (MODE=LOGADD) decomposition. You can also use the default MODE option of MULTORADD to let SAS help you make a decision based on the feature of your data. Good thing is, you can always use the log transformation whenever there is a need to change a multiplicative relationship to an additive relationship. The plot option in PROC Timeseries can produce graphs of the generated trend-cycle component, seasonal component and residual component. In this post, I would like to output the OUTDECOMP dataset from PROC Timeseries, load the data and visualize the decomposed time series with SAS Visual Analytics to understand more about their patterns.

See how it's done

I decompose the time series in the SASHELP.AIR dataset as an example. The series involves data about international air travel with monthly data points from Jan 1949 to Jan 1961, as pictured below:

We see an obvious upward trend and significant seasonality in the original series, with more and more intensive fluctuation around the trend. This indicates that the multiplicative decomposition of trend and seasonality components is more appropriate. I get the decomposed components using this SAS code. Here I do not explicitly give the mode option first, and let SAS use the default MODE=MULTORADD option. Since the values in this time series are strictly positive, SAS eventually specifies the MODE=MULT to generate the decomposed series in the OUTDECOMP dataset (see details in the document).

When you load the data set into SAS Visual Analytics and make visualizations, it’s very straight forward to draw a time-series plot showing the decomposed series, respectively.

Note that the magnitudes of the Trend-Cycle-Seasonal and Trend-Cycle components are much larger than those of the Seasonal, Irregular and Cycle components. The upward trend and increasing volatility of the Trend-Cycle-Seasonal component reveal an obvious multiplicative composition of Trend-Cycle and Seasonal components. The formula should be: Trend-Cycle-Seasonal Component = Trend-Cycle Component * Seasonal Component.

Can you visually show the multiplicative relationship in the series?

I can easily make the log transformation of the decomposition series using the calculated item in SAS Visual Analytics, and accordingly show the additive relationship of the transformed series. The visualization below shows the additive relationship of the log transformation of the Trend-Cycle-Seasonal component with the log transformations of Trend-Cycle component and Seasonal component, which is the equivalent of the pre-transformed multiplicative relationship.

In the visualization below, the moss-green line series at the bottom of the chart shows the Log Seasonal component, with each vertical black line representing its value. The lines at the top show that the value of the orange line series (the Log Trend-Cycle component) adds to the value that the mint-green vertical lines (value of the Log Seasonal component) will make to the pine-green line series (the Log Trend-Cycle-Seasonal component).

In the list table, note that the value of the calculated item 'Trend-Cycle Component * Seasonal Component’ is equal to the 'Trend-Cycle-Seasonal Component' value highlighted in blue, which indicates the multiplicative composition of 'Trend-Cycle Component' and 'Seasonal Component' to the 'Trend-Cycle-Seasonal Component.' Also, summation of the calculated item 'Log Trend-Cycle Component' and the 'Log Seasonal Component' is equal to the value of 'Log Trend-Cycle-Seasonal Component' in light green. They verify the multiplicative and additive relationships, respectively.

More ways to expose and view patterns

Besides the above multiplicative decomposition, we can dig for more multiplicative or additive relationships from the original series and the decomposed series. Here are the formulas:

Original Series = Trend-Cycle-Seasonal Component * Irregular Component

Seasonal-Irregular Component = Seasonal Component * Irregular Component

Original Series = Seasonal Adjusted Series * Seasonal Component

Trend-Cycle Component = Trend Component + Cycle Component 1

[ 1 Note: Despite setting the MODE=MULT option, SAS Proc Timeseries uses the Hodrick-Prescott filter, which always decomposes the trend-cycle component into the trend component and cycle component in an additive fashion. ]

Considering the decomposed dataset from various time series will have the fixed structure as shown below, we can easily apply the visualizations in SAS Visual Analytics to the decomposed series from different time series. Just applying the new dataset, all the calculated items will be inherited accordingly, and the new data will be applied to the visualizations automatically. That’s the thing I like most for visualizing time series decomposition in SAS Visual Analytics.

A final decomposition comparison

Let’s compare the multiplicative decomposition and the additive decomposition of the same series. Note the Trend-Cycle components (as well as Trend component and Cycle component) from multiplicative and additive decomposition are the same, meaning that the seasonal component is decomposed differently in multiplicative and additive decomposition.

In the screenshot below, we see that the two seasonal components have similar seasonal fluctuation style, but the value of seasonal components are largely different between multiplicative and additive decomposition. Different decomposition method also leads to different Trend-Cycle-Seasonal component, Irregular component and Seasonal-Irregular component. In addition, we see still some patterns there in the Irregular component from additive decomposition.

But in multiplicative decomposition, the Irregular component seems more random-like. Thus, the multiplicative decomposition is a better choice than additive decomposition for SASHELP.AIR time series.

PROC Timeseries provides classical decomposition of time series, and SAS has other procedures that can perform more complex decomposition of time series. If you want to visualize time series decomposition in a way you like, give SAS Visual Analytics a try!

SAS® Visual Analytics on SAS® Viya® Try it for free!

How to Visualize Time Series Decomposition using SAS Visual Analytics was published on SAS Users.

8月 092019
 

Opening Plenary session, Esri UC 2019

Several of my colleagues and I attended the annual Esri User Conference last month in San Diego - along with 18,000 other Geo professionals.  It was a busy week of meetings, seminars and talks about the latest in GIS and Spatial technologies.  The days were long and exhausting, but it was also exciting and a ton of fun.  As we continue to process, plan and prepare to integrate some of these technologies into SAS Visual Analytics, I thought it would be beneficial to highlight the Esri features available in VA today.

One topic that received a lot of questions during this year’s SAS Global Forum in Dallas was that of Geocoding.  Geocoding is the process of transforming text address data into numeric latitude and longitude values.  Once the latitude and longitude are known, they can be mapped and analyzed spatially.  SAS has offered geocoding capabilities for quite some time as a part of SAS/Graph.  Beginning with SAS v940m5, PROC GEOCODE has moved into BASE SAS.  See my colleague’s blog posts here and here for more information on geocoding from BASE SAS.

But Geocoding is no longer limited to just Base SAS.  You can also geocode from within Visual Analytics, thanks to the integration with the Esri geocoding api.  This feature is part of the Esri Premium agreement, and became available in VA 8.3.   Esri premium features require an existing relationship and credentials with Esri.  This post assumes that relationship exists and your credentials have been validated.  I will discuss the details of the Esri premium features in a future post, but for today the focus is how to use the Esri Geocoding feature from VA with a real-world data set.

1. Getting the data into Visual Analytics

We will be using point data from the City of Dallas for the Public Library branch locations.  You can download the .csv file from the Dallas Open Data portal.  After downloading, it must be imported into VA for geocoding.

  • From the Data tab in VA, select Import > Local File
  • Navigate to the location of the Dallas library .csv file and select it
  • Adjust the default settings, if desired, and click the ‘Import Item’ button
  • Once you see the green success message, the data has been imported into VA and is ready to be geocoded. Click the ‘Cancel’ button

Message indicating successful data import

2. Selecting the data columns to geocode on

Accessing the Geocoding feature in VA follows a similar process to the steps we just performed to import the .csv file.

  • From the Data tab in VA, select Import > Esri > Geocode. Here, you must select the location of the newly imported library data set.  This path will vary depending upon the configuration of your VA instance.  For my installation, it is located at cas-shared-default > Public folder > CITY_OF_DALLAS_LIBRARY_LOCATIONS.  Once located, click the 'Select' button
  • The Geocoding Import window will open. This window should look familiar.  The top half is the same as the Import data we just used to get the .csv file into VA.  Essentially, the geocoding process is a new data import.  It will send selected columns to Esri via a REST api call.  The response will contain the corresponding latitude and longitude values we desire.  They will be added to our existing data set and imported into VA as a new geocoded data set.  The name of the new data set will have _GEO_CODE appended to the end of the original data set name.  This name can be modified as desired.

Geocoding selection dialog window

  • At the bottom of the Geocoding Import window are two list boxes, Available items and Selected items. The Available items box on the left contains all columns in the data set.  Select the column(s) containing the address information you wish to geocode.  Double click or click the right arrow to move them to the Selected items window on the right.  In this example, we are using the Address column.
  • VA concatenates the selected column(s) to generate a sample address for geocoding. Clicking the ‘Test’ button returns coordinates for the sample address and a score representing the confidence level of the results.  In the screenshot above, our score is 71/100 for the test address.  Not bad, but it could be better.  More on this a bit later.
  • To finish the geocoding process, click the ‘Import Item’ at the top of the page, as we did with the original .csv file import. This time, you will be presented with a new dialog window.  Geocoding, as with other Esri premium features require the use of credits.  This dialog indicates how many Esri credits will be used by the geocoding process and will also be discussed in detail in a future post.

Esri credit usage alert dialog

For now, select 'Yes' to continue.  When you see the green success message, the operation is complete.  We are now ready to map our Dallas Library locations.  Click 'Ok' to open the new geocoded data set.

3. Create the geography variable and display the map

Next, we need to create our geography variable from the new geocoded data set.  As part of the geocoding process, four new columns have been added to the new data set: esri_latitude, esri_longitude, esri_score, esri_address.  We only need the esri_latitude and esri_longitude columns for our map.

  • Select the Branch Name category variable and change its Classification to Geography
  • For Geography data type, select Custom Coordinates
  • Select esri_latitude for Latitude
  • Select esri_longitude for Longitude
  • Click 'OK'
  • Drag the Branch Name geography variable to the canvas to create the map

Map of non-unique geocoded addresses

What happened??  Our data set contains Dallas Public library locations, so why are the data points spread across the world?  It’s all in the data.  If you look at the original data a bit deeper, you will notice the Address field we selected for the geocoding only contains the street number and street name of the library location.  It does not contain enough information to make it unique.  Therefore, during the geocoding process, the first instance of that address will be considered a match, regardless of where it is actually located.

Detailed view of incorrect geocoded address

In the image above for the Preston Royal branch, its street number and name were a perfect match to a location in Eugene, Oregon.  Not quite what we were looking for.  So, how do we fix this?  To make our addresses unique, it requires a simple addition to the source data .csv file.

Column selection to ensure unique addresses for geocoding

We need to add a ‘City’ and ‘State’ column to the original .csv file with the values of ‘Dallas’ and ‘Texas’ assigned to all entries.  This will ensure each address is unique and within our area of interest.  Re-import the new .csv file and geocode it using the Address, City and State columns.  The result?  A confidence score of a perfect 100.  Much better than our first attempt!  This will now give us the map we desire for the Dallas Public Library locations.

 

Final geocoded map of Dallas Library branches

In this post, I used real-world data to illustrate two things: the importance of knowing your data set, and how to geocode address information in SAS Visual Analytics.  Public data sets are a great resource but need to be used with a critical eye.  They may still need additional cleansing in order to work for your situation.

The geocoding feature is one example of the premium Esri features currently available in VA.  In future posts, I will go into more detail on other Esri features available, what make these features ‘premium’ and examples of how to use them.  Stay tuned!

Esri integration with SAS Visual Analytics: Geocoding was published on SAS Users.

7月 192019
 

When a new Moon passes between the Earth and the Sun, the Moon can cast a shadow on certain regions of the Earth. This natural phenomenon creates a solar eclipse, meaning the Moon covers, or eclipses, your view of the Sun if you're in that region. No surprise that in [...]

Ring of fire: Visualizing 5,000 years of solar eclipses was published on SAS Voices by Falko Schulz

6月 202019
 

Move over video games and sports. Make room for escape rooms. This burgeoning form of entertainment found its roots in the video gaming movement. Escape rooms tap into a player's drive to reach the next level, solve a puzzle and win. Escape rooms present a physical game that traps you until your brain (and teamwork) help you escape. Does that sound exhilarating or terrifying? Maybe a bit of both.

SAS adapted this concept and built its own Data Science Escape Rooms at SAS Global Forum 2019. More than 800 customers used SAS software to solve a series of problems in one of six rooms, spanning three different themes: soccer, wildlife and cyberattack.

I had the opportunity to support these rooms by assisting with registration and check-in. I also "covered" the experience through videos that shared a behind-the-scenes look, player reactions and backstories on how the rooms were developed.

I even got to chat with one team that completed all three rooms. The Advanced Claims Analysis Team with the United States Office of Personnel Management, aka Team "Auditgators," included Kevin Sikora, Julie Zoeller, Richard Allen and Lauren Goob. They spoke about how cool it was to play with software they're not used to in pseudo real-world scenarios.

What the escape rooms were really like

Let me give you a lay of the land. The rooms included four stations each with a computer screen and a mouse (no keyboards). At each station, players used products like SAS Visual Analytics and SAS Visual Investigator to solve a challenge, working against the clock to escape the room within 20 minutes. Talk about pressure.

A team studies a problem at one of four stations in the Wildlife-themed escape room.

"You definitely felt a sense of urgency in there," recalls Sikora. "We really got into the wildlife room – it's where we received the highest score. We had the chance to do some great team-building too. We're more cohesive having shared this experience."

The Auditgators worked separately to solve problems and then collaborated to piece together the bigger puzzle. When they got stuck, they asked for help from the "gamemasters," who were literally behind the walls of the rooms. Gamemasters doled out clues to nudge teams along.

This may seem like a fun diversion, but how in the world can you apply this experience to your daily work? "Well we certainly realized the value in using more graphs, to click and filter to find answers we need," said Zoeller. "It was enlightening to see how we could show the data, and it gave us a better awareness of the end goal, the big picture."

Allen and Goob have been to SAS Global Forums in the past and said that the Data Science Escape Rooms gave them the best opportunity to interact with one another than anything else at the event. "It breaks up sitting in rooms and listening, breaks up the monotony. It gave us a chance to do something together as a team."

More reactions and backstories

Want a taste of the action at the event? Here, SAS' Lisa Dodson, Data Science Escape Room Gamemaster, couldn't hide her excitement to give clues and cheer on teams.

 
 
Michael Gibbs with the University of Arkansas described the intensity and pressure to get the answer and move on while escaping.

 
 
SAS partnered with SciSports to build the soccer-themed escape room. SciSports Founder and CIO Giels Brouwer and Data Scientist Mick Bosma explained how the idea came about and plans to take the room on the road.

 
 
While most participants enjoyed the challenge, not everyone "escaped" successfully. The most successful teams had solid teamwork and good communication. Sound like a lesson for "real life"? SAS' Alfredo Iglesias Rey reports that 18 percent of the teams completed all four challenges correctly. He also marveled at how players were able to create advanced machine learning techniques with just a mouse and keyboard.

 
 
Dying to check out a Data Science Escape Room yourself? You'll find them at several local SAS Forums this year, and plans are underway to offer them at other venues. But you don't have to wait for an escape room to get a feel for SAS software. Try SAS online right now, for free.

A playful way to get your hands on SAS: the Data Science Escape Rooms was published on SAS Users.

5月 222019
 

This blog shows how the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules. Because of these capabilities highly customized models can be developed in VTA. The rules used in this blog are basic. Developing linguistic rules and accurately categorizing documents requires subject matter expertise and understanding the grammatical structure of the language(s) used.

I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings. Two customized categories will be developed one for Children and the other for Sport movies. Because we are familiar with movies classification and MPAA ratings, it will be relatively easy to understand the rules used in this blog. The overall blog’s objective is to show how to formulate basic rules, thus their use can be extended to other fields.

SAS Visual Text Analytics (VTA) is the SAS offering designed to effectively extract insights from unstructured data in large scale. Offered on the SAS Viya architecture, VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules. Currently, VTA supports 30 languages and it has an open architecture supporting 3rd-party programming interfaces.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Initial Text Analysis Using Visual Analytics

Because Visual Analytics (VA) and VTA are highly integrated, the initial Text Analysis can be done in VA.

Every Text Analytics dataset must have a unique identifier associated to each document. In my blog, Discover Main Topics on #MLKDayofService Tweets Using SAS Visual Text Analytics, I showed how to set a “Unique Row Identifier”, and how to work with the nodes in the Pipeline.

In Visual Analytics (VA) one can do the initial analysis of text data, see the Word Cloud, and a list of topics. In the Options menu, I indicated I wanted a Maximum of Topics to Generate=7.

The photo above shows that there are 364 movies with the term “kid” in the Topic “+show,+kid, +rate,+movie”. We could build a category that groups appropriate movies for kids.

There are 203 documents with the Topic related to science fiction. Therefore, if I wanted to have a category for Sport movies, I would have to build it myself because sports terms appear in fewer documents.

Create a Visual Text Analytics Project

In VTA, a pipeline is a process flow diagram whose nodes represent tasks in the Text Analysis Process, I described in detail how to work with the nodes in my MLKDayofService blog mentioned above. Briefly, from the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select and create a New Project.

The photo above shows the data role assignments done in the Data tab.

Notice that there is a Unique Row Identifier for each document, the Text Variable to analyze is Review, and two variables are used as Category: MPAARating and mainGenre. Later on, VTA will use these two variables to automatically create categories and their Boolean rules. Title doesn’t have a role but I want it to be displayed to facilitate the analysis.

Movies are already classified according to their main category (mainGenre), I want to see the Boolean rules that VTA automatically generates for each category, and if I can create new concepts and categories that improve on the initial categorization. For example, I would like:

  1. to find children movies that don’t contain violence,
  2. to find movies that are related to Sports,
  3. to read the reviews of my favorite old movies, and
  4. find movies whose reviews mention some of my favorite movie directors.

Method

I ran two pipelines. The first one had the default pipeline settings and also the option Include predefined concepts enabled for the Concepts node. The objective was to see the rules associated with the genres “Sports”, “Animated” and “Family”, the movies matched by these rules, as well as, the ones that shouldn’t have been matched. In the second pipeline, I developed LITI and Boolean rules with the objective of improving the default categorizations automatically produced in the first pipeline.

In the next sections I will describe how the new categories were built. In real business situations, sometimes we will have pre-defined categories available to us, and other times we will come up with categories that satisfy the business objectives after analyzing many documents.

Customized Concepts built in the Concept Node

In the second pipeline, I developed three customized concepts, I will use one of them “MySports” to build a new category later on.

Basic Boolean operators are used to define new concepts and categories. AND/NOT operators are applied to the whole document. There other operators that search within the same sentence (SENT), the same paragraph (PARA) or a number of terms (DIST).

# Any line that starts with “#” is a comment
# Use CLASSIFIER to match a literal sequence
# Use CONCEPT_RULE to use Boolean and proximity operators. The term extracted should use _c{ }

MySports Concept

I wrote this rule which matches 98 documents, most of them related to Sport movies and with few false positives. This rule will match a document if any of the terms sport, baseball, tennis, football, basketball, racetrack appear anywhere in the document (movie review) but the terms gambling, buddy or sporting do not appear anywhere in the document.

I will use this MySports CONCEPT_RULE to build the new Sports category:

CONCEPT_RULE: (AND, (OR, “_c{sport@}”,”_c{baseball}”,”_c{tennis}”,”_c{football}”,”_c{basketball}”, “_c{racetrack}”),(NOT,”sporting”),(NOT, “gambling”),(NOT,”buddy”))

filmmakersInReview Concept

I built this concept just to illustrate how to use a pre-defined concept, in this case nlpPerson:

CONCEPT_RULE: (DIST_10,(OR,”filmmaker”,”director”,”film producer”,”producer”,”movie maker”),”_c{nlpPerson}”)

favoriteMovies Concept

I built this concept to match my favorite old movies and one of my favorite directors. The first CONCEPT_RULE will match documents that contain in the same sentence the terms Stanley Kubrick and 2001. The second CONCEPT_RULE will match documents that contain the two terms anywhere in the document. Both CONCEPT_RULEs will only extract the first term "Stanley Kubrick":

CLASSIFIER:A Space Odyssey
CLASSIFIER:The Sound of Music
CLASSIFIER:Il Postino
CONCEPT_RULE:(SENT,”_c{Stanley Kubrick}”,”2001″)
CONCEPT_RULE:(AND,”_c{Stanley Kubrick}”,”A Space Odyssey”)

New Concepts in Parsing Text Node

The customized concepts developed in the Concepts node are passed to the Text Parsing Node. Notice the Terms football, sport, sports, baseball and the new Role MySports in the Kept Terms window, as well as the matched documents to the term "football":

Customized Categories in the Category Mode

In the second pipeline I developed new categories using as starting point the rules associated with the genres “Sports”, “Animated” and “Family”.

Sports Category

The input data has only 3 movies in the Sports category, it is difficult to generate a meaningful rule with such a small dataset. Once the first pipeline is ran, there are a total of 8 movies which include the 3 original ones, and 5 that are not related to sports. The automatically generated rule is:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”))

For the second pipeline, I developed the MySports rule in the Concepts node as mentioned above, and write this Boolean rule in the Categories node:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”),”[MySports]”)

The new rule matches 90 movies, most of them related to Sports. For the next iteration, one would need to look at the movies that don’t relate to Sports, the ones that relate to Sports and were not matched, and improve in the rule above.

ChildrenMovies Category

In the second pipeline, I combined the rules for the Family and Animation categories which were automatically produced in the first pipeline.

For the Family category, there were 6 movies matched by this rule

(OR,(AND,(OR,”adults”,”adult”),”oz”))

It matched “People vs Larry Flynt” which prompted me to use the terms “murder” and “obscenity” in the Concept rule.
The Animation category had 66 matched movies and the automatically generated rule was:

(OR,(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))
I decided to modify these two rules. In the second pipeline, I used this new rule
(OR,(AND,(NOT,(OR,”adults”,”adult”,”suitable for children”,”rated R”,”strip@”,”suck@”,”crude humor”,”gore”,”horror”,”murder”,”obscenity”,”drug use@”)),”Wizard of Oz”),(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))

This produced 73 movies and only two rated “R”. Therefore, I removed both the Animation and the Family categories and created the new category childrenMovies.

Again, to determine if the new rules are an improvement over the previous ones, we must find out how many true positives and false positives are matched by the new rules, and repeat the process until we obtain the precision required.

Conclusion

Because the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules, highly customized models can be developed in VTA.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Many thanks to Teresa Jade and Biljana Belamaric Wilsey for reviewing the linguistic rules used in this blog. For more information about Visual Text Analytics, please check out:

Analysis of Movie Reviews using Visual Text Analytics was published on SAS Users.

5月 212019
 

If you spend any time working with maps and spatial data, having a fundamental understanding of coordinate systems and map projections becomes necessary.  It’s the foundation of how spatial data and maps work.  These areas invariably evoke trepidation and some angst, even in the most seasoned map professional.  And rightfully so, it can get complicated quickly. Fortunately, most of those worries can be set aside when creating maps with SAS Visual Analytics, without requiring a degree in Geodesy.

Visual Analytics includes several different coordinate system definitions configured out-of-the-box.  Like the Predefined geography types (see Fundamental of SAS Visual Analytics geo maps), they are selected from a drop-down list during the geography variable setup.  With the details handled by VA, all you need to know is what coordinate space your data uses and select the appropriate one.

The four Coordinate spaces included with VA are:

  1. World Geodetic System (WGS84)
    Area of coverage: World.  Used by GPS navigation systems and NATO military geodetic surveying.  This is the VA default and should work in most situations.
  2. Web Mercator
    Area of coverage: World.  Format used by Google maps, OpenStreetMap, Bing maps and other web map providers.
  3. British National Grid (OSGB36)
    Area of coverage: United Kingdom – Great Britain, Isle of Man
  4. Singapore Transverse Mercator (SVY21)
    Area of coverage: Singapore onshore/offshore

But what if your data does not use one of these?  For those situations, VA also supports custom coordinate spaces.  With this option, you can specify the definition of your desired coordinate space using industry standard formats for EPSG codes or Proj4 strings.  Before we get into the details of how to use custom coordinate spaces in VA, let’s take a step back and review the basics of coordinate spaces and projections.

Background

A coordinate space is simply a grid designed to cover a specific area of the Earth.  Some have global coverage (WGS84, the default in VA) and others cover relatively small areas (SVY21/Singapore Transverse Mercator).  Each coordinate space is defined by several parameters, including but not limited to:

  • Center coordinates (origin)
  • Coverage area (‘bounds’ or ‘extent’)
  • Unit of measurement (feet or meters)

Comparison of coordinate space definitions included in Visual Analytics -- Source: http://epsg.io

The image above compares the four coordinate space definitions included with VA.  The two on the right, BNG and Singapore Transverse Mercator, have a limited extent.  A red rectangle outlines the area of coverage for each region.  The two on the left, WGS84 and Mercator, are both world maps.  At first glance, they may appear to have the same coverage area, but they are not interchangeable.  The origin for both is located at the intersection of the Equator and the Prime Meridian.  However, the similarities end there.  Notice the extent for WGS84 covers the entire latitude range, from -90 to +90.  Mercator on the other hand, covers from -85 to +85 latitude, so the first 5 degrees from each Pole are not included.  Another difference is the unit of measurement.  WGS84 is measured in un-projected degrees, which is indicative of a spherical Geographic Coordinate System (GCS).  Mercator uses meters, which implies a Projected Coordinate System (PCS) used for a flat surface, ie. a screen or paper.

The projection itself is a complex mathematical operation that transforms the spherical surface of the GCS into the flat surface of the PCS.  This transformation introduces distortion in one or more qualities of the map: shape, area, direction, or distance.  The process of map projection compares to peeling an orange. Removing the peel and placing it on a flat surface will cause parts of it to stretch, tear or separate as it flattens. The same thing happens to a map projection.

A flat map will always have some degree of distortion.  The amount of distortion depends on the projection used.  Select a projection that minimizes the distortion in the areas most important to the map.  For example, are you creating a navigation map where direction is critical?  How about a World map to compare land mass of various countries?  Or maybe a local map of Municipality services where all factors are equally important?  These decisions are important if you are collecting and creating your data set from the field.  But, if you are using existing data sets, chances are that decision has already been made for you.  It then becomes a task of understanding what coordinate system was selected and how to use it within VA.

Using a Custom Coordinate Space in VA

When using VA’s custom coordinate space option, it is critical the geography variable and the dataset use the same coordinate space.  This tells VA how to align the grid used by the data with the grid used by the underlying map.  If they align, the data will be placed at the expected location.  If they don’t align, the data will appear in the wrong location or may not be displayed at all.

Illustration of aligning the map and data grids

To illustrate the process of using a custom coordinate space in VA, we will be creating a custom region map of the Oklahoma City School Districts.  The data can be found on the Oklahoma City Open Data Portal.  We will use the Esri shapefile format.  As you may recall from a previous blog post, Creating custom region maps with SAS Visual Analytics, the first step is to import the Esri shapefile data into a SAS dataset.

Once the shapefile has been successfully imported into SAS, we then must determine the coordinate system of the data.  While WGS84 is common and will work in many situations, it should not be assumed.  The first place to look is at the source, the data provider.  Many Open Data portals will have the coordinate system listed along with the metadata and description of the dataset.  But when using an Esri shapefile, there is an easier way to find what we need.

Locate the directory where you unzipped the original shapefile.  Inside of that directory is a file with a .prj extension.  This file defines the projection and coordinate system used by the shapefile.  Below are the contents of our .prj file with the first parameter highlighted.  We are only interested in this value.  Here, you can see the data has been defined in the Oklahoma State Plane coordinate system -- not in VA’s default WGS84.  So, we must use a custom coordinate system when defining the geography variable.

PROJCS["NAD_1983_StatePlane_Oklahoma_North_FIPS_3501_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101004]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]], PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",1968500],PARAMETER["False_Northing",0],PARAMETER["Central_Meridian",-98],PARAMETER["Standard_Parallel_1",35.5666666666667],PARAMETER["Standard_Parallel_2",36.7666666666667],PARAMETER["Scale_Factor",1],PARAMETER["Latitude_Of_Origin",35],UNIT["Foot_US",0.304800609601219]]

Next, we need to look up the Oklahoma State Plane coordinate system to find a definition VA understands.  From the main page of the SpatialReference.org website, type ‘Oklahoma State Plane’ into the search box. Four results are returned.  Compare the results with the string highlighted above.  You can see the third option is what we are looking for: NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.

Selecting the appropriate definition based on the .prj file contents

To get the definitions we need for VA, click the third link for the option NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.  Here you will see a grey box with a bulleted list of links.  Each of these links represent a definition for the Oklahoma StatePlane coordinate space.

Visual Analytics supports two of the listed formats, EPSG and Proj4.  EPSG stands for European Petroleum Survey Group, an organization that publishes a database of coordinate system and projection information.  The syntax of this format is epsg:<number> or esri:<number>, where <number> is a 4-6 digit for the desired coordinate system.  In our cases, the format we need is the title of the page:

ESRI:102724

The second format supported by VA is Proj4, the third link in the image above.  This format consists of a string of space-delimited name value pairs.  The Oklahoma StatePlane proj4 definition we are interested in is:

+proj=lcc +lat_1=35.56666666666667 +lat_2=36.76666666666667 +lat_0=35 +lon_0=-98 +x_0=600000.0000000001 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs

Now we have identified the coordinate system used by our data set and looked up its definition, we are ready to configure VA to use it.

Using a Projected Coordinate System definition in VA

The following section assumes you are familiar with custom region maps and setting up a polygon provider.  If not, see my previous post on that process, Creating custom region maps with SAS Visual Analytics.  The first step in setting up a geography variable for a custom region map is to start with the polygon provider.  At the bottom of the ‘Edit Polygon Provider’ window, there is an ‘Advanced’ section that is collapsed by default.  Expand it to see the Coordinate Space option.  By default, it is populated with the value EPSG:4326, which is the EPSG code for WGS84.  Since our Oklahoma City School District code data does not use WGS84, we need to replace this value with the EPSG code that we looked up from SpatialReference.org (ESRI:102724).

Using the same Custom Coordinate definition for Polygon provider and geography variable

Next, we must make sure to configure the geography variable itself with the same coordinate space as the polygon provider.  On the ‘Edit Geography Item’ window, the Coordinate Space option is the last item.  Again, we must change this from the default WGS84 to ESRI:102724.  From the dropdown list, select the option ‘Custom’.  A new entry box appears where we can enter the custom coordinate space definition.  If configured correctly, you should see your map in the preview thumbnail and a 100% mapped indicator.

Congratulations!  The setup was successful.  Now, simply click OK and drag the geography variable to the canvas.  VA’s auto-map feature will recognize it and display the custom region map.

In this post, I showed how to identify the coordinate system of your Esri shapefile data, lookup its epsg and proj4 definitions, and configure VA to use it via the Custom Coordinate space option.  While the focus was on a custom region map, the technique also applies to Custom Coordinate maps, minus the polygon provider setup.  The support of custom coordinate spaces in VA allow the mapping of practically any spatial dataset, giving you a new level of power and flexibility in your mapping efforts.

Essentials of Map Coordinate Systems and Projections in Visual Analytics was published on SAS Users.

5月 062019
 

App security is at the top of mind for just about everybody – users, IT folks, business executives. Rightfully so. Mobile apps and the devices on which they reside tend to travel around, without any physical boundaries that encompass the traditional desktop computers.

In chatting with folks who are evaluating the SAS Visual Analytics app for their mobile devices, the conversation eventually winds up with a focus on security and the big question comes up:

How is this app secure?

Great question! Here’s a whirlwind tour of the security features that have been built into the SAS Visual Analytics app for Windows 10, Android, and iOS devices. The app is now a young kid and not a toddler anymore, it has been around for about six years. And during its growth journey, the app has been beefed up with rock-solid features to address security for Visual Analytics reports viewed from mobile devices.

Before we take a look at the security features in the app, here are a few things you should know:

    • The app is free.
    • No license is needed to use the app.
    • You can download it anytime from the app store, and try out the sample reports in the app.
    • If you already have SAS Visual Analytics deployed in your organization, you can connect to your server, add reports to the app, and start interacting with your reports from your smartphone or tablet. The Help available in the app walks you through these steps.

Now, let’s get back to security for Visual Analytics reports on mobile devices. Here are five things that make the Visual Analytics app robust and secure on mobile devices.

    1. Device Whitelisting: If you want to connect to your SAS Visual Analytics server from the app, your administrator will “whitelist” your mobile device. Your device is first registered as a valid device that can connect to the Visual Analytics server. The whitelist affects devices, not users. If you happen to lose your mobile device, your administrator can remove the device from the whitelist and prevent access to the reports and data. The option to “blacklist” devices is also available.
    2. Cached Reports: After you add Visual Analytics reports to your app, if you don’t want the report data to remain with the report in the app, your administrator can enable the cached report feature. Data is downloaded only when you open and view the report on your mobile device. When you close the report, that data is removed from the device. For enhanced security, thumbnail images for report tiles in your app will not display for cached reports.
    3. Passcode: To prevent anyone other than yourself from opening the Visual Analytics app, you can set a 4-digit passcode for the app. There are two kinds of passcodes: required and optional. A required passcode is mandated by the server – when you connect to the server, you will create a passcode. Then, whenever you open the app or view a report from that server, you must enter the passcode. An optional passcode, on the other hand, is a passcode that you choose to use to lock up the app – it is not required to access the server, it is needed only to open the app. In addition, there are several features for passcode use that solidify security and access to the app: time-out, lock-out and so forth. I’ll go over these features in an upcoming blog.
    4. SSL/HTTPS: If the Visual Analytics server is set up with SSL/HTTPS, the data viewed in the reports on your mobile device is encrypted.
    5. Offline: If you were offline for a specified number of days, you must sign into the server again. If you don’t, the app does not download reports, update reports, or open reports for viewing.

Cached Reports

One of the security features we just talked about was the cached report feature. Here’s how cached report thumbnails are displayed in the Visual Analytics app on Windows 10, without any images.

When you tap the thumbnail for the cached report, data is immediately downloaded and the report opens in the app for viewing and interaction:

When you close this cached report in the app, the data is removed from the device and the cached report thumbnail displays in the app without any images.

Thanks for joining me on this whirlwind security tour of the SAS Visual Analytics app. Now you know the many different security mechanisms that are in place to protect your organization’s data and reports accessed from the mobile app.

Five key security features in the SAS Visual Analytics app was published on SAS Users.

4月 222019
 

Imagine a world where satisfying human-computer dialogues exist. With the resurgence of interest in natural language processing (NLP) and understanding (NLU) – that day may not be far off.

In order to provide more satisfying interactions with machines, researchers are designing smart systems that use artificial intelligence (AI) to develop better understanding of human requests and intent.

Last year, OpenAI used a machine learning technique called reinforcement learning to teach agents to design their own language. The AI agents were given a simple set of words and the ability to communicate with each other. They were then given a set of goals that were best achieved by cooperating (communicating) with other agents. The agents independently developed a simple ‘grounded’ language.

Grounded vs. inferred language


Human language is said to be grounded in experience. People grasp the meaning of many basic words by interaction – not by learning dictionary definitions by rote. They develop understanding in terms of sensory experience -- for example, words like red, heavy, above.

Abstract word meanings are built in relation to more concretely grounded terms. Grounding allows humans to acquire and understand words and sentences in context.

The opposite of a grounded language is an inferred language. Inferred languages derive meaning from the words themselves and not what they represent. In AI trained only on textual data, but not real-world representations, these methods lack true understanding of what the words mean.

What if the AI agent develops its own language we can’t understand?

It happens. Even if the researcher gives the agents simple English words the agent inevitably diverges to its own, unintelligible language. Recently researchers at Facebook, Google and OpenAI all experienced this phenomenon!

Agents are reward driven. If there is no reward for using English (or human language) then the agents will develop a more efficient shorthand for themselves.

That’s cool – why is that a problem?

When researchers at the Facebook Artificial Intelligence Research lab designed chatbots to negotiate with one another using machine learning, they had to tweak one of their models because otherwise the bot-to-bot conversation “led to divergence from human language as the agents developed their own language for negotiating.” They had to use what’s called a fixed supervised model instead.

The problem, there, is transparency. Machine learning techniques such as deep learning are black box technologies. A lot of data is fed into the AI, in this case a neural network, to train on and develop its own rules. The model is then fed new data which is used to spit out answers or information. The black box analogy is used because it is very hard, if not impossible in complex models, to know exactly how the AI derives the output (answers). If AI develops its own languages when talking to other AI, the transparency problem compounds. How can we fully trust an AI when we can’t follow how it is making its decisions and what it is telling other AI?

But it does demonstrate how machines are redefining people’s understanding of so many realms once believed to be exclusively human—like language. The Facebook researchers concluded that it offered a fascinating insight to human and machine language. The bots also proved to be very good negotiators, developing intelligent negotiating strategies.

These new insights, in turn, lead to smarter chatbots that have a greater understanding of the real world and the context of human dialog.

At SAS, we’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer in a recent SAS Insights article. “Chatbots create a humanlike interaction that makes results accessible to all.” The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike.

Satisfying human-computer dialogues will soon exist, and will have applications in medicine, law, and the classroom-to name but a few. As the volume of unstructured information continues to grow exponentially, we will benefit from AI’s tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
SAS: What are chatbots?
Blog: Let’s chat about chatbots, by Wayne Thompson

Moving from natural language processing to natural language understanding was published on SAS Users.