natural language processing

5月 292019
 

Interestingly enough, paperclips have their own day of honor. On May 29th, we celebrate #NationalPaperclipDay! That well-known piece of curved wire deserves attention for keeping our papers together and helping us stay organized. Do you remember who else deserved the same attention? Clippit – the infamous Microsoft Office assistant, popularly known as ‘Clippy’.

We saw the last of Clippy in 2004 before it was removed completely from Office 2007 after constant negative criticism. By then, most users considered it useless and decided to turn it off completely, despite the fact that it was supposed to help them perform certain tasks faster. Clippy was a conversational agent, like a chatbot, launched a decade before Apple’s Siri. The cartoonish paperclip-with-eyes resting on a yellow loose-leaf paper bed would pop up to offer assistance every time you opened a Microsoft Office program. Today, people seem to love AI. But a decade ago, why did everyone hate Clippy?

Why was Clippy such a failure?

One of the problems with Clippy was the terrible user experience. Clippy stopped to ask users if they needed help but in doing so would suspend their operations altogether. This was great for first timers, getting to know Clippy, goofing around and training themselves to use the agent. However, later on, mandatory conversations became frustrating. Second, Clippy was designed as a male agent. Clippy was born in a meeting room full of male employees working at Microsoft. The facial features resemble those of a male cartoon or at least that is what women in the testing focus group observed. In a farewell note for Clippy, the company addressed Clippy as ‘he’ affirming his gender. Some women even said that Clippy’s gaze created discomfort while they tried to do their tasks.

Rather than using Natural Language Processing (NLP), Clippy’s actions were invoked by Bayesian algorithms that estimated the probability of a user wanting help based on specific actions. Clippy’s answers were rule based and templated from Microsoft’s knowledge base. Clippy seemed like the agent you would use when you were stuck on a problem, but the fact that he kept appearing on the screen in random situations added negativity to users’ actions. In total, Clippy violated design principles, social norms and process workflows.

What lessons did Clippy teach us?

We can learn from Clippy. The renewed interest in conversational agents in the tech industry often overlooks the aspects that affect efficient communication, resulting in failures. It is important to learn lessons from the past and design interfaces and algorithms that tackle the needs of humans rather than hyping the capabilities of artificial intelligence.

At SAS, we are working to deliver a natural language interaction (NLI) service that converts keyed or spoken natural language text into application-specific, executable code; and using apps like Q –genderless AI voice for virtual assistants. We’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer. “Chatbots create a humanlike interaction that makes results accessible to all.”

The use of chatbots is exponentially growing and all kinds of organizations are starting to see the exciting possibilities combining chatbots with AI analytics. Clippy was a trailblazer of sorts but has come to represent what to avoid when designing AI. At SAS we are focused on augmenting the human experience and that it is the customer who needs to be at the center, not the technology.

Interested in seeing what SAS is doing with Natural Language Processing? Check out SAS Visual Text Analytics and try it for free. We also have a brand new SAS Book hot off the press focusing on information extraction models for unstructured text / language data: SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models. We even have a free chapter for you to take a peek!

Other Resources:

What can we learn from Clippy about AI? was published on SAS Users.

5月 222019
 

This blog shows how the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules. Because of these capabilities highly customized models can be developed in VTA. The rules used in this blog are basic. Developing linguistic rules and accurately categorizing documents requires subject matter expertise and understanding the grammatical structure of the language(s) used.

I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings. Two customized categories will be developed one for Children and the other for Sport movies. Because we are familiar with movies classification and MPAA ratings, it will be relatively easy to understand the rules used in this blog. The overall blog’s objective is to show how to formulate basic rules, thus their use can be extended to other fields.

SAS Visual Text Analytics (VTA) is the SAS offering designed to effectively extract insights from unstructured data in large scale. Offered on the SAS Viya architecture, VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules. Currently, VTA supports 30 languages and it has an open architecture supporting 3rd-party programming interfaces.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Initial Text Analysis Using Visual Analytics

Because Visual Analytics (VA) and VTA are highly integrated, the initial Text Analysis can be done in VA.

Every Text Analytics dataset must have a unique identifier associated to each document. In my blog, Discover Main Topics on #MLKDayofService Tweets Using SAS Visual Text Analytics, I showed how to set a “Unique Row Identifier”, and how to work with the nodes in the Pipeline.

In Visual Analytics (VA) one can do the initial analysis of text data, see the Word Cloud, and a list of topics. In the Options menu, I indicated I wanted a Maximum of Topics to Generate=7.

The photo above shows that there are 364 movies with the term “kid” in the Topic “+show,+kid, +rate,+movie”. We could build a category that groups appropriate movies for kids.

There are 203 documents with the Topic related to science fiction. Therefore, if I wanted to have a category for Sport movies, I would have to build it myself because sports terms appear in fewer documents.

Create a Visual Text Analytics Project

In VTA, a pipeline is a process flow diagram whose nodes represent tasks in the Text Analysis Process, I described in detail how to work with the nodes in my MLKDayofService blog mentioned above. Briefly, from the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select and create a New Project.

The photo above shows the data role assignments done in the Data tab.

Notice that there is a Unique Row Identifier for each document, the Text Variable to analyze is Review, and two variables are used as Category: MPAARating and mainGenre. Later on, VTA will use these two variables to automatically create categories and their Boolean rules. Title doesn’t have a role but I want it to be displayed to facilitate the analysis.

Movies are already classified according to their main category (mainGenre), I want to see the Boolean rules that VTA automatically generates for each category, and if I can create new concepts and categories that improve on the initial categorization. For example, I would like:

  1. to find children movies that don’t contain violence,
  2. to find movies that are related to Sports,
  3. to read the reviews of my favorite old movies, and
  4. find movies whose reviews mention some of my favorite movie directors.

Method

I ran two pipelines. The first one had the default pipeline settings and also the option Include predefined concepts enabled for the Concepts node. The objective was to see the rules associated with the genres “Sports”, “Animated” and “Family”, the movies matched by these rules, as well as, the ones that shouldn’t have been matched. In the second pipeline, I developed LITI and Boolean rules with the objective of improving the default categorizations automatically produced in the first pipeline.

In the next sections I will describe how the new categories were built. In real business situations, sometimes we will have pre-defined categories available to us, and other times we will come up with categories that satisfy the business objectives after analyzing many documents.

Customized Concepts built in the Concept Node

In the second pipeline, I developed three customized concepts, I will use one of them “MySports” to build a new category later on.

Basic Boolean operators are used to define new concepts and categories. AND/NOT operators are applied to the whole document. There other operators that search within the same sentence (SENT), the same paragraph (PARA) or a number of terms (DIST).

# Any line that starts with “#” is a comment
# Use CLASSIFIER to match a literal sequence
# Use CONCEPT_RULE to use Boolean and proximity operators. The term extracted should use _c{ }

MySports Concept

I wrote this rule which matches 98 documents, most of them related to Sport movies and with few false positives. This rule will match a document if any of the terms sport, baseball, tennis, football, basketball, racetrack appear anywhere in the document (movie review) but the terms gambling, buddy or sporting do not appear anywhere in the document.

I will use this MySports CONCEPT_RULE to build the new Sports category:

CONCEPT_RULE: (AND, (OR, “_c{sport@}”,”_c{baseball}”,”_c{tennis}”,”_c{football}”,”_c{basketball}”, “_c{racetrack}”),(NOT,”sporting”),(NOT, “gambling”),(NOT,”buddy”))

filmmakersInReview Concept

I built this concept just to illustrate how to use a pre-defined concept, in this case nlpPerson:

CONCEPT_RULE: (DIST_10,(OR,”filmmaker”,”director”,”film producer”,”producer”,”movie maker”),”_c{nlpPerson}”)

favoriteMovies Concept

I built this concept to match my favorite old movies and one of my favorite directors. The first CONCEPT_RULE will match documents that contain in the same sentence the terms Stanley Kubrick and 2001. The second CONCEPT_RULE will match documents that contain the two terms anywhere in the document. Both CONCEPT_RULEs will only extract the first term "Stanley Kubrick":

CLASSIFIER:A Space Odyssey
CLASSIFIER:The Sound of Music
CLASSIFIER:Il Postino
CONCEPT_RULE:(SENT,”_c{Stanley Kubrick}”,”2001″)
CONCEPT_RULE:(AND,”_c{Stanley Kubrick}”,”A Space Odyssey”)

New Concepts in Parsing Text Node

The customized concepts developed in the Concepts node are passed to the Text Parsing Node. Notice the Terms football, sport, sports, baseball and the new Role MySports in the Kept Terms window, as well as the matched documents to the term "football":

Customized Categories in the Category Mode

In the second pipeline I developed new categories using as starting point the rules associated with the genres “Sports”, “Animated” and “Family”.

Sports Category

The input data has only 3 movies in the Sports category, it is difficult to generate a meaningful rule with such a small dataset. Once the first pipeline is ran, there are a total of 8 movies which include the 3 original ones, and 5 that are not related to sports. The automatically generated rule is:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”))

For the second pipeline, I developed the MySports rule in the Concepts node as mentioned above, and write this Boolean rule in the Categories node:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”),”[MySports]”)

The new rule matches 90 movies, most of them related to Sports. For the next iteration, one would need to look at the movies that don’t relate to Sports, the ones that relate to Sports and were not matched, and improve in the rule above.

ChildrenMovies Category

In the second pipeline, I combined the rules for the Family and Animation categories which were automatically produced in the first pipeline.

For the Family category, there were 6 movies matched by this rule

(OR,(AND,(OR,”adults”,”adult”),”oz”))

It matched “People vs Larry Flynt” which prompted me to use the terms “murder” and “obscenity” in the Concept rule.
The Animation category had 66 matched movies and the automatically generated rule was:

(OR,(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))
I decided to modify these two rules. In the second pipeline, I used this new rule
(OR,(AND,(NOT,(OR,”adults”,”adult”,”suitable for children”,”rated R”,”strip@”,”suck@”,”crude humor”,”gore”,”horror”,”murder”,”obscenity”,”drug use@”)),”Wizard of Oz”),(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))

This produced 73 movies and only two rated “R”. Therefore, I removed both the Animation and the Family categories and created the new category childrenMovies.

Again, to determine if the new rules are an improvement over the previous ones, we must find out how many true positives and false positives are matched by the new rules, and repeat the process until we obtain the precision required.

Conclusion

Because the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules, highly customized models can be developed in VTA.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Many thanks to Teresa Jade and Biljana Belamaric Wilsey for reviewing the linguistic rules used in this blog. For more information about Visual Text Analytics, please check out:

Analysis of Movie Reviews using Visual Text Analytics was published on SAS Users.

5月 082019
 

I just spent four inspiring days talking to customers about the many ways they are putting analytics into action in their organizations.  From computer vision models that interpret medical images to natural language processing models that analyze supply chain records, SAS users are doing ground-breaking work with analytics and AI. [...]

6 must reads following our biggest event of the year was published on SAS Voices by Oliver Schabenberger

4月 222019
 

Imagine a world where satisfying human-computer dialogues exist. With the resurgence of interest in natural language processing (NLP) and understanding (NLU) – that day may not be far off.

In order to provide more satisfying interactions with machines, researchers are designing smart systems that use artificial intelligence (AI) to develop better understanding of human requests and intent.

Last year, OpenAI used a machine learning technique called reinforcement learning to teach agents to design their own language. The AI agents were given a simple set of words and the ability to communicate with each other. They were then given a set of goals that were best achieved by cooperating (communicating) with other agents. The agents independently developed a simple ‘grounded’ language.

Grounded vs. inferred language


Human language is said to be grounded in experience. People grasp the meaning of many basic words by interaction – not by learning dictionary definitions by rote. They develop understanding in terms of sensory experience -- for example, words like red, heavy, above.

Abstract word meanings are built in relation to more concretely grounded terms. Grounding allows humans to acquire and understand words and sentences in context.

The opposite of a grounded language is an inferred language. Inferred languages derive meaning from the words themselves and not what they represent. In AI trained only on textual data, but not real-world representations, these methods lack true understanding of what the words mean.

What if the AI agent develops its own language we can’t understand?

It happens. Even if the researcher gives the agents simple English words the agent inevitably diverges to its own, unintelligible language. Recently researchers at Facebook, Google and OpenAI all experienced this phenomenon!

Agents are reward driven. If there is no reward for using English (or human language) then the agents will develop a more efficient shorthand for themselves.

That’s cool – why is that a problem?

When researchers at the Facebook Artificial Intelligence Research lab designed chatbots to negotiate with one another using machine learning, they had to tweak one of their models because otherwise the bot-to-bot conversation “led to divergence from human language as the agents developed their own language for negotiating.” They had to use what’s called a fixed supervised model instead.

The problem, there, is transparency. Machine learning techniques such as deep learning are black box technologies. A lot of data is fed into the AI, in this case a neural network, to train on and develop its own rules. The model is then fed new data which is used to spit out answers or information. The black box analogy is used because it is very hard, if not impossible in complex models, to know exactly how the AI derives the output (answers). If AI develops its own languages when talking to other AI, the transparency problem compounds. How can we fully trust an AI when we can’t follow how it is making its decisions and what it is telling other AI?

But it does demonstrate how machines are redefining people’s understanding of so many realms once believed to be exclusively human—like language. The Facebook researchers concluded that it offered a fascinating insight to human and machine language. The bots also proved to be very good negotiators, developing intelligent negotiating strategies.

These new insights, in turn, lead to smarter chatbots that have a greater understanding of the real world and the context of human dialog.

At SAS, we’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer in a recent SAS Insights article. “Chatbots create a humanlike interaction that makes results accessible to all.” The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike.

Satisfying human-computer dialogues will soon exist, and will have applications in medicine, law, and the classroom-to name but a few. As the volume of unstructured information continues to grow exponentially, we will benefit from AI’s tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
SAS: What are chatbots?
Blog: Let’s chat about chatbots, by Wayne Thompson

Moving from natural language processing to natural language understanding was published on SAS Users.

4月 092019
 

Natural language understanding (NLU) is a subfield of natural language processing (NLP) that enables machine reading comprehension. While both understand human language, NLU goes beyond the structural understanding of language to interpret intent, resolve context and word ambiguity, and even generate human language on its own. NLU is designed for communicating with non-programmers – to understand their intent and act on it. NLU algorithms tackle the extremely complex problem of semantic interpretation – that is, understanding the intended meaning of spoken or written language, with all the subtleties, of human error, such as mispronunciations or fragmented sentences.

How does it work?

After your data has been analyzed by NLP to identify parts of speech, etc., NLU utilizes context to discern meaning of fragmented and run-on sentences to execute intent. For example, imagine a voice command to Siri or Alexa:

Siri / Alexa play me a …. um song by ... um …. oh I don’t know …. that band I like …. the one you played yesterday …. The Beach Boys … no the bass player … Dick something …

What are the chances of Siri / Alexa playing a song by Dick Dale? That’s where NLU comes in.

NLU reduces the human speech (or text) into a structured ontology – a data model comprising of a formal explicit definition of the semantics (meaning) and pragmatics (purpose or goal). The algorithms pull out such things as intent, timing, location and sentiment.

The above example might break down into:

Play song [intent] / yesterday [timing] / Beach Boys [artist] / bass player [artist] / Dick [artist]

By piecing together this information you might just get the song you want!

NLU has many important implications for businesses and consumers alike. Here are some common applications:

    Conversational interfaces – BOTs that can enhance the customer experience and deliver efficiency.
    Virtual assistants – natural language powered, allowing for easy engagement using natural dialogue.
    Call steering – allowing customers to explain, in their own words, why they are calling rather than going through predefined menus.
    Smart listener – allowing users to optimize speech output applications.
    Information summarization – algorithms that can ‘read’ long documents and summarize the meaning and/or sentiment.
    Pre-processing for machine learning (ML) – the information extracted can then be fed into a machine learning recommendation engine or predictive model. For example, NLU and ML are used to sift through novels to predict which would make hit movies at the box office!

Imagine the power of an algorithm that can understand the meaning and nuance of human language in many contexts, from medicine to law to the classroom. As the volumes of unstructured information continue to grow exponentially, we will benefit from computers’ tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters

White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?

SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis

Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham

So, you’ve figured out NLP but what’s NLU? was published on SAS Users.

3月 182019
 

Artificial intelligence (AI) is a natural evolution of analytics. Over the years, we have seen AI add learning and automation capabilities to the predictive and prescriptive jobs of analytics. We have been building AI systems for decades, but a few things have changed to make today’s AI systems more powerful. [...]

Advancing AI with deep learning and GPUs was published on SAS Voices by Oliver Schabenberger

3月 182019
 

Artificial intelligence (AI) is a natural evolution of analytics. Over the years, we have seen AI add learning and automation capabilities to the predictive and prescriptive jobs of analytics. We have been building AI systems for decades, but a few things have changed to make today’s AI systems more powerful. [...]

Advancing AI with deep learning and GPUs was published on SAS Voices by Oliver Schabenberger

3月 122019
 

The Special Olympics is part of the inclusion movement for people with intellectual disabilities. The organisation provides year-round sports training and competitions for adults and children with intellectual disabilities. In March 2019 the Special Olympics World Games will be held in Abu Dhabi, United Arab Emirates. There are a number [...]

Normal and exceptional: analytics in use at the Special Olympics was published on SAS Voices by Yigit Karabag

2月 132019
 

Every day, military intelligence analysts sit behind computers reading a never-ending stream of reports, updating presentation templates and writing assessments. But intelligence is more than documenting events and sharing breaking news. It involves understanding and predicting complexities in human behavior across various organizational constructs and using facets of information to [...]

NLP for military intelligence was published on SAS Voices by Mary Beth Moore

2月 132019
 

Every day, military intelligence analysts sit behind computers reading a never-ending stream of reports, updating presentation templates and writing assessments. But intelligence is more than documenting events and sharing breaking news. It involves understanding and predicting complexities in human behavior across various organizational constructs and using facets of information to [...]

NLP for military intelligence was published on SAS Voices by Mary Beth Moore

 Posted by at 11:25 下午