1月 162019
 

If you've ever wanted to apply modern machine learning techniques for text analysis, but didn't have enough labeled training data, you're not alone. This is a common scenario in domains that use specialized terminology, or for use cases where customized entities of interest won't be well detected by standard, off-the-shelf entity models.

For example, manufacturers often analyze engineer, technician, or consumer comments to identify the name of specific components which have failed, along with the associated cause of failure or symptoms exhibited. These specialized terms and contextual phrases are highly unlikely to be tagged in a useful way by a pre-trained, all-purpose entity model. The same is true for any types of texts which contain diverse mentions of chemical compounds, medical conditions, regulatory statutes, lab results, suspicious groups, legal jargon…the list goes on.

For many real-world applications, users find themselves at an impasse, it being incredibly impractical for experts to manually label hundreds of thousands of documents. This post will discuss an analytical approach for Named Entity Recognition (NER) which uses rules-based text models to efficiently generate large amounts of training data suitable for supervised learning methods.

Putting NER to work

In this example, we used documents produced by the United States Department of State (DOS) on the subject of assessing and preventing human trafficking. Each year, the DOS releases publicly-facing Trafficking in Persons (TIP) reports for more than 200 countries, each containing a wealth of information expressed through freeform text. The simple question we pursued for this project was: who are the vulnerable groups most likely to be victimized by trafficking?

Sample answers include "Argentine women and girls," "Ghanaian children," "Dominican citizens," "Afghan and Pakistani men," "Chinese migrant workers," and so forth. Although these entities follow a predictable pattern (nationality + group), note that the context must also be that of a victimized population. For example, “French citizens” in a sentence such as "French citizens are working to combat the threats of human trafficking" are not a valid match to our "Targeted Groups" entity.

For more contextually-complex entities, or fluid entities such as People or Organizations where every possible instance is unknown, the value that machine learning provides is that the algorithm can learn the pattern of a valid match without the programmer having to anticipate and explicitly state every possible variation. In short, we expect the machine to increase our recall, while maintaining a reasonable level of precision.

For this case study, here is the method we used:

1. Using SAS Visual Text Analytics, create a rules-based, contextual extraction model on a sample of data to detect and extract the "Targeted Groups" custom entity. Next, apply this rules-based model to a much larger number of observations, which will form our training corpus for a machine learning algorithm. In this case, we used Conditional Random Fields (CRF), a sequence modeling algorithm also included with SAS Visual Text Analytics.
 
2. Re-format the training data to reflect the json input structure needed for CRF, where each token in the sentence is assigned a corresponding target label and part of speech.
 
3. Train the CRF model to detect our custom entity and predict the correct boundaries for each match.
 
4. Manually annotate a set of documents to use as a holdout sample for validation purposes. For each document, our manual label captures the matched text of the Targeted Groups entity as well as the start and end offsets where that string occurs within the larger body of text.
 
5. Score the validation “gold” dataset, assess recall and precision metrics, and inspect differences between the results of the linguistic vs machine learning model.

Let's explore each of these steps in more detail.

1. Create a rules-based, contextual extraction model

In SAS Visual Text Analytics, we created a simple model consisting of a few intermediate, "helper" concepts and the main Targeted Groups concept, which combines these entities to generate our final output.

The Nationalities List and Affected Parties concepts are simple CLASSIFIER lists of nationalities and vulnerable groups that are known a priori. The Targeted Group is a predicate rule which only returns a match if the aforementioned two entities are found in that order, separated by no more than 7 tokens, AND if there is not a verb intervening between the two entities (the verb "trafficking" being the only exception). This verb exclusion clause was added to the rule to prevent false matches such as "Turkish Cypriots lacked shelters for victims" and "Bahraini government officials stated that they encouraged victims to participate in the investigation and prosecution of traffickers." We then applied this linguistic model to all the TIP reports leading up to 2017, which would form the basis for our CRF training data.

Nationalities List Helper Concept:

Affected Parties Helper Concept:

Verb Exclusions Helper Concept:

Targeted Group Concept (Final Fact Rule):

2. Re-format the training data

The SAS Visual Text Analytics score code produces a transactional-style output for predicate rules, where each fact argument and the full match are captured in a separate row. Note that a single document may have more than one match, which are then listed according to _result_id_.

Using code, we joined these results back to the original table and the underlying parsing tables to transform the native output you see above to this, the json format required to train a CRF model:

Notice how every single token in each sentence is broken out separately and has both a corresponding label and part of speech. For all the tokens which are not part of our Targeted Groups entity of interest, the label is simple "O", for "Other". But, for matches such as "Afghan women and girls," the first token in the match has a label of "B-vic" for "Beginning of the Victim entity" and subsequent tokens in that match are labeled "I-vic" for "Inside the Victim entity."

Note that part of speech tags are not required for CRF, but we have found that including them as an input improves the accuracy of this model type. These three fields are all we will use to train our CRF model.

3. Train the CRF model

Because the Conditional Random Fields algorithm predicts a label for every single token, it is often used for base-level Natural Language Processing tasks such as Part of Speech detection. However, we already have part of speech tags, so the task we are giving it in this case is signal detection. Most of the words are "Other," meaning not of interest, and therefore noise. Can the CRF model detect our Targeted Groups entity and assign the correct boundaries for the match using the B-vic and I-vic labels?
 
After loading the training data to CAS using SAS Studio, we applied the crfTrain action set as follows:

After it runs successfully, we have a number of underlying tables which will be used in the scoring step.

4. Manually annotate a set of documents

For ease of annotation and interpretability, we tokenized the saved the original data by sentence. Using a purpose-built web application which enables a user to highlight entities and save the relevant text string and its offsets to a file, we then hand-scored approximately 2,200 sentences from 2017 TIP documents. Remember, these documents have not yet been "seen" by either the linguistic model or the CRF model. This hand-scored data will serve as our validation dataset.

5. Score the validation “gold” dataset by both models and assess results

Finally, we scored the validation set in SAS Studio with the CRF model, so we could compare human versus machine outcomes.

In a perfect world, we would hope that all the matches found by humans are also found by the model and moreover, the model detected even more valid matches than the humans. For example, perhaps we did not include "Rohingyan" or "Tajik" (versus Tajikistani) as nationalities in our CLASSIFIER list in our rules-based model, but the machine learning model detected victims from these groups them as a valid pattern nonetheless. This would be a big success, and one of the compelling reasons to use machine learning for NER use cases.

In a future blog, I'll detail the results of the outcomes, including modeling considerations such as:
 
  o The format of the CRF training template
 
  o The relative impact of including inputs such as part of speech tags
 
  o Precision and recall metrics
 
  o Performance and train times by volumes of training documents

Machine markup provides scale and agility

In summary, although human experts might produce the highest-quality annotations for NER, machine markup can be produced much more cheaply and efficiently -- and even more importantly, scale to far greater data volumes in a fraction of the time. Generating a rules-based model to generate large amounts of "good enough" labeled data is an excellent way to take advantage of these economies of scale, reduce the cost-barrier to exploring new use cases, and improve your ability to quickly adapt to evolving business objectives.

Reduce the cost-barrier of generating labeled text data for machine learning algorithms was published on SAS Users.

1月 162019
 

A quantile-quantile plot (Q-Q plot) is a graphical tool that compares a data distribution and a specified probability distribution. If the points in a Q-Q plot appear to fall on a straight line, that is evidence that the data can be approximately modeled by the target distribution. Although it is not necessary, some data analysts like to overlay a reference line to help "guide their eyes" as to whether the values in the plot fall on a straight line. This article describes three ways to overlay a reference line on a Q-Q plot. The first two lines are useful during the exploratory phase of data analysis; the third line visually represents the estimates of the location and scale parameters in the fitted model distribution. The three lines are:

  • A line that connect the 25th and 75th percentiles of the data and reference distributions
  • A least squares regression line
  • A line whose intercept and slope are determined by maximum likelihood estimates of the location and scale parameters of the target distribution.

If you need to review Q-Q plots, see my previous article that describes what a Q-Q plot is, how to construct a Q-Q plot in SAS, and how to interpret a Q-Q plot.

Create a basic Q-Q plot in SAS

Let me be clear: It is not necessary to overlay a line on a Q-Q plot. You can display only the points on a Q-Q plot and, in fact, that is the default behavior in SAS when you create a Q-Q plot by using the QQPLOT statement in PROC UNIVARIATE.

The following DATA step generates 97 random values from an exponential distribution with shape parameter σ = 2 and three artificial "outliers." The call to PROC UNIVARIATE creates a Q-Q plot, which is shown:

data Q(keep=y);
call streaminit(321);
do i = 1 to 97;
   y = round( rand("Expon", 2), 0.001);  /* Y ~ Exp(2), rounded to nearest 0.001 */
   output;
end;
do y = 10,11,15; output; end;   /* add outliers */
run;
 
proc univariate data=Q;
   qqplot y / exp grid;         /* plot data quantiles against Exp(1) */
   ods select QQPlot;
   ods output QQPlot=QQPlot;    /* for later use: save quantiles to a data set */
run;
Q-Q plot in SAS without a reference line

The vertical axis of the Q-Q plot displays the sorted values of the data; the horizontal axis displays evenly spaced quantiles of the standardized target distribution, which in this case is the exponential distribution with scale parameter σ = 1. Most of the points appear to fall on a straight line, which indicates that these (simulated) data might be reasonably modeled by using an exponential distribution. The slope of the line appears to be approximately 2, which is a crude estimate of the scale parameter (σ). The Y-intercept of the line appears to be approximately 0, which is a crude estimate of the location parameter (the threshold parameter, θ).

Although the basic Q-Q plot provides all the information you need to decide that these data can be modeled by an exponential distribution, some data sets are less clear. The Q-Q plot might show a slight bend or wiggle, and you might want to overlay a reference line to assess how severely the pattern deviates from a straight line. The problem is, what line should you use?

A reference line for the Q-Q plot

Cleveland (Visualiizing Data, 1993, p. 31) recommends overlaying a line that connects the first and third quartiles. That is, let p25 and p75 be the 25th and 75th percentiles of the target distribution, respectively, and let y25 and y75 be the 25th and 75th percentiles of the ordered data values. Then Cleveland recommends plotting the line through the ordered pairs (p25, y25) and (p75, yy5).

In SAS, you can use PROC MEANS to compute the 25th and 75th percentiles for the X and Y variables in the Q-Q plot. You can then use the DATA step or PROC SQL to compute the slope of the line that passes between the percentiles. The following statements analyze the Q-Q plot data that was created by using the ODS OUTPUT statement in the previous section:

proc means data=QQPlot P25 P75;
   var Quantile Data;        /* ODS OUTPUT created the variables Quantile (X) and Data (Y) */
   output out=Pctl P25= P75= / autoname;
run;
 
data _null_;
set Pctl;
slope = (Data_P75 - Data_P25) / (Quantile_P75 - Quantile_P25); /* dy / dx */
/* if desired, put point-slope values into macro variables to help plot the line */
call symputx("x1", Quantile_P25);
call symputx("y1", Data_P25);
call symput("Slope", putn(slope,"BEST5."));
run;
 
title "Q-Q Plot with Reference Line";
title2 "Reference Line through First and Third Quartiles";
title3 "Slope = &slope";
proc sgplot data=QQPlot;
   scatter x=Quantile y=Data;
   lineparm x=&x1 y=&y1 slope=&slope / lineattrs=(color=Green) legendlabel="Percentile Estimate";
   xaxis grid label="Exponential Quantiles"; yaxis grid;
run;
Q-Q plot in SAS with reference line through first and third quartiles

Because the line passes through the first and third quartiles, the slope of the line is robust to outliers in the tails of the data. The line often provides a simple visual guide to help you determine whether the central portion of the data matches the quantiles of the specified probability distribution.

Keep in mind that this is a visual guide. The slope and intercept for this line should not be used as parameter estimates for the location and scale parameters of the probability distribution, although they could be used as an initial guess for an optimization that estimates the location and scale parameters for the model distribution.

Regression lines as visual guides for a Q-Q plot

Let's be honest, when a statistician sees a scatter plot for which the points appear to be linearly related, there is a Pavlovian reflex to fit a regression line to the values in the plot. However, I can think of several reasons to avoid adding a regression line to a Q-Q plot:

  • The values in the Q-Q plot do not satisfy the assumptions of ordinary least squares (OLS) regression. For example, the points are not a random sample and there is no reason to assume that the errors in the Y direction are normally distributed.
  • In practice, the tails of the probability distribution rarely match the tails of the data distribution. In fact, the points to the extreme left and right of a Q-Q plot often exhibit a systematic bend away from a straight line. In an OLS regression, these extreme points will be high-leverage points that will unduly affect the OLS fit.

If you choose to ignore these problems, you can use the REG statement in PROC SGPLOT to add a reference line. Alternatively, you can use PROC REG in SAS (perhaps with the NOINT option if the location parameter is zero) to obtain an estimate of the slope:

proc reg data=QQPlot plots=NONE;
   model Data = Quantile / NOINT;  /* use NOINT when location parameter is 0 */
   ods select ParameterEstimates;
quit;
 
title2 "Least Squares Reference Line";
proc sgplot data=QQPlot;
   scatter x=Quantile y=Data;
   lineparm x=0 y=0 slope=2.36558 / lineattrs=(color=Red) legendlabel="OLS Estimate";
   xaxis grid label="Exponential Quantiles"; yaxis grid;
run;
Q-Q plot in SAS with regression line (not recommended)

For these data, I used the NOINT option to set the threshold parameter to 0. The zero-intercept line with slope 2.36558 is overlaid on the Q-Q plot. As expected, the outliers in the upper-right corner of the Q-Q plot have pulled the regression line upward, so the regression line has a steeper slope than the reference line based on the first and third quartiles. Because the tails of an empirical distribution often differ from the tails of the target distribution, the regression-based reference line can be misleading. I do not recommend its use.

Maximum likelihood estimates

The previous sections describe two ways to overlay a reference line during the exploratory phase of the data analysis. The purpose of the reference line is to guide your eye and help you determine whether the points in the Q-Q plot appear to fall on a straight line. If so, you can move to the modeling phase.

In the modeling phase, you use a parameter estimation method to fit the parameters in the target distribution. Maximum likelihood estimation (MLE) is often the method-of-choice for estimating parameters from data. You can use the HISTOGRAM statement in PROC UNIVARIATE to obtain a maximum likelihood estimate of the shape parameter for the exponential distribution, which turns out to be 2.21387. If you specify the location and scale parameters in the QQPLOT statement, PROC UNIVARIATE will automatically overlay a line that represents that fitted values:

proc univariate data=Q;
   histogram y / exp;
   qqplot y / exp(threshold=0 scale=est) odstitle="Q-Q Plot with MLE Estimate" grid;
   ods select ParameterEstimates GoodnessOfFit QQPlot;
run;
Parameter estimates and goodness-of-fit test for a maximum likelihood estimate of parameters in an exponential distribution

The ParameterEstimates table shows the maximum likelihood estimate. The GoodnessOfFit table shows that there is no evidence to reject the hypothesis that these data came from an Exp(σ=2.21) distribution.

Q-Q plot in SAS with line formed by using maximum likelihood estimates

Notice the distinction between this line and the previous lines. This line is the result of fitting the target distribution to the data (MLE) whereas the previous lines were visual guides. When you display a Q-Q plot that has a diagonal line, you should state how the line was computed.

In conclusion, you can display a Q-Q plot without adding any reference line. If you choose to overlay a line, there are three common methods. During the exploratory phase of analysis, you can display a line that connects the 25th and 75th percentiles of the data and target distributions. (Some practitioners use an OLS regression line, but I do not recommend it.) During the modeling phase, you can use maximum likelihood estimation or some other fitting method to estimate the location and scale of the target distribution. Those estimates can be used as the intercept and slope, respectively, of a line on the Q-Q plot. PROC UNIVARIATE in SAS displays this line automatically when you fit a distribution.

The post Three ways to add a line to a Q-Q plot appeared first on The DO Loop.

1月 162019
 

The potential for artificial intelligence (AI) and the Internet of Things (IoT) to transform the way health care and therapies are delivered is tremendous. It’s not surprising that the health care and life sciences industries are being flooded with information about how these new technologies will change everything. While it’s [...]

How are AI and advanced analytics transforming health and life sciences? was published on SAS Voices by Cameron McLauchlin

1月 152019
 

When I fill up my daily-driver Prius, the price of gasoline isn't that important. But when I occasionally take a trip in my V8 Suburban, I pay a lot more attention! Therefore I was pleasantly surprised when I noticed that gasoline prices have been falling. How much have they fallen, [...]

The post Let's track the falling gasoline prices! appeared first on SAS Learning Post.

1月 152019
 

This Is the latest installment in my series of posts dedicated to describing IoT applications in retail. Why do I want retailers to move more quickly in their widescale adoption of IoT in the coming year? Because I'm confident they'll see results, and these use cases help explain how. As [...]

IoT applications in retail: the supply chain was published on SAS Voices by Greg Heidrick

1月 152019
 

This Is the latest installment in my series of posts dedicated to describing IoT applications in retail. Why do I want retailers to move more quickly in their widescale adoption of IoT in the coming year? Because I'm confident they'll see results, and these use cases help explain how. As [...]

IoT applications in retail: the supply chain was published on SAS Voices by Greg Heidrick

1月 142019
 

In the second of three posts on using automated analysis with SAS Visual Analytics, we used the automated analysis object to get a better understanding of our variable of interest, X-Sell and Up-sell Flag, and how it is influenced by other variables in our dataset.

In this third and final post, you'll see how to filter the data even more to set up your customer care workers for success.

Remember how on the left-hand side of the analysis we had a list of subgroups with their probabilities? We can use those to filter our data or create additional subsets of data. Let’s create a calculated category from one of the subgroups and then use that to filter a list table of customers. If I right click on the 87% subgroup and select Derive subgroup item a new calculated category will appear in my Data pane.

Here is the new data item located in our data pane:

To see the filter for this data object we can right click on it and select edit.

We can now use this category as a filter. Here we have a basic customer table that does not have a filter applied:

If we apply the filter for customers who fall in the 87% subgroup and a filter for those customers who have not yet upgraded, we have a list of customers that are highly likely to upgrade.

We could give this list to our customer care centers and have them call these customers to see if they want to upgrade. Alternatively, the customer care center could use this filter to target customers for upgrades when they call in. So, if a customer calls into the center, the employee could see if that customer meets the criteria set out in the filter. If they do, they are highly likely to upgrade, and the employee should provide an offer to them.

How to match callers with sales channels

Let’s go back to our automated analysis and perform one more action. We’ll create a new object from the subgroup and assess the group by acquisition channel. This will help us determine which acquisition channel(s) the customers who are in our 87% subgroup purchased their plans from. Then we’ll know which sales teams we need to communicate to about our sales strategy.

To do this we’ll select our 87% group, right click and select New object from subgroup on new page, then Acquisition Channel.

Here we see the customers who are in or out of our subgroup by acquisition channel.

Because it is difficult to see the "in" group, we’ll remove those customers who are out of our subgroup by selecting out from the legend then right click and select New filter from selection, then Exclude selection.

Now we can see which acquisition channel the 87% subgroup purchased their current plan from and how many have already upgraded.

In less than a minute using SAS Visual Analytics' automated analysis we’ve gained business insights based on machine learning that would have taken hours to produce manually. Not only that, we’ve got easy-to-understand results that are built with natural language processing. We can now analyze all variables and remove any bias, ensuring we don’t miss key findings. Business users gain access to analytics without them having the expert skills needed to build models and interpret results. Automated analysis is a start and SAS is committed to investing time and resources into this new wave of BI. Look for more enhancements in future releases!

Miss the previous posts?

This is the third of a three-part series demonstrating automated analysis using SAS Visual Analytics on Viya. Part 1 describes a common visualization approach to handling customer data that leaves room for error and missed opportunities. Part 2 shows improvements through automated analysis.

Want to see automated analysis in action? Watch this video!

How SAS Visual Analytics' automated analysis takes customer care to the next level - Part 3 was published on SAS Users.