6月 112020
 

Rapid demand response forecasting techniques are forecasting processes that can incorporate key information quickly enough to act upon in real time by agile supply chains.   Retailers and consumer goods suppliers are urgently trying to determine how changes in consumer behavior will affect their regions, channels, categories, brands and products during [...]

Rapid demand response forecasting helps retailers adapt during COVID-19 was published on SAS Voices by Charlie Chase

6月 012020
 

In a previous article, I discussed the definition of the Kullback-Leibler (K-L) divergence between two discrete probability distributions. For completeness, this article shows how to compute the Kullback-Leibler divergence between two continuous distributions. When f and g are discrete distributions, the K-L divergence is the sum of f(x)*log(f(x)/g(x)) over all x values for which f(x) > 0. When f and g are continuous distributions, the sum becomes an integral:
KL(f,g) = ∫ f(x)*log( f(x)/g(x) ) dx
which is equivalent to
KL(f,g) = ∫ f(x)*( log(f(x)) – log(g(x)) ) dx
The integral is over the support of f, and clearly g must be positive inside the support of f for the integral to be defined.

The Kullback-Leibler divergence between normal distributions

I like to perform numerical integration in SAS by using the QUAD subroutine in the SAS/IML language. You specify the function that you want to integrate (the integrand) and the domain of integration and get back the integral on the domain. Use the special missing value .M to indicate "minus infinity" and the missing value .P to indicate "positive infinity."

As a first example, suppose the f(x) is the pdf of the normal distribution N(μ1, σ1) and g(x) is the pdf of the normal distribution N(μ2, σ2). From the definition, you can exactly calculate the Kullback-Leibler divergence between normal distributions:
KL between Normals: log(σ2/σ1) + (σ12 – σ22 + (μ1 – μ2)2) / (2*σ22)

You can define the integrand by using the PDF and LOGPDF functions in SAS. The following computation integrates the distributions on the infinite integral (-∞, ∞), although a "six sigma" interval about μ1, such as [-6, 6], would give the same result:

proc iml;
/* Example 1: K-L divergence between two normal distributions */
start KLDivNormal(x) global(mu1, mu2, sigma1, sigma2);
   return pdf("Normal", x, mu1, sigma1) #
          (logpdf("Normal", x, mu1, sigma1) - logpdf("Normal", x, mu2, sigma2));
finish;
 
/* f is PDF for N(0,1) and g is PDF for N(2,3) */ 
mu1 = 0; sigma1 = 1;
mu2 = 2; sigma2 = 3;
 
call quad(KL, "KLDivNormal", {.M .P});  /* integral on (-infinity, infinity) */
KLExact = log(sigma2/sigma1) + 
          (sigma1##2 - sigma2##2+ (mu1-mu2)##2) / (2*sigma2##2);
print KL KLExact (KL-KLExact)[L="Difference"];

The output indicates that the numerical integration agrees with the exact result to many decimal places.

The Kullback-Leibler divergence between exponential and gamma distributions

A blog post by John D. Cook computes the K-L divergence between an exponential and a gamma(a=2) distribution. This section performs the same computation in SAS.

This is a good time to acknowledge that numerical integration can be challenging. Numerical integration routines have default settings that enable you to integrate a wide variety of functions, but sometimes you need to modify the default parameters to get a satisfactory result—especially for integration on infinite intervals. I have previously written about how to use the PEAK= option for the QUAD routine to achieve convergence in certain cases by providing additional guidance to the QUAD routine about the location and scale of the integrand. Following the advice I gave in the previous article, a value of PEAK=0.1 seems to be a good choice for the K-L divergence between the exponential and gamma(a=2) distributions: it is inside the domain of integration and is close to the mode of the integrand, which is at x=0.

/* Example 2: K-L divergence between exponential and gamma(2). Example from
   https://www.johndcook.com/blog/2017/11/08/why-is-kullback-leibler-divergence-not-a-distance/
*/
start KLDivExpGamma(x) global(a);
   return pdf("Exponential", x) #
          (logpdf("Exponential", x) - logpdf("Gamma", x, a));
finish;
 
a = 2;
/* Use PEAK= option: https://blogs.sas.com/content/iml/2014/08/13/peak-option-in-quad.html */
call quad(KL2, "KLDivExpGamma", {0 .P}) peak=0.1;
print KL2;

The value of the K-L divergence agrees with the answer in John D. Cook's blog post.

Approximating the Exponential Distribution by the Gamma Distribution

Recall that the K-L divergence is a measure of the dissimilarity between two distributions. For example, the previous example indicates how well the gamma(a=2) distribution approximates the exponential distribution. As I showed for discrete distributions, you can use the K-L divergence to select the parameters for a distribution that make it the most similar to another distribution. You can do this by using optimization methods, but I will merely plot the K-L divergence as a function of the shape parameter (a) to indicate how the K-L divergence depends on the parameter.

You might recall that the gamma distribution includes the exponential distribution as a special case. When the shape parameter a=1, the gamma(1) distribution is an exponential distribution. Therefore, the K-L divergence between the exponential and the gamma(1) distribution should be zero.

The following program computes the K-L divergence for values of a in the range [0.5, 2] and plots the K-L divergence versus the parameter:

/* Plot KL(Expo, Gamma(a)) for various values of a.
   Note that gamma(a=1) is the exponential distribution. */
aa = do(0.5, 2, 0.01);
KL = j(1, ncol(aa), .);
do i = 1 to ncol(aa);
   a = aa[i];
   call quad(KL_i, "KLDivExpGamma", {0 .P}) peak=0.1;
   KL[i] = KL_i;
end;
title "Kullback-Leibler Divergence Between Exponential and Gamma{a)";
call series(aa, KL) grid={x y} label={'Gamma Shape Parameter (a)' 'K-L Divergence'};

As expected, the graph of the K-L divergence reaches a minimum value at a=1, which is the best approximation to an exponential distribution by the gamma(a) distribution. Note that the K-L divergence equals zero when a=1, which indicates that the distributions are identical when a=1.

Summary

The Kullback-Leibler divergence between two continuous probability distributions is an integral. This article shows how to use the QUAD function in SAS/IML to compute the K-L divergence between two probability distributions.

The post The Kullback–Leibler divergence between continuous probability distributions appeared first on The DO Loop.

5月 282020
 

The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. An application in machine learning is to measure how distributions in a parametric family differ from a data distribution. This article shows that if you minimize the Kullback–Leibler divergence over a set of parameters, you can find a distribution that is similar to the data distribution. This article focuses on discrete distributions.

The Kullback–Leibler divergence between two discrete distributions

As explained in a previous article, the Kullback–Leibler (K-L) divergence between two discrete probability distributions is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) For this sum to be well defined, the distribution g must be strictly positive on the support of f.

One application of the K-L divergence is to measure the similarity between a hypothetical model distribution defined by g and an empirical distribution defined by f.

Example data for the Kullback–Leibler divergence

As an example, suppose a call center averages about 10 calls per hour. An analyst wants to investigate whether the number of calls per hour can be modeled by using a Poisson(λ=10) distribution. To test the hypothesis, the analyst records the number of calls for each hour during 100 hours of operation. The following SAS DATA step reads the data. The call to PROC SGPLOT creates a histogram shows the distribution of the 100 counts:

data Calls;
input N @@;
label N = "Calls per hour";
datalines;
11 19 11 13 13 8 11 9 9 14 
10 13 8 15 7 9 6 12 7 13 
12 19 6 12 11 12 11 9 15 4 
7 12 12 10 10 16 18 13 13 8 
13 10 9 9 12 13 12 8 13 9 
7 9 10 9 4 10 12 5 4 12 
8 12 14 16 11 7 18 8 10 13 
12 5 11 12 16 9 11 8 11 7 
11 15 8 7 12 16 9 18 9 8 
10 7 11 12 13 15 6 10 10 7 
;
 
title "Number of Calls per Hour";
title2 "Data for 100 Hours";
proc sgplot data=Calls;
   histogram N / scale=proportion binwidth=1;
   xaxis values=(4 to 19) valueshint;
run;

The graph should really be a bar chart, but I used a histogram with BINWIDTH=1 so that the graph reveals that the value 17 does not appear in the data. Furthermore, the values 0, 1, 2, and 3 do not appear in the data. I used the SCALE=PROPORTION option to plot the data distribution on the density scale.

The call center wants to model these data by using a Poisson distribution. The traditional statistical approach is to use maximum likelihood estimation (MLE) to find the parameter, λ, in the Poisson family so that the Poisson(λ) distribution is the best fit to the data. However, let's see how using the Kullback–Leibler divergence leads to a similar result.

The Kullback–Leibler divergence between data and a Poisson distribution

Let's compute the K-L divergence between the empirical frequency distribution and a Poisson(10) distribution. The empirical distribution is the reference distribution; the Poisson(10) distribution is the model. The Poisson distribution has a nonzero probability for all x ≥ 0, but recall that the K-L divergence is computed by summing over the observed values of the empirical distribution, which is the set {4, 5, ..., 19}, excluding the value 17.

proc iml;
/* read the data, which is the reference distribution, f */
use Calls;  read all var "N" into Obs;  close;
call Tabulate(Levels, Freq, Obs);   /* find unique values and frequencies */
Proportion = Freq / nrow(Obs);      /* empirical density of frequency of calls (f) */
 
/* create the model distribution: Poisson(10) */
lambda = 10;   
poisPDF = pdf("Poisson", Levels, lambda); /* Poisson model on support(f) */
 
/* load K-L divergence module or include the definition from: 
 https://blogs.sas.com/content/iml/2020/05/26/kullback-leibler-divergence-discrete.html
*/
load module=KLDiv;
KL = KLDiv(Proportion, poisPDF); 
print KL[format=best5.];

Notice that although the Poisson distribution has infinite support, you only need to evaluate the Poisson density on the (finite) support of empirical density.

Minimize the Kullback–Leibler divergence between data and a Poisson distribution

The previous section shows how to compute the Kullback–Leibler divergence between an empirical density and a Poisson(10) distribution. You can repeat that computation for a whole range of λ values and plot the divergence versus the Poisson parameter. The following statements compute the K-L divergence for λ on [4, 16] and plots the result. The minimum value of the K-L divergence is achieved near λ = 10.7. At that value of λ, the K-L divergence between the data and the Poisson(10.7) distribution is 0.105.

/* Plot the K-L div versus lambda for a sequence of Poisson(lambda) models */
lambda = do(4, 16, 0.1);
KL = j(1, ncol(lambda), .);
do i = 1 to ncol(lambda);
   poisPDF = pdf("Poisson", Levels, lambda[i]);
   KL[i] = KLDiv(Proportion, poisPDF); 
end;
 
title "K-L Divergence from Poisson(lambda)";
call series(lambda, KL) grid={x y} xvalues=4:16 label={'x' 'K-L Divergence'};

The graph shows the K-L divergence for a sequence of Poisson(λ) models. The Poisson(10.7) model has the smallest divergence from the data distribution, therefore it is the most similar to the data among the Poisson(λ) distributions that were considered. You can use a numerical optimization technique in SAS/IML if you want to find a more accurate value that minimizes the K-L divergences.

The following graph overlays the PMF for the Poisson(10.7) distribution on the empirical distribution for the number of calls.

The minimal Kullback–Leibler divergence and the maximum likelihood estimate

You might wonder how minimizing the K-L divergence relates to the traditional MLE method for fitting a Poisson model to the data. The following call to PROC GENMOD shows that the MLE estimate is λ = 10.71:

proc genmod data=MyData;
   model Obs = / dist=poisson;
   output out=PoissonFit p=lambda;
run;
 
proc print data=PoissonFit(obs=1) noobs;
   var lambda;
run;

Is this a coincidence? No. It turns out that there a connection between the K-L divergence and the negative log-likelihood. Minimizing the K-L divergence is equivalent to minimizing the negative log-likelihood, which is equivalent to maximizing the likelihood between the Poisson model and the data.

Summary

This article shows how to compute the Kullback–Leibler divergence between an empirical distribution and a Poisson distribution. The empirical distribution was the observed number of calls per hour for 100 hours in a call center. You can compute the K-L divergence for many parameter values (or use numerical optimization) to find the parameter that minimizes the K-L divergence. This parameter value corresponds to the Poisson distribution that is most similar to the data. It turns out that minimizing the K-L divergence is equivalent to maximizing the likelihood function. Although the parameter estimates are the same, the traditional MLE estimate comes with additional tools for statistical inference, such as estimates for confidence intervals and standard errors.

The post Minimizing the Kullback–Leibler divergence appeared first on The DO Loop.

5月 262020
 

If you have been learning about machine learning or mathematical statistics, you might have heard about the Kullback–Leibler divergence. The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. It measures how much one distribution differs from a reference distribution. This article explains the Kullback–Leibler divergence and shows how to compute it for discrete probability distributions.

Recall that there are many statistical methods that indicate how much two distributions differ. For example, a maximum likelihood estimate involves finding parameters for a reference distribution that is similar to the data. Statistics such as the Kolmogorov-Smirnov statistic are used in goodness-of-fit tests to compare a data distribution to a reference distribution.

What is the Kullback–Leibler divergence?

Let f and g be probability mass functions that have the same domain. The Kullback–Leibler (K-L) divergence is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) The K-L divergence measures the similarity between the distribution defined by g and the reference distribution defined by f.

For this sum to be well defined, the distribution g must be strictly positive on the support of f. That is, the Kullback–Leibler divergence is defined only when g(x) > 0 for all x in the support of f.

Some researchers prefer the argument to the log function to have f(x) in the denominator. Flipping the ratio introduces a negative sign, so an equivalent formula is
KL(f, g) = –Σx f(x) log( g(x)/f(x) )

Notice that if the two density functions (f and g) are the same, then the logarithm of the ratio is 0. Therefore, the K-L divergence is zero when the two distributions are equal. The K-L divergence is positive if the distributions are different.

The Kullback–Leibler divergence was developed as a tool for information theory, but it is frequently used in machine learning. The divergence has several interpretations. In information theory, it measures the information loss when f is approximated by g. In statistics and machine learning, f is often the observed distribution and g is a model.

An example of Kullback–Leibler divergence

As an example, suppose you roll a six-sided die 100 times and record the proportion of 1s, 2s, 3s, etc. You might want to compare this empirical distribution to the uniform distribution, which is the distribution of a fair die for which the probability of each face appearing is 1/6. The following SAS/IML statements compute the Kullback–Leibler (K-L) divergence between the empirical density and the uniform density:

proc iml;
/* K-L divergence is defined for positive discrete densities */
/* Face: 1   2   3   4   5   6 */
f    = {20, 14, 19, 19, 12, 16} / 100;  /* empirical density; 100 rolls of die */
g    = { 1,  1,  1,  1,  1,  1} / 6;    /* uniform density */
KL_fg = -sum( f#log(g/f) );             /* K-L divergence using natural log */
print KL_fg;

The K-L divergence is very small, which indicates that the two distributions are similar.

Although this example compares an empirical distribution to a theoretical distribution, you need to be aware of the limitations of the K-L divergence. The K-L divergence compares two distributions and assumes that the density functions are exact. The K-L divergence does not account for the size of the sample in the previous example. The computation is the same regardless of whether the first density is based on 100 rolls or a million rolls. Thus, the K-L divergence is not a replacement for traditional statistical goodness-of-fit tests.

The Kullback–Leibler is not symmetric

Let's compare a different distribution to the uniform distribution. Let h(x)=9/30 if x=1,2,3 and let h(x)=1/30 if x=4,5,6. The following statements compute the K-L divergence between h and g and between g and h. In the first computation, the step distribution (h) is the reference distribution. In the second computation, the uniform distribution is the reference distribution.

h = { 9,  9,  9,  1,  1,  1} /30;    /* step density */
g = { 1,  1,  1,  1,  1,  1} / 6;    /* uniform density */
KL_hg = -sum( h#log(g/h) );          /* h is reference distribution */
print KL_hg;
 
/* show that K-L div is not symmetric */
KL_gh = -sum( g#log(h/g) );          /* g is reference distribution */
print KL_gh;

First, notice that the numbers are larger than for the example in the previous section. The f density function is approximately constant, whereas h is not. So the distribution for f is more similar to a uniform distribution than the step distribution is. Second, notice that the K-L divergence is not symmetric. In the first computation (KL_hg), the reference distribution is h, which means that the log terms are weighted by the values of h. The weights from h give a lot of weight to the first three categories (1,2,3) and very little weight to the last three categories (4,5,6). In contrast, g is the reference distribution for the second computation (KL_gh). Because g is the uniform density, the log terms are weighted equally in the second computation.

The fact that the summation is over the support of f means that you can compute the K-L divergence between an empirical distribution (which always has finite support) and a model that has infinite support. However, you cannot use just any distribution for g. Mathematically, f must be absolutely continuous with respect to g. (Another expression is that f is dominated by g.) This means that for every value of x such that f(x)>0, it is also true that g(x)>0.

A SAS function for the Kullback–Leibler divergence

It is convenient to write a function, KLDiv, that computes the Kullback–Leibler divergence for vectors that give the density for two discrete densities. The call KLDiv(f, g) should compute the weighted sum of log( g(x)/f(x) ), where x ranges over elements of the support of f. By default, the function verifies that g > 0 on the support of f and returns a missing value if it isn't. The following SAS/IML function implements the Kullback–Leibler divergence.

/* The Kullback–Leibler divergence between two discrete densities f and g.
   The f distribution is the reference distribution, which means that 
   the sum is probability-weighted by f. 
   If f(x0)>0 at some x0, the model must allow it. You cannot have g(x0)=0.
*/
start KLDiv(f, g, validate=1);
   if validate then do;        /* if g might be zero, validate */
      idx = loc(g<=0);         /* find locations where g = 0 */
      if ncol(idx) > 0 then do; 
         if any(f[idx]>0) then do; /* g is not a good model for f */
            *print "ERROR: g(x)=0 when f(x)>0";
            return( . );
         end;
      end;
   end;
   if any(f<=0) then do;        /* f = 0 somewhere */
      idx = loc(f>0);           /* support of f */
      fp = f[idx];              /* restrict f and g to support of f */
      return -sum( fp#log(g[idx]/fp) );  /* sum over support of f */
   end;
   /* else, f > 0 everywhere */
   return -sum( f#log(g/f) );  /* K-L divergence using natural logarithm */
finish;
 
/* test the KLDiv function */
f = {0, 0.10, 0.40, 0.40, 0.1};
g = {0, 0.60, 0.25, 0.15, 0};
KL_pq = KLDiv(f, g);  /* g is not a valid model for f; K-L div not defined */
KL_qp = KLDiv(g, f);  /* f is valid model for g. Sum is over support of g */
print KL_pq KL_qp;

The first call returns a missing value because the sum over the support of f encounters the invalid expression log(0) as the fifth term of the sum. The density g cannot be a model for f because g(5)=0 (no 5s are permitted) whereas f(5)>0 (5s were observed). The second call returns a positive value because the sum over the support of g is valid. In this case, f says that 5s are permitted, but g says that no 5s were observed.

Summary

This article explains the Kullback–Leibler divergence for discrete distributions. A simple example shows that the K-L divergence is not symmetric. This is explained by understanding that the K-L divergence involves a probability-weighted sum where the weights come from the first argument (the reference distribution). Lastly, the article gives an example of implementing the Kullback–Leibler divergence in a matrix-vector language such as SAS/IML. This article focused on discrete distributions. The next article shows how the K-L divergence changes as a function of the parameters in a model.

References

The post The Kullback–Leibler divergence between discrete probability distributions appeared first on The DO Loop.

4月 162020
 

After careful consideration of the evolving COVID-19 situation, SAS made the decision in March to cancel the in-person SAS Global Forum 2020 conference in Washington, DC. The health and well-being of SAS customers and employees was the company's top priority in making that decision, and while it's unfortunate that we [...]

SAS Global Forum 2020: Hybrid marketing with SAS Customer Intelligence 360 was published on Customer Intelligence Blog.

4月 022020
 

In February, SAS was recognized as a Leader in the 2020 Gartner Magic Quadrant for Data Science & Machine Learning Platforms report. SAS is the only vendor to be a leader in this report for all seven years of its existence. According to us, the topic of the research is [...]

SAS Customer Intelligence 360: Analyst viewpoints was published on Customer Intelligence Blog.

3月 122020
 

In parts one and two of this blog series, we introduced hybrid marketing as a method that combines both direct and digital marketing capabilities while absorbing insights from machine learning. According to Daniel Newman (Futurum Research) and Wilson Raj (SAS) in the October 2019 research study Experience 2030: “Brands must [...]

SAS Customer Intelligence 360: Hybrid marketing and analytic's last mile [Part 3] was published on Customer Intelligence Blog.

3月 052020
 

Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and Machine Learning. Download it today and keep reading for an excerpt from this free e-book!

In SAS Studio, you can access tasks that help automate your programming so that you do not have to manually write your code. However, there are three options for manually writing your programs in SAS® Viya®:

  1. SAS Studio provides a SAS programming environment for developing and submitting programs to the server.
  2. Batch submission is also still an option.
  3. Open-source languages such as Python, Lua, and Java can submit code to the CAS server.

In this blog post, you will learn the syntax for two of the new, advanced data mining and machine learning procedures: PROC TEXTMINE and PROCTMSCORE.

Overview

The TEXTMINE and TMSCORE procedures integrate the functionalities from both natural language processing and statistical analysis to provide essential functionalities for text mining. The procedures support essential natural language processing (NLP) features such as tokenizing, stemming, part-of-speech tagging, entity recognition, customized stop list, and so on. They also support dimensionality reduction and topic discovery through Singular Value Decomposition.

In this example, you will learn about some of the essential functionalities of PROC TEXTMINE and PROC TMSCORE by using a text data set containing 1,830 Amazon reviews of electronic gaming systems. The data set is named Amazon. You can find similar data sets of Amazon reviews at http://jmcauley.ucsd.edu/data/amazon/.

PROC TEXTMINE

The Amazon data set has already been loaded into CAS. The review content is stored in the variable ReviewBody, and we generate a unique review ID for each review. In the proc call shown in Program 1 we ask PROC TEXTMINE to do three tasks:

  1. parse the documents in table reviews and generate the term by document matrix
  2. perform dimensionality reduction via Singular Value Decomposition
  3. perform topic discovery based on Singular Value Decomposition results

Program 1: PROC TEXTMINE

data mycaslib.amazon;
    set mylib.amazon;
run;

data mycaslib.engstop;
    set mylib.engstop;
run;

proc textmine data=mycaslib.amazon;
    doc_id id;
    var reviewbody;

 /*(1)*/  parse reducef=2 entities=std stoplist=mycaslib.engstop 
          outterms=mycaslib.terms outparent=mycaslib.parent
          outconfig=mycaslib.config;

 /*(2)*/  svd k=10 svdu=mycaslib.svdu outdocpro=mycaslib.docpro
          outtopics=mycaslib.topics;

run;

(1) The first task (parsing) is specified in the PARSE statement. Parameter “reducef” specifies the minimum number of times a term needs to appear in the text to be included in the analysis. Parameter “stop” specifies a list of terms to be excluded from the analysis, such as “the”, “this”, and “that”. Outparent is the output table that stores the term by document matrix, and Outterms is the output table that stores the information of terms that are included in the term by document matrix. Outconfig is the output table that stores configuration information for future scoring.

(2) Tasks 2 and 3 (dimensionality reduction and topic discovery) are specified in the SVD statement. Parameter K specifies the desired number of dimensions and number of topics. Parameter SVDU is the output table that stores the U matrix from SVD calculations, which is needed in future scoring. Parameter OutDocPro is the output table that stores the new matrix with reduced dimensions. Parameter OutTopics specifies the output table that stores the topics discovered.

Click the Run shortcut button or press F3 to run Program 1. The terms table shown in Output 1 stores the tagging, stemming, and entity recognition results. It also stores the number of times each term appears in the text data.

Output 1: Results from Program 1

PROC TMSCORE

PROC TEXTMINE is used with large training data sets. When you have new documents coming in, you do not need to re-run all the parsing and SVD computations with PROC TEXTMINE. Instead, you can use PROC TMSCORE to score new text data. The scoring procedure parses the new document(s) and projects the text data into the same dimensions using the SVD weights derived from the original training data.

In order to use PROC TMSCORE to generate results consistent with PROC TEXTMINE, you need to provide the following tables generated by PROC TEXTMINE:

  • SVDU table – provides the required information for projection into the same dimensions.
  • Config table – provides parameter values for parsing.
  • Terms table – provides the terms that should be included in the analysis.

Program 2 shows an example of TMSCORE. It uses the same input data layout used for PROC TEXTMINE code, so it will generate the same docpro and parent output tables, as shown in Output 2.

Program 2: PROC TMSCORE

Proc tmscore data=mycaslib.amazon svdu=mycaslib.svdu
        config=mycaslib.config terms=mycaslib.terms
        svddocpro=mycaslib.score_docpro outparent=mycaslib.score_parent;
    var reviewbody;
    doc_id id;
run;

 

Output 2: Results from Program 2

To learn more about advanced data mining and machine learning procedures available in SAS Viya, including PROC FACTMAC, PROC TEXTMINE, and PROC NETWORK, you can download the free e-book, Exploring SAS® Viya®: Data Mining and Machine Learning. Exploring SAS® Viya® is a series of e-books that are based on content from SAS® Viya® Enablement, a free course available from SAS Education. You can follow along with examples in real time by watching the videos.

 

Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

3月 042020
 

Your brand is customer journey obsessed, and every interaction with your company provides a potential opportunity to make an intelligent decision, deepen engagement and meet conversion goals. The hype of martech innovation in 2020 is continuing to elevate, and every technology vendor is claiming the following statement: "Bolster the customer [...]

SAS Customer Intelligence 360: Marketing AI vision was published on Customer Intelligence Blog.

2月 262020
 

  Is it getting harder and harder to find empty Excel spreadsheets cells, as you run out of columns and rows? Do your spreadsheet cell labels have more letters than the license plate on your car? Do you find yourself waking up in the middle of the night in cold [...]

Are you suffering from demand planner spreadsheet overload? was published on SAS Voices by Charlie Chase