2月 192019
 

When it comes to forecasting new product launches, executives say that it's a frustrating, almost futile, effort. The reason? Minimal data, limited analytic capabilities and a general uncertainty surrounding a new product launch. Not to mention the ever-changing marketplace. Nevertheless, companies cannot disregard the need for a new product forecast [...]

Practical approaches to new product forecasting using structured and unstructured data was published on SAS Voices by Charlie Chase

2月 192019
 

When it comes to forecasting new product launches, executives say that it's a frustrating, almost futile, effort. The reason? Minimal data, limited analytic capabilities and a general uncertainty surrounding a new product launch. Not to mention the ever-changing marketplace. Nevertheless, companies cannot disregard the need for a new product forecast [...]

Practical approaches to new product forecasting using structured and unstructured data was published on SAS Voices by Charlie Chase

2月 072019
 

Each day, more than 130 Americans die from opioid overdoses. Combating the opioid epidemic begins with understanding it, and that begins with data. SAS recently partnered with graduate students from Carnegie Mellon University (CMU) 's Heinz College of Information Systems and Public Policy to understand how data mining and machine [...]

An unexpected weapon in the fight against the opioid epidemic: Graduate students was published on SAS Voices by Manuel Figallo

2月 062019
 

Feature generation (also known as feature creation) is the process of creating new features to use for training machine learning models. This article focuses on regression models. The new features (which statisticians call variables) are typically nonlinear transformations of existing variables or combinations of two or more existing variables. This article argues that a naive approach to feature generation can lead to many correlated features (variables) that increase the cost of fitting a model without adding much to the model's predictive power.

Feature generation in traditional statistics

Feature generation is not new. Classical regression often uses transformations of the original variables. In the past, I have written about applying a logarithmic transformation when a variable's values span several orders of magnitude. Statisticians generate spline effects from explanatory variables to handle general nonlinear relationships. Polynomial effects can model quadratic dependence and interactions between variables. Other classical transformations in statistics include the square-root and inverse transformations.

SAS/STAT procedures provide several ways to generate new features from existing variables, including the EFFECT statement and the "stars and bars" notation. However, an undisciplined approach to feature generation can lead to a geometric explosion of features. For example, if you generate all pairwise quadratic interactions of N continuous variables, you obtain "N choose 2" or N*(N-1)/2 new features. For N=100 variables, this leads to 4950 pairwise quadratic effects!

Generated features might be highly correlated

In addition to the sheer number of features that you can generate, another problem with generating features willy-nilly is that some of the generated effects might be highly correlated with each other. This can lead to difficulties if you use automated model-building methods to select the "important" features from among thousands of candidates.

I was reminded of this fact recently when I wrote an article about model building with PROC GLMSELECT in SAS. The data were simulated: X from a uniform distribution on [-3, 3] and Y from a cubic function of X (plus random noise). I generated the polynomial effects x, x^2, ..., x^7, and the procedure had to build a regression model from these candidates. The stepwise selection method added the x^7 effect first (after the intercept). Later it added the x^5 effect. Of course, the polynomials x^7 and x^5 have a similar shape to x^3, but I was surprised that those effects entered the model before x^3 because the data were simulated from a cubic formula.

After thinking about it, I realized that the odd-degree polynomial effects are highly correlated with each other and have high correlations with the target (response) variable. The same is true for the even-degree polynomial effects. Here is the DATA step that generates 1000 observations from a cubic regression model, along with the correlations between the effects (x1-x7) and the target variable (y):

%let d = 7;
%let xMean = 0;
data Poly;
call streaminit(54321);
array x[&d];
do i = 1 to 1000;
   x[1] = rand("Normal", &xMean);  /* x1 ~ U(-3, 3] */
   do j = 2 to &d;
      x[j] = x[j-1] * x[1];        /* x[i] = x1**i, i = 2..7 */
   end;
   y = 2 - 1.105*x1 - 0.2*x2 + 0.5*x3 + rand("Normal");  /* response is cubic function of x1 */
   output;
end;
drop i j;
run;
 
proc corr data=Poly nosimple noprob;
   var y;
   with x:;
run;
Correlations between the target variable and polynimal effects

You can see from the output that the x^5 and x^7 effects have the highest correlations with the response variable. Because the squared correlations are the R-square values for the regression of Y onto each effect, it makes intuitive sense that these effects are added to the model early in the process. Towards the end of the model-building process, the x^3 effect enters the model and the x^5 and x^7 effects are removed. To the model-building algorithm, these effects have similar predictive power because they are highly correlated with each other, as shown in the following correlation matrix:

proc corr data=Poly nosimple noprob plots(MAXPOINTS=NONE)=matrix(NVAR=ALL);
   var x:;
run;
Correlations among polynomial effects

I've highlighted certain cells in the lower triangular correlation matrix to emphasize the large correlations. Notice that the correlations between the even- and odd-degree effects are close to zero and are not highlighted. The table is a little hard to read, but you can use PROC IML to generate a heat map for which cells are shaded according to the pairwise correlations:

Heat map of correlations among polynomial effects

This checkerboard pattern shows the large correlations that occur between the polynomial effects in this problem. This image shows that many of the generated features do not add much new information to the set of explanatory variables. This can also happen with other transformations, so think carefully before you generate thousands of new features. Are you producing new effects or redundant ones?

Generated features are not independent

Notice that the generated effects are not statistically independent. They are all generated from the same X variable, so they are functionally dependent. In fact, a scatter plot matrix of the data will show that the pairwise relationships are restricted to parametric curves in the plane. The graph of (x, x^2) is quadratic, the graph of (x, x^3) is cubic, the graph of (x^2, x^3) is an algebraic cusp, and so forth. The PROC CORR statement in the previous section created a scatter plot matrix, which is shown below: Again, I have highlighted cells that have highly correlated variables.

Scatter plot matrix that shows the statistical dependencies between polynomial effects

I like this scatter plot matrix for two reasons. First, it is soooo pretty! Second, it visually demonstrates that two variables that have low correlation are not necessarily independent.

Summary

Feature generation is important in many areas of machine learning. It is easy to create thousands of effects by applying transformations and generating interactions between all variables. However, the example in this article demonstrates that you should be a little cautious of using a brute-force approach. When possible, you should use domain knowledge to guide your feature generation. Every feature you generate adds complexity during model fitting, so try to avoid adding a bunch of redundant highly-correlated features.

For more on this topic, see the article "Automate your feature engineering" by my colleague, Funda Gunes. She writes: "The process [of generating new features]involves thinking about structures in the data, the underlying form of the problem, and how best to expose these features to predictive modeling algorithms. The success of this tedious human-driven process depends heavily on the domain and statistical expertise of the data scientist." I agree completely. Funda goes on to discuss SAS tools that can help data scientists with this process, such as Model Studio in SAS Visual Data Mining and Machine Learning. She shows an example of feature generation in her article and in a companion article about feature generation. These tools can help you generate features in a thoughtful, principled, and problem-driven manner, rather than relying on a brute-force approach.

The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.

2月 042019
 
Animation of models built by using PROC GLMSELECT in SAS

I previously discussed how you can use validation data to choose between a set of competing regression models. In that article, I manually evaluated seven models on the training data and manually chose the model that gave the best predictions for the validation data. Fortunately, SAS software provides ways to automate this process! This article describes how PROC GLMSELECT builds models on training data and uses validation data to choose a final model. The animated GIF to the right visualizes the sequence of models that are built.

You can download the complete SAS program that creates the results in this blog post.

How to run PROC GLMSELECT

The GLMSELECT procedure in SAS/STAT is a workhorse procedure that implements many variable-selection methods, including least angle regression (LAR), LASSO, and elastic nets. Even though PROC GLMSELECT was introduced in SAS 9.1 (Cohen, 2006), many of its options remain relatively unknown to many SAS data analysts.

Some statisticians argue against automated model building and selection. Both Cohen (2006) and the PROC GLMSELECT documentation contain a discussion of the dangers and controversies regarding automated model building. Nevertheless, machine learning techniques routinely use this sort of process to build accurate predictive models from among hundreds of variables (or features, as they are known in the ML community).

The following four statements create the analyses and graphs in this article. The (x, y) data are simulated from the cubic polynomial y = 2 - 1.105*x - 0.2*x2 + 0.5*x3 + N(0,1).

title "Select Model from 8 Effects";
proc glmselect data=Have seed=1 plots(StartStep=1)=(ASEPlot Coefficients);
   effect poly = polynomial(x / degree=7);    /* generate monomial effects: x, x^2, ..., x^7 */
   partition fraction(validate=0.4);          /* use 40% of data for validation */
   model y = poly / selection=stepwise(select=SBC choose=validate)
                details=steps(Candidates ParameterEstimates); /* OPTIONAL */
run;

The analysis uses the following options:

  • The PLOTS= option requests two plots. The ASEPlot is a plot of the average square error (ASE) of each model on the validation data. The STARTSTEP= option displays the ASE beginning with the first step. (The zeroth step is usually the intercept-only model, which typically has a very large ASE.) The Coefficients plot shows how various effects enter or leave the model at each step of the model-building process. By default, a panel is created, but the lower part of the panel duplicates part of the ASE plot so I've unpacked the panel into separate plots.
  • The EFFECT statement generates the monomial effects {x, x^2, ..., x^7} from the variable x.
  • The PARTITION statement randomly divides the input data into two subsets. The validation set contains 40% of the data and the training set contains the other 60%. The SEED= option on the PROC GLMSELECT statement specifies the seed value for the random split.
  • The SELECTION= option specifies the algorithm that builds a model from the effects. This example uses the stepwise selection algorithm because it is easy to understand. At each step in the model-building process, the stepwise algorithm builds a new model by modifying the model from the previous step. The new model will either contain a new effect that was not in the previous model or will remove an effect from the previous model.
  • The SELECT=SBC option specifies that the procedure will use the SBC criterion to assess the candidate effects and determine which effect should be added (or removed) from the previous model.
  • The CHOOSE=VALIDATE option specifies that the models are scored on the validation data. The ASE values for the models are used to choose the most predictive model from among the models that were built.
  • The DETAILS= option displays information about the models that are built at each step. If you only care about the final model, you do not need this option.

The chosen model is a cubic polynomial. The parameter estimates (below) are very close to the parameter values that are used to simulate the cubic data, so the procedure "discovered" the correct model.

Parameter estimates for a model chosen among models built by PROC GLMSELECT in SAS

How PROC GLMSELECT constructs the models

Average square error (ASE) plot for models built by PROC GLMSELECT in SAS

How did PROC GLMSELECT obtain that model? The output of the DETAILS=STEPS option shows that the GLMSELECT procedure built eight models. It then used the validation data to decide that the four-parameter cubic model was the model that best balances parsimony (simple versus complex models) and prediction accuracy (underfitting versus overfitting). I won't display all the output, but you can summarize the details by using the ASE plot and the Coefficients plot.

The ASE plot (shown to the right) visualizes the prediction accuracy of the models. The initial model (zeroth step) is the intercept-only model. The horizontal axis of the ASE plot shows how the models are formed from the previous model. The label for the first tick mark is "1+x^5", which means "the model at Step=1 adds the x^5 term to the previous model. The label for the second tick mark is "2+x^2", which means "the model at Step=2 adds the x^2 term to the previous model." A minus sign means that an effect is removed. For example, the label "6-x^7" means that "the model at Step=6 removes the x^7 effect from the previous model." The vertical axis tracks the change of the ASE for each successive model. The model-building process stops when it can no longer decrease the ASE on the validation data. For this example, that happens at Step=7.

If you prefer a table, the SelectionSummary table summarizes the models that are built. The columns labeled ASE and Validation ASE contain the precise values in the ASE plot.

SBC and validation ASE for models built by PROC GLMSELECT in SAS

How PROC GLMSELECT defines the models

Coefficient progression plot for models built by PROC GLMSELECT in SAS

The models are least squares estimates for the included effects on the training data. The DETAILS=STEPS option displays the parameter estimates for each model. For these models, which are all polynomial effects for a single continuous variable, you can graph the eight models and overlay the fitted curves on the data. You can do this in eight separate plots or you can use PROC SGPLOT in SAS to create an animated gif that animates the sequence of models. The animation is shown at the top of this article.

In general, you can't visualize models with many effects, but the Coefficients plot displays the values of the estimates for each model. The plot (shown to the right; click to enlarge) labels only the effects in the final model, but I have manually added labels for the x^5 and x^7 effects to help explain the plot.

Because the magnitudes of the parameter estimates can vary greatly, the vertical axis of the plot shows standardized estimates. As before, the horizontal axis represents the various models. Each line visualizes the evolution of values for a particular effect. For example, the brown line dominates the upper left portion of the graph. This line shows the progression of the standardized coefficient of the x^5 term. In models 1–4, the coefficient of the x^5 effect is large and positive. In models 5 and 6, the standardized coefficient of x^5 is small and negative. In the last model, the coefficient of x^5 is set to zero because the effect is removed. Similarly, the magenta line displays the standardized coefficients for x^7. The coefficient is negative for Steps 3 and 4, positive for Step 5, and is zero for the last two models, which do not include the x^7 effect. By using this plot, you can visually discern which effects are included in each model and study the relative change of the coefficients between models.

Summary

This article shows how to use PROC GLMSELECT in SAS to build a sequence of models on training data. From among the models, a final model is chosen that best predicts a validation data set. The example uses the stepwise selection technique because it is easy to understand, but the GLMSELECT procedure supports other model selection algorithms.

You can visualize the model selection process by using the ASE plot. You can visualize the progression of candidate models by using the coefficient plot. For this univariate regression model, you can also visualize each candidate model by overlaying curves or by using an animation.

For more information about the model selection procedures in SAS, see the SAS/STAT documentation or the following articles:

The post Model selection with PROC GLMSELECT appeared first on The DO Loop.

2月 042019
 
Animation of models built by using PROC GLMSELECT in SAS

I previously discussed how you can use validation data to choose between a set of competing regression models. In that article, I manually evaluated seven models on the training data and manually chose the model that gave the best predictions for the validation data. Fortunately, SAS software provides ways to automate this process! This article describes how PROC GLMSELECT builds models on training data and uses validation data to choose a final model. The animated GIF to the right visualizes the sequence of models that are built.

You can download the complete SAS program that creates the results in this blog post.

How to run PROC GLMSELECT

The GLMSELECT procedure in SAS/STAT is a workhorse procedure that implements many variable-selection methods, including least angle regression (LAR), LASSO, and elastic nets. Even though PROC GLMSELECT was introduced in SAS 9.1 (Cohen, 2006), many of its options remain relatively unknown to many SAS data analysts.

Some statisticians argue against automated model building and selection. Both Cohen (2006) and the PROC GLMSELECT documentation contain a discussion of the dangers and controversies regarding automated model building. Nevertheless, machine learning techniques routinely use this sort of process to build accurate predictive models from among hundreds of variables (or features, as they are known in the ML community).

The following four statements create the analyses and graphs in this article. The (x, y) data are simulated from the cubic polynomial y = 2 - 1.105*x - 0.2*x2 + 0.5*x3 + N(0,1).

title "Select Model from 8 Effects";
proc glmselect data=Have seed=1 plots(StartStep=1)=(ASEPlot Coefficients);
   effect poly = polynomial(x / degree=7);    /* generate monomial effects: x, x^2, ..., x^7 */
   partition fraction(validate=0.4);          /* use 40% of data for validation */
   model y = poly / selection=stepwise(select=SBC choose=validate)
                details=steps(Candidates ParameterEstimates); /* OPTIONAL */
run;

The analysis uses the following options:

  • The PLOTS= option requests two plots. The ASEPlot is a plot of the average square error (ASE) of each model on the validation data. The STARTSTEP= option displays the ASE beginning with the first step. (The zeroth step is usually the intercept-only model, which typically has a very large ASE.) The Coefficients plot shows how various effects enter or leave the model at each step of the model-building process. By default, a panel is created, but the lower part of the panel duplicates part of the ASE plot so I've unpacked the panel into separate plots.
  • The EFFECT statement generates the monomial effects {x, x^2, ..., x^7} from the variable x.
  • The PARTITION statement randomly divides the input data into two subsets. The validation set contains 40% of the data and the training set contains the other 60%. The SEED= option on the PROC GLMSELECT statement specifies the seed value for the random split.
  • The SELECTION= option specifies the algorithm that builds a model from the effects. This example uses the stepwise selection algorithm because it is easy to understand. At each step in the model-building process, the stepwise algorithm builds a new model by modifying the model from the previous step. The new model will either contain a new effect that was not in the previous model or will remove an effect from the previous model.
  • The SELECT=SBC option specifies that the procedure will use the SBC criterion to assess the candidate effects and determine which effect should be added (or removed) from the previous model.
  • The CHOOSE=VALIDATE option specifies that the models are scored on the validation data. The ASE values for the models are used to choose the most predictive model from among the models that were built.
  • The DETAILS= option displays information about the models that are built at each step. If you only care about the final model, you do not need this option.

The chosen model is a cubic polynomial. The parameter estimates (below) are very close to the parameter values that are used to simulate the cubic data, so the procedure "discovered" the correct model.

Parameter estimates for a model chosen among models built by PROC GLMSELECT in SAS

How PROC GLMSELECT constructs the models

Average square error (ASE) plot for models built by PROC GLMSELECT in SAS

How did PROC GLMSELECT obtain that model? The output of the DETAILS=STEPS option shows that the GLMSELECT procedure built eight models. It then used the validation data to decide that the four-parameter cubic model was the model that best balances parsimony (simple versus complex models) and prediction accuracy (underfitting versus overfitting). I won't display all the output, but you can summarize the details by using the ASE plot and the Coefficients plot.

The ASE plot (shown to the right) visualizes the prediction accuracy of the models. The initial model (zeroth step) is the intercept-only model. The horizontal axis of the ASE plot shows how the models are formed from the previous model. The label for the first tick mark is "1+x^5", which means "the model at Step=1 adds the x^5 term to the previous model. The label for the second tick mark is "2+x^2", which means "the model at Step=2 adds the x^2 term to the previous model." A minus sign means that an effect is removed. For example, the label "6-x^7" means that "the model at Step=6 removes the x^7 effect from the previous model." The vertical axis tracks the change of the ASE for each successive model. The model-building process stops when it can no longer decrease the ASE on the validation data. For this example, that happens at Step=7.

If you prefer a table, the SelectionSummary table summarizes the models that are built. The columns labeled ASE and Validation ASE contain the precise values in the ASE plot.

SBC and validation ASE for models built by PROC GLMSELECT in SAS

How PROC GLMSELECT defines the models

Coefficient progression plot for models built by PROC GLMSELECT in SAS

The models are least squares estimates for the included effects on the training data. The DETAILS=STEPS option displays the parameter estimates for each model. For these models, which are all polynomial effects for a single continuous variable, you can graph the eight models and overlay the fitted curves on the data. You can do this in eight separate plots or you can use PROC SGPLOT in SAS to create an animated gif that animates the sequence of models. The animation is shown at the top of this article.

In general, you can't visualize models with many effects, but the Coefficients plot displays the values of the estimates for each model. The plot (shown to the right; click to enlarge) labels only the effects in the final model, but I have manually added labels for the x^5 and x^7 effects to help explain the plot.

Because the magnitudes of the parameter estimates can vary greatly, the vertical axis of the plot shows standardized estimates. As before, the horizontal axis represents the various models. Each line visualizes the evolution of values for a particular effect. For example, the brown line dominates the upper left portion of the graph. This line shows the progression of the standardized coefficient of the x^5 term. In models 1–4, the coefficient of the x^5 effect is large and positive. In models 5 and 6, the standardized coefficient of x^5 is small and negative. In the last model, the coefficient of x^5 is set to zero because the effect is removed. Similarly, the magenta line displays the standardized coefficients for x^7. The coefficient is negative for Steps 3 and 4, positive for Step 5, and is zero for the last two models, which do not include the x^7 effect. By using this plot, you can visually discern which effects are included in each model and study the relative change of the coefficients between models.

Summary

This article shows how to use PROC GLMSELECT in SAS to build a sequence of models on training data. From among the models, a final model is chosen that best predicts a validation data set. The example uses the stepwise selection technique because it is easy to understand, but the GLMSELECT procedure supports other model selection algorithms.

You can visualize the model selection process by using the ASE plot. You can visualize the progression of candidate models by using the coefficient plot. For this univariate regression model, you can also visualize each candidate model by overlaying curves or by using an animation.

For more information about the model selection procedures in SAS, see the SAS/STAT documentation or the following articles:

The post Model selection with PROC GLMSELECT appeared first on The DO Loop.

1月 302019
 

Machine learning differs from classical statistics in the way it assesses and compares competing models. In classical statistics, you use all the data to fit each model. You choose between models by using a statistic (such as AIC, AICC, SBC, ...) that measures both the goodness of fit and the complexity of the model. In machine learning, which was developed for Big Data, you separate the data into a "training" set on which you fit models and a "validation" set on which you score the models. You choose between models by using a statistic such as the average squared errors (ASE) of the predicted values on the validation data. This article shows an example that illustrates how you can use validation data to assess the fit of multiple models and choose the best model.

The idea for this example is from the excellent book, A First Course in Machine Learning by Simon Rogers and Mark Girolami (Second Edition, 2016, p. 31-36 and 86), although I have seen similar examples in many places. The data (training and validation) are pairs (x, y): The independent variable X is randomly sampled in [-3,3] and the response Y is a cubic polynomial in X to which normally distributed noise is added. The training set contains 50 observations; the validation set contains 200 observations. The goal of the example is to demonstrate how machine learning uses validation data to discover that a cubic model fits the data better than polynomials of other degrees.

This article is part of a series that introduces concepts in machine learning to SAS statistical programmers. Machine learning is accompanied by a lot of hype, but at its core it combines many concepts that are already familiar to statisticians and data analysts, including modeling, optimization, linear algebra, probability, and statistics. If you are familiar with those topics, you can master machine learning.

Simulate data from a cubic regression model

The first step of the example is to simulate data from a cubic regression model. The following SAS DATA step simulates the data. PROC SGPLOT visualizes the training and validation data.

data Have;
length Type $10.;
call streaminit(54321);
do i = 1 to 250;
   if i <= 50 then Type = "Train";       /* 50 training observations */
   else            Type = "Validate";    /* 200 validation observations */
   x = rand("uniform", -3, 3);
   /* 2 - 1.105 x - 0.2 x^2 + 0.5 x^3 */
   y = 2 + 0.5*x*(x+1.3)*(x-1.7) + rand("Normal");  /* Y = b0 + b1*X + b2*X**2 + b3*X**3 + N(0,1) */
   output;
end;
run;
 
title "Training and Validation Data";
title2 "True Model Is Cubic Polynomial";
proc sgplot data=Have;
   scatter x=x y=y / group=Type grouporder=data;
   xaxis grid;  yaxis grid;
run;
Data simulated from a cubic model. Use machine learning and training/validation data to select a model that fits the data

Use validation data to assess the fit

In a classical regression analysis, you would fit a series of polynomial models to the 250 observations and use a fit statistic to assess the goodness of fit. Recall that as you increase the degree of a polynomial model, you have more parameters (more degrees of freedom) to fit the data. A high-degree polynomial can produce a small sum of square errors (SSE) but probably overfits the data and is unlikely to predict future data well. To try to prevent overfitting, statistics such as the AIC and SBC include two terms: one which rewards low values of SSE and another that penalizes models that have a large number of parameters. This is done at the end of this article.

The machine learning approach is different: use some data to train the model (fit the parameters) and use different data to validate the model. A popular validation statistic is the average square error (ASE), which is formed by scoring the model on the validation data and then computing the average of the squared residuals. The GLMSELECT procedure supports the PARTITION statement, which enables you to fit the model on training data and assess the fit on validation data. The GLMSELECT procedure also supports the EFFECT statement, which enables you to form a POLYNOMIAL effect to model high-order polynomials. For example, the following statements create a third-degree polynomial model, fit the model parameters on the training data, and evaluate the ASE on the validation data:

%let Degree = 3;   
proc glmselect data=Have;
   effect poly = polynomial(x / degree=&Degree);              /* model is polynomial of specified degree */
   partition rolevar=Type(train="Train" validate="Validate"); /* specify training/validation observations */
   model y = poly / selection=NONE;                           /* fit model on training data */
   ods select FitStatistics ParameterEstimates;
run;
Classical fit statistics and average square error (ASE) on validation data

The first table includes the classical statistics for the model, evaluated only on the training data. At the bottom of the table is the "ASE (Validate)" statistic, which is the value of the ASE for the validation data. The second table shows that the parameter estimates are very close to the parameters in the simulation.

Use validation data to select a model

You can repeat the regression analysis for various other polynomial models. It is straightforward to write a macro loop that repeats the analysis as the degree of the polynomial ranges from 1 to 7. For each model, the training and validation data are the same. If you plot the ASE value for each model (on both the training and validation data) against the degree of each polynomial, you obtain the following graph. The vertical axis is plotted on a logarithmic scale because the ASE ranges over an order of magnitude,

The graph shows that when you score the linear and quadratic models on the validation data, the fit is not very good, as indicated by the relatively large values of the average square error for Degree = 1 and 2. For the third-degree model, the ASE drops dramatically. Higher-degree polynomials do not substantially change the ASE. As you would expect, the ASE on the training data continues to decrease as the polynomial degree increases. However, the high-degree models, which overfit the training data, are less effective at predicting the validation data. Consequently, the ASE on the validation data actually increases when the degree is greater than 3.

Notice that the minimum value of the ASE on the validation data corresponds to the correct model. By choosing the model that minimizes the validation ASE, we have "discovered" the correct model! Of course, real life is not so simple. In real life, the data almost certainly does not follow a simple algebraic equation. Nevertheless, the graph summarizes the essence of the machine learning approach to model selection: train each model on a subset of the data and select the one that gives the smallest prediction error for the validation sample.

Classical statistics are pretty good, too!

Classical fit statistics as a way to select from competing models

Validation methods are quite effective, but classical statistics are powerful, too. The graph to the right shows three popular fit statistics evaluated on the training data. In each case, the fit statistic reaches a minimum value for a cubic polynomial. Consequently, the classical fit statistics would also select the cubic polynomial as being the best model for these data.

However, it's clear that the validation ASE is a simple technique that is applicable to any model. Whether you use a tree model, a nonparametric model, or even an ensemble of models, the ASE on the validation data is easy to compute and easy to understand. All you need is a way to score the model on the validation sample. In contrast, it can be difficult to construct the classical statistics because you must estimate the number of parameters (degrees of freedom) in the model, which can be a challenge for nonparametric models.

Summary

In summary, a fundamental difference between classical statistics and machine learning is how each discipline assesses and compares models. Machine learning tends to fit models on a subset of the data and assess the goodness of fit by using a second subset. The average square error (ASE) on the validation data is easy to compute and to explain. When you fit and compare several models, you can use the ASE to determine which model is best at predicting new data.

The simple example in this article also illustrates some of the fundamental principles that are used in a model selection procedure such as PROC GLMSELECT in SAS. It turns out that PROC GLMSELECT can automate many of the steps in this article. Using PROC GLMSELECT for model selection will be the topic of a future article.

You can download the SAS program for this example.

The post Model assessment and selection in machine learning appeared first on The DO Loop.

1月 212019
 

In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing.

  • Training data is used to fit each model.
  • Validation data is a random sample that is used for model selection. These data are used to select a model from among candidates by balancing the tradeoff between model complexity (which fit the training data well) and generality (but they might not fit the validation data). These data are potentially used several times to build the final model
  • Test data is a hold-out sample that is used to assess final model and estimate its prediction error. It is only used at the end of the model-building process.

I've seen many questions about how to use SAS to split data into training, validation, and testing data. (A common variation uses only training and validation.) There are basically two approaches to partitioning data:

  • Specify the proportion of observations that you want in each role. For each observation, randomly assign it to one of the three roles. The number of observations assigned to each role will be a multinomial random variable with expected value N pk, where N is the number of observations and pk (k = 1, 2, 3) is the probability of assigning an observation to the k_th role. For this method, if you change the random number seed you will usually get a different number of observations each role because the size is a random variable.
  • Specify the number of observations that you want in each role and randomly allocate that many observations.

This article uses the SAS DATA step to accomplish the first task and uses PROC SURVEYSELECT to accomplish the second. I also discuss how to split data into only two roles: training and validation.

It is worth mentioning that many model-selection routines in SAS enable you to split data by using the PARTITION statement. Example include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure. However, be aware that the procedures might ignore observations that have missing values for the variables in the model.

Random partition into training, validation, and testing data

When you partition data into various roles, you can choose to add an indicator variable, or you can physically create three separate data sets. The following DATA step creates an indicator variable with values "Train", "Validate", and "Test". The specified proportions are 60% training, 30% validation, and 10% testing. You can change the values of the SAS macro variables to use your own proportions. The RAND("Table") function is an efficient way to generate the indicator variable.

data Have;             /* the data to partition  */
   set Sashelp.Heart;  /* for example, use Heart data */
run;
 
/* If propTrain + propValid = 1, then no observation is assigned to testing */
%let propTrain = 0.6;         /* proportion of trainging data */
%let propValid = 0.3;         /* proportion of validation data */
%let propTest = %sysevalf(1 - &propTrain - &propValid); /* remaining are used for testing */
 
/* Randomly assign each observation to a role; _ROLE_ is indicator variable */
data RandOut;
   array p[2] _temporary_ (&propTrain, &propValid);
   array labels[3] $ _temporary_ ("Train", "Validate", "Test");
   set Have;
   call streaminit(123);         /* set random number seed */
   /* RAND("table") returns 1, 2, or 3 with specified probabilities */
   _k = rand("Table", of p[*]); 
   _ROLE_ = labels[_k];          /* use _ROLE_ = _k if you prefer numerical categories */
   drop _k;
run;
 
proc freq data=RandOut order=freq;
   tables _ROLE_ / nocum;
run;

A shown by the output of PROC FREQ, the proportion of observations in each role is approximately the same as the specified proportions. For this random number seed, the proportions are 59.1%, 30.4%, and 10.6%. If you change the random number seed, you will get a different assignment of observations to roles and also a different proportion of data in each role.

The observant reader will notice that there are only two elements in the array of probabilities (p) that is used in the RAND("Table") call. This is intentional. The documentation for the RAND("Table") function states that if the sum of the specified probabilities is less than 1, then the quantity 1 – Σ pi is used as the probability of the last event. By specifying only two values in the p array, the same program works for partitioning the data into two pieces (training and validation) or three pieces (and testing).

Create random training, validation, and testing data sets

Some practitioners choose to create three separate data sets instead of adding an indicator variable to the existing data. The computation is exactly the same, but you can use the OUTPUT statement to direct each observation to one of three output data sets, as follows:

/* create a separate data set for each role */
data Train Validate Test;
array p[2] _temporary_ (&propTrain, &propValid);
set Have;
call streaminit(123);         /* set random number seed */
/* RAND("table") returns 1, 2, or 3 with specified probabilities */
_k = rand("Table", of p[*]);
if      _k = 1 then output Train;
else if _k = 2 then output Validate;
else                output Test;
drop _k;
run;
NOTE: The data set WORK.TRAIN has 3078 observations and 17 variables.
NOTE: The data set WORK.VALIDATE has 1581 observations and 17 variables.
NOTE: The data set WORK.TEST has 550 observations and 17 variables.

This example uses the same random number seed as the previous example. Consequently, the three output data sets have the same observations as are indicated by the partition variable (_ROLE_) in the previous example.

Specify the number of observations in each role

Instead of specifying a proportion, you might want to specify the exact number of observations that are randomly assigned to each role. The advantage of this technique is that changing the random number seed does not change the number of observations in each role (although it does change which observations are assigned to each role). The SURVEYSELECT procedure supports the GROUPS= option, which you can use to specify the number of observations.

The GROUPS= option requires that you specify integer values. For this example, the original data contains 5209 observations but 60% of 5209 is not an integer. Therefore, the following DATA step computes the number of observations as ROUND(N p) for the training and validation sets. These integer values are put into macro variables and used in the GROUPS= option on the PROC SURVEYSELECT statement. You can, of course, skip the DATA step and specify your own values such as groups=(3200, 1600, 409).

/* Specify the sizes of the train/validation/test data from proportions */
data _null_;
   if 0 then set sashelp.heart nobs=N;  /* N = total number of obs */
   nTrain = round(N * &propTrain);      /* size of training data */
   nValid = round(N * &propValid);      /* size of validataion data */
   call symputx("nTrain", nTrain);      /* put integer into macro variable */
   call symputx("nValid", nValid);
   call symputx("nTest", N - nTrain - nValid);
run;
 
/* randomly assign observations to three groups */
proc surveyselect data=Have seed=12345 out=SSOut
     groups=(&nTrain, &nValid, &nTest); /* if no Test data, use  GROUPS=(&nTrain, &nValid) */
run;
 
proc freq data=SSOut order=freq;
   tables GroupID / nocum;           /* GroupID is name of indicator variable */
run;

The training, validation, and testing groups contain 3125, 1563, and 521 observations, respectively. These numbers are the closest integer approximations to 60%, 30% and 10% of the 5209 observations. Notice that the output from the SURVEYSELECT procedure uses the values 1, 2, and 3 for the GroupID indicator variable. You can use PROC FORMAT to associate those numbers with labels such as "Train", "Validate", and "Test".

In summary, there are two basic programming techniques for randomly partitioning data into training, validation, and testing roles. One way uses the SAS DATA step to randomly assign each observation to a role according to proportions that you specify. If you use this technique, the size of each group is random. The other way is to use PROC SURVEYSELECT to randomly assign observations to roles. If you use this technique, you must specify the number of observation in each group.

The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.

1月 142019
 

In the second of three posts on using automated analysis with SAS Visual Analytics, we used the automated analysis object to get a better understanding of our variable of interest, X-Sell and Up-sell Flag, and how it is influenced by other variables in our dataset.

In this third and final post, you'll see how to filter the data even more to set up your customer care workers for success.

Remember how on the left-hand side of the analysis we had a list of subgroups with their probabilities? We can use those to filter our data or create additional subsets of data. Let’s create a calculated category from one of the subgroups and then use that to filter a list table of customers. If I right click on the 87% subgroup and select Derive subgroup item a new calculated category will appear in my Data pane.

Here is the new data item located in our data pane:

To see the filter for this data object we can right click on it and select edit.

We can now use this category as a filter. Here we have a basic customer table that does not have a filter applied:

If we apply the filter for customers who fall in the 87% subgroup and a filter for those customers who have not yet upgraded, we have a list of customers that are highly likely to upgrade.

We could give this list to our customer care centers and have them call these customers to see if they want to upgrade. Alternatively, the customer care center could use this filter to target customers for upgrades when they call in. So, if a customer calls into the center, the employee could see if that customer meets the criteria set out in the filter. If they do, they are highly likely to upgrade, and the employee should provide an offer to them.

How to match callers with sales channels

Let’s go back to our automated analysis and perform one more action. We’ll create a new object from the subgroup and assess the group by acquisition channel. This will help us determine which acquisition channel(s) the customers who are in our 87% subgroup purchased their plans from. Then we’ll know which sales teams we need to communicate to about our sales strategy.

To do this we’ll select our 87% group, right click and select New object from subgroup on new page, then Acquisition Channel.

Here we see the customers who are in or out of our subgroup by acquisition channel.

Because it is difficult to see the "in" group, we’ll remove those customers who are out of our subgroup by selecting out from the legend then right click and select New filter from selection, then Exclude selection.

Now we can see which acquisition channel the 87% subgroup purchased their current plan from and how many have already upgraded.

In less than a minute using SAS Visual Analytics' automated analysis we’ve gained business insights based on machine learning that would have taken hours to produce manually. Not only that, we’ve got easy-to-understand results that are built with natural language processing. We can now analyze all variables and remove any bias, ensuring we don’t miss key findings. Business users gain access to analytics without them having the expert skills needed to build models and interpret results. Automated analysis is a start and SAS is committed to investing time and resources into this new wave of BI. Look for more enhancements in future releases!

Miss the previous posts?

This is the third of a three-part series demonstrating automated analysis using SAS Visual Analytics on Viya. Part 1 describes a common visualization approach to handling customer data that leaves room for error and missed opportunities. Part 2 shows improvements through automated analysis.

Want to see automated analysis in action? Watch this video!

How SAS Visual Analytics' automated analysis takes customer care to the next level - Part 3 was published on SAS Users.

1月 072019
 

In the first of three posts on using automated analysis with SAS Visual Analytics, we explored a typical visualization designed to give telco customer care workers guidance on customers most receptive to upgrade their plans. While the analysis provided some insight, it lacked analytical depth -- and that increases the risk of  wasting time, energy and money on a strategy that may not succeed.

Let’s now look at the same data, but this time deepen the analytical view by putting SAS Visual Analytics' automated analysis into play. We’ll use automated analysis to determine significant variables that impact our key business measure, X-sell and Up-sell Flag.

Less time spent on data discovery, quicker response time

The automated analysis object determines the most important underlying factors for a specific response variable, in our case the X-Sell and Up-Sell flag. After you specify a response variable, most of the remaining data items are added as underlying factors. Variables that are identical to the response variable, variables that have excessive missing values, or variables that have high cardinality are not added as underlying factors. For category responses, you can select the event level (category value) that interests you.

To run automated analysis, I will use the data pane and right click on Xsell and Upsell Flag Category, select Analyze, then Analyze on new page.

 

Here we see the results for Not Yet Upgraded.

Seeing how we really want to understand what made our customers upgrade so we can learn from it, let’s change the results to see upgraded accounts. To do this I will use the drop-down menu to change the category value to Upgraded.

 

Now we see the details for Upgraded. Let’s look at each piece of information within this chart.

 

The top section tells us that the probability of a customer upgraded is 12.13%. It also tells us the other variables in our dataset that influences that probability. The strongest influencers are Total Days over plan, Days Suspended Last 6M (months), Total Times Over Plan and Delinquent Indicator. Remember from our previous analysis, the correlation matrix determined that Total Days over plan, Delinquent Indicator and Days suspended last 6M were correlated with our X-sell and Up-sell Flag. So this part of the analysis is pretty similar. However, the rest of the automated analysis provides so much more information than what we got from our previous analysis and it was produced in under a minute.

The next section gives us a visual on how strong each influencer is on our variable of interest, Xsell and Upsell Flag Category. Total Days Over Plan is the strongest followed by Days Suspended Last 6M followed by Total Times Over Plan…. If we mouse over each of the boxes, we’ll see their relative importance.

After SAS Visual Analytics adds the underlying factors, it creates a relative importance score for each underlying factor. The most important underlying factor is assigned a score of 1, and all other scores are proportional to that value.

If I mouse over Total Days Over Plan I’ll see the relative importance score for that variable.

Here we see that Total Days Over Plan relative importance in influencing a customer to change plans is 1. That means it was the most important factor in predicting our variable of interest, cross-sell and up-sell flag. If I mouse over the Days Suspended Last 6M, I can see that the relative importance for that variable is 0.6664.

The percentages along the left-hand side give us the probability (or chance) of the subgroups of customers likely to upgrade. SAS Visual Analysis shows the top groups and the bottom groups based on probability. The first group of customers are 100% likely to upgrade. These customers have Total Days Over Plan greater than or equal to 33, Days Suspended Last 6M greater than or equal to 6, 6M Avg Minutes on Network Normally Distributed less than -6.9, Delinquent Indicator of 1,2,3 or 4. This means going forward, if we have customers that meet these criteria we should target them for an upgrade because they are 100% likely to upgrade. We can also use the next three customer groups to target as well.

For measure responses, the results display the four groups that result in the greatest values of the response. The results also display the two groups that result in the smallest values of the response. For category responses, the results display the four groups that contain the greatest percentages of the response. The results also display the two groups that contain the least percentages of the response.

The bottom right chart shows how a variable relates to our variable of interest. Below the chart is a description outlining key findings.

An explanatory plot is included for each underlying factor. The contents of this plot depend on the variable type of both the response variable and the underlying factor.

If I click on Days Suspended Last 6M from the colored button bar, the informative text will be highlighted, and the plot chart will be updated to reflect my selection.

But what if you want to see all the variables analyzed and discover what actions were taken on them? If we maximize the automated analysis object we’d see a table at the bottom. This table outlines actions taken on the predictors.

Here we see that Census Area Total Males was rejected because it is too strongly correlated with another measure. This reason would be easy for someone to miss and would affect the results of an analysis or model if that predictor was not removed. Automated analysis really does do the thinking for us and makes models more accurate!

In the second post of this three-part series, we’ll see how we can turn the results from this automated analysis into actionable items.

SAS® Visual Analytics on SAS® Viya® Try it for free!

How SAS Visual Analytics' automated analysis takes customer care to the next level - Part 2 was published on SAS Users.