9月 222020
 

Everyone knows that SAS has been helping programmers and coders build complex machine learning models and solve complex business problems for many years, but did you know that you can also now build machines learning models without a single line of code using SAS Viya?

SAS has been helping programmers and coders build complex machine learning models and solve complex business problems over many years.

Building on the vision and commitment to democratize analytics, SAS Viya offers multiple ways to support non-programmers and empowers people with no programming skills to get up and running quickly and build machine learning models. I touched on some of the ways this can be done via SAS Visual Analytics in my previous post on analytics for everyone with SAS Viya. In addition, SAS Viya also supports more advanced pipeline-based visual modeling via SAS Visual Data Mining and Machine Learning. The combination of these different tools within SAS Viya supporting a low-code/no-code approach to modeling makes SAS Viya an incredibly flexible and powerful analytics platform that can help drive analytics usage and adoption throughout an organization.

As analytics and machine learning become more pervasive, an analytics platform that supports a low-code/no-code approach can get more people involved, drive ongoing innovations, and ultimately accelerate digital transformation throughout an organization.

Speed

I have met my fair share of coding ninjas who blew me away with their ability to build models using keyboards with lightning speed. But when it comes to being able to quickly get an idea into a model and generate all the assessment statistics and charts, there is nothing quite like a visual approach to building machine learning models.

In SAS Viya, you can build a decision tree model literally just by dragging and dropping the relevant variables onto the canvas as shown in the animated screen flow below.

Building a machine learning model via drag and drop

In this case, we were able to quickly build a decision tree model that predicts child mortality rates around the world. Not only do we get the decision tree in all its graphics glory (on the left-hand side of the image), we also get the overall model fit measure (Average Standard Error in this case), a variable importance chart, as well as a lift chart all without having to enter a single line of code in under 5 seconds!

You also get a bunch of detailed statistical outputs, including a detailed node statistics table without having to do anything extra. This is useful for when you need to review the distribution and characteristics of specific nodes when using the decision tree.

Detailed node statistics table

 

What’s more, you can leverage the same drag-and-drop paradigm to quickly tune the model. In our case, you can do simple modifications like adding a new variable by simply dragging a new data item onto the canvas or more complex techniques like manually splitting or pruning a node just by clicking and selecting a node on the canvas. The whole model and visualization refreshes instantly as you make changes, and you get instant feedback on the outputs of your tuning actions, which can help drive rapid iteration and idea testing.

Governance and collaboration

A graphical and components-based approach to modeling also has the added benefits of providing a stronger level of governance and fostering collaboration. Building machine learning model is often a team sport, and the ability to share and reuse models easily can dramatically reduce the cost and effort involved in building and maintaining models.

SAS Visual Data Mining and Machine Learning enables users to build complex, enterprise-grade pipeline models that support sophisticated variable selection, feature engineering techniques, as well as model comparison processes all within a single, easy-to-understand, pipeline-based design framework.

Pipeline modeling using SAS VDMML

The graphical, pipeline-based modeling framework within SAS Visual Data Mining and Machine Learning leverages common components, supports self-documentation, and allows users to leverage a template-based approach to building and sharing machine learning models quickly.

More importantly, as a new user or team member who needs to review, tune or reuse someone else’s model, it is much easier and quicker to understand the design and intent of the various components of a pipeline model and make the needed changes.

It is much easier and quicker to understand the design and intent of the various components of a pipeline model.

Communication and storytelling

Finally, and perhaps most importantly, a graphical, low-code/no-code approach to building machine learning models makes it much easier to communicate both the intent and potential impact of the model. Figures and numbers represent facts, but narratives and stories convey emotion and build connections. The visual modeling approaches supported by SAS Viya enable you to tell compelling stories, share powerful ideas, and inspire valuable actions.

SAS Viya enables you to make changes and apply filters on the fly within its various visual modeling environments. With the model training process and model outputs all represented visually, it makes it extremely easy to discuss business scenarios, test hypotheses, and test modeling strategies and approaches, even with people without a deep machine learning background.

There is no question that a programmatic approach to building machine learning models offers the ultimate power and flexibility and enables data scientist to build the most complex and advanced machine learning models. But when it comes to speed, governance, and communications, a graphical, low-code/no-code approach to building machine learning definitely has a lot to offer.

To learn more about a low-code/no-code approach to building machine learning models using SAS Viya, check out my book Smart Data Discovery Using SAS® Viya®.

The value of a low-code/no-code approach to building machine learning models was published on SAS Users.

9月 222020
 

Everyone knows that SAS has been helping programmers and coders build complex machine learning models and solve complex business problems for many years, but did you know that you can also now build machines learning models without a single line of code using SAS Viya?

SAS has been helping programmers and coders build complex machine learning models and solve complex business problems over many years.

Building on the vision and commitment to democratize analytics, SAS Viya offers multiple ways to support non-programmers and empowers people with no programming skills to get up and running quickly and build machine learning models. I touched on some of the ways this can be done via SAS Visual Analytics in my previous post on analytics for everyone with SAS Viya. In addition, SAS Viya also supports more advanced pipeline-based visual modeling via SAS Visual Data Mining and Machine Learning. The combination of these different tools within SAS Viya supporting a low-code/no-code approach to modeling makes SAS Viya an incredibly flexible and powerful analytics platform that can help drive analytics usage and adoption throughout an organization.

As analytics and machine learning become more pervasive, an analytics platform that supports a low-code/no-code approach can get more people involved, drive ongoing innovations, and ultimately accelerate digital transformation throughout an organization.

Speed

I have met my fair share of coding ninjas who blew me away with their ability to build models using keyboards with lightning speed. But when it comes to being able to quickly get an idea into a model and generate all the assessment statistics and charts, there is nothing quite like a visual approach to building machine learning models.

In SAS Viya, you can build a decision tree model literally just by dragging and dropping the relevant variables onto the canvas as shown in the animated screen flow below.

Building a machine learning model via drag and drop

In this case, we were able to quickly build a decision tree model that predicts child mortality rates around the world. Not only do we get the decision tree in all its graphics glory (on the left-hand side of the image), we also get the overall model fit measure (Average Standard Error in this case), a variable importance chart, as well as a lift chart all without having to enter a single line of code in under 5 seconds!

You also get a bunch of detailed statistical outputs, including a detailed node statistics table without having to do anything extra. This is useful for when you need to review the distribution and characteristics of specific nodes when using the decision tree.

Detailed node statistics table

 

What’s more, you can leverage the same drag-and-drop paradigm to quickly tune the model. In our case, you can do simple modifications like adding a new variable by simply dragging a new data item onto the canvas or more complex techniques like manually splitting or pruning a node just by clicking and selecting a node on the canvas. The whole model and visualization refreshes instantly as you make changes, and you get instant feedback on the outputs of your tuning actions, which can help drive rapid iteration and idea testing.

Governance and collaboration

A graphical and components-based approach to modeling also has the added benefits of providing a stronger level of governance and fostering collaboration. Building machine learning model is often a team sport, and the ability to share and reuse models easily can dramatically reduce the cost and effort involved in building and maintaining models.

SAS Visual Data Mining and Machine Learning enables users to build complex, enterprise-grade pipeline models that support sophisticated variable selection, feature engineering techniques, as well as model comparison processes all within a single, easy-to-understand, pipeline-based design framework.

Pipeline modeling using SAS VDMML

The graphical, pipeline-based modeling framework within SAS Visual Data Mining and Machine Learning leverages common components, supports self-documentation, and allows users to leverage a template-based approach to building and sharing machine learning models quickly.

More importantly, as a new user or team member who needs to review, tune or reuse someone else’s model, it is much easier and quicker to understand the design and intent of the various components of a pipeline model and make the needed changes.

It is much easier and quicker to understand the design and intent of the various components of a pipeline model.

Communication and storytelling

Finally, and perhaps most importantly, a graphical, low-code/no-code approach to building machine learning models makes it much easier to communicate both the intent and potential impact of the model. Figures and numbers represent facts, but narratives and stories convey emotion and build connections. The visual modeling approaches supported by SAS Viya enable you to tell compelling stories, share powerful ideas, and inspire valuable actions.

SAS Viya enables you to make changes and apply filters on the fly within its various visual modeling environments. With the model training process and model outputs all represented visually, it makes it extremely easy to discuss business scenarios, test hypotheses, and test modeling strategies and approaches, even with people without a deep machine learning background.

There is no question that a programmatic approach to building machine learning models offers the ultimate power and flexibility and enables data scientist to build the most complex and advanced machine learning models. But when it comes to speed, governance, and communications, a graphical, low-code/no-code approach to building machine learning definitely has a lot to offer.

To learn more about a low-code/no-code approach to building machine learning models using SAS Viya, check out my book Smart Data Discovery Using SAS® Viya®.

The value of a low-code/no-code approach to building machine learning models was published on SAS Users.

8月 272020
 

Decision trees are a fundamental machine learning technique that every data scientist should know. Luckily, the construction and implementation of decision trees in SAS is straightforward and easy to produce.

There are simply three sections to review for the development of decision trees:

  1. Data
  2. Tree development
  3. Model evaluation

Data

The data that we will use for this example is found in the fantastic UCI Machine Learning Repository. The data set is titled “Bank Marketing Dataset,” and it can be found at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

This data set represents a direct marketing campaign (phone calls) conducted by a Portuguese banking institution. The goal of the direct marketing campaign was to have customers subscribe to a term deposit product. The data set consists of 15 independent variables that represent customer attributes (age, job, marital status, education, etc.) and marketing campaign attributes (month, day of week, number of marketing campaigns, etc.).

The target variable in the data set is represented as “y.” This variable is a binary indicator of whether the phone solicitation resulted in a sale of a term deposit product (“yes”) or did not result in a sale (“no”). For our purposes, we will recode this variable and label it as “TARGET,” and the binary outcomes will be 1 for “yes” and 0 for “no.”

The data set is randomly split into two data sets at a 70/30 ratio. The larger data set will be labeled “bank_train” and the smaller data set will be labeled “bank_test”. The decision tree will be developed on the bank_train data set. Once the decision tree has been developed, we will apply the model to the holdout bank_test data set.

Tree development

The code below specifies how to build a decision tree in SAS. The data set mydata.bank_train is used to develop the decision tree. The output code file will enable us to apply the model to our unseen bank_test data set.

ODS GRAPHICS ON;
 
PROC HPSPLIT DATA=mydata.bank_train;
 
    CLASS TARGET _CHARACTER_;
 
    MODEL TARGET(EVENT='1') = _NUMERIC_ _CHARACTER_;
 
    PRUNE costcomplexity;
 
    PARTITION FRACTION(VALIDATE=<strong>0.3</strong> SEED=<strong>42</strong>);
 
    CODE FILE='C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/bank_tree.sas';
 
    OUTPUT OUT = SCORED;
 
run;

The output of the decision tree algorithm is a new column labeled “P_TARGET1”. This column shows the probability of a positive outcome for each observation. The output also contains the standard tree diagram that demonstrates the model split points.

Model evaluation

Once you have developed your model, you will need to evaluate it to see whether it meets the needs of the project. In this example, we want to make sure that the model adequately predicts which observation will lead to a sale.

The first step is to apply the model to the holdout bank_test data set.

DATA test_scored;
 
    SET MYDATA.bank_test;
 
    %INCLUDE 'C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/bank_tree.sas';
 
RUN;

The %INCLUDE statement applied the decision tree algorithm to the bank_test data set and created the P_TARGET1 column for the bank_test data set.

Now that the model has been applied to the bank_test data set, we will need to evaluate the performance of the model by creating a lift table. Lift tables provide additional information that has been summarized in the ROC chart. Remember that every point along the ROC chart is a probability threshold. The lift table provides detailed information for every point along the ROC curve.

The model evaluation macro that we will use was developed by Wensui Liu. This easy-to-use macro is labeled “separation” and can be applied to any binary classification model output to evaluate the model results.

You can find this macro in my GitHub repository for my new book, End-to-End Data Science with SAS®. This GitHub repository contains all of the code demonstrated in the book along with all of the macros that were used in the book.

This macro on my C drive, and we call it with a %INCLUDE statement.

%INCLUDE 'C:/Users/James Gearheart/Desktop/SAS Book Stuff/Projects/separation.sas';
 
%<em>separation</em>(data = test_scored, score = P_TARGET1, y = target);

The score script that was generated from the CODE FILE statement in the PROC HPSPLIT procedure is applied to the holdout bank_test data set through the use of the %INCLUDE statement.

The table below is generated from the lift table macro.

This table shows that that model adequately separated the positive and negative observations. If we examine the top two rows of data in the table, we can see that the cumulative bad percent for the top 20% of observations is 47.03%. This can be interpreted as we can identify 47.03% of positive cases by selecting the top 20% of the population. This selection is made by selecting observations with a P_TARGET1 score greater than or equal to 0.8276 as defined by the MAX SCORE column.

Additional information about decision trees along with several other model designs are reviewed in detail in my new book End-to-End Data Science with SAS® available at Amazon and SAS.com.

Build a decision tree in SAS was published on SAS Users.

8月 252020
 

Analytics is playing an increasingly strategic role in the ongoing digital transformation of organizations today. However, to succeed and scale your digital transformation efforts, it is critical to enable analytics skills at all tiers of your organization. In a recent blog post covering 4 principles of analytics you cannot ignore, SAS COO Oliver Schabenberger articulated the importance of democratizing analytics. By scaling your analytics efforts beyond traditional data science teams and involving more people with strong business domain knowledge, you can gain more valuable insights and make more significant impacts.

SAS Viya was built from the ground up to fulfill this vision of democratizing analytics. At SAS, we believe analytics should be accessible to everyone. While SAS Viya offers tremendous support and will continue to be the tool of choice for many advanced users and programmers, it is also highly accessible for business analysts and insights team who prefer a more visual approach to analytics and insights discovery.

Self-service data management

First of all, SAS Viya makes it easy for anyone to ingest and prepare data without a single line of code. The integrated data preparation components within SAS Viya support ad-hoc, agile-oriented data management tasks where you can profile, cleanse, and join data easily and rapidly.

Automatically Generated Data Profiling Report

You can execute complex joins, create custom columns, and cleanse your data via a completely drag-and-drop interface. The automation built into SAS Viya eases the often tedious task of data profiling and data cleansing via automated data type identification and transform suggestions. In an area that can be both complex and intimidating, SAS Viya makes data management tasks easy and approachable, helping you to analyze more data and uncover more insights.

Data Join Using a Visual Interface

A visual approach supporting low-code and no-code programming

Speaking of no-code, SAS Viya’s visual approach and support extend deep into data exploration and advanced modeling. Not only can you quickly build charts such as histograms and box plots using a drag and drop interface, but you can also build complex machine learning models using algorithms such as decision trees and logistic regression on the same visual canvas.

Building a Decision Tree Model Using SAS Viya

By putting the appropriate guard rails and providing relevant and context-rich help for the user, SAS Viya empowers users to undertake data analysis using other advanced analytics techniques such as forecasting and correlation analysis. These techniques empower users to ask more complex questions and can potentially help uncover more actionable and valuable insights.

Correlation Analysis Using the Correlation Matrix within SAS Viya

Augmented analytics

Augmented analytics is an emerging area of analytics that leverages machine learning to streamline and automate the process of doing analytics and building machine learning models. SAS Viya leverages augmented analytics throughout the platform to automate various tasks. My favorite use of augmented analytics in SAS Viya, though, is the hyperparameters autotuning feature.

In machine learning, hyperparameters are parameters that you need to set before the learning processing can begin. They are only used during the training process and contribute significantly to the model training process. It can often be challenging to set the optimal hyperparameter settings, especially if you are not an experienced modeler. This is where SAS Viya can help by making building machine learning models easier for everyone one hyperparameter at a time.

Here is an example of using the SAS Viya autotuning feature to improve my decision tree model. Using the autotuning window, all I needed to do was tell SAS Viya how long I want the autotuning process to run for. It will then work its magic and determine the best hyperparameters to use, which, in this case, include the Maximum tree level and the number of Predictor bins. In most cases, you get a better model after coming back from getting a glass of water!

Hyperparameters Autotuning in SAS Viya

Under the hood, SAS Viya uses complex optimization techniques to try to find the best hyperparameter combinations to use all without you having to understand how it manages this impressive feat. I should add that hyperparameters autotuning is supported with many other algorithms in SAS Viya, and you have even more autotuning options when using it via the programmatic interface!

By leveraging a visually oriented framework and augmented analytics capabilities, SAS Viya is making analytics easier and machine learning models more accessible for everyone within an organization. For more on how SAS Viya enables everyone to ask more complex questions and uncover more valuable insights, check out my book Smart Data Discovery Using SAS® Viya®.

Analytics for everyone with SAS Viya was published on SAS Users.

8月 182020
 

As the operations director of any organization’s contact center, you’ll want to ensure that your employees have the right tools in their hands to deliver superior customer service ultimately leading to happy customers and improved revenue for your company. In a previous series on How SAS Visual Analytics’ automated analysis takes customer care to the next level, we saw how automated analysis used machine learning and natural language generation (NLG) to determine which factors were most important in predicting if telecommunications customers would be likely to upgrade to a different mobile plan. We then used this information to create a list of customers our customer care workers could contact to promote our new products. But what about making on-the-fly decisions about what offer(s) best supports a given customer’s needs? To provide support for this type of analysis, SAS recently introduced the automated prediction feature within SAS Visual Analytics on SAS Viya.

What is automated prediction?

Automated prediction, in less than a minute, runs several analytic models (such as decision trees, gradient boosting, and logistic and linear regression) on a specific variable of your choice. Most of the remaining variables in your dataset are automatically analyzed as factors that might influence your specified variable. They are called underlying factors. SAS then chooses the one model (champion model) that most accurately predicts your target variable. The model prediction and the underlying factors are then displayed. You can adjust the values of the underlying factors to determine how the model prediction changes with each adjustment.

Let’s look at how automated prediction works.

Here we have the same customer table we were working within our previous blog posts. This table contains 121 columns (i.e. variables) containing usage and demographic information from a subset of customers who have contacted our customer care centers. One of these columns is a flag that indicates whether that customer upgraded their plan or not. We’ll use this as our target variable.

We’ll right-click on the Upgrade Flag variable and choose Predict on a new page in our report (Figure 1 below).

Figure 1: Selecting automated prediction.

With the automated prediction feature, SAS Visual Analytics created this easy-to-use form in less than a few minutes—by using advanced analytic models and machine learning in the background—that can help predict whether a customer is likely to upgrade (Figure 2 below). SAS analyzed all the other variables in the dataset to determine which ones were most likely to influence our Upgrade Flag. We can modify some of the values in the form and see how it affects the outcome. The factors are listed in order of their relative importance.

Figure 2: Automated Prediction results.

Under the prediction, SAS provides further details using natural language generation (NLG) (Figure 3 below). For example, here we see that the most prevalent value in our data was Did not upgrade with 87.87% of the records having that value. We can also see what type of model was chosen as the champion model; the model that best predicts whether a customer will upgrade. In our current example, the Gradient Boosting model provides the most accurate results.

Figure 3: Natural Language Generation (NLG) provides details about the prediction.

Now, let’s say that a customer calls in. We can fill in the values for each of these parameters listed that are specific to that customer. The analysis is automatically updated using the model previously generated and provides an updated prediction.

Here we see that this customer is likely to upgrade (Figure 4 below). Now we can discuss other mobile plans with this customer.

Figure 4: Updated prediction based on values entered for a specific customer.

What if we want to learn more about the model behind the prediction and what variables were deemed the most important? We can find that information when we maximize the object (Figure 5 below).

Figure 5: Maximize to see model details.

Now we can see at the bottom the steps that were taken to create the champion model (step 2) (Figure 6 below). SAS ran a decision tree, logistic regression, and a gradient boosting model and found that the gradient boosting model provided the greatest accuracy (91.67% accurate in predicting upgrade flag).

Figure 6: Prediction model description.

If we select the Relative Importance tab, we can see the relative importance of each of the underlying factors (Figure 7 below).

Figure 7: Relative importance of underlying factors.

Total Days Over Plan has the greatest influence on our Upgrade Flag variable. Days Suspended Last 6M is the next most important variable whose impact is 43.97% of Total Days Over Plan (Figure 7 above).

Automated Prediction is a fast and easy way to gain an understanding of how variables influence a target and to consider “what if,” by seeing how modifying those underlying factors affects an outcome. Multiple models were run on the data and a champion model was chosen for us using machine learning. We were able to see the accuracy of the champion model and the relative importance of each influencer. All this within a few minutes! This provides a great foundation for moving forward to more advanced modeling techniques. For those who wish to have more control over the models, SAS Visual Analytics also provides capabilities to build and modify other advanced analytical models such as gradient boosting, linear and logistic regression, and decision trees. Furthermore, as your company’s analytic maturing increases additional products can be easily added on to provide even more model choices (Forest, Neural network, Support vector machine, etc.) and capabilities. SAS’ platform and products support the whole analytical life cycle from data preparation all the way through model deployment, model performance management, and decision intelligence.

Take the next step in learning more about SAS Visual Analytics on SAS Viya by signing up for a free trial!

How SAS Visual Analytics' automated prediction takes customer care to the next level was published on SAS Users.

8月 182020
 

As the operations director of any organization’s contact center, you’ll want to ensure that your employees have the right tools in their hands to deliver superior customer service ultimately leading to happy customers and improved revenue for your company. In a previous series on How SAS Visual Analytics’ automated analysis takes customer care to the next level, we saw how automated analysis used machine learning and natural language generation (NLG) to determine which factors were most important in predicting if telecommunications customers would be likely to upgrade to a different mobile plan. We then used this information to create a list of customers our customer care workers could contact to promote our new products. But what about making on-the-fly decisions about what offer(s) best supports a given customer’s needs? To provide support for this type of analysis, SAS recently introduced the automated prediction feature within SAS Visual Analytics on SAS Viya.

What is automated prediction?

Automated prediction, in less than a minute, runs several analytic models (such as decision trees, gradient boosting, and logistic and linear regression) on a specific variable of your choice. Most of the remaining variables in your dataset are automatically analyzed as factors that might influence your specified variable. They are called underlying factors. SAS then chooses the one model (champion model) that most accurately predicts your target variable. The model prediction and the underlying factors are then displayed. You can adjust the values of the underlying factors to determine how the model prediction changes with each adjustment.

Let’s look at how automated prediction works.

Here we have the same customer table we were working within our previous blog posts. This table contains 121 columns (i.e. variables) containing usage and demographic information from a subset of customers who have contacted our customer care centers. One of these columns is a flag that indicates whether that customer upgraded their plan or not. We’ll use this as our target variable.

We’ll right-click on the Upgrade Flag variable and choose Predict on a new page in our report (Figure 1 below).

Figure 1: Selecting automated prediction.

With the automated prediction feature, SAS Visual Analytics created this easy-to-use form in less than a few minutes—by using advanced analytic models and machine learning in the background—that can help predict whether a customer is likely to upgrade (Figure 2 below). SAS analyzed all the other variables in the dataset to determine which ones were most likely to influence our Upgrade Flag. We can modify some of the values in the form and see how it affects the outcome. The factors are listed in order of their relative importance.

Figure 2: Automated Prediction results.

Under the prediction, SAS provides further details using natural language generation (NLG) (Figure 3 below). For example, here we see that the most prevalent value in our data was Did not upgrade with 87.87% of the records having that value. We can also see what type of model was chosen as the champion model; the model that best predicts whether a customer will upgrade. In our current example, the Gradient Boosting model provides the most accurate results.

Figure 3: Natural Language Generation (NLG) provides details about the prediction.

Now, let’s say that a customer calls in. We can fill in the values for each of these parameters listed that are specific to that customer. The analysis is automatically updated using the model previously generated and provides an updated prediction.

Here we see that this customer is likely to upgrade (Figure 4 below). Now we can discuss other mobile plans with this customer.

Figure 4: Updated prediction based on values entered for a specific customer.

What if we want to learn more about the model behind the prediction and what variables were deemed the most important? We can find that information when we maximize the object (Figure 5 below).

Figure 5: Maximize to see model details.

Now we can see at the bottom the steps that were taken to create the champion model (step 2) (Figure 6 below). SAS ran a decision tree, logistic regression, and a gradient boosting model and found that the gradient boosting model provided the greatest accuracy (91.67% accurate in predicting upgrade flag).

Figure 6: Prediction model description.

If we select the Relative Importance tab, we can see the relative importance of each of the underlying factors (Figure 7 below).

Figure 7: Relative importance of underlying factors.

Total Days Over Plan has the greatest influence on our Upgrade Flag variable. Days Suspended Last 6M is the next most important variable whose impact is 43.97% of Total Days Over Plan (Figure 7 above).

Automated Prediction is a fast and easy way to gain an understanding of how variables influence a target and to consider “what if,” by seeing how modifying those underlying factors affects an outcome. Multiple models were run on the data and a champion model was chosen for us using machine learning. We were able to see the accuracy of the champion model and the relative importance of each influencer. All this within a few minutes! This provides a great foundation for moving forward to more advanced modeling techniques. For those who wish to have more control over the models, SAS Visual Analytics also provides capabilities to build and modify other advanced analytical models such as gradient boosting, linear and logistic regression, and decision trees. Furthermore, as your company’s analytic maturing increases additional products can be easily added on to provide even more model choices (Forest, Neural network, Support vector machine, etc.) and capabilities. SAS’ platform and products support the whole analytical life cycle from data preparation all the way through model deployment, model performance management, and decision intelligence.

Take the next step in learning more about SAS Visual Analytics on SAS Viya by signing up for a free trial!

How SAS Visual Analytics' automated prediction takes customer care to the next level was published on SAS Users.

8月 182020
 

As the operations director of any organization’s contact center, you’ll want to ensure that your employees have the right tools in their hands to deliver superior customer service ultimately leading to happy customers and improved revenue for your company. In a previous series on How SAS Visual Analytics’ automated analysis takes customer care to the next level, we saw how automated analysis used machine learning and natural language generation (NLG) to determine which factors were most important in predicting if telecommunications customers would be likely to upgrade to a different mobile plan. We then used this information to create a list of customers our customer care workers could contact to promote our new products. But what about making on-the-fly decisions about what offer(s) best supports a given customer’s needs? To provide support for this type of analysis, SAS recently introduced the automated prediction feature within SAS Visual Analytics on SAS Viya.

What is automated prediction?

Automated prediction, in less than a minute, runs several analytic models (such as decision trees, gradient boosting, and logistic and linear regression) on a specific variable of your choice. Most of the remaining variables in your dataset are automatically analyzed as factors that might influence your specified variable. They are called underlying factors. SAS then chooses the one model (champion model) that most accurately predicts your target variable. The model prediction and the underlying factors are then displayed. You can adjust the values of the underlying factors to determine how the model prediction changes with each adjustment.

Let’s look at how automated prediction works.

Here we have the same customer table we were working within our previous blog posts. This table contains 121 columns (i.e. variables) containing usage and demographic information from a subset of customers who have contacted our customer care centers. One of these columns is a flag that indicates whether that customer upgraded their plan or not. We’ll use this as our target variable.

We’ll right-click on the Upgrade Flag variable and choose Predict on a new page in our report (Figure 1 below).

Figure 1: Selecting automated prediction.

With the automated prediction feature, SAS Visual Analytics created this easy-to-use form in less than a few minutes—by using advanced analytic models and machine learning in the background—that can help predict whether a customer is likely to upgrade (Figure 2 below). SAS analyzed all the other variables in the dataset to determine which ones were most likely to influence our Upgrade Flag. We can modify some of the values in the form and see how it affects the outcome. The factors are listed in order of their relative importance.

Figure 2: Automated Prediction results.

Under the prediction, SAS provides further details using natural language generation (NLG) (Figure 3 below). For example, here we see that the most prevalent value in our data was Did not upgrade with 87.87% of the records having that value. We can also see what type of model was chosen as the champion model; the model that best predicts whether a customer will upgrade. In our current example, the Gradient Boosting model provides the most accurate results.

Figure 3: Natural Language Generation (NLG) provides details about the prediction.

Now, let’s say that a customer calls in. We can fill in the values for each of these parameters listed that are specific to that customer. The analysis is automatically updated using the model previously generated and provides an updated prediction.

Here we see that this customer is likely to upgrade (Figure 4 below). Now we can discuss other mobile plans with this customer.

Figure 4: Updated prediction based on values entered for a specific customer.

What if we want to learn more about the model behind the prediction and what variables were deemed the most important? We can find that information when we maximize the object (Figure 5 below).

Figure 5: Maximize to see model details.

Now we can see at the bottom the steps that were taken to create the champion model (step 2) (Figure 6 below). SAS ran a decision tree, logistic regression, and a gradient boosting model and found that the gradient boosting model provided the greatest accuracy (91.67% accurate in predicting upgrade flag).

Figure 6: Prediction model description.

If we select the Relative Importance tab, we can see the relative importance of each of the underlying factors (Figure 7 below).

Figure 7: Relative importance of underlying factors.

Total Days Over Plan has the greatest influence on our Upgrade Flag variable. Days Suspended Last 6M is the next most important variable whose impact is 43.97% of Total Days Over Plan (Figure 7 above).

Automated Prediction is a fast and easy way to gain an understanding of how variables influence a target and to consider “what if,” by seeing how modifying those underlying factors affects an outcome. Multiple models were run on the data and a champion model was chosen for us using machine learning. We were able to see the accuracy of the champion model and the relative importance of each influencer. All this within a few minutes! This provides a great foundation for moving forward to more advanced modeling techniques. For those who wish to have more control over the models, SAS Visual Analytics also provides capabilities to build and modify other advanced analytical models such as gradient boosting, linear and logistic regression, and decision trees. Furthermore, as your company’s analytic maturing increases additional products can be easily added on to provide even more model choices (Forest, Neural network, Support vector machine, etc.) and capabilities. SAS’ platform and products support the whole analytical life cycle from data preparation all the way through model deployment, model performance management, and decision intelligence.

Take the next step in learning more about SAS Visual Analytics on SAS Viya by signing up for a free trial!

How SAS Visual Analytics' automated prediction takes customer care to the next level was published on SAS Users.

7月 232020
 

Last month a SAS programmer asked how to fit a multivariate Gaussian mixture model in SAS. For univariate data, you can use the FMM Procedure, which fits a large variety of finite mixture models. If your company is using SAS Viya, you can use the MBC or GMM procedures, which perform model-based clustering (PROC MBC) or cluster analysis by using the Gaussian mixture model (PROC GMM). The MBC procedure is part of SAS Visual Statistics; The GMM procedure is part of SAS Visual Data Mining and Machine Learning (VDMML).

Unfortunately, the programmer did not yet have access to SAS Viya. He asked whether you can fit a Gaussian mixture model in PROC IML. Yes, and there are several methods and models that you can fit. Most methods use the expectation-maximization (EM) algorithm. In my opinion, the the Wikipedia entry for the EM algorithm (which includes a Gaussian mixture example) is rather dense. To keep this article as simple as possible, I choose to fit a Gaussian mixture model by using one particular model (full-matrix covariance) and by using a technique called "hard clustering." This article is inspired by a presentation and paper on PROC MBC by Dave Kessler at the 2019 SAS Global Forum. Dave also kindly provided some sample code for me to look at when I was learning about the EM algorithm.

The problem: Fitting a Gaussian mixture model

A Gaussian mixture model assumes that each cluster is multivariate normal but allows different clusters to have different within-cluster covariance structures. As in k-means clustering, it is assumed that you know the number of clusters, G. The clustering problem is an "unsupervised" machine learning problem, which means that the observations do not initially have labels that tell you which observation belongs to which cluster. The goal of the analysis is to assign each observation to a cluster ("hard clustering") or a probability of belonging to each cluster ("fuzzy clustering"). I will use the terms "cluster" and "group" interchangeably. In this article, I will only discuss the hard-clustering problem, which is conceptually easier to understand and implement.

The density of a Gaussian mixture with G components is \(\Sigma_{i=1}^G \tau_i f(\mathbf{x}; {\boldsymbol\mu}_i, {\boldsymbol\Sigma}_i)\), where f(x; μi, Σi) is the multivariate normal density of the i_th component, which has mean vector μi and covariance matrix Σi. The values τi are the mixing probabilities, so Σ τi = 1. If x is a d-dimensional vector, you need to estimate τi, μi, and Σi for each of the G groups, for a total of G*(1 + d + d*(d-1)/2) – 1 parameter estimates. The d*(d-1)/2 expression is the number of free parameters in the symmetric covariance matrix and the -1 term reflects the constraint Σ τi = 1. Fortunately, the assumption that the groups are multivariate normal enables you to compute the maximum likelihood estimates directly without doing any numerical optimization.

Fitting a Gaussian mixture model is a "chicken-and-egg" problem because it consists of two subproblems, each of which is easy to solve if you know the answer to the other problem:

  • If you know to which group each observation belongs, you can compute maximum likelihood estimates for the mean (center) and covariance of each group. You can also use the number of observations in each group to estimate the mixing probabilities for the finite mixture distribution.
  • If you know the center, covariance matrix, and mixing probabilities for each of the G groups, you can use the density function for each group (weighted by the mixing probability) to determine how likely each observation is to belong to a cluster. For "hard clustering," you assign the observation to the cluster that gives the highest likelihood.

The expectation-maximization (EM) algorithm is an iterative method that enables you to solve interconnected problems like this. The steps of the EM algorithm are given in the documentation for the MBC procedure, as follows:

  1. Use some method (such as k-means clustering) to assign each observation to a cluster. The assignment does not have to be precise because it will be refined. Some data scientists use random assignment as a quick-and-dirty way to initially assign points to clusters, but for hard clustering this can lead to less than optimal solutions.
  2. The M Step: Assume that the current assignment to clusters is correct. Compute the maximum likelihood estimates of the within-cluster count, mean, and covariance. From the counts, estimate the mixing probabilities. I have previously shown how to compute the within-group parameter estimates
  3. The E Step: Assume the within-cluster statistics are correct. I have previously shown how to evaluate the likelihood that each observation belongs to each cluster. Use the likelihoods to update the assignment to clusters. For hard clustering, this means choosing the cluster for which the density (weighted by the mixing probabilities) is greatest.
  4. Evaluate the mixture log likelihood, which is an overall measure of the goodness of fit. If the log likelihood has barely changed from the previous iteration, assume that the EM algorithm has converged. Otherwise, go to the M step and repeat until convergence. The documentation for PROC MBC provides the log-likelihood function for both hard and fuzzy clustering.

This implementation of the EM algorithm performs the M-step before the E-step, but it is still the "EM" algorithm. (Why? Because there is no "ME" in statistics!)

Use k-means clustering to assign the initial group membership

The first step is to assign each observation to a cluster. Let's use the Fisher Iris data, which we know has three clusters. We will not use the Species variable in any way, but merely treat the data as four-dimensional unlabeled data.

The following steps are adapted from the Getting Started example in PROC FASTCLUS. The call to PROC STDIZE standardizes the data and the call to PROC FASTCLUS performs k-means clustering and saves the clusters to an output data set.

/* standardize and use k-means clustering (k=3) for initial guess */
proc stdize data=Sashelp.Iris out=StdIris method=std;
   var Sepal: Petal:;
run;
 
proc fastclus data=StdIris out=Clust maxclusters=3 maxiter=100 random=123;
   var Sepal: Petal:;
run;
 
data Iris;
merge Sashelp.Iris Clust(keep=Cluster);
/* for consistency with the Species order, remap the cluster numbers  */
if Cluster=1 then Cluster=2; else if Cluster=2 then Cluster=1;
run;
 
title "k-Means Clustering of Iris Data";
proc sgplot data=Iris;
   ellipse x=PetalWidth y=SepalWidth / group=Cluster;
   scatter x=PetalWidth y=SepalWidth / group=Cluster transparency=0.2
                       markerattrs=(symbol=CircleFilled size=10) jitter;
   xaxis grid; yaxis grid;
run;

The graph shows the cluster assignments from PROC FASTCLUS. They are similar but not identical to the actual groups of the Species variable. In the next section, these cluster assignments are used to initialize the EM algorithm.

The EM algorithm for hard clustering

You can write a SAS/IML program that implements the EM algorithm for hard clustering. The M step uses the MLEstMVN function (described in a previous article) to compute the maximum likelihood estimates within clusters. The E step uses the LogPdfMVN function (described in a previous article) to compute the log-PDF for each cluster (assuming MVN).

proc iml;
load module=(LogPdfMVN MLEstMVN);  /* you need to STORE these modules */
 
/* 1. Read data. Initialize 'Cluster' assignments from PROC FASTCLUS */
use Iris;
varNames = {'SepalLength' 'SepalWidth' 'PetalLength' 'PetalWidth'};
read all var varNames into X;
read all var {'Cluster'} into Group;  /* from PROC FASTCLUS */
close;
 
nobs = nrow(X); d = ncol(X); G = ncol(unique(Group));
prevCDLL = -constant('BIG');    /* set to large negative number */
converged = 0;                  /* iterate until converged=1 */
eps = 1e-5;                     /* convergence criterion */
iterHist = j(100, 3, .);        /* monitor EM iteration */
LL = J(nobs, G, 0);             /* store the LL for each group */
 
/* EM algorithm: Solve the M and E subproblems until convergence */
do nIter = 1 to 100 while(^converged);
   /* 2. M Step: Given groups, find MLE for n, mean, and cov within each group */
   L = MLEstMVN(X, Group);      /* function returns a list */
   ns = L$'n';       tau = ns / sum(ns);
   means = L$'mean'; covs = L$'cov';  
 
   /* 3. E Step: Given MLE estimates, compute the LL for membership 
         in each group. Use LL to update group membership. */
   do k = 1 to G;
      LL[ ,k] = log(tau[k]) + LogPDFMVN(X, means[k,], shape(covs[k,],d));
   end;
   Group = LL[ ,<:>];               /* predicted group has maximum LL */
 
   /* 4. The complete data LL is the sum of log(tau[k]*PDF[,k]).
         For "hard" clustering, Z matrix is 0/1 indicator matrix.
         DESIGN function: https://blogs.sas.com/content/iml/2016/03/02/dummy-variables-sasiml.html
   */
   Z = design(Group);               /* get dummy variables for Group */
   CDLL = sum(Z # LL);              /* sum of LL weighted by group membership */
   /* compute the relative change in CD LL. Has algorithm converged? */
   relDiff = abs( (prevCDLL-CDLL) / prevCDLL );
   converged = ( relDiff < eps );               
 
   /* monitor convergence; if no convergence, iterate */
   prevCDLL = CDLL;
   iterHist[nIter,] = nIter || CDLL || relDiff;
end;
 
/* remove unused rows and print EM iteration history */
iterHist = iterHist[ loc(iterHist[,2]^=.), ];  
print iterHist[c={'Iter' 'CD LL' 'relDiff'}];

The output from the iteration history shows that the EM algorithm converged in five iterations. At each step of the iteration, the log likelihood increased, which shows that the fit of the Gaussian mixture model improved at each iteration. This is one of the features of the EM algorithm: the likelihood always increases on successive steps.

The results of the EM algorithm for fitting a Gaussian mixture model

This problem uses G=3 clusters and d=4 dimensions, so there are 3*(1 + 4 + 4*3/2) – 1 = 32 parameter estimates! Most of those parameters are the elements of the three symmetric 4 x 4 covariance matrices. The following statements print the estimates of the mixing probabilities, the mean vector, and the covariance matrices for each cluster. To save space, the covariance matrices are flattened into a 16-element row vector.

/* print final parameter estimates for Gaussian mixture */
GroupNames = strip(char(1:G));
rows = repeat(T(1:d), 1, d);  cols = repeat(1:d, d, 1);
SNames = compress('S[' + char(rows) + ',' + char(cols) + ']');
print tau[r=GroupNames F=6.2],
      means[r=GroupNames c=varNames F=6.2],
      covs[r=GroupNames c=(rowvec(SNames)) F=6.2];

You can merge the final Group assignment with the data and create scatter plots that show the group assignment. One of the scatter plots is shown below:

The EM algorithm changed the group assignments for 10 observations. The most obvious change is the outlying marker in the lower-left corner, which changed from Group=2 to Group=1. The other markers that changed are near the overlapping region between Group=2 and Group=3.

Summary

This article shows an implementation of the EM algorithm for fitting a Gaussian mixture model. For simplicity, I implemented an algorithm that uses hard clustering (the complete data likelihood model). This algorithm might not perform well with a random initial assignment of clusters, so I used the results of k-means clustering (PROC FASTCLUS) to initialize the algorithm. Hopefully, some of the tricks and techniques in this implementation of the EM algorithm will be useful to SAS programmers who want to write more sophisticated EM programs.

The main purpose of this article is to demonstrate how to implement the EM algorithm in SAS/IML. As a reminder, if you want to fit a Gaussian mixture model in SAS, you can use PROC MBC or PROC GMM in SAS Viya. The MBC procedure is especially powerful: in Visual Statistics 8.5 you can choose from 17 different ways to model the covariance structure in the clusters. The MBC procedure also supports several ways to initialize the group assignments and supports both hard and fuzzy clustering. For more about the MBC procedure, see Kessler (2019), "Introducing the MBC Procedure for Model-Based Clustering."

I learned a lot by writing this program, so I hope it is helpful. You can download the complete SAS program that implements the EM algorithm for fitting a Gaussian mixture model. Leave a comment if you find it useful.

The post Fit a multivariate Gaussian mixture model by using the expectation-maximization (EM) algorithm appeared first on The DO Loop.

7月 202020
 

I think we can all agree that lifelong learning is the future, for all of us. We know that we need to learn and develop all the time, simply to stay abreast. The world is changing fast, and we must change with it. Investing in analytics talent is an investment in your future.

Create a corporate learning culture

Organizations may embrace the concept of data in their work cultures, yet they fail to commit to developing the analytics skills their teams need to harness the data. Organizations often don’t know what skills their team has, the skills they’re lacking, or even where they need their employees to go. Corporate training brings new avenues for career development and professional training, and that can have a huge impact on retention.

Companies that understand the importance of continued skill development for their employees will set out a vision and develop a corporate learning culture. Leaders are needed who can drive a cultural change towards continued learning in the organization. Hiring curious people and rewarding that curiosity develops the right mindset. Revealing the knowledge gap that employees have is a good trigger to get the learning started.

When an organization wants to close the talent gap for Analytics and AI, SAS can offer support on the whole journey.

Company objectives and goals will determine how the skills development program will be shaped. Once the roles and desired skills levels are determined, SAS will provide a Learning Needs Assessment to uncover the skills gaps.

Based on the findings from the assessment, a program for learning and development will be proposed for the different target audiences and to work towards company goals.

With readily available digital assets, the first steps to a learning platform can be realized quickly. By dynamically developing more assets and creating domain and company specific materials, continuous learning is encouraged and facilitated.

Power of online learning

Today most of us are forced to work from home due to COVID-19. We know the future of work will look different; we are getting used to the digital channel for communicating and interacting. Though the digitization of learning has been going on for many years, learning content is now moving to the cloud, becoming accessible across multiple devices and teaching environments and often being generated, shared, and continually updated.

Millennials feel most comfortable with this digitization, but the wider workforce will ultimately benefit from access to digital learning assets. Integrated cloud-based platforms enable more than just new computer programs or smartphone apps. Smart organizations are now expanding their use of cloud-based learning to run personalized online courses, small tailored live web sessions, instructional videos, e-coaching, communities, virtual classrooms, and simulations games. SAS is responding to this need by offering several free options for learning including e-Learning, online tutorials, lab time, SAS Academy for Data Science, and a SAS Learning Subscription that offers more than 100 e-Learning courses. There is something for everyone!

Finding analytics talent within your workforce can be challenging. It can be hard to assess the skills and identify the talent you need for specific solutions. With an unparalleled depth of understanding in the industry, SAS can help identify, cultivate and grow analytics talent in your organization. In addition, we’ll help you to build a talent pipeline in partnership with local academic institutions, if needed.

Do you want to start today with learning essential skills? Take a look at how SAS can help your organization to get ahead, through our learning solutions customized to help you train your team. Or transform it.

Shaping the future of lifelong learning was published on SAS Users.

7月 152020
 

The multivariate normal distribution is used frequently in multivariate statistics and machine learning. In many applications, you need to evaluate the log-likelihood function in order to compare how well different models fit the data. The log-likelihood for a vector x is the natural logarithm of the multivariate normal (MVN) density function evaluated at x. A probability density function is usually abbreviated as PDF, so the log-density function is also called a log-PDF. This article discusses how to efficiently evaluate the log-likelihood function and the log-PDF. Examples are provided by using the SAS/IML matrix language.

The multivariate normal PDF

A previous article provides examples of using the LOGPDF function in SAS for univariate distributions. Multivariate distributions are more complicated and are usually written by using matrix-vector notation. The multivariate normal distribution in dimension d has two parameters: A d-dimensional mean vector μ and a d x d covariance matrix Σ. The MVN PDF evaluated at a d-dimensional vector x is
\(f(\mathbf{x})= \frac{1}{\sqrt { (2\pi)^d|\boldsymbol \Sigma| } } \exp\left(-\frac{1}{2} (\mathbf{x}-\boldsymbol\mu)^{\rm T} \boldsymbol\Sigma^{-1} ({\mathbf x}-\boldsymbol\mu)\right) \)
where |Σ| is the determinant of Σ. I have previously shown how to evaluate the MVN density in the SAS/IML language, and I noted that the argument to the EXP function involves the expression MD(x; μ, Σ)2 = (x-μ)TΣ-1(x-μ), where MD is the Mahalanobis distance between the point x and the mean vector μ.

Evaluate the MVN log-likelihood function

When you take the natural logarithm of the MVN PDF, the EXP function goes away and the expression becomes the sum of three terms:
\(\log(f(\mathbf{x})) = -\frac{1}{2} [ d \log(2\pi) + \log(|\boldsymbol \Sigma|) + {\rm MD}(\mathbf{x}; \boldsymbol\mu, \boldsymbol\Sigma)^2 ]\)
The first term in the brackets is easy to evaluate, but the second and third terms appear more daunting. Fortunately, the SAS/IML language provides two functions that simplify the evaluation:

  • The LOGABSDET function computes the logarithm of the absolute value of the determinant of a matrix. For a full-rank covariance matrix the determinant is always positive, so the SAS/IML function LogAbsDet(C)[1] returns the log-determinant of a covariance matrix, C.
  • The MAHALANOBIS function in SAS/IML evaluates the Mahalanobis distance. The function is vectorized, which means that you can pass in a matrix that has d columns, and the MAHALANOBIS function will return the distance for each row of the matrix.

Some researchers use -2*log(f(x)) instead of log(f(x)) as a measure of likelihood. You can see why: The -2 cancels with the -1/2 in the formula and makes the values positive instead of negative.

Log likelihood versus log-PDF

I use the terms log-likelihood function and log-PDF function interchangeably, but there is a subtle distinction. The log-PDF is a function of x when the parameters are specified (fixed). The log-likelihood is a function of the parameters when the data are specified. The SAS/IML function in the next section can be used for either purpose. (Note: Some references use the term "log likelihood" to refer only to the sum of the log-PDF scores evaluated at each observation in the sample.)

Example: Compare the log likelihood values for different parameter values

The log-likelihood function has many applications, but one is to determine whether one model fits the data better than another model. The log likelihood depends on the mean vector μ and the covariance matrix, Σ, which are the parameters for the MVN distribution.

Suppose you have some data that you think are approximately multivariate normal. You can use the log-likelihood function to evaluate whether the model MVN(μ1, Σ1) fits the data better than an alternative model MVN(μ2, Σ2). For example, the Fisher Iris data for the SepalLength and SepalWidth variables appear to be approximately bivariate normal and positively correlated, as shown in the following graph:

title "Iris Data and 95% Prediction Ellipse";
title2 "Assuming Multivariate Normality";
proc sgplot data=Sashelp.Iris noautolegend;
   where species="Setosa";
   scatter x=SepalLength y=SepalWidth / jitter;
   ellipse x=SepalLength y=SepalWidth;
run;

The following SAS/IML function defines a function (LogPdfMVN) that evaluates the log-PDF at every observation of a data matrix, given the MVN parameters (or estimates for the parameters). To test the function, the program creates a data matrix from the SepalLength and SepalWidth variables for the observations for which Species="Setosa". The program uses the MEAN and COV functions to compute the maximum likelihood estimates for the data, then calls the LogPdfMVN function to evaluate the log-PDF at each observation:

proc iml;
/* This function returns the log-PDF for a MVN(mu, Sigma) density at each row of X.
   The output is a vector with the same number of rows as X. */
start LogPdfMVN(X, mu, Sigma);
   d = ncol(X);
   log2pi = log( 2*constant('pi') );
   logdet = logabsdet(Sigma)[1];             /* sign of det(Sigma) is '+' */
   MDsq = mahalanobis(X, mu, Sigma)##2;      /* (x-mu)`*inv(Sigma)*(x-mu) */
   Y = -0.5 *( MDsq + d*log2pi + logdet );   /* log-PDF for each obs. Sum it for LL */
   return( Y );
finish;
 
/* read the iris data for the Setosa species */
use Sashelp.Iris where(species='Setosa');
read all var {'SepalLength' 'SepalWidth'} into X;
close;
 
n = nrow(X);           /* assume no missing values */
m = mean(X);           /* maximum likelihood estimate of mu */
S = (n-1)/n * cov(X);  /* maximum likelihood estimate of Sigma */
/* evaluate the log likelihood for each observation */
LL = LogPdfMVN(X, m, S);

Notice that you can find the maximum likelihood estimates (m and S) by using a direct computation. For MVN models, you do not need to run a numerical optimization, which is one reason why MVN models are so popular.

The LogPdfMVN function returns a vector that has the same number of rows as the data matrix. The value of the i_th element is the log-PDF of the i_th observation, given the parameters. Because the parameters for the LogPdfMVN function are the maximum likelihood estimates, the total log likelihood (the sum of the LL vector) should be as large as possible for this data. In other words, if we choose different values for μ and Σ, the total log likelihood will be less. Let's see if that is true for this example. Let's change the mean vector and use a covariance matrix that incorrectly postulates that the SepalLength and SepalWidth variables are negatively correlated. The following statements compute the log likelihood for the alternative model:

/* What if we use "wrong" parameters? */
m2 = {45 30};
S2 = {12 -10,  -10 14};          /* this covariance matrix indicates negative correlation */
LL_Wrong = LogPdfMVN(X, m2, S2); /* LL for each obs of the alternative model */
 
/* The total log likelihood is sum(LL) over all obs */
TotalLL = sum(LL);
TotalLL_Wrong = sum(LL_Wrong);
print TotalLL TotalLL_Wrong;

As expected, the total log likelihood is larger for the first model than for the second model. The interpretation that the first model fits the data better than the second model.

Although the total log likelihood (the sum) is often used to choose the better model, the log-PDF of the individual observations are also important. The individual log-PDF values identify which observations are unlikely to come from a distribution with the given parameters.

Visualize the log-likelihood for each model

The easiest way to demonstrate the difference between the "good" and "bad" model parameters is to draw the bivariate scatter plot of the data and color each observation by the log-PDF at that position.

The plot for the first model (which fits the data well) is shown below. The observations are colored by the log-PDF value (the LL vector) for each observation. Most observations are blue or blue-green because those colors indicate high values of the log-PDF.

The plot for the second model (which intentionally misspecifies the parameters) is shown below. The observations near (45, 30) are blue or blue-green because that is the location of the specified mean parameter. A prediction ellipse for the specified model has a semimajor axis that slopes from the upper left to the lower right. Therefore, the points in the upper right corner of the plot have a large Mahalanobis distance and a very negative log-PDF. These points are colored yellow, orange, or red. They are "outliers" in the sense that they are unlikely to be observed in a random sample from an MVN distribution that has the second set of parameters.

What is a "large" log-PDF value?

For this example, the log-PDF is negative for each observation, so "large" and "small" can be confusing terms. I want to emphasize two points:

  1. When I say a log-PDF value is "large" or "high," I mean "close to the maximum value of the log-PDF function." For example, -3.1 is a large log-PDF value for these data. Observations that are far from the mean vector are very negative. For example, -40 is a "very negative" value.
  2. The maximum value of the log-PDF occurs when an observation exactly equals the mean vector. Thus the log-PDF will never be larger than -0.5*( d*log(2π) + log(det(Σ)) ). For these data, the maximum value of the log-PDF is -4.01 when you use the maximum likelihood estimates as MVN parameters.

Summary

In summary, this article shows how to evaluate the log-PDF of the multivariate normal distribution. The log-PDF values indicate how likely each observation would be in a random sample, given parameters for an MVN model. If you sum the log-PDF values over all observations, you get a statistic (the total log likelihood) that summarizes how well a model fits the data. If you are comparing two models, the one with the larger log likelihood is the model that fits better.

The post How to evaluate the multivariate normal log likelihood appeared first on The DO Loop.