Larry Orimoloye

1月 062018
 

Deep learning is not synonymous with artificial intelligence (AI) or even machine learning. Artificial Intelligence is a broad field which aims to "automate cognitive processes." Machine learning is a subfield of AI that aims to automatically develop programs (called models) purely from exposure to training data.

Deep Learning and AI

Deep learning is one of many branches of machine learning, where the models are long chains of geometric functions, applied one after the other to form stacks of layers. It is one among many approaches to machine learning but not on equal footing with the others.

What makes deep learning exceptional

Why is deep learning unequaled among machine learning techniques? Well, deep learning has achieved tremendous success in a wide range of tasks that have historically been extremely difficult for computers, especially in the areas of machine perception. This includes extracting useful information from images, videos, sound, and others.

Given sufficient training data (in particular, training data appropriately labelled by humans), it’s possible to extract from perceptual data almost anything that a human could extract. Large corporations and businesses are deriving value from deep learning by enabling human-level speech recognition, smart assistants, human-level image classification, vastly improved machine translation, and more. Google Now, Amazon Alexa, ad targeting used by Google, Baidu and Bing are all powered by deep learning. Think of superhuman Go playing and near-human-level autonomous driving.

In the summer of 2016, an experimental short movie, Sunspring, was directed using a script written by a long short-term memory (LSTM) algorithm a type of deep learning algorithm.

How to build deep learning models

Given all this success recorded using deep learning, it's important to stress that building deep learning models is more of an art than science. To build a deep learning or any machine learning model for that matter one need to consider the following steps:

  • Define the problem: What data does the organisation have? What are we trying to predict? Do we need to collect more data? How can we manually label the data? Make sure to work with domain expert because you can’t interpret what you don’t know!
  • What metrics can we use to reliably measure the success of our goals.
  • Prepare validation process that will be used to evaluate the model.
  • Data exploration and pre-processing: This is where most time will be spent such as normalization, manipulation, joining of multiple data sources and so on.
  • Develop an initial model that does better than a baseline model. This gives some indication of whether machine learning is ideal for the problem.
  • Refine model architecture by tuning hyperparameters and adding regularization. Make changes based on validation data.
  • Avoid overfitting.
  • Once happy with the model, deploy it into production environment. This may be difficult to achieve for many organisations giving that a deep learning score code is large. This is where SAS can help. SAS has developed a scoring mechanism called "astore" which allows deep learning method to be pushed into production with just a click.

Is the deep learning hype justified?

We're still in the middle of deep learning revolution trying to understand the limitations of this algorithm. Due to its unprecedented successes, there has been a lot of hype in the field of deep learning and AI. It’s important for managers, professionals, researchers and industrial decision makers to be able to distill this hype from reality created by the media.

Despite the progress on machine perception, we are still far from human level AI. Our models can only perform local generalization, adapting to new situations that must be similar to past data, whereas human cognition is capable of extreme generalization, quickly adapting to radically novel situations and planning for long-term future situations. To make this concrete, imagine you’ve developed a deep network controlling a human body, and you wanted it to learn to safely navigate a city without getting hit by cars, the net would have to die many thousands of times in various situations until it could infer that cars are dangerous, and develop appropriate avoidance behaviors. Dropped into a new city, the net would have to relearn most of what it knows. On the other hand, humans are able to learn safe behaviors without having to die even once—again, thanks to our power of abstract modeling of hypothetical situations.

Lastly, remember deep learning is a long chain of geometrical functions. To learn its parameters via gradient descent one key technical requirements is that it must be differentiable and continuous which is a significant constraint.

Looking beyond the AI and deep learning hype was published on SAS Users.

11月 042017
 

Internet of Things for dementiaDementia describes different brain disorders that trigger a loss of brain function. These conditions are all usually progressive and eventually severe. Alzheimer's disease is the most common type of dementia, affecting 62 percent of those diagnosed. Other types of dementia include; vascular dementia affecting 17 percent of those diagnosed, mixed dementia affecting 10 percent of those diagnosed.

Dementia Statistics

There are 850,000 people with dementia in the UK, with numbers set to rise to over 1 million by 2025. This will soar to 2 million by 2051. 225,000 will develop dementia this year, that’s one every three minutes. 1 in 6 people over the age of 80 have dementia. 70 percent of people in care homes have dementia or severe memory problems. There are over 40,000 people under 65 with dementia in the UK. More than 25,000 people from black, Asian and minority ethnic groups in the UK are affected.

Cost of treating dementia

Two-thirds of the cost of dementia is paid by people with dementia and their families. Unpaid careers supporting someone with dementia save the economy £11 billion a year. Dementia is one of the main causes of disability later in life, ahead of cancer, cardiovascular disease and stroke (Statistic can be obtained from here - Alzheimer’s society webpage). To tackle dementia requires a lot of resources and support from the UK government which is battling to find funding to support NHS (National Health Service). During 2010–11, the NHS will need to contribute £2.3bn ($3.8bn) of the £5bn of public sector efficiency savings, where the highest savings are expected primarily from PCTs (Primary care trusts). In anticipation for tough times ahead, it is in the interest of PCTs to obtain evidence-based knowledge of the use of their services (e.g. accident & emergency, inpatients, outpatients, etc.) based on regions and patient groups in order to reduce the inequalities in health outcomes, improve matching of supply and demand, and most importantly reduce costs generated by its various services.

Currently in the UK, general practice (GP) doctors deliver primary care services by providing treatment and drug prescriptions and where necessary patients are referred to specialists, such as for outpatient care, which is provided by local hospitals (or specialised clinics). However, general practitioners (GPs) are limited in terms of size, resources, and the availability of the complete spectrum of care within the local community.

Solution in sight for dementia patients

There is the need to prevent or avoid delay for costly long-term care of dementia patients in nursing homes. Using wearables, monitors, sensors and other devices, NHS based in Surrey is collaborating with research centres to generate ‘Internet of Things’ data to monitor the health of dementia patients at the comfort of staying at home. The information from these devices will help people take more control over their own health and wellbeing, with the insights and alerts enabling health and social care staff to deliver more responsive and effective services. (More project details can be obtained here.)

Particle filtering

One method that could be used to analyse the IoT generated data is particle filtering methods. IoT dataset naturally fails within Bayesian framework. This method is very robust and account for the combination of historical and real-time data in order to make better decision.

In Bayesian statistics, we often have a prior knowledge or information of the phenomenon/application been modelled. This allows us to formulate a Bayesian model, that is, prior distribution, for the unknown quantities and likelihood functions relating these quantities to the observations (Doucet et al., 2008). As new evidence becomes available, we are often interested in updating our current knowledge or posterior distribution. Using State Space Model (SSM), we are able to apply Bayesian methods to time series dataset. The strength of these methods however lies in the ability to properly define our SSM parameters appropriately; otherwise our model will perform poorly. Many SSMs suffers from non-linearity and non-Gaussian assumptions making the maximum likelihood difficult to obtain when using standard methods. The classical inference methods for nonlinear dynamic systems are the extended Kalman filter (EKF) which is based on linearization of a state and trajectories (e.g. Johnson et al., 2008 ). The EKF have been successfully applied to many non-linear filtering problems. However, the EKF is known to fail if the system exhibits substantial nonlinearity and/or if the state and the measurement noise are significantly non-Gaussian.

An alternative method which gives a good approximation even when the posterior distribution is non-Gaussian is a simulation based method called Monte Carlo. This method is based upon drawing observations from the distribution of the variable of interest and simply calculating the empirical estimate of the expectation.

To apply these methods to time series data where observation arrives in sequential order, performing inference on-line becomes imperative hence the general term sequential Monte Carlo (SMC). A SMC method encompasses range of algorithms which are used for approximate filtering and smoothing. Among this method is particle filtering. In most literature, it has become a general tradition to present particle filtering as SMC, however, it is very important to note this distinction. Particle filtering is simply a simulation based algorithm used to approximate complicated posterior distributions. It combines sequential importance sampling (SIS) with an addition resampling step. SMC methods are very flexible, easy to implement, parallelizable and applicable in very general settings. The advent of cheap and formidable computational power in conjunction with some recent developments in applied statistics, engineering and probability, have stimulated many advancements in this field (Cappe et al. 2007). Computational simplicity in the form of not having to store all the data is also an additional advantage of SMC over MCMC (Markov Chain Monte Carlo).

References

[1] National Health Service England, http://www.nhs.uk/NHSEngland/aboutnhs/Pages/Authoritiesandtrusts.aspx (accessed 18 August 2009).

[2] Pincus SM. Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88: 2297–2301, 1991

[3] Pincus SM and Goldberger AL. Physiological time-series analysis: what does regularity quantify?  Am J Physiol Heart Circ Physiol 266: H1643–H1656, 1994

[4] Cappe, O.,S. Godsill, E. Moulines(2007).An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. Volume 95, No 5, pp 899-924.

[5] Doucet, A., A. M. Johansen(2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years Later.

[6] Johansen, A. M. (2009).SMCTC: Sequential Monte Carlo in C++. Journal of Statistical Software. Volume 30, issue 6.

[7] Rasmussen and Z.Ghahramani (2003). Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15.

[8] Osborne A. M.,Duvenaud D., GarnettR.,Rasmussen, C.E., Roberts, C.E.,Ghahramani, Z. Active Learning of Model Evidence Using Bayesian Quadrature.

Scaling Internet of Things for dementia using Particle filters was published on SAS Users.

11月 042017
 

Internet of Things for dementiaDementia describes different brain disorders that trigger a loss of brain function. These conditions are all usually progressive and eventually severe. Alzheimer's disease is the most common type of dementia, affecting 62 percent of those diagnosed. Other types of dementia include; vascular dementia affecting 17 percent of those diagnosed, mixed dementia affecting 10 percent of those diagnosed.

Dementia Statistics

There are 850,000 people with dementia in the UK, with numbers set to rise to over 1 million by 2025. This will soar to 2 million by 2051. 225,000 will develop dementia this year, that’s one every three minutes. 1 in 6 people over the age of 80 have dementia. 70 percent of people in care homes have dementia or severe memory problems. There are over 40,000 people under 65 with dementia in the UK. More than 25,000 people from black, Asian and minority ethnic groups in the UK are affected.

Cost of treating dementia

Two-thirds of the cost of dementia is paid by people with dementia and their families. Unpaid careers supporting someone with dementia save the economy £11 billion a year. Dementia is one of the main causes of disability later in life, ahead of cancer, cardiovascular disease and stroke (Statistic can be obtained from here - Alzheimer’s society webpage). To tackle dementia requires a lot of resources and support from the UK government which is battling to find funding to support NHS (National Health Service). During 2010–11, the NHS will need to contribute £2.3bn ($3.8bn) of the £5bn of public sector efficiency savings, where the highest savings are expected primarily from PCTs (Primary care trusts). In anticipation for tough times ahead, it is in the interest of PCTs to obtain evidence-based knowledge of the use of their services (e.g. accident & emergency, inpatients, outpatients, etc.) based on regions and patient groups in order to reduce the inequalities in health outcomes, improve matching of supply and demand, and most importantly reduce costs generated by its various services.

Currently in the UK, general practice (GP) doctors deliver primary care services by providing treatment and drug prescriptions and where necessary patients are referred to specialists, such as for outpatient care, which is provided by local hospitals (or specialised clinics). However, general practitioners (GPs) are limited in terms of size, resources, and the availability of the complete spectrum of care within the local community.

Solution in sight for dementia patients

There is the need to prevent or avoid delay for costly long-term care of dementia patients in nursing homes. Using wearables, monitors, sensors and other devices, NHS based in Surrey is collaborating with research centres to generate ‘Internet of Things’ data to monitor the health of dementia patients at the comfort of staying at home. The information from these devices will help people take more control over their own health and wellbeing, with the insights and alerts enabling health and social care staff to deliver more responsive and effective services. (More project details can be obtained here.)

Particle filtering

One method that could be used to analyse the IoT generated data is particle filtering methods. IoT dataset naturally fails within Bayesian framework. This method is very robust and account for the combination of historical and real-time data in order to make better decision.

In Bayesian statistics, we often have a prior knowledge or information of the phenomenon/application been modelled. This allows us to formulate a Bayesian model, that is, prior distribution, for the unknown quantities and likelihood functions relating these quantities to the observations (Doucet et al., 2008). As new evidence becomes available, we are often interested in updating our current knowledge or posterior distribution. Using State Space Model (SSM), we are able to apply Bayesian methods to time series dataset. The strength of these methods however lies in the ability to properly define our SSM parameters appropriately; otherwise our model will perform poorly. Many SSMs suffers from non-linearity and non-Gaussian assumptions making the maximum likelihood difficult to obtain when using standard methods. The classical inference methods for nonlinear dynamic systems are the extended Kalman filter (EKF) which is based on linearization of a state and trajectories (e.g. Johnson et al., 2008 ). The EKF have been successfully applied to many non-linear filtering problems. However, the EKF is known to fail if the system exhibits substantial nonlinearity and/or if the state and the measurement noise are significantly non-Gaussian.

An alternative method which gives a good approximation even when the posterior distribution is non-Gaussian is a simulation based method called Monte Carlo. This method is based upon drawing observations from the distribution of the variable of interest and simply calculating the empirical estimate of the expectation.

To apply these methods to time series data where observation arrives in sequential order, performing inference on-line becomes imperative hence the general term sequential Monte Carlo (SMC). A SMC method encompasses range of algorithms which are used for approximate filtering and smoothing. Among this method is particle filtering. In most literature, it has become a general tradition to present particle filtering as SMC, however, it is very important to note this distinction. Particle filtering is simply a simulation based algorithm used to approximate complicated posterior distributions. It combines sequential importance sampling (SIS) with an addition resampling step. SMC methods are very flexible, easy to implement, parallelizable and applicable in very general settings. The advent of cheap and formidable computational power in conjunction with some recent developments in applied statistics, engineering and probability, have stimulated many advancements in this field (Cappe et al. 2007). Computational simplicity in the form of not having to store all the data is also an additional advantage of SMC over MCMC (Markov Chain Monte Carlo).

References

[1] National Health Service England, http://www.nhs.uk/NHSEngland/aboutnhs/Pages/Authoritiesandtrusts.aspx (accessed 18 August 2009).

[2] Pincus SM. Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88: 2297–2301, 1991

[3] Pincus SM and Goldberger AL. Physiological time-series analysis: what does regularity quantify?  Am J Physiol Heart Circ Physiol 266: H1643–H1656, 1994

[4] Cappe, O.,S. Godsill, E. Moulines(2007).An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. Volume 95, No 5, pp 899-924.

[5] Doucet, A., A. M. Johansen(2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years Later.

[6] Johansen, A. M. (2009).SMCTC: Sequential Monte Carlo in C++. Journal of Statistical Software. Volume 30, issue 6.

[7] Rasmussen and Z.Ghahramani (2003). Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15.

[8] Osborne A. M.,Duvenaud D., GarnettR.,Rasmussen, C.E., Roberts, C.E.,Ghahramani, Z. Active Learning of Model Evidence Using Bayesian Quadrature.

Scaling Internet of Things for dementia using Particle filters was published on SAS Users.

7月 182017
 

In the digital world where billions of customers are making trillions of visits on a multi-channel marketing environment, big data has drawn researchers’ attention all over the world. Customers leave behind a huge trail of data volumes in digital channels. It is becoming an extremely difficult task finding the right data, given exploding data volumes, that can potentially help make the right decision.

This can be a big issue for brands. Traditional databases have not been efficient enough to capture the sheer amount of information or complexity of datasets we accumulate on the web, on social media and other places.

A leading consulting firm, for example, boasts of one of its client having 35 million customers and 9million unique visitors daily on their website, leaving a huge amount of shopper information data every second. Segmenting this large amount of data with the right tools to help target marketing activities is not readily available. To make matters more complicated, this data can be structured and unstructured, making traditional methods of analysing data not suitable.

Tackling market segmentation

Market segmentation is a process by which market researchers identify key attributes about customers and potential customers which can be used to create distinct target market groups. Without a market segmentation base, Advertising and Sales can lose large amount of money targeting the wrong set of customers.

Some known methods of segmenting consumers market include geographical segmentation, demographical segmentation, behavioural segmentation, multi-variable account segmentation and others. Common approaches using statistical methods to segment various markets include:

  • Clustering algorithms such as K-Means clustering
  • Statistical mixture models such as Latent Class Analysis
  • Ensemble approaches such as Random Forests

Most of these methods assume the number of clusters to be known, which in reality is never the case. There are several approaches to estimate the number of clusters. However, strong evidence about the quality of this clusters does not exist.

To add to the above issues, clusters could be domain specific, which means they are built to solve certain domain problems such as:

  • Delimitation of species of plants or animals in biology.
  • Medical classification of diseases.
  • Discovery and segmentation of settlements and periods in archaeology.
  • Image segmentation and object recognition.
  • Social stratification.
  • Market segmentation.
  • Efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

  • Exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses.
  • Information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems.
  • Investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster,” and cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

Van Mechelen et al. (1993) set out an objective characteristics of what a “true clusters” should possess which includes the following:

  • Within-cluster dissimilarities should be small.
  • Between-cluster dissimilarities should be large.
  • Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models.
  • Members of a cluster should be well represented by its centroid.
  • The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
  • Clusters should be stable.
  • Clusters should correspond to connected areas in data space with high density.
  • The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
  • It should be possible to characterize the clusters using a small number of variables.
  • Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
  • Features should be approximately independent within clusters.
  • The number of clusters should be low.

Reference

  1. Van Mechelen, J. Hampton, R.S. Michalski, P. Theuns (1993). Categories and Concepts—Theoretical Views and Inductive Data Analysis, Academic Press, London

What are the characteristics of “true clusters?” was published on SAS Users.

3月 112017
 

Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers (see Lessmann et al., 2015). They are also known to reduce model bias and variance (Myoung - Jong et al., 2006; Tsai C-F et. al., 2011). The objective of this article is to compare the predictive accuracy of four distinct datasets using two ensemble classifiers (Gradient boosting(GB)/Random Forest(RF)) and two single classifiers (Logistic regression(LR)/Neural Network(NN)) to determine if, in fact, ensemble models are always better. My analysis did not look into optimizing any of these algorithms or feature engineering, which are the building blocks of arriving at a good predictive model. I also decided to base my analysis on these four algorithms because they are the most widely used methods.

What is the difference between a single and an ensemble classifier?

Single classifier

Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|) directly (e.g., logistic regression), or estimate class-conditional probabilities (|), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeller to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. The specific covariates and threshold values to branch a node follow from minimizing indicators of node impurity such as the Gini coefficient or information gain (Baesens, et al., 2003).

Ensemble classifier

Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy (Finlay, 2011; Paleologo, et al., 2010). Ensemble learners create the base models in an independent or dependent manner. For example, the bagging algorithm derives independent base models from bootstrap samples of the original data (Breiman, 1996). Boosting algorithms, on the other hand, grow an ensemble in a dependent fashion. They iteratively add base models that are trained to avoid the errors of the current ensemble (Freund & Schapire, 1996). Several extensions of bagging and boosting have been proposed in the literature (Breiman, 2001; Friedman, 2002; Rodriguez, et al., 2006). The common denominator of homogeneous ensembles is that they develop the base models using the same classification algorithm (Lessmann et al., 2015).

ensemble modifers

Figure 1: Workflow of single v. ensemble classifiers: derived from the work of Utami, et al., 2014

Experiment set-up

Datasets

Before modelling, I partitioned the dataset into 70% training and 30% validation dataset.

Table 1: Summary of dataset used for model comparisons

I used SAS Enterprise Miner as a modelling tool.

Figure 2: Model flow using Enterprise Miner

Results

Table 2: Results showing misclassification rates of all dataset

Conclusion

Using misclassification rate as model performance, RF was the best model using Cardata, Organics_Data and HMEQ followed closely by NN. NN was the best model using Time_series_data and performed better than GB ensemble model using Organics_Data and Cardata.

My findings partly supports the hypothesis that ensemble models naturally do better in comparison to single classifiers, but not in all cases. NN, which is a single classifier, can be very powerful unlike most classifiers (single or ensemble) which are kernel machines and data-driven. NN can generalize from unseen data and act as universal functional approximators (Zhang, et al., 1998).

According to Kaggle CEO and Founder, Anthony Goldbloom:

“In the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted & Neural Networks”.

What are your thoughts?

Are ensemble classifiers always better than single classifiers? was published on SAS Users.

9月 282016
 

open_source_models_using_sasWith my first open source software (OSS) experience over a decade ago, I was ecstatic. It was amazing to learn how easy it was to download the latest version on my personal computer, with no initial license fee. I was quickly able to analyse datasets using various statistical methods.

Organisations might feel similar excitement when they first employ people with predominantly open source programming skills. . However, it becomes tricky to organize an enterprise-wide approach based solely on open source software. . However, it becomes tricky to organize an enterprise-wide approach based solely on open source software. Decision makers within many organisations are now coming to realize the value of investing in both OSS and vendor provided, proprietary software. Very often, open source has been utilized widely to prototype models, whilst proprietary software, such as SAS, provides a stable platform to deploy models in real time or for batch processing, monitor  changes and update - directly in any database or on a Hadoop platform.

Industries such as pharma and finance have realised the advantages of complementing open source software usage with enterprise solutions such as SAS.

A classic example is when pharmaceutical companies conduct clinical trials, which must follow international good clinical practice (GCP) guidelines. Some pharma organisations use SAS for operational analytics, taking advantage of standardized macros and automated statistical reporting, whilst R is used for the  planning phase (i.e. simulations), for the peer-validation of the results (i.e. double programming) and for certain specific analyses.

In finance, transparency is required by ever demanding regulators, intensified after the recent financial crisis. Changing regulations, security and compliance are mitigating factors to using open source technology exclusively. Basel’s metrics such as PD, LGD and EADs computation must be properly performed. A very well-known bank in the Nordics, for example, uses open source technology to build all type of models including ensemble models, but relies on SAS’ ability to co-exist and extend open source on its platform to deploy and operationalise open source models.

Open source software and SAS working together – An example

The appetite of deriving actionable insight from data is very crucial. It is often believed that when data is thoroughly tortured, the required insight will become obvious to drive business growth. SAS and open source technology is used by various organisations to achieve maximum business opportunities and ROI on all analytics investment made.

Using the flexibility of prototyping predictive model in R and the power and stable platform of SAS to handle massive dataset, parallelize analytic workload processing, a well-known financial institution is combining both to deliver instant results from analytics and take quick actions.

How does this work?

SAS embraces and extends open source in different ways, following the complete analytics lifecycle of Data, Discovery and Deployment.

open-source-models-using-sas

An ensemble model, built in R is used within SAS for objective comparison within SAS Enterprise Miner (Enterprise Miner is a drag and drop, workflow modelling application which is easy to use without the need to code) – including an R model within the ‘open source integration node.’

open-source-models-using-sas1

Once this model has been compared and the best model identified from automatically generated fit statistics, the model can be registered into the metadata repository making it available for usage on all SAS platform.

We used SAS Model Manager to monitor Probability of Default(PD) and Loss Given Default(LGD) model. All models are also visible to everyone within the organization depending on system rights and privileges and can be used to score and retrain new dataset when necessary. Alerts can also be set to monitor model degradation and automated message sent for real time intervention.

open-source-models-using-sas2

Once champion model was set and published, it was used in Real Time Decision Manager(RTDM) flow to score new customers coming in for loan. RTDM is a web application which allows instant assessment of new applications without the need to score the entire database.

As a result of this flexibility the bank was able to manage their workload and modernize their platform in order to make better hedging decisions and cost saving investments. Complex algorithms can now be integrated into SAS to make better predictions and manage exploding data volumes.

open-source-models-using-sas3

tags: open source, SAS Enterprise Miner

Operationalising Open Source Models Using SAS was published on SAS Users.