In a recent presentation, Jill Dyche, VP of SAS Best Practices gave two great quotes: "Map strategy to data" and "strategy drives analytics drives data." In other words, don't wait for your data to be perfect before you invest in analytics. Don't get me wrong -- I fully understand and [...]
In just a few short months the European General Data Protection Regulation becomes enforceable. This regulation enshrines in law the rights of EU citizens to have their personal data treated in accordance with their wishes. The regulation applies to any organisation which is processing EU citizens’ data, and the UK [...]
GDPR – sounding the death knell for self-learning algorithms? was published on SAS Voices by Dave Smith
In my previous blog we identified quality as the critical success factor that will deliver UK manufacturers through the potentially choppy fallout from Brexit and other global political and economic upheaval. Quality in every thought, every action, every outcome You might think differentiating on the quality of your products is [...]
When we send spacecraft from Earth to Mars, do the Martians consider them to be UFOs? I might not be able to answer that question definitively ... but I do have some really cool graphs showing the data for all those missions to Mars! You might remember a previous blog [...]
My high school basketball coach started preparing us for the tournaments in the season’s first practice. He talked about the “long haul” of tournament basketball, and geared our strategies toward a successful run at the end of the season.
I thought about the “long haul” when considering my brackets for this year’s NCAA Tournament, and came to this conclusion; instead of seeking to predict who might win a particular game, I wanted to use analytics to identify which teams were most likely to win multiple games. The question that I sought to answer was simply, “could I look at regular season data, and recognize the characteristics inherent in teams that win multiple games in the NCAA Tournament?”
I prepared and extracted features from data representing the last 5 regular seasons. I took tournament data from the same period and counted numbers of wins per team (per season). This number would be my target value (0, 1, 2, 3, 4, 5, or 6 wins). Only teams that participated in the tournaments made the analysis.
I used SAS Enterprise Miner’s High Performance Random Forest Node to build 10,000 trees (in less than 14 seconds), and I determined my “top 10 stats” by simply observing which factors were split on the most.
Here are the results (remember that statistics represented are from the regular season and not the tournament), my “top 10 statistics to consider.”
1 --- Winning Percentage. Winners win, right? It is evident this is true the further a team moves into the tournament.
- Teams that win a single game have an average winning percentage of .729
- Teams that win 6 games have an average winning percentage of .858
- No team that has won a Final Four game over the last 5 years has a winning percentage less than .706
- Teams that won 6 games have a minimum winning percentage of .765.
2 --- Winning games by wide margins. Teams that advance in the tournament have beaten teams by wide margins during the regular season – this means that in some game over the course of the year, a team let go and won big! From a former player’s perspective, it doesn’t matter “who” you beat by a wide margin, but rather do you have the drive to crush the opponent?
- Teams that won 6 games have beaten some team by 49 points differentiating themselves from even the 5 win teams by 9 points!
3 --- The ratio of assists to turnovers (ATO). Teams that take care of and distribute the ball tend to be making assists instead of turnovers. From my perspective, the ATO indicates whether or not a team dictates the action.
- Over the last 5 years, no team that won 6 games had an ATO less than 1.19!
- Teams that have won at least 5 had an average ATO of 1.04.
- Teams that won less than 5 had average ATOs of less than 1.
4 --- Winning percentage on the road. We’re already noted that overall winning percentage is important, but it’s also important to win on the road since the tournament games are rarely played on a team’s home floor!
- Teams that don’t win any tournament games win 52% of their road games
- Teams that win 1-2 games win 57.8%
- Teams that win 3-5 win 63%
- Team that win 6 win 78% of their road games, and average only 2.4 (road) losses per year
- No team that has won at least 5 games has lost more than 5 on the road (in the last 5 years)!
5 --- The ratio of a team’s field goal percentage to the opposition’s field goal percentage? Winning games on the big stage requires both scoring and defense! A ratio above 1 indicates that you score the ball better than you allow your opposition to score.
- Teams that win 2 or fewer games have a ratio of 1.12
- Teams that win 3-5 games have a ratio of 1.18
- Teams that win 6 games have a ratio of 1.23 – no team that has won 6 games had a ratio of less than 1.19!
6 --- The ratio of turnovers to the turnovers created (TOR). I recall coaches telling me that a turnover committed by our team was essentially a 4-point play: 2 that we didn’t get, and 2 they did.
- Teams that win the most tournament games have an average TOR of 0.89. This means they turn the ball over at a minimal rate when compared to the turnovers they create.
- Over the past 5 years, teams that won 6 games have an average TOR .11 better than the rest of the pack which can be interpreted this way: they force the opposition into turnovers 10 times as often as they commit turnovers themselves.
7 --- Just as important as beating teams by wide margins, are the close games! Close games build character, and provide preparation for the tournament.
- Teams that win 6 games play more close games than any other group. The average minimum differential for this group is 1.6 points
- Teams winning less games average a differential of 1.8 points.
8 --- Defending the 3. Teams that win more games in the tournament defend the 3 point shot only slightly better than the other teams, but they are twice as consistent in doing it! So, regardless of who’s coming to play, look for some sticky D beyond the arc!
- On average, teams allow a 3-point field goal percentage .328
- Teams winning the most tournament games defend only slightly better at .324; however the standard deviation is the more interesting statistic indicating the consistency of doing so (defending the 3 point shot) is almost twice as good as the other teams!
9 --- Teams that win are good at the stripe! Free throws close games. Make them and get away with win!
- Teams that win the most games shoot for an average of .730 while the rest of the pack sits at .700
10 --- Teams that win the most games block shots! They play defense, period.
- Teams that win the most tournament games average over 5 blocks per game.
- Teams winning 6 games have blocked at least 3.4 shots per game (over the last 5 years)
Next steps? Take what’s been learned and apply it to this year’s tournament teams, and then as Larry Bird used to do, ask the question, “who’s playing for second?”
Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers (see Lessmann et al., 2015). They are also known to reduce model bias and variance (Myoung - Jong et al., 2006; Tsai C-F et. al., 2011). The objective of this article is to compare the predictive accuracy of four distinct datasets using two ensemble classifiers (Gradient boosting(GB)/Random Forest(RF)) and two single classifiers (Logistic regression(LR)/Neural Network(NN)) to determine if, in fact, ensemble models are always better. My analysis did not look into optimizing any of these algorithms or feature engineering, which are the building blocks of arriving at a good predictive model. I also decided to base my analysis on these four algorithms because they are the most widely used methods.
What is the difference between a single and an ensemble classifier?
Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|) directly (e.g., logistic regression), or estimate class-conditional probabilities (|), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeller to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. The specific covariates and threshold values to branch a node follow from minimizing indicators of node impurity such as the Gini coefficient or information gain (Baesens, et al., 2003).
Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy (Finlay, 2011; Paleologo, et al., 2010). Ensemble learners create the base models in an independent or dependent manner. For example, the bagging algorithm derives independent base models from bootstrap samples of the original data (Breiman, 1996). Boosting algorithms, on the other hand, grow an ensemble in a dependent fashion. They iteratively add base models that are trained to avoid the errors of the current ensemble (Freund & Schapire, 1996). Several extensions of bagging and boosting have been proposed in the literature (Breiman, 2001; Friedman, 2002; Rodriguez, et al., 2006). The common denominator of homogeneous ensembles is that they develop the base models using the same classification algorithm (Lessmann et al., 2015).
Before modelling, I partitioned the dataset into 70% training and 30% validation dataset.
I used SAS Enterprise Miner as a modelling tool.
Using misclassification rate as model performance, RF was the best model using Cardata, Organics_Data and HMEQ followed closely by NN. NN was the best model using Time_series_data and performed better than GB ensemble model using Organics_Data and Cardata.
My findings partly supports the hypothesis that ensemble models naturally do better in comparison to single classifiers, but not in all cases. NN, which is a single classifier, can be very powerful unlike most classifiers (single or ensemble) which are kernel machines and data-driven. NN can generalize from unseen data and act as universal functional approximators (Zhang, et al., 1998).
According to Kaggle CEO and Founder, Anthony Goldbloom:
“In the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted & Neural Networks”.
What are your thoughts?
Are ensemble classifiers always better than single classifiers? was published on SAS Users.
Unlike some other UK Government departments, the Home Office has done well out of the recent spending review. Overall police spending has been protected – following the debacle of the earlier calculation errors – to protect against emerging crime threats and to train more firearms officers. Counter-terrorism has received a [...]
To get a high-performing analytics team producing insights that matter, you need great people, powerful software and a culture of experimentation and innovation. Three simple ingredients, but getting there is far from easy. In this post, I’d like to get you thinking about how to organize for success by building [...]
The Obama administration made great strides in improving the government’s use of information technology over the past eight years, and now it's up to the Trump administration to expand upon it. Let’s look at five possible Trump administration initiatives that can take government’s use of information technology to the next [...]
There aren’t many things that keep me awake at night but let me share a recent example with you. I’ve been grappling with how to help a local SAS team respond to a customer’s request for a “generic enterprise analytics architecture.” As background, this customer organization had recently embarked on [...]
How to apply design thinking to your analytics architecture was published on SAS Voices by Paul Gittins