Sam Edgemon

3月 132017

My high school basketball coach started preparing us for the tournaments in the season’s first practice. He talked about the “long haul” of tournament basketball, and geared our strategies toward a successful run at the end of the season.

I thought about the “long haul” when considering my brackets for this year’s NCAA Tournament, and came to this conclusion; instead of seeking to predict who might win a particular game, I wanted to use analytics to identify which teams were most likely to win multiple games. The question that I sought to answer was simply, “could I look at regular season data, and recognize the characteristics inherent in teams that win multiple games in the NCAA Tournament?”

I prepared and extracted features from data representing the last 5 regular seasons. I took tournament data from the same period and counted numbers of wins per team (per season). This number would be my target value (0, 1, 2, 3, 4, 5, or 6 wins). Only teams that participated in the tournaments made the analysis.

I used SAS Enterprise Miner’s High Performance Random Forest Node to build 10,000 trees (in less than 14 seconds), and I determined my “top 10 stats” by simply observing which factors were split on the most.

Here are the results (remember that statistics represented are from the regular season and not the tournament), my “top 10 statistics to consider.”

1 ---  Winning Percentage. Winners win, right?  It is evident this is true the further a team moves into the tournament.

  • Teams that win a single game have an average winning percentage of .729
  • Teams that win 6 games have an average winning percentage of .858
  • No team that has won a Final Four game over the last 5 years has a winning percentage less than .706
  • Teams that won 6 games have a minimum winning percentage of .765.

2 --- Winning games by wide margins. Teams that advance in the tournament have beaten teams by wide margins during the regular season – this means that in some game over the course of the year, a team let go and won big! From a former player’s perspective, it doesn’t matter “who” you beat by a wide margin, but rather do you have the drive to crush the opponent?

  • Teams that won 6 games have beaten some team by 49 points differentiating themselves from even the 5 win teams by 9 points!

3 --- The ratio of assists to turnovers (ATO). Teams that take care of and distribute the ball tend to be making assists instead of turnovers. From my perspective, the ATO indicates whether or not a team dictates the action.

  • Over the last 5 years, no team that won 6 games had an ATO less than 1.19!
  • Teams that have won at least 5 had an average ATO of 1.04.
  • Teams that won less than 5 had average ATOs of less than 1.

4 --- Winning percentage on the road. We’re already noted that overall winning percentage is important, but it’s also important to win on the road since the tournament games are rarely played on a team’s home floor!

  • Teams that don’t win any tournament games win 52% of their road games
  • Teams that win 1-2 games win 57.8%
  • Teams that win 3-5 win 63%
  • Team that win 6 win 78% of their road games, and average only 2.4 (road) losses per year
  • No team that has won at least 5 games has lost more than 5 on the road (in the last 5 years)!

5 --- The ratio of a team’s field goal percentage to the opposition’s field goal percentage? Winning games on the big stage requires both scoring and defense! A ratio above 1 indicates that you score the ball better than you allow your opposition to score.

  • Teams that win 2 or fewer games have a ratio of 1.12
  • Teams that win 3-5 games have a ratio of 1.18
  • Teams that win 6 games have a ratio of 1.23 – no team that has won 6 games had a ratio of less than 1.19!

6 --- The ratio of turnovers to the turnovers created (TOR). I recall coaches telling me that a turnover committed by our team was essentially a 4-point play: 2 that we didn’t get, and 2 they did.

  • Teams that win the most tournament games have an average TOR of 0.89. This means they turn the ball over at a minimal rate when compared to the turnovers they create.
  • Over the past 5 years, teams that won 6 games have an average TOR .11 better than the rest of the pack which can be interpreted this way: they force the opposition into turnovers 10 times as often as they commit turnovers themselves.

7 --- Just as important as beating teams by wide margins, are the close games! Close games build character, and provide preparation for the tournament.

  • Teams that win 6 games play more close games than any other group. The average minimum differential for this group is 1.6 points
  • Teams winning less games average a differential of 1.8 points.

8 --- Defending the 3. Teams that win more games in the tournament defend the 3 point shot only slightly better than the other teams, but they are twice as consistent in doing it! So, regardless of who’s coming to play, look for some sticky D beyond the arc!

  • On average, teams allow a 3-point field goal percentage .328
  • Teams winning the most tournament games defend only slightly better at .324; however the standard deviation is the more interesting statistic indicating the consistency of doing so (defending the 3 point shot) is almost twice as good as the other teams!

9 --- Teams that win are good at the stripe! Free throws close games. Make them and get away with win!

  • Teams that win the most games shoot for an average of .730 while the rest of the pack sits at .700

10 --- Teams that win the most games block shots! They play defense, period.

  • Teams that win the most tournament games average over 5 blocks per game.
  • Teams winning 6 games have blocked at least 3.4 shots per game (over the last 5 years)

Next steps? Take what’s been learned and apply it to this year’s tournament teams, and then as Larry Bird used to do, ask the question, “who’s playing for second?”

In addition to SAS Enterprise Miner, I used SAS Enterprise Guide to prepare the data for analysis and, I used JMP’s Graph Builder to create the graphics. The data was provided by Kaggle.

The top 10 statistics to consider when filling out your NCAA brackets was published on SAS Users.

6月 292016

I recently read an article in which the winner of a Kaggle Competition was not shy about sharing his technique for winning not one, but several of the analytical competitions.

“I always use Gradient Boosting,” he said. And then added, “but the key is Feature Engineering.”

A couple days later, a friend who read the same article called and asked, “What is this Feature Engineering that he’s talking about?”

It was a timely question, as I was in the process of developing a risk model for a client, and specifically, I was working through the stage of Feature Engineering.

The act, or I should say the “art” of Feature Engineering takes time – in fact, it takes much more time than building the actual model, and because many clients desire a quick return regarding analytics, I was giving some thought to explaining the value that I perceived in the effort being put forth regarding the engineering of features from their data.

But first, I needed to respond to my friend’s question, and I began by describing the environments in which we typically work; “the companies we work with are not void of analytical talent, in fact some institutions have dozens of highly qualified statisticians with many years of experience in their respective fields. Still, they may not be getting the results from their models they are expecting.”

I was quickly asked, “So, how do you get better results? Neural Nets? Random Forests?”

“It’s not the method that sets my results apart from theirs,” I answered. “In fact, the success of my modeling projects is generally solidified before I begin considering ‘which’ method.” I paused and pondered how to continue, “Before I consider methods, I simply derive more information from their data than they do, and that is Feature Engineering.”

Feature Engineering

Feature Engineering is a vital component in the process of attaining extraordinary Machine Learning results, but it doesn’t merit as many intriguing conversations, papers, books or talks at conferences. Regardless, I know the successes due my projects that are typically attributed to the application of Machine Learning techniques is actually due effective Feature Engineering that not only improves the model, but also creates information that the clients can more readily understand.

How did I justify the time spent “Feature Engineering” to the client mentioned above? First, I developed a model with PROC LOGISTIC using only the primary data given to me and called it “Model 1.”

Model 1 was then compared to “Model 2,” a model built using the primary data plus the newly developed features. Examples of those new features that proved significant were those constructed as follows:

  • Ratios of various events relative to time.
  • Continuous variables dimensionally reduced into categories.
  • Simple counts of events.

And then I proactively promoted the findings!

I often use the OUTROC option to output the necessary components such that I can create a chart of ROC curves using PROC SGPLOT. An example of this process can be viewed in Graph 1 where the value of Feature Engineering is clearly evident in the chart detailing the curves from both models: the c-statistic improved from 0.70 to 0.90.

Great machine learning starts with resourceful feature engineering

However, there is still a story to be shared as improvement is not only represented as a stronger c-statistic, but rather the in the actual classification of default. For example:

  • Model 1 correctly predicts default in 20% of accounts categorized as “most likely to default,” but it misses on 80%!
  • Model 2 correctly predicts default in 86% of those accounts it categorizes as most likely to default, but misses on only 14%.

How did this happen?

While we could drill down into a discussion of significant factors to seek an in-depth explanation of the results, a more general observation is simply this: Model 2 accomplishes its mission by correctly reclassifying a substantial portion of the population as “low risk” and leaves those with default concerns to be assessed with greater consideration.

Essentially, the default model is vastly improved by better discerning those customers who we need not worry about leaving an uncluttered pool of customers that require attention (see Graph 2).


Instead of assessing the default potential for 900,000 accounts, the model instead deals with 100,000 “risky” accounts. It essentially improves its chances of success by creating a more “easy to hit” target, and the result (see Graph 3) is that when it categorizes an account as “high risk” it means it!


Thanks to Feature Engineering, the business now possess very actionable information.

How Do I Engineer the Features Necessary to Attain Such Results?

Working from the SAS Data Step and select SAS Procedures, I work through a process in which I attempt to create useful features with the following techniques and methods:

  • Calculate statistics like the minimums, maximums, averages, medians and ranges thinking that extremes (or the lack) might help define interesting behaviors.
  • Count occurrences of events considering that I might highlight statistically interesting habitual behaviors.
  • Develop ratios seeking to add predictive value to already analytically powerful variables as well as to variables that might have previously lacked statistical vigor.
  • Develop quintiles across variables of interest seeking to create expressive segments of the population while also dealing with extreme values.
  • Apply dimensionality reduction techniques, ranks, clustering etc. expecting that grouping those with similar behaviors will be statistically beneficial.
  • Consider the element of time as an important interaction with any feature that has been developed.
  • Use regression to identify trends in continuous variables thinking that moves up or down (whether fast or slow) will be interesting.

As with the guy who won the Kaggle competitions, I’ve now revealed some of my secret weapons regarding the building of successful models.

What’s your secret weapon?



tags: machine learning

Great machine learning starts with resourceful feature engineering was published on SAS Users.

6月 082016

In a previous post, I wrote how pedigree might be used to help predict outcomes of horse races. In particular, I discussed a metric called the Dosage Index (DI), which appeared to be a leading indicator of success (at least historically). In this post, I want to introduce the Center […]

The post What does a winning thoroughbred horse look like? appeared first on JMP Blog.

5月 182016

You may have heard that a horse named Nyquist won the Kentucky Derby recently. Nyquist was the favorite going into the race, though he was not without his doubters. Many expert race prognosticators questioned his stamina, and I was curious about the basis for those comments. My due diligence revealed […]

The post Does the pedigree of a thoroughbred racehorse still matter? appeared first on JMP Blog.

5月 032016

The Los Angeles Times recently produced a graphic illustrating the 30,699 shots that the recently retired Kobe Bryant took over the span of his 20-year career. It became such a topic of conversation that the Times later offered the graphic for $69.95 (plus shipping). The paper also published a follow-up […]

The post Kobe Bryant took 30,699 shots, and I've plotted them all using JMP appeared first on JMP Blog.

4月 282016

In my previous post, I showed how we explored the eras of baseball using a simple scatterplot that helped us generate questions and analytical direction. The next phase was figuring out how I might use analytics to aid the “subject matter knowledge” that had been applied to the data. Could […]

The post Using analytics to explore the eras of baseball appeared first on JMP Blog.

4月 212016

In consulting with companies about building models with their data, I always talk to them about how their data may differentiate itself over time. For instance, are there seasons in which you might expect a rise in flu cases per day, or is  there an economic environment in which you […]

The post Using data visualization to explore the eras of baseball appeared first on JMP Blog.