March Madness 2017 and Predictive Analytics

From March Apathy to March Madness

– Jared Endicott, Launch BI Analyst and Business Athlete 

When it comes to college basketball I’m about as ignorant as they come. I have a vague awareness of March Madness and the Final Four, mostly through tangential knowledge like President Obama filling out a tournament bracket each year and Warren Buffet offering a billion dollar prize for a perfect bracket prediction in 2014. I have never filled out a bracket before. I have never even watched an NCAA game before. I couldn’t tell you about the teams or the players. That all changes this year.

I have been challenged by the idea of using predictive analytics to fill out the 2017 March Madness brackets. To train for this effort I have watched about a dozen lectures on the topic…well…I have watched the same lecture about a dozen times. The talk in question is from a Great Courses lecture series called Big Data: How Data Analytics Is Transforming the World and it is appropriately titled “Bracketology-The Math of March Madness.” In the lecture, Professor Tim Chartier explains some basics about the tournament and then proceeds to lay out a particular data-centered approach to forecasting which teams will win which playoff games.

Professor Chartier demonstrates the Massey Method, an algorithm created by mathematics professor Kenneth Massey for ranking NCAA college football teams for the Bowl Championship Series. The method can be applied to college basketball as well, and uses game data from the regular season to rank all of the teams. Whoever has the highest rank in any given matchup is favored to win that matchup and the algorithm assumes the winner of any matchup to be the higher ranking team. The algorithm takes into account the point spreads in each of the regular season games, such that the winning team has a positive score differential and the losing team has a negative score differential, thus weighting the winners and losers by their net score differentials for the entire season. Point spreads can be capped, so that blowouts and teams who have easier schedules don’t overweight the rankings. The method also accommodates various approaches for weighting the games themselves, such as weighting away and neutral territory games higher than home games and/or weighting more recent games higher than games earlier in the season. All in all, this algorithm looks like it could be a scoring leader.

The data for this analysis was made available by Kaggle in the form of CSVs. Kaggle is a website that hosts data science competitions, including an annual March Machine Learning Mania for the NCAA March Madness Tournament. As cited by Kaggle the data is ultimately provided by Kenneth Massey himself.

Calculate NCAA Rankings

I used R, a preferred language for data scientists, to write a function that will calculate NCAA Division I Men’s Basketball team rankings using the Massey Method as explained by Professor Chartier. This algorithm follows this general recipe for calculating the rankings using a linear system:

  1. Calculate each team’s net point differential from the regular season games and save these in a vector (V).
  2. Calculate a winning team by losing team matrix (M) that has the amount of games each team has played as well as a multiple of -1 for each matchup between two given teams.
  3. Solve the equation M * Rankings = V, where the teams in V are aligned with the teams in the rows of M.
  4. Calculate the inverse of M.
  5. Multiply the inverse of M by V to obtain the Rankings.

For a more in depth understanding of the mathematical ideas behind this algorithm I suggest acquiring Professor Chartier’s lectures on Big Data from The Great Courses.

Simulations for 2017

For 2017 I will submit my 25 allowable brackets to the ESPN Tournament Challenge. The first 12 of these brackets will be simulations using the ranking model with different parameters. The next 12 brackets will be coin flip simulations, where each matchup is given a 50/50 random chance. The last remaining bracket will be chosen by my wife Mary based on her own preferences. The first 12 are my model tests, while the next 12 are essentially controls. Presumably the model simulations should perform better than the coin flip simulations.

My favorite bracket out the 12 ranking model simulations, what I would choose if I had to pick one, is ranking model simulation 5. This one has Gonzaga winning against Kentucky in the final, after they beat Villanova and Louisville respectively in the Final Four. It will be really interesting to see how well the methods of predictive analytics stack up against random chance, as well as other more knowledgeable bracketologists.

Below is the bracket for ranking model simulation 5. To see my other bracket simulations, along with the R scripts I used to perform the ranking algorithm and tournament simulation please check out my full article, March Madness 2017 and Predictive Analytics.