A Regression Based Approach for Prediction of Major League Baseball Game Outcomes

Authors

Hughes, Gerry

Issue Date

2022

Degree

MSc in Data Analytics

Publisher

Dublin Business School

Rights

Items in eSource are protected by copyright. Previously published items are made available in accordance with the copyright policy of the publisher/copyright holder.

Abstract

Data analytics and statistics have seen increasing usage in professional sports in recent years, with many professional organisations expanding the usage of data analytics to improve team performance. Similarly, organisations adjacent to professional sports, such as gambling and sportsbooks have traditionally been one of the largest groups utilising statistical analysis for sports outcome prediction. With the wealth of data and techniques available now, the question naturally becomes how accurately can the outcome of a sporting event be predicted. This project aims to build and quantify a regression-based machine learning model for the prediction of the outcome of a game of baseball, first by determining the statistical value of individual players, and then by determining the relative statistical value of the teams they play on. Baseball has traditionally been the most statistically driven professional sport, with over 100 years of complex recorded statistics. However, despite the sheer amount of data available, the simple classification question of which of two teams will win a game between them remains impossible to answer. Existing models from simple Naïve classifiers to complex artificial neural networks have largely been limited to classification accuracies of <60%. The plethora of unquantifiable factors in any sport leads to it surely being impossible to ever definitively solve these problems. The aim of this project is to build an adaptive regression-based model that can be used and further tuned to improve predictive capability as both a classifier and probabilistic predictor. The current version of the model combines a multilinear regression-based model and a logistic regression-based model to produce a predictive model that presents a classification accuracy of 56.6%, an AUC of 0.549, and a Brier Score of 0.244.