COVID-19 countermeasures, Major League Baseball, and the home field advantage: Simulating the 2020 season using logit

In the wake of COVID-19, almost all major league sports Background: have been either cancelled or postponed. The sports industry suffered a major blow with the uncertainty of sporting events being held in the near future. Various scenarios of how and when sports might recommence have been discussed. This paper examines various scenarios of how Major League Baseball team performance is going to be impacted by the presence of fans, or the lack thereof, in the context of physical distancing and other COVID-19 countermeasures The paper simulates, using a neural network and a logit Methods: regression model, the win-loss probabilities for various scenarios under consideration and also estimates the home effect for each team using data for the 2017-2019 seasons. The model demonstrates that individual team home effect is Results: symmetric between home and away and teams will not necessarily have a win or loss of any additional games in neutral stadiums, as teams with a high home field effect will lose more neutral games that would have been at home but will win more neutral games that would have been away. However, the result of individual games will be different since home effect is asymmetric between teams. Our simulation demonstrates that these individual game differences may lead to a slight difference in Play-Off Berths between a full season, a half season, or a full season without fans. Without fans, any advantage (or disadvantage) from home Conclusions: field advantage is removed. Our models and simulation demonstrate that this will reduce the variance. This stabilizes the outcome based upon true team talent, which we estimate will cause a larger divide between the best and worst teams. This estimation helps decision makers understand how individual team performance will be impacted as they prepare for the 2020 season under the new circumstances.


Introduction
The 2019-2020 pandemic from the novel coronavirus  has brought unprecedented countermeasures to every sector of the economy, including individuals, groups, institutions, and industries. The sports industry took one of the biggest hits, with all major leagues in the U.S. cancelling or halting their events. While these actions were necessary to address the public health concern, each segment is now floating various proposals to resume operations and give some relief to the significant portion of the economy that the sports industry comprises.
Major League Baseball (MLB) is likely to be the first American professional sporting league to resume, probably in May or June (Passan, 2020). Players are willing (although not all players agree to the method), the League is willing, the Arizona government is willing, and health professionals have approved a plan to move forward, known as the Arizona Plan. This plan calls for players, coaches, and staff to be quarantined in hotels around the Phoenix area, and to play in empty ballparks that include the ten Cactus League Spring Training parks, Chase Field, and other Phoenix ballparks. One interesting aspect of this arrangement is that the stadium size will not matter because there will not be any fans. This will be an opportunity for MLB to get back into the spotlight and accumulate massive television viewership that MLB has not seen in decades. The experience will be completely optimized for TV viewing, and so the league will finally be able to experiment with proposed rule changes, including removing mound visits to make the game go faster, adding a Robo Umpire, which has already been successfully tested last season via a partnership with the independent Atlantic League (Bogage, 2019), and an expanded roster giving players more rest due to the extremely hot temperatures of Phoenix. While all of this will alter predictions on who's going to the playoffs, probably the biggest impact that this plan will have on the games is the lack of the home field advantage (HFA): the advantage that the home team has over the visiting team due to the home team having fans, the familiarity of the home team to their own ball park, and the away team having to travel.
Baseball has been shown in previous studies to be less susceptible to the HFA effect than other professional sports (Edwards & Archambault, 1979;Gómez et al., 2011;Pollard et al., 2017). Despite this, there is a measurable home field advantage in baseball, as shown by Jones (2015); Jones (2018). Building on this, we extend the analysis for the MLB under uncertainty of which scenario the League will be following for the 2020 season. In particular, we simulate the win-loss probabilities for three different scenarios as well as estimate the home advantage for each team using the past three seasons' data. This estimation helps us understand how individual team performance is going to be impacted as they prepare for the 2020 season in the new circumstances.

Data sources
We use the MLB 2017-2019 season data for the 30 teams represented in the league. The data were obtained from the MLB Advanced Media's Baseball Savant Website using the Python package PyBaseball 1.0.4 (LeDoux, 2017/2020). The data shows that out of the 7,290 home games played during the 2017-2019 seasons, 3,881 (53.237%) resulted in wins and the remaining 3,409 (46.763%) resulted in a loss. Next, we seek to quantify the HFA's role in this difference.

Calculating home advantage
There are various techniques to calculate the home advantage depending on the sport, gender, league, and the nature of scoring (Jones, 2015). Pollard et al. (2017) use a general linear model to fit the home advantage. However, because we have a categorical variable of win or lose, we need to follow a non-linear approach. To test the hypothesis that teams have home-field advantage, we apply a logit regression model to predict the probability of winning as a function of home game dummy, team fixed effects, opponent fixed effects, and the win-loss records. We estimate the following regression equation: where Win i is a dummy variable that takes the value 1 if the recorded game resulted in a win for the team-opponent pair, and zero otherwise; Home i accounts for home game; and Team i controls for the individual team fixed effects, and Opp i controls for the opponent fixed effects, Z i represents the win-loss percentage for the team as well as the opponent, ε i stands for the error term. The HFA is calculated accounting for the team fixed effects as well as the opponent fixed effects by interacting home with team and opponent separately. We run the logit model on all the data, with Home=1 for the home team and equal to zero for the travelling team. Doing so separates the team fixed effects and home field advantage. The model in Equation 1 is used to estimate both the win probability and the HFA per team. The HFA is obtained by calculating the marginal effect (ME) of Home on the win probability for each team separately.
Development of the neural network model A neural network model was also created to act as a robustness check for the logit win prediction model. The software to train the model is hosted on GitHub (Ehrlich, 2020a). We used the R package nnet 7.3-14 (Ripley & Venables, 2020) for the neural network platform, and trained and tuned the model with the R package caret 6.0-86 (Kuhn et al., 2020). We developed a simulator to estimate what might happen if: 1) The full 2020 season continued on in a parallel universe devoid of COVID-19; 2) MLB waits and is able to return and play half a season to packed stadiums around the All Star Break, which is assuming an extremely optimistic timeline of a return to normal life; 3) a full season is played without fans, which is likely the only way they will be able to play this season (i.e., the Arizona Plan). The simulation was executed 100 times, and the logit win prediction model was used as the basis for predicting each win. A random number between 0 and 1 was generated and checked against the win probability provided by the model. If the random number was below the probability, then the team won, otherwise the team lost.

Results
The summary statistics of the training data is contained in Table 1. The logit results from Equation 1, without the fixed effects, are reported in Table 2. Both log odds ratios and the MEs are reported in this table. The results show that the individual regressors included in the model show plausible impacts. Looking at the log-odds ratios, home games and the home team's previous win-loss percentage (WL%) are more likely to result in a win but the opponent's WL% is less likely to result a loss for the home team. These results support the presence of the HFA. The right half of the table shows the MEs for each variable. We are mainly interested in the MEe for the Home variable, which is 0.064. This means, the marginal probability of winning a game at home versus away field goes up by 6.4%. This is the average HFA for all of the teams as a whole. The HFA for each team is presented in Figure 1. In our sample, PHI seems to have the highest home advantage and HOU seems to have the lowest (negative, in fact) home advantage.
The model was trained using the 2017-2019 MLB regular season games. The schedule for the 2020 season was estimated using the schedule from the 2019 season. While the dates will be off slightly, the team pairings will be nearly the same. The wins and losses of the 100 simulations were added to form the result of the 2020 season. The overall results are visualized in Figure 2, while the divisional results shown in Table 3. Table 4 provides statistics calculated during each season and averaged. This includes the correlation between the full season and the half and no-fan seasons using both the overall rankings and the win-loss percent (WL%). The full seasons rank correlations are higher with the no-fans seasons (0.825) than the half seasons (0.735). The correlations using WL% is similar. The standard deviation of the predicted win probabilities is lower for the no-fans seasons (0.073) than the full (0.085) and half seasons (0.085). The home effect was correlated with the win probabilities' standard deviations and is negative for the no fans seasons (-0.221). In other words, the higher the home effect, the lower the variance.
This neural network was also used as the win predictor in 100 simulations and the results are very similar to the logit win prediction model, which shows robustness in the simulation results. Table 4 shows the statistical results of both models. The correlation and standard deviation differences are approximately the same between the two models.
The results of the simulations are available as Extended data (Ehrlich, 2020b) and the code necessary for replicating the results, including training the models, are hosted on GitHub (Ehrlich, 2020a).

Discussion
Based on the above results, since the team-home effect is symmetric between home and away, teams will not necessarily win or lose any additional games in neutral stadiums as teams with a high home field effect will lose more neutral games that would have been at home but will win more neutral games that would have been away. The greater the home-team ME, the less variance there will in of the predicted win probabilities.
To verify this assumption, we calculated the correlation of HomeEffects and the standard deviation (SD) of win probabilities between a full (0.361) and no-fan season (-0.221). Since the home effect is symmetric for each team (the away field disadvantage = -the home field advantage), decreasing the variance does not affect the overall expected WL% for each team. However, the result of individual games will be different since home effect is asymmetric between teams. For example, if the Cubs (highest home effect in the NL Central) plays the Cardinals (lowest home effect in the NL Central), the Cubs will have a larger advantage playing at home then the Cardinals will have playing at home (besides team fixed effects). These differences are removed with the No-Fan scenario and the outcome will be solely based upon the talent of the teams. However, on average there only a slight change of overall WL% (or playoff berth), just the SD of the results (see Table 3). Without fans, any advantage (or disadvantage) from home field advantage, which cause higher levels of variance, is removed. This stabilizes the outcome based upon true team talent. As fewer games have been played, the half-season will have more upsets, but the SD is close to the same as the full season.

Conclusion
This paper analyzes the previous season MLB data to estimate the win-loss probabilities for the 2020 season for each of the 30 teams in the League using logit regressions and a neural network. The Arizona Plan's neutralization of HFA would not significantly affect the overall outcome of the season.
In fact, our model predicts that the Arizona Plan season will produce season results that are based more on the true talent of the teams. Further, our simulation demonstrates that there will be less variance in the win probability between any two teams, which we estimate will cause a larger divide between the best and worst teams. In conclusion, we believe that the results of the Arizona Plan will be similar to a regular season with fans, and that the teams' standings at the end of the regular season will be more predictable than a normal season.
This project contains the following source data files: • data/2008_2019Games.csv. (Input data scraped using the scrapingMLB.ipynb.) • data/divisions.csv. (Input team division data for grouping by division.) Mixed data and code hosted on GitHub and Zenodo are available under the terms of the GNU General Public License v3.0.
Data hosted on Harvard Dataverse are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).