I’ve been on hiatus from blogging for a long time. Chalk it up to completing the dissertation and then taking a post-dissertation writing break. This post won’t have much to do with news or my main academic research. Instead, I am going to apply my research skills to one of my favorite hobbies: baseball.

Fair warning, this post is ~2800 words. I made my own metric for evaluating baseball players! Then applied it to the home runs hit per start for over 100,000 starts since 2011.

Those of you who know me well know that I am a die hard New York Mets fan. I still carry the ticket from the last game I saw in Shea Stadium. Over the last two years I have also become an active fantasy baseball player. I had played fantasy baseball for eight or nine years and established a consistent pattern: draft reasonably well, get bored or frustrated by injuries, stop playing. Last year I managed my teams for the entire season and *finally* won a random public league. (I also lost a playoff round by .0005 in batting average after winning the regular season by 18 games.)

This year I was leading four of my five leagues in late July when my friend Gil started talking to me about daily fantasy baseball. Gil got me in to fantasy baseball in the first place, and taught me everything I know, but he’d gotten bored and moved on to try and make some money. We started talking about how these leagues work. You get a budget to spend on players, you get points based on their performance. Depending on the daily fantasy game, either the top 44% or so of players double their money (the house takes the rest) or the bulk of the prize goes to the top handful of finishers.

I thought my research skills could give me a major edge on other daily fantasy players. I had just finished a dissertation where most of the statistical models were specially designed to look at rare events. You know what else is a rare event? Hitting home runs. Even the best professional baseball player is more likely to have 0 home runs on a given day than to hit one out of the park. So I set out to try and create a statistical model for predicting how many home runs a player will hit on a particular day.

Before I get in to the details of how I did the study and what I’ve found so far, I want to say that I am ** not** writing this post now because I cashed in on daily fantasy baseball. What happened was after a few weeks of tinkering around with the data, my models suggested a good deal of randomness. Yes, people with a track record for hitting home runs are more likely to hit one out of the park on any particular day, but daily performances for hitters are highly volatile. I quickly concluded that success in daily fantasy sports is more about game theory. Predicting who other players will buy for their team was more valuable than predicting which hitter would be most likely to succeed. Since any insider information would threaten the integrity of daily fantasy, I decided not to play and shelved the research project for a few months. Given recent scandals over using inside information, I’m pretty happy that I stuck to traditional year-long fantasy baseball instead of investing my hard-earned cash in a daily fantasy site.

That being said, there could be a lot of valuable non-gambling reasons to do research on how many home runs someone will hit in a particular game. Today marks the first of the two wild card games. The Yankees and Astros both have a reputation for being feast or famine offenses, dependent on the long ball to win games. Anything can happen in a one game playoff, but home runs may play a big role in determining tonight’s outcome. (It could also help people win more traditional fantasy leagues.)

**The Study**

My goal in this post is to try and explain what will lead a particular batter to hit a particular of home runs on a particular day. Baseball is notorious for streaky home run hitters. Lucas Duda hit three home runs in his first 41 games, then hit six in his next 7 games, then went another 35 games with only one home run. Duda eventually hit 27 home runs. Would a player like Lucas Duda who hit his home runs in bunches be more or less likely to hit a home run in the Mets first playoff game than Curtis Granderson?

To start, I downloaded the last five years of play-by-play information from *Retrosheet*. Retrosheet maintains the copyright but makes data freely available for any use, making it the best source of information for this type of research.

The first thing I needed to do with this data was create some measure of how good a particular player was at hitting home runs over a particular period of time. Simply counting home runs over a season isn’t an optimal measure of skill. Counting may underestimate the skill of a player like Nelson Cruz or Troy Tulowitzki, because injuries have limited their playing time. As we know, counting is also a limited measure because playing in particular ballparks can have a major impact on home run totals.

To solve these problems, I created a measure for “home run skill” that compares the number of home runs a player hit over a certain amount of time to the number of home runs an average hitter would obtain by playing in the same ballparks. A player who does “as expected” would get a 0. Players who do better than average have positive values for skill, while below average players get negative values. The concept is similar to how someone might measure OPS+, but constructing my own measure allows skill to carry over from one season to the next. Other implications are somewhat novel, so I should add a few notes before going on:

- The comparison is to the average hitter (arithmetic mean), not a replacement level hitter.
- Pitchers hitting is excluded. Apologies to Bartolo Colon.
- Skill is measured over some specified amount of time. I have tried both plate appearances, game appearances and games started. I haven’t tried team games yet because I want to assume hitters are actually in the game, hitting.
- The expected performance part of the skill equation tracks the specific road ballparks a player is at, along with the home park.
- Unlike OPS+, I do not separate AL and NL teams.

One problem with measuring home run skill this way is the range of values is based on the timeframe selected. If we look at a player’s last 162 starts, home run skill ranges from 23 homers below expected to 36.775 homers above expected. If we look at a player’s last fifteen starts, home run skill ranges from -2.589 to 8.268, because fifteen games isn’t as much time for Mike Trout to differentiate himself from Ben Revere (career 4 homers in 2660 plate appearances).

To put different time frames in similar units, I standardize home run skill for any particular time frame, to measure players in standard deviations away from the mean. (This has one important side effect: really hot players will be further away from the mean over short time periods like 15 games, but over 162 games we will see some regression to the mean.) I created a wide range of other skills for hitters using the same method, but they will be saved for a future post.

**Regression Models**

Choosing the best form of regression for looking at home runs is more difficult than it would appear. [The following paragraph will refer to some regression models that were not taught in the required statistics classes in my PhD program. Feel free to skip ahead if you’d prefer.]

Basic ordinary least squares regression could work, but it is unlikely to be the best option. The number of home runs someone hits in a game does not follow a normal, bell curve distribution. The most common outcome is someone hitting zero home runs. Negative binomial regressions are ideally suited to this kind of rare outcome. Unfortunately, none of the negative binomial regressions I have tried so far have converged successfully, for any hitting outcome. (nbreg can be a bit testy that way). Zero-inflated negative binomial regression, using park effects in the inflation model, also failed to converge.

As a result, the best model that actually converges on some kind of prediction equation is a Poisson model. Poisson models are designed for counting things that are specific units and don’t quite have a bell curve distribution, because a few people have unusually large counts. Home runs hit in a season is a perfect example. The number of languages someone speaks is another good example. Poisson would make a lot of sense for hits per game, but wasn’t my first choice for looking at home runs. If anything is off, this could be one of the main reasons why.

To start off, let’s see if a player’s long term track record predicts whether they are more likely to hit a home run today. If players who hit more home runs over a 162 game season are not more likely to hit more home runs today, then hitting a home run would be almost completely random. The outcome is the number of home runs a player hits in a particular game. Since players who are not in the starting lineup are much less likely to hit home runs, I am limiting this to players who start. In this regression model I only use one independent variable: home run skill based on a player’s last 162 starts. If a player had yet to start 162 games before the game in question, they are excluded from the analysis. Among other things, this means the entire 2010 season is used as seed data to establish players’ track record to predict performances starting in 2011.^{1 }I am left with 105,484 starts from players with an established track record.

(Apologies in advance for the ugly copy/paste from stata. I rename a few variables in the output for easier reading, since I’m lame and the actual variable names are > 8 characters)

. poisson hr_pg std_hrskill_162gs if under162gs==0

Iteration 0: log likelihood = -37885.596 Iteration 1: log likelihood = -37885.596

Poisson regression Number of obs = 105484 LR chi2(1) = 1893.60 Prob > chi2 = 0.0000 Log likelihood = -37885.596 Pseudo R2 = 0.0244

------------------------------------------------------------------------------ hr_pg | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- hr sk:162 GS | .3843963 . 0087115 44.12 0.000 .367322 .4014706 _cons | -2.238728 .0097242 -230.22 0.000 -2.257787 -2.219669 ------------------------------------------------------------------------------

These results fit our expectations. A player who is one standard deviation better at hitting home runs over his prior 162 starts will hit *e *^ 0.3843983 = 1.46872 times as many home runs in the current game. Since Poisson is an exponential regression model, the top home run hitters would be expected to hit more than twice the home runs in any game. Don’t get too excited though. The very large and negative coefficient tells us that hitting home runs is still rare. One way to examine this is with stata’s “margins” command, which allows us to predict the number of home runs someone hits per game at different levels of skill. For illustration I chose -2 standard deviations of skill, -1, 0, +1, +2 and +3

. margins, at(std_hrskill_162gs=(-2(1)3)) vsquish

Adjusted predictions Number of obs = 105484 Model VCE : OIM

Expression : Predicted number of events, predict() 1._at : std_~l_162gs = -2 2._at : std_~l_162gs = -1 3._at : std_~l_162gs = 0 4._at : std_~l_162gs = 1 5._at : std_~l_162gs = 2 6._at : std_~l_162gs = 3

------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _at | -2 SD | .049414 .0011259 43.89 0.000 .0472074 .0516207 -1 SD | .0725758 .0011029 65.80 0.000 .070414 .0747375 0 SD | .106594 .0010365 102.84 0.000 .1045624 .1086256 +1 SD | .1565576 .0016416 95.37 0.000 .15334 .1597751 +2 SD | .2299404 .0038278 60.07 0.000 .222438 .2374427 +3 SD | .3377197 .0082453 40.96 0.000 .3215592 .3538803 ------------------------------------------------------------------------------

Based on this regression model, a player with average skill at hitting home runs would hit .106 homers per game they start. If we do a little back of the envelope math and assume four plate appearances per start, that translates to one home run per every 37.735 plate appearances. Home runs are pretty rare, even for hitters who do an average job at hitting home runs. A player with +2 standard deviations of home run hitting skill, based on their track record over 162 games, would hit 0.2299 home runs per game they start. It’s more than double what the average power hitter would accomplish. However, Mets fans shouldn’t expect a Yoenis Cespedes home run every playoff game just because he hit 35 in the regular season.

Cespedes is an interesting case because he had one of the most memorable hot streaks upon being traded to the Mets. How well does a hot streak predict how someone will do today? Let’s define a hot streak as skill for hitting home runs over the last 15 starts, as opposed to 162. (Veteran fantasy baseball players may recognize the number 15.)

. poisson hr_pg std_hrskill_15gs if under162gs==0

Iteration 0: log likelihood = -38522.133 Iteration 1: log likelihood = -38522.131

Poisson regression Number of obs = 105484 LR chi2(1) = 620.53 Prob > chi2 = 0.0000 Log likelihood = -38522.131 Pseudo R2 = 0.0080

------------------------------------------------------------------------------ hr_pg | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- HR sk: 15gs | .2084912 .0080806 25.80 0.000 .1926536 .2243289 _cons | -2.202273 .0094182 -233.83 0.000 -2.220732 -2.183814 ------------------------------------------------------------------------------

We see there is some effect of a hot streak, if the only independent variable we are using is recent performance. However, this simple regression should already send up a few red flags. The size of the coefficient, 0.208, is roughly half the size of the coefficient for the 162 games started model. Since both coefficients are standardized, they can be compared more easily. A one standard deviation increase in recent performance only has around half the effect of a one standard deviation increase in long term performance. If we put both variables in the same model, the effect of a short term hot streak may go away completely.

. poisson hr_pg std_hrskill_15gs std_hrskill_16_162gs if under162gs==0

Iteration 0: log likelihood = -37884.932 Iteration 1: log likelihood = -37884.932

Poisson regression Number of obs = 105484 LR chi2(2) = 1894.93 Prob > chi2 = 0.0000 Log likelihood = -37884.932 Pseudo R2 = 0.0244

------------------------------------------------------------------------------ hr_pg | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- HR sk: 15gs | .0716725 .0090748 7.90 0.000 .0538863 .0894587 HR sk: 16 to 162 gs | .3473053 .0095419 36.40 0.000 .3286035 .3660071 _cons | -2.244009 .0097878 -229.27 0.000 -2.263193 -2.224825 ------------------------------------------------------------------------------

This regression contains two independent variables. The first, home run skill over the past 15 games, is just like the results presented above. The second variable is skill based on prior games 16-162. Overlapping effects is a major concern here. Players who hit more home runs in the long term should hit more home runs in the short term. Both should affect how many home runs a player hits today. Using non-overlapping periods cuts down on the overlap and leads to better predictions. (Better but not 100% ideal.)

Notice how a player’s short term hot streak is much worse at predicting how many home runs someone will hit today, once we control for a player’s longer term track record. The predictive power of a player’s longer term track record for hitting home runs is relatively unaltered. A margins command to look at the effect of short-term home run streaks illustrates this more clearly.

. margins, at(std_hrskill_15gs=(-2(1)3)) vsquish

Predictive margins Number of obs = 105484 Model VCE : OIM

Expression : Predicted number of events, predict() 1._at : std_hrsk~5gs = -2 2._at : std_hrsk~5gs = -1 3._at : std_hrsk~5gs = 0 4._at : std_hrsk~5gs = 1 5._at : std_hrsk~5gs = 2 6._at : std_hrsk~5gs = 3

------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _at | -2 SD | .0976826 .0022189 44.02 0.000 .0933337 .1020315 -1 SD | .1049408 .0015582 67.35 0.000 .1018867 .1079949 0 SD | .1127382 .0010678 105.58 0.000 .1106454 .114831 +1 SD | .1211151 .0013466 89.94 0.000 .1184757 .1237544 +2 SD | .1301143 .0023357 55.71 0.000 .1255364 .1346923 +3 SD | .1397823 .00366 38.19 0.000 .1326089 .1469557 ------------------------------------------------------------------------------

Readers with some statistics training may point out that short term performance still has an association with the number of home runs hit today. It is significant at the p < .001 level. However, I wouldn’t read too much in to statistical significance as the only way to evaluate results with a sample size of over 100,000 player starts. With such a large sample, even minute differences can be statistically significant.

For fantasy baseball players, these results have a greater implication for how to play the game. The reason I chose 15 starts as a cutoff is because the default in an espn.com fantasy league is to show player performance over the last 15 days. Fantasy baseball players often overreact to hitters who have sudden hot or cold streaks, getting rid of established hitters going though a rough patch and adding the hot bat to their lineup. My results suggest that if you needed to add one last hitter to win your fantasy league title last week, you would be better off adding a hitter who had been relatively good at hitting home runs all season (and was starting) over the hitter who unexpectedly hit a bunch of home runs last week. And if you want to figure out who to watch closely tonight, watch the players with the most home runs over the season. Pay less attention to announcers fixated on how hot a hitter is, at least for home runs.

**Up next: Pitchers!** Will the importance of long term track record over short term trends also apply to how many home runs a pitcher gives up in a game? (I may also write up other things like a batter’s walks, lineup order, and park effects if I have time. I’ve already looked at some models, but this post is __much__ longer than expected already.)

1: As we might expect, players who stay in the bigs long enough to get 162 starts are better than the average big leaguer, which includes players not talented enough to get 162 starts.

## Leave a Reply