For this week’s #MakeoverMonday, we’re looking into cost effectiveness in Major League Baseball. More specifically, how does a player/team salary translate into productivity on the field, across a variety of statistical categories. For instance, if Player X made $20 million in 2015 and hit 20 home runs that year, you could say the team is paying $1 million per home run hit by Player X. Alright, let’s get started.
Step 1. Understanding the Data
Since I’m a lifelong baseball fan, this is a data scenario that is familiar to me. However, Andy and Eva did a great job of including links on the data sets page for those who may be less familiar with the sport. If this were, say, Rugby data, I would absolutely be diving into those resources, so if you ever feel uncomfortable with the data set, be sure to do a little bit of research. One thing I will mention is that it looks like the data set is focused on hitting stats only and does not include pitching stats. However, pitchers are included in the data set, because they do compile hitting stats in certain situations. If you’re unfamiliar with the rules and why this could potentially matter for this data set, here are a few notes;
- Pitchers ONLY hit in games that are played in National League ballparks
- Starting pitchers ONLY start every 4th or 5th game
- MOST (not all) pitchers are not very good at hitting the ball
- MOST (not all) good pitchers have high salaries
So, if the average National League pitcher starts every 5th game (32 starts) and gets three plate appearances per game, that comes to 96 at bats, for the season. So why does any of this matter? Like we mentioned, pitchers typically aren’t great at hitting the ball, so their hitting stats could look very poor when compared to the average position player (all other players on the field, other than the pitcher are referred to as position players). So, if we’re analyzing the cost a team pays players per home run, for instance, let’s look at an example of what it could look like when comparing a pitcher vs. a good position player.
- Pitcher X
- makes $25 million and hits 1 home run in 96 plate appearances
- This would suggest we pay Pitcher X $25 million per home run hit
- Position Player Y
- makes $25 million and hits 25 home runs in 432 plate appearances
- We pay Position Player Y $1 million per home run hit
This scenario would lead you to believe that Position Player Y is a much more cost effective player, when the reality is simply that he is paid to hit the ball, while Pitcher X is paid primarily to pitch the ball. And since the data set does not include a field for each player’s “position,” we’re unable to simply filter pitchers out of the data set. Therefore, it may make sense to set a filter on plate appearances and set it to a minimum of 200 per season. This would filter the pitchers out, as in my opinion, it does not make sense to include them in this analysis. I apologize for the long-winded explanation, but in my first glance at the data, I saw this as potentially slipping up some participants who may not be familiar with the game. Ok, what about the original viz?
Step 2. The Original Viz
The scatter plots on the original viz are easy enough to understand, but to be honest the way in which it was labeled, made it difficult for me to follow, especially the bottom, team section. Also, I didn’t find the team section all that interesting, because basically it was just showing us what teams have the lowest payrolls (Houston Astros) and which teams have the highest payrolls (New York Yankees and Los Angeles Dodgers). I guess the most interesting part of the team section was it tells us just how unbelievably bad the Miami Marlins were, offensively, in 2013. Wow!! Last in the league in all five categories.
Step 3. Try New Things
Awhile back, I saw this really good video by Andy, on how to build a no-whisker box plot and have been waiting for the right opportunity to try creating something similar. I was hopeful this data set would provide that opportunity, but after working through a few different scenarios, I was unhappy with the results. So, we’ll continue to file that chart type away for a different day and move onto something else. Another recent viz I really liked was this beautiful viz by Lindsey Poulter, which used a stepped line and dot combo chart to capture the magical 2018 season of Kansas City Chiefs QB, Patrick Mahomes. In his #WorkoutWednesday challenge for Week 4 of 2019, Curtis Harris built a similar chart that tracked headcount. I really loved not only the look of these vizzes, but also the ease of understanding them. So, I decided to go with this chart type, but the question was what data would it work well with? It’s probably worth mentioning that in a business setting, choosing your chart type first is probably not going to be the best approach. However, one of the great things about Tableau Public and community projects like #MakeoverMonday are that they offer us great opportunities to try new things and approach data visualization in different ways, in a safe environment.
Step 4. Finding the Story
The next step was to begin playing around with the data to find a story that fit the vision I had in my head. Early on, I had ruled out looking at team data, as I wanted to focus on players instead. Looking at hits, runs, RBI and home runs, I worked through some different ideas before landing on a viz that would feature the most recent members of the 500 home run club. Leveraging the stepped line dot combo chart, I felt it would be fun to visualize each player’s home runs by season, along with their team’s cost per home run (or the player’s salary per home run, whichever way you prefer looking at it). What I expected to see was as a player’s salary increased throughout their career, their salary per home run would increase fairly closely along with it. While this was true in a majority of cases, it certainly was not the case for all players on the list and in other cases, the increase in cost per home run was not as steep as I had guessed. Now that I had found a story, the next step was communicating it with a clean, engaging design.
Step 5. Simplicity in Design
I used just two colors in the viz, with a third for non-data related text. My colors were shades of red and blue, the colors of the MLB logo. For the chart, I set home runs as the stepped line chart and salary per home run as the sized dots. Here’s what it looked like.
The chart looked nice, but something very important was missing; salary by season. I immediately thought back to a fantastic blog post by Ryan Sleeper, in which he shares creative ways to use transparent sheets to add context to your vizzes. This was exactly what I needed, context. The moment I saw it, I fell in love with Ryan’s bar chart trend pushed to the background. So, I implemented this strategy, with player salary set as the trend in the background, with an opacity of 20%. This way it would be there for context, but not draw attention away from the other chart. With home runs set to running total for the player’s career, this worked out well, because as home runs increased, a player’s salary typically increased as well. So, for the most part, they increased along with one another. Here’s what it looked like after floating the stepped line on top of the bar chart.
To add more context, I included text for each player’s career salary, home runs and salary per home run, as well as the years they played. Lastly, I wanted the reader to be able to see the differences in all three measures, so I fixed the y-axis for the home runs stepped chart from 0 to 800, fixed the salary bar chart from 0 to $40 million and fixed the salary per home run dot size from $0 to $5 million. Below is a view after adding the text. Since I didn’t show any axes, I included an explanation through the use of an info button.
Step 6. Sense Checking the Data
It was not until after building the entire viz that I really took the time to look closely at the numbers to make sure everything made sense. Guess what? It didn’t!! For a while, I couldn’t figure out why a few players had extreme spikes in their salaries. Take a look at the below comparison of before and after I found the issue. Look at those spikes!! Why on earth would Gary Sheffield randomly make nearly $30 million dollars in one season and then go back to making $9-10 for the next several years? Answer? He wouldn’t, so there was clearly something wrong with either the data or one of my calculations.
incorrect salary figures
corrected salary figures
After digging around, here’s what I found. My ‘FIXED Player Salary’ calculation had originally been set up as SUM([Salary]), as I had not taken into account the fact that if a player had played for more than one team during the same season, they would have more than one row of data for that season. Here’s what the incorrect calculation and the result looked like.
I was certain $29.9 million was incorrect, but I also wanted to be sure that the $14.9 million figure was correct, so I checked the trusty old baseball-reference.com and saw that the numbers matched and Sheffield did indeed make that amount in 1998. So, I needed to change my calculation to pull in the MIN([Salary]) as opposed to SUM.
Overall, I enjoyed working with the data set this week and wound up spending a lot more time on this viz than on a typical #MakeoverMonday, mostly due to just playing around with exploring the data. Below is a look at my final viz, the interactive version can be found here. Thanks for reading, I hope you enjoyed!!