Bump

sports
Author

Derek Sollberger

Published

August 12, 2022

Today I want to try to make a bump plot while practicing with sports data. The Lahman data set has a lot of historical data about Major League Baseball. Data scientists have been using bump plots for a few years now, but currently I wish to credit this code by Albert Rapp.

library("ggbump")
library("Lahman")
library("tidyverse")

For today’s easy foray, let us seek out the wins and losses of teams in the Teams data frame (I tend to call my data frames df for typing ease).

df <- Teams

There are about 3000 observations and 48 variables. I will need some of the column names.

colnames(df)
 [1] "yearID"         "lgID"           "teamID"         "franchID"      
 [5] "divID"          "Rank"           "G"              "Ghome"         
 [9] "W"              "L"              "DivWin"         "WCWin"         
[13] "LgWin"          "WSWin"          "R"              "AB"            
[17] "H"              "X2B"            "X3B"            "HR"            
[21] "BB"             "SO"             "SB"             "CS"            
[25] "HBP"            "SF"             "RA"             "ER"            
[29] "ERA"            "CG"             "SHO"            "SV"            
[33] "IPouts"         "HA"             "HRA"            "BBA"           
[37] "SOA"            "E"              "DP"             "FP"            
[41] "name"           "park"           "attendance"     "BPF"           
[45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"   

To make a quick exploration, let us filter for the past 10 seasons of baseball (2012 to 2021) and select the columns I will use later.

df <- Teams |>
  filter(yearID >= 2012) |>
  select(yearID, lgID, franchID, divID, Rank)
head(df)
  yearID lgID franchID divID Rank
1   2012   NL      ARI     W    3
2   2012   NL      ATL     E    2
3   2012   AL      BAL     E    2
4   2012   AL      BOS     E    5
5   2012   AL      CHW     C    2
6   2012   NL      CHC     C    5

To be honest, I thought I was going to have to code up some function to rank team wins within the MLB divisions, but the Lahman database already has that!

df_left <- df |> filter(yearID == 2012 & lgID == "NL")
df_right <- df |> filter(yearID == 2021 & lgID == "NL")
df |>
  filter(lgID == "NL") |>
  ggplot(aes(x = yearID, y = -Rank, color = franchID)) +
  geom_bump(size = 2) +
  geom_point(aes(x = yearID, y = -Rank, color = franchID),
             size = 5) +
  geom_label(aes(x = yearID, y = -Rank, label = franchID), data = df_left) +
  geom_label(aes(x = yearID, y = -Rank, label = franchID), data = df_right) +
  facet_wrap(. ~ divID, ncol = 1) +
  labs(title = "National League Standings",
       subtitle = "early draft of bump plot",
       caption = "Derek Sollberger") +
  theme(legend.position = "none",
        panel.background = element_blank())
Warning in f(...): 'StatBump' needs at least two observations per group

bump plot