Introduction

The following work identifies the baseball teams that have achieved the largest number of wins on the smallest budgets since 1985 per the Lahman data package.


Preprocessing

This analysis links two Lahman package tables, viz. Salaries and Teams, to determine total wins per season and the salary cost of each win earned. See the following preview of the first observations in Salaries:


yearID teamID lgID playerID salary
1985 ATL NL barkele01 870000
1985 ATL NL bedrost01 550000
1985 ATL NL benedbr01 545000
1985 ATL NL campri01 633333
1985 ATL NL ceronri01 625000
1985 ATL NL chambch01 800000


See the following preview of select variables and first observations in Teams:


yearID teamID lgID G W L
1871 BS1 NA 31 20 10
1871 CH1 NA 28 19 9
1871 CL1 NA 29 10 19
1871 FW1 NA 19 7 12
1871 NY2 NA 33 16 17
1871 PH1 NA 28 21 7


Lastly, Salaries data are adjusted for inflation at a constant annual increase rate of 3%. Observe:


Salaries <- Salaries %>% 
  mutate(salary.adj = salary * (1.03)^(max(yearID) - yearID))

head(Salaries) %>%
  pander()
yearID teamID lgID playerID salary salary.adj
1985 ATL NL barkele01 870000 2175070
1985 ATL NL bedrost01 550000 1375044
1985 ATL NL benedbr01 545000 1362544
1985 ATL NL campri01 633333 1583383
1985 ATL NL ceronri01 625000 1562550
1985 ATL NL chambch01 800000 2000064


In order to calculate the cost per win, adjusted individual player salaries from the Salaries table must be aggregated by team and year in the Teams table.



Step 1: Aggregating Salaries

Using the Salaries table and grouping on variables yearID and teamID via package dplyr function group_by(), the following uses function summarize() to calculate and name the following summary variables for each year and team:


  • Total adjusted team budget using salary.adj and sum()

  • Total players using n()


# Your code here
  

  Salaries_summary <- Salaries %>%
  group_by(yearID, teamID) %>%
  summarise(team.budget = sum(salary.adj), count = n())


Salaries_summary %>% head() %>% pander()
yearID teamID team.budget count
1985 ATL 37018690 22
1985 BAL 28902709 22
1985 BOS 27244776 25
1985 CAL 36070894 28
1985 CHA 24616236 21
1985 CHN 31758313 22



Step 2: Join Salaries & Teams

The following merges the Salaries summary data and full Teams tables using either base R function merge() or package dplyr functions left_join().


  • Variables teamID and yearID are unique merging keys

  • Example: dataset_1 %>% left_join(dataset_2)


# Your code here

Teams_selected <- Teams %>%
  select(yearID, teamID, lgID, G, W, L)

merged_data <- left_join(Teams_selected, Salaries_summary, 
                         by = c("yearID", "teamID"))

merged_data %>% head() %>% pander()
yearID teamID lgID G W L team.budget count
1871 BS1 NA 31 20 10 NA NA
1871 CH1 NA 28 19 9 NA NA
1871 CL1 NA 29 10 19 NA NA
1871 FW1 NA 19 7 12 NA NA
1871 NY2 NA 33 16 17 NA NA
1871 PH1 NA 28 21 7 NA NA



Step 3: Calculate Cost per Win

The following uses the merged Salaries summary data and Teams data to create a new variable with function mutate() that calculates the total team salary cost per win, divided by $100,000.


  • Divide (/) total salaries from Step 1 by total wins (W)

  • Divide (/) that value by 100000


# Your code here

merged_data <- merged_data %>%
  mutate(cost.per.win = (team.budget / W) / 100000)

merged_data %>% head() %>% pander()
yearID teamID lgID G W L team.budget count cost.per.win
1871 BS1 NA 31 20 10 NA NA NA
1871 CH1 NA 28 19 9 NA NA NA
1871 CL1 NA 29 10 19 NA NA NA
1871 FW1 NA 19 7 12 NA NA NA
1871 NY2 NA 33 16 17 NA NA NA
1871 PH1 NA 28 21 7 NA NA NA



Step 4: Select, Filter, & Arrange

The following filters out teams in merged data with less than 25 players on their annual roster. It then arranges teams in ascending order by cost per win. Lastly, only variables are preserved to create the full version of the table preview.


  • Use filter() to remove observations with less than 25 players

  • Use arrange() to order observations by total cost per win

  • Use select() to preserve only essential variables


Objective: The following contains variables and initial values of the full table. Variable names may differ.


yearID teamID lgID Rank G W n team.budget cost.per.win
1988 PIT NL 2 160 85 27 13724134 1.615
1986 TEX AL 2 162 87 26 16367320 1.881
1987 ML4 AL 3 162 91 28 17186960 1.889


Solution: Create the full version of the table below.

# Your code here

final_table <- merged_data %>%
  filter(count >= 25) %>%
  arrange(cost.per.win) %>%
  select(yearID, teamID, lgID, G, W, count, team.budget, cost.per.win)

final_table %>% head() %>% pander()
yearID teamID lgID G W count team.budget cost.per.win
1988 PIT NL 160 85 27 13724134 1.615
1986 TEX AL 162 87 26 16367320 1.881
1987 ML4 AL 162 91 28 17186960 1.889
1989 BAL AL 162 87 26 18381537 2.113
1986 SEA AL 162 67 26 14462380 2.159
1990 CHA AL 162 94 31 20469286 2.178