Introduction

The following work identifies the baseball teams that have achieved the largest number of wins on the smallest budgets since 1985 per the Lahman data package.

Preprocessing

This analysis links two Lahman package tables, viz. Salaries and Teams, to determine total wins per season and the salary cost of each win earned. See the following preview of the first observations in Salaries:

yearID	teamID	lgID	playerID	salary
1985	ATL	NL	barkele01	870000
1985	ATL	NL	bedrost01	550000
1985	ATL	NL	benedbr01	545000
1985	ATL	NL	campri01	633333
1985	ATL	NL	ceronri01	625000
1985	ATL	NL	chambch01	800000

See the following preview of select variables and first observations in Teams:

yearID	teamID	lgID	G	W	L
1871	BS1	NA	31	20	10
1871	CH1	NA	28	19	9
1871	CL1	NA	29	10	19
1871	FW1	NA	19	7	12
1871	NY2	NA	33	16	17
1871	PH1	NA	28	21	7

Lastly, Salaries data are adjusted for inflation at a constant annual increase rate of 3%. Observe:

Salaries <- Salaries %>% 
  mutate(salary.adj = salary * (1.03)^(max(yearID) - yearID))

head(Salaries) %>%
  pander()

yearID	teamID	lgID	playerID	salary	salary.adj
1985	ATL	NL	barkele01	870000	2175070
1985	ATL	NL	bedrost01	550000	1375044
1985	ATL	NL	benedbr01	545000	1362544
1985	ATL	NL	campri01	633333	1583383
1985	ATL	NL	ceronri01	625000	1562550
1985	ATL	NL	chambch01	800000	2000064

In order to calculate the cost per win, adjusted individual player salaries from the Salaries table must be aggregated by team and year in the Teams table.

Step 1: Aggregating Salaries

Using the Salaries table and grouping on variables yearID and teamID via package dplyr function group_by(), the following uses function summarize() to calculate and name the following summary variables for each year and team:

Total adjusted team budget using salary.adj and sum()
Total players using n()

# Your code here
  

  Salaries_summary <- Salaries %>%
  group_by(yearID, teamID) %>%
  summarise(team.budget = sum(salary.adj), count = n())


Salaries_summary %>% head() %>% pander()

yearID	teamID	team.budget	count
1985	ATL	37018690	22
1985	BAL	28902709	22
1985	BOS	27244776	25
1985	CAL	36070894	28
1985	CHA	24616236	21
1985	CHN	31758313	22

Step 2: Join Salaries & Teams

The following merges the Salaries summary data and full Teams tables using either base R function merge() or package dplyr functions left_join().

Variables teamID and yearID are unique merging keys
Example: dataset_1 %>% left_join(dataset_2)

# Your code here

Teams_selected <- Teams %>%
  select(yearID, teamID, lgID, G, W, L)

merged_data <- left_join(Teams_selected, Salaries_summary, 
                         by = c("yearID", "teamID"))

merged_data %>% head() %>% pander()

yearID	teamID	lgID	G	W	L	team.budget	count
1871	BS1	NA	31	20	10	NA	NA
1871	CH1	NA	28	19	9	NA	NA
1871	CL1	NA	29	10	19	NA	NA
1871	FW1	NA	19	7	12	NA	NA
1871	NY2	NA	33	16	17	NA	NA
1871	PH1	NA	28	21	7	NA	NA

Step 3: Calculate Cost per Win

The following uses the merged Salaries summary data and Teams data to create a new variable with function mutate() that calculates the total team salary cost per win, divided by $100,000.

Divide (/) total salaries from Step 1 by total wins (W)
Divide (/) that value by 100000

# Your code here

merged_data <- merged_data %>%
  mutate(cost.per.win = (team.budget / W) / 100000)

merged_data %>% head() %>% pander()

yearID	teamID	lgID	G	W	L	team.budget	count	cost.per.win
1871	BS1	NA	31	20	10	NA	NA	NA
1871	CH1	NA	28	19	9	NA	NA	NA
1871	CL1	NA	29	10	19	NA	NA	NA
1871	FW1	NA	19	7	12	NA	NA	NA
1871	NY2	NA	33	16	17	NA	NA	NA
1871	PH1	NA	28	21	7	NA	NA	NA

Step 4: Select, Filter, & Arrange

The following filters out teams in merged data with less than 25 players on their annual roster. It then arranges teams in ascending order by cost per win. Lastly, only variables are preserved to create the full version of the table preview.

Use filter() to remove observations with less than 25 players
Use arrange() to order observations by total cost per win
Use select() to preserve only essential variables

Objective: The following contains variables and initial values of the full table. Variable names may differ.

yearID	teamID	lgID	Rank	G	W	n	team.budget	cost.per.win
1988	PIT	NL	2	160	85	27	13724134	1.615
1986	TEX	AL	2	162	87	26	16367320	1.881
1987	ML4	AL	3	162	91	28	17186960	1.889

Solution: Create the full version of the table below.

# Your code here

final_table <- merged_data %>%
  filter(count >= 25) %>%
  arrange(cost.per.win) %>%
  select(yearID, teamID, lgID, G, W, count, team.budget, cost.per.win)

final_table %>% head() %>% pander()

yearID	teamID	lgID	G	W	count	team.budget	cost.per.win
1988	PIT	NL	160	85	27	13724134	1.615
1986	TEX	AL	162	87	26	16367320	1.881
1987	ML4	AL	162	91	28	17186960	1.889
1989	BAL	AL	162	87	26	18381537	2.113
1986	SEA	AL	162	67	26	14462380	2.159
1990	CHA	AL	162	94	31	20469286	2.178

Lab 06 - Dataset Joins

Kenaniah Dodson

October 02, 2024