The following work identifies the baseball teams that have achieved
the largest number of wins on the smallest budgets since 1985 per the
Lahman data package.
This analysis links two Lahman package tables,
viz. Salaries and Teams, to determine total
wins per season and the salary cost of each win earned. See the
following preview of the first observations in
Salaries:
| yearID | teamID | lgID | playerID | salary |
|---|---|---|---|---|
| 1985 | ATL | NL | barkele01 | 870000 |
| 1985 | ATL | NL | bedrost01 | 550000 |
| 1985 | ATL | NL | benedbr01 | 545000 |
| 1985 | ATL | NL | campri01 | 633333 |
| 1985 | ATL | NL | ceronri01 | 625000 |
| 1985 | ATL | NL | chambch01 | 800000 |
See the following preview of select variables and first observations
in Teams:
| yearID | teamID | lgID | G | W | L |
|---|---|---|---|---|---|
| 1871 | BS1 | NA | 31 | 20 | 10 |
| 1871 | CH1 | NA | 28 | 19 | 9 |
| 1871 | CL1 | NA | 29 | 10 | 19 |
| 1871 | FW1 | NA | 19 | 7 | 12 |
| 1871 | NY2 | NA | 33 | 16 | 17 |
| 1871 | PH1 | NA | 28 | 21 | 7 |
Lastly, Salaries data are adjusted for inflation at a
constant annual increase rate of 3%. Observe:
Salaries <- Salaries %>%
mutate(salary.adj = salary * (1.03)^(max(yearID) - yearID))
head(Salaries) %>%
pander()| yearID | teamID | lgID | playerID | salary | salary.adj |
|---|---|---|---|---|---|
| 1985 | ATL | NL | barkele01 | 870000 | 2175070 |
| 1985 | ATL | NL | bedrost01 | 550000 | 1375044 |
| 1985 | ATL | NL | benedbr01 | 545000 | 1362544 |
| 1985 | ATL | NL | campri01 | 633333 | 1583383 |
| 1985 | ATL | NL | ceronri01 | 625000 | 1562550 |
| 1985 | ATL | NL | chambch01 | 800000 | 2000064 |
In order to calculate the cost per win, adjusted individual player
salaries from the Salaries table must be aggregated by team
and year in the Teams table.
Using the Salaries table and grouping on variables
yearID and teamID via package
dplyr function group_by(), the following uses
function summarize() to calculate and name the
following summary variables for each year and team:
Total adjusted team budget using salary.adj and
sum()
Total players using n()
# Your code here
Salaries_summary <- Salaries %>%
group_by(yearID, teamID) %>%
summarise(team.budget = sum(salary.adj), count = n())
Salaries_summary %>% head() %>% pander()| yearID | teamID | team.budget | count |
|---|---|---|---|
| 1985 | ATL | 37018690 | 22 |
| 1985 | BAL | 28902709 | 22 |
| 1985 | BOS | 27244776 | 25 |
| 1985 | CAL | 36070894 | 28 |
| 1985 | CHA | 24616236 | 21 |
| 1985 | CHN | 31758313 | 22 |
The following merges the Salaries summary data and full
Teams tables using either base R function
merge() or package dplyr functions
left_join().
Variables teamID and yearID are unique
merging keys
Example:
dataset_1 %>% left_join(dataset_2)
# Your code here
Teams_selected <- Teams %>%
select(yearID, teamID, lgID, G, W, L)
merged_data <- left_join(Teams_selected, Salaries_summary,
by = c("yearID", "teamID"))
merged_data %>% head() %>% pander()| yearID | teamID | lgID | G | W | L | team.budget | count |
|---|---|---|---|---|---|---|---|
| 1871 | BS1 | NA | 31 | 20 | 10 | NA | NA |
| 1871 | CH1 | NA | 28 | 19 | 9 | NA | NA |
| 1871 | CL1 | NA | 29 | 10 | 19 | NA | NA |
| 1871 | FW1 | NA | 19 | 7 | 12 | NA | NA |
| 1871 | NY2 | NA | 33 | 16 | 17 | NA | NA |
| 1871 | PH1 | NA | 28 | 21 | 7 | NA | NA |
The following uses the merged Salaries summary data and
Teams data to create a new variable with function
mutate() that calculates the total team salary cost per
win, divided by $100,000.
Divide (/) total salaries from Step
1 by total wins (W)
Divide (/) that value by
100000
# Your code here
merged_data <- merged_data %>%
mutate(cost.per.win = (team.budget / W) / 100000)
merged_data %>% head() %>% pander()| yearID | teamID | lgID | G | W | L | team.budget | count | cost.per.win |
|---|---|---|---|---|---|---|---|---|
| 1871 | BS1 | NA | 31 | 20 | 10 | NA | NA | NA |
| 1871 | CH1 | NA | 28 | 19 | 9 | NA | NA | NA |
| 1871 | CL1 | NA | 29 | 10 | 19 | NA | NA | NA |
| 1871 | FW1 | NA | 19 | 7 | 12 | NA | NA | NA |
| 1871 | NY2 | NA | 33 | 16 | 17 | NA | NA | NA |
| 1871 | PH1 | NA | 28 | 21 | 7 | NA | NA | NA |
The following filters out teams in merged data with less than 25 players on their annual roster. It then arranges teams in ascending order by cost per win. Lastly, only variables are preserved to create the full version of the table preview.
Use filter() to remove observations with less than
25 players
Use arrange() to order observations by total cost
per win
Use select() to preserve only essential
variables
Objective: The following contains variables and initial values of the full table. Variable names may differ.
| yearID | teamID | lgID | Rank | G | W | n | team.budget | cost.per.win |
|---|---|---|---|---|---|---|---|---|
| 1988 | PIT | NL | 2 | 160 | 85 | 27 | 13724134 | 1.615 |
| 1986 | TEX | AL | 2 | 162 | 87 | 26 | 16367320 | 1.881 |
| 1987 | ML4 | AL | 3 | 162 | 91 | 28 | 17186960 | 1.889 |
Solution: Create the full version of the table below.
# Your code here
final_table <- merged_data %>%
filter(count >= 25) %>%
arrange(cost.per.win) %>%
select(yearID, teamID, lgID, G, W, count, team.budget, cost.per.win)
final_table %>% head() %>% pander()| yearID | teamID | lgID | G | W | count | team.budget | cost.per.win |
|---|---|---|---|---|---|---|---|
| 1988 | PIT | NL | 160 | 85 | 27 | 13724134 | 1.615 |
| 1986 | TEX | AL | 162 | 87 | 26 | 16367320 | 1.881 |
| 1987 | ML4 | AL | 162 | 91 | 28 | 17186960 | 1.889 |
| 1989 | BAL | AL | 162 | 87 | 26 | 18381537 | 2.113 |
| 1986 | SEA | AL | 162 | 67 | 26 | 14462380 | 2.159 |
| 1990 | CHA | AL | 162 | 94 | 31 | 20469286 | 2.178 |