I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know.
The best ways to provide feedback are by GitHub or hypothes.is annotations.
Adding an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
4Download and Process NFL Football Data
4.1 Load Packages
Code
library("ffanalytics") # to install: install.packages("remotes"); remotes::install_github("FantasyFootballAnalytics/ffanalytics")library("petersenlab") # to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")library("nflreadr")library("nflfastR")library("nfl4th")library("nflplotR")library("progressr")library("lubridate")library("tidyverse")
4.2 Data Dictionaries of NFL Data
Data Dictionaries are metadata that describe the meaning of the variables in a dataset. You can find Data Dictionaries for the various National Football League (NFL) datasets at the following link: https://nflreadr.nflverse.com/articles/index.html.
4.3 Types of NFL Data
Below, we provide examples for how to download various types of NFL data using the nflreadr package. For additional resources, Congelio (2023) provides a helpful introductory text for working with NFL data in R. We save each data file after downloading it, so we can use the data in subsequent chapters. If you have difficulty downloading the data files using the nflreadr package, we also saved the data files so they are publicly available on the Open Science Framework: https://osf.io/z6pg4.
This chapter extensively uses merging to process the data for later use. See Section 3.22 for a reminder of how to perform merging, the types of merges, and what you can expect when you merge data objects with different formats. Guidance for how to merge the various NFL-related data files is provided at the following link: https://github.com/nflverse/nfldata/blob/master/DATASETS.md.
# Convert missing values to NAnfl_players[nfl_players ==""] <-NA# Drop players with missing values for gsis_idnfl_players <- nfl_players %>%filter(!is.na(gsis_id))
The nfl_rosters object is in player-season-team form. That is, each row should be uniquely identified by the combination of gsis_id, season, and team. Let’s rearrange the data accordingly:
# Drop players with missing values for gsis_idnfl_rosters <- nfl_rosters %>%filter(!is.na(gsis_id))# Fill in missing values for a player in their duplicate instances, and then keep only the first of the duplicate instancesnfl_rosters <- nfl_rosters %>%group_by(gsis_id, season, team) %>%fill(names(.), .direction ="downup") %>%slice_head(n =1) %>%ungroup()
Let’s check again for duplicate player-season-team instances:
The nfl_rosters_weekly object is in player-season-week form. That is, each row should be uniquely identified by the combination of gsis_id, season, and week. Let’s rearrange the data accordingly:
# Drop players with missing values for gsis_idnfl_rosters_weekly <- nfl_rosters_weekly %>%filter(!is.na(gsis_id))# Fill in missing values for a player in their duplicate instances, and then keep only the first of the duplicate instancesnfl_rosters_weekly <- nfl_rosters_weekly %>%group_by(gsis_id, season, week) %>%fill(names(.), .direction ="downup") %>%slice_head(n =1) %>%ungroup()
Let’s check again for duplicate player-season-week instances:
The nfl_schedules object is in game form and in season-week (and -game type) form. That is, each row should be uniquely identified by game_id. Each row should also be uniquely identified by the combination of season and week (and game type).
The nfl_combine object is in player form. That is, each row should be uniquely identified by the player’s id. However, there is no gsis_id variable to merge it easily with other datasets. Some of the players have other id variables, including pfr_id and cfb_id. Let’s rearrange the data accordingly:
# Convert missing values to NAnfl_draftPicks[nfl_draftPicks ==""] <-NA# Drop players with missing values for gsis_idnfl_draftPicks <- nfl_draftPicks %>%filter(!is.na(gsis_id))
The nfl_depthCharts object is in player-season-week-position form. That is, each row should be uniquely identified by the combination of gsis_id, season, week, and depth_position. Let’s rearrange the data accordingly:
# Drop players with missing values for gsis_idnfl_depthCharts <- nfl_depthCharts %>%filter(!is.na(gsis_id))# Fill in missing values for a player in their duplicate instances, and then keep only the first of the duplicate instancesnfl_depthCharts <- nfl_depthCharts %>%group_by(gsis_id, season, week, depth_position) %>%fill(names(.), .direction ="downup") %>%slice_head(n =1) %>%ungroup()
Let’s check again for duplicate player-season-week-position instances:
To download play-by-play data from prior weeks and seasons, we can use the load_pbp() function of the nflreadr package. We add a progress bar using the with_progress() function from the progressr package because it takes a while to run. A Data Dictionary for the play-by-play data is located at the following link: https://nflreadr.nflverse.com/articles/dictionary_pbp.html
The nfl_pbp object is in game-drive-play form. That is, each row should be uniquely identified by the combination of game_id, fixed_drive, play_id. Let’s rearrange the data accordingly:
The nfl_4thdown object is in game-drive-play form. That is, each row should be uniquely identified by the combination of game_id, drive, play_id. Let’s rearrange the data accordingly:
The nfl_participation object is in game-drive-play form. That is, each row should be uniquely identified by the combination of nflverse_game_id, drive, play_id. Let’s rearrange the data accordingly:
The nfl_actualStats_weekly objects are in player-season-week form. That is, each row should be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
The nfl_injuries object is in player-season-week form. That is, each row should be uniquely identified by the combination of gsis_id, season, and week. Let’s rearrange the data accordingly:
The nfl_snapCounts object is in game-player form. That is, each row should be uniquely identified by the combination of game_id and pfr_player_id. Let’s rearrange the data accordingly:
The nfl_espnQBR_seasonal object is in player-season-season type form, where season type refers to regular season versus postseason. That is, each row should be uniquely identified by the combination of player_id, season, and season_type. Let’s rearrange the data accordingly:
The nfl_espnQBR_weekly object is in both game-player form and player-season-season type-week form, where season type refers to regular season versus postseason. That is, each row should be uniquely identified by the combination of gsis_id, season, and week or by the combination of player_id, season, season_type, and week_num. Let’s rearrange the data accordingly:
The nfl_nextGenStats_weekly object is in player-season-season type-week form, where season type refers to regular season versus postseason. That is, each row should be uniquely identified by the combination of player_gsis_id, season, season_type, and week. Let’s rearrange the data accordingly:
The nfl_advancedStatsPFR_seasonalByTeam object is in player-season-team form. That is, each row should be uniquely identified by the combination of pfr_id, season, and team. Let’s rearrange the data accordingly:
Aggregate variables within each pass/rush/rec/def object by team for seasonal data (so seasonal data are in player-season form, not player-season-team form). Depending on the variable, aggregation was performed using a sum, weighted mean (weighted by the number of games played for each team), or a recomputed percentage.
The nfl_advancedStatsPFR_weekly object is in both game-player form and player-season-week form. That is, each row should be uniquely identified by the combination of pfr_player_id, season, and week or by the combination of pfr_player_id, season, game_type, and week. Let’s rearrange the data accordingly:
# Merge seasonal data with the player IDsnfl_advancedStatsPFR_seasonal <-left_join( nfl_advancedStatsPFR_seasonal, nfl_playerIDs %>%filter(!is.na(pfr_id)) %>%filter(gsis_id !="00-0039137") %>%# drop DL Byron Young, keep OLB Byron Youngselect(pfr_id, gsis_id) %>%unique(),by ="pfr_id")# Merge weekly data with the player IDsnfl_advancedStatsPFR_weekly <-left_join( nfl_advancedStatsPFR_weekly, nfl_playerIDs %>%filter(!is.na(pfr_id)) %>%filter(gsis_id !="00-0039137") %>%# drop DL Byron Young, keep OLB Byron Youngselect(pfr_id, gsis_id) %>%unique(),by ="pfr_id")# Remove distinct players who were given the same `pfr_id` (to allow merging)nfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0035665"& nfl_advancedStatsPFR_seasonal$pos %in%c("LB","LILB","RILB"))] <-NA# drop LB David Young, keep DB David Young#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0035665" & nfl_advancedStatsPFR_weekly$team %in% c("TEN","MIA"))] <- NA # drop LB David Young, keep DB David Youngnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0035292"& nfl_advancedStatsPFR_seasonal$pos %in%c("LB","LILB","RILB"))] <-NA# drop LB David Young, keep DB David Youngnfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id =="00-0035292"& nfl_advancedStatsPFR_weekly$team %in%c("TEN","MIA"))] <-NA# drop LB David Young, keep DB David Youngnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0033894"& nfl_advancedStatsPFR_seasonal$pos =="DB")] <-NA# drop S Marcus Williams, keep DB David Young#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0033894" & nfl_advancedStatsPFR_weekly$pos == "DB")] <- NA # drop S Marcus Williamsnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0038407"& nfl_advancedStatsPFR_seasonal$pos =="DB")] <-NA# drop DB Jaylon Jones, keep CB Jaylon Jones#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0038407" & nfl_advancedStatsPFR_weekly$pos == "DB")] <- NA # drop DB Jaylon Jones, keep CB Jaylon Jonesnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0037106"& nfl_advancedStatsPFR_seasonal$pos =="DB")] <-NA# drop DB Jaylon Jones, keep CB Jaylon Jones#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0037106" & nfl_advancedStatsPFR_weekly$pos == "DB")] <- NA # drop DB Jaylon Jones, keep CB Jaylon Jonesnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0038549"& nfl_advancedStatsPFR_seasonal$pos =="WR")] <-NA# drop WR DJ TUrner, keep CB DJ Turner#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0038549" & nfl_advancedStatsPFR_weekly$pos == "WR")] <- NA # drop WR DJ TUrner, keep CB DJ Turner
Now, each row of the nfl_advancedStatsPFR_seasonal object should be uniquely identified by the combination of gsis_id (or pfr_id), and season. Each row of the nfl_advancedStatsPFR_weekly object should be uniquely identified by the combination of gsis_id (or pfr_id), season, and week or by the combination of gsis_id (or pfr_id), season, game_type, and week.
Let’s check again for duplicate game-player or player-season-week instances:
Code
# Based on gsis_idnfl_advancedStatsPFR_seasonal %>%select(gsis_id, everything()) %>%filter(!is.na(gsis_id)) %>%group_by(gsis_id, season) %>%filter(n() >1) %>%head()
The nfl_playerContracts object is in player-year-team-value form. That is, each row should be uniquely identified by the combination of otc_id, year_signed, team, and value. Let’s rearrange the data accordingly:
The nfl_ftnCharting object is in game-play form. That is, each row should be uniquely identified by the combination of nflverse_game_id and play_id. Let’s rearrange the data accordingly:
The nfl_rankings_draft object is in player-page_type form. That is, each row should be uniquely identified by the player’s id. Let’s rearrange the data accordingly:
The nfl_rankings_weekly object is in player-page form. That is, each row should be uniquely identified by fantasypros_id and page. Let’s rearrange the data accordingly:
The nfl_expectedFantasyPoints_weekly object is in game-player form and in player-season-week form. That is, each row should be uniquely identified by the combination of game_id and player_id. Each row should also be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
The nfl_expectedFantasyPoints_pbp object is in game-drive-play form. That is, each row should be uniquely identified by the combination of game_id, fixed_drive, and play_id. Let’s rearrange the data accordingly:
The players_projections_seasonal_raw and players_projections_weekly_raw object is in player-position-projection source form. That is, each row should be uniquely identified by the combination of id, pos and data_src. Each row should also be uniquely identified by the combination of player, pos, and data_src.
Let’s check for duplicate player-position-projection source instances:
The players_projectedPoints_seasonal, players_projectedStatsAverage_seasonal, players_projectedPointsAverage_seasonal, players_projectedPoints_weekly, players_projectedStatsAverage_weekly, and players_projectedPointsAverage_weekly objects are in player-average type-position form. That is, each row should be uniquely identified by the combination of id, avg_type and pos (or position).
Let’s check for duplicate player-position-projection source instances:
In addition to week-by-week actual player statistics, we can also compute historical actual player statistics as a function of different timeframes, including season-by-season and career statistics.
4.4.1.1 Career Statistics
First, we can compute the players’ career statistics using the calculate_player_stats(), calculate_player_stats_def(), and calculate_player_stats_kicking() functions from the nflfastR package for offensive players, defensive players, and kickers, respectively.
The nfl_actualStats_career objects are in player form. That is, each row should be uniquely identified by the combination of player_id. Let’s rearrange the data accordingly:
The nfl_actualStats_seasonal objects are in player-season form. That is, each row should be uniquely identified by the combination of player_id and season. Let’s rearrange the data accordingly:
We already load players’ week-by-week statistics above. Nevertheless, we could compute players’ weekly statistics from the play-by-play data using the following syntax:
The player_stats_weekly objects are in player-season-week form. That is, each row should be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
The player_stats_seasonal objects are in player-season form. That is, each row should be uniquely identified by the combination of player_id and season. Let’s rearrange the data accordingly:
Corston, R., & Colman, A. M. (2000). A crash course in SPSS for Windows. Wiley-Blackwell.
Feedback
Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link:
https://forms.gle/LsnVKwqmS1VuxWD18
Email Notification
The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.