I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know.
The best ways to provide feedback are by GitHub or hypothes.is annotations.
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
4Download and Process NFL Football Data
4.1 Load Packages
Code
library("ffanalytics") # to install: install.packages("remotes"); remotes::install_github("FantasyFootballAnalytics/ffanalytics")library("petersenlab") # to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")library("nflreadr")library("nflfastR")library("nfl4th")library("nflplotR")library("progressr")library("lubridate")library("tidyverse")
4.2 Data Dictionaries of NFL Data
Data Dictionaries are metadata that describe the meaning of the variables in a dataset. Ho & Carl (2025a) provide Data Dictionaries for the various National Football League (NFL) datasets at the following link: https://nflreadr.nflverse.com/articles/index.html.
4.3 Types of NFL Data
Below, we provide examples for how to download various types of NFL data using the nflreadr(Ho & Carl, 2024) and nflfastR(Carl & Baldwin, 2024) packages. For additional resources, Congelio (2023) provides a helpful introductory text for working with NFL data in R. We save each data file after downloading it, so we can use the data in subsequent chapters. If you have difficulty downloading the data files using the nflreadr(Ho & Carl, 2024) or nflfastR(Carl & Baldwin, 2024) packages, we also saved the data files so they are publicly available on the Open Science Framework: https://osf.io/z6pg4.
This chapter extensively uses merging to process the data for later use. See Section 3.22 for a reminder of how to perform merging, the types of merges, and what you can expect when you merge data objects with different formats. Guidance for how to merge the various NFL-related data files is provided by Sharpe (2020a): https://github.com/nflverse/nfldata/blob/master/DATASETS.md.
In this case, I arrange the data by player using display_name. You could arrange by gsis_id (or any other variable, for that matter), but gsis_id is fairly unintelligible by itself. For instance, here are the first six players by ID according to gsis_id:
Note how it is unclear who these players are until you combine this column with other, more relevant information. Thus, I prefer to sort by a variable that is more interpretable, such as a player’s name. Here are the first six players by name according to display_name:
Note how it is more clear, at a glance and without additional information, who these players are.
In data analysis, the variable(s) that are used to sort the dataframe are primarily chosen for aesthetic or usability purposes. Many data analysis approaches do not depend on the order of the rows in the data. However, the sorting of the dataframe may influence data wrangling operations. Thus, when performing data wrangling operations it may be prudent to make sure to re-sort the data to the desired format before performing such operations.
Regardless of which variable(s) are used to sort the dataframe, it is important to know which variable(s) uniquely identify the rows because that will influence approaches to merge the data with other dataframes. For instance, it would be preferable to merge two data objects based on the player ID (gsis_id) rather than the player name (display_name), because multiple players share the same name, which can lead the merge operation not to know which player from the first dataframe goes with which player from the second dataframe when both players have the same name (despite different having different player IDs).
# Convert missing values to NAnfl_players[nfl_players ==""] <-NA# Drop players with missing values for gsis_idnfl_players <- nfl_players %>%filter(!is.na(gsis_id))
The nfl_rosters object is in player-season-team form. That is, each row should be uniquely identified by the combination of gsis_id, season, and team. Let’s rearrange the data accordingly:
# Drop players with missing values for gsis_idnfl_rosters <- nfl_rosters %>%filter(!is.na(gsis_id))# Fill in missing values for a player in their duplicate instances, and then keep only the first of the duplicate instancesnfl_rosters <- nfl_rosters %>%group_by(gsis_id, season, team) %>%fill(names(.), .direction ="downup") %>%slice_head(n =1) %>%ungroup()
Let’s check again for duplicate player-season-team instances:
The nfl_rosters_weekly object is in player-season-week form. That is, each row should be uniquely identified by the combination of gsis_id, season, and week. Let’s rearrange the data accordingly:
# Drop players with missing values for gsis_idnfl_rosters_weekly <- nfl_rosters_weekly %>%filter(!is.na(gsis_id))# Fill in missing values for a player in their duplicate instances, and then keep only the first of the duplicate instancesnfl_rosters_weekly <- nfl_rosters_weekly %>%group_by(gsis_id, season, week) %>%fill(names(.), .direction ="downup") %>%slice_head(n =1) %>%ungroup()
Let’s check again for duplicate player-season-week instances:
The nfl_schedules object is in game form and in season-week (and -game type) form. That is, each row should be uniquely identified by game_id. Each row should also be uniquely identified by the combination of season and week (and game type).
The nfl_combine object is in player form. That is, each row should be uniquely identified by the player’s id. However, there is no gsis_id variable to merge it easily with other datasets. Some of the players have other id variables, including pfr_id and cfb_id. Let’s rearrange the data accordingly:
# Convert missing values to NAnfl_draftPicks[nfl_draftPicks ==""] <-NA# Drop players with missing values for gsis_idnfl_draftPicks <- nfl_draftPicks %>%filter(!is.na(gsis_id))
The nfl_depthCharts object is in player-season-week-position form. That is, each row should be uniquely identified by the combination of gsis_id, season, week, and depth_position. Let’s rearrange the data accordingly:
# Drop players with missing values for gsis_idnfl_depthCharts <- nfl_depthCharts %>%filter(!is.na(gsis_id))# Fill in missing values for a player in their duplicate instances, and then keep only the first of the duplicate instancesnfl_depthCharts <- nfl_depthCharts %>%group_by(gsis_id, season, week, depth_position) %>%fill(names(.), .direction ="downup") %>%slice_head(n =1) %>%ungroup()
Let’s check again for duplicate player-season-week-position instances:
To download play-by-play data from prior weeks and seasons, we can use the load_pbp() function of the nflreadr package (Ho & Carl, 2024). We add a progress bar using the with_progress() function from the progressr package (Bengtsson, 2024) because it takes a while to run. Ho & Carl (2025n) provide a Data Dictionary for the play-by-play data at the following link: https://nflreadr.nflverse.com/articles/dictionary_pbp.html
The nfl_pbp object is in game-drive-play form. That is, each row should be uniquely identified by the combination of game_id, fixed_drive, play_id. Let’s rearrange the data accordingly:
The nfl_4thdown object is in game-drive-play form. That is, each row should be uniquely identified by the combination of game_id, drive, play_id. Let’s rearrange the data accordingly:
nfl_participation_raw <- progressr::with_progress( nflreadr::load_participation(seasons =2016:2023, # participation data are no longer available after 2023include_pbp =TRUE))
The nfl_participation object is in game-drive-play form. That is, each row should be uniquely identified by the combination of nflverse_game_id, drive, play_id. Let’s rearrange the data accordingly:
The nfl_actualStats_weekly objects are in player-season-week form. That is, each row should be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
The nfl_injuries object is in player-season-week form. That is, each row should be uniquely identified by the combination of gsis_id, season, and week. Let’s rearrange the data accordingly:
The nfl_snapCounts object is in game-player form. That is, each row should be uniquely identified by the combination of game_id and pfr_player_id. Let’s rearrange the data accordingly:
The nfl_espnQBR_seasonal object is in player-season-season type form, where season type refers to regular season versus postseason. That is, each row should be uniquely identified by the combination of player_id, season, and season_type. Let’s rearrange the data accordingly:
The nfl_espnQBR_weekly object is in both game-player form and player-season-season type-week form, where season type refers to regular season versus postseason. That is, each row should be uniquely identified by the combination of gsis_id, season, and week or by the combination of player_id, season, season_type, and week_num. Let’s rearrange the data accordingly:
The nfl_nextGenStats_weekly object is in player-season-season type-week form, where season type refers to regular season versus postseason. That is, each row should be uniquely identified by the combination of player_gsis_id, season, season_type, and week. Let’s rearrange the data accordingly:
The nfl_advancedStatsPFR_seasonalByTeam object is in player-season-team form. That is, each row should be uniquely identified by the combination of pfr_id, season, and team. Let’s rearrange the data accordingly:
Aggregate variables within each pass/rush/rec/def object by team for seasonal data (so seasonal data are in player-season form, not player-season-team form). Depending on the variable, aggregation was performed using a sum, weighted mean (weighted by the number of games played for each team), or a recomputed percentage.
The nfl_advancedStatsPFR_weekly object is in both game-player form and player-season-week form. That is, each row should be uniquely identified by the combination of pfr_player_id, season, and week or by the combination of pfr_player_id, season, game_type, and week. Let’s rearrange the data accordingly:
# Merge seasonal data with the player IDsnfl_advancedStatsPFR_seasonal <-left_join( nfl_advancedStatsPFR_seasonal, nfl_playerIDs %>%filter(!is.na(pfr_id)) %>%filter(gsis_id !="00-0039137") %>%# drop DL Byron Young, keep OLB Byron Youngselect(pfr_id, gsis_id) %>%unique(),by ="pfr_id")# Merge weekly data with the player IDsnfl_advancedStatsPFR_weekly <-left_join( nfl_advancedStatsPFR_weekly, nfl_playerIDs %>%filter(!is.na(pfr_id)) %>%filter(gsis_id !="00-0039137") %>%# drop DL Byron Young, keep OLB Byron Youngselect(pfr_id, gsis_id) %>%unique(),by ="pfr_id")# Remove distinct players who were given the same `pfr_id` (to allow merging)nfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0035665"& nfl_advancedStatsPFR_seasonal$pos %in%c("LB","LILB","RILB"))] <-NA# drop LB David Young, keep DB David Young#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0035665" & nfl_advancedStatsPFR_weekly$team %in% c("TEN","MIA"))] <- NA # drop LB David Young, keep DB David Youngnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0035292"& nfl_advancedStatsPFR_seasonal$pos %in%c("LB","LILB","RILB"))] <-NA# drop LB David Young, keep DB David Youngnfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id =="00-0035292"& nfl_advancedStatsPFR_weekly$team %in%c("TEN","MIA"))] <-NA# drop LB David Young, keep DB David Youngnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0033894"& nfl_advancedStatsPFR_seasonal$pos =="DB")] <-NA# drop S Marcus Williams, keep DB David Young#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0033894" & nfl_advancedStatsPFR_weekly$pos == "DB")] <- NA # drop S Marcus Williamsnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0038407"& nfl_advancedStatsPFR_seasonal$pos =="DB")] <-NA# drop DB Jaylon Jones, keep CB Jaylon Jones#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0038407" & nfl_advancedStatsPFR_weekly$pos == "DB")] <- NA # drop DB Jaylon Jones, keep CB Jaylon Jonesnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0037106"& nfl_advancedStatsPFR_seasonal$pos =="DB")] <-NA# drop DB Jaylon Jones, keep CB Jaylon Jones#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0037106" & nfl_advancedStatsPFR_weekly$pos == "DB")] <- NA # drop DB Jaylon Jones, keep CB Jaylon Jonesnfl_advancedStatsPFR_seasonal$gsis_id[which(nfl_advancedStatsPFR_seasonal$gsis_id =="00-0038549"& nfl_advancedStatsPFR_seasonal$pos =="WR")] <-NA# drop WR DJ TUrner, keep CB DJ Turner#nfl_advancedStatsPFR_weekly$gsis_id[which(nfl_advancedStatsPFR_weekly$gsis_id == "00-0038549" & nfl_advancedStatsPFR_weekly$pos == "WR")] <- NA # drop WR DJ TUrner, keep CB DJ Turner
Now, each row of the nfl_advancedStatsPFR_seasonal object should be uniquely identified by the combination of gsis_id (or pfr_id), and season. Each row of the nfl_advancedStatsPFR_weekly object should be uniquely identified by the combination of gsis_id (or pfr_id), season, and week or by the combination of gsis_id (or pfr_id), season, game_type, and week.
Let’s check again for duplicate game-player or player-season-week instances:
Code
# Based on gsis_idnfl_advancedStatsPFR_seasonal %>%select(gsis_id, everything()) %>%filter(!is.na(gsis_id)) %>%group_by(gsis_id, season) %>%filter(n() >1) %>%head()
The nfl_playerContracts object is in player-year-team-value form. That is, each row should be uniquely identified by the combination of otc_id, year_signed, team, and value. Let’s rearrange the data accordingly:
The nfl_ftnCharting object is in game-play form. That is, each row should be uniquely identified by the combination of nflverse_game_id and play_id. Let’s rearrange the data accordingly:
The nfl_rankings_draft object is in player-page_type form. That is, each row should be uniquely identified by the player’s id. Let’s rearrange the data accordingly:
The nfl_rankings_weekly object is in player-page form. That is, each row should be uniquely identified by fantasypros_id and page. Let’s rearrange the data accordingly:
The nfl_expectedFantasyPoints_weekly object is in game-player form and in player-season-week form. That is, each row should be uniquely identified by the combination of game_id and player_id. Each row should also be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
The nfl_expectedFantasyPoints_pbp object is in game-drive-play form. That is, each row should be uniquely identified by the combination of game_id, fixed_drive, and play_id. Let’s rearrange the data accordingly:
The players_projections_seasonal_raw and players_projections_weekly_raw object is in player-position-projection source form. That is, each row should be uniquely identified by the combination of id, pos and data_src. Each row should also be uniquely identified by the combination of player, pos, and data_src.
Let’s check for duplicate player-position-projection source instances:
Now, modify the scoring settings to match your league settings. Below, we use the scoring settings for fantasy leagues on NFL.com, which happen to be point-per-reception leagues (i.e., PPR leagues):
The players_projectedPoints_seasonal, players_projectedStatsAverage_seasonal, players_projectedPointsAverage_seasonal, players_projectedPoints_weekly, players_projectedStatsAverage_weekly, and players_projectedPointsAverage_weekly objects are in player-average type-position form. That is, each row should be uniquely identified by the combination of id, avg_type and pos (or position).
Let’s check for duplicate player-position-projection source instances:
In addition to week-by-week actual player statistics, we can also compute historical actual player statistics as a function of different timeframes, including season-by-season and career statistics.
4.4.1.1 Season-by-Season Statistics
First, we can compute the players’ season-by-season statistics using the nflfastR(Carl & Baldwin, 2024) package.
TODO: Save/update data file in repo with the data generated from this code, and follow the data file through the rest of the book, updating code as necessary.
A Data Dictionary for the variables is available in the nfl_stats_variables object that is returned when running the calculate_stats() function:
Code
nfl_stats_variables
The nfl_actualStats_seasonal_player_raw object is in player-season form. That is, each row should be uniquely identified by the combination of player_id and season. The nfl_actualStats_seasonal_team_raw object is in team-season form. That is, each row should be uniquely identified by the combination of team and season. Let’s rearrange the data accordingly:
We already load players’ week-by-week statistics above. Nevertheless, we could compute players’ weekly statistics from the play-by-play data using the following syntax:
Save/update data file in repo with the data generated from the relevant code, and follow the data file through the rest of the book, updating code as necessary.
The nfl_actualStats_weekly objects are in player-season-week form. That is, each row should be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
The nfl_actualFantasyPoints_weekly objects are in player-season-week (or team-season-week) form. That is, each row should be uniquely identified by the combination of player_id, season, and week (or team-season-week). Let’s rearrange the data accordingly:
The nfl_actualFantasyPoints_seasonal objects are in player-season (or team-season) form. That is, each row should be uniquely identified by the combination of player_id and season (or team-season). Let’s rearrange the data accordingly:
We calculate the player’s age based on the difference between dates using the lubridate package (Spinu et al., 2024):
Code
# Reshape from wide to long formatnfl_actualFantasyPoints_player_weekly_long <- nfl_actualFantasyPoints_player_weekly %>% tidyr::pivot_longer(cols =c(team, opponent_team),names_to ="role",values_to ="team")# Perform separate inner join operations for the home_team and away_teamnfl_actualFantasyPoints_player_weekly_home <- dplyr::inner_join( nfl_actualFantasyPoints_player_weekly_long, nfl_schedules,by =c("season","week","team"="home_team")) %>%mutate(home_away ="home_team")nfl_actualFantasyPoints_player_weekly_away <- dplyr::inner_join( nfl_actualFantasyPoints_player_weekly_long, nfl_schedules,by =c("season","week","team"="away_team")) %>%mutate(home_away ="away_team")# Combine the results of the join operationsnfl_actualFantasyPoints_player_weekly_schedules_long <- dplyr::bind_rows( nfl_actualFantasyPoints_player_weekly_home, nfl_actualFantasyPoints_player_weekly_away)# Reshape from long to wideplayer_game_gameday <- nfl_actualFantasyPoints_player_weekly_schedules_long %>% dplyr::distinct(player_id, season, week, game_id, home_away, team, gameday) %>%#, .keep_all = TRUE tidyr::pivot_wider(names_from = home_away,values_from = team)# Merge player birthdate and the game dateplayer_game_birthdate_gameday <- dplyr::left_join( player_game_gameday,unique(nfl_players[,c("gsis_id","birth_date")]),by =c("player_id"="gsis_id"))player_game_birthdate_gameday$birth_date <- lubridate::ymd(player_game_birthdate_gameday$birth_date)player_game_birthdate_gameday$gameday <- lubridate::ymd(player_game_birthdate_gameday$gameday)# Calculate player's age for a given week as the difference between their birthdate and the game dateplayer_game_birthdate_gameday$age <- lubridate::interval(start = player_game_birthdate_gameday$birth_date,end = player_game_birthdate_gameday$gameday) %>% lubridate::time_length(unit ="years")# Merge with Pro Football Reference Data on Player Age by Seasonplayer_game_birthdate_gameday <- player_game_birthdate_gameday %>% dplyr::left_join( nfl_advancedStatsPFR_seasonal %>%filter(!is.na(gsis_id), !is.na(season), !is.na(age)) %>%select(gsis_id, season, age) %>%unique(),by =c("player_id"="gsis_id", "season") )# Set age as first non-missing value from calculation above or from PFRplayer_game_birthdate_gameday <- player_game_birthdate_gameday %>%mutate(age =coalesce(age.x, age.y)) %>%select(-age.x, -age.y)# Calculate ageCentered and ageCenteredQuadraticplayer_game_birthdate_gameday$ageCentered20 <- player_game_birthdate_gameday$age -20player_game_birthdate_gameday$ageCentered20Quadratic <- player_game_birthdate_gameday$ageCentered20 ^2# Merge with player infoplayer_age <- dplyr::left_join( player_game_birthdate_gameday, nfl_players %>%select(-birth_date, -team_abbr, - team_seq),by =c("player_id"="gsis_id"))# Add game_id to weekly stats to facilitate mergingnfl_actualFantasyPoints_player_weekly <- nfl_actualFantasyPoints_player_weekly %>% dplyr::left_join( player_age[,c("season","week","player_id","game_id")],by =c("season","week","player_id"))# Merge with player weekly statsplayer_stats_weekly <- dplyr::full_join( player_age %>%select(-position, -position_group), nfl_actualFantasyPoints_player_weekly,by =c("season","week","player_id","game_id"))player_stats_weekly$total_years_of_experience <-as.integer(player_stats_weekly$years_of_experience)player_stats_weekly$years_of_experience <-NULLdistinct_seasons <- player_stats_weekly %>% dplyr::select(player_id, season) %>% dplyr::distinct() %>% dplyr::left_join( nfl_players[,c("gsis_id","years_of_experience")],by =c("player_id"="gsis_id") ) %>% dplyr::mutate(total_years_of_experience =as.integer(years_of_experience)) %>% dplyr::select(-years_of_experience)years_of_experience <- distinct_seasons %>% dplyr::arrange(player_id, -season) %>% dplyr::group_by(player_id) %>% dplyr::mutate(years_of_experience =first(total_years_of_experience) - (row_number() -1)) %>% dplyr::ungroup()years_of_experience$years_of_experience[which(years_of_experience$years_of_experience <0)] <-0player_stats_weekly <- player_stats_weekly %>% dplyr::left_join( years_of_experience[,c("player_id","season","years_of_experience")],by =c("player_id","season") )
The player_stats_weekly objects are in player-season-week form. That is, each row should be uniquely identified by the combination of player_id, season, and week. Let’s rearrange the data accordingly:
# Save datasave( player_stats_weekly,file ="./data/player_stats_weekly.RData")
4.4.3.2 Seasonal
Code
# Merge player info with seasonal statsplayer_stats_seasonal <- dplyr::full_join( nfl_actualFantasyPoints_player_seasonal, nfl_players %>%select(-position, -position_group, -team_abbr, - team_seq),by =c("player_id"="gsis_id"))# Calculate ageseason_startdate <- nfl_schedules %>% dplyr::group_by(season) %>% dplyr::summarise(startdate =min(gameday, na.rm =TRUE))player_stats_seasonal <- player_stats_seasonal %>% dplyr::left_join( season_startdate,by ="season" )player_stats_seasonal$age <- lubridate::interval(start = player_stats_seasonal$birth_date,end = player_stats_seasonal$startdate) %>% lubridate::time_length(unit ="years")# Merge with Pro Football Reference Data on Player Age by Seasonplayer_stats_seasonal <- player_stats_seasonal %>% dplyr::left_join( nfl_advancedStatsPFR_seasonal %>%filter(!is.na(gsis_id), !is.na(season), !is.na(age)) %>%select(gsis_id, season, age) %>%unique(),by =c("player_id"="gsis_id", "season") )# Set age as first non-missing value from calculation above or from PFRplayer_stats_seasonal <- player_stats_seasonal %>%mutate(age =coalesce(age.x, age.y)) %>%select(-age.x, -age.y)# Calculate ageCentered and ageCenteredQuadraticplayer_stats_seasonal$ageCentered20 <- player_stats_seasonal$age -20player_stats_seasonal$ageCentered20Quadratic <- player_stats_seasonal$ageCentered20 ^2# Years of experienceplayer_stats_seasonal$years_of_experience <-NULLplayer_stats_seasonal <- player_stats_seasonal %>% dplyr::left_join( years_of_experience[,c("player_id","season","years_of_experience")],by =c("player_id","season") )
The player_stats_seasonal objects are in player-season form. That is, each row should be uniquely identified by the combination of player_id and season. Let’s rearrange the data accordingly:
Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link:
https://forms.gle/LsnVKwqmS1VuxWD18
Email Notification
The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.