I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Hypothesis Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

19  Machine Learning

This chapter provides an overview of machine learning.

19.1 Getting Started

19.1.1 Load Packages

Code
library("petersenlab")
library("future")
library("missRanger")
library("powerjoin")
library("tidymodels")
library("LongituRF")
library("gpboost")
library("effectsize")
library("tidyverse")
library("knitr")

19.1.2 Load Data

Code
# Downloaded Data - Processed
load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_teams.RData")
load(file = "./data/nfl_rosters.RData")
load(file = "./data/nfl_rosters_weekly.RData")
load(file = "./data/nfl_schedules.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/nfl_draftPicks.RData")
load(file = "./data/nfl_depthCharts.RData")
#load(file = "./data/nfl_pbp.RData")
#load(file = "./data/nfl_4thdown.RData")
#load(file = "./data/nfl_participation.RData")
#load(file = "./data/nfl_actualFantasyPoints_weekly.RData")
load(file = "./data/nfl_injuries.RData")
load(file = "./data/nfl_snapCounts.RData")
load(file = "./data/nfl_espnQBR_seasonal.RData")
load(file = "./data/nfl_espnQBR_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_weekly.RData")
load(file = "./data/nfl_playerContracts.RData")
load(file = "./data/nfl_ftnCharting.RData")
load(file = "./data/nfl_playerIDs.RData")
load(file = "./data/nfl_rankings_draft.RData")
load(file = "./data/nfl_rankings_weekly.RData")
load(file = "./data/nfl_expectedFantasyPoints_weekly.RData")
#load(file = "./data/nfl_expectedFantasyPoints_pbp.RData")

# Calculated Data - Processed
load(file = "./data/nfl_actualStats_player_career.RData")
load(file = "./data/nfl_actualStats_seasonal.RData")
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")

We created the player_stats_weekly.RData and player_stats_seasonal.RData objects in Section 4.4.3.

19.1.3 Specify Options

Code
options(scipen = 999) # prevent scientific notation

19.2 Overview of Machine Learning

Machine learning is a class of algorithmic approaches that are used to identify patterns in data. Machine learning takes us away from focusing on causal inference. Machine learning does not care about which processes are causal—i.e., which processes influence the outcome. Instead, machine learning cares about prediction—it cares about a predictor variable to the extent that it increases predictive accuracy regardless of whether it is causally related to the outcome. Nevertheless, association is necessary (despite being insufficient) for causality, as described in Section 13.4. Thus, achieving strong prediction is important (even if insufficient) for the model to be useful. If a model does explains only a small portion of variance, it is difficult for it to be useful.

Machine learning can be useful for leveraging big data and many predictor variables to develop predictive models with greater accuracy. However, many machine learning techniques are black boxes—it is often unclear how or why certain predictions are made, which can make it difficult to interpret the model’s decisions and understand the underlying relationships between variables. Machine learning tends to be a data-driven, atheoretical technique. This can result in overfitting. Thus, when estimating machine learning models, it is common to keep a hold-out sample for use in cross-validation to evaluate the extent of shrinkage of model coefficients. The data that the model is trained on is known as the “training data”. The data that the model was not trained on but is then is independently tested on—i.e., the hold-out sample—is the “test data”. Shrinkage occurs when predictor variables explain some random error variance in the original model. When the model is applied to an independent sample (i.e., the test data), the predictive model will likely not perform quite as well, and the regressions coefficients will tend to get smaller (i.e., shrink).

If the test data were collected as part of the same processes as the original data and were merely held out for purposes of analysis, this is called internal cross-validation. If the test data were collected separately from the original data used to train the model, this is called external cross-validation.

Although machine learning tends to be data-driven in its execution, theory should still inform which variables are included in the model.

Most machine learning methods were developed with cross-sectional data in mind. That is, they assume that each person has only one observation on the outcome variable. However, with longitudinal data, each person has multiple observations on the outcome variable.

When performing machine learning, various approaches may help address this:

  • transform data from long to wide form, so that each person has only one row
  • when designing the training and test sets, keep all measurements from the same person in the same data object (either the training or test set); do not have some measurements from a given person in the training set and other measurements from the same person in the test set
  • use a machine learning approach that accounts for the clustered/nested nature of the data

19.3 Types of Machine Learning

There are many approaches to machine learning. This chapter discusses several key ones:

19.3.1 Supervised Learning

Supervised learning involves learning from data where the correct classification or outcome is known (and the classification is thus part of the data). For instance, predicting how many points a player will score is a supervised learning task, because there is a ground truth—the actual number of points scored—that can be used to train and evaluate the model. If the outcome variable is categorical, the approach involves classification. If the outcome variable is continuous, the approach involves regression.

Unlike linear and logistic regression, various machine learning techniques can handle multicollinearity, including LASSO regression, ridge regression, and elastic net regression via regularization. Regularization involves penalizing model complexity to avoid overfitting (Ramasubramanian & Singh, 2016). Least absolute shrinkage and selection option (LASSO) regression performs selection of which predictor variables to keep in the model by shrinking some coefficients to zero, effectively removing them from the model. Ridge regression shrinks the coefficients of predictor variables toward zero, but not to zero, so it does not perform selection of which predictor variables to retain; this allows it to yield stable estimates for multiple correlated predictor variables in the context of multicollinearity. Elastic net involves a combination of LASSO and ridge regression; it performs selection of which predictor variables to keep by shrinking the coefficients of some predictor variables to zero (like LASSO, for variable selection), and it shrinks the coefficients of some predictor variables toward zero (like ridge, for handling multicollinearity among correlated predictors).

Unless interactions or nonlinear terms are specified, linear, logistic, LASSO, ridge, and elastic net regression assume additive and linear associations between the predictors and outcome. That is, they do not automatically account for interactions among the predictor variables or for nonlinear associations between the predictor variables and the outcome variable (unless interaction terms or nonlinear transformations are explicitly included). By contrast, random forests and tree boosting methods automatically account for interactions and nonlinear associations between predictors and the outcome variable. These models recursively partition the data in ways that capture complex patterns without the need to manually specify interaction or polynomial terms.

19.3.2 Unsupervised Learning

Unsupervised learning involves learning from data without known classifications. Unsupervised learning is used to discover hidden patterns, groupings, or structures in the data. For instance, if we want to identify different subtypes of Wide Receivers based on their playing style or performance metrics, or uncover underlying dimensions in a large dataset, we would use an unsupervised learning approach.

We describe cluster analysis in Chapter 21. We describe factor analysis in Chapter 22. We describe principal component analysis in Chapter 23.

19.3.3 Semi-supervised Learning

Semi-supervised learning combines supervised learning and unsupervised learning by training the model on some data for which the classification is known and some data for which the classification is not known.

19.3.4 Reinforcement Learning

Reinforcement learning involves an agent learning to make decisions by interacting with the environment. Through trial and error, the agent receives feedback in the form of rewards or penalties and learns a strategy that maximizes the cumulative reward over time.

19.3.5 Ensemble Learing

Ensemble machine learning methods combine multiple machine learning approaches with the goal that combining multiple approaches might lead to more accurate predictions than any one method might be able to achieve on its own.

19.4 Data Processing

Several data processing steps are necessary to get the data in the form necessary for machine learning.

19.4.1 Prepare Data for Merging

First, we apply several steps. We subset to the positions and variables of interest. We also rename columns and change variable types to make sure they match the column names and types across objects, which will be important later when we merge the data.

Code
# Prepare data for merging

#nfl_actualFantasyPoints_player_weekly <- nfl_actualFantasyPoints_player_weekly %>% 
#  rename(gsis_id = player_id)
#
#nfl_actualFantasyPoints_player_seasonal <- nfl_actualFantasyPoints_player_seasonal %>% 
#  rename(gsis_id = player_id)

player_stats_seasonal_offense <- player_stats_seasonal %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

player_stats_weekly_offense <- player_stats_weekly %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

## Rename other variables to ensure common names

## Ensure variables with the same name have the same type
nfl_players <- nfl_players %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number),
    nfl_id = as.character(nfl_id),
    years_of_experience = as.integer(years_of_experience))

player_stats_seasonal_offense <- player_stats_seasonal_offense %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number))

nfl_rosters <- nfl_rosters %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_rosters_weekly <- nfl_rosters_weekly %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_depthCharts <- nfl_depthCharts %>% 
  mutate(
    season = as.integer(season))

nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  rename(gsis_id = player_id) %>% 
  mutate(
    season = as.integer(season),
    receptions = as.integer(receptions)) %>% 
  distinct(gsis_id, season, week, .keep_all = TRUE) # drop duplicated rows

## Rename variables
nfl_draftPicks <- nfl_draftPicks %>%
  rename(
    games_career = games,
    pass_completions_career = pass_completions,
    pass_attempts_career = pass_attempts,
    pass_yards_career = pass_yards,
    pass_tds_career = pass_tds,
    pass_ints_career = pass_ints,
    rush_atts_career = rush_atts,
    rush_yards_career = rush_yards,
    rush_tds_career = rush_tds,
    receptions_career = receptions,
    rec_yards_career = rec_yards,
    rec_tds_career = rec_tds,
    def_solo_tackles_career = def_solo_tackles,
    def_ints_career = def_ints,
    def_sacks_career = def_sacks
  )

## Subset variables
nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  select(gsis_id:position, contains("_exp"), contains("_diff"), contains("_team")) #drop "raw stats" variables (e.g., rec_yards_gained) so they don't get coalesced with actual stats

# Check duplicate ids
player_stats_seasonal_offense %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()
Code
nfl_advancedStatsPFR_seasonal %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1, !is.na(gsis_id)) %>% 
  select(gsis_id, pfr_id, season, team, everything()) %>% 
  head()

Below, we identify shared variable names across objects to be merged to make sure we account for them in merging:

Code
dplyr::intersect(
  names(nfl_players),
  names(nfl_draftPicks))
[1] "gsis_id"  "position"
Code
length(na.omit(nfl_players$position)) # use by default (more cases)
[1] 24375
Code
length(na.omit(nfl_draftPicks$position))
[1] 10824
Code
dplyr::intersect(
  names(player_stats_seasonal_offense),
  names(nfl_advancedStatsPFR_seasonal))
[1] "gsis_id" "season"  "team"    "pfr_id"  "age"    
Code
length(na.omit(player_stats_seasonal_offense$season)) # use by default (more cases)
[1] 15507
Code
length(na.omit(nfl_advancedStatsPFR_seasonal$season))
[1] 12035
Code
length(na.omit(player_stats_seasonal_offense$team)) # use by default (more cases)
[1] 15506
Code
length(na.omit(nfl_advancedStatsPFR_seasonal$team))
[1] 12035
Code
length(na.omit(player_stats_seasonal_offense$age)) # use by default (more cases)
[1] 15507
Code
length(na.omit(nfl_advancedStatsPFR_seasonal$age))
[1] 11961
Code
dplyr::intersect(
  names(nfl_rosters_weekly),
  names(nfl_expectedFantasyPoints_weekly))
[1] "gsis_id"   "season"    "week"      "position"  "full_name"
Code
length(na.omit(nfl_rosters_weekly$season)) # use by default (more cases)
[1] 888773
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$season))
[1] 105903
Code
length(na.omit(nfl_rosters_weekly$week)) # use by default (more cases)
[1] 888773
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$week))
[1] 105903
Code
length(na.omit(nfl_rosters_weekly$position)) # use by default (more cases)
[1] 888740
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$position))
[1] 103446
Code
length(na.omit(nfl_rosters_weekly$full_name)) # use by default (more cases)
[1] 888757
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$full_name))
[1] 103446

19.4.2 Merge Data

To perform machine learning, we need all of the predictor variables and the outcome variable in the same data file. Thus, we must merge data files. To merge data, we use the powerjoin package (Fabri, 2022), which allows coalescing variables with the same name from two different objects. We specify coalesce_xy, which means that—for variables that have the same name across both objects—it keeps the value from object 1 (if present); if not, it keeps the value from object 2. We first merge variables from objects that have the same structure—player data (i.e., id form), seasonal data (i.e., id-season form), or weekly data (i.e., id-season-week form).

Code
# Create lists of objects to merge, depending on data structure: id; or id-season; or id-season-week
playerListToMerge <- list(
  nfl_players %>% filter(!is.na(gsis_id)),
  nfl_draftPicks %>% filter(!is.na(gsis_id)) %>% select(-season)
)

playerSeasonListToMerge <- list(
  player_stats_seasonal_offense %>% filter(!is.na(gsis_id), !is.na(season)),
  nfl_advancedStatsPFR_seasonal %>% filter(!is.na(gsis_id), !is.na(season))
)

playerSeasonWeekListToMerge <- list(
  nfl_rosters_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week)),
  #nfl_actualStats_offense_weekly,
  nfl_expectedFantasyPoints_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
  #nfl_advancedStatsPFR_weekly,
)

playerSeasonWeekPositionListToMerge <- list(
  nfl_depthCharts %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
)

# Merge data
playerMerged <- playerListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id"),
    conflict = powerjoin::coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

playerSeasonMerged <- playerSeasonListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season"),
    conflict = powerjoin::coalesce_xy) # where the objects have the same variable name (e.g., team), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

playerSeasonWeekMerged <- playerSeasonWeekListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season","week"),
    conflict = powerjoin::coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

To prepare for merging player data with seasonal data, we identify shared variable names across the objects:

Code
dplyr::intersect(
  names(playerSeasonMerged),
  names(playerMerged))
 [1] "gsis_id"                      "position"                    
 [3] "position_group"               "display_name"                
 [5] "common_first_name"            "first_name"                  
 [7] "last_name"                    "short_name"                  
 [9] "football_name"                "suffix"                      
[11] "esb_id"                       "nfl_id"                      
[13] "pff_id"                       "otc_id"                      
[15] "espn_id"                      "smart_id"                    
[17] "birth_date"                   "ngs_position_group"          
[19] "ngs_position"                 "height"                      
[21] "weight"                       "headshot"                    
[23] "college_name"                 "college_conference"          
[25] "jersey_number"                "rookie_season"               
[27] "last_season"                  "status"                      
[29] "ngs_status"                   "ngs_status_short_description"
[31] "pff_position"                 "pff_status"                  
[33] "draft_year"                   "draft_round"                 
[35] "draft_pick"                   "draft_team"                  
[37] "years_of_experience"          "pfr_player_name"             
[39] "team"                         "pfr_id"                      
[41] "age"                         

Then we merge the player data with the seasonal data:

Code
seasonalData <- powerjoin::power_full_join(
  playerSeasonMerged,
  playerMerged %>% select(-age, -years_of_experience, -team, -latest_team, -last_season, -pff_status), # drop variables from id objects that change from year to year (and thus are not necessarily accurate for a given season)
  by = "gsis_id",
  conflict = powerjoin::coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>% 
  filter(!is.na(season)) %>% 
  select(gsis_id, season, player_display_name, position, team, games, everything())

To prepare for merging player and seasonal data with weekly data, we identify shared variable names across the objects:

Code
dplyr::intersect(
  names(playerSeasonWeekMerged),
  names(seasonalData))
 [1] "gsis_id"       "season"        "week"          "team"         
 [5] "jersey_number" "status"        "first_name"    "last_name"    
 [9] "birth_date"    "height"        "weight"        "college"      
[13] "espn_id"       "pff_id"        "pfr_id"        "headshot_url" 
[17] "ngs_position"  "football_name" "esb_id"        "smart_id"     
[21] "position"     

Then we merge the player and seasonal data with the weekly data:

Code
seasonalAndWeeklyData <- powerjoin::power_full_join(
  playerSeasonWeekMerged,
  seasonalData,
  by = c("gsis_id","season"),
  conflict = powerjoin::coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>% 
  filter(!is.na(week)) %>% 
  select(gsis_id, season, week, full_name, position, team, everything())
Code
# Duplicate cases
seasonalData %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()
Code
seasonalAndWeeklyData %>% 
  group_by(gsis_id, season, week) %>% 
  filter(n() > 1) %>% 
  head()

19.4.3 Additional Processing

For purposes of machine learning, we set all character and logical columns to factors.

Code
# Convert character and logical variables to factors
seasonalData <- seasonalData %>% 
  mutate(
    across(
      where(is.character),
      as.factor
    ),
    across(
      where(is.logical),
      as.factor
    )
  )

19.4.4 Fill in Missing Data for Static Variables

For variables that are not expected to change, such as a player’s name and position, we fill in missing values by using a player’s value on those variables from other rows in the data.

Code
seasonalData <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  fill(
    player_name, player_display_name, pos, position, position_group,
    .direction = "downup") %>% 
  ungroup()

19.4.5 Create New Data Object for Merging with Later Predictions

We create a new data object that contains the latest seasonal data, for merging with later predictions.

Code
newData_seasonal <- seasonalData %>% 
  filter(season == max(season, na.rm = TRUE))

19.4.6 Lag Fantasy Points

To develop a machine learning model that uses a player’s performance metrics in a given season for predicting the player’s fantasy points in the subsequent season, we need to include the player’s fantasy points from the subsequent season in the same row as the previous season’s performance metrics. Thus, we need to create a lagged variable for fantasy points. That way, 2024 fantasy points are in the same row as 2023 performance metrics, 2023 fantasy points are in the same row as 2023 performance metrics, and so on. We call this the lagged fantasy points variable (fantasyPoints_lag). We also retain the original same-year fantasy points variable (fantasyPoints) so it can be used as predictor of their subsequent-year fantasy points.

Code
seasonalData_lag <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  mutate(
    fantasyPoints_lag = lead(fantasyPoints)
  ) %>% 
  ungroup()

seasonalData_lag %>% 
  select(gsis_id, player_display_name, season, fantasyPoints, fantasyPoints_lag) # verify that lagging worked as expected

19.4.7 Subset to Predictor Variables and Outcome Variable

Then, we drop variables that we do not want to include in the model as our predictor or outcome variable. Thus, all of the variables in the object are our predictor and outcome variables.

Code
seasonalData_lag %>% select_if(~class(.) == "Date")
Code
seasonalData_lag %>% select_if(is.character)
Code
seasonalData_lag %>% select_if(is.factor)