I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Hypothesis Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

19  Machine Learning

This chapter provides an overview of machine learning.

19.1 Getting Started

19.1.1 Load Packages

Code
library("petersenlab")
library("future")
library("missRanger")
library("powerjoin")
library("tidymodels")
library("LongituRF")
library("gpboost")
library("effectsize")
library("tidyverse")
library("knitr")

19.1.2 Load Data

Code
# Downloaded Data - Processed
load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_teams.RData")
load(file = "./data/nfl_rosters.RData")
load(file = "./data/nfl_rosters_weekly.RData")
load(file = "./data/nfl_schedules.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/nfl_draftPicks.RData")
load(file = "./data/nfl_depthCharts.RData")
#load(file = "./data/nfl_pbp.RData")
#load(file = "./data/nfl_4thdown.RData")
#load(file = "./data/nfl_participation.RData")
#load(file = "./data/nfl_actualFantasyPoints_weekly.RData")
load(file = "./data/nfl_injuries.RData")
load(file = "./data/nfl_snapCounts.RData")
load(file = "./data/nfl_espnQBR_seasonal.RData")
load(file = "./data/nfl_espnQBR_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_weekly.RData")
load(file = "./data/nfl_playerContracts.RData")
load(file = "./data/nfl_ftnCharting.RData")
load(file = "./data/nfl_playerIDs.RData")
load(file = "./data/nfl_rankings_draft.RData")
load(file = "./data/nfl_rankings_weekly.RData")
load(file = "./data/nfl_expectedFantasyPoints_weekly.RData")
#load(file = "./data/nfl_expectedFantasyPoints_pbp.RData")

# Calculated Data - Processed
load(file = "./data/nfl_actualStats_career.RData")
load(file = "./data/nfl_actualStats_seasonal.RData")
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")

19.1.3 Specify Options

Code
options(scipen = 999) # prevent scientific notation

19.2 Overview of Machine Learning

Machine learning takes us away from focusing on causal inference. Machine learning does not care about which processes are causal—i.e., which processes influence the outcome. Instead, machine learning cares about prediction—it cares about a predictor variable to the extent that it increases predictive accuracy regardless of whether it is causally related to the outcome.

Machine learning can be useful for leveraging big data and lots of predictor variable to develop predictive models with greater accuracy. However, many machine learning techniques are black boxes—it is often unclear how or why certain predictions are made, which can make it difficult to interpret the model’s decisions and understand the underlying relationships between variables. Machine learning tends to be a data-driven, atheoretical technique. This can result in overfitting. Thus, when estimating machine learning models, it is common to keep a hold-out sample for use in cross-validation to evaluate the extent of shrinkage of model coefficients. The data that the model is trained on is known as the “training data”. The data that the model was not trained on but is then is independently tested on—i.e., the hold-out sample—is the “test data”. Shrinkage occurs when predictor variables explain some random error variance in the original model. When the model is applied to an independent sample (i.e., the test data), the predictive model will likely not perform quite as well, and the regressions coefficients will tend to get smaller (i.e., shrink).

If the test data were collected as part of the same processes as the original data and were merely held out for purposes of analysis, this is called internal cross-validation. If the test data were collected separately from the original data used to train the model, this is called external cross-validation.

Most machine learning methods were developed with cross-sectional data in mind. That is, they assume that each person has only one observation on the outcome variable. However, with longitudinal data, each person has multiple observations on the outcome variable.

When performing machine learning, various approaches may help address this:

  • transform data from long to wide form, so that each person has only one row
  • when designing the training and test sets, keep all measurements from the same person in the same data object (either the training or test set); do not have some measurements from a given person in the training set and other measurements from the same person in the test set
  • use a machine learning approach that accounts for the clustered/nested nature of the data

19.3 Types of Machine Learning

There are many approaches to machine learning. This chapter discusses several key ones:

Ensemble machine learning methods combine multiple machine learning approaches with the goal that combining multiple approaches might lead to more accurate predictions than any one method might be able to achieve on its own.

19.3.1 Supervised Learning

Supervised learning involves learning from data where the correct classification or outcome is known. For instance, predicting how many points a player will score is a supervised learning task, because there is a ground truth—the actual number of points scored—that can be used to train and evaluate the model.

Unlike linear and logistic regression, various machine learning techniques can handle multicollinearity, including LASSO regression, ridge regression, and elastic net regression. Least absolute shrinkage and selection option (LASSO) regression performs selection of which predictor variables to keep in the model by shrinking some coefficients to zero, effectively removing them from the model. Ridge regression shrinks the coefficients of predictor variables toward zero, but not to zero, so it does not perform selection of which predictor variables to retain; this allows it to yield stable estimates for multiple correlated predictor variables in the context of multicollinearity. Elastic net involves a combination of LASSO and ridge regression; it performs selection of which predictor variables to keep by shrinking the coefficients of some predictor variables to zero (like LASSO, for variable selection), and it shrinks the coefficients of some predictor variables toward zero (like ridge, for handling multicollinearity among correlated predictors).

Unless interactions or nonlinear terms are specified, linear, logistic, LASSO, ridge, and elastic net regression assume additive and linear associations between the predictors and outcome. That is, they do not automatically account for interactions among the predictor variables or for nonlinear associations between the predictor variables and the outcome variable (unless interaction terms or nonlinear transformations are explicitly included). By contrast, random forests and tree boosting methods automatically account for interactions and nonlinear associations between predictors and the outcome variable. These models recursively partition the data in ways that capture complex patterns without the need to manually specify interaction or polynomial terms.

19.3.2 Unsupervised Learning

Unsupervised learning involves learning from data without known classifications. Unsupervised learning is used to discover hidden patterns, groupings, or structures in the data. For instance, if we want to identify different subtypes of Wide Receivers based on their playing style or performance metrics, or uncover underlying dimensions in a large dataset, we would use an unsupervised learning approach.

We describe cluster analysis in Chapter 21. We describe principal component analysis in Chapter 23.

19.3.3 Semi-supervised Learning

Semi-supervised learning combines supervised learning and unsupervised learning by training the model on some data for which the classification is known and some data for which the classification is not known.

19.3.4 Reinforcement Learning

Reinforcement learning involves an agent learning to make decisions by interacting with the environment. Through trial and error, the agent receives feedback in the form of rewards or penalties and learns a strategy that maximizes the cumulative reward over time.

19.4 Data Processing

Several data processing steps are necessary to get the data in the form necessary for machine learning.

19.4.1 Prepare Data for Merging

First, we apply several steps. We subset to the positions and variables of interest. We also rename columns and change variable types to make sure they match the column names and types across objects, which will be important later when we merge the data.

Code
# Prepare data for merging

#nfl_actualFantasyPoints_player_weekly <- nfl_actualFantasyPoints_player_weekly %>% 
#  rename(gsis_id = player_id)
#
#nfl_actualFantasyPoints_player_seasonal <- nfl_actualFantasyPoints_player_seasonal %>% 
#  rename(gsis_id = player_id)

player_stats_seasonal_offense <- player_stats_seasonal %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

player_stats_weekly_offense <- player_stats_weekly %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  rename(gsis_id = player_id)

## Rename other variables to ensure common names

## Ensure variables with the same name have the same type
nfl_players <- nfl_players %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number),
    gsis_it_id = as.character(gsis_it_id),
    years_of_experience = as.integer(years_of_experience))

player_stats_seasonal_offense <- player_stats_seasonal_offense %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number),
    gsis_it_id = as.character(gsis_it_id))

nfl_rosters <- nfl_rosters %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_rosters_weekly <- nfl_rosters_weekly %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_depthCharts <- nfl_depthCharts %>% 
  mutate(
    season = as.integer(season))

nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  mutate(
    season = as.integer(season),
    receptions = as.integer(receptions)) %>% 
  distinct(gsis_id, season, week, .keep_all = TRUE) # drop duplicated rows

## Rename variables
nfl_draftPicks <- nfl_draftPicks %>%
  rename(
    games_career = games,
    pass_completions_career = pass_completions,
    pass_attempts_career = pass_attempts,
    pass_yards_career = pass_yards,
    pass_tds_career = pass_tds,
    pass_ints_career = pass_ints,
    rush_atts_career = rush_atts,
    rush_yards_career = rush_yards,
    rush_tds_career = rush_tds,
    receptions_career = receptions,
    rec_yards_career = rec_yards,
    rec_tds_career = rec_tds,
    def_solo_tackles_career = def_solo_tackles,
    def_ints_career = def_ints,
    def_sacks_career = def_sacks
  )

## Subset variables
nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  select(gsis_id:position, contains("_exp"), contains("_diff"), contains("_team")) #drop "raw stats" variables (e.g., rec_yards_gained) so they don't get coalesced with actual stats

# Check duplicate ids
player_stats_seasonal_offense %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()
Code
nfl_advancedStatsPFR_seasonal %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1, !is.na(gsis_id)) %>% 
  select(gsis_id, pfr_id, season, team, everything()) %>% 
  head()

Below, we identify shared variable names across objects to be merged to make sure we account for them in merging:

Code
dplyr::intersect(
  names(nfl_players),
  names(nfl_draftPicks))
[1] "gsis_id"  "position"
Code
length(na.omit(nfl_players$position)) # use by default (more cases)
[1] 21360
Code
length(na.omit(nfl_draftPicks$position))
[1] 2855
Code
dplyr::intersect(
  names(player_stats_seasonal_offense),
  names(nfl_advancedStatsPFR_seasonal))
[1] "gsis_id" "season"  "team"    "age"    
Code
length(na.omit(player_stats_seasonal_offense$season)) # use by default (more cases)
[1] 14859
Code
length(na.omit(nfl_advancedStatsPFR_seasonal$season))
[1] 10395
Code
length(na.omit(player_stats_seasonal_offense$team)) # use by default (more cases)
[1] 14858
Code
length(na.omit(nfl_advancedStatsPFR_seasonal$team))
[1] 10395
Code
length(na.omit(player_stats_seasonal_offense$age)) # use by default (more cases)
[1] 14859
Code
length(na.omit(nfl_advancedStatsPFR_seasonal$age))
[1] 10325
Code
dplyr::intersect(
  names(nfl_rosters_weekly),
  names(nfl_expectedFantasyPoints_weekly))
[1] "gsis_id"   "season"    "week"      "position"  "full_name"
Code
length(na.omit(nfl_rosters_weekly$season)) # use by default (more cases)
[1] 845134
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$season))
[1] 100272
Code
length(na.omit(nfl_rosters_weekly$week)) # use by default (more cases)
[1] 841942
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$week))
[1] 100272
Code
length(na.omit(nfl_rosters_weekly$position)) # use by default (more cases)
[1] 845101
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$position))
[1] 97815
Code
length(na.omit(nfl_rosters_weekly$full_name)) # use by default (more cases)
[1] 845118
Code
length(na.omit(nfl_expectedFantasyPoints_weekly$full_name))
[1] 97815

19.4.2 Merge Data

To perform machine learning, we need all of the predictor variables and the outcome variable in the same data file. Thus, we must merge data files. To merge data, we use the powerjoin package (Fabri, 2022), which allows coalescing variables with the same name from two different objects. We specify coalesce_xy, which means that—for variables that have the same name across both objects—it keeps the value from object 1 (if present); if not, it keeps the value from object 2. We first merge variables from objects that have the same structure—player data (i.e., id form), seasonal data (i.e., id-season form), or weekly data (i.e., id-season-week form).

Code
# Create lists of objects to merge, depending on data structure: id; or id-season; or id-season-week
playerListToMerge <- list(
  nfl_players %>% filter(!is.na(gsis_id)),
  nfl_draftPicks %>% filter(!is.na(gsis_id)) %>% select(-season)
)

playerSeasonListToMerge <- list(
  player_stats_seasonal_offense %>% filter(!is.na(gsis_id), !is.na(season)),
  nfl_advancedStatsPFR_seasonal %>% filter(!is.na(gsis_id), !is.na(season))
)

playerSeasonWeekListToMerge <- list(
  nfl_rosters_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week)),
  #nfl_actualStats_offense_weekly,
  nfl_expectedFantasyPoints_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
  #nfl_advancedStatsPFR_weekly,
)

playerSeasonWeekPositionListToMerge <- list(
  nfl_depthCharts %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
)

# Merge data
playerMerged <- playerListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id"),
    conflict = coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

playerSeasonMerged <- playerSeasonListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season"),
    conflict = coalesce_xy) # where the objects have the same variable name (e.g., team), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

playerSeasonWeekMerged <- playerSeasonWeekListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season","week"),
    conflict = coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

To prepare for merging player data with seasonal data, we identify shared variable names across the objects:

Code
dplyr::intersect(
  names(playerSeasonMerged),
  names(playerMerged))
 [1] "gsis_id"                  "position"                
 [3] "position_group"           "first_name"              
 [5] "last_name"                "esb_id"                  
 [7] "display_name"             "rookie_year"             
 [9] "college_conference"       "current_team_id"         
[11] "draft_club"               "draft_number"            
[13] "draftround"               "entry_year"              
[15] "football_name"            "gsis_it_id"              
[17] "headshot"                 "jersey_number"           
[19] "short_name"               "smart_id"                
[21] "status"                   "status_description_abbr" 
[23] "status_short_description" "uniform_number"          
[25] "height"                   "weight"                  
[27] "college_name"             "birth_date"              
[29] "suffix"                   "years_of_experience"     
[31] "pfr_player_name"          "team"                    
[33] "age"                     

Then we merge the player data with the seasonal data:

Code
seasonalData <- powerjoin::power_full_join(
  playerSeasonMerged,
  playerMerged %>% select(-age, -years_of_experience, -team, -team_abbr, -team_seq, -current_team_id), # drop variables from id objects that change from year to year (and thus are not necessarily accurate for a given season)
  by = "gsis_id",
  conflict = coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>% 
  filter(!is.na(season)) %>% 
  select(gsis_id, season, player_display_name, position, team, games, everything())

To prepare for merging player and seasonal data with weekly data, we identify shared variable names across the objects:

Code
dplyr::intersect(
  names(playerSeasonWeekMerged),
  names(seasonalData))
 [1] "gsis_id"                 "season"                 
 [3] "week"                    "team"                   
 [5] "jersey_number"           "status"                 
 [7] "first_name"              "last_name"              
 [9] "birth_date"              "height"                 
[11] "weight"                  "college"                
[13] "pfr_id"                  "headshot_url"           
[15] "status_description_abbr" "football_name"          
[17] "esb_id"                  "gsis_it_id"             
[19] "smart_id"                "entry_year"             
[21] "rookie_year"             "draft_club"             
[23] "draft_number"            "position"               

Then we merge the player and seasonal data with the weekly data:

Code
seasonalAndWeeklyData <- powerjoin::power_full_join(
  playerSeasonWeekMerged,
  seasonalData,
  by = c("gsis_id","season"),
  conflict = coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>% 
  filter(!is.na(week)) %>% 
  select(gsis_id, season, week, full_name, position, team, everything())
Code
# Duplicate cases
seasonalData %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()
Code
seasonalAndWeeklyData %>% 
  group_by(gsis_id, season, week) %>% 
  filter(n() > 1) %>% 
  head()

19.4.3 Additional Processing

For purposes of machine learning, we set all character and logical columns to factors.

Code
# Convert character and logical variables to factors
seasonalData <- seasonalData %>% 
  mutate(
    across(
      where(is.character),
      as.factor
    ),
    across(
      where(is.logical),
      as.factor
    )
  )

19.4.4 Fill in Missing Data for Static Variables

For variables that are not expected to change, such as a player’s name and position, we fill in missing values by using a player’s value on those variables from other rows in the data.

Code
seasonalData <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  fill(
    player_name, player_display_name, pos, position, position_group,
    .direction = "downup") %>% 
  ungroup()

19.4.5 Create New Data Object for Merging with Later Predictions

We create a new data object that contains the latest seasonal data, for merging with later predictions.

Code
newData_seasonal <- seasonalData %>% 
  filter(season == max(season, na.rm = TRUE))

19.4.6 Lag Fantasy Points

To develop a machine learning model that uses a player’s performance metrics in a given season for predicting the player’s fantasy points in the subsequent season, we need to include the player’s fantasy points from the subsequent season in the same row as the previous season’s performance metrics. Thus, we need to create a lagged variable for fantasy points. That way, 2024 fantasy points are in the same row as 2023 performance metrics, 2023 fantasy points are in the same row as 2023 performance metrics, and so on. We call this the lagged fantasy points variable (fantasyPoints_lag). We also retain the original same-year fantasy points variable (fantasyPoints) so it can be used as predictor of their subsequent-year fantasy points.

Code
seasonalData_lag <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  mutate(
    fantasyPoints_lag = lead(fantasyPoints)
  ) %>% 
  ungroup()

seasonalData_lag %>% 
  select(gsis_id, player_display_name, season, fantasyPoints, fantasyPoints_lag) # verify that lagging worked as expected

19.4.7 Subset to Predictor Variables and Outcome Variable

Then, we drop variables that we do not want to include in the model as our predictor or outcome variable. Thus, all of the variables in the object are our predictor and outcome variables.

Code
seasonalData_lag %>% select_if(~class(.) == "Date")
Code
seasonalData_lag %>% select_if(is.character)
Code
seasonalData_lag %>% select_if(is.factor)
Code
seasonalData_lag %>% select_if(is.logical)
Code
dropVars <- c(
  "birth_date", "loaded", "full_name", "player_name", "player_display_name", "display_name", "suffix", "headshot_url", "player", "pos",
  "espn_id", "sportradar_id", "yahoo_id", "rotowire_id", "pff_id", "fantasy_data_id", "sleeper_id", "pfr_id",
  "pfr_player_id", "cfb_player_id", "pfr_player_name", "esb_id", "gsis_it_id", "smart_id",
  "college", "college_name", "team_abbr", "current_team_id", "college_conference", "draft_club", "status_description_abbr",
  "status_short_description", "short_name", "headshot", "uniform_number", "jersey_number", "first_name", "last_name",
  "football_name", "team")

seasonalData_lag_subset <- seasonalData_lag %>% 
  dplyr::select(-any_of(dropVars))

19.4.8 Separate by Position

Then, we separate the objects by position, so we can develop different machine learning models for each position.

Code
seasonalData_lag_subsetQB <- seasonalData_lag_subset %>% 
  filter(position == "QB") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    completions:rushing_2pt_conversions, special_teams_tds, contains(".pass"), contains(".rush"))

seasonalData_lag_subsetRB <- seasonalData_lag_subset %>% 
  filter(position == "RB") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

seasonalData_lag_subsetWR <- seasonalData_lag_subset %>% 
  filter(position == "WR") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

seasonalData_lag_subsetTE <- seasonalData_lag_subset %>% 
  filter(position == "TE") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

19.4.9 Split into Test and Training Data

Because machine learning can leverage many predictors, it is at high risk of overfitting—explaining error variance that would not generalize to new data, such as data for new players or future seasons. Thus, it is important to develop and tune the machine learning model so as not to overfit the model. In machine learning, it is common to use cross-validation where we train the model on a subset of the observations, and we evaluate how well the model generalizes to unseen (e.g., “hold-out”) observations. Then, we select the model parameters by how well the model generalizes to the hold-out data, so we are selecting a model that maximizes accuracy and generalizability (i.e., parsimony).

For internal cross-validation, it is common to divide the data into three subsets:

  1. training data
  2. validation data
  3. test data

The training set is used to fit the model. It is usually the largest portion of the data. We fit various models to the training set based on which parameters we want to evaluate (e.g., how many trees to use in a tree-boosting model).

The models fit with the training set are then evaluated using the unseen observations in the validation set. The validation set is used to tune the model parameters and prevent overfitting. We select the model parameters that yield the greatest accuracy in the validation set. In k-fold cross-validation, the validation set rotates across folds, thus replacing the need for a separate validation set.

The test set is used after model training and tuning to evaluate the model’s generalizability to unseen data.

Below, we split the data into test and training data. Our ultimate goal is to predict next year’s fantasy points. However, to do that effectively, we must first develop a model for which we can evaluate its accuracy against historical fantasy points (because we do not yet know players will score in the future). We want to include all current/active players in our training data, so that our predictions of their future performance can be accounted for by including their prior data in the model. Thus, we use retired players as our hold-out (test) data. We split our data into 80% training data and 20% testing data. The 20% testing data thus includes all retired players, but not all retired players are in the testing data.

Then, for the analysis, we can either a) use rotating folds (as the case for k-fold and leave-one-out [LOO] cross-validation) for which a separate validation set (from the training set) is not needed, as we do in Section 19.6, or we can b) subdivide the training set into an inner training set and validation set, as we do in Section 19.8.5.4.

Code
seasonalData_lag_qb_all <- seasonalData_lag_subsetQB
seasonalData_lag_rb_all <- seasonalData_lag_subsetRB
seasonalData_lag_wr_all <- seasonalData_lag_subsetWR
seasonalData_lag_te_all <- seasonalData_lag_subsetTE

set.seed(52242) # for reproducibility (to keep the same train/holdout players)

activeQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$season == max(seasonalData_lag_qb_all$season, na.rm = TRUE))])
retiredQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$gsis_id %ni% activeQBs)])
numQBs <- length(unique(seasonalData_lag_qb_all$gsis_id))
qbHoldoutIDs <- sample(retiredQBs, size = ceiling(.2 * numQBs)) # holdout 20% of players

activeRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$season == max(seasonalData_lag_rb_all$season, na.rm = TRUE))])
retiredRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$gsis_id %ni% activeRBs)])
numRBs <- length(unique(seasonalData_lag_rb_all$gsis_id))
rbHoldoutIDs <- sample(retiredRBs, size = ceiling(.2 * numRBs)) # holdout 20% of players

set.seed(52242) # for reproducibility (to keep the same train/holdout players); added here to prevent a downstream error with predict.missRanger() due to missingness; this suggests that an error can arise from including a player in the holdout sample who has missingness in particular variables; would be good to identify which player(s) in the holdout sample evoke that error to identify the kinds of missingness that yield the error

activeWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$season == max(seasonalData_lag_wr_all$season, na.rm = TRUE))])
retiredWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$gsis_id %ni% activeWRs)])
numWRs <- length(unique(seasonalData_lag_wr_all$gsis_id))
wrHoldoutIDs <- sample(retiredWRs, size = ceiling(.2 * numWRs)) # holdout 20% of players

activeTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$season == max(seasonalData_lag_te_all$season, na.rm = TRUE))])
retiredTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$gsis_id %ni% activeTEs)])
numTEs <- length(unique(seasonalData_lag_te_all$gsis_id))
teHoldoutIDs <- sample(retiredTEs, size = ceiling(.2 * numTEs)) # holdout 20% of players
  
seasonalData_lag_qb_train <- seasonalData_lag_qb_all %>% 
  filter(gsis_id %ni% qbHoldoutIDs)
seasonalData_lag_qb_test <- seasonalData_lag_qb_all %>% 
  filter(gsis_id %in% qbHoldoutIDs)

seasonalData_lag_rb_train <- seasonalData_lag_rb_all %>% 
  filter(gsis_id %ni% rbHoldoutIDs)
seasonalData_lag_rb_test <- seasonalData_lag_rb_all %>% 
  filter(gsis_id %in% rbHoldoutIDs)

seasonalData_lag_wr_train <- seasonalData_lag_wr_all %>% 
  filter(gsis_id %ni% wrHoldoutIDs)
seasonalData_lag_wr_test <- seasonalData_lag_wr_all %>% 
  filter(gsis_id %in% wrHoldoutIDs)

seasonalData_lag_te_train <- seasonalData_lag_te_all %>% 
  filter(gsis_id %ni% teHoldoutIDs)
seasonalData_lag_te_test <- seasonalData_lag_te_all %>% 
  filter(gsis_id %in% teHoldoutIDs)

19.4.10 Impute the Missing Data

Many of the machine learning approaches described in this chapter require no missing observations in order for a case to be included in the analysis. In this section, we demonstrate one approach to imputing missing data. Here is a vignette demonstrating how to impute missing data using missForest(): https://rpubs.com/lmorgan95/MissForest (archived at: https://perma.cc/6GB4-2E22). Below, we impute the training data (and all data) separately by position. We then use the imputed training data to make out-of-sample predictions to fill in the missing data for the testing data. We do not want to impute the training and testing data together so that we can keep them separate for the purposes of cross-validation. However, we impute all data (training and test data together) for purposes of making out-of-sample predictions from the machine learning models to predict players’ performance next season (when actuals are not yet available for evaluating their accuracy). To impute data, we use the missRanger package (Mayer, 2024).

Note 19.1: Impute missing data for machine learning

Note: the following code takes a while to run.

Code
# QBs
seasonalData_lag_qb_all_imp <- missRanger::missRanger(
  seasonalData_lag_qb_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

Variables to impute:        fantasy_points, fantasy_points_ppr, special_teams_tds, passing_epa, pacr, rushing_epa, fantasyPoints_lag, passing_cpoe, rookie_year, draft_number, gs, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, pocket_time.pass
Variables used to impute:   gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic, height, weight, rookie_year, draft_number, fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag, completions, attempts, passing_yards, passing_tds, passing_interceptions, sacks_suffered, sack_yards_lost, sack_fumbles, sack_fumbles_lost, passing_air_yards, passing_yards_after_catch, passing_first_downs, passing_epa, passing_cpoe, passing_2pt_conversions, pacr, carries, rushing_yards, rushing_tds, rushing_fumbles, rushing_fumbles_lost, rushing_first_downs, rushing_epa, rushing_2pt_conversions, special_teams_tds, pocket_time.pass, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush

    fntsy_  fnts__  spcl__  pssng_p pacr    rshng_  fntsP_  pssng_c rok_yr  drft_n  gs  pss_t.  thrww.  spks.p  drps.p  bd_th.  tms_b.  tms_hr. tms_ht. tms_p.  bttd_.  on_tgt_t.   rp_pl.  rp_yr.  rp_pss_t.   rp_pss_y.   rp_rsh_t.   rp_rsh_y.   p_pss_t.    p_pss_y.    att.rs  yds.rs  td.rsh  x1d.rs  ybc.rs  yc.rsh  brk_t.  att_b.  drp_p.  bd_t_.  on_tgt_p.   prss_.  ybc_t.  yc_tt.  pckt_.
iter 1: 0.0054  0.0024  0.7924  0.1919  0.7612  0.3628  0.4789  0.4133  0.0224  0.5216  0.0271  0.0134  0.3024  0.7659  0.1304  0.0541  0.0758  0.1759  0.1820  0.0370  0.3238  0.0291  0.2952  0.1812  0.0885  0.0867  0.2627  0.2563  0.1093  0.0902  0.0580  0.0645  0.1732  0.0524  0.0578  0.1795  0.3524  0.3428  0.7447  0.5158  0.0824  0.6803  0.3529  0.5758  0.8111  
iter 2: 0.0044  0.0048  0.8304  0.2002  0.7926  0.3736  0.4801  0.4289  0.0488  0.6139  0.0188  0.0090  0.2883  0.7481  0.0764  0.0385  0.0718  0.1231  0.1329  0.0337  0.2760  0.0113  0.0548  0.0814  0.0765  0.0990  0.1989  0.2841  0.0707  0.0952  0.0396  0.0386  0.1606  0.0492  0.0525  0.1220  0.2541  0.3556  0.7468  0.4937  0.0827  0.6610  0.3465  0.5796  0.8134  
iter 3: 0.0049  0.0046  0.8690  0.1986  0.7810  0.3641  0.4774  0.4360  0.0528  0.6123  0.0188  0.0088  0.2867  0.7538  0.0767  0.0393  0.0734  0.1261  0.1374  0.0343  0.2741  0.0119  0.0524  0.0816  0.0748  0.1008  0.2184  0.2811  0.0691  0.0926  0.0389  0.0413  0.1640  0.0511  0.0585  0.1255  0.2510  0.3609  0.7477  0.5108  0.0858  0.6426  0.3588  0.5734  0.8300  
Code
seasonalData_lag_qb_all_imp
missRanger object. Extract imputed data via $data
- best iteration: 2 
- best average OOB imputation error: 0.2524825 
Code
data_all_qb <- seasonalData_lag_qb_all_imp$data
data_all_qb$fantasyPointsMC_lag <- scale(data_all_qb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_qb_matrix <- data_all_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_qb <- data_all_qb %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_qb_matrix <- data_all_qb_matrix[
  data_all_qb_matrix[, "season"] == max(data_all_qb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_qb <- which(colnames(newData_qb_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_qb_matrix <- newData_qb_matrix[, -dropCol_qb, drop = FALSE]

seasonalData_lag_qb_train_imp <- missRanger::missRanger(
  seasonalData_lag_qb_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

Variables to impute:        fantasy_points, fantasy_points_ppr, special_teams_tds, passing_epa, pacr, rushing_epa, fantasyPoints_lag, passing_cpoe, rookie_year, draft_number, gs, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, pocket_time.pass
Variables used to impute:   gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic, height, weight, rookie_year, draft_number, fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag, completions, attempts, passing_yards, passing_tds, passing_interceptions, sacks_suffered, sack_yards_lost, sack_fumbles, sack_fumbles_lost, passing_air_yards, passing_yards_after_catch, passing_first_downs, passing_epa, passing_cpoe, passing_2pt_conversions, pacr, carries, rushing_yards, rushing_tds, rushing_fumbles, rushing_fumbles_lost, rushing_first_downs, rushing_epa, rushing_2pt_conversions, special_teams_tds, pocket_time.pass, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush

    fntsy_  fnts__  spcl__  pssng_p pacr    rshng_  fntsP_  pssng_c rok_yr  drft_n  gs  pss_t.  thrww.  spks.p  drps.p  bd_th.  tms_b.  tms_hr. tms_ht. tms_p.  bttd_.  on_tgt_t.   rp_pl.  rp_yr.  rp_pss_t.   rp_pss_y.   rp_rsh_t.   rp_rsh_y.   p_pss_t.    p_pss_y.    att.rs  yds.rs  td.rsh  x1d.rs  ybc.rs  yc.rsh  brk_t.  att_b.  drp_p.  bd_t_.  on_tgt_p.   prss_.  ybc_t.  yc_tt.  pckt_.
iter 1: 0.0061  0.0028  0.8162  0.1897  0.5083  0.3633  0.4726  0.4456  0.0242  0.4723  0.0283  0.0141  0.2939  0.7728  0.1343  0.0558  0.0744  0.1757  0.1818  0.0381  0.3288  0.0351  0.2921  0.1846  0.0860  0.0894  0.2737  0.2661  0.1127  0.0900  0.0586  0.0644  0.1800  0.0574  0.0639  0.1792  0.3570  0.3486  0.7646  0.5313  0.0868  0.7084  0.3533  0.5933  0.8466  
iter 2: 0.0052  0.0052  0.8304  0.1937  0.5621  0.3715  0.4614  0.4586  0.0505  0.5647  0.0192  0.0092  0.2953  0.7530  0.0800  0.0393  0.0725  0.1170  0.1355  0.0343  0.2771  0.0121  0.0555  0.0731  0.0713  0.0979  0.2073  0.2943  0.0698  0.0911  0.0416  0.0399  0.1683  0.0527  0.0577  0.1262  0.2474  0.3582  0.7719  0.5165  0.0900  0.6862  0.3642  0.5926  0.8400  
iter 3: 0.0053  0.0051  0.8261  0.2008  0.5551  0.3571  0.4727  0.4410  0.0551  0.5658  0.0188  0.0092  0.2859  0.7460  0.0807  0.0402  0.0739  0.1202  0.1393  0.0351  0.2808  0.0114  0.0595  0.0705  0.0775  0.1051  0.2163  0.2935  0.0718  0.0921  0.0426  0.0400  0.1719  0.0535  0.0534  0.1225  0.2498  0.3484  0.7502  0.5100  0.0884  0.6609  0.3672  0.5852  0.8440  
iter 4: 0.0054  0.0051  0.6928  0.1979  0.5598  0.3732  0.4771  0.4349  0.0506  0.5691  0.0189  0.0085  0.2891  0.7456  0.0785  0.0395  0.0737  0.1210  0.1353  0.0335  0.2836  0.0117  0.0566  0.0778  0.0743  0.1055  0.2131  0.2964  0.0697  0.0912  0.0396  0.0395  0.1611  0.0531  0.0597  0.1258  0.2600  0.3560  0.8062  0.5032  0.0973  0.6739  0.3698  0.5875  0.8485  
iter 5: 0.0052  0.0055  0.8355  0.1965  0.5664  0.3710  0.4743  0.4604  0.0520  0.5598  0.0193  0.0091  0.2852  0.7474  0.0800  0.0405  0.0722  0.1213  0.1366  0.0344  0.2788  0.0118  0.0555  0.0756  0.0746  0.0986  0.2190  0.2765  0.0695  0.0932  0.0390  0.0425  0.1650  0.0509  0.0576  0.1305  0.2556  0.3509  0.7738  0.5051  0.0969  0.6902  0.3640  0.6007  0.8326  
Code
seasonalData_lag_qb_train_imp
missRanger object. Extract imputed data via $data
- best iteration: 4 
- best average OOB imputation error: 0.2482278 
Code
data_train_qb <- seasonalData_lag_qb_train_imp$data
data_train_qb$fantasyPointsMC_lag <- scale(data_train_qb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_qb_matrix <- data_train_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_qb_test_imp <- predict(
  object = seasonalData_lag_qb_train_imp,
  newdata = seasonalData_lag_qb_test,
  seed = 52242)

data_test_qb <- seasonalData_lag_qb_test_imp
data_test_qb_matrix <- data_test_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
Code
# RBs
seasonalData_lag_rb_all_imp <- missRanger::missRanger(
  seasonalData_lag_rb_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_rb_all_imp

data_all_rb <- seasonalData_lag_rb_all_imp$data
data_all_rb$fantasyPointsMC_lag <- scale(data_all_rb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_rb_matrix <- data_all_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_rb <- data_all_rb %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_rb_matrix <- data_all_rb_matrix[
  data_all_rb_matrix[, "season"] == max(data_all_rb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_rb <- which(colnames(newData_rb_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_rb_matrix <- newData_rb_matrix[, -dropCol_rb, drop = FALSE]

seasonalData_lag_rb_train_imp <- missRanger::missRanger(
  seasonalData_lag_rb_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_rb_train_imp

data_train_rb <- seasonalData_lag_rb_train_imp$data
data_train_rb$fantasyPointsMC_lag <- scale(data_train_rb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_rb_matrix <- data_train_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_rb_test_imp <- predict(
  object = seasonalData_lag_rb_train_imp,
  newdata = seasonalData_lag_rb_test,
  seed = 52242)

data_test_rb <- seasonalData_lag_rb_test_imp
data_test_rb_matrix <- data_test_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
Code
# WRs
seasonalData_lag_wr_all_imp <- missRanger::missRanger(
  seasonalData_lag_wr_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_wr_all_imp

data_all_wr <- seasonalData_lag_wr_all_imp$data
data_all_wr$fantasyPointsMC_lag <- scale(data_all_wr$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_wr_matrix <- data_all_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_wr <- data_all_wr %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_wr_matrix <- data_all_wr_matrix[
  data_all_wr_matrix[, "season"] == max(data_all_wr_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_wr <- which(colnames(newData_wr_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_wr_matrix <- newData_wr_matrix[, -dropCol_wr, drop = FALSE]

seasonalData_lag_wr_train_imp <- missRanger::missRanger(
  seasonalData_lag_wr_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_wr_train_imp

data_train_wr <- seasonalData_lag_wr_train_imp$data
data_train_wr$fantasyPointsMC_lag <- scale(data_train_wr$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_wr_matrix <- data_train_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_wr_test_imp <- predict(
  object = seasonalData_lag_wr_train_imp,
  newdata = seasonalData_lag_wr_test,
  seed = 52242)

data_test_wr <- seasonalData_lag_wr_test_imp
data_test_wr_matrix <- data_test_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
Code
# TEs
seasonalData_lag_te_all_imp <- missRanger::missRanger(
  seasonalData_lag_te_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_te_all_imp

data_all_te <- seasonalData_lag_te_all_imp$data
data_all_te$fantasyPointsMC_lag <- scale(data_all_te$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_te_matrix <- data_all_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_te <- data_all_te %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_te_matrix <- data_all_te_matrix[
  data_all_te_matrix[, "season"] == max(data_all_te_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_te <- which(colnames(newData_te_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_te_matrix <- newData_te_matrix[, -dropCol_te, drop = FALSE]

seasonalData_lag_te_train_imp <- missRanger::missRanger(
  seasonalData_lag_te_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_te_train_imp

data_train_te <- seasonalData_lag_te_train_imp$data
data_train_te$fantasyPointsMC_lag <- scale(data_train_te$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_te_matrix <- data_train_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_te_test_imp <- predict(
  object = seasonalData_lag_te_train_imp,
  newdata = seasonalData_lag_te_test,
  seed = 52242)

data_test_te <- seasonalData_lag_te_test_imp
data_test_te_matrix <- data_test_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

19.5 Identify Cores for Parallel Processing

Code
num_cores <- parallelly::availableCores() - 1
num_true_cores <- parallelly::availableCores(logical = FALSE) - 1
Code
num_cores
system 
     4 

We use the future package (Bengtsson, 2025) for parallel (faster) processing.

Code
future::plan(future::multisession, workers = num_cores)

19.6 Set up the Cross-Validation Folds

In the examples below, we predict the future fantasy points of Quarterbacks. However, the examples could be applied to any of the positions. There are various approaches to cross-validation. In the examples below, we use k-fold cross-validation. However, we also provide the code to apply leave-one-out (LOO) cross-validation.

19.6.1 k-Fold Cross-Validation

k-fold cross-validation partitions the data into k folds (subsets). In each of the k iterations, the model is trained on \(k - 1\) folds and is evaluated on the remaining fold. For example, in a 10-fold cross-validation (i.e., \(k = 10\)), as used below, the model is trained 10 times, each time leaving out a different 10% of the data for validation. k-fold cross-validation is widely used because it tends to yield stable estimates of model performance, by balancing bias and variance. It is also computationally efficient, requiring only k model fits to evaluate model performance.

We set up the k folds using the group_vfold_cv() function of the rsample package (Frick, Chow, et al., 2025).

Code
set.seed(52242) # for reproducibility

folds_kFold <- rsample::group_vfold_cv(
  data_train_qb,
  group = gsis_id, # ensures all rows for a player are in the training set or all in the validation set for each fold
  v = 10) # 10-fold cross-validation

19.6.2 Leave-One-Out (LOO) Cross-Validation

Leave-one-out (LOO) cross-validation partitions the data into n folds, where n is the sample size. In each of the n iterations, the model is trained on \(n - 1\) observations and is evaluated on the one left out. For example, in a LOO cross-validation with 100 players, the model is trained 100 times, each time leaving out a different player for validation. LOO cross-validation is especially useful when the dataset is small—too small to form reliable training sets in k-fold cross-validation (e.g., with \(k = 5\) or \(k = 10\), which divide the sample into 5 or 10 folds, respectively). However, LOO tends to be less computationally efficient because it requires more model fits than k-fold cross-validation. LOO tends to have low bias, producing performance estimates closer to those obtained when fitting the model to the full dataset, because each model is trained on nearly all the data. However, LOO also tends to have high variance in its error estimates, because each validation fold contains only a single observation, making those estimates more sensitive to individual data points.

We set up the LOO folds using the loo_cv() function of the rsample package (Frick, Chow, et al., 2025).

Code
set.seed(52242) # for reproducibility

folds_loo <- rsample::loo_cv(data_train_qb)

19.7 Fitting the Traditional Linear Regression Models

We describe linear regression in Chapter 11.

19.7.1 Regression with One Predictor

Below, we fit a linear regression model with one predictor and evaluate it with cross-validation. We also evaluate its accuracy on the hold-out (test) data. For each of the models, we fit and evaluate the models using the tidymodels ecosystem of packages (Kuhn & Wickham, 2020, 2025). we specify our model formula using the recipe() function of the recipes package (Kuhn, Wickham, et al., 2025). We define the model using the linear_reg(), set_engine(), and set_mode() functions of the parsnip package (Kuhn & Vaughan, 2025). We specify the workflow using the workflow(), add_recipe(), and add_model() functions of the workflows package (Vaughan & Couch, 2025). We fit the cross-validation model using the fit_resamples() function of the tune package (Kuhn, 2025). We specify the accuracy metrics to evaluate using the metric_set() function of the yardstick package (Kuhn, Vaughan, et al., 2025). We fit the final model using the fit() function of the workflows package (Vaughan & Couch, 2025). We evaluate the accuracy of the model’s predictions on the test data using the accuracyOverall() of the petersenlab package (Petersen, 2025a).

Code
# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ fantasyPoints,
  data = data_train_qb)

# Define Model
lm_spec <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  parsnip::set_mode("regression")

# Workflow
lm_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lm_spec)

# Fit Model with Cross-Validation
cv_results <- tune::fit_resamples(
  lm_wf,
  resamples = folds,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_resamples(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
# Fit Final Model on Training Data
final_model <- workflows::fit(
  lm_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()
Code
final_model %>%
  workflows::extract_fit_parsnip() %>% 
  effectsize::standardize_parameters()
Code
# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

Figure 19.1 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Regression Model with One Predictor (Player Age).
Figure 19.1: Predicted Versus Actual Fantasy Points for Regression Model with One Predictor (Player Age).

Below are the model predictions for next year’s fantasy points:

Code
newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.7.2 Regression with Multiple Predictors

Below, we fit a linear regression model with multiple predictors and evaluate it with cross-validation. We also evaluate its accuracy on the hold-out (test) data.

Code
# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
lm_spec <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  parsnip::set_mode("regression")

# Workflow
lm_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lm_spec)

# Fit Model with Cross-Validation
cv_results <- tune::fit_resamples(
  lm_wf,
  resamples = folds,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_resamples(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
# Fit Final Model on Training Data
final_model <- workflows::fit(
  lm_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()
Code
final_model %>%
  workflows::extract_fit_parsnip() %>% 
  effectsize::standardize_parameters()
Code
# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

Figure 19.2 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Regression Model with Multiple Predictors.
Figure 19.2: Predicted Versus Actual Fantasy Points for Regression Model with Multiple Predictors.

Below are the model predictions for next year’s fantasy points:

Code
newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8 Fitting the Machine Learning Models

19.8.1 Least Absolute Shrinkage and Selection Option (LASSO)

Below, we fit a LASSO model. We evaluate it and tune its penalty parameter with cross-validation. The penalty parameter in a LASSO model controls the strength of regularization applied to the model coefficients. Smaller penalty values result in less regularization, allowing the model to retain more predictors with larger (nonzero) coefficients. This typically reduces bias but increases variance of the model’s predictions, as the model may overfit to the training data by including irrelevant or weak predictors. Larger penalty values apply stronger regularization, shrinking more coefficients exactly to zero. This encourages a sparser model that may increase bias (by excluding useful predictors) but reduces variance and improves generalizability by simplifying the model and reducing overfitting.

After tuning the model, we evaluate its accuracy on the hold-out (test) data.

The LASSO models were fit using the glmnet package (Friedman et al., 2010, 2025; Tay et al., 2023).

For the machine learning models, we perform the parameter tuning using the tune(), tune_grid(), select_best(), and finalize_workflow() functions of the tune package (Kuhn, 2025). We specify the grid of possible parameter values using the grid_regular() function of the dials package (Kuhn & Frick, 2025).

Code
# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
lasso_spec <- 
  parsnip::linear_reg(
    penalty = tune::tune(),
    mixture = 1) %>%
  parsnip::set_engine("glmnet")

# Workflow
lasso_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lasso_spec)

# Define grid of penalties to try (log scale is typical)
penalty_grid <- dials::grid_regular(
  dials::penalty(range = c(-4, -1)),
  levels = 20)

# Tune the Penalty Parameter
cv_results <- tune::tune_grid(
  lasso_wf,
  resamples = folds,
  grid = penalty_grid,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
# Identify best penalty
tune::select_best(cv_results, metric = "rmse")
Code
tune::select_best(cv_results, metric = "mae")
Code
tune::select_best(cv_results, metric = "rsq")
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  lasso_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()
Code
# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

Figure 19.3 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Least Absolute Shrinkage and Selection Option (Lasso) Model.
Figure 19.3: Predicted Versus Actual Fantasy Points for Least Absolute Shrinkage and Selection Option (Lasso) Model.

Below are the model predictions for next year’s fantasy points:

Code
newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8.2 Ridge Regression

Below, we fit a ridge regression model. We evaluate it and tune its penalty parameter with cross-validation. The penalty parameter in a ridge regression model controls the amount of regularization applied to the model’s coefficients. Smaller penalty values allow coefficients to remain large, resulting in a model that closely fits the training data. This may reduce bias but increases the risk of overfitting, especially in the presence of multicollinearity or many weak predictors. Larger penalty values shrink coefficients toward zero (though not exactly to zero), which reduces model complexity. This typically increases bias slightly but reduces variance of the model’s predictions, making the model more stable and better suited for generalization to new data.

After tuning the model, we also evaluate its accuracy on the hold-out (test) data.

The ridge regression models were fit using the glmnet package (Friedman et al., 2010, 2025; Tay et al., 2023).

Code
# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
ridge_spec <- 
  parsnip::linear_reg(
    penalty = tune::tune(),
    mixture = 0) %>%
  parsnip::set_engine("glmnet")

# Workflow
ridge_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(ridge_spec)

# Define grid of penalties to try (log scale is typical)
penalty_grid <- dials::grid_regular(
  dials::penalty(range = c(-4, -1)),
  levels = 20)

# Tune the Penalty Parameter
cv_results <- tune::tune_grid(
  ridge_wf,
  resamples = folds,
  grid = penalty_grid,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
# Identify best penalty
tune::select_best(cv_results, metric = "rmse")
Code
tune::select_best(cv_results, metric = "mae")
Code
tune::select_best(cv_results, metric = "rsq")
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  ridge_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()
Code
# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

Figure 19.4 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Ridge Regression Model.
Figure 19.4: Predicted Versus Actual Fantasy Points for Ridge Regression Model.

Below are the model predictions for next year’s fantasy points:

Code
newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8.3 Elastic Net

Below, we fit an elastic net model. We evaluate it and tune its penalty and mixture parameters with cross-validation.

The penalty parameter controls the overall strength of regularization applied to the model’s coefficients. Smaller penalty values allow coefficients to remain large, which can reduce bias but increase variance of the model’s predictions and can increase the risk of overfitting. Larger penalty values shrink coefficients more aggressively, leading to simpler models with potentially higher bias but lower variance of predictions. This regularization helps prevent overfitting, especially when the model includes many predictors or multicollinearity.

The mixture parameter controls the balance between LASSO and ridge regularization. It ranges from 0 to 1: A value of 0 applies pure ridge regression, which shrinks all coefficients but keeps them in the model. A value of 1 applies pure LASSO, which can shrink some coefficients exactly to zero, effectively performing variable selection. Values between 0 and 1 apply a combination: ridge-like smoothing and LASSO-like sparsity. Smaller mixture values favor shrinkage without variable elimination, whereas larger values favor sparser solutions by excluding weak predictors.

After tuning the model, we also evaluate its accuracy on the hold-out (test) data.

The elastic net models were fit using the glmnet package (Friedman et al., 2010, 2025; Tay et al., 2023).

Code
# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
enet_spec <- 
  parsnip::linear_reg(
    penalty = tune::tune(),
    mixture = tune::tune()) %>%
  parsnip::set_engine("glmnet")

# Workflow
enet_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(enet_spec)

# Define a regular grid for both penalty and mixture
grid_enet <- dials::grid_regular(
  dials::penalty(range = c(-4, -1)),
  dials::mixture(range = c(0, 1)),
  levels = c(20, 5) # 20 penalty values × 5 mixture values
)

# Tune the Grid
cv_results <- tune::tune_grid(
  enet_wf,
  resamples = folds,
  grid = grid_enet,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
# Identify best penalty
tune::select_best(cv_results, metric = "rmse")
Code
tune::select_best(cv_results, metric = "mae")
Code
tune::select_best(cv_results, metric = "rsq")
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  enet_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()
Code
# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

Figure 19.5 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Elastic Net Model.
Figure 19.5: Predicted Versus Actual Fantasy Points for Elastic Net Model.

Below are the model predictions for next year’s fantasy points:

Code
newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8.4 Random Forest Machine Learning

19.8.4.1 Assuming Independent Observations

Below, we fit a random forest model. We evaluate it and tune two parameters with cross-validation. The first parameter is mtry, which controls the number of predictors randomly sampled at each split in a decision tree. Smaller mtry values increase tree diversity by forcing trees to consider different subsets of predictors. This typically reduces the variance of the overall model’s predictions (because the final prediction is averaged over more diverse trees) but may increase bias if strong predictors are frequently excluded. Larger mtry allow more predictors to be considered at each split, making trees more similar to each other. This can reduce bias but often increases variance of the model’s predictions, because the trees are more correlated and less effective at error cancellation when averaged.

The second parameter is min_n, which controls the minimum number of observations that must be present in a node for a split to be attempted. Smaller min_n values allow trees to grow deeper and capture more fine-grained patterns in the training data. This typically reduces bias but increases variance of the overall model’s predictions, because deeper trees are more likely to overfit to noise in the training data. Larger min_n values restrict the depth of the trees by requiring more data to justify a split. This can reduce variance by producing simpler, more generalizable trees—but may increase bias if the trees are unable to capture important structure in the data.

After tuning the model, we evaluate its accuracy on the hold-out (test) data.

The random forest models were fit using the ranger package (Wright, 2024; Wright & Ziegler, 2017). We specify the grid of possible parameter values using the grid_random(), update.parameters(), mtry(), and min_n() functions of the dials package (Kuhn & Frick, 2025) and the extract_parameter_set_dials() function of the hardhat package (Frick, Vaughan, et al., 2025).

Code
# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
rf_spec <- 
  parsnip::rand_forest(
    mtry = tune::tune(),
    min_n = tune::tune(),
    trees = 500) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine(
    "ranger",
    importance = "impurity")

# Workflow
rf_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(rf_spec)

# Create Grid
n_predictors <- recipes::prep(rec) %>%
  recipes::juice() %>%
  dplyr::select(-fantasyPoints_lag) %>%
  ncol()

# Dynamically define ranges based on data
rf_params <- hardhat::extract_parameter_set_dials(rf_spec) %>%
  dials:::update.parameters(
    mtry = dials::mtry(range = c(1L, n_predictors)),
    min_n = dials::min_n(range = c(2L, 10L))
  )

rf_grid <- dials::grid_random(rf_params, size = 15) #dials::grid_regular(rf_params, levels = 5)

# Tune the Grid
cv_results <- tune::tune_grid(
  rf_wf,
  resamples = folds,
  grid = rf_grid,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
# Identify best penalty
tune::select_best(cv_results, metric = "rmse")
Code
tune::select_best(cv_results, metric = "mae")
Code
tune::select_best(cv_results, metric = "rsq")
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  rf_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Feature Importance
rf_fit <- final_model %>% 
  workflows::extract_fit_parsnip()

rf_fit
parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~8L,      x), num.trees = ~500, min.node.size = min_rows(~3L, x), importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      1582 
Number of independent variables:  73 
Mtry:                             8 
Target node size:                 3 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       6292.873 
R squared (OOB):                  0.5165346 
Code
ranger_obj <- rf_fit$fit

ranger_obj$variable.importance
                   season                     games                        gs 
              158486.2281               311601.5550               508011.8724 
      years_of_experience                       age             ageCentered20 
              153382.2301               267708.3635               279843.6026 
   ageCentered20Quadratic                    height                    weight 
              275022.6554                94161.2553               191252.7328 
              rookie_year              draft_number            fantasy_points 
              140253.5439               272494.7684              1419670.4892 
       fantasy_points_ppr             fantasyPoints               completions 
             1293483.9743              1312301.3737               827216.4193 
                 attempts             passing_yards               passing_tds 
              646841.5869               917999.3053               857088.9137 
    passing_interceptions            sacks_suffered           sack_yards_lost 
              130485.7228               161120.2954               220249.8113 
             sack_fumbles         sack_fumbles_lost         passing_air_yards 
               76489.5596                52392.6083               299007.8777 
passing_yards_after_catch       passing_first_downs               passing_epa 
              329812.9334               939795.2571               431328.9812 
             passing_cpoe   passing_2pt_conversions                      pacr 
              177761.1008                35523.1896               168910.0901 
                  carries             rushing_yards               rushing_tds 
              240997.7562               181847.7875                57066.7128 
          rushing_fumbles      rushing_fumbles_lost       rushing_first_downs 
               54868.5816                33249.7074               118475.4108 
              rushing_epa   rushing_2pt_conversions         special_teams_tds 
              212956.2609                26187.3427                  585.1349 
         pocket_time.pass        pass_attempts.pass           throwaways.pass 
              117380.4026               542011.1159               181719.6877 
              spikes.pass                drops.pass           bad_throws.pass 
               56148.3695               317853.6846               331561.0155 
       times_blitzed.pass        times_hurried.pass            times_hit.pass 
              347944.2411               194260.1925               192719.6026 
     times_pressured.pass         batted_balls.pass        on_tgt_throws.pass 
              441632.1558                92205.6330               362699.7319 
           rpo_plays.pass            rpo_yards.pass         rpo_pass_att.pass 
              161827.5832               166390.5923               141437.2823 
      rpo_pass_yards.pass         rpo_rush_att.pass       rpo_rush_yards.pass 
              134159.9270                64700.4095               108135.7323 
         pa_pass_att.pass        pa_pass_yards.pass             drop_pct.pass 
              274342.4437               247080.0647               160183.4104 
       bad_throw_pct.pass           on_tgt_pct.pass         pressure_pct.pass 
              195925.8780               132741.7748               167263.9786 
             ybc_att.rush              yac_att.rush                  att.rush 
              169961.6625               155900.6714               229358.3533 
                 yds.rush                   td.rush                  x1d.rush 
              158863.0638                65818.0750               155380.6876 
                 ybc.rush                  yac.rush              brk_tkl.rush 
              156091.4085               153192.0666                49373.0228 
              att_br.rush 
               60286.9067 
Code
# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

Figure 19.6 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Random Forest Model.
Figure 19.6: Predicted Versus Actual Fantasy Points for Random Forest Model.

Below are the model predictions for next year’s fantasy points:

Code
newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

Now we can stop the parallel backend:

Code
future::plan(future::sequential)

19.8.4.2 Accounting for Longitudinal Data

The above approaches to machine learning assume that the observations are independent across rows. However, in our case, this assumption does not hold because the data are longitudinal—each player has multiple seasons, and each row corresponds to a unique player–season combination. In the approaches below, using a random forest model (this section) and a tree-boosting model (in Section 19.8.5), we address this by explicitly accounting for the nesting of longitudinal observations within players.

Approaches to estimating random forest models with longitudinal data are described by Hu & Szymczak (2023). Below, we fit longitudinal random forest models using the MERF() function of the LongituRF package (Capitaine, 2020).

Code
smerf <- LongituRF::MERF(
  X = data_train_qb %>% dplyr::select(season:att_br.rush) %>% as.matrix(), # predictors of the fixed effects
  Y = data_train_qb[,c("fantasyPoints_lag")] %>% as.matrix(), # outcome variable
  Z = data_train_qb %>% dplyr::mutate(constant = 1) %>% dplyr::select(constant, passing_yards, passing_tds, passing_interceptions, passing_epa, pacr) %>% as.matrix(), # predictors of the random effects
  id = data_train_qb[,c("gsis_id")] %>% as.matrix(), # player ID (for nesting)
  time = data_train_qb[,c("ageCentered20")] %>% as.matrix(), # time variable
  ntree = 500,
  sto = "BM")
[1] "stopped after 11 iterations."
Code
smerf$forest # the fitted random forest (obtained at the last iteration)

Call:
 randomForest(x = X, y = ystar, ntree = ntree, mtry = mtry, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 25

          Mean of squared residuals: 165.2049
                    % Var explained: 98.49
Code
smerf$random_effects %>% data.frame() # the predicted random effects for each player
Code
smerf$omega %>% data.frame() # the predicted stochastic processes
Code
smerf$OOB # OOB error at each iteration
 [1] 154.9823 110.8246 105.2359 108.3674 127.9415 145.0449 145.5596 158.7728
 [9] 160.9756 151.7372 165.2049
Code
plot(smerf$Vraisemblance)
Evolution of the Log-Likelihood.
Figure 19.7: Evolution of the Log-Likelihood.

The following code generates an error, for some reason. This issue has been posted on the GitHub repository: https://github.com/sistm/LongituRF/issues/5. Hopefully the package maintainer will help address the issue.

Code
# Predict on Test Data
predict(
  smerf,
  X = data_test_qb %>% dplyr::select(season:att_br.rush) %>% as.matrix(),
  Z = data_train_qb %>% dplyr::mutate(constant = 1) %>% dplyr::select(constant, passing_yards, passing_tds, passing_interceptions, passing_epa, pacr) %>% as.matrix(),
  id = data_test_qb[,c("gsis_id")] %>% as.matrix(),
  time = data_test_qb[,c("ageCentered20")] %>% as.matrix())
Error in Z[w, , drop = FALSE] %*% object$random_effects[k, ]: non-conformable arguments

19.8.5 Combining Tree-Boosting with Mixed Models

To combine tree-boosting with mixed models, we use the gpboost package (Sigrist et al., 2025).

Adapted from here: https://towardsdatascience.com/mixed-effects-machine-learning-for-longitudinal-panel-data-with-gpboost-part-iii-523bb38effc

19.8.5.1 Process Data

If using a gamma distribution, it requires positive-only values:

Code
data_train_qb_matrix[,"fantasyPoints_lag"][data_train_qb_matrix[,"fantasyPoints_lag"] <= 0] <- 0.01

19.8.5.2 Specify Predictor Variables

Code
pred_vars_qb <- data_train_qb_matrix %>% 
  as_tibble() %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag, -ageCentered20, ageCentered20Quadratic) %>% # -gsis_id
  names()

pred_vars_qb_categorical <- "gsis_id" # to specify categorical predictors

19.8.5.3 Specify General Model Options

Code
model_likelihood <- "gamma" # gaussian
nrounds <- 2000 # maximum number of boosting iterations (i.e., number of trees built sequentially); more rounds = potentially better learning, but also greater risk of overfitting

19.8.5.4 Identify Optimal Tuning Parameters

For identifying the optimal tuning parameters for boosting, we partition the training data into inner training data and validation data. We randomly split the training data into 80% inner training data and 20% held-out validation data. We then use the mean absolute error as our index of prediction accuracy on the held-out validation data.

Code
# Partition training data into inner training data and validation data
ntrain_qb <- dim(data_train_qb_matrix)[1]

set.seed(52242)
valid_tune_idx_qb <- sample.int(ntrain_qb, as.integer(0.2*ntrain_qb)) # validation set

folds_qb <- list(valid_tune_idx_qb)

# Specify parameter grid, gp_model, and gpb.Dataset
param_grid_qb <- list(
  "learning_rate" = c(0.2, 0.1, 0.05, 0.01), # the step size used when updating predictions after each boosting round (high values make big updates, which can speed up learning but risk overshooting; low values are usually more accurate but require more rounds)
  "max_depth" = c(3, 5, 7), # maximum depth (levels) of each decision tree; deeper trees capture more complex patterns and interactions but risk overfitting; shallower trees tend to generalize better
  "min_data_in_leaf" = c(10, 50, 100), # minimum number of training examples in a leaf node; higher values = more regularization (simpler trees)
  "lambda_l2" = c(0, 1, 5)) # L2 regularization penalty for large weights in tree splits; adds a "cost" for complexity; helps prevent overfitting by shrinking the contribution of each tree

other_params_qb <- list(
  num_leaves = 2^6) # maximum number of leaves per tree; controls the maximum complexity of each tree (along with max_depth); more leaves = more expressive models, but can overfit if min_data_in_leaf is too small; num_leaves must be consistent with max_depth, because deeper trees naturally support more leaves; max is: 2^n, where n is the largest max_depth

gp_model_qb <- gpboost::GPModel(
  group_data = data_train_qb_matrix[,"gsis_id"],
  likelihood = model_likelihood,
  group_rand_coef_data = cbind(
    data_train_qb_matrix[,"ageCentered20"],
    data_train_qb_matrix[,"ageCentered20Quadratic"]),
  ind_effect_group_rand_coef = c(1,1))

gp_data_qb <- gpboost::gpb.Dataset(
  data = data_train_qb_matrix[,pred_vars_qb],
  categorical_feature = pred_vars_qb_categorical,
  label = data_train_qb_matrix[,"fantasyPoints_lag"]) # could instead use mean-centered variable (fantasyPointsMC_lag) and add mean back afterward

# Find optimal tuning parameters
opt_params_qb <- gpboost::gpb.grid.search.tune.parameters(
  param_grid = param_grid_qb,
  params = other_params_qb,
  num_try_random = NULL,
  folds = folds_qb,
  data = gp_data_qb,
  gp_model = gp_model_qb,
  nrounds = nrounds,
  early_stopping_rounds = 50, # stops training early if the model hasn’t improved on the validation set in 50 rounds; prevents overfitting and saves time
  verbose_eval = 1,
  metric = "mae")
Error in fd$booster$update(fobj = fobj): [GPBoost] [Fatal] NaN occured in gradient wrt covariance / auxiliary parameter number 1 (counting starts at 1, total nb. par. = 4) 
Code
opt_params_qb
Error: object 'opt_params_qb' not found

We use the optimal parameters identified during tuning. We use a low learning rate (0.1) to avoid overfitting. We set some light regularization (lambda_l2) for better generalization. We also set the maximum tree depth (max_depth) at 5 to capture complex (up to 5-way) interactions, and set the maximum number of terminal nodes (num_leaves) per tree at \(2^6\) (64)—though the number will not reach greater than \(2^{\text{max depth}} = 2^5 = 32\). We set the minimum number of samples in any leaf (min_data_in_leaf) to be 100.

19.8.5.5 Specify Model and Tuning Parameters

Code
gp_model_qb <- gpboost::GPModel(
  group_data = data_train_qb_matrix[,"gsis_id"],
  likelihood = model_likelihood,
  group_rand_coef_data = cbind(
    data_train_qb_matrix[,"ageCentered20"],
    data_train_qb_matrix[,"ageCentered20Quadratic"]),
  ind_effect_group_rand_coef = c(1,1))

gp_data_qb <- gpboost::gpb.Dataset(
  data = data_train_qb_matrix[,pred_vars_qb],
  categorical_feature = pred_vars_qb_categorical,
  label = data_train_qb_matrix[,"fantasyPoints_lag"])

params_qb <- list(
  learning_rate = 0.1,
  max_depth = 5,
  min_data_in_leaf = 100,
  lambda_l2 = 5,
  num_leaves = 2^6,
  num_threads = num_cores)

nrounds_qb <- 84 # identify optimal number of trees through iteration and cross-validation

#gp_model_qb$set_optim_params(params = list(optimizer_cov = "nelder_mead")) # to speed up model estimation

19.8.5.6 Fit Model

Code
gp_model_fit_qb <- gpboost::gpb.train(
  data = gp_data_qb,
  gp_model = gp_model_qb,
  nrounds = nrounds_qb,
  params = params_qb) # verbose = 0
[GPBoost] [Info] Total Bins 8709
[GPBoost] [Info] Number of data points in the train set: 1582, number of used features: 73
[GPBoost] [Info] [GPBoost with gamma likelihood]: initscore=4.800459
[GPBoost] [Info] Start training from score 4.800459

19.8.5.7 Model Results

Code
summary(gp_model_qb) # estimated random effects model
=====================================================
Model summary:
Nb. observations: 1582
Nb. groups: 316 (Group_1)
Covariance parameters (random effects):
                       Param.
Group_1                     0
Group_1_rand_coef_nb_1      0
Group_1_rand_coef_nb_2      0
-----------------------------------------------------
Additional parameters:
      Param.
shape 0.6343
=====================================================
Code
gp_model_qb_importance <- gpboost::gpb.importance(gp_model_fit_qb)
gp_model_qb_importance
Code
gpboost::gpb.plot.importance(gp_model_qb_importance)
Importance of Features (Predictors) in Tree Boosting Machine Learning Model.
Figure 19.8: Importance of Features (Predictors) in Tree Boosting Machine Learning Model.

19.8.5.8 Evaluate Accuracy of Model on Test Data

Code
# Test Model on Test Data
pred_test_qb <- predict(
  gp_model_fit_qb,
  data = data_test_qb_matrix[,pred_vars_qb],
  group_data_pred = data_test_qb_matrix[,"gsis_id"],
  group_rand_coef_data_pred = cbind(
    data_test_qb_matrix[,"ageCentered20"],
    data_test_qb_matrix[,"ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

y_pred_test_qb <- pred_test_qb[["response_mean"]] # if outcome is mean-centered, add mean(data_train_qb_matrix[,"fantasyPoints_lag"])

predictedVsActual <- data.frame(
  predictedPoints = y_pred_test_qb,
  actualPoints = data_test_qb_matrix[,"fantasyPoints_lag"]
)

predictedVsActual
Code
petersenlab::accuracyOverall(
  predicted = predictedVsActual$predictedPoints,
  actual = predictedVsActual$actualPoints,
  dropUndefined = TRUE
)

Figure 19.9 depicts the predicted versus actual fantasy points for the model on the test data.

Code
# Calculate combined range for axes
axis_limits <- range(c(predictedVsActual$predictedPoints, predictedVsActual$actualPoints), na.rm = TRUE)

ggplot(
  predictedVsActual,
  aes(
    x = predictedPoints,
    y = actualPoints)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Predicted Versus Actual Fantasy Points for Tree-Boosting Model.
Figure 19.9: Predicted Versus Actual Fantasy Points for Tree-Boosting Model.

19.8.5.9 Generate Predictions for Next Season

Code
# Generate model predictions for next season
pred_nextYear_qb <- predict(
  gp_model_fit_qb,
  data = newData_qb_matrix[,pred_vars_qb],
  group_data_pred = newData_qb_matrix[,"gsis_id"],
  group_rand_coef_data_pred = cbind(
    newData_qb_matrix[,"ageCentered20"],
    newData_qb_matrix[,"ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

newData_qb$fantasyPoints_lag <- pred_nextYear_qb$response_mean

# Merge with player names
newData_qb <- left_join(
  newData_qb,
  nfl_playerIDs %>% select(gsis_id, name),
  by = "gsis_id"
)

newData_qb %>% 
  arrange(-fantasyPoints_lag) %>% 
  select(name, fantasyPoints_lag, fantasyPoints)

19.9 Summarizing the Machine Learning Models

We examined multiple machine learning models that included many predictor variables to identify the model that generates the most accurate prediction of Quarterbacks’ future fantasy points. We also compared them to simpler models, including a regression model with one predictor and a regression model with multiple predictors to deteremine the incremental validity of the machine learning models above and beyond the simpler models. A summary of the accuracy of the machine learning models is in Table 19.1.

Table 19.1: Summary of Machine Learning Models.
Model \(R^2\) Mean Error Mean Absolute Error
Regression with One Predictor 0.44 7.91 57.08
Regressions with Multiple Predictors 0.38 9.49 61.38
LASSO 0.40 9.52 60.52
Ridge Regression 0.42 8.71 59.65
Elastic Net 0.42 8.71 59.65
Random Forest 0.43 10.87 58.79
Tree-Boosting 0.44 -10.05 56.41

The most accurate models, in terms of discrimination (\(R^2\)), had an \(R^2\) of 0.44, and it was achieved using one of two approaches—a tree-boosting model with many predictors or a regression model with one predictor: the player’s fantasy points in the previous season. The regression model with one predictor was more accurate (in terms of \(R^2\)) than a regression model with multiple predictors and the other machine learning models that leveraged multiple predictors. This suggests that the other predictors examined did not have incremental validity—they did not show utility above and beyond a player’s fantasy points in predicting their future fantasy points. If anything, considering the additional predictor variables showed decremental validity—the inclusion of the additional variables worsened prediction, possibly by weakening the predictive association of the previous season’s fantasy points and misattributing its predictive effect to other, related predictors in the model. This is consistent with the common finding that simple models (and parsimony) are often preferable. As an example, Youngstrom et al. (2018) found that simple models did just as well and in some cases better than LASSO models in classifying bipolar disorder.

The most accurate model in terms of calibration (mean absolute error) was the tree-boosting model, which had a mean absolute error of 56.41, and was just slightly more accurate than the regression model with one predictor. However, even in the most accurate model, the predicted values were over 56 points away from the actual values, suggesting that the model-predicted values were not particularly accurate. In sum, the regression model with one predictor was the most accurate in terms of differentiating between players (\(R^2\)), whereas the tree-boosting model was the most accurate in terms of how close the predicted values were to the actual values.

All models explained less than half of the variance in players’ future fantasy points. Thus, there remains considerable variance to be explained. The considerable unexplained variance suggests that we are missing key variables that are important predictors of future fantasy performance. It is also consistent with the conclusion from many other domains in psychology that human behavior is challenging to predict (Petersen, 2025b). We could likely improve our predictive accuracy by including projections, but it is unclear what information projections are or are not based on. For instance, if projections already account for the prior season’s statistics and fantasy points, then including both may not buy you much in terms of increased predictive accuracy.

19.10 Conclusion

Machine learning attempts to maximize prediction accuracy. There are various types of machine learning models. Supervised learning involves learning from data where the correct classification or outcome is known. Unsupervised learning involves learning from data without known classifications. Machine learning can incorporate many predictor variables, so it is important to perform cross-validation to avoid overfitting. We examined multiple machine learning models to identify the model that generates the most accurate prediction of Quarterbacks’ future fantasy points. All models explained less than half of the variance in players’ fantasy points. The most accurate model, in terms of discrimination (\(R^2\)), was achieved using regression with one predictor: the player’s fantasy points in the previous season. The most accurate model in terms of calibration (mean absolute error) was the tree-boosting model, which accounted for the longitudinal nature of the data. The considerable unexplained variance suggests that we are missing key variables that are important predictors of future fantasy performance.

19.11 Session Info

Code
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] glmnet_4.1-10      Matrix_1.7-3       knitr_1.50         lubridate_1.9.4   
 [5] forcats_1.0.0      stringr_1.5.1      readr_2.1.5        tidyverse_2.0.0   
 [9] effectsize_1.0.1   gpboost_1.6.1      R6_2.6.1           LongituRF_0.9     
[13] yardstick_1.3.2    workflowsets_1.1.1 workflows_1.2.0    tune_1.3.0        
[17] tidyr_1.3.1        tibble_3.3.0       rsample_1.3.0      recipes_1.3.1     
[21] purrr_1.1.0        parsnip_1.3.2      modeldata_1.4.0    infer_1.0.9       
[25] ggplot2_3.5.2      dplyr_1.1.4        dials_1.4.0        scales_1.4.0      
[29] broom_1.0.8        tidymodels_1.3.0   powerjoin_0.1.0    missRanger_2.6.1  
[33] future_1.58.0      petersenlab_1.1.7 

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3   shape_1.4.6.1        rstudioapi_0.17.1   
  [4] jsonlite_2.0.0       datawizard_1.2.0     magrittr_2.0.3      
  [7] TH.data_1.1-3        estimability_1.5.1   farver_2.1.2        
 [10] nloptr_2.2.1         rmarkdown_2.29       vctrs_0.6.5         
 [13] minqa_1.2.8          base64enc_0.1-3      sparsevctrs_0.3.4   
 [16] htmltools_0.5.8.1    Formula_1.2-5        parallelly_1.45.0   
 [19] htmlwidgets_1.6.4    sandwich_3.1-1       plyr_1.8.9          
 [22] zoo_1.8-14           emmeans_1.11.2       lifecycle_1.0.4     
 [25] iterators_1.0.14     pkgconfig_2.0.3      fastmap_1.2.0       
 [28] rbibutils_2.3        digest_0.6.37        colorspace_2.1-1    
 [31] furrr_0.3.1          Hmisc_5.2-3          labeling_0.4.3      
 [34] latex2exp_0.9.6      randomForest_4.7-1.2 RJSONIO_2.0.0       
 [37] timechange_0.3.0     compiler_4.5.1       withr_3.0.2         
 [40] htmlTable_2.4.3      backports_1.5.0      DBI_1.2.3           
 [43] psych_2.5.6          MASS_7.3-65          lava_1.8.1          
 [46] tools_4.5.1          pbivnorm_0.6.0       ranger_0.17.0       
 [49] foreign_0.8-90       future.apply_1.20.0  nnet_7.3-20         
 [52] doFuture_1.1.2       glue_1.8.0           quadprog_1.5-8      
 [55] nlme_3.1-168         grid_4.5.1           checkmate_2.3.2     
 [58] cluster_2.1.8.1      reshape2_1.4.4       generics_0.1.4      
 [61] gtable_0.3.6         tzdb_0.5.0           class_7.3-23        
 [64] hms_1.1.3            data.table_1.17.8    foreach_1.5.2       
 [67] pillar_1.11.0        mitools_2.4          splines_4.5.1       
 [70] lhs_1.2.0            lattice_0.22-7       FNN_1.1.4.1         
 [73] survival_3.8-3       tidyselect_1.2.1     mix_1.0-13          
 [76] reformulas_0.4.1     gridExtra_2.3        stats4_4.5.1        
 [79] xfun_0.52            hardhat_1.4.1        timeDate_4041.110   
 [82] stringi_1.8.7        DiceDesign_1.10      yaml_2.3.10         
 [85] boot_1.3-31          evaluate_1.0.4       codetools_0.2-20    
 [88] cli_3.6.5            rpart_4.1.24         xtable_1.8-4        
 [91] parameters_0.27.0    Rdpack_2.6.4         lavaan_0.6-19       
 [94] Rcpp_1.1.0           globals_0.18.0       coda_0.19-4.1       
 [97] parallel_4.5.1       gower_1.0.2          bayestestR_0.16.1   
[100] GPfit_1.0-9          lme4_1.1-37          listenv_0.9.1       
[103] viridisLite_0.4.2    mvtnorm_1.3-3        ipred_0.9-15        
[106] prodlim_2025.04.28   insight_1.3.1        rlang_1.1.6         
[109] multcomp_1.4-28      mnormt_2.1.1        

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/LsnVKwqmS1VuxWD18

Email Notification

The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.