I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

19 Machine Learning

This chapter provides an overview of machine learning.

19.1 Getting Started

19.1.1 Load Packages

Code

library("petersenlab")
library("future")
library("missRanger")
library("powerjoin")
library("tidymodels")
library("LongituRF")
library("gpboost")
library("effectsize")
library("tidyverse")
library("knitr")

19.1.2 Load Data

Code

# Downloaded Data - Processed
load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_teams.RData")
load(file = "./data/nfl_rosters.RData")
load(file = "./data/nfl_rosters_weekly.RData")
load(file = "./data/nfl_schedules.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/nfl_draftPicks.RData")
load(file = "./data/nfl_depthCharts.RData")
#load(file = "./data/nfl_pbp.RData")
#load(file = "./data/nfl_4thdown.RData")
#load(file = "./data/nfl_participation.RData")
#load(file = "./data/nfl_actualFantasyPoints_weekly.RData")
load(file = "./data/nfl_injuries.RData")
load(file = "./data/nfl_snapCounts.RData")
load(file = "./data/nfl_espnQBR_seasonal.RData")
load(file = "./data/nfl_espnQBR_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_weekly.RData")
load(file = "./data/nfl_playerContracts.RData")
load(file = "./data/nfl_ftnCharting.RData")
load(file = "./data/nfl_playerIDs.RData")
load(file = "./data/nfl_rankings_draft.RData")
load(file = "./data/nfl_rankings_weekly.RData")
load(file = "./data/nfl_expectedFantasyPoints_weekly.RData")
#load(file = "./data/nfl_expectedFantasyPoints_pbp.RData")

# Calculated Data - Processed
load(file = "./data/nfl_actualStats_player_career.RData")
load(file = "./data/nfl_actualStats_seasonal.RData")
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")

We created the player_stats_weekly.RData and player_stats_seasonal.RData objects in Section 4.4.3.

19.1.3 Specify Options

Code

options(scipen = 999) # prevent scientific notation

19.2 Overview of Machine Learning

Machine learning is a class of algorithmic approaches that are used to identify patterns in data. Machine learning takes us away from focusing on causal inference. Machine learning does not care about which processes are causal—i.e., which processes influence the outcome. Instead, machine learning cares about prediction—it cares about a predictor variable to the extent that it increases predictive accuracy regardless of whether it is causally related to the outcome.

Machine learning can be useful for leveraging big data and many predictor variables to develop predictive models with greater accuracy. However, many machine learning techniques are black boxes—it is often unclear how or why certain predictions are made, which can make it difficult to interpret the model’s decisions and understand the underlying relationships between variables. Machine learning tends to be a data-driven, atheoretical technique. This can result in overfitting. Thus, when estimating machine learning models, it is common to keep a hold-out sample for use in cross-validation to evaluate the extent of shrinkage of model coefficients. The data that the model is trained on is known as the “training data”. The data that the model was not trained on but is then is independently tested on—i.e., the hold-out sample—is the “test data”. Shrinkage occurs when predictor variables explain some random error variance in the original model. When the model is applied to an independent sample (i.e., the test data), the predictive model will likely not perform quite as well, and the regressions coefficients will tend to get smaller (i.e., shrink).

If the test data were collected as part of the same processes as the original data and were merely held out for purposes of analysis, this is called internal cross-validation. If the test data were collected separately from the original data used to train the model, this is called external cross-validation.

Although machine learning tends to be data-driven in its execution, theory should still inform which variables are included in the model.

Most machine learning methods were developed with cross-sectional data in mind. That is, they assume that each person has only one observation on the outcome variable. However, with longitudinal data, each person has multiple observations on the outcome variable.

When performing machine learning, various approaches may help address this:

transform data from long to wide form, so that each person has only one row
when designing the training and test sets, keep all measurements from the same person in the same data object (either the training or test set); do not have some measurements from a given person in the training set and other measurements from the same person in the test set
use a machine learning approach that accounts for the clustered/nested nature of the data

19.3 Types of Machine Learning

There are many approaches to machine learning. This chapter discusses several key ones:

supervised learning
- continuous outcome (i.e., regression)
  - linear regression
  - least absolute shrinkage and selection option (LASSO) regression
  - ridge regression
  - elastic net regression
  - tree-based methods
    - decision tree
    - random forest
    - gradient tree boosting
- categorical outcome (i.e., classification)
  - logistic regression
  - support vector machine
  - random forest
  - tree-based methods
unsupervised learning
semi-supervised learning
reinforcement learning
deep learning / neural network
ensemble

19.3.1 Supervised Learning

Supervised learning involves learning from data where the correct classification or outcome is known (and the classification is thus part of the data). For instance, predicting how many points a player will score is a supervised learning task, because there is a ground truth—the actual number of points scored—that can be used to train and evaluate the model. If the outcome variable is categorical, the approach involves classification. If the outcome vairable is continuous, the approach involves regression.

Unlike linear and logistic regression, various machine learning techniques can handle multicollinearity, including LASSO regression, ridge regression, and elastic net regression via regularization. Regularization involves penalizing model complexity to avoid overfitting (Ramasubramanian & Singh, 2016). Least absolute shrinkage and selection option (LASSO) regression performs selection of which predictor variables to keep in the model by shrinking some coefficients to zero, effectively removing them from the model. Ridge regression shrinks the coefficients of predictor variables toward zero, but not to zero, so it does not perform selection of which predictor variables to retain; this allows it to yield stable estimates for multiple correlated predictor variables in the context of multicollinearity. Elastic net involves a combination of LASSO and ridge regression; it performs selection of which predictor variables to keep by shrinking the coefficients of some predictor variables to zero (like LASSO, for variable selection), and it shrinks the coefficients of some predictor variables toward zero (like ridge, for handling multicollinearity among correlated predictors).

Unless interactions or nonlinear terms are specified, linear, logistic, LASSO, ridge, and elastic net regression assume additive and linear associations between the predictors and outcome. That is, they do not automatically account for interactions among the predictor variables or for nonlinear associations between the predictor variables and the outcome variable (unless interaction terms or nonlinear transformations are explicitly included). By contrast, random forests and tree boosting methods automatically account for interactions and nonlinear associations between predictors and the outcome variable. These models recursively partition the data in ways that capture complex patterns without the need to manually specify interaction or polynomial terms.

19.3.2 Unsupervised Learning

Unsupervised learning involves learning from data without known classifications. Unsupervised learning is used to discover hidden patterns, groupings, or structures in the data. For instance, if we want to identify different subtypes of Wide Receivers based on their playing style or performance metrics, or uncover underlying dimensions in a large dataset, we would use an unsupervised learning approach.

We describe cluster analysis in Chapter 21. We describe factor analysis in Chapter 22. We describe principal component analysis in Chapter 23.

19.3.3 Semi-supervised Learning

Semi-supervised learning combines supervised learning and unsupervised learning by training the model on some data for which the classification is known and some data for which the classification is not known.

19.3.4 Reinforcement Learning

Reinforcement learning involves an agent learning to make decisions by interacting with the environment. Through trial and error, the agent receives feedback in the form of rewards or penalties and learns a strategy that maximizes the cumulative reward over time.

19.3.5 Ensemble Learing

Ensemble machine learning methods combine multiple machine learning approaches with the goal that combining multiple approaches might lead to more accurate predictions than any one method might be able to achieve on its own.

19.4 Data Processing

Several data processing steps are necessary to get the data in the form necessary for machine learning.

19.4.1 Prepare Data for Merging

First, we apply several steps. We subset to the positions and variables of interest. We also rename columns and change variable types to make sure they match the column names and types across objects, which will be important later when we merge the data.

Code

# Prepare data for merging

#nfl_actualFantasyPoints_player_weekly <- nfl_actualFantasyPoints_player_weekly %>% 
#  rename(gsis_id = player_id)
#
#nfl_actualFantasyPoints_player_seasonal <- nfl_actualFantasyPoints_player_seasonal %>% 
#  rename(gsis_id = player_id)

player_stats_seasonal_offense <- player_stats_seasonal %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

player_stats_weekly_offense <- player_stats_weekly %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

## Rename other variables to ensure common names

## Ensure variables with the same name have the same type
nfl_players <- nfl_players %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number),
    nfl_id = as.character(nfl_id),
    years_of_experience = as.integer(years_of_experience))

player_stats_seasonal_offense <- player_stats_seasonal_offense %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number))

nfl_rosters <- nfl_rosters %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_rosters_weekly <- nfl_rosters_weekly %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_depthCharts <- nfl_depthCharts %>% 
  mutate(
    season = as.integer(season))

nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  rename(gsis_id = player_id) %>% 
  mutate(
    season = as.integer(season),
    receptions = as.integer(receptions)) %>% 
  distinct(gsis_id, season, week, .keep_all = TRUE) # drop duplicated rows

## Rename variables
nfl_draftPicks <- nfl_draftPicks %>%
  rename(
    games_career = games,
    pass_completions_career = pass_completions,
    pass_attempts_career = pass_attempts,
    pass_yards_career = pass_yards,
    pass_tds_career = pass_tds,
    pass_ints_career = pass_ints,
    rush_atts_career = rush_atts,
    rush_yards_career = rush_yards,
    rush_tds_career = rush_tds,
    receptions_career = receptions,
    rec_yards_career = rec_yards,
    rec_tds_career = rec_tds,
    def_solo_tackles_career = def_solo_tackles,
    def_ints_career = def_ints,
    def_sacks_career = def_sacks
  )

## Subset variables
nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  select(gsis_id:position, contains("_exp"), contains("_diff"), contains("_team")) #drop "raw stats" variables (e.g., rec_yards_gained) so they don't get coalesced with actual stats

# Check duplicate ids
player_stats_seasonal_offense %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()

Code

nfl_advancedStatsPFR_seasonal %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1, !is.na(gsis_id)) %>% 
  select(gsis_id, pfr_id, season, team, everything()) %>% 
  head()

Below, we identify shared variable names across objects to be merged to make sure we account for them in merging:

Code

dplyr::intersect(
  names(nfl_players),
  names(nfl_draftPicks))

[1] "gsis_id"  "position"

Code

length(na.omit(nfl_players$position)) # use by default (more cases)

[1] 24532

Code

length(na.omit(nfl_draftPicks$position))

[1] 2855

Code

dplyr::intersect(
  names(player_stats_seasonal_offense),
  names(nfl_advancedStatsPFR_seasonal))

[1] "gsis_id" "season"  "team"    "pfr_id"  "age"

Code

length(na.omit(player_stats_seasonal_offense$season)) # use by default (more cases)

[1] 14874

Code

length(na.omit(nfl_advancedStatsPFR_seasonal$season))

[1] 10511

Code

length(na.omit(player_stats_seasonal_offense$team)) # use by default (more cases)

[1] 14873

Code

length(na.omit(nfl_advancedStatsPFR_seasonal$team))

[1] 10511

Code

length(na.omit(player_stats_seasonal_offense$age)) # use by default (more cases)

[1] 14874

Code

length(na.omit(nfl_advancedStatsPFR_seasonal$age))

[1] 10444

Code

dplyr::intersect(
  names(nfl_rosters_weekly),
  names(nfl_expectedFantasyPoints_weekly))

[1] "gsis_id"   "season"    "week"      "position"  "full_name"

Code

length(na.omit(nfl_rosters_weekly$season)) # use by default (more cases)

[1] 844984

Code

length(na.omit(nfl_expectedFantasyPoints_weekly$season))

[1] 103207

Code

length(na.omit(nfl_rosters_weekly$week)) # use by default (more cases)

[1] 844984

Code

length(na.omit(nfl_expectedFantasyPoints_weekly$week))

[1] 103207

Code

length(na.omit(nfl_rosters_weekly$position)) # use by default (more cases)

[1] 844951

Code

length(na.omit(nfl_expectedFantasyPoints_weekly$position))

[1] 100750

Code

length(na.omit(nfl_rosters_weekly$full_name)) # use by default (more cases)

[1] 844968

Code

length(na.omit(nfl_expectedFantasyPoints_weekly$full_name))

[1] 100750

19.4.2 Merge Data

To perform machine learning, we need all of the predictor variables and the outcome variable in the same data file. Thus, we must merge data files. To merge data, we use the powerjoin package (Fabri, 2022), which allows coalescing variables with the same name from two different objects. We specify coalesce_xy, which means that—for variables that have the same name across both objects—it keeps the value from object 1 (if present); if not, it keeps the value from object 2. We first merge variables from objects that have the same structure—player data (i.e., id form), seasonal data (i.e., id-season form), or weekly data (i.e., id-season-week form).

Code

# Create lists of objects to merge, depending on data structure: id; or id-season; or id-season-week
playerListToMerge <- list(
  nfl_players %>% filter(!is.na(gsis_id)),
  nfl_draftPicks %>% filter(!is.na(gsis_id)) %>% select(-season)
)

playerSeasonListToMerge <- list(
  player_stats_seasonal_offense %>% filter(!is.na(gsis_id), !is.na(season)),
  nfl_advancedStatsPFR_seasonal %>% filter(!is.na(gsis_id), !is.na(season))
)

playerSeasonWeekListToMerge <- list(
  nfl_rosters_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week)),
  #nfl_actualStats_offense_weekly,
  nfl_expectedFantasyPoints_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
  #nfl_advancedStatsPFR_weekly,
)

playerSeasonWeekPositionListToMerge <- list(
  nfl_depthCharts %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
)

# Merge data
playerMerged <- playerListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id"),
    conflict = powerjoin::coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

playerSeasonMerged <- playerSeasonListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season"),
    conflict = powerjoin::coalesce_xy) # where the objects have the same variable name (e.g., team), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

playerSeasonWeekMerged <- playerSeasonWeekListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season","week"),
    conflict = powerjoin::coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2

To prepare for merging player data with seasonal data, we identify shared variable names across the objects:

Code

dplyr::intersect(
  names(playerSeasonMerged),
  names(playerMerged))

 [1] "gsis_id"                      "position"                    
 [3] "position_group"               "display_name"                
 [5] "common_first_name"            "first_name"                  
 [7] "last_name"                    "short_name"                  
 [9] "football_name"                "suffix"                      
[11] "esb_id"                       "nfl_id"                      
[13] "pff_id"                       "otc_id"                      
[15] "espn_id"                      "smart_id"                    
[17] "birth_date"                   "ngs_position_group"          
[19] "ngs_position"                 "height"                      
[21] "weight"                       "headshot"                    
[23] "college_name"                 "college_conference"          
[25] "jersey_number"                "rookie_season"               
[27] "last_season"                  "status"                      
[29] "ngs_status"                   "ngs_status_short_description"
[31] "pff_position"                 "pff_status"                  
[33] "draft_year"                   "draft_round"                 
[35] "draft_pick"                   "draft_team"                  
[37] "years_of_experience"          "pfr_player_name"             
[39] "team"                         "pfr_id"                      
[41] "age"

Then we merge the player data with the seasonal data:

Code

seasonalData <- powerjoin::power_full_join(
  playerSeasonMerged,
  playerMerged %>% select(-age, -years_of_experience, -team, -latest_team, -last_season, -pff_status), # drop variables from id objects that change from year to year (and thus are not necessarily accurate for a given season)
  by = "gsis_id",
  conflict = powerjoin::coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>% 
  filter(!is.na(season)) %>% 
  select(gsis_id, season, player_display_name, position, team, games, everything())

To prepare for merging player and seasonal data with weekly data, we identify shared variable names across the objects:

Code

dplyr::intersect(
  names(playerSeasonWeekMerged),
  names(seasonalData))

 [1] "gsis_id"       "season"        "week"          "team"         
 [5] "jersey_number" "status"        "first_name"    "last_name"    
 [9] "birth_date"    "height"        "weight"        "college"      
[13] "espn_id"       "pff_id"        "pfr_id"        "headshot_url" 
[17] "ngs_position"  "football_name" "esb_id"        "smart_id"     
[21] "position"

Then we merge the player and seasonal data with the weekly data:

Code

seasonalAndWeeklyData <- powerjoin::power_full_join(
  playerSeasonWeekMerged,
  seasonalData,
  by = c("gsis_id","season"),
  conflict = powerjoin::coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>% 
  filter(!is.na(week)) %>% 
  select(gsis_id, season, week, full_name, position, team, everything())

Code

# Duplicate cases
seasonalData %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()

Code

seasonalAndWeeklyData %>% 
  group_by(gsis_id, season, week) %>% 
  filter(n() > 1) %>% 
  head()

19.4.3 Additional Processing

For purposes of machine learning, we set all character and logical columns to factors.

Code

# Convert character and logical variables to factors
seasonalData <- seasonalData %>% 
  mutate(
    across(
      where(is.character),
      as.factor
    ),
    across(
      where(is.logical),
      as.factor
    )
  )

19.4.4 Fill in Missing Data for Static Variables

For variables that are not expected to change, such as a player’s name and position, we fill in missing values by using a player’s value on those variables from other rows in the data.

Code

seasonalData <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  fill(
    player_name, player_display_name, pos, position, position_group,
    .direction = "downup") %>% 
  ungroup()

19.4.5 Create New Data Object for Merging with Later Predictions

We create a new data object that contains the latest seasonal data, for merging with later predictions.

Code

newData_seasonal <- seasonalData %>% 
  filter(season == max(season, na.rm = TRUE))

19.4.6 Lag Fantasy Points

To develop a machine learning model that uses a player’s performance metrics in a given season for predicting the player’s fantasy points in the subsequent season, we need to include the player’s fantasy points from the subsequent season in the same row as the previous season’s performance metrics. Thus, we need to create a lagged variable for fantasy points. That way, 2024 fantasy points are in the same row as 2023 performance metrics, 2023 fantasy points are in the same row as 2023 performance metrics, and so on. We call this the lagged fantasy points variable (fantasyPoints_lag). We also retain the original same-year fantasy points variable (fantasyPoints) so it can be used as predictor of their subsequent-year fantasy points.

Code

seasonalData_lag <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  mutate(
    fantasyPoints_lag = lead(fantasyPoints)
  ) %>% 
  ungroup()

seasonalData_lag %>% 
  select(gsis_id, player_display_name, season, fantasyPoints, fantasyPoints_lag) # verify that lagging worked as expected

19.4.7 Subset to Predictor Variables and Outcome Variable

Then, we drop variables that we do not want to include in the model as our predictor or outcome variable. Thus, all of the variables in the object are our predictor and outcome variables.

Code

seasonalData_lag %>% select_if(~class(.) == "Date")

Code

seasonalData_lag %>% select_if(is.character)

Code

seasonalData_lag %>% select_if(is.factor)

Code

seasonalData_lag %>% select_if(is.logical)

Code

dropVars <- c(
  "birth_date", "player_display_name", "team", "player_name", "headshot_url", "season_type", "fg_made_list", "fg_missed_list", "fg_blocked_list", "gwfg_distance_list", "pff_status", "startdate", "pos", "merge_name", "pfr_player_id", "cfb_player_id", "hof", "category", "side", "college", "car_av", "display_name", "common_first_name", "first_name", "last_name", "short_name", "football_name", "suffix", "esb_id", "nfl_id", "pff_id", "otc_id", "espn_id", "smart_id", "ngs_position_group", "ngs_position", "headshot", "college_name", "college_conference", "jersey_number", "status", "ngs_status", "ngs_status_short_description", "pff_position", "draft_team", "pfr_player_name", "pfr_id")

seasonalData_lag_subset <- seasonalData_lag %>% 
  dplyr::select(-any_of(dropVars))

19.4.8 Separate by Position

Then, we separate the objects by position, so we can develop different machine learning models for each position.

Code

seasonalData_lag_subsetQB <- seasonalData_lag_subset %>% 
  filter(position == "QB") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_season, draft_pick,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    completions:rushing_2pt_conversions, special_teams_tds, contains(".pass"), contains(".rush"))

seasonalData_lag_subsetRB <- seasonalData_lag_subset %>% 
  filter(position == "RB") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_season, draft_pick,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

seasonalData_lag_subsetWR <- seasonalData_lag_subset %>% 
  filter(position == "WR") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_season, draft_pick,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

seasonalData_lag_subsetTE <- seasonalData_lag_subset %>% 
  filter(position == "TE") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_season, draft_pick,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

19.4.9 Split into Test and Training Data

Because machine learning can leverage many predictors, it is at high risk of overfitting—explaining error variance that would not generalize to new data, such as data for new players or future seasons. Thus, it is important to develop and tune the machine learning model so as not to overfit the model. In machine learning, it is common to use cross-validation where we train the model on a subset of the observations, and we evaluate how well the model generalizes to unseen (e.g., “hold-out”) observations. Then, we select the model parameters by how well the model generalizes to the hold-out data, so we are selecting a model that maximizes accuracy and generalizability (i.e., parsimony).

For internal cross-validation, it is common to divide the data into three subsets:

training data
validation data
test data

The training set is used to fit the model. It is usually the largest portion of the data. We fit various models to the training set based on which parameters we want to evaluate (e.g., how many trees to use in a gradient tree boosting model).

The models fit with the training set are then evaluated using the unseen observations in the validation set. The validation set is used to tune the model parameters and prevent overfitting. We select the model parameters that yield the greatest accuracy in the validation set. In k-fold cross-validation, the validation set rotates across folds, thus replacing the need for a separate validation set.

The test set is used after model training and tuning to evaluate the model’s generalizability to unseen data.

Below, we split the data into test and training data. Our ultimate goal is to predict next year’s fantasy points. However, to do that effectively, we must first develop a model for which we can evaluate its accuracy against historical fantasy points (because we do not yet know players will score in the future). We want to include all current/active players in our training data, so that our predictions of their future performance can be accounted for by including their prior data in the model. Thus, we use retired players as our hold-out (test) data. We split our data into 80% training data and 20% testing data. The 20% testing data thus includes all retired players, but not all retired players are in the testing data.

Then, for the analysis, we can either a) use rotating folds (as the case for k-fold and leave-one-out [LOO] cross-validation) for which a separate validation set (from the training set) is not needed, as we do in Section 19.6, or we can b) subdivide the training set into an inner training set and validation set, as we do in Section 19.8.5.4.

Code

seasonalData_lag_qb_all <- seasonalData_lag_subsetQB
seasonalData_lag_rb_all <- seasonalData_lag_subsetRB
seasonalData_lag_wr_all <- seasonalData_lag_subsetWR
seasonalData_lag_te_all <- seasonalData_lag_subsetTE

set.seed(52242) # for reproducibility (to keep the same train/holdout players)

activeQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$season == max(seasonalData_lag_qb_all$season, na.rm = TRUE))])
retiredQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$gsis_id %ni% activeQBs)])
numQBs <- length(unique(seasonalData_lag_qb_all$gsis_id))
qbHoldoutIDs <- sample(retiredQBs, size = ceiling(.2 * numQBs)) # holdout 20% of players

activeRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$season == max(seasonalData_lag_rb_all$season, na.rm = TRUE))])
retiredRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$gsis_id %ni% activeRBs)])
numRBs <- length(unique(seasonalData_lag_rb_all$gsis_id))
rbHoldoutIDs <- sample(retiredRBs, size = ceiling(.2 * numRBs)) # holdout 20% of players

set.seed(52242) # for reproducibility (to keep the same train/holdout players); added here to prevent a downstream error with predict.missRanger() due to missingness; this suggests that an error can arise from including a player in the holdout sample who has missingness in particular variables; would be good to identify which player(s) in the holdout sample evoke that error to identify the kinds of missingness that yield the error

activeWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$season == max(seasonalData_lag_wr_all$season, na.rm = TRUE))])
retiredWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$gsis_id %ni% activeWRs)])
numWRs <- length(unique(seasonalData_lag_wr_all$gsis_id))
wrHoldoutIDs <- sample(retiredWRs, size = ceiling(.2 * numWRs)) # holdout 20% of players

activeTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$season == max(seasonalData_lag_te_all$season, na.rm = TRUE))])
retiredTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$gsis_id %ni% activeTEs)])
numTEs <- length(unique(seasonalData_lag_te_all$gsis_id))
teHoldoutIDs <- sample(retiredTEs, size = ceiling(.2 * numTEs)) # holdout 20% of players
  
seasonalData_lag_qb_train <- seasonalData_lag_qb_all %>% 
  filter(gsis_id %ni% qbHoldoutIDs)
seasonalData_lag_qb_test <- seasonalData_lag_qb_all %>% 
  filter(gsis_id %in% qbHoldoutIDs)

seasonalData_lag_rb_train <- seasonalData_lag_rb_all %>% 
  filter(gsis_id %ni% rbHoldoutIDs)
seasonalData_lag_rb_test <- seasonalData_lag_rb_all %>% 
  filter(gsis_id %in% rbHoldoutIDs)

seasonalData_lag_wr_train <- seasonalData_lag_wr_all %>% 
  filter(gsis_id %ni% wrHoldoutIDs)
seasonalData_lag_wr_test <- seasonalData_lag_wr_all %>% 
  filter(gsis_id %in% wrHoldoutIDs)

seasonalData_lag_te_train <- seasonalData_lag_te_all %>% 
  filter(gsis_id %ni% teHoldoutIDs)
seasonalData_lag_te_test <- seasonalData_lag_te_all %>% 
  filter(gsis_id %in% teHoldoutIDs)

19.4.10 Impute the Missing Data

Many of the machine learning approaches described in this chapter require no missing observations in order for a case to be included in the analysis. In this section, we demonstrate one approach to imputing missing data. Here is a vignette demonstrating how to impute missing data using missForest(): https://rpubs.com/lmorgan95/MissForest (archived at: https://perma.cc/6GB4-2E22). Below, we impute the training data (and all data) separately by position. We then use the imputed training data to make out-of-sample predictions to fill in the missing data for the testing data. We do not want to impute the training and testing data together so that we can keep them separate for the purposes of cross-validation. However, we impute all data (training and test data together) for purposes of making out-of-sample predictions from the machine learning models to predict players’ performance next season (when actuals are not yet available for evaluating their accuracy). To impute data, we use the missRanger package (Mayer, 2024).

Note 19.1: Impute missing data for machine learning

Note: the following code takes a while to run.

Code

# QBs
seasonalData_lag_qb_all_imp <- missRanger::missRanger(
  seasonalData_lag_qb_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)


Skip constant features for imputation:  special_teams_tds
Variables to impute:        fantasy_points, fantasy_points_ppr, special_teams_tds, passing_epa, pacr, rushing_epa, draft_pick, fantasyPoints_lag, passing_cpoe, gs, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, intended_air_yards.pass, intended_air_yards_per_pass_attempt.pass, completed_air_yards.pass, completed_air_yards_per_completion.pass, completed_air_yards_per_pass_attempt.pass, pass_yards_after_catch.pass, pass_yards_after_catch_per_completion.pass, scrambles.pass, scramble_yards_per_attempt.pass, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, pocket_time.pass
Variables used to impute:   gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic, height, weight, rookie_season, draft_pick, fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag, completions, attempts, passing_yards, passing_tds, passing_interceptions, sacks_suffered, sack_yards_lost, sack_fumbles, sack_fumbles_lost, passing_air_yards, passing_yards_after_catch, passing_first_downs, passing_epa, passing_cpoe, passing_2pt_conversions, pacr, carries, rushing_yards, rushing_tds, rushing_fumbles, rushing_fumbles_lost, rushing_first_downs, rushing_epa, rushing_2pt_conversions, pocket_time.pass, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, intended_air_yards.pass, intended_air_yards_per_pass_attempt.pass, completed_air_yards.pass, completed_air_yards_per_completion.pass, completed_air_yards_per_pass_attempt.pass, pass_yards_after_catch.pass, pass_yards_after_catch_per_completion.pass, scrambles.pass, scramble_yards_per_attempt.pass, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush

    fntsy_  fnts__  spcl__  pssng_p pacr    rshng_  drft_p  fntsP_  pssng_c gs  pss_t.  thrww.  spks.p  drps.p  bd_th.  tms_b.  tms_hr. tms_ht. tms_p.  bttd_.  on_tgt_t.   rp_pl.  rp_yr.  rp_pss_t.   rp_pss_y.   rp_rsh_t.   rp_rsh_y.   p_pss_t.    p_pss_y.    int__.  i_____  cmp__.  c____.  c_____  ps___.  p_____  scrmb.  sc___.  att.rs  yds.rs  td.rsh  x1d.rs  ybc.rs  yc.rsh  brk_t.  att_b.  drp_p.  bd_t_.  on_tgt_p.   prss_.  ybc_t.  yc_tt.  pckt_.
iter 1: 0.0058  0.0026  0.0000  0.1904  0.7606  0.3600  0.6075  0.4712  0.4347  0.0133  0.0041  0.2898  0.7738  0.1277  0.0481  0.0701  0.1768  0.1835  0.0368  0.3073  0.0232  0.2624  0.1595  0.0802  0.0771  0.2590  0.2650  0.1652  0.1026  0.0529  0.1108  0.0330  0.1044  0.1453  0.0250  0.1103  0.1096  0.1182  0.0410  0.0499  0.1919  0.0398  0.0523  0.1677  0.3414  0.3708  0.7752  0.4660  0.0858  0.7115  0.3391  0.5726  0.7565  
iter 2: 0.0044  0.0043  0.0000  0.1953  0.7965  0.3608  0.6987  0.4835  0.4251  0.0132  0.0041  0.2780  0.7537  0.0713  0.0334  0.0667  0.1246  0.1340  0.0302  0.2671  0.0074  0.0537  0.0699  0.0686  0.0911  0.2005  0.2822  0.0598  0.0751  0.0118  0.0818  0.0215  0.0861  0.1618  0.0228  0.1139  0.0915  0.1117  0.0261  0.0322  0.1949  0.0380  0.0491  0.1173  0.2361  0.3435  0.7364  0.4414  0.0829  0.6882  0.3195  0.5835  0.7556  
iter 3: 0.0043  0.0043  0.0000  0.1957  0.7922  0.3622  0.6797  0.4820  0.4231  0.0134  0.0042  0.2784  0.7391  0.0711  0.0339  0.0670  0.1238  0.1339  0.0312  0.2683  0.0088  0.0542  0.0653  0.0686  0.0929  0.2035  0.2616  0.0583  0.0733  0.0124  0.0761  0.0210  0.0879  0.1547  0.0229  0.1098  0.0907  0.1171  0.0269  0.0283  0.2024  0.0385  0.0484  0.1170  0.2397  0.3694  0.7323  0.4258  0.0819  0.6666  0.3198  0.5825  0.7500  
iter 4: 0.0046  0.0044  0.0000  0.1953  0.7895  0.3560  0.6855  0.4787  0.4309  0.0133  0.0038  0.2719  0.7516  0.0693  0.0349  0.0668  0.1272  0.1346  0.0301  0.2715  0.0079  0.0533  0.0713  0.0679  0.0939  0.2013  0.2733  0.0563  0.0737  0.0116  0.0740  0.0205  0.0840  0.1686  0.0203  0.1168  0.0904  0.1136  0.0283  0.0306  0.2090  0.0403  0.0473  0.1098  0.2276  0.3507  0.7073  0.4352  0.0821  0.6643  0.3233  0.5815  0.7552  
iter 5: 0.0044  0.0044  0.0000  0.1945  0.7824  0.3462  0.6896  0.4831  0.4114  0.0133  0.0040  0.2795  0.7413  0.0723  0.0343  0.0670  0.1261  0.1362  0.0303  0.2657  0.0077  0.0545  0.0708  0.0716  0.0912  0.2091  0.2729  0.0579  0.0775  0.0111  0.0716  0.0207  0.0868  0.1687  0.0221  0.1155  0.0873  0.1149  0.0277  0.0316  0.1914  0.0387  0.0476  0.1194  0.2377  0.3611  0.6914  0.4513  0.0806  0.6571  0.3277  0.5873  0.7595  
iter 6: 0.0045  0.0043  0.0000  0.1962  0.7877  0.3524  0.6905  0.4813  0.4150  0.0135  0.0040  0.2727  0.7381  0.0710  0.0339  0.0666  0.1249  0.1333  0.0298  0.2646  0.0081  0.0519  0.0722  0.0686  0.0918  0.2154  0.2737  0.0571  0.0771  0.0117  0.0739  0.0204  0.0887  0.1626  0.0254  0.1088  0.0971  0.1164  0.0292  0.0304  0.2011  0.0393  0.0504  0.1164  0.2412  0.3561  0.7288  0.4336  0.0862  0.6589  0.3259  0.5868  0.7431

Code

seasonalData_lag_qb_all_imp

missRanger object. Extract imputed data via $data
- best iteration: 5 
- best average OOB imputation error: 0.205817

Code

data_all_qb <- seasonalData_lag_qb_all_imp$data
data_all_qb$fantasyPointsMC_lag <- scale(data_all_qb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_qb_matrix <- data_all_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_qb <- data_all_qb %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_qb_matrix <- data_all_qb_matrix[
  data_all_qb_matrix[, "season"] == max(data_all_qb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_qb <- which(colnames(newData_qb_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_qb_matrix <- newData_qb_matrix[, -dropCol_qb, drop = FALSE]

seasonalData_lag_qb_train_imp <- missRanger::missRanger(
  seasonalData_lag_qb_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)


Skip constant features for imputation:  special_teams_tds
Variables to impute:        fantasy_points, fantasy_points_ppr, special_teams_tds, passing_epa, pacr, rushing_epa, draft_pick, fantasyPoints_lag, passing_cpoe, gs, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, intended_air_yards.pass, intended_air_yards_per_pass_attempt.pass, completed_air_yards.pass, completed_air_yards_per_completion.pass, completed_air_yards_per_pass_attempt.pass, pass_yards_after_catch.pass, pass_yards_after_catch_per_completion.pass, scrambles.pass, scramble_yards_per_attempt.pass, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, pocket_time.pass
Variables used to impute:   gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic, height, weight, rookie_season, draft_pick, fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag, completions, attempts, passing_yards, passing_tds, passing_interceptions, sacks_suffered, sack_yards_lost, sack_fumbles, sack_fumbles_lost, passing_air_yards, passing_yards_after_catch, passing_first_downs, passing_epa, passing_cpoe, passing_2pt_conversions, pacr, carries, rushing_yards, rushing_tds, rushing_fumbles, rushing_fumbles_lost, rushing_first_downs, rushing_epa, rushing_2pt_conversions, pocket_time.pass, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, intended_air_yards.pass, intended_air_yards_per_pass_attempt.pass, completed_air_yards.pass, completed_air_yards_per_completion.pass, completed_air_yards_per_pass_attempt.pass, pass_yards_after_catch.pass, pass_yards_after_catch_per_completion.pass, scrambles.pass, scramble_yards_per_attempt.pass, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush

    fntsy_  fnts__  spcl__  pssng_p pacr    rshng_  drft_p  fntsP_  pssng_c gs  pss_t.  thrww.  spks.p  drps.p  bd_th.  tms_b.  tms_hr. tms_ht. tms_p.  bttd_.  on_tgt_t.   rp_pl.  rp_yr.  rp_pss_t.   rp_pss_y.   rp_rsh_t.   rp_rsh_y.   p_pss_t.    p_pss_y.    int__.  i_____  cmp__.  c____.  c_____  ps___.  p_____  scrmb.  sc___.  att.rs  yds.rs  td.rsh  x1d.rs  ybc.rs  yc.rsh  brk_t.  att_b.  drp_p.  bd_t_.  on_tgt_p.   prss_.  ybc_t.  yc_tt.  pckt_.
iter 1: 0.0063  0.0030  0.0000  0.1884  0.5315  0.3607  0.6076  0.4595  0.4128  0.0133  0.0040  0.2897  0.7744  0.1320  0.0481  0.0704  0.1810  0.1809  0.0393  0.3086  0.0254  0.2643  0.1720  0.0785  0.0755  0.2482  0.2632  0.1759  0.1001  0.0521  0.1117  0.0327  0.1165  0.1335  0.0233  0.1162  0.1193  0.1233  0.0423  0.0505  0.2058  0.0430  0.0543  0.1706  0.3530  0.3645  0.7312  0.4605  0.0970  0.7020  0.3538  0.5905  0.7721  
iter 2: 0.0049  0.0049  0.0000  0.1983  0.5655  0.3632  0.6840  0.4728  0.3973  0.0136  0.0040  0.2748  0.7489  0.0741  0.0348  0.0661  0.1269  0.1361  0.0311  0.2756  0.0079  0.0502  0.0700  0.0725  0.0927  0.2026  0.2808  0.0622  0.0760  0.0109  0.0776  0.0203  0.0803  0.1760  0.0219  0.1136  0.0905  0.1153  0.0295  0.0344  0.2034  0.0429  0.0484  0.1162  0.2378  0.3729  0.7683  0.4079  0.0916  0.6689  0.3539  0.6132  0.7710  
iter 3: 0.0049  0.0048  0.0000  0.2010  0.5571  0.3533  0.7022  0.4691  0.4033  0.0136  0.0041  0.2758  0.7556  0.0774  0.0354  0.0685  0.1273  0.1375  0.0306  0.2700  0.0090  0.0476  0.0703  0.0718  0.0904  0.2130  0.2749  0.0598  0.0782  0.0108  0.0802  0.0194  0.0833  0.1637  0.0223  0.1120  0.0930  0.1178  0.0282  0.0325  0.2131  0.0418  0.0470  0.1196  0.2344  0.3464  0.7793  0.4142  0.0958  0.6861  0.3424  0.6006  0.7603  
iter 4: 0.0051  0.0049  0.0000  0.1938  0.5664  0.3622  0.7074  0.4673  0.4133  0.0140  0.0041  0.2781  0.7628  0.0762  0.0366  0.0671  0.1247  0.1359  0.0319  0.2796  0.0084  0.0554  0.0740  0.0658  0.0906  0.2068  0.2757  0.0618  0.0785  0.0118  0.0771  0.0189  0.0832  0.1654  0.0249  0.1097  0.0911  0.1155  0.0277  0.0322  0.2033  0.0409  0.0484  0.1169  0.2505  0.3568  0.7225  0.4062  0.0915  0.6787  0.3497  0.5692  0.7516  
iter 5: 0.0052  0.0050  0.0000  0.1933  0.5606  0.3490  0.6907  0.4633  0.3962  0.0137  0.0042  0.2803  0.7482  0.0729  0.0359  0.0681  0.1260  0.1367  0.0313  0.2760  0.0082  0.0518  0.0750  0.0695  0.0952  0.2059  0.2830  0.0617  0.0744  0.0126  0.0812  0.0200  0.0856  0.1440  0.0255  0.1188  0.0945  0.1144  0.0285  0.0296  0.2001  0.0421  0.0469  0.1172  0.2555  0.3732  0.7372  0.4198  0.0859  0.6769  0.3526  0.5873  0.7653

Code

seasonalData_lag_qb_train_imp

missRanger object. Extract imputed data via $data
- best iteration: 4 
- best average OOB imputation error: 0.2036231

Code

data_train_qb <- seasonalData_lag_qb_train_imp$data
data_train_qb$fantasyPointsMC_lag <- scale(data_train_qb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_qb_matrix <- data_train_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_qb_test_imp <- predict(
  object = seasonalData_lag_qb_train_imp,
  newdata = seasonalData_lag_qb_test,
  seed = 52242)

data_test_qb <- seasonalData_lag_qb_test_imp
data_test_qb_matrix <- data_test_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

Code

# RBs
seasonalData_lag_rb_all_imp <- missRanger::missRanger(
  seasonalData_lag_rb_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_rb_all_imp

data_all_rb <- seasonalData_lag_rb_all_imp$data
data_all_rb$fantasyPointsMC_lag <- scale(data_all_rb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_rb_matrix <- data_all_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_rb <- data_all_rb %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_rb_matrix <- data_all_rb_matrix[
  data_all_rb_matrix[, "season"] == max(data_all_rb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_rb <- which(colnames(newData_rb_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_rb_matrix <- newData_rb_matrix[, -dropCol_rb, drop = FALSE]

seasonalData_lag_rb_train_imp <- missRanger::missRanger(
  seasonalData_lag_rb_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_rb_train_imp

data_train_rb <- seasonalData_lag_rb_train_imp$data
data_train_rb$fantasyPointsMC_lag <- scale(data_train_rb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_rb_matrix <- data_train_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_rb_test_imp <- predict(
  object = seasonalData_lag_rb_train_imp,
  newdata = seasonalData_lag_rb_test,
  seed = 52242)

data_test_rb <- seasonalData_lag_rb_test_imp
data_test_rb_matrix <- data_test_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

Code

# WRs
seasonalData_lag_wr_all_imp <- missRanger::missRanger(
  seasonalData_lag_wr_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_wr_all_imp

data_all_wr <- seasonalData_lag_wr_all_imp$data
data_all_wr$fantasyPointsMC_lag <- scale(data_all_wr$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_wr_matrix <- data_all_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_wr <- data_all_wr %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_wr_matrix <- data_all_wr_matrix[
  data_all_wr_matrix[, "season"] == max(data_all_wr_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_wr <- which(colnames(newData_wr_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_wr_matrix <- newData_wr_matrix[, -dropCol_wr, drop = FALSE]

seasonalData_lag_wr_train_imp <- missRanger::missRanger(
  seasonalData_lag_wr_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_wr_train_imp

data_train_wr <- seasonalData_lag_wr_train_imp$data
data_train_wr$fantasyPointsMC_lag <- scale(data_train_wr$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_wr_matrix <- data_train_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_wr_test_imp <- predict(
  object = seasonalData_lag_wr_train_imp,
  newdata = seasonalData_lag_wr_test,
  seed = 52242)

data_test_wr <- seasonalData_lag_wr_test_imp
data_test_wr_matrix <- data_test_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

Code

# TEs
seasonalData_lag_te_all_imp <- missRanger::missRanger(
  seasonalData_lag_te_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_te_all_imp

data_all_te <- seasonalData_lag_te_all_imp$data
data_all_te$fantasyPointsMC_lag <- scale(data_all_te$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_te_matrix <- data_all_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_te <- data_all_te %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_te_matrix <- data_all_te_matrix[
  data_all_te_matrix[, "season"] == max(data_all_te_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_te <- which(colnames(newData_te_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_te_matrix <- newData_te_matrix[, -dropCol_te, drop = FALSE]

seasonalData_lag_te_train_imp <- missRanger::missRanger(
  seasonalData_lag_te_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_te_train_imp

data_train_te <- seasonalData_lag_te_train_imp$data
data_train_te$fantasyPointsMC_lag <- scale(data_train_te$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_te_matrix <- data_train_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_te_test_imp <- predict(
  object = seasonalData_lag_te_train_imp,
  newdata = seasonalData_lag_te_test,
  seed = 52242)

data_test_te <- seasonalData_lag_te_test_imp
data_test_te_matrix <- data_test_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

19.5 Identify Cores for Parallel Processing

Code

num_cores <- parallelly::availableCores() - 1
num_true_cores <- parallelly::availableCores(logical = FALSE) - 1

Code

num_cores

system 
     4

We use the future package (Bengtsson, 2025) for parallel (faster) processing.

Code

future::plan(future::multisession, workers = num_cores)

19.6 Set up the Cross-Validation Folds

In the examples below, we predict the future fantasy points of Quarterbacks. However, the examples could be applied to any of the positions. There are various approaches to cross-validation. In the examples below, we use k-fold cross-validation. However, we also provide the code to apply leave-one-out (LOO) cross-validation. k-fold and LOO cross-validation are both forms of internal cross-validation.

19.6.1 k-Fold Cross-Validation

k-fold cross-validation partitions the data into k folds (subsets). In each of the k iterations, the model is trained on \(k - 1\) folds and is evaluated on the remaining fold. For example, in a 10-fold cross-validation (i.e., \(k = 10\)), as used below, the model is trained 10 times, each time leaving out a different 10% of the data for validation. k-fold cross-validation is widely used because it tends to yield stable estimates of model performance, by balancing bias and variance. It is also computationally efficient, requiring only k model fits to evaluate model performance.

We set up the k folds using the rsample::group_vfold_cv() function of the rsample package (Frick, Chow, et al., 2025).

Code

set.seed(52242) # for reproducibility

folds_kFold <- rsample::group_vfold_cv(
  data_train_qb,
  group = gsis_id, # ensures all rows for a player are in the training set or all in the validation set for each fold
  v = 10) # 10-fold cross-validation

19.6.2 Leave-One-Out (LOO) Cross-Validation

Leave-one-out (LOO) cross-validation partitions the data into n folds, where n is the sample size. In each of the n iterations, the model is trained on \(n - 1\) observations and is evaluated on the one left out. For example, in a LOO cross-validation with 100 players, the model is trained 100 times, each time leaving out a different player for validation. LOO cross-validation is a special case of k-fold cross-validation where \(k = n\). LOO cross-validation is especially useful when the dataset is small—too small to form reliable training sets in k-fold cross-validation (e.g., with \(k = 5\) or \(k = 10\), which divide the sample into 5 or 10 folds, respectively). However, LOO tends to be less computationally efficient because it requires more model fits than k-fold cross-validation. LOO tends to have low bias, producing performance estimates closer to those obtained when fitting the model to the full dataset, because each model is trained on nearly all the data. However, LOO also tends to have high variance in its error estimates, because each validation fold contains only a single observation, making those estimates more sensitive to individual data points.

We set up the LOO folds using the rsample::loo_cv() function of the rsample package (Frick, Chow, et al., 2025).

Code

set.seed(52242) # for reproducibility

folds_loo <- rsample::loo_cv(data_train_qb)

19.7 Fitting the Traditional Linear Regression Models

We describe linear regression in Chapter 11.

19.7.1 Regression with One Predictor

Below, we fit a linear regression model with one predictor and evaluate it with cross-validation. We also evaluate its accuracy on the hold-out (test) data. For each of the models, we fit and evaluate the models using the tidymodels ecosystem of packages (Kuhn & Wickham, 2020, 2025). Modeling using tidymodels is described in Kuhn & Silge (2023): https://www.tmwr.org. We specify our model formula using the recipes::recipe() function of the recipes package (Kuhn, Wickham, et al., 2025). We define the model using the parsnip::linear_reg(), parsnip::set_engine(), and parsnip::set_mode() functions of the parsnip package (Kuhn & Vaughan, 2025). We specify the workflow using the workflows::workflow(), workflows::add_recipe(), and workflows::add_model() functions of the workflows package (Vaughan & Couch, 2025). We fit the cross-validation model using the tune::fit_resamples() function of the tune package (Kuhn, 2025). We specify the accuracy metrics to evaluate using the yardstick::metric_set() function of the yardstick package (Kuhn, Vaughan, et al., 2025). We fit the final model using the workflows::fit() function of the workflows package (Vaughan & Couch, 2025). We evaluate the accuracy of the model’s predictions on the test data using the petersenlab::accuracyOverall() of the petersenlab package (Petersen, 2025a).

Code

# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ fantasyPoints,
  data = data_train_qb)

# Define Model
lm_spec <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  parsnip::set_mode("regression")

# Workflow
lm_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lm_spec)

# Fit Model with Cross-Validation
cv_results <- tune::fit_resamples(
  lm_wf,
  resamples = folds,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_resamples(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)

Code

# Fit Final Model on Training Data
final_model <- workflows::fit(
  lm_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()

Code

final_model %>%
  workflows::extract_fit_parsnip() %>% 
  effectsize::standardize_parameters()

Code

# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

There was modest shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.46; the \(R^2\) for the same model applied to the test data was 0.42.

Figure 19.1 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.1: Predicted Versus Actual Fantasy Points for Regression Model with One Predictor (Player Age).

Below are the model predictions for next year’s fantasy points:

Code

newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.7.2 Regression with Multiple Predictors

Below, we fit a linear regression model with multiple predictors and evaluate it with cross-validation. We also evaluate its accuracy on the hold-out (test) data.

Code

# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
lm_spec <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  parsnip::set_mode("regression")

# Workflow
lm_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lm_spec)

# Fit Model with Cross-Validation
cv_results <- tune::fit_resamples(
  lm_wf,
  resamples = folds,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_resamples(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)

Code

# Fit Final Model on Training Data
final_model <- workflows::fit(
  lm_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()

Code

final_model %>%
  workflows::extract_fit_parsnip() %>% 
  effectsize::standardize_parameters()

Code

# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

There was modest shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.45; the \(R^2\) for the same model applied to the test data was 0.4.

Figure 19.2 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.2: Predicted Versus Actual Fantasy Points for Regression Model with Multiple Predictors.

Below are the model predictions for next year’s fantasy points:

Code

newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8 Fitting the Machine Learning Models

19.8.1 Least Absolute Shrinkage and Selection Option (LASSO)

Below, we fit a LASSO model. We evaluate it and tune its penalty parameter with cross-validation. The penalty parameter in a LASSO model controls the strength of regularization applied to the model coefficients. Smaller penalty values result in less regularization, allowing the model to retain more predictors with larger (nonzero) coefficients. This typically reduces bias but increases variance of the model’s predictions, as the model may overfit to the training data by including irrelevant or weak predictors. Larger penalty values apply stronger regularization, shrinking more coefficients exactly to zero. This encourages a sparser model that may increase bias (by excluding useful predictors) but reduces variance and improves generalizability by simplifying the model and reducing overfitting.

After tuning the model, we evaluate its accuracy on the hold-out (test) data.

The LASSO models were fit using the glmnet package (Friedman et al., 2010, 2025; Tay et al., 2023).

For the machine learning models, we perform the parameter tuning using the tune::tune(), tune::tune_grid(), tune::select_best(), and tune::finalize_workflow() functions of the tune package (Kuhn, 2025). We specify the grid of possible parameter values using the dials::grid_regular() function of the dials package (Kuhn & Frick, 2025).

Code

# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
lasso_spec <- 
  parsnip::linear_reg(
    penalty = tune::tune(),
    mixture = 1) %>%
  parsnip::set_engine("glmnet")

# Workflow
lasso_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lasso_spec)

# Define grid of penalties to try (log scale is typical)
penalty_grid <- dials::grid_regular(
  dials::penalty(range = c(-4, -1)),
  levels = 20)

# Tune the Penalty Parameter
cv_results <- tune::tune_grid(
  lasso_wf,
  resamples = folds,
  grid = penalty_grid,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)

Code

# Identify best penalty
tune::select_best(cv_results, metric = "rmse")

Code

tune::select_best(cv_results, metric = "mae")

Code

tune::select_best(cv_results, metric = "rsq")

Code

best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  lasso_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()

Code

# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

There was modest shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.46; the \(R^2\) for the same model applied to the test data was 0.41.

Figure 19.3 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.3: Predicted Versus Actual Fantasy Points for Least Absolute Shrinkage and Selection Option (Lasso) Model.

Below are the model predictions for next year’s fantasy points:

Code

newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8.2 Ridge Regression

Below, we fit a ridge regression model. We evaluate it and tune its penalty parameter with cross-validation. The penalty parameter in a ridge regression model controls the amount of regularization applied to the model’s coefficients. Smaller penalty values allow coefficients to remain large, resulting in a model that closely fits the training data. This may reduce bias but increases the risk of overfitting, especially in the presence of multicollinearity or many weak predictors. Larger penalty values shrink coefficients toward zero (though not exactly to zero), which reduces model complexity. This typically increases bias slightly but reduces variance of the model’s predictions, making the model more stable and better suited for generalization to new data.

After tuning the model, we also evaluate its accuracy on the hold-out (test) data.

The ridge regression models were fit using the glmnet package (Friedman et al., 2010, 2025; Tay et al., 2023).

Code

# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
ridge_spec <- 
  parsnip::linear_reg(
    penalty = tune::tune(),
    mixture = 0) %>%
  parsnip::set_engine("glmnet")

# Workflow
ridge_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(ridge_spec)

# Define grid of penalties to try (log scale is typical)
penalty_grid <- dials::grid_regular(
  dials::penalty(range = c(-4, -1)),
  levels = 20)

# Tune the Penalty Parameter
cv_results <- tune::tune_grid(
  ridge_wf,
  resamples = folds,
  grid = penalty_grid,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)

Code

# Identify best penalty
tune::select_best(cv_results, metric = "rmse")

Code

tune::select_best(cv_results, metric = "mae")

Code

tune::select_best(cv_results, metric = "rsq")

Code

best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  ridge_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()

Code

# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

There was modest shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.47; the \(R^2\) for the same model applied to the test data was 0.42.

Figure 19.4 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.4: Predicted Versus Actual Fantasy Points for Ridge Regression Model.

Below are the model predictions for next year’s fantasy points:

Code

newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8.3 Elastic Net

Below, we fit an elastic net model. We evaluate it and tune its penalty and mixture parameters with cross-validation.

The penalty parameter controls the overall strength of regularization applied to the model’s coefficients. Smaller penalty values allow coefficients to remain large, which can reduce bias but increase variance of the model’s predictions and can increase the risk of overfitting. Larger penalty values shrink coefficients more aggressively, leading to simpler models with potentially higher bias but lower variance of predictions. This regularization helps prevent overfitting, especially when the model includes many predictors or multicollinearity.

The mixture parameter controls the balance between LASSO and ridge regularization. It ranges from 0 to 1: A value of 0 applies pure ridge regression, which shrinks all coefficients but keeps them in the model. A value of 1 applies pure LASSO, which can shrink some coefficients exactly to zero, effectively performing variable selection. Values between 0 and 1 apply a combination: ridge-like smoothing and LASSO-like sparsity. Smaller mixture values favor shrinkage without variable elimination, whereas larger values favor sparser solutions by excluding weak predictors.

After tuning the model, we also evaluate its accuracy on the hold-out (test) data.

The elastic net models were fit using the glmnet package (Friedman et al., 2010, 2025; Tay et al., 2023).

Code

# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
enet_spec <- 
  parsnip::linear_reg(
    penalty = tune::tune(),
    mixture = tune::tune()) %>%
  parsnip::set_engine("glmnet")

# Workflow
enet_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(enet_spec)

# Define a regular grid for both penalty and mixture
grid_enet <- dials::grid_regular(
  dials::penalty(range = c(-4, -1)),
  dials::mixture(range = c(0, 1)),
  levels = c(20, 5) # 20 penalty values × 5 mixture values
)

# Tune the Grid
cv_results <- tune::tune_grid(
  enet_wf,
  resamples = folds,
  grid = grid_enet,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)

Code

# Identify best penalty
tune::select_best(cv_results, metric = "rmse")

Code

tune::select_best(cv_results, metric = "mae")

Code

tune::select_best(cv_results, metric = "rsq")

Code

best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  enet_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Coefficients
final_model %>% 
  workflows::extract_fit_parsnip() %>% 
  broom::tidy()

Code

# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

There was modest shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.47; the \(R^2\) for the same model applied to the test data was 0.42.

Figure 19.5 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.5: Predicted Versus Actual Fantasy Points for Elastic Net Model.

Below are the model predictions for next year’s fantasy points:

Code

newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

19.8.4 Random Forest Machine Learning

A random forest model combines the results of many decision trees. A decision tree splits the data into groups based on the values of the predictors, and then predicts the outcome within each group. A given decision tree can be noisy and prone to overfitting, so random forests build many trees, where each tree is trained on a random subset of the data and considers a random subset of predictors at each split. The random forest then averages the predictions of all the decision trees, to improve overall predictive accuracy.

19.8.4.1 Assuming Independent Observations

Below, we fit a random forest model. We evaluate it and tune two parameters with cross-validation. The first parameter is mtry, which controls the number of predictors randomly sampled at each split in a decision tree. Smaller mtry values increase tree diversity by forcing trees to consider different subsets of predictors. This typically reduces the variance of the overall model’s predictions (because the final prediction is averaged over more diverse trees) but may increase bias if strong predictors are frequently excluded. Larger mtry allow more predictors to be considered at each split, making trees more similar to each other. This can reduce bias but often increases variance of the model’s predictions, because the trees are more correlated and less effective at error cancellation when averaged.

The second parameter is min_n, which controls the minimum number of observations that must be present in a node for a split to be attempted. Smaller min_n values allow trees to grow deeper and capture more fine-grained patterns in the training data. This typically reduces bias but increases variance of the overall model’s predictions, because deeper trees are more likely to overfit to noise in the training data. Larger min_n values restrict the depth of the trees by requiring more data to justify a split. This can reduce variance by producing simpler, more generalizable trees—but may increase bias if the trees are unable to capture important structure in the data.

After tuning the model, we evaluate its accuracy on the hold-out (test) data.

The random forest models were fit using the ranger package (Wright, 2024; Wright & Ziegler, 2017). We specify the grid of possible parameter values using the dials::grid_random(), dials::update.parameters(), dials::mtry(), and dials::min_n() functions of the dials package (Kuhn & Frick, 2025) and the hardhat::extract_parameter_set_dials() function of the hardhat package (Frick, Vaughan, et al., 2025).

Code

# Set seed for reproducibility
set.seed(52242)

# Set up Cross-Validation
folds <- folds_kFold

# Define Recipe (Formula)
rec <- recipes::recipe(
  fantasyPoints_lag ~ ., # use all predictors
  data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))

# Define Model
rf_spec <- 
  parsnip::rand_forest(
    mtry = tune::tune(),
    min_n = tune::tune(),
    trees = 500) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine(
    "ranger",
    importance = "impurity")

# Workflow
rf_wf <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(rf_spec)

# Create Grid
n_predictors <- recipes::prep(rec) %>%
  recipes::juice() %>%
  dplyr::select(-fantasyPoints_lag) %>%
  ncol()

# Dynamically define ranges based on data
rf_params <- hardhat::extract_parameter_set_dials(rf_spec) %>%
  dials:::update.parameters(
    mtry = dials::mtry(range = c(1L, n_predictors)),
    min_n = dials::min_n(range = c(2L, 10L))
  )

rf_grid <- dials::grid_random(rf_params, size = 15) #dials::grid_regular(rf_params, levels = 5)

# Tune the Grid
cv_results <- tune::tune_grid(
  rf_wf,
  resamples = folds,
  grid = rf_grid,
  metrics = yardstick::metric_set(rmse, mae, rsq),
  control = tune::control_grid(save_pred = TRUE)
)

# View Cross-Validation metrics
tune::collect_metrics(cv_results)

Code

# Identify best penalty
tune::select_best(cv_results, metric = "rmse")

Code

tune::select_best(cv_results, metric = "mae")

Code

tune::select_best(cv_results, metric = "rsq")

Code

best_penalty <- tune::select_best(cv_results, metric = "mae")

# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
  rf_wf,
  best_penalty)

# Fit Final Model on Training Data
final_model <- workflows::fit(
  final_wf,
  data = data_train_qb)

# View Feature Importance
rf_fit <- final_model %>% 
  workflows::extract_fit_parsnip()

rf_fit

parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~25L,      x), num.trees = ~500, min.node.size = min_rows(~7L, x), importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      1574 
Number of independent variables:  82 
Mtry:                             25 
Target node size:                 7 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       6400.27 
R squared (OOB):                  0.516673

Code

ranger_obj <- rf_fit$fit

ranger_obj$variable.importance

                                    season 
                                 126058.37 
                                     games 
                                 145371.26 
                                        gs 
                                 115643.53 
                       years_of_experience 
                                 124165.84 
                                       age 
                                 261273.00 
                             ageCentered20 
                                 249756.51 
                    ageCentered20Quadratic 
                                 264252.29 
                                    height 
                                  98275.91 
                                    weight 
                                 168219.58 
                             rookie_season 
                                 152332.14 
                                draft_pick 
                                 375471.24 
                            fantasy_points 
                                2252446.36 
                        fantasy_points_ppr 
                                2781075.74 
                             fantasyPoints 
                                2350652.02 
                               completions 
                                 356112.51 
                                  attempts 
                                 270139.82 
                             passing_yards 
                                 898614.04 
                               passing_tds 
                                1045526.93 
                     passing_interceptions 
                                 116222.75 
                            sacks_suffered 
                                  96816.11 
                           sack_yards_lost 
                                 150004.07 
                              sack_fumbles 
                                  71114.87 
                         sack_fumbles_lost 
                                  47403.37 
                         passing_air_yards 
                                 146557.88 
                 passing_yards_after_catch 
                                 231279.15 
                       passing_first_downs 
                                 833827.76 
                               passing_epa 
                                 389165.51 
                              passing_cpoe 
                                 199682.63 
                   passing_2pt_conversions 
                                  34807.66 
                                      pacr 
                                 152377.33 
                                   carries 
                                 131095.26 
                             rushing_yards 
                                 134541.18 
                               rushing_tds 
                                  46833.60 
                           rushing_fumbles 
                                  44241.44 
                      rushing_fumbles_lost 
                                  28658.66 
                       rushing_first_downs 
                                  84886.81 
                               rushing_epa 
                                 187593.41 
                   rushing_2pt_conversions 
                                  17970.66 
                         special_teams_tds 
                                      0.00 
                          pocket_time.pass 
                                 108996.50 
                        pass_attempts.pass 
                                 370405.04 
                           throwaways.pass 
                                 116999.57 
                               spikes.pass 
                                  66463.00 
                                drops.pass 
                                 114051.36 
                           bad_throws.pass 
                                 111942.38 
                        times_blitzed.pass 
                                 246287.74 
                        times_hurried.pass 
                                 104082.02 
                            times_hit.pass 
                                 100322.48 
                      times_pressured.pass 
                                 140375.87 
                         batted_balls.pass 
                                  71550.05 
                        on_tgt_throws.pass 
                                 189969.34 
                            rpo_plays.pass 
                                 111892.33 
                            rpo_yards.pass 
                                 126295.30 
                         rpo_pass_att.pass 
                                 107233.50 
                       rpo_pass_yards.pass 
                                 159839.11 
                         rpo_rush_att.pass 
                                  55075.26 
                       rpo_rush_yards.pass 
                                 121620.30 
                          pa_pass_att.pass 
                                 106613.48 
                        pa_pass_yards.pass 
                                  90108.12 
                   intended_air_yards.pass 
                                  33466.06 
  intended_air_yards_per_pass_attempt.pass 
                                  14938.19 
                  completed_air_yards.pass 
                                  25280.22 
   completed_air_yards_per_completion.pass 
                                  30909.39 
 completed_air_yards_per_pass_attempt.pass 
                                  50542.56 
               pass_yards_after_catch.pass 
                                  42382.33 
pass_yards_after_catch_per_completion.pass 
                                  28017.44 
                            scrambles.pass 
                                  41709.70 
           scramble_yards_per_attempt.pass 
                                  33929.75 
                             drop_pct.pass 
                                 174200.49 
                        bad_throw_pct.pass 
                                 218832.91 
                           on_tgt_pct.pass 
                                 199077.00 
                         pressure_pct.pass 
                                 174895.43 
                              ybc_att.rush 
                                 214004.24 
                              yac_att.rush 
                                 134678.43 
                                  att.rush 
                                 140700.93 
                                  yds.rush 
                                 129462.71 
                                   td.rush 
                                  44929.33 
                                  x1d.rush 
                                  68177.23 
                                  ybc.rush 
                                 154748.87 
                                  yac.rush 
                                 124351.88 
                              brk_tkl.rush 
                                  50896.64 
                               att_br.rush 
                                  49786.86

Code

# Predict on Test Data
df <- data_test_qb %>%
  mutate(pred = predict(final_model, new_data = data_test_qb)$.pred)

# Evaluate Accuracy of Predictions
petersenlab::accuracyOverall(
  predicted = df$pred,
  actual = df$fantasyPoints_lag,
  dropUndefined = TRUE
)

There was modest shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.47; the \(R^2\) for the same model applied to the test data was 0.42.

Figure 19.6 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)

ggplot(
  df,
  aes(
    x = pred,
    y = fantasyPoints_lag)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.6: Predicted Versus Actual Fantasy Points for Random Forest Model.

Below are the model predictions for next year’s fantasy points:

Code

newData_qb %>%
  mutate(fantasyPoints_lag = predict(final_model, new_data = newData_qb)$.pred) %>% 
  left_join(
    .,
    nfl_playerIDs %>% select(gsis_id, name),
    by = "gsis_id"
  ) %>% 
  select(name, fantasyPoints_lag) %>% 
  arrange(-fantasyPoints_lag)

Now we can stop the parallel backend:

Code

future::plan(future::sequential)

19.8.4.2 Accounting for Longitudinal Data

The above approaches to machine learning assume that the observations are independent across rows. However, in our case, this assumption does not hold because the data are longitudinal—each player has multiple seasons, and each row corresponds to a unique player–season combination. In the approaches below, using a random forest model (this section) and a gradient tree boosting model (in Section 19.8.5), we address this by explicitly accounting for the nesting of longitudinal observations within players.

Approaches to estimating random forest models with longitudinal data are described by Hu & Szymczak (2023). Below, we fit longitudinal random forest models using the LongituRF::MERF() function of the LongituRF package (Capitaine, 2020).

Code

smerf <- LongituRF::MERF(
  X = data_train_qb %>% dplyr::select(season:att_br.rush) %>% as.matrix(), # predictors of the fixed effects
  Y = data_train_qb[,c("fantasyPoints_lag")] %>% as.matrix(), # outcome variable
  Z = data_train_qb %>% dplyr::mutate(constant = 1) %>% dplyr::select(constant, passing_yards, passing_tds, passing_interceptions, passing_epa, pacr) %>% as.matrix(), # predictors of the random effects
  id = data_train_qb[,c("gsis_id")] %>% as.matrix(), # player ID (for nesting)
  time = data_train_qb[,c("ageCentered20")] %>% as.matrix(), # time variable
  ntree = 500,
  sto = "BM")

[1] "stopped after 45 iterations."

Code

smerf$forest # the fitted random forest (obtained at the last iteration)


Call:
 randomForest(x = X, y = ystar, ntree = ntree, mtry = mtry, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 28

          Mean of squared residuals: 163.3564
                    % Var explained: 98.64

Code

smerf$random_effects %>% data.frame() # the predicted random effects for each player

Code

smerf$omega %>% data.frame() # the predicted stochastic processes

Code

smerf$OOB # OOB error at each iteration

 [1] 183.9657 116.7924 112.8150 119.4798 134.5974 145.3230 161.6217 151.8105
 [9] 153.0515 155.7555 152.0939 153.0498 157.4007 148.7607 160.8126 146.2074
[17] 149.5348 169.8635 146.7907 171.3666 155.2761 158.3658 151.8484 150.3231
[25] 152.8311 145.8451 165.7809 157.6085 149.0263 149.2640 165.0875 149.4626
[33] 156.8973 174.4520 166.3946 162.3699 155.7453 163.0329 155.0299 165.3237
[41] 162.6044 154.7752 177.3439 158.1475 163.3564

Code

plot(smerf$Vraisemblance)

Figure 19.7: Evolution of the Log-Likelihood.

The following code generates an error, for some reason. This issue has been posted on the GitHub repository: https://github.com/sistm/LongituRF/issues/5. Hopefully the package maintainer will help address the issue.

Code

# Predict on Test Data
predict(
  smerf,
  X = data_test_qb %>% dplyr::select(season:att_br.rush) %>% as.matrix(),
  Z = data_train_qb %>% dplyr::mutate(constant = 1) %>% dplyr::select(constant, passing_yards, passing_tds, passing_interceptions, passing_epa, pacr) %>% as.matrix(),
  id = data_test_qb[,c("gsis_id")] %>% as.matrix(),
  time = data_test_qb[,c("ageCentered20")] %>% as.matrix())

Error in Z[w, , drop = FALSE] %*% object$random_effects[k, ]: non-conformable arguments

19.8.5 Combining Gradient Tree Boosting with Mixed Models

To combine gradient tree boosting with mixed models, we use the gpboost package (Sigrist et al., 2025). Gradient tree boosting is similar to a random forest model. Gradient tree boosting builds trees one by one, each trying to correct the errors of the previous trees. By contrast, in a random forest model, all trees are built independently and then averaged. This gradient tree boosting approach includes a mixed model to account for the longitudinal data nested within players.

Adapted from here: https://towardsdatascience.com/mixed-effects-machine-learning-for-longitudinal-panel-data-with-gpboost-part-iii-523bb38effc

19.8.5.1 Process Data

If using a gamma distribution, it requires positive-only values:

Code

data_train_qb_matrix[,"fantasyPoints_lag"][data_train_qb_matrix[,"fantasyPoints_lag"] <= 0] <- 0.01

19.8.5.2 Specify Predictor Variables

Code

pred_vars_qb <- data_train_qb_matrix %>% 
  as_tibble() %>% 
  select(-fantasyPoints_lag, -fantasyPointsMC_lag, -ageCentered20, ageCentered20Quadratic) %>% # -gsis_id
  names()

pred_vars_qb_categorical <- "gsis_id" # to specify categorical predictors

19.8.5.3 Specify General Model Options

Code

model_likelihood <- "gamma" # gaussian
nrounds <- 2000 # maximum number of boosting iterations (i.e., number of trees built sequentially); more rounds = potentially better learning, but also greater risk of overfitting

19.8.5.4 Identify Optimal Tuning Parameters

For identifying the optimal tuning parameters for boosting, we partition the training data into inner training data and validation data. We randomly split the training data into 80% inner training data and 20% held-out validation data. We then use the mean absolute error as our index of prediction accuracy on the held-out validation data. We use the gpboost::gpb.Dataset() function to specify the data, the gpboost::GPModel() function to specify the model, and the gpboost::gpb.grid.search.tune.parameters() function to identify the optimal tuning parameters for that model.

Code

# Partition training data into inner training data and validation data
ntrain_qb <- dim(data_train_qb_matrix)[1]

set.seed(52242)
valid_tune_idx_qb <- sample.int(ntrain_qb, as.integer(0.2*ntrain_qb)) # validation set

folds_qb <- list(valid_tune_idx_qb)

# Specify parameter grid, gp_model, and gpb.Dataset
param_grid_qb <- list(
  "learning_rate" = c(0.2, 0.1, 0.05, 0.01), # the step size used when updating predictions after each boosting round (high values make big updates, which can speed up learning but risk overshooting; low values are usually more accurate but require more rounds)
  "max_depth" = c(3, 5, 7), # maximum depth (levels) of each decision tree; deeper trees capture more complex patterns and interactions but risk overfitting; shallower trees tend to generalize better
  "min_data_in_leaf" = c(10, 50, 100), # minimum number of training examples in a leaf node; higher values = more regularization (simpler trees)
  "lambda_l2" = c(0, 1, 5)) # L2 regularization penalty for large weights in tree splits; adds a "cost" for complexity; helps prevent overfitting by shrinking the contribution of each tree

other_params_qb <- list(
  num_leaves = 2^6) # maximum number of leaves per tree; controls the maximum complexity of each tree (along with max_depth); more leaves = more expressive models, but can overfit if min_data_in_leaf is too small; num_leaves must be consistent with max_depth, because deeper trees naturally support more leaves; max is: 2^n, where n is the largest max_depth

gp_model_qb <- gpboost::GPModel(
  group_data = data_train_qb_matrix[,"gsis_id"],
  likelihood = model_likelihood,
  group_rand_coef_data = cbind(
    data_train_qb_matrix[,"ageCentered20"],
    data_train_qb_matrix[,"ageCentered20Quadratic"]),
  ind_effect_group_rand_coef = c(1,1))

gp_data_qb <- gpboost::gpb.Dataset(
  data = data_train_qb_matrix[,pred_vars_qb],
  categorical_feature = pred_vars_qb_categorical,
  label = data_train_qb_matrix[,"fantasyPoints_lag"]) # could instead use mean-centered variable (fantasyPointsMC_lag) and add mean back afterward

# Find optimal tuning parameters
opt_params_qb <- gpboost::gpb.grid.search.tune.parameters(
  param_grid = param_grid_qb,
  params = other_params_qb,
  num_try_random = NULL,
  folds = folds_qb,
  data = gp_data_qb,
  gp_model = gp_model_qb,
  nrounds = nrounds,
  early_stopping_rounds = 50, # stops training early if the model hasn’t improved on the validation set in 50 rounds; prevents overfitting and saves time
  verbose_eval = 1,
  metric = "mae")

Error in fd$booster$update(fobj = fobj): [GPBoost] [Fatal] NaN occured in gradient wrt covariance / auxiliary parameter number 1 (counting starts at 1, total nb. par. = 4)

Code

opt_params_qb

Error: object 'opt_params_qb' not found

We use the optimal parameters identified during tuning. We use a low learning rate (0.1) to avoid overfitting. We set some light regularization (lambda_l2) for better generalization. We also set the maximum tree depth (max_depth) at 5 to capture complex (up to 5-way) interactions, and set the maximum number of terminal nodes (num_leaves) per tree at \(2^6\) (64)—though the number will not reach greater than \(2^{\text{max depth}} = 2^5 = 32\). We set the minimum number of samples in any leaf (min_data_in_leaf) to be 100.

19.8.5.5 Specify Model and Tuning Parameters

Code

gp_model_qb <- gpboost::GPModel(
  group_data = data_train_qb_matrix[,"gsis_id"],
  likelihood = model_likelihood,
  group_rand_coef_data = cbind(
    data_train_qb_matrix[,"ageCentered20"],
    data_train_qb_matrix[,"ageCentered20Quadratic"]),
  ind_effect_group_rand_coef = c(1,1))

gp_data_qb <- gpboost::gpb.Dataset(
  data = data_train_qb_matrix[,pred_vars_qb],
  categorical_feature = pred_vars_qb_categorical,
  label = data_train_qb_matrix[,"fantasyPoints_lag"])

params_qb <- list(
  learning_rate = 0.1,
  max_depth = 5,
  min_data_in_leaf = 100,
  lambda_l2 = 5,
  num_leaves = 2^6,
  num_threads = num_cores)

nrounds_qb <- 84 # identify optimal number of trees through iteration and cross-validation

#gp_model_qb$set_optim_params(params = list(optimizer_cov = "nelder_mead")) # to speed up model estimation

19.8.5.6 Fit Model

We use the gpboost::gpb.train() function to fit the specified model with the specified tuning parameters and dataset.

Code

gp_model_fit_qb <- gpboost::gpb.train(
  data = gp_data_qb,
  gp_model = gp_model_qb,
  nrounds = nrounds_qb,
  params = params_qb) # verbose = 0

[GPBoost] [Info] Total Bins 8997
[GPBoost] [Info] Number of data points in the train set: 1574, number of used features: 81
[GPBoost] [Info] [GPBoost with gamma likelihood]: initscore=4.766071
[GPBoost] [Info] Start training from score 4.766071

19.8.5.7 Model Results

Code

summary(gp_model_qb) # estimated random effects model

=====================================================
Model summary:
Nb. observations: 1574
Nb. groups: 317 (Group_1)
Covariance parameters (random effects):
                       Param.
Group_1                     0
Group_1_rand_coef_nb_1      0
Group_1_rand_coef_nb_2      0
-----------------------------------------------------
Additional parameters:
      Param.
shape 0.6383
=====================================================

We evaluate the importance of features using the gpboost::gpb.importance() and gpboost::gpb.plot.importance() functions:

Code

gp_model_qb_importance <- gpboost::gpb.importance(gp_model_fit_qb)
gp_model_qb_importance

Code

gpboost::gpb.plot.importance(gp_model_qb_importance)

Figure 19.8: Importance of Features (Predictors) in Tree Boosting Machine Learning Model.

19.8.5.8 Evaluate Accuracy of Model on Training Data

Code

# Model-implied predictions
pred_train_qb <- predict(
  gp_model_fit_qb,
  data = data_train_qb_matrix[, pred_vars_qb],
  group_data_pred = data_train_qb_matrix[, "gsis_id"],
  group_rand_coef_data_pred = cbind(
    data_train_qb_matrix[, "ageCentered20"],
    data_train_qb_matrix[, "ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

petersenlab::accuracyOverall(
  predicted = pred_train_qb[["response_mean"]],
  actual = data_train_qb_matrix[, "fantasyPoints_lag"],
  dropUndefined = TRUE
)

19.8.5.9 Evaluate Accuracy of Model on Test Data

Code

# Test Model on Test Data
pred_test_qb <- predict(
  gp_model_fit_qb,
  data = data_test_qb_matrix[,pred_vars_qb],
  group_data_pred = data_test_qb_matrix[,"gsis_id"],
  group_rand_coef_data_pred = cbind(
    data_test_qb_matrix[,"ageCentered20"],
    data_test_qb_matrix[,"ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

y_pred_test_qb <- pred_test_qb[["response_mean"]] # if outcome is mean-centered, add mean(data_train_qb_matrix[,"fantasyPoints_lag"])

predictedVsActual <- data.frame(
  predictedPoints = y_pred_test_qb,
  actualPoints = data_test_qb_matrix[,"fantasyPoints_lag"]
)

predictedVsActual

Code

petersenlab::accuracyOverall(
  predicted = predictedVsActual$predictedPoints,
  actual = predictedVsActual$actualPoints,
  dropUndefined = TRUE
)

There was moderate shrinkage from the training model to the test model: the \(R^2\) for the model on the training data was 0.65; the \(R^2\) for the same model applied to the test data was 0.43.

Figure 19.9 depicts the predicted versus actual fantasy points for the model on the test data.

Code

# Calculate combined range for axes
axis_limits <- range(c(predictedVsActual$predictedPoints, predictedVsActual$actualPoints), na.rm = TRUE)

ggplot(
  predictedVsActual,
  aes(
    x = predictedPoints,
    y = actualPoints)) +
  geom_point(
    size = 2,
    alpha = 0.6) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "blue",
    linetype = "dashed") +
  coord_equal(
    xlim = axis_limits,
    ylim = axis_limits) +
  labs(
    title = "Predicted vs Actual Fantasy Points (Test Data)",
    x = "Predicted Fantasy Points",
    y = "Actual Fantasy Points"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Figure 19.9: Predicted Versus Actual Fantasy Points for Gradient Tree Boosting Model.

19.8.5.10 Generate Predictions for Next Season

Code

# Generate model predictions for next season
pred_nextYear_qb <- predict(
  gp_model_fit_qb,
  data = newData_qb_matrix[,pred_vars_qb],
  group_data_pred = newData_qb_matrix[,"gsis_id"],
  group_rand_coef_data_pred = cbind(
    newData_qb_matrix[,"ageCentered20"],
    newData_qb_matrix[,"ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

newData_qb$fantasyPoints_lag <- pred_nextYear_qb$response_mean

# Merge with player names
newData_qb <- left_join(
  newData_qb,
  nfl_playerIDs %>% select(gsis_id, name),
  by = "gsis_id"
)

newData_qb %>% 
  arrange(-fantasyPoints_lag) %>% 
  select(name, fantasyPoints_lag, fantasyPoints)

19.9 Summarizing the Machine Learning Models

We examined multiple machine learning models that included many predictor variables to identify the model that generates the most accurate prediction of Quarterbacks’ future fantasy points. We also compared them to simpler models, including a regression model with one predictor and a regression model with multiple predictors to determine the incremental validity of the machine learning models above and beyond the simpler models. A summary of the accuracy of the machine learning models is in Table 19.1.

Table 19.1: Summary of Machine Learning Models.

Model	\(R^2\)	Mean Error	Mean Absolute Error
Regression with One Predictor	0.42	8.64	58.12
Regressions with Multiple Predictors	0.40	8.06	60.20
LASSO	0.41	8.59	59.38
Ridge Regression	0.42	9.43	59.01
Elastic Net	0.42	9.43	59.01
Random Forest	0.42	10.22	59.24
Gradient Tree Boosting	0.43	-11.52	56.97

The most accurate models, in terms of discrimination (\(R^2\)), had an \(R^2\) near 0.43, and it was achieved using several approaches—including a gradient tree boosting model with many predictors and a regression model with one predictor: the player’s fantasy points in the previous season. The regression model with one predictor was more accurate (in terms of \(R^2\)) than a regression model with multiple predictors and was just as accurate, if not more accurate, than other machine learning models that leveraged multiple predictors. This suggests that the other predictors examined did not have incremental validity—they did not show utility above and beyond a player’s previous fantasy points in predicting their future fantasy points. If anything, considering the additional predictor variables showed decremental validity—the inclusion of the additional variables worsened prediction in several models, possibly by weakening the predictive association of the previous season’s fantasy points and misattributing its predictive effect to other, related predictors in the model. This is consistent with the common finding that simple models (and parsimony) are often preferable. As an example, Youngstrom et al. (2018) found that simple models did just as well and in some cases better than LASSO models in classifying bipolar disorder.

The most accurate model in terms of calibration (mean absolute error) was the gradient tree boosting model, which had a mean absolute error of 56.97, and was just slightly more accurate than the regression model with one predictor. The gradient tree boosting model being the most accurate suggests that it is important to account for the longitudinal data nested within players. However, even in the most accurate model, the predicted values were over 57 points away from the actual values, suggesting that the model-predicted values were not particularly accurate. In sum, the regression model with one predictor and gradient tree boosting model were the most accurate in terms of differentiating between players (\(R^2\)), whereas the gradient tree boosting model was the most accurate in terms of how close the predicted values were to the actual values.

All models explained less than half of the variance in players’ future fantasy points. Thus, there remains considerable variance to be explained. The considerable unexplained variance suggests that we are missing key variables that are important predictors of future fantasy performance. It is also consistent with the conclusion from many other domains in psychology that human behavior is challenging to predict (Petersen, 2025b). We could likely improve our predictive accuracy by including projections, but it is unclear what information projections are or are not based on. For instance, if projections already account for the prior season’s statistics and fantasy points, then including both may not buy you much in terms of increased predictive accuracy.

19.10 Conclusion

Machine learning attempts to maximize prediction accuracy. There are various types of machine learning models. Supervised learning involves learning from data where the correct classification or outcome is known. Unsupervised learning involves learning from data without known classifications. Machine learning can incorporate many predictor variables, so it is important to perform cross-validation to avoid overfitting. We examined multiple machine learning models to identify the model that generates the most accurate prediction of Quarterbacks’ future fantasy points. All models explained less than half of the variance in players’ fantasy points. The most accurate model, in terms of discrimination (\(R^2\)), was achieved using a regression model with one predictor: the player’s fantasy points in the previous season—and a gradient tree boosting model. The most accurate model in terms of calibration (mean absolute error) was the gradient tree boosting model, which accounted for the longitudinal nature of the data. The considerable unexplained variance suggests that we are missing key variables that are important predictors of future fantasy performance.

19.11 Session Info

Code

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] glmnet_4.1-10      Matrix_1.7-4       knitr_1.50         lubridate_1.9.4   
 [5] forcats_1.0.1      stringr_1.6.0      readr_2.1.6        tibble_3.3.0      
 [9] tidyverse_2.0.0    effectsize_1.0.1   gpboost_1.6.4      R6_2.6.1          
[13] LongituRF_0.9      yardstick_1.3.2    workflowsets_1.1.1 workflows_1.3.0   
[17] tune_2.0.1         tidyr_1.3.1        tailor_0.1.0       rsample_1.3.1     
[21] recipes_1.3.1      purrr_1.2.0        parsnip_1.3.3      modeldata_1.5.1   
[25] infer_1.0.9        ggplot2_4.0.1      dplyr_1.1.4        dials_1.4.2       
[29] scales_1.4.0       broom_1.0.10       tidymodels_1.4.1   powerjoin_0.1.0   
[33] missRanger_2.6.1   future_1.67.0      petersenlab_1.2.1 

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3   shape_1.4.6.1        rstudioapi_0.17.1   
  [4] jsonlite_2.0.0       datawizard_1.3.0     magrittr_2.0.4      
  [7] TH.data_1.1-4        estimability_1.5.1   farver_2.1.2        
 [10] nloptr_2.2.1         rmarkdown_2.30       vctrs_0.6.5         
 [13] minqa_1.2.8          base64enc_0.1-3      sparsevctrs_0.3.4   
 [16] htmltools_0.5.8.1    Formula_1.2-5        parallelly_1.45.1   
 [19] htmlwidgets_1.6.4    sandwich_3.1-1       plyr_1.8.9          
 [22] zoo_1.8-14           emmeans_2.0.0        iterators_1.0.14    
 [25] lifecycle_1.0.4      pkgconfig_2.0.3      fastmap_1.2.0       
 [28] rbibutils_2.4        digest_0.6.38        colorspace_2.1-2    
 [31] furrr_0.3.1          Hmisc_5.2-4          labeling_0.4.3      
 [34] latex2exp_0.9.6      randomForest_4.7-1.2 RJSONIO_2.0.0       
 [37] timechange_0.3.0     compiler_4.5.2       withr_3.0.2         
 [40] htmlTable_2.4.3      S7_0.2.1             backports_1.5.0     
 [43] DBI_1.2.3            psych_2.5.6          MASS_7.3-65         
 [46] lava_1.8.2           tools_4.5.2          pbivnorm_0.6.0      
 [49] ranger_0.17.0        foreign_0.8-90       future.apply_1.20.0 
 [52] nnet_7.3-20          glue_1.8.0           quadprog_1.5-8      
 [55] nlme_3.1-168         grid_4.5.2           checkmate_2.3.3     
 [58] cluster_2.1.8.1      reshape2_1.4.5       generics_0.1.4      
 [61] gtable_0.3.6         tzdb_0.5.0           class_7.3-23        
 [64] hms_1.1.4            data.table_1.17.8    foreach_1.5.2       
 [67] pillar_1.11.1        mitools_2.4          splines_4.5.2       
 [70] lhs_1.2.0            lattice_0.22-7       FNN_1.1.4.1         
 [73] survival_3.8-3       tidyselect_1.2.1     mix_1.0-13          
 [76] reformulas_0.4.2     gridExtra_2.3        stats4_4.5.2        
 [79] xfun_0.54            hardhat_1.4.2        timeDate_4051.111   
 [82] stringi_1.8.7        DiceDesign_1.10      yaml_2.3.10         
 [85] boot_1.3-32          evaluate_1.0.5       codetools_0.2-20    
 [88] cli_3.6.5            rpart_4.1.24         xtable_1.8-4        
 [91] parameters_0.28.2    Rdpack_2.6.4         lavaan_0.6-20       
 [94] Rcpp_1.1.0           globals_0.18.0       coda_0.19-4.1       
 [97] parallel_4.5.2       gower_1.0.2          bayestestR_0.17.0   
[100] GPfit_1.0-9          lme4_1.1-37          listenv_0.10.0      
[103] viridisLite_0.4.2    mvtnorm_1.3-3        ipred_0.9-15        
[106] prodlim_2025.04.28   insight_1.4.2        rlang_1.1.6         
[109] multcomp_1.4-29      mnormt_2.1.1

Bengtsson, H. (2025). ‘future‘: Unified parallel and distributed processing in ‘R‘ for everyone. https://doi.org/10.32614/CRAN.package.future

Capitaine, L. (2020). LongituRF: Random forests for longitudinal data. https://doi.org/10.32614/CRAN.package.LongituRF

Corston, R., & Colman, A. M. (2000). A crash course in SPSS for Windows. Wiley-Blackwell.

Fabri, A. (2022). powerjoin: Extensions of dplyr and fuzzyjoin join functions. https://doi.org/10.32614/CRAN.package.powerjoin

Frick, H., Chow, F., Kuhn, M., Mahoney, M., Silge, J., & Wickham, H. (2025). rsample: General resampling infrastructure. https://doi.org/10.32614/CRAN.package.rsample

Frick, H., Vaughan, D., & Kuhn, M. (2025). hardhat: Construct modeling packages. https://doi.org/10.32614/CRAN.package.hardhat

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01

Friedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., & Yang, J. (2025). glmnet: Lasso and elastic-net regularized generalized linear models. https://doi.org/10.32614/CRAN.package.glmnet

Hu, J., & Szymczak, S. (2023). A review on longitudinal data analysis with random forest. Briefings in Bioinformatics, 24(2). https://doi.org/10.1093/bib/bbad002

Kuhn, M. (2025). tune: Tidy tuning tools. https://doi.org/10.32614/CRAN.package.tune

Kuhn, M., & Frick, H. (2025). dials: Tools for creating tuning parameter values. https://doi.org/10.32614/CRAN.package.dials

Kuhn, M., & Silge, J. (2023). Tidy modeling with R: A framework for modeling in the tidyverse. O’Reilly Media. https://www.tmwr.org

Kuhn, M., & Vaughan, D. (2025). parsnip: A common API to modeling and analysis functions. https://doi.org/10.32614/CRAN.package.parsnip

Kuhn, M., Vaughan, D., & Hvitfeldt, E. (2025). yardstick: Tidy characterizations of model performance. https://doi.org/10.32614/CRAN.package.yardstick

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

Kuhn, M., & Wickham, H. (2025). tidymodels: Easily install and load the tidymodels packages. https://doi.org/10.32614/CRAN.package.tidymodels

Kuhn, M., Wickham, H., & Hvitfeldt, E. (2025). recipes: Preprocessing and feature engineering steps for modeling. https://doi.org/10.32614/CRAN.package.recipes

Mayer, M. (2024). missRanger: Fast imputation of missing values. https://doi.org/10.32614/CRAN.package.missRanger

Petersen, I. T. (2025a). petersenlab: A collection of R functions by the Petersen Lab. https://doi.org/10.32614/CRAN.package.petersenlab

Petersen, I. T. (2025b). Principles of psychological assessment: With applied examples in R. University of Iowa Libraries. https://doi.org/10.25820/work.007199

Ramasubramanian, K., & Singh, A. (2016). Machine learning using R. Springer.

Sigrist, F., Gyger, T., & Kuendig, P. (2025). gpboost: Combining tree-boosting with gaussian process and mixed effects models. https://doi.org/10.32614/CRAN.package.gpboost

Tay, J. K., Narasimhan, B., & Hastie, T. (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106(1), 1–31. https://doi.org/10.18637/jss.v106.i01

Vaughan, D., & Couch, S. (2025). workflows: Modeling workflows. https://doi.org/10.32614/CRAN.package.workflows

Wright, M. N. (2024). ranger: A fast implementation of random forests. https://doi.org/10.32614/CRAN.package.ranger

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01

Youngstrom, E. A., Halverson, T. F., Youngstrom, J. K., Lindhiem, O., & Findling, R. L. (2018). Evidence-based assessment from simple clinical judgments to statistical learning: Evaluating a range of options using pediatric bipolar disorder as a diagnostic challenge. Clinical Psychological Science, 6(2), 243–265. https://doi.org/10.1177/2167702617741845

19.1 Getting Started

19.1.1 Load Packages

19.1.2 Load Data

19.1.3 Specify Options

19.2 Overview of Machine Learning

19.3 Types of Machine Learning

19.3.1 Supervised Learning

19.3.2 Unsupervised Learning

19.3.3 Semi-supervised Learning

19.3.4 Reinforcement Learning

19.3.5 Ensemble Learing

19.4 Data Processing

19.4.1 Prepare Data for Merging

19.4.2 Merge Data

19.4.3 Additional Processing

19.4.4 Fill in Missing Data for Static Variables

19.4.5 Create New Data Object for Merging with Later Predictions

19.4.6 Lag Fantasy Points

19.4.7 Subset to Predictor Variables and Outcome Variable

19.4.8 Separate by Position

19.4.9 Split into Test and Training Data

19.4.10 Impute the Missing Data

19.5 Identify Cores for Parallel Processing

19.6 Set up the Cross-Validation Folds

19.6.1 k-Fold Cross-Validation

19.6.2 Leave-One-Out (LOO) Cross-Validation

19.7 Fitting the Traditional Linear Regression Models

19.7.1 Regression with One Predictor

19.7.2 Regression with Multiple Predictors

19.8 Fitting the Machine Learning Models

19.8.1 Least Absolute Shrinkage and Selection Option (LASSO)

19.8.2 Ridge Regression

19.8.3 Elastic Net

19.8.4 Random Forest Machine Learning

19.8.4.1 Assuming Independent Observations

19.8.4.2 Accounting for Longitudinal Data

19.8.5 Combining Gradient Tree Boosting with Mixed Models

19.8.5.1 Process Data

19.8.5.2 Specify Predictor Variables

19.8.5.3 Specify General Model Options

19.8.5.4 Identify Optimal Tuning Parameters

19.8.5.5 Specify Model and Tuning Parameters

19.8.5.6 Fit Model

19.8.5.7 Model Results

19.8.5.8 Evaluate Accuracy of Model on Training Data

19.8.5.9 Evaluate Accuracy of Model on Test Data

19.8.5.10 Generate Predictions for Next Season

19.9 Summarizing the Machine Learning Models

19.10 Conclusion

19.11 Session Info

Feedback

Email Notification