I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
19 Machine Learning
This chapter provides an overview of machine learning.
19.1 Getting Started
19.1.1 Load Packages
19.1.2 Load Data
Code
# Downloaded Data - Processed
load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_teams.RData")
load(file = "./data/nfl_rosters.RData")
load(file = "./data/nfl_rosters_weekly.RData")
load(file = "./data/nfl_schedules.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/nfl_draftPicks.RData")
load(file = "./data/nfl_depthCharts.RData")
#load(file = "./data/nfl_pbp.RData")
#load(file = "./data/nfl_4thdown.RData")
#load(file = "./data/nfl_participation.RData")
#load(file = "./data/nfl_actualFantasyPoints_weekly.RData")
load(file = "./data/nfl_injuries.RData")
load(file = "./data/nfl_snapCounts.RData")
load(file = "./data/nfl_espnQBR_seasonal.RData")
load(file = "./data/nfl_espnQBR_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_weekly.RData")
load(file = "./data/nfl_playerContracts.RData")
load(file = "./data/nfl_ftnCharting.RData")
load(file = "./data/nfl_playerIDs.RData")
load(file = "./data/nfl_rankings_draft.RData")
load(file = "./data/nfl_rankings_weekly.RData")
load(file = "./data/nfl_expectedFantasyPoints_weekly.RData")
#load(file = "./data/nfl_expectedFantasyPoints_pbp.RData")
# Calculated Data - Processed
load(file = "./data/nfl_actualStats_career.RData")
load(file = "./data/nfl_actualStats_seasonal.RData")
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")
19.1.3 Specify Options
19.2 Overview of Machine Learning
Machine learning takes us away from focusing on causal inference. Machine learning does not care about which processes are causal—i.e., which processes influence the outcome. Instead, machine learning cares about prediction—it cares about a predictor variable to the extent that it increases predictive accuracy regardless of whether it is causally related to the outcome.
Machine learning can be useful for leveraging big data and lots of predictor variable to develop predictive models with greater accuracy. However, many machine learning techniques are black boxes—it is often unclear how or why certain predictions are made, which can make it difficult to interpret the model’s decisions and understand the underlying relationships between variables. Machine learning tends to be a data-driven, atheoretical technique. This can result in overfitting. Thus, when estimating machine learning models, it is common to keep a hold-out sample for use in cross-validation to evaluate the extent of shrinkage of model coefficients. The data that the model is trained on is known as the “training data”. The data that the model was not trained on but is then is independently tested on—i.e., the hold-out sample—is the “test data”. Shrinkage occurs when predictor variables explain some random error variance in the original model. When the model is applied to an independent sample (i.e., the test data), the predictive model will likely not perform quite as well, and the regressions coefficients will tend to get smaller (i.e., shrink).
If the test data were collected as part of the same processes as the original data and were merely held out for purposes of analysis, this is called internal cross-validation. If the test data were collected separately from the original data used to train the model, this is called external cross-validation.
Most machine learning methods were developed with cross-sectional data in mind. That is, they assume that each person has only one observation on the outcome variable. However, with longitudinal data, each person has multiple observations on the outcome variable.
When performing machine learning, various approaches may help address this:
- transform data from long to wide form, so that each person has only one row
- when designing the training and test sets, keep all measurements from the same person in the same data object (either the training or test set); do not have some measurements from a given person in the training set and other measurements from the same person in the test set
- use a machine learning approach that accounts for the clustered/nested nature of the data
19.3 Types of Machine Learning
There are many approaches to machine learning. This chapter discusses several key ones:
- supervised learning
- continuous outcome (i.e., regression)
- categorical outcome (i.e., classification)
- logistic regression
- support vector machine
- random forest
- boosting
- unsupervised learning
- semi-supervised learning
- reinforcement learning
- deep learning
- ensemble
Ensemble machine learning methods combine multiple machine learning approaches with the goal that combining multiple approaches might lead to more accurate predictions than any one method might be able to achieve on its own.
19.3.1 Supervised Learning
Supervised learning involves learning from data where the correct classification or outcome is known. For instance, predicting how many points a player will score is a supervised learning task, because there is a ground truth—the actual number of points scored—that can be used to train and evaluate the model.
Unlike linear and logistic regression, various machine learning techniques can handle multicollinearity, including LASSO regression, ridge regression, and elastic net regression. Least absolute shrinkage and selection option (LASSO) regression performs selection of which predictor variables to keep in the model by shrinking some coefficients to zero, effectively removing them from the model. Ridge regression shrinks the coefficients of predictor variables toward zero, but not to zero, so it does not perform selection of which predictor variables to retain; this allows it to yield stable estimates for multiple correlated predictor variables in the context of multicollinearity. Elastic net involves a combination of LASSO and ridge regression; it performs selection of which predictor variables to keep by shrinking the coefficients of some predictor variables to zero (like LASSO, for variable selection), and it shrinks the coefficients of some predictor variables toward zero (like ridge, for handling multicollinearity among correlated predictors).
Unless interactions or nonlinear terms are specified, linear, logistic, LASSO, ridge, and elastic net regression assume additive and linear associations between the predictors and outcome. That is, they do not automatically account for interactions among the predictor variables or for nonlinear associations between the predictor variables and the outcome variable (unless interaction terms or nonlinear transformations are explicitly included). By contrast, random forests and tree boosting methods automatically account for interactions and nonlinear associations between predictors and the outcome variable. These models recursively partition the data in ways that capture complex patterns without the need to manually specify interaction or polynomial terms.
19.3.2 Unsupervised Learning
Unsupervised learning involves learning from data without known classifications. Unsupervised learning is used to discover hidden patterns, groupings, or structures in the data. For instance, if we want to identify different subtypes of Wide Receivers based on their playing style or performance metrics, or uncover underlying dimensions in a large dataset, we would use an unsupervised learning approach.
We describe cluster analysis in Chapter 21. We describe principal component analysis in Chapter 23.
19.3.3 Semi-supervised Learning
Semi-supervised learning combines supervised learning and unsupervised learning by training the model on some data for which the classification is known and some data for which the classification is not known.
19.3.4 Reinforcement Learning
Reinforcement learning involves an agent learning to make decisions by interacting with the environment. Through trial and error, the agent receives feedback in the form of rewards or penalties and learns a strategy that maximizes the cumulative reward over time.
19.4 Data Processing
Several data processing steps are necessary to get the data in the form necessary for machine learning.
19.4.1 Prepare Data for Merging
First, we apply several steps. We subset to the positions and variables of interest. We also rename columns and change variable types to make sure they match the column names and types across objects, which will be important later when we merge the data.
Code
# Prepare data for merging
#nfl_actualFantasyPoints_player_weekly <- nfl_actualFantasyPoints_player_weekly %>%
# rename(gsis_id = player_id)
#
#nfl_actualFantasyPoints_player_seasonal <- nfl_actualFantasyPoints_player_seasonal %>%
# rename(gsis_id = player_id)
player_stats_seasonal_offense <- player_stats_seasonal %>%
filter(position_group %in% c("QB","RB","WR","TE")) %>%
rename(gsis_id = player_id)
player_stats_weekly_offense <- player_stats_weekly %>%
filter(position_group %in% c("QB","RB","WR","TE")) %>%
rename(gsis_id = player_id)
nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>%
rename(gsis_id = player_id)
## Rename other variables to ensure common names
## Ensure variables with the same name have the same type
nfl_players <- nfl_players %>%
mutate(
birth_date = as.Date(birth_date),
jersey_number = as.character(jersey_number),
gsis_it_id = as.character(gsis_it_id),
years_of_experience = as.integer(years_of_experience))
player_stats_seasonal_offense <- player_stats_seasonal_offense %>%
mutate(
birth_date = as.Date(birth_date),
jersey_number = as.character(jersey_number),
gsis_it_id = as.character(gsis_it_id))
nfl_rosters <- nfl_rosters %>%
mutate(
draft_number = as.integer(draft_number))
nfl_rosters_weekly <- nfl_rosters_weekly %>%
mutate(
draft_number = as.integer(draft_number))
nfl_depthCharts <- nfl_depthCharts %>%
mutate(
season = as.integer(season))
nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>%
mutate(
season = as.integer(season),
receptions = as.integer(receptions)) %>%
distinct(gsis_id, season, week, .keep_all = TRUE) # drop duplicated rows
## Rename variables
nfl_draftPicks <- nfl_draftPicks %>%
rename(
games_career = games,
pass_completions_career = pass_completions,
pass_attempts_career = pass_attempts,
pass_yards_career = pass_yards,
pass_tds_career = pass_tds,
pass_ints_career = pass_ints,
rush_atts_career = rush_atts,
rush_yards_career = rush_yards,
rush_tds_career = rush_tds,
receptions_career = receptions,
rec_yards_career = rec_yards,
rec_tds_career = rec_tds,
def_solo_tackles_career = def_solo_tackles,
def_ints_career = def_ints,
def_sacks_career = def_sacks
)
## Subset variables
nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>%
select(gsis_id:position, contains("_exp"), contains("_diff"), contains("_team")) #drop "raw stats" variables (e.g., rec_yards_gained) so they don't get coalesced with actual stats
# Check duplicate ids
player_stats_seasonal_offense %>%
group_by(gsis_id, season) %>%
filter(n() > 1) %>%
head()
Code
Below, we identify shared variable names across objects to be merged to make sure we account for them in merging:
[1] "gsis_id" "position"
[1] 21360
[1] 2855
[1] "gsis_id" "season" "team" "age"
[1] 14859
[1] 10395
[1] 14858
[1] 10395
[1] 14859
[1] 10325
[1] "gsis_id" "season" "week" "position" "full_name"
[1] 845134
[1] 100272
[1] 841942
[1] 100272
[1] 845101
[1] 97815
[1] 845118
[1] 97815
19.4.2 Merge Data
To perform machine learning, we need all of the predictor variables and the outcome variable in the same data file. Thus, we must merge data files. To merge data, we use the powerjoin
package (Fabri, 2022), which allows coalescing variables with the same name from two different objects. We specify coalesce_xy
, which means that—for variables that have the same name across both objects—it keeps the value from object 1 (if present); if not, it keeps the value from object 2. We first merge variables from objects that have the same structure—player data (i.e., id
form), seasonal data (i.e., id
-season
form), or weekly data (i.e., id
-season
-week
form).
Code
# Create lists of objects to merge, depending on data structure: id; or id-season; or id-season-week
playerListToMerge <- list(
nfl_players %>% filter(!is.na(gsis_id)),
nfl_draftPicks %>% filter(!is.na(gsis_id)) %>% select(-season)
)
playerSeasonListToMerge <- list(
player_stats_seasonal_offense %>% filter(!is.na(gsis_id), !is.na(season)),
nfl_advancedStatsPFR_seasonal %>% filter(!is.na(gsis_id), !is.na(season))
)
playerSeasonWeekListToMerge <- list(
nfl_rosters_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week)),
#nfl_actualStats_offense_weekly,
nfl_expectedFantasyPoints_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
#nfl_advancedStatsPFR_weekly,
)
playerSeasonWeekPositionListToMerge <- list(
nfl_depthCharts %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
)
# Merge data
playerMerged <- playerListToMerge %>%
reduce(
powerjoin::power_full_join,
by = c("gsis_id"),
conflict = coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
playerSeasonMerged <- playerSeasonListToMerge %>%
reduce(
powerjoin::power_full_join,
by = c("gsis_id","season"),
conflict = coalesce_xy) # where the objects have the same variable name (e.g., team), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
playerSeasonWeekMerged <- playerSeasonWeekListToMerge %>%
reduce(
powerjoin::power_full_join,
by = c("gsis_id","season","week"),
conflict = coalesce_xy) # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
To prepare for merging player data with seasonal data, we identify shared variable names across the objects:
[1] "gsis_id" "position"
[3] "position_group" "first_name"
[5] "last_name" "esb_id"
[7] "display_name" "rookie_year"
[9] "college_conference" "current_team_id"
[11] "draft_club" "draft_number"
[13] "draftround" "entry_year"
[15] "football_name" "gsis_it_id"
[17] "headshot" "jersey_number"
[19] "short_name" "smart_id"
[21] "status" "status_description_abbr"
[23] "status_short_description" "uniform_number"
[25] "height" "weight"
[27] "college_name" "birth_date"
[29] "suffix" "years_of_experience"
[31] "pfr_player_name" "team"
[33] "age"
Then we merge the player data with the seasonal data:
Code
seasonalData <- powerjoin::power_full_join(
playerSeasonMerged,
playerMerged %>% select(-age, -years_of_experience, -team, -team_abbr, -team_seq, -current_team_id), # drop variables from id objects that change from year to year (and thus are not necessarily accurate for a given season)
by = "gsis_id",
conflict = coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>%
filter(!is.na(season)) %>%
select(gsis_id, season, player_display_name, position, team, games, everything())
To prepare for merging player and seasonal data with weekly data, we identify shared variable names across the objects:
[1] "gsis_id" "season"
[3] "week" "team"
[5] "jersey_number" "status"
[7] "first_name" "last_name"
[9] "birth_date" "height"
[11] "weight" "college"
[13] "pfr_id" "headshot_url"
[15] "status_description_abbr" "football_name"
[17] "esb_id" "gsis_it_id"
[19] "smart_id" "entry_year"
[21] "rookie_year" "draft_club"
[23] "draft_number" "position"
Then we merge the player and seasonal data with the weekly data:
Code
seasonalAndWeeklyData <- powerjoin::power_full_join(
playerSeasonWeekMerged,
seasonalData,
by = c("gsis_id","season"),
conflict = coalesce_xy # where the objects have the same variable name (e.g., position), keep the values from object 1, unless it's NA, in which case use the relevant value from object 2
) %>%
filter(!is.na(week)) %>%
select(gsis_id, season, week, full_name, position, team, everything())
19.4.3 Additional Processing
For purposes of machine learning, we set all character and logical columns to factors.
19.4.4 Fill in Missing Data for Static Variables
For variables that are not expected to change, such as a player’s name and position, we fill in missing values by using a player’s value on those variables from other rows in the data.
19.4.5 Create New Data Object for Merging with Later Predictions
We create a new data object that contains the latest seasonal data, for merging with later predictions.
19.4.6 Lag Fantasy Points
To develop a machine learning model that uses a player’s performance metrics in a given season for predicting the player’s fantasy points in the subsequent season, we need to include the player’s fantasy points from the subsequent season in the same row as the previous season’s performance metrics. Thus, we need to create a lagged variable for fantasy points. That way, 2024 fantasy points are in the same row as 2023 performance metrics, 2023 fantasy points are in the same row as 2023 performance metrics, and so on. We call this the lagged fantasy points variable (fantasyPoints_lag
). We also retain the original same-year fantasy points variable (fantasyPoints
) so it can be used as predictor of their subsequent-year fantasy points.
19.4.7 Subset to Predictor Variables and Outcome Variable
Then, we drop variables that we do not want to include in the model as our predictor or outcome variable. Thus, all of the variables in the object are our predictor and outcome variables.
Code
dropVars <- c(
"birth_date", "loaded", "full_name", "player_name", "player_display_name", "display_name", "suffix", "headshot_url", "player", "pos",
"espn_id", "sportradar_id", "yahoo_id", "rotowire_id", "pff_id", "fantasy_data_id", "sleeper_id", "pfr_id",
"pfr_player_id", "cfb_player_id", "pfr_player_name", "esb_id", "gsis_it_id", "smart_id",
"college", "college_name", "team_abbr", "current_team_id", "college_conference", "draft_club", "status_description_abbr",
"status_short_description", "short_name", "headshot", "uniform_number", "jersey_number", "first_name", "last_name",
"football_name", "team")
seasonalData_lag_subset <- seasonalData_lag %>%
dplyr::select(-any_of(dropVars))
19.4.8 Separate by Position
Then, we separate the objects by position, so we can develop different machine learning models for each position.
Code
seasonalData_lag_subsetQB <- seasonalData_lag_subset %>%
filter(position == "QB") %>%
select(
gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
height, weight, rookie_year, draft_number,
fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
completions:rushing_2pt_conversions, special_teams_tds, contains(".pass"), contains(".rush"))
seasonalData_lag_subsetRB <- seasonalData_lag_subset %>%
filter(position == "RB") %>%
select(
gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
height, weight, rookie_year, draft_number,
fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
carries:special_teams_tds, contains(".rush"), contains(".rec"))
seasonalData_lag_subsetWR <- seasonalData_lag_subset %>%
filter(position == "WR") %>%
select(
gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
height, weight, rookie_year, draft_number,
fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
carries:special_teams_tds, contains(".rush"), contains(".rec"))
seasonalData_lag_subsetTE <- seasonalData_lag_subset %>%
filter(position == "TE") %>%
select(
gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
height, weight, rookie_year, draft_number,
fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
carries:special_teams_tds, contains(".rush"), contains(".rec"))
19.4.9 Split into Test and Training Data
Because machine learning can leverage many predictors, it is at high risk of overfitting—explaining error variance that would not generalize to new data, such as data for new players or future seasons. Thus, it is important to develop and tune the machine learning model so as not to overfit the model. In machine learning, it is common to use cross-validation where we train the model on a subset of the observations, and we evaluate how well the model generalizes to unseen (e.g., “hold-out”) observations. Then, we select the model parameters by how well the model generalizes to the hold-out data, so we are selecting a model that maximizes accuracy and generalizability (i.e., parsimony).
For internal cross-validation, it is common to divide the data into three subsets:
- training data
- validation data
- test data
The training set is used to fit the model. It is usually the largest portion of the data. We fit various models to the training set based on which parameters we want to evaluate (e.g., how many trees to use in a tree-boosting model).
The models fit with the training set are then evaluated using the unseen observations in the validation set. The validation set is used to tune the model parameters and prevent overfitting. We select the model parameters that yield the greatest accuracy in the validation set. In k-fold cross-validation, the validation set rotates across folds, thus replacing the need for a separate validation set.
The test set is used after model training and tuning to evaluate the model’s generalizability to unseen data.
Below, we split the data into test and training data. Our ultimate goal is to predict next year’s fantasy points. However, to do that effectively, we must first develop a model for which we can evaluate its accuracy against historical fantasy points (because we do not yet know players will score in the future). We want to include all current/active players in our training data, so that our predictions of their future performance can be accounted for by including their prior data in the model. Thus, we use retired players as our hold-out (test) data. We split our data into 80% training data and 20% testing data. The 20% testing data thus includes all retired players, but not all retired players are in the testing data.
Then, for the analysis, we can either a) use rotating folds (as the case for k-fold and leave-one-out [LOO] cross-validation) for which a separate validation set (from the training set) is not needed, as we do in Section 19.6, or we can b) subdivide the training set into an inner training set and validation set, as we do in Section 19.8.5.4.
Code
seasonalData_lag_qb_all <- seasonalData_lag_subsetQB
seasonalData_lag_rb_all <- seasonalData_lag_subsetRB
seasonalData_lag_wr_all <- seasonalData_lag_subsetWR
seasonalData_lag_te_all <- seasonalData_lag_subsetTE
set.seed(52242) # for reproducibility (to keep the same train/holdout players)
activeQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$season == max(seasonalData_lag_qb_all$season, na.rm = TRUE))])
retiredQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$gsis_id %ni% activeQBs)])
numQBs <- length(unique(seasonalData_lag_qb_all$gsis_id))
qbHoldoutIDs <- sample(retiredQBs, size = ceiling(.2 * numQBs)) # holdout 20% of players
activeRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$season == max(seasonalData_lag_rb_all$season, na.rm = TRUE))])
retiredRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$gsis_id %ni% activeRBs)])
numRBs <- length(unique(seasonalData_lag_rb_all$gsis_id))
rbHoldoutIDs <- sample(retiredRBs, size = ceiling(.2 * numRBs)) # holdout 20% of players
set.seed(52242) # for reproducibility (to keep the same train/holdout players); added here to prevent a downstream error with predict.missRanger() due to missingness; this suggests that an error can arise from including a player in the holdout sample who has missingness in particular variables; would be good to identify which player(s) in the holdout sample evoke that error to identify the kinds of missingness that yield the error
activeWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$season == max(seasonalData_lag_wr_all$season, na.rm = TRUE))])
retiredWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$gsis_id %ni% activeWRs)])
numWRs <- length(unique(seasonalData_lag_wr_all$gsis_id))
wrHoldoutIDs <- sample(retiredWRs, size = ceiling(.2 * numWRs)) # holdout 20% of players
activeTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$season == max(seasonalData_lag_te_all$season, na.rm = TRUE))])
retiredTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$gsis_id %ni% activeTEs)])
numTEs <- length(unique(seasonalData_lag_te_all$gsis_id))
teHoldoutIDs <- sample(retiredTEs, size = ceiling(.2 * numTEs)) # holdout 20% of players
seasonalData_lag_qb_train <- seasonalData_lag_qb_all %>%
filter(gsis_id %ni% qbHoldoutIDs)
seasonalData_lag_qb_test <- seasonalData_lag_qb_all %>%
filter(gsis_id %in% qbHoldoutIDs)
seasonalData_lag_rb_train <- seasonalData_lag_rb_all %>%
filter(gsis_id %ni% rbHoldoutIDs)
seasonalData_lag_rb_test <- seasonalData_lag_rb_all %>%
filter(gsis_id %in% rbHoldoutIDs)
seasonalData_lag_wr_train <- seasonalData_lag_wr_all %>%
filter(gsis_id %ni% wrHoldoutIDs)
seasonalData_lag_wr_test <- seasonalData_lag_wr_all %>%
filter(gsis_id %in% wrHoldoutIDs)
seasonalData_lag_te_train <- seasonalData_lag_te_all %>%
filter(gsis_id %ni% teHoldoutIDs)
seasonalData_lag_te_test <- seasonalData_lag_te_all %>%
filter(gsis_id %in% teHoldoutIDs)
19.4.10 Impute the Missing Data
Many of the machine learning approaches described in this chapter require no missing observations in order for a case to be included in the analysis. In this section, we demonstrate one approach to imputing missing data. Here is a vignette demonstrating how to impute missing data using missForest()
: https://rpubs.com/lmorgan95/MissForest (archived at: https://perma.cc/6GB4-2E22). Below, we impute the training data (and all data) separately by position. We then use the imputed training data to make out-of-sample predictions to fill in the missing data for the testing data. We do not want to impute the training and testing data together so that we can keep them separate for the purposes of cross-validation. However, we impute all data (training and test data together) for purposes of making out-of-sample predictions from the machine learning models to predict players’ performance next season (when actuals are not yet available for evaluating their accuracy). To impute data, we use the missRanger
package (Mayer, 2024).
Note: the following code takes a while to run.
Code
Variables to impute: fantasy_points, fantasy_points_ppr, special_teams_tds, passing_epa, pacr, rushing_epa, fantasyPoints_lag, passing_cpoe, rookie_year, draft_number, gs, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, pocket_time.pass
Variables used to impute: gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic, height, weight, rookie_year, draft_number, fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag, completions, attempts, passing_yards, passing_tds, passing_interceptions, sacks_suffered, sack_yards_lost, sack_fumbles, sack_fumbles_lost, passing_air_yards, passing_yards_after_catch, passing_first_downs, passing_epa, passing_cpoe, passing_2pt_conversions, pacr, carries, rushing_yards, rushing_tds, rushing_fumbles, rushing_fumbles_lost, rushing_first_downs, rushing_epa, rushing_2pt_conversions, special_teams_tds, pocket_time.pass, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush
fntsy_ fnts__ spcl__ pssng_p pacr rshng_ fntsP_ pssng_c rok_yr drft_n gs pss_t. thrww. spks.p drps.p bd_th. tms_b. tms_hr. tms_ht. tms_p. bttd_. on_tgt_t. rp_pl. rp_yr. rp_pss_t. rp_pss_y. rp_rsh_t. rp_rsh_y. p_pss_t. p_pss_y. att.rs yds.rs td.rsh x1d.rs ybc.rs yc.rsh brk_t. att_b. drp_p. bd_t_. on_tgt_p. prss_. ybc_t. yc_tt. pckt_.
iter 1: 0.0054 0.0024 0.7924 0.1919 0.7612 0.3628 0.4789 0.4133 0.0224 0.5216 0.0271 0.0134 0.3024 0.7659 0.1304 0.0541 0.0758 0.1759 0.1820 0.0370 0.3238 0.0291 0.2952 0.1812 0.0885 0.0867 0.2627 0.2563 0.1093 0.0902 0.0580 0.0645 0.1732 0.0524 0.0578 0.1795 0.3524 0.3428 0.7447 0.5158 0.0824 0.6803 0.3529 0.5758 0.8111
iter 2: 0.0044 0.0048 0.8304 0.2002 0.7926 0.3736 0.4801 0.4289 0.0488 0.6139 0.0188 0.0090 0.2883 0.7481 0.0764 0.0385 0.0718 0.1231 0.1329 0.0337 0.2760 0.0113 0.0548 0.0814 0.0765 0.0990 0.1989 0.2841 0.0707 0.0952 0.0396 0.0386 0.1606 0.0492 0.0525 0.1220 0.2541 0.3556 0.7468 0.4937 0.0827 0.6610 0.3465 0.5796 0.8134
iter 3: 0.0049 0.0046 0.8690 0.1986 0.7810 0.3641 0.4774 0.4360 0.0528 0.6123 0.0188 0.0088 0.2867 0.7538 0.0767 0.0393 0.0734 0.1261 0.1374 0.0343 0.2741 0.0119 0.0524 0.0816 0.0748 0.1008 0.2184 0.2811 0.0691 0.0926 0.0389 0.0413 0.1640 0.0511 0.0585 0.1255 0.2510 0.3609 0.7477 0.5108 0.0858 0.6426 0.3588 0.5734 0.8300
missRanger object. Extract imputed data via $data
- best iteration: 2
- best average OOB imputation error: 0.2524825
Code
data_all_qb <- seasonalData_lag_qb_all_imp$data
data_all_qb$fantasyPointsMC_lag <- scale(data_all_qb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_qb_matrix <- data_all_qb %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
newData_qb <- data_all_qb %>%
filter(season == max(season, na.rm = TRUE)) %>%
select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_qb_matrix <- data_all_qb_matrix[
data_all_qb_matrix[, "season"] == max(data_all_qb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
, # all columns
drop = FALSE]
dropCol_qb <- which(colnames(newData_qb_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_qb_matrix <- newData_qb_matrix[, -dropCol_qb, drop = FALSE]
seasonalData_lag_qb_train_imp <- missRanger::missRanger(
seasonalData_lag_qb_train,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
Variables to impute: fantasy_points, fantasy_points_ppr, special_teams_tds, passing_epa, pacr, rushing_epa, fantasyPoints_lag, passing_cpoe, rookie_year, draft_number, gs, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, pocket_time.pass
Variables used to impute: gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic, height, weight, rookie_year, draft_number, fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag, completions, attempts, passing_yards, passing_tds, passing_interceptions, sacks_suffered, sack_yards_lost, sack_fumbles, sack_fumbles_lost, passing_air_yards, passing_yards_after_catch, passing_first_downs, passing_epa, passing_cpoe, passing_2pt_conversions, pacr, carries, rushing_yards, rushing_tds, rushing_fumbles, rushing_fumbles_lost, rushing_first_downs, rushing_epa, rushing_2pt_conversions, special_teams_tds, pocket_time.pass, pass_attempts.pass, throwaways.pass, spikes.pass, drops.pass, bad_throws.pass, times_blitzed.pass, times_hurried.pass, times_hit.pass, times_pressured.pass, batted_balls.pass, on_tgt_throws.pass, rpo_plays.pass, rpo_yards.pass, rpo_pass_att.pass, rpo_pass_yards.pass, rpo_rush_att.pass, rpo_rush_yards.pass, pa_pass_att.pass, pa_pass_yards.pass, drop_pct.pass, bad_throw_pct.pass, on_tgt_pct.pass, pressure_pct.pass, ybc_att.rush, yac_att.rush, att.rush, yds.rush, td.rush, x1d.rush, ybc.rush, yac.rush, brk_tkl.rush, att_br.rush
fntsy_ fnts__ spcl__ pssng_p pacr rshng_ fntsP_ pssng_c rok_yr drft_n gs pss_t. thrww. spks.p drps.p bd_th. tms_b. tms_hr. tms_ht. tms_p. bttd_. on_tgt_t. rp_pl. rp_yr. rp_pss_t. rp_pss_y. rp_rsh_t. rp_rsh_y. p_pss_t. p_pss_y. att.rs yds.rs td.rsh x1d.rs ybc.rs yc.rsh brk_t. att_b. drp_p. bd_t_. on_tgt_p. prss_. ybc_t. yc_tt. pckt_.
iter 1: 0.0061 0.0028 0.8162 0.1897 0.5083 0.3633 0.4726 0.4456 0.0242 0.4723 0.0283 0.0141 0.2939 0.7728 0.1343 0.0558 0.0744 0.1757 0.1818 0.0381 0.3288 0.0351 0.2921 0.1846 0.0860 0.0894 0.2737 0.2661 0.1127 0.0900 0.0586 0.0644 0.1800 0.0574 0.0639 0.1792 0.3570 0.3486 0.7646 0.5313 0.0868 0.7084 0.3533 0.5933 0.8466
iter 2: 0.0052 0.0052 0.8304 0.1937 0.5621 0.3715 0.4614 0.4586 0.0505 0.5647 0.0192 0.0092 0.2953 0.7530 0.0800 0.0393 0.0725 0.1170 0.1355 0.0343 0.2771 0.0121 0.0555 0.0731 0.0713 0.0979 0.2073 0.2943 0.0698 0.0911 0.0416 0.0399 0.1683 0.0527 0.0577 0.1262 0.2474 0.3582 0.7719 0.5165 0.0900 0.6862 0.3642 0.5926 0.8400
iter 3: 0.0053 0.0051 0.8261 0.2008 0.5551 0.3571 0.4727 0.4410 0.0551 0.5658 0.0188 0.0092 0.2859 0.7460 0.0807 0.0402 0.0739 0.1202 0.1393 0.0351 0.2808 0.0114 0.0595 0.0705 0.0775 0.1051 0.2163 0.2935 0.0718 0.0921 0.0426 0.0400 0.1719 0.0535 0.0534 0.1225 0.2498 0.3484 0.7502 0.5100 0.0884 0.6609 0.3672 0.5852 0.8440
iter 4: 0.0054 0.0051 0.6928 0.1979 0.5598 0.3732 0.4771 0.4349 0.0506 0.5691 0.0189 0.0085 0.2891 0.7456 0.0785 0.0395 0.0737 0.1210 0.1353 0.0335 0.2836 0.0117 0.0566 0.0778 0.0743 0.1055 0.2131 0.2964 0.0697 0.0912 0.0396 0.0395 0.1611 0.0531 0.0597 0.1258 0.2600 0.3560 0.8062 0.5032 0.0973 0.6739 0.3698 0.5875 0.8485
iter 5: 0.0052 0.0055 0.8355 0.1965 0.5664 0.3710 0.4743 0.4604 0.0520 0.5598 0.0193 0.0091 0.2852 0.7474 0.0800 0.0405 0.0722 0.1213 0.1366 0.0344 0.2788 0.0118 0.0555 0.0756 0.0746 0.0986 0.2190 0.2765 0.0695 0.0932 0.0390 0.0425 0.1650 0.0509 0.0576 0.1305 0.2556 0.3509 0.7738 0.5051 0.0969 0.6902 0.3640 0.6007 0.8326
missRanger object. Extract imputed data via $data
- best iteration: 4
- best average OOB imputation error: 0.2482278
Code
data_train_qb <- seasonalData_lag_qb_train_imp$data
data_train_qb$fantasyPointsMC_lag <- scale(data_train_qb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_qb_matrix <- data_train_qb %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
seasonalData_lag_qb_test_imp <- predict(
object = seasonalData_lag_qb_train_imp,
newdata = seasonalData_lag_qb_test,
seed = 52242)
data_test_qb <- seasonalData_lag_qb_test_imp
data_test_qb_matrix <- data_test_qb %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
Code
# RBs
seasonalData_lag_rb_all_imp <- missRanger::missRanger(
seasonalData_lag_rb_all,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
seasonalData_lag_rb_all_imp
data_all_rb <- seasonalData_lag_rb_all_imp$data
data_all_rb$fantasyPointsMC_lag <- scale(data_all_rb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_rb_matrix <- data_all_rb %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
newData_rb <- data_all_rb %>%
filter(season == max(season, na.rm = TRUE)) %>%
select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_rb_matrix <- data_all_rb_matrix[
data_all_rb_matrix[, "season"] == max(data_all_rb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
, # all columns
drop = FALSE]
dropCol_rb <- which(colnames(newData_rb_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_rb_matrix <- newData_rb_matrix[, -dropCol_rb, drop = FALSE]
seasonalData_lag_rb_train_imp <- missRanger::missRanger(
seasonalData_lag_rb_train,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
seasonalData_lag_rb_train_imp
data_train_rb <- seasonalData_lag_rb_train_imp$data
data_train_rb$fantasyPointsMC_lag <- scale(data_train_rb$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_rb_matrix <- data_train_rb %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
seasonalData_lag_rb_test_imp <- predict(
object = seasonalData_lag_rb_train_imp,
newdata = seasonalData_lag_rb_test,
seed = 52242)
data_test_rb <- seasonalData_lag_rb_test_imp
data_test_rb_matrix <- data_test_rb %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
Code
# WRs
seasonalData_lag_wr_all_imp <- missRanger::missRanger(
seasonalData_lag_wr_all,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
seasonalData_lag_wr_all_imp
data_all_wr <- seasonalData_lag_wr_all_imp$data
data_all_wr$fantasyPointsMC_lag <- scale(data_all_wr$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_wr_matrix <- data_all_wr %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
newData_wr <- data_all_wr %>%
filter(season == max(season, na.rm = TRUE)) %>%
select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_wr_matrix <- data_all_wr_matrix[
data_all_wr_matrix[, "season"] == max(data_all_wr_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
, # all columns
drop = FALSE]
dropCol_wr <- which(colnames(newData_wr_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_wr_matrix <- newData_wr_matrix[, -dropCol_wr, drop = FALSE]
seasonalData_lag_wr_train_imp <- missRanger::missRanger(
seasonalData_lag_wr_train,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
seasonalData_lag_wr_train_imp
data_train_wr <- seasonalData_lag_wr_train_imp$data
data_train_wr$fantasyPointsMC_lag <- scale(data_train_wr$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_wr_matrix <- data_train_wr %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
seasonalData_lag_wr_test_imp <- predict(
object = seasonalData_lag_wr_train_imp,
newdata = seasonalData_lag_wr_test,
seed = 52242)
data_test_wr <- seasonalData_lag_wr_test_imp
data_test_wr_matrix <- data_test_wr %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
Code
# TEs
seasonalData_lag_te_all_imp <- missRanger::missRanger(
seasonalData_lag_te_all,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
seasonalData_lag_te_all_imp
data_all_te <- seasonalData_lag_te_all_imp$data
data_all_te$fantasyPointsMC_lag <- scale(data_all_te$fantasyPoints_lag, scale = FALSE) # mean-centered
data_all_te_matrix <- data_all_te %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
newData_te <- data_all_te %>%
filter(season == max(season, na.rm = TRUE)) %>%
select(-fantasyPoints_lag, -fantasyPointsMC_lag)
newData_te_matrix <- data_all_te_matrix[
data_all_te_matrix[, "season"] == max(data_all_te_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
, # all columns
drop = FALSE]
dropCol_te <- which(colnames(newData_te_matrix) %in% c("fantasyPoints_lag","fantasyPointsMC_lag"))
newData_te_matrix <- newData_te_matrix[, -dropCol_te, drop = FALSE]
seasonalData_lag_te_train_imp <- missRanger::missRanger(
seasonalData_lag_te_train,
pmm.k = 5,
verbose = 2,
seed = 52242,
keep_forests = TRUE)
seasonalData_lag_te_train_imp
data_train_te <- seasonalData_lag_te_train_imp$data
data_train_te$fantasyPointsMC_lag <- scale(data_train_te$fantasyPoints_lag, scale = FALSE) # mean-centered
data_train_te_matrix <- data_train_te %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
seasonalData_lag_te_test_imp <- predict(
object = seasonalData_lag_te_train_imp,
newdata = seasonalData_lag_te_test,
seed = 52242)
data_test_te <- seasonalData_lag_te_test_imp
data_test_te_matrix <- data_test_te %>%
mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>%
as.matrix()
19.5 Identify Cores for Parallel Processing
We use the future
package (Bengtsson, 2025) for parallel (faster) processing.
19.6 Set up the Cross-Validation Folds
In the examples below, we predict the future fantasy points of Quarterbacks. However, the examples could be applied to any of the positions. There are various approaches to cross-validation. In the examples below, we use k-fold cross-validation. However, we also provide the code to apply leave-one-out (LOO) cross-validation.
19.6.1 k-Fold Cross-Validation
k-fold cross-validation partitions the data into k folds (subsets). In each of the k iterations, the model is trained on \(k - 1\) folds and is evaluated on the remaining fold. For example, in a 10-fold cross-validation (i.e., \(k = 10\)), as used below, the model is trained 10 times, each time leaving out a different 10% of the data for validation. k-fold cross-validation is widely used because it tends to yield stable estimates of model performance, by balancing bias and variance. It is also computationally efficient, requiring only k model fits to evaluate model performance.
We set up the k folds using the group_vfold_cv()
function of the rsample
package (Frick, Chow, et al., 2025).
19.6.2 Leave-One-Out (LOO) Cross-Validation
Leave-one-out (LOO) cross-validation partitions the data into n folds, where n is the sample size. In each of the n iterations, the model is trained on \(n - 1\) observations and is evaluated on the one left out. For example, in a LOO cross-validation with 100 players, the model is trained 100 times, each time leaving out a different player for validation. LOO cross-validation is especially useful when the dataset is small—too small to form reliable training sets in k-fold cross-validation (e.g., with \(k = 5\) or \(k = 10\), which divide the sample into 5 or 10 folds, respectively). However, LOO tends to be less computationally efficient because it requires more model fits than k-fold cross-validation. LOO tends to have low bias, producing performance estimates closer to those obtained when fitting the model to the full dataset, because each model is trained on nearly all the data. However, LOO also tends to have high variance in its error estimates, because each validation fold contains only a single observation, making those estimates more sensitive to individual data points.
We set up the LOO folds using the loo_cv()
function of the rsample
package (Frick, Chow, et al., 2025).
19.7 Fitting the Traditional Linear Regression Models
We describe linear regression in Chapter 11.
19.7.1 Regression with One Predictor
Below, we fit a linear regression model with one predictor and evaluate it with cross-validation. We also evaluate its accuracy on the hold-out (test) data. For each of the models, we fit and evaluate the models using the tidymodels
ecosystem of packages (Kuhn & Wickham, 2020, 2025). we specify our model formula using the recipe()
function of the recipes
package (Kuhn, Wickham, et al., 2025). We define the model using the linear_reg()
, set_engine()
, and set_mode()
functions of the parsnip
package (Kuhn & Vaughan, 2025). We specify the workflow using the workflow()
, add_recipe()
, and add_model()
functions of the workflows
package (Vaughan & Couch, 2025). We fit the cross-validation model using the fit_resamples()
function of the tune
package (Kuhn, 2025). We specify the accuracy metrics to evaluate using the metric_set()
function of the yardstick
package (Kuhn, Vaughan, et al., 2025). We fit the final model using the fit()
function of the workflows
package (Vaughan & Couch, 2025). We evaluate the accuracy of the model’s predictions on the test data using the accuracyOverall()
of the petersenlab
package (Petersen, 2025a).
Code
# Set seed for reproducibility
set.seed(52242)
# Set up Cross-Validation
folds <- folds_kFold
# Define Recipe (Formula)
rec <- recipes::recipe(
fantasyPoints_lag ~ fantasyPoints,
data = data_train_qb)
# Define Model
lm_spec <- parsnip::linear_reg() %>%
parsnip::set_engine("lm") %>%
parsnip::set_mode("regression")
# Workflow
lm_wf <- workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(lm_spec)
# Fit Model with Cross-Validation
cv_results <- tune::fit_resamples(
lm_wf,
resamples = folds,
metrics = yardstick::metric_set(rmse, mae, rsq),
control = tune::control_resamples(save_pred = TRUE)
)
# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
Code
Figure 19.1 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)
ggplot(
df,
aes(
x = pred,
y = fantasyPoints_lag)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Below are the model predictions for next year’s fantasy points:
19.7.2 Regression with Multiple Predictors
Below, we fit a linear regression model with multiple predictors and evaluate it with cross-validation. We also evaluate its accuracy on the hold-out (test) data.
Code
# Set seed for reproducibility
set.seed(52242)
# Set up Cross-Validation
folds <- folds_kFold
# Define Recipe (Formula)
rec <- recipes::recipe(
fantasyPoints_lag ~ ., # use all predictors
data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))
# Define Model
lm_spec <- parsnip::linear_reg() %>%
parsnip::set_engine("lm") %>%
parsnip::set_mode("regression")
# Workflow
lm_wf <- workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(lm_spec)
# Fit Model with Cross-Validation
cv_results <- tune::fit_resamples(
lm_wf,
resamples = folds,
metrics = yardstick::metric_set(rmse, mae, rsq),
control = tune::control_resamples(save_pred = TRUE)
)
# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
Code
Figure 19.2 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)
ggplot(
df,
aes(
x = pred,
y = fantasyPoints_lag)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Below are the model predictions for next year’s fantasy points:
19.8 Fitting the Machine Learning Models
19.8.1 Least Absolute Shrinkage and Selection Option (LASSO)
Below, we fit a LASSO model. We evaluate it and tune its penalty parameter with cross-validation. The penalty parameter in a LASSO model controls the strength of regularization applied to the model coefficients. Smaller penalty values result in less regularization, allowing the model to retain more predictors with larger (nonzero) coefficients. This typically reduces bias but increases variance of the model’s predictions, as the model may overfit to the training data by including irrelevant or weak predictors. Larger penalty values apply stronger regularization, shrinking more coefficients exactly to zero. This encourages a sparser model that may increase bias (by excluding useful predictors) but reduces variance and improves generalizability by simplifying the model and reducing overfitting.
After tuning the model, we evaluate its accuracy on the hold-out (test) data.
The LASSO models were fit using the glmnet
package (Friedman et al., 2010, 2025; Tay et al., 2023).
For the machine learning models, we perform the parameter tuning using the tune()
, tune_grid()
, select_best()
, and finalize_workflow()
functions of the tune
package (Kuhn, 2025). We specify the grid of possible parameter values using the grid_regular()
function of the dials
package (Kuhn & Frick, 2025).
Code
# Set seed for reproducibility
set.seed(52242)
# Set up Cross-Validation
folds <- folds_kFold
# Define Recipe (Formula)
rec <- recipes::recipe(
fantasyPoints_lag ~ ., # use all predictors
data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))
# Define Model
lasso_spec <-
parsnip::linear_reg(
penalty = tune::tune(),
mixture = 1) %>%
parsnip::set_engine("glmnet")
# Workflow
lasso_wf <- workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(lasso_spec)
# Define grid of penalties to try (log scale is typical)
penalty_grid <- dials::grid_regular(
dials::penalty(range = c(-4, -1)),
levels = 20)
# Tune the Penalty Parameter
cv_results <- tune::tune_grid(
lasso_wf,
resamples = folds,
grid = penalty_grid,
metrics = yardstick::metric_set(rmse, mae, rsq),
control = tune::control_grid(save_pred = TRUE)
)
# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")
# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
lasso_wf,
best_penalty)
# Fit Final Model on Training Data
final_model <- workflows::fit(
final_wf,
data = data_train_qb)
# View Coefficients
final_model %>%
workflows::extract_fit_parsnip() %>%
broom::tidy()
Code
Figure 19.3 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)
ggplot(
df,
aes(
x = pred,
y = fantasyPoints_lag)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Below are the model predictions for next year’s fantasy points:
19.8.2 Ridge Regression
Below, we fit a ridge regression model. We evaluate it and tune its penalty parameter with cross-validation. The penalty parameter in a ridge regression model controls the amount of regularization applied to the model’s coefficients. Smaller penalty values allow coefficients to remain large, resulting in a model that closely fits the training data. This may reduce bias but increases the risk of overfitting, especially in the presence of multicollinearity or many weak predictors. Larger penalty values shrink coefficients toward zero (though not exactly to zero), which reduces model complexity. This typically increases bias slightly but reduces variance of the model’s predictions, making the model more stable and better suited for generalization to new data.
After tuning the model, we also evaluate its accuracy on the hold-out (test) data.
The ridge regression models were fit using the glmnet
package (Friedman et al., 2010, 2025; Tay et al., 2023).
Code
# Set seed for reproducibility
set.seed(52242)
# Set up Cross-Validation
folds <- folds_kFold
# Define Recipe (Formula)
rec <- recipes::recipe(
fantasyPoints_lag ~ ., # use all predictors
data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))
# Define Model
ridge_spec <-
parsnip::linear_reg(
penalty = tune::tune(),
mixture = 0) %>%
parsnip::set_engine("glmnet")
# Workflow
ridge_wf <- workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(ridge_spec)
# Define grid of penalties to try (log scale is typical)
penalty_grid <- dials::grid_regular(
dials::penalty(range = c(-4, -1)),
levels = 20)
# Tune the Penalty Parameter
cv_results <- tune::tune_grid(
ridge_wf,
resamples = folds,
grid = penalty_grid,
metrics = yardstick::metric_set(rmse, mae, rsq),
control = tune::control_grid(save_pred = TRUE)
)
# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")
# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
ridge_wf,
best_penalty)
# Fit Final Model on Training Data
final_model <- workflows::fit(
final_wf,
data = data_train_qb)
# View Coefficients
final_model %>%
workflows::extract_fit_parsnip() %>%
broom::tidy()
Code
Figure 19.4 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)
ggplot(
df,
aes(
x = pred,
y = fantasyPoints_lag)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Below are the model predictions for next year’s fantasy points:
19.8.3 Elastic Net
Below, we fit an elastic net model. We evaluate it and tune its penalty and mixture parameters with cross-validation.
The penalty parameter controls the overall strength of regularization applied to the model’s coefficients. Smaller penalty values allow coefficients to remain large, which can reduce bias but increase variance of the model’s predictions and can increase the risk of overfitting. Larger penalty values shrink coefficients more aggressively, leading to simpler models with potentially higher bias but lower variance of predictions. This regularization helps prevent overfitting, especially when the model includes many predictors or multicollinearity.
The mixture parameter controls the balance between LASSO and ridge regularization. It ranges from 0 to 1: A value of 0 applies pure ridge regression, which shrinks all coefficients but keeps them in the model. A value of 1 applies pure LASSO, which can shrink some coefficients exactly to zero, effectively performing variable selection. Values between 0 and 1 apply a combination: ridge-like smoothing and LASSO-like sparsity. Smaller mixture values favor shrinkage without variable elimination, whereas larger values favor sparser solutions by excluding weak predictors.
After tuning the model, we also evaluate its accuracy on the hold-out (test) data.
The elastic net models were fit using the glmnet
package (Friedman et al., 2010, 2025; Tay et al., 2023).
Code
# Set seed for reproducibility
set.seed(52242)
# Set up Cross-Validation
folds <- folds_kFold
# Define Recipe (Formula)
rec <- recipes::recipe(
fantasyPoints_lag ~ ., # use all predictors
data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))
# Define Model
enet_spec <-
parsnip::linear_reg(
penalty = tune::tune(),
mixture = tune::tune()) %>%
parsnip::set_engine("glmnet")
# Workflow
enet_wf <- workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(enet_spec)
# Define a regular grid for both penalty and mixture
grid_enet <- dials::grid_regular(
dials::penalty(range = c(-4, -1)),
dials::mixture(range = c(0, 1)),
levels = c(20, 5) # 20 penalty values × 5 mixture values
)
# Tune the Grid
cv_results <- tune::tune_grid(
enet_wf,
resamples = folds,
grid = grid_enet,
metrics = yardstick::metric_set(rmse, mae, rsq),
control = tune::control_grid(save_pred = TRUE)
)
# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")
# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
enet_wf,
best_penalty)
# Fit Final Model on Training Data
final_model <- workflows::fit(
final_wf,
data = data_train_qb)
# View Coefficients
final_model %>%
workflows::extract_fit_parsnip() %>%
broom::tidy()
Code
Figure 19.5 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)
ggplot(
df,
aes(
x = pred,
y = fantasyPoints_lag)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Below are the model predictions for next year’s fantasy points:
19.8.4 Random Forest Machine Learning
19.8.4.1 Assuming Independent Observations
Below, we fit a random forest model. We evaluate it and tune two parameters with cross-validation. The first parameter is mtry
, which controls the number of predictors randomly sampled at each split in a decision tree. Smaller mtry
values increase tree diversity by forcing trees to consider different subsets of predictors. This typically reduces the variance of the overall model’s predictions (because the final prediction is averaged over more diverse trees) but may increase bias if strong predictors are frequently excluded. Larger mtry
allow more predictors to be considered at each split, making trees more similar to each other. This can reduce bias but often increases variance of the model’s predictions, because the trees are more correlated and less effective at error cancellation when averaged.
The second parameter is min_n
, which controls the minimum number of observations that must be present in a node for a split to be attempted. Smaller min_n
values allow trees to grow deeper and capture more fine-grained patterns in the training data. This typically reduces bias but increases variance of the overall model’s predictions, because deeper trees are more likely to overfit to noise in the training data. Larger min_n
values restrict the depth of the trees by requiring more data to justify a split. This can reduce variance by producing simpler, more generalizable trees—but may increase bias if the trees are unable to capture important structure in the data.
After tuning the model, we evaluate its accuracy on the hold-out (test) data.
The random forest models were fit using the ranger
package (Wright, 2024; Wright & Ziegler, 2017). We specify the grid of possible parameter values using the grid_random()
, update.parameters()
, mtry()
, and min_n()
functions of the dials
package (Kuhn & Frick, 2025) and the extract_parameter_set_dials()
function of the hardhat
package (Frick, Vaughan, et al., 2025).
Code
# Set seed for reproducibility
set.seed(52242)
# Set up Cross-Validation
folds <- folds_kFold
# Define Recipe (Formula)
rec <- recipes::recipe(
fantasyPoints_lag ~ ., # use all predictors
data = data_train_qb %>% select(-gsis_id, -fantasyPointsMC_lag))
# Define Model
rf_spec <-
parsnip::rand_forest(
mtry = tune::tune(),
min_n = tune::tune(),
trees = 500) %>%
parsnip::set_mode("regression") %>%
parsnip::set_engine(
"ranger",
importance = "impurity")
# Workflow
rf_wf <- workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(rf_spec)
# Create Grid
n_predictors <- recipes::prep(rec) %>%
recipes::juice() %>%
dplyr::select(-fantasyPoints_lag) %>%
ncol()
# Dynamically define ranges based on data
rf_params <- hardhat::extract_parameter_set_dials(rf_spec) %>%
dials:::update.parameters(
mtry = dials::mtry(range = c(1L, n_predictors)),
min_n = dials::min_n(range = c(2L, 10L))
)
rf_grid <- dials::grid_random(rf_params, size = 15) #dials::grid_regular(rf_params, levels = 5)
# Tune the Grid
cv_results <- tune::tune_grid(
rf_wf,
resamples = folds,
grid = rf_grid,
metrics = yardstick::metric_set(rmse, mae, rsq),
control = tune::control_grid(save_pred = TRUE)
)
# View Cross-Validation metrics
tune::collect_metrics(cv_results)
Code
best_penalty <- tune::select_best(cv_results, metric = "mae")
# Finalize Workflow with Best Penalty
final_wf <- tune::finalize_workflow(
rf_wf,
best_penalty)
# Fit Final Model on Training Data
final_model <- workflows::fit(
final_wf,
data = data_train_qb)
# View Feature Importance
rf_fit <- final_model %>%
workflows::extract_fit_parsnip()
rf_fit
parsnip model object
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~8L, x), num.trees = ~500, min.node.size = min_rows(~3L, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Regression
Number of trees: 500
Sample size: 1582
Number of independent variables: 73
Mtry: 8
Target node size: 3
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 6292.873
R squared (OOB): 0.5165346
season games gs
158486.2281 311601.5550 508011.8724
years_of_experience age ageCentered20
153382.2301 267708.3635 279843.6026
ageCentered20Quadratic height weight
275022.6554 94161.2553 191252.7328
rookie_year draft_number fantasy_points
140253.5439 272494.7684 1419670.4892
fantasy_points_ppr fantasyPoints completions
1293483.9743 1312301.3737 827216.4193
attempts passing_yards passing_tds
646841.5869 917999.3053 857088.9137
passing_interceptions sacks_suffered sack_yards_lost
130485.7228 161120.2954 220249.8113
sack_fumbles sack_fumbles_lost passing_air_yards
76489.5596 52392.6083 299007.8777
passing_yards_after_catch passing_first_downs passing_epa
329812.9334 939795.2571 431328.9812
passing_cpoe passing_2pt_conversions pacr
177761.1008 35523.1896 168910.0901
carries rushing_yards rushing_tds
240997.7562 181847.7875 57066.7128
rushing_fumbles rushing_fumbles_lost rushing_first_downs
54868.5816 33249.7074 118475.4108
rushing_epa rushing_2pt_conversions special_teams_tds
212956.2609 26187.3427 585.1349
pocket_time.pass pass_attempts.pass throwaways.pass
117380.4026 542011.1159 181719.6877
spikes.pass drops.pass bad_throws.pass
56148.3695 317853.6846 331561.0155
times_blitzed.pass times_hurried.pass times_hit.pass
347944.2411 194260.1925 192719.6026
times_pressured.pass batted_balls.pass on_tgt_throws.pass
441632.1558 92205.6330 362699.7319
rpo_plays.pass rpo_yards.pass rpo_pass_att.pass
161827.5832 166390.5923 141437.2823
rpo_pass_yards.pass rpo_rush_att.pass rpo_rush_yards.pass
134159.9270 64700.4095 108135.7323
pa_pass_att.pass pa_pass_yards.pass drop_pct.pass
274342.4437 247080.0647 160183.4104
bad_throw_pct.pass on_tgt_pct.pass pressure_pct.pass
195925.8780 132741.7748 167263.9786
ybc_att.rush yac_att.rush att.rush
169961.6625 155900.6714 229358.3533
yds.rush td.rush x1d.rush
158863.0638 65818.0750 155380.6876
ybc.rush yac.rush brk_tkl.rush
156091.4085 153192.0666 49373.0228
att_br.rush
60286.9067
Code
Figure 19.6 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(df$pred, df$fantasyPoints_lag), na.rm = TRUE)
ggplot(
df,
aes(
x = pred,
y = fantasyPoints_lag)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Below are the model predictions for next year’s fantasy points:
Code
Now we can stop the parallel backend:
19.8.4.2 Accounting for Longitudinal Data
The above approaches to machine learning assume that the observations are independent across rows. However, in our case, this assumption does not hold because the data are longitudinal—each player has multiple seasons, and each row corresponds to a unique player–season combination. In the approaches below, using a random forest model (this section) and a tree-boosting model (in Section 19.8.5), we address this by explicitly accounting for the nesting of longitudinal observations within players.
Approaches to estimating random forest models with longitudinal data are described by Hu & Szymczak (2023). Below, we fit longitudinal random forest models using the MERF()
function of the LongituRF
package (Capitaine, 2020).
Code
smerf <- LongituRF::MERF(
X = data_train_qb %>% dplyr::select(season:att_br.rush) %>% as.matrix(), # predictors of the fixed effects
Y = data_train_qb[,c("fantasyPoints_lag")] %>% as.matrix(), # outcome variable
Z = data_train_qb %>% dplyr::mutate(constant = 1) %>% dplyr::select(constant, passing_yards, passing_tds, passing_interceptions, passing_epa, pacr) %>% as.matrix(), # predictors of the random effects
id = data_train_qb[,c("gsis_id")] %>% as.matrix(), # player ID (for nesting)
time = data_train_qb[,c("ageCentered20")] %>% as.matrix(), # time variable
ntree = 500,
sto = "BM")
[1] "stopped after 11 iterations."
Call:
randomForest(x = X, y = ystar, ntree = ntree, mtry = mtry, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 25
Mean of squared residuals: 165.2049
% Var explained: 98.49
[1] 154.9823 110.8246 105.2359 108.3674 127.9415 145.0449 145.5596 158.7728
[9] 160.9756 151.7372 165.2049
The following code generates an error, for some reason. This issue has been posted on the GitHub repository: https://github.com/sistm/LongituRF/issues/5. Hopefully the package maintainer will help address the issue.
Code
# Predict on Test Data
predict(
smerf,
X = data_test_qb %>% dplyr::select(season:att_br.rush) %>% as.matrix(),
Z = data_train_qb %>% dplyr::mutate(constant = 1) %>% dplyr::select(constant, passing_yards, passing_tds, passing_interceptions, passing_epa, pacr) %>% as.matrix(),
id = data_test_qb[,c("gsis_id")] %>% as.matrix(),
time = data_test_qb[,c("ageCentered20")] %>% as.matrix())
Error in Z[w, , drop = FALSE] %*% object$random_effects[k, ]: non-conformable arguments
19.8.5 Combining Tree-Boosting with Mixed Models
To combine tree-boosting with mixed models, we use the gpboost
package (Sigrist et al., 2025).
Adapted from here: https://towardsdatascience.com/mixed-effects-machine-learning-for-longitudinal-panel-data-with-gpboost-part-iii-523bb38effc
19.8.5.1 Process Data
If using a gamma distribution, it requires positive-only values:
19.8.5.2 Specify Predictor Variables
19.8.5.3 Specify General Model Options
19.8.5.4 Identify Optimal Tuning Parameters
For identifying the optimal tuning parameters for boosting, we partition the training data into inner training data and validation data. We randomly split the training data into 80% inner training data and 20% held-out validation data. We then use the mean absolute error as our index of prediction accuracy on the held-out validation data.
Code
# Partition training data into inner training data and validation data
ntrain_qb <- dim(data_train_qb_matrix)[1]
set.seed(52242)
valid_tune_idx_qb <- sample.int(ntrain_qb, as.integer(0.2*ntrain_qb)) # validation set
folds_qb <- list(valid_tune_idx_qb)
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid_qb <- list(
"learning_rate" = c(0.2, 0.1, 0.05, 0.01), # the step size used when updating predictions after each boosting round (high values make big updates, which can speed up learning but risk overshooting; low values are usually more accurate but require more rounds)
"max_depth" = c(3, 5, 7), # maximum depth (levels) of each decision tree; deeper trees capture more complex patterns and interactions but risk overfitting; shallower trees tend to generalize better
"min_data_in_leaf" = c(10, 50, 100), # minimum number of training examples in a leaf node; higher values = more regularization (simpler trees)
"lambda_l2" = c(0, 1, 5)) # L2 regularization penalty for large weights in tree splits; adds a "cost" for complexity; helps prevent overfitting by shrinking the contribution of each tree
other_params_qb <- list(
num_leaves = 2^6) # maximum number of leaves per tree; controls the maximum complexity of each tree (along with max_depth); more leaves = more expressive models, but can overfit if min_data_in_leaf is too small; num_leaves must be consistent with max_depth, because deeper trees naturally support more leaves; max is: 2^n, where n is the largest max_depth
gp_model_qb <- gpboost::GPModel(
group_data = data_train_qb_matrix[,"gsis_id"],
likelihood = model_likelihood,
group_rand_coef_data = cbind(
data_train_qb_matrix[,"ageCentered20"],
data_train_qb_matrix[,"ageCentered20Quadratic"]),
ind_effect_group_rand_coef = c(1,1))
gp_data_qb <- gpboost::gpb.Dataset(
data = data_train_qb_matrix[,pred_vars_qb],
categorical_feature = pred_vars_qb_categorical,
label = data_train_qb_matrix[,"fantasyPoints_lag"]) # could instead use mean-centered variable (fantasyPointsMC_lag) and add mean back afterward
# Find optimal tuning parameters
opt_params_qb <- gpboost::gpb.grid.search.tune.parameters(
param_grid = param_grid_qb,
params = other_params_qb,
num_try_random = NULL,
folds = folds_qb,
data = gp_data_qb,
gp_model = gp_model_qb,
nrounds = nrounds,
early_stopping_rounds = 50, # stops training early if the model hasn’t improved on the validation set in 50 rounds; prevents overfitting and saves time
verbose_eval = 1,
metric = "mae")
Error in fd$booster$update(fobj = fobj): [GPBoost] [Fatal] NaN occured in gradient wrt covariance / auxiliary parameter number 1 (counting starts at 1, total nb. par. = 4)
Error: object 'opt_params_qb' not found
We use the optimal parameters identified during tuning. We use a low learning rate (0.1) to avoid overfitting. We set some light regularization (lambda_l2
) for better generalization. We also set the maximum tree depth (max_depth
) at 5 to capture complex (up to 5-way) interactions, and set the maximum number of terminal nodes (num_leaves
) per tree at \(2^6\) (64)—though the number will not reach greater than \(2^{\text{max depth}} = 2^5 = 32\). We set the minimum number of samples in any leaf (min_data_in_leaf
) to be 100.
19.8.5.5 Specify Model and Tuning Parameters
Code
gp_model_qb <- gpboost::GPModel(
group_data = data_train_qb_matrix[,"gsis_id"],
likelihood = model_likelihood,
group_rand_coef_data = cbind(
data_train_qb_matrix[,"ageCentered20"],
data_train_qb_matrix[,"ageCentered20Quadratic"]),
ind_effect_group_rand_coef = c(1,1))
gp_data_qb <- gpboost::gpb.Dataset(
data = data_train_qb_matrix[,pred_vars_qb],
categorical_feature = pred_vars_qb_categorical,
label = data_train_qb_matrix[,"fantasyPoints_lag"])
params_qb <- list(
learning_rate = 0.1,
max_depth = 5,
min_data_in_leaf = 100,
lambda_l2 = 5,
num_leaves = 2^6,
num_threads = num_cores)
nrounds_qb <- 84 # identify optimal number of trees through iteration and cross-validation
#gp_model_qb$set_optim_params(params = list(optimizer_cov = "nelder_mead")) # to speed up model estimation
19.8.5.6 Fit Model
Code
[GPBoost] [Info] Total Bins 8709
[GPBoost] [Info] Number of data points in the train set: 1582, number of used features: 73
[GPBoost] [Info] [GPBoost with gamma likelihood]: initscore=4.800459
[GPBoost] [Info] Start training from score 4.800459
19.8.5.7 Model Results
=====================================================
Model summary:
Nb. observations: 1582
Nb. groups: 316 (Group_1)
Covariance parameters (random effects):
Param.
Group_1 0
Group_1_rand_coef_nb_1 0
Group_1_rand_coef_nb_2 0
-----------------------------------------------------
Additional parameters:
Param.
shape 0.6343
=====================================================
19.8.5.8 Evaluate Accuracy of Model on Test Data
Code
# Test Model on Test Data
pred_test_qb <- predict(
gp_model_fit_qb,
data = data_test_qb_matrix[,pred_vars_qb],
group_data_pred = data_test_qb_matrix[,"gsis_id"],
group_rand_coef_data_pred = cbind(
data_test_qb_matrix[,"ageCentered20"],
data_test_qb_matrix[,"ageCentered20Quadratic"]),
predict_var = FALSE,
pred_latent = FALSE)
y_pred_test_qb <- pred_test_qb[["response_mean"]] # if outcome is mean-centered, add mean(data_train_qb_matrix[,"fantasyPoints_lag"])
predictedVsActual <- data.frame(
predictedPoints = y_pred_test_qb,
actualPoints = data_test_qb_matrix[,"fantasyPoints_lag"]
)
predictedVsActual
Code
Figure 19.9 depicts the predicted versus actual fantasy points for the model on the test data.
Code
# Calculate combined range for axes
axis_limits <- range(c(predictedVsActual$predictedPoints, predictedVsActual$actualPoints), na.rm = TRUE)
ggplot(
predictedVsActual,
aes(
x = predictedPoints,
y = actualPoints)) +
geom_point(
size = 2,
alpha = 0.6) +
geom_abline(
slope = 1,
intercept = 0,
color = "blue",
linetype = "dashed") +
coord_equal(
xlim = axis_limits,
ylim = axis_limits) +
labs(
title = "Predicted vs Actual Fantasy Points (Test Data)",
x = "Predicted Fantasy Points",
y = "Actual Fantasy Points"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
19.8.5.9 Generate Predictions for Next Season
Code
# Generate model predictions for next season
pred_nextYear_qb <- predict(
gp_model_fit_qb,
data = newData_qb_matrix[,pred_vars_qb],
group_data_pred = newData_qb_matrix[,"gsis_id"],
group_rand_coef_data_pred = cbind(
newData_qb_matrix[,"ageCentered20"],
newData_qb_matrix[,"ageCentered20Quadratic"]),
predict_var = FALSE,
pred_latent = FALSE)
newData_qb$fantasyPoints_lag <- pred_nextYear_qb$response_mean
# Merge with player names
newData_qb <- left_join(
newData_qb,
nfl_playerIDs %>% select(gsis_id, name),
by = "gsis_id"
)
newData_qb %>%
arrange(-fantasyPoints_lag) %>%
select(name, fantasyPoints_lag, fantasyPoints)
19.9 Summarizing the Machine Learning Models
We examined multiple machine learning models that included many predictor variables to identify the model that generates the most accurate prediction of Quarterbacks’ future fantasy points. We also compared them to simpler models, including a regression model with one predictor and a regression model with multiple predictors to deteremine the incremental validity of the machine learning models above and beyond the simpler models. A summary of the accuracy of the machine learning models is in Table 19.1.
Model | \(R^2\) | Mean Error | Mean Absolute Error |
---|---|---|---|
Regression with One Predictor | 0.44 | 7.91 | 57.08 |
Regressions with Multiple Predictors | 0.38 | 9.49 | 61.38 |
LASSO | 0.40 | 9.52 | 60.52 |
Ridge Regression | 0.42 | 8.71 | 59.65 |
Elastic Net | 0.42 | 8.71 | 59.65 |
Random Forest | 0.43 | 10.87 | 58.79 |
Tree-Boosting | 0.44 | -10.05 | 56.41 |
The most accurate models, in terms of discrimination (\(R^2\)), had an \(R^2\) of 0.44, and it was achieved using one of two approaches—a tree-boosting model with many predictors or a regression model with one predictor: the player’s fantasy points in the previous season. The regression model with one predictor was more accurate (in terms of \(R^2\)) than a regression model with multiple predictors and the other machine learning models that leveraged multiple predictors. This suggests that the other predictors examined did not have incremental validity—they did not show utility above and beyond a player’s fantasy points in predicting their future fantasy points. If anything, considering the additional predictor variables showed decremental validity—the inclusion of the additional variables worsened prediction, possibly by weakening the predictive association of the previous season’s fantasy points and misattributing its predictive effect to other, related predictors in the model. This is consistent with the common finding that simple models (and parsimony) are often preferable. As an example, Youngstrom et al. (2018) found that simple models did just as well and in some cases better than LASSO models in classifying bipolar disorder.
The most accurate model in terms of calibration (mean absolute error) was the tree-boosting model, which had a mean absolute error of 56.41, and was just slightly more accurate than the regression model with one predictor. However, even in the most accurate model, the predicted values were over 56 points away from the actual values, suggesting that the model-predicted values were not particularly accurate. In sum, the regression model with one predictor was the most accurate in terms of differentiating between players (\(R^2\)), whereas the tree-boosting model was the most accurate in terms of how close the predicted values were to the actual values.
All models explained less than half of the variance in players’ future fantasy points. Thus, there remains considerable variance to be explained. The considerable unexplained variance suggests that we are missing key variables that are important predictors of future fantasy performance. It is also consistent with the conclusion from many other domains in psychology that human behavior is challenging to predict (Petersen, 2025b). We could likely improve our predictive accuracy by including projections, but it is unclear what information projections are or are not based on. For instance, if projections already account for the prior season’s statistics and fantasy points, then including both may not buy you much in terms of increased predictive accuracy.
19.10 Conclusion
Machine learning attempts to maximize prediction accuracy. There are various types of machine learning models. Supervised learning involves learning from data where the correct classification or outcome is known. Unsupervised learning involves learning from data without known classifications. Machine learning can incorporate many predictor variables, so it is important to perform cross-validation to avoid overfitting. We examined multiple machine learning models to identify the model that generates the most accurate prediction of Quarterbacks’ future fantasy points. All models explained less than half of the variance in players’ fantasy points. The most accurate model, in terms of discrimination (\(R^2\)), was achieved using regression with one predictor: the player’s fantasy points in the previous season. The most accurate model in terms of calibration (mean absolute error) was the tree-boosting model, which accounted for the longitudinal nature of the data. The considerable unexplained variance suggests that we are missing key variables that are important predictors of future fantasy performance.
19.11 Session Info
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] glmnet_4.1-10 Matrix_1.7-3 knitr_1.50 lubridate_1.9.4
[5] forcats_1.0.0 stringr_1.5.1 readr_2.1.5 tidyverse_2.0.0
[9] effectsize_1.0.1 gpboost_1.6.1 R6_2.6.1 LongituRF_0.9
[13] yardstick_1.3.2 workflowsets_1.1.1 workflows_1.2.0 tune_1.3.0
[17] tidyr_1.3.1 tibble_3.3.0 rsample_1.3.0 recipes_1.3.1
[21] purrr_1.1.0 parsnip_1.3.2 modeldata_1.4.0 infer_1.0.9
[25] ggplot2_3.5.2 dplyr_1.1.4 dials_1.4.0 scales_1.4.0
[29] broom_1.0.8 tidymodels_1.3.0 powerjoin_0.1.0 missRanger_2.6.1
[33] future_1.58.0 petersenlab_1.1.7
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 shape_1.4.6.1 rstudioapi_0.17.1
[4] jsonlite_2.0.0 datawizard_1.2.0 magrittr_2.0.3
[7] TH.data_1.1-3 estimability_1.5.1 farver_2.1.2
[10] nloptr_2.2.1 rmarkdown_2.29 vctrs_0.6.5
[13] minqa_1.2.8 base64enc_0.1-3 sparsevctrs_0.3.4
[16] htmltools_0.5.8.1 Formula_1.2-5 parallelly_1.45.0
[19] htmlwidgets_1.6.4 sandwich_3.1-1 plyr_1.8.9
[22] zoo_1.8-14 emmeans_1.11.2 lifecycle_1.0.4
[25] iterators_1.0.14 pkgconfig_2.0.3 fastmap_1.2.0
[28] rbibutils_2.3 digest_0.6.37 colorspace_2.1-1
[31] furrr_0.3.1 Hmisc_5.2-3 labeling_0.4.3
[34] latex2exp_0.9.6 randomForest_4.7-1.2 RJSONIO_2.0.0
[37] timechange_0.3.0 compiler_4.5.1 withr_3.0.2
[40] htmlTable_2.4.3 backports_1.5.0 DBI_1.2.3
[43] psych_2.5.6 MASS_7.3-65 lava_1.8.1
[46] tools_4.5.1 pbivnorm_0.6.0 ranger_0.17.0
[49] foreign_0.8-90 future.apply_1.20.0 nnet_7.3-20
[52] doFuture_1.1.2 glue_1.8.0 quadprog_1.5-8
[55] nlme_3.1-168 grid_4.5.1 checkmate_2.3.2
[58] cluster_2.1.8.1 reshape2_1.4.4 generics_0.1.4
[61] gtable_0.3.6 tzdb_0.5.0 class_7.3-23
[64] hms_1.1.3 data.table_1.17.8 foreach_1.5.2
[67] pillar_1.11.0 mitools_2.4 splines_4.5.1
[70] lhs_1.2.0 lattice_0.22-7 FNN_1.1.4.1
[73] survival_3.8-3 tidyselect_1.2.1 mix_1.0-13
[76] reformulas_0.4.1 gridExtra_2.3 stats4_4.5.1
[79] xfun_0.52 hardhat_1.4.1 timeDate_4041.110
[82] stringi_1.8.7 DiceDesign_1.10 yaml_2.3.10
[85] boot_1.3-31 evaluate_1.0.4 codetools_0.2-20
[88] cli_3.6.5 rpart_4.1.24 xtable_1.8-4
[91] parameters_0.27.0 Rdpack_2.6.4 lavaan_0.6-19
[94] Rcpp_1.1.0 globals_0.18.0 coda_0.19-4.1
[97] parallel_4.5.1 gower_1.0.2 bayestestR_0.16.1
[100] GPfit_1.0-9 lme4_1.1-37 listenv_0.9.1
[103] viridisLite_0.4.2 mvtnorm_1.3-3 ipred_0.9-15
[106] prodlim_2025.04.28 insight_1.3.1 rlang_1.1.6
[109] multcomp_1.4-28 mnormt_2.1.1