I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Hypothesis Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

11  Multiple Regression

This chapter provides an overview of multiple regression.

11.1 Getting Started

11.1.1 Load Packages

Code
library("petersenlab")
library("rms")
library("car")
library("bestNormalize")
library("lme4")
library("performance")
library("lavaan")
library("mice")
library("miceadds")
library("interactions")
library("brms")
library("parallelly")
library("robustbase")
library("ordinal")
library("MASS")
library("broom")
library("rockchalk")
library("effectsize")
library("yhat")
library("tidymodels")
library("tidyverse")
library("knitr")

11.1.2 Load Data

Code
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")

We created the player_stats_weekly.RData and player_stats_seasonal.RData objects in Section 4.4.3.

11.2 Overview of Multiple Regression

Multiple regression is an extension of correlation. Correlation examines the association between one predictor variables and one outcome variable. Multiple regression examines the association between multiple predictor variables and one outcome variable. It allows obtaining a more accurate estimate of the unique contribution of a given predictor variable, by controlling for other variables (covariates). By including multiple predictor variables in prediction of the outcome variable, it also allows improved prediction accuracy.

In multiple regression, we estimate models. A model is a simplification of reality. All statistical analyses follow the same basic structure, as in Equation 11.1:

\[ \text{DATA} = \text{MODEL} + \text{ERROR} \tag{11.1}\]

Regression with one predictor variable takes the form of Equation 11.2:

\[ y = \beta_0 + \beta_1x_1 + \epsilon \tag{11.2}\]

where \(y\) is the outcome variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(x_1\) is the predictor variable, and \(\epsilon\) is the error term.

A regression line is depicted in Figure 11.29.

A Regression Best-Fit Line.
Figure 11.1: A Regression Best-Fit Line.

Regression with multiple predictors—i.e., multiple regression—takes the form of Equation 11.3:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon \tag{11.3}\]

where \(p\) is the number of predictor variables. Multiple regression is basically a weighted sum of the predictor variables, and adding an intercept. Under the hood, multiple regression seeks to identify the best weight for each predictor. That is, multiple regression identifies the optimal linear combination of the predictor variables that explains the most variance in the outcome variable. The intercept is the expected value of the outcome variable when all of the predictor variables have a value of zero.

11.3 Components

  • \(B\) = unstandardized coefficient: direction and magnitude of the estimate (original scale)
  • \(\beta\) (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
  • \(SE\) = standard error: uncertainty of unstandardized estimate

11.3.1 Unstandardized Regression Coefficient

In a model with one predictor variable, the unstandardized regression coefficient is computed as in Equation 11.4:

\[ \text{unstandardized regression coefficient} = \frac{\text{covariance between } X \text{ and } Y}{\text{variance of X}} \tag{11.4}\]

In a model with multiple predictor variables, the unstandardized regression coefficient is computed as in Equation 11.5:

\[ \text{unstandardized regression coefficient} = \frac{\text{unique covariance of } X \text{ with } Y}{\text{unique variance of X}} \tag{11.5}\]

The unstandardized regression coefficient (\(B\)) is interpreted such that, for every unit change in the predictor variable, there is a __ unit change in the outcome variable (holding all other predictor variables constant). For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age. (In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.) Unstandardized regression coefficients are tied to the metric of the raw data. Thus, a large unstandardized regression coefficient for two variables may mean completely different things. Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.

11.3.2 Standardized Regression Coefficient

The standardized regression coefficient is computed as in Equation 11.6:

\[ \small \text{standardized regression coefficient} = \text{unstandardized regression coefficient} \cdot \frac{\text{standard deviation of }X}{\text{standard deviation of }Y} \tag{11.6}\]

Standardized regression coefficients can be obtained by standardizing the variables to z scores so they all have a mean of zero and standard deviation of one. The standardized regression coefficient (\(\beta\)) is interpreted such that, for every standard deviation change in the predictor variable, there is a __ standard deviation change in the outcome variable (holding all other predictor variables constant). For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age. Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1]. Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients. In this way, standardized regression coefficients provide a meaningful index of effect size and can be used to identify the predictors with the strongest predictive validity.

11.3.3 Standard Error

The standard error of a regression coefficient represents the imprecision or uncertainty of the parameter. If we have less uncertainty (i.e., more confidence) about the parameter, the standard error will be small, reflecting greater precision of the regression coefficient. If we have more uncertainty (i.e., less confidence) about the parameter, the standard error will be large, reflecting less precision of the regression coefficient. If we used the same sampling procedure repeatedly and calculated the regression coefficient each time, the true parameter in the population would fall 68% of the time within the interval of: \([\text{model parameter estimate for the regression coefficient} \pm 1 \text{ standard error}]\). The standard error is related to the sample size—the larger the sample size, the smaller the standard error (the greater the precision of our estimate of the regression coefficient). Otherwise said, having more data gives more precise estimates and thus increases statistical power.

11.3.4 Confidence Interval

A confidence interval represents a range of plausible values such that, with repeated sampling, the true value falls within a given interval with some confidence. Our parameter estimate for the regression coefficient, plus or minus 1 standard error, reflects the 68% confidence interval for the coefficient. The 95% confidence interval is computed as the parameter estimated plus or minus 1.96 standard errors (because in a standard normal distribution, the middle 95% of the distribution lies between −1.96 and +1.96). For instance, if the parameter estimate for the regression coefficient is 0.50, and the standard error is 0.10, the 95% confidence interval is [0.30, 0.70]: \(0.5 - (1.96 \times 0.10) = 0.3\); \(0.5 + (1.96 \times 0.10) = 0.7\). That is, if we used the same sampling procedure repeatedly, the true value of the regression coefficient would be expected to be 95% of the time somewhere between 0.30 to 0.70.

11.4 Types of Regression

When the outcome variable is continuous or semicontinuous, linear regression is common. However, there are other types of regression depending on type and distribution of the outcome variable. For instance, if the outcome variable is binary, logistic regression would be used. If the outcome variable is an ordered categorical variable, ordinal regression would be used. If the outcome variable is a count, Poisson or negative binomial regression would be preferable. If the outcome variable is a proportion, beta regression would be used.

Here are examples of each:

11.4.1 Linear Regression

We fit a linear regression model using the stats::lm() function.

Code
linearRegression <- lm(
  fantasyPoints ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegression)

Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share, 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-746.79  -18.14  -10.12   10.34  293.12 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -27.36726   22.98098  -1.191  0.23377    
age            0.63790    0.21398   2.981  0.00289 ** 
height        -0.13754    0.41579  -0.331  0.74081    
weight         0.19270    0.06573   2.932  0.00339 ** 
target_share 750.28584    7.09218 105.791  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.83 on 4692 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.7139,    Adjusted R-squared:  0.7136 
F-statistic:  2927 on 4 and 4692 DF,  p-value: < 2.2e-16
Code
print(effectsize::standardize_parameters(linearRegression, method = "refit"), digits = 2)
# Standardization method: refit

Parameter    | Std. Coef. |        95% CI
-----------------------------------------
(Intercept)  |  -7.02e-17 | [-0.02, 0.02]
age          |       0.02 | [ 0.01, 0.04]
height       |  -3.87e-03 | [-0.03, 0.02]
weight       |       0.03 | [ 0.01, 0.06]
target share |       0.84 | [ 0.82, 0.85]

11.4.2 Logistic Regression

We fit a logistic regression model using the stats::glm() function and specifying family = binomial(). To calculate the model \(R^2\), we use the performance::r2() function of the performance package (Lüdecke et al., 2021; Lüdecke, Makowski, Ben-Shachar, Patil, Waggoner, et al., 2025).

Code
newdata <- player_stats_weekly %>%
  mutate( # create a binary variable to be used as the outcome variable
    receiving_td = case_when(
      is.na(receiving_tds) ~ NA_real_, # keep NA
      receiving_tds >= 1   ~ 1,
      receiving_tds == 0   ~ 0
    )
  )

logisticRegression <- glm(
  receiving_td ~ age + height + weight + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  family = binomial(),
  na.action = "na.exclude"
)

summary(logisticRegression)

Call:
glm(formula = receiving_td ~ age + height + weight + target_share, 
    family = binomial(), data = newdata %>% filter(position %in% 
        c("WR")), na.action = "na.exclude")

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.399387   0.468032 -13.673  < 2e-16 ***
age          -0.007115   0.004042  -1.760   0.0783 .  
height        0.045882   0.008491   5.404 6.52e-08 ***
weight        0.002150   0.001298   1.657   0.0975 .  
target_share  8.076738   0.120654  66.941  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 44783  on 47078  degrees of freedom
Residual deviance: 39235  on 47074  degrees of freedom
  (12124 observations deleted due to missingness)
AIC: 39245

Number of Fisher Scoring iterations: 5
Code
performance::r2(logisticRegression)
# R2 for Logistic Regression
  Tjur's R2: 0.121
Code
print(effectsize::standardize_parameters(logisticRegression, method = "refit"), digits = 2)
# Standardization method: refit

Parameter    | Std. Coef. |         95% CI
------------------------------------------
(Intercept)  |      -1.73 | [-1.76, -1.70]
age          |      -0.02 | [-0.05,  0.00]
height       |       0.11 | [ 0.07,  0.15]
weight       |       0.03 | [-0.01,  0.07]
target share |       0.88 | [ 0.85,  0.90]

- Response is unstandardized.
Code
broom::tidy(
  logisticRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.3 Ordinal Regression

We fit an ordinal regression model using the ordinal::clm() function of the ordinal package (Christensen, 2024).

Code
newdata <- player_stats_weekly
newdata$receiving_tdsFactor <- factor(newdata$receiving_tds, ordered = TRUE)
table(newdata$receiving_tdsFactor)

     0      1      2      3      4 
437886  15692   1891    190     10 
Code
ordinalRegression <- ordinal::clm(
  receiving_tdsFactor ~ age + height + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(ordinalRegression)
formula: receiving_tdsFactor ~ age + height + target_share
data:    newdata %>% filter(position %in% c("WR"))

 link  threshold nobs  logLik    AIC      niter max.grad cond.H 
 logit flexible  47079 -23447.24 46908.48 7(0)  8.83e-07 3.0e+07

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
age          -0.007254   0.004003  -1.812     0.07 .  
height        0.058542   0.005584  10.484   <2e-16 ***
target_share  8.191548   0.118166  69.322   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   6.8965     0.4185   16.48
1|2   9.1862     0.4199   21.88
2|3  11.6131     0.4293   27.05
3|4  14.9889     0.6525   22.97
(12124 observations deleted due to missingness)
Code
performance::r2(ordinalRegression)
  Nagelkerke's R2: 0.171
Code
print(effectsize::standardize_parameters(ordinalRegression, method = "refit"), digits = 2)
# Standardization method: refit

Component |    Parameter | Std. Coef. |         95% CI
------------------------------------------------------
intercept |          0|1 |       1.73 | [ 1.70,  1.76]
intercept |          1|2 |       4.02 | [ 3.96,  4.08]
intercept |          2|3 |       6.45 | [ 6.26,  6.63]
intercept |          3|4 |       9.82 | [ 8.84, 10.80]
location  |          age |      -0.02 | [-0.05,  0.00]
location  |       height |       0.14 | [ 0.11,  0.16]
location  | target share |       0.89 | [ 0.86,  0.91]

- Response is unstandardized.
Code
broom::tidy(
  ordinalRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.4 Poisson Regression

We fit a Poisson regression model using the stats::glm() function and specifying family = poisson().

Code
poissonRegression <- glm(
  receiving_tds ~ age + height + weight + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  family = poisson(),
  na.action = na.exclude
)

summary(poissonRegression)

Call:
glm(formula = receiving_tds ~ age + height + weight + target_share, 
    family = poisson(), data = newdata %>% filter(position %in% 
        c("WR")), na.action = na.exclude)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.964539   0.368830 -16.172  < 2e-16 ***
age          -0.002243   0.003163  -0.709   0.4783    
height        0.043557   0.006705   6.497 8.21e-11 ***
weight        0.002015   0.001024   1.967   0.0492 *  
target_share  5.157739   0.062052  83.120  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 34811  on 47078  degrees of freedom
Residual deviance: 29529  on 47074  degrees of freedom
  (12124 observations deleted due to missingness)
AIC: 47552

Number of Fisher Scoring iterations: 6
Code
performance::r2(poissonRegression)
# R2 for Generalized Linear Regression
  Nagelkerke's R2: 0.203
Code
print(effectsize::standardize_parameters(poissonRegression, method = "refit"), digits = 2)
# Standardization method: refit

Parameter    | Std. Coef. |         95% CI
------------------------------------------
(Intercept)  |      -1.76 | [-1.78, -1.74]
age          |  -7.01e-03 | [-0.03,  0.01]
height       |       0.10 | [ 0.07,  0.13]
weight       |       0.03 | [ 0.00,  0.06]
target share |       0.56 | [ 0.55,  0.57]

- Response is unstandardized.
Code
broom::tidy(
  poissonRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.5 Negative Binomial Regression

We fit a negative binomial regression model using the MASS::glm.nb() function of the MASS package (Ripley & Venables, 2025).

Code
negativeBinomialRegression <- MASS::glm.nb(
  receiving_tds ~ age + height + weight + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  na.action = na.exclude,
  control = glm.control(maxit = 10000)
)

summary(negativeBinomialRegression)

Call:
MASS::glm.nb(formula = receiving_tds ~ age + height + weight + 
    target_share, data = newdata %>% filter(position %in% c("WR")), 
    na.action = na.exclude, control = glm.control(maxit = 10000), 
    init.theta = 4.91223613, link = log)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.747307   0.383654 -14.980  < 2e-16 ***
age          -0.004274   0.003271  -1.307   0.1913    
height        0.039669   0.006968   5.693 1.25e-08 ***
weight        0.001795   0.001058   1.697   0.0896 .  
target_share  6.012864   0.077205  77.882  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(4.9122) family taken to be 1)

    Null deviance: 32837  on 47078  degrees of freedom
Residual deviance: 27480  on 47074  degrees of freedom
  (12124 observations deleted due to missingness)
AIC: 47362

Number of Fisher Scoring iterations: 1

              Theta:  4.912 
          Std. Err.:  0.438 

 2 x log-likelihood:  -47350.188 
Code
performance::r2(negativeBinomialRegression)
# R2 for Generalized Linear Regression
  Nagelkerke's R2: 0.214
Code
print(effectsize::standardize_parameters(negativeBinomialRegression, method = "refit"), digits = 2)
# Standardization method: refit

Parameter    | Std. Coef. |         95% CI
------------------------------------------
(Intercept)  |      -1.81 | [-1.83, -1.78]
age          |      -0.01 | [-0.03,  0.01]
height       |       0.09 | [ 0.06,  0.13]
weight       |       0.03 | [ 0.00,  0.06]
target share |       0.65 | [ 0.63,  0.67]

- Response is unstandardized.
Code
broom::tidy(
  negativeBinomialRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.6 Beta Regression

We fit a (Bayesian) zero-one-inflated beta regression model (to allow for zeros and ones) using the brms::brm() function of the brms package (Bürkner, 2024) and specifying family = zero_one_inflated_beta().

Note 11.1: Bayesian beta regression

Note: the following code that runs the model takes a while. If you just want to save time and load the model object instead of running the model, you can load the model object (which has already been fit) using this code:

Code
load(url("https://osf.io/download/fe37j/"))
Code
betaRegression <- brms::brm(
  formula = target_share ~ age + height + weight,
  data = newdata %>% filter(position %in% c("WR")),
  family = zero_one_inflated_beta(),
  cores = 4,
  threads = threading(parallelly::availableCores()),
  seed = 52242,
  silent = 0
)
Code
summary(betaRegression)
 Family: zero_one_inflated_beta 
  Links: mu = logit 
Formula: target_share ~ age + height + weight 
   Data: newdata %>% filter(position %in% c("WR")) (Number of observations: 47079) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -2.75      0.12    -2.99    -2.52 1.00     4979     2954
age           0.02      0.00     0.02     0.02 1.00     5557     2732
height       -0.01      0.00    -0.01    -0.00 1.00     5448     2503
weight        0.01      0.00     0.00     0.01 1.00     4841     3534

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi    12.22      0.09    12.05    12.39 1.00     2351     2190
zoi     0.14      0.00     0.13     0.14 1.00     2134     2415
coi     0.00      0.00     0.00     0.00 1.00     2375     2340

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
performance::r2(betaRegression)
# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.010 (95% CI [0.009, 0.011])

11.5 Assumptions of Multiple Regression

Linear regression models make the following assumptions:

  • there is a linear association between the predictor variable and the outcome variable
  • there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variable or as a function of the outcome variable
  • the residuals are independent; they are uncorrelated with each other
  • the residuals are normally distributed

Homoscedasticity of the residuals means that the variance of the residuals does not differ as a function of the outcome variable or as a function of the predictor variable (i.e., the residuals show constant variance as a function of outcome/predictors). If the residuals differ as a function of the outcome or predictor variable, this is called heteroscedasticity.

Those are some of the key assumptions of multiple regression. However, there are additional assumptions of multiple regression, including ones discussed in the chapter on Causal Inference. For instance, the variables included should reflect a causal process such that the predictor variables influence the outcome variable, and not the other way around. That is, the outcome variable should not influence the predictor variables (i.e., there should be no reverse causation). In addition, it is important control for any confound(s). If a confound is not controlled for, this is called omitted-variable bias, and it leads the researcher to incorrectly attribute the effects of the omitted variable to the included variables.

11.5.1 Evaluating and Addressing Assumptions of Multiple Regression

11.5.1.1 Linear Association

To evaluate the shape of the association between the predictor variables and the outcome variable, we can examine scatterplots (Figure 11.2), residual plots (Figure 11.21), marginal model plots (Figure 11.12), added-variable plots (Figure 11.13), and component-plus-residual plots (Figure 11.14). Residual plots depict the residuals (errors) on the y-axis as a function of the fitted values or a specific predictor on the x-axis. Marginal model plots are basically glorified scatterplots that depict the outcome variable (both in terms of observed values and model-fitted values) on the y-axis and the predictor variables on the x-axis. Added-variable plots depict the unique association of each predictor variables with the outcome variable when controlling for all the other predictor variables in the model. Component-plus-residual plots depict partial residuals on the y-axis as a function of each predictor variable on the x-axis, where a partial residual for a given predictor is the effect of a given predictor (thus controlling for all the other predictor variables in the model) plus the residual from the full model.

Examples of linear and nonlinear associations are depicted with scatterplots in Figure 11.2.

Code
set.seed(52242)

sampleSize <- 1000
quadraticX <- runif(
  sampleSize,
  min = -4,
  max = 4)
linearY <- quadraticX + rnorm(sampleSize, mean = 0, sd = 0.5)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

plot(
  x = quadraticX,
  y = linearY,
  xlab = "",
  ylab = "",
  main = "Linear Association")

abline(
  lm(linearY ~ quadraticX),
  lwd = 2,
  col = "blue")

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "",
  main = "Nonlinear Association")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")
Example Associations Depicted With Scatterplots.
(a) Example of a Linear Association
Example Associations Depicted With Scatterplots.
(b) Example of a Nonlinear Association
Figure 11.2: Example Associations Depicted With Scatterplots.

If the shape of the association is nonlinear (as indicated by any of these plots), various approaches may be necessary such as including nonlinear terms (e.g., polynomial terms such as quadratic, cubic, quartic, or higher-degree terms), transforming the predictors (e.g., log, square root, inverse, exponential, Box-Cox, Yeo-Johnson transform), use of splines/piecewise regression, and generalized additive models.

11.5.1.2 Homoscedasticity

To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.21) and spread-level plot (Figure 11.22). A residual plot depicts the residuals on the y-axis as a function of the model’s fitted values on the x-axis. Homoscedasticity in a residual plot is identified as a constant spread of residuals versus fitted values—the residuals do not show a fan, cone, or bow-tie shape; a fan, cone, or bow-tie shape indicates heteroscedasticity. In a residual plot, a fan or cone shape indicates increasing or decreasing variance in the residuals as a function of the fitted values; a bow-tie shape indicates that the residuals are smallest in the middle of the fitted values and greatest on the extremes of the fitted values. A spread-level plot depicts the log of the absolute value of studentized residuals on the y-axis as a function of the log of the model’s fitted values on the x-axis. Homoscedasticity in a spread-level plot is identified as a flat slope; a slope that differs from zero indicates heteroscedasticity.

Code
set.seed(52242)
sampleSize <- 1000

# 1. Homoscedasticity
x1 <- runif(sampleSize, 0, 10)
y1 <- 3 + 2 * x1 + rnorm(sampleSize, mean = 0, sd = 2)
fit1 <- lm(y1 ~ x1)
res1 <- resid(fit1)
fitted1 <- fitted(fit1)

# 2. Fan-shaped heteroscedasticity
x2 <- runif(sampleSize, 0, 10)
y2 <- 3 + 2 * x2 + rnorm(sampleSize, mean = 0, sd = 0.5 * x2) # increasing variance
fit2 <- lm(y2 ~ x2)
res2 <- resid(fit2)
fitted2 <- fitted(fit2)

# 3. Bow-tie-shaped heteroscedasticity
x3 <- runif(sampleSize, 0, 10)
sd3 <- abs(x3 - 5) + 0.5  # variance smallest in middle, higher on edges
y3 <- 3 + 2 * x3 + rnorm(sampleSize, mean = 0, sd = sd3)
fit3 <- lm(y3 ~ x3)
res3 <- resid(fit3)
fitted3 <- fitted(fit3)

# 4. Diamond-shaped heteroscedasticity
x4 <- runif(sampleSize, 0, 10)
sd4 <- -abs(x4 - 5) + 5.5  # inverted bow-tie
y4 <- 3 + 2 * x4 + rnorm(sampleSize, mean = 0, sd = sd4)
fit4 <- lm(y4 ~ x4)
res4 <- resid(fit4)
fitted4 <- fitted(fit4)

#5. Triangle-shaped heteroscedasticity
height <- 10

sample_triangle <- function(n, v1, v2, v3) {
  u <- runif(n)
  v <- runif(n)
  is_flip <- u + v > 1
  u[is_flip] <- 1 - u[is_flip]
  v[is_flip] <- 1 - v[is_flip]
  
  x <- (1 - u - v) * v1[1] + u * v2[1] + v * v3[1]
  y <- (1 - u - v) * v1[2] + u * v2[2] + v * v3[2]
  
  cbind(x, y)
}

# Triangle vertices
v1 <- c(0, height)  # top left
v2 <- c(10, height) # top right
v3 <- c(5, 0)       # bottom center

# Sample points inside triangle
tri_points <- sample_triangle(
  sampleSize,
  v1,
  v2,
  v3)

# Extract x and residuals
x5 <- tri_points[, 1]
residuals5 <- tri_points[, 2]

# Generate outcome
y5 <- 3 + 2 * x5 + residuals5

# Fit model
fit5 <- lm(y5 ~ x5)
res5 <- resid(fit5)
fitted5 <- fitted(fit5)

# Plots
plot(
  fitted1,
  res1,
  main = "Homoscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted2,
  res2,
  main = "Fan-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted3,
  res3,
  main = "Bow-Tie-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted4,
  res4,
  main = "Diamond-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted5,
  res5,
  main = "Triangle-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(a) Homoscedasticity
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(b) Fan-Shaped Heteroscedasticity
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(c) Bow-Tie-Shaped Heteroscedasticity
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(d) Diamond-Shaped Heteroscedasticity
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(e) Triangle-Shaped Heteroscedasticity
Figure 11.3: Example of Homoscedasticity and Heteroscedasticity in Residual Plots.

If there is heteroscedasticity, it may be necessary to transform the outcome variable to be more normally distributed. The spread-level plot provides a suggested power transformation to transform the outcome variable so that the spread of residuals becomes more uniform across the fitted values.

11.5.1.3 Uncorrelated Residuals

To determine if residuals are correlated by a grouping level, we can examine the proportion of variance that is attributable to the grouping level using the intraclass correlation coefficient (ICC) from a mixed model. The greater the ICC value, the more variance is accounted for by the grouping level, and the more the residuals are intercorrelated. If the residuals are intercorrelated, it may be necessary to account for the grouping structure of the data using a mixed model.

11.5.1.4 Normally Distributed Residuals

To examine whether residuals are normally distributed, we can examine quantile–quantile (QQ) plots and probability–probability (PP) plots. QQ plots depict quantiles of a sample distribution (y-axis) compared to the quantiles of a theoretical (in this case, normal) distribution (x-axis). PP plots depict cumulative probabilities of a sample distribution (y-axis) compared to those of a theoretical (in this case, normal) distribution (x-axis). QQ plots are particularly useful for identifying deviations from normality in the tails of the distribution; PP plots are particularly useful for identifying deviations from normality in the center of the distribution. Researchers tend to be more concerned about the tails of the distribution, because extreme values tend to have a greater impact on inferences, so researchers tend to use QQ plots more often than PP plots. Various examples of QQ plots and deviations from normality are depicted in Figure 11.4. If the residuals are normally distributed, they will stay close to the diagonal reference line of the QQ and PP plots.

Code
set.seed(52242)

sampleSize <- 10000

distribution_normal <- rnorm(
  sampleSize,
  mean = 0,
  sd = 1)

distribution_bimodal <- c(
  rnorm(
    sampleSize/2,
    mean = -2,
    sd = 1),
  rnorm(
    sampleSize/2,
    mean = 2,
    sd = 1))

distribution_negativeSkew <- -rlnorm(
  sampleSize,
  meanlog = 0,
  sdlog = 1)

distribution_positiveSkew <- rlnorm(
  sampleSize,
  meanlog = 0,
  sdlog = 1)

distribution_lightTailed <- runif(
  sampleSize,
  min = -2,
  max = 2)

distribution_heavyTailed <- rt(
  sampleSize,
  df = 2)

hist(
  distribution_normal,
  main = "Normal Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_normal,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_bimodal,
  main = "Bimodal Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_bimodal,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_negativeSkew,
  main = "Negatively Skewed Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_negativeSkew,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_positiveSkew,
  main = "Positively Skewed Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_positiveSkew,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_lightTailed,
  main = "Platykurtic (Light Tailed) Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_lightTailed,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_heavyTailed,
  main = "Leptokurtic (Heavy Tailed) Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_heavyTailed,
  main = "QQ Plot",
  id = FALSE)
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(a) Normal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(b) Normal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(c) Bimodal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(d) Bimodal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(e) Negatively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(f) Negatively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(g) Positively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(h) Positively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(i) Platykurtic (Light-Tailed) Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(j) Platykurtic (Light-Tailed) Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(k) Leptokurtic (Heavy-Tailed) Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(l) Leptokurtic (Heavy-Tailed) Distribution
Figure 11.4: Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).

If the residuals are not normally distributed (i.e., they do not stay close to the diagonal reference line of the QQ and PP plots), it may be necessary to transform the outcome variable to be more normally distributed or to use a generalized linear model (GLM) that more closely matches the distribution of the outcome variable (e.g., Poisson, binomial, gamma).

11.6 How Much Variance the Model Explains

When estimating a multiple regression model, it can be useful to evaluate how much variance in the outcome variable that the predictor variable(s) explain (i.e., account for). If the variables collectively explain only a small amount of variance, it suggest that the predictor variables have a small effect size and that other predictor variables will be necessary to account for the majority of variability in the outcome variable. There are two primary indices of how much how much variance in the outcome variable is explained by the predictor variable(s): the coefficient of determination (\(R^2\)) and adjusted \(R^2\) (\(R^2_{adj}\)), described below.

11.6.1 Coefficient of Determination (\(R^2\))

As noted in Section 9.6.6, The coefficient of determination (\(R^2\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions (i.e., by the optimal linear combination of the predictor variable(s)), as in Equation 9.31: \(R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}\). Various formulas for \(R^2\) are in Equation 9.19. Larger \(R^2\) values indicate greater accuracy. Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in Figure 11.29.

Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
Figure 11.5: Conceptual Depiction of Proportion of Variance Explained (\(R^2\)) in an Outcome Variable (\(Y\)) by Multiple Predictors (\(X1\) and \(X2\)) in Multiple Regression. The size of each circle represents the variable’s variance. The proportion of variance in \(Y\) that is explained by the predictors is depicted by the areas in orange. The dark orange space (\(G\)) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome (\(G\)) will not be recovered in the regression coefficients when both predictors are included in the regression model. From Petersen (2024) and Petersen (2025).

One issue with \(R^2\) is that it increases as the number of predictors increases, which can lead to overfitting if using \(R^2\) as an index to compare models for purposes of selecting the “best-fitting” model. Consider the following example (adapted from Petersen (2025)) in which you have one predictor variable and one outcome variable, as shown in Table 11.1.

Table 11.1: Example Data of Predictor (x1) and Outcome (y) Used for Regression Model.
y x1
7 1
13 2
29 7
10 2

Using the data, the best fitting regression model is: \(y =\) 3.98 \(+\) 3.59 \(\cdot x_1\). In this example, the \(R^2\) is 0.98. The equation is not a perfect prediction, but with a single predictor variable, it captures the majority of the variance in the outcome.

Now consider the following example where you add a second predictor variable to the data above, as shown in Table 11.2.

Table 11.2: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.
y x1 x2
7 1 3
13 2 5
29 7 1
10 2 2

With the second predictor variable, the best fitting regression model is: \(y =\) 0.00 + 4.00 \(\cdot x_1 +\) 1.00 \(\cdot x_2\). In this example, the \(R^2\) is 1.00. The equation with the second predictor variable provides a perfect prediction of the outcome.

Providing perfect prediction with the right set of predictor variables is the dream of multiple regression. So, using multiple regression, we often add predictor variables to incrementally improve prediction. Knowing how much variance would be accounted for by random chance follows Equation 11.7:

\[ E(R^2) = \frac{K}{n-1} \tag{11.7}\]

where \(E(R^2)\) is the expected value of \(R^2\) (the proportion of variance explained), \(K\) is the number of predictor variables, and \(n\) is the sample size. The formula demonstrates that the more predictor variables in the regression model, the more variance will be accounted for by chance. With many predictor variables and a small sample, you can account for a large share of the variance merely by chance.

As an example, consider that we have 13 predictor variables to predict fantasy performance for 43 players. Assume that, with 13 predictor variables, we explain 38% of the variance (\(R^2 = .38; r = .62\)). We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: \(E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31\). We expect to explain 31% of the variance, by chance, in the outcome. So, 82% of the variance explained was likely spurious (i.e., \(\frac{.31}{.38} = .82\)). As the sample size increases, the spuriousness decreases. To account for the number of predictor variables in the model, we can use a modified version of \(R^2\) called adjusted \(R^2\) (\(R^2_{adj}\)), described next.

11.6.2 Adjusted \(R^2\) (\(R^2_{adj}\))

Adjusted \(R^2\) (\(R^2_{adj}\)) accounts for the number of predictor variables in the model, based on how much would be expected to be accounted for by chance to penalize overfitting. Adjusted \(R^2\) (\(R^2_{adj}\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of predictor variables in the model. The formula for adjusted \(R^2\) (\(R^2_{adj}\)) is in Equation 11.8:

\[ R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \tag{11.8}\]

where \(p\) is the number of predictor variables in the model, and \(n\) is the sample size.

11.7 How Much Variance Each Predictor Explains

Another estimate of effect size in addition to standardized regression coefficients is the squared semi-partial correlation (\(sr^2\)). The squared semi-partial correlation reflects the proportion of variance in the outcome variable that is uniquely explained by each predictor variable. The total model \(R^2\) minus the sum of all the squared semi-partial correlations reflects the proportion of variance in the outcome variable that is explained by the common variance among the predictor variables.

11.8 Overfitting

Statistical models applied to big data (e.g., data with many predictor variables) can overfit the data, which means that the statistical model accounts for error variance, which will not generalize to future samples. So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built. Overfitting is most likely to occur when the model is too complex relative to the sample size (e.g., many predictor variables or parameters) In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.

Consider an example where you develop an algorithm to predict players’ fantasy performance based on 2024 data using hundreds of predictor variables. To some extent, these predictor variables will likely account for true variance (i.e., signal) and error variance (i.e., noise). If we were to apply the same algorithm based on the 2025 prediction model to 2026 data, the prediction model would likely predict less accurately than with 2025 data. The regression coefficients (and resulting accuracy) tend to become weaker when applied to new data, a phenomenon called shrinkage, which is described in Section 15.7.1. For instance, shrinkage is observed when applying multiple regression models to new data in Section 19.7 and when applying machine learning models to new data in Section 19.8.

In Figure 11.6, the blue line represents the true distribution of the data, and the red line is an overfitting model:

Code
set.seed(52242)

sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

loessFit <- loess(
  quadraticY ~ quadraticX,
  data = quadraticData,
  span = 0.01,
  degree = 1)

loessNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$loessY <- predict(
  loessFit,
  newdata = quadraticNewData)

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")

lines(
  quadraticNewData$loessY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "red")
Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
Figure 11.6: Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From Petersen (2024) and Petersen (2025).

11.9 Covariates

Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable. Use of covariates examines the association between the predictor variable and the outcome variable when holding people’s level constant on the covariates. Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable. Ideally, you want to include any and all confounds as covariates. As described in Section 8.5.2.1, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association. Covariates are potentially (but not necessarily) confounds. For instance, you might include the player’s age as a covariate in a model that examines whether a player’s 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.

However, it is worth noting that, for several reasons, including a variable as a covariate in a model does not necessarily fully control for that construct. First, your measure of the covariate has measurement error and may not fully capture the construct, which can lead to residual confounding. Second, including the covariate only accounts for the linear effect of that variable. The variable (e.g., age) could have nonlinear effects that are not captured by the linear term (and may require nonlinear terms, e.g., quadratic). Third, the construct could interact with other constructs, which would also not be captured by the linear term (and may require interaction terms). Thus, it is important to consider the reliability of the covariate, and the ways in which the covariate is related to the outcome including any nonlinearity and interactions.

In general, you should select which covariates to include based on theory and your causal diagram of the causal process (as described in Section 13.6), not based on whether the covariate has a statistically significant association with the outcome variable. Just because a predictor variable is not associated with an outcome variable does not mean that the predictor variable will not be useful in the model. For instance, the predictor variable could have a suppression effect, or a moderating effect, etc., that may not be detectable based on a bivariate association.

11.10 Example: Predicting Wide Receivers’ Fantasy Points

Let’s say we want to use a number of variables to predict a wide receiver’s fantasy performance. We want to consider several predictors, including the player’s age, height, weight, and target share. Target share is computed as the number of targets a player receives divided by the team’s total number of targets. We have only a few predictors and our sample size is large enough such that overfitting is not likely a concern.

11.10.1 Examine Descriptive Statistics

Let’s first examine descriptive statistics of the predictor and outcome variables.

Code
player_stats_seasonal %>% 
  dplyr::select(fantasyPoints, age, height, weight, target_share) %>% 
  dplyr::summarise(across(
      everything(),
      .fns = list(
        n = ~ length(na.omit(.)),
        missingness = ~ mean(is.na(.)) * 100,
        M = ~ mean(., na.rm = TRUE),
        SD = ~ sd(., na.rm = TRUE),
        min = ~ min(., na.rm = TRUE),
        max = ~ max(., na.rm = TRUE),
        range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
        IQR = ~ IQR(., na.rm = TRUE),
        MAD = ~ mad(., na.rm = TRUE),
        median = ~ median(., na.rm = TRUE),
        pseudomedian = ~ DescTools::HodgesLehmann(., na.rm = TRUE),
        mode = ~ petersenlab::Mode(., multipleModes = "mean"),
        skewness = ~ psych::skew(., na.rm = TRUE),
        kurtosis = ~ psych::kurtosi(., na.rm = TRUE)),
      .names = "{.col}.{.fn}")) %>%
    tidyr::pivot_longer(
      cols = everything(),
      names_to = c("variable","index"),
      names_sep = "\\.") %>% 
    tidyr::pivot_wider(
      names_from = index,
      values_from = value)

Let’s also examine the distributions of the variables using a density plot, as depicted in Figure 11.7.

Code
ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = fantasyPoints)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Fantasy Points",
    y = "Density",
    title = "Density Plot of Fantasy Points with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = age)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Age (years)",
    y = "Density",
    title = "Density Plot of Player Age with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = height)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Height (inches)",
    y = "Density",
    title = "Density Plot of Player Height with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = weight)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Weight (pounds)",
    y = "Density",
    title = "Density Plot of Player Weight with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = target_share)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Target Share",
    y = "Density",
    title = "Density Plot of Target Share with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(a) Fantasy Points
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(b) Age
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(c) Height
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(d) Weight
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(e) Target Share
Figure 11.7: Density Plot of Model Variables (i.e., Predictor and Outcome Variables).

11.10.2 Examine Bivariate Associations

Then, let’s examine the bivariate association of each using a scatterplot to evaluate for any potential nonlinearity, as depicted in Figure 11.8.

Code
ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = age,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Age (Years)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Age",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = height,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Height (Inches)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Height",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = weight,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Weight (Pounds)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Weight",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = target_share,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Target Share",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Target Share",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(a) Age
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(b) Height
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(c) Weight
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(d) Target Share
Figure 11.8: Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.

There are some suggestions of potential nonlinearity, such as an inverted-U-shaped association between height and fantasy points, suggesting that there may an optimal range for height among Wide Receivers—being too short or too tall could be a disadvantage. In addition, target share shows a weakening association as target share increases. Thus, after evaluating the linear association between the predictors and outcome, we will also examine the possibility for curvilinear associations.

11.10.3 Estimate Multiple Regression Model

Now that we have examined descriptive statistics and bivariate associations, let’s first estimate a multiple regression model with only linear terms:

Code
linearRegressionModel <- lm(
  fantasyPoints ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

The model formula is in Equation 11.9:

\[ \text{fantasy points} = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \tag{11.9}\]

Here are the model results:

Code
summary(linearRegressionModel)

Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share, 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-746.79  -18.14  -10.12   10.34  293.12 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -27.36726   22.98098  -1.191  0.23377    
age            0.63790    0.21398   2.981  0.00289 ** 
height        -0.13754    0.41579  -0.331  0.74081    
weight         0.19270    0.06573   2.932  0.00339 ** 
target_share 750.28584    7.09218 105.791  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.83 on 4692 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.7139,    Adjusted R-squared:  0.7136 
F-statistic:  2927 on 4 and 4692 DF,  p-value: < 2.2e-16

The only terms that were significantly associated with fantasy performance among Wide Receivers are weight and target share, both of which showed a positive association with fantasy points.

We can obtain the coefficient of determination (\(R^2\)) and adjusted \(R^2\) (\(R^2_{adj}\)) using the following code:

Code
summary(linearRegressionModel)$r.squared
[1] 0.7138718
Code
summary(linearRegressionModel)$adj.r.squared
[1] 0.7136278

The model explained 71% of the variability in fantasy points (i.e., \(R^2 = .71\)).

The model formula with substituted values is in Equation 11.10:

\[ \begin{aligned} \text{fantasy points} = &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \\ = &-27.37 + 0.64 \cdot \text{age} + -0.14 \cdot \text{height} + 0.19 \cdot \text{weight} + 750.29 \cdot \text{target share} + \epsilon \end{aligned} \tag{11.10}\]

If we want to obtain standardized regression coefficients, we can use the effectsize::standardize_parameters() function of the effectsize package (Ben-Shachar et al., 2020, 2025).

Code
print(effectsize::standardize_parameters(linearRegressionModel, method = "basic"), digits = 2)
# Standardization method: basic

Parameter    | Std. Coef. |        95% CI
-----------------------------------------
(Intercept)  |       0.00 | [ 0.00, 0.00]
age          |       0.02 | [ 0.01, 0.04]
height       |  -3.87e-03 | [-0.03, 0.02]
weight       |       0.03 | [ 0.01, 0.06]
target share |       0.84 | [ 0.82, 0.85]

Or, we can standardize the outcome variable and each predictor variable using the base::scale() function:

Code
linearRegressionModelStandardized <- lm(
  scale(fantasyPoints) ~ scale(age) + scale(height) + scale(weight) + scale(target_share),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModelStandardized)

Call:
lm(formula = scale(fantasyPoints) ~ scale(age) + scale(height) + 
    scale(weight) + scale(target_share), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9170 -0.2166 -0.1209  0.1235  3.5000 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.002527   0.007812  -0.324  0.74631    
scale(age)           0.023916   0.008022   2.981  0.00289 ** 
scale(height)       -0.003853   0.011646  -0.331  0.74081    
scale(weight)        0.034168   0.011654   2.932  0.00339 ** 
scale(target_share)  0.837689   0.007918 105.791  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5352 on 4692 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.7139,    Adjusted R-squared:  0.7136 
F-statistic:  2927 on 4 and 4692 DF,  p-value: < 2.2e-16
Code
print(effectsize::standardize_parameters(linearRegressionModel, method = "refit"), digits = 2)
# Standardization method: refit

Parameter    | Std. Coef. |        95% CI
-----------------------------------------
(Intercept)  |  -7.02e-17 | [-0.02, 0.02]
age          |       0.02 | [ 0.01, 0.04]
height       |  -3.87e-03 | [-0.03, 0.02]
weight       |       0.03 | [ 0.01, 0.06]
target share |       0.84 | [ 0.82, 0.85]

Target share has a large effect size. All of the other predictors have a small effect size.

Now let’s consider whether any of the terms show curvilinear associations with fantasy points by adding quadratic terms:

Code
quadraticTermsRegressionModel <- lm(
  fantasyPoints ~ age + I(age^2) + height + I(height^2) + weight + I(weight^2) + target_share + I(target_share^2),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

The model formula is in Equation 11.11:

\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \end{aligned} \tag{11.11}\]

Here are the model results:

Code
summary(quadraticTermsRegressionModel)

Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) + 
    weight + I(weight^2) + target_share + I(target_share^2), 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-277.777  -15.025   -3.017    6.303  310.218 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5.883e+01  4.063e+02  -0.145   0.8849    
age                3.973e+00  2.160e+00   1.839   0.0659 .  
I(age^2)          -6.573e-02  3.896e-02  -1.687   0.0916 .  
height             2.289e+00  1.200e+01   0.191   0.8487    
I(height^2)       -2.139e-02  8.292e-02  -0.258   0.7965    
weight            -7.438e-01  7.529e-01  -0.988   0.3233    
I(weight^2)        2.409e-03  1.873e-03   1.286   0.1984    
target_share       1.121e+03  8.733e+00 128.401   <2e-16 ***
I(target_share^2) -9.491e+02  1.725e+01 -55.011   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.94 on 4688 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.8263,    Adjusted R-squared:  0.826 
F-statistic:  2787 on 8 and 4688 DF,  p-value: < 2.2e-16

Only target share (not height or weight) shows a significant association, including its linear and quadratic terms.

Here are the standardized coefficients:

Code
print(effectsize::standardize_parameters(quadraticTermsRegressionModel, method = "basic"), digits = 2)
# Standardization method: basic

Parameter      | Std. Coef. |         95% CI
--------------------------------------------
(Intercept)    |       0.00 | [ 0.00,  0.00]
age            |       0.15 | [-0.01,  0.30]
age^2          |      -0.13 | [-0.29,  0.02]
height         |       0.06 | [-0.60,  0.73]
height^2       |      -0.09 | [-0.75,  0.58]
weight         |      -0.13 | [-0.40,  0.13]
weight^2       |       0.17 | [-0.09,  0.44]
target share   |       1.25 | [ 1.23,  1.27]
target share^2 |      -0.53 | [-0.55, -0.51]
Code
quadraticTermsRegressionModelStandardized <- lm(
  scale(fantasyPoints) ~ scale(age) + scale(I(age^2)) + scale(height) + scale(I(height^2)) + scale(weight) + scale(I(weight^2)) + scale(target_share) + scale(I(target_share^2)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(quadraticTermsRegressionModelStandardized)

Call:
lm(formula = scale(fantasyPoints) ~ scale(age) + scale(I(age^2)) + 
    scale(height) + scale(I(height^2)) + scale(weight) + scale(I(weight^2)) + 
    scale(target_share) + scale(I(target_share^2)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3168 -0.1794 -0.0360  0.0753  3.7042 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -0.003168   0.006091  -0.520   0.6030    
scale(age)                0.148957   0.080977   1.839   0.0659 .  
scale(I(age^2))          -0.137314   0.081385  -1.687   0.0916 .  
scale(height)             0.064124   0.336177   0.191   0.8487    
scale(I(height^2))       -0.086741   0.336268  -0.258   0.7965    
scale(weight)            -0.131881   0.133499  -0.988   0.3233    
scale(I(weight^2))        0.171475   0.133305   1.286   0.1984    
scale(target_share)       1.251993   0.009751 128.401   <2e-16 ***
scale(I(target_share^2)) -0.531579   0.009663 -55.011   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4172 on 4688 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.8263,    Adjusted R-squared:  0.826 
F-statistic:  2787 on 8 and 4688 DF,  p-value: < 2.2e-16
Code
print(effectsize::standardize_parameters(quadraticTermsRegressionModel, method = "refit"), digits = 2)
# Standardization method: refit

Parameter      | Std. Coef. |         95% CI
--------------------------------------------
(Intercept)    |       0.10 | [ 0.08,  0.12]
age            |       0.02 | [ 0.00,  0.03]
age^2          |  -7.44e-03 | [-0.02,  0.00]
height         |      -0.02 | [-0.04,  0.00]
height^2       |  -1.42e-03 | [-0.01,  0.01]
weight         |       0.04 | [ 0.02,  0.06]
weight^2       |   6.46e-03 | [ 0.00,  0.02]
target share   |       1.07 | [ 1.05,  1.08]
target share^2 |      -0.10 | [-0.10, -0.10]

11.10.4 Squared Semi-Partial Correlation (\(sr^2\))

To estimate the squared semi-partial correlation (\(sr^2\)), we use the rockchalk::getDeltaRsquare() function from the rockchalk package (Johnson, 2022).

Code
rockchalk::getDeltaRsquare(linearRegressionModel)
The deltaR-square values: the change in the R-square
      observed when a single term is removed.
Same as the square of the 'semi-partial correlation coefficient'
             deltaRsquare
age          5.419797e-04
height       6.673050e-06
weight       5.241517e-04
target_share 6.824913e-01

11.10.5 Commonality Analysis

Commonality analysis decomposes the model’s \(R^2\) into coefficients that represent the proportion of variance accounted for uniquely by each predictor variable and the proportion of variance accounted for that is shared in each possible combination of the predictor variables. Below we conduct a commonality analysis using the yhat::calc.yhat(), yhat::commonality(), and yhat::commonalityCoefficients() functions of the yhat package (Nimon et al., 2025).

Code
yhat::calc.yhat(linearRegressionModel)
$PredictorMetrics
                   b   Beta     r    rs   rs2 Unique Common  CD:0  CD:1  CD:2
age            0.638  0.023 0.121 0.143 0.020  0.001  0.014 0.015 0.010 0.005
height        -0.138 -0.004 0.084 0.099 0.010  0.000  0.007 0.007 0.003 0.000
weight         0.193  0.034 0.130 0.154 0.024  0.001  0.016 0.017 0.009 0.004
target_share 750.286  0.838 0.844 0.999 0.998  0.682  0.030 0.712 0.700 0.690
Total             NA     NA    NA    NA 1.052  0.684  0.067 0.751 0.722 0.699
              CD:3 GenDom Pratt   RLW
age          0.001  0.008 0.003 0.008
height       0.000  0.002 0.000 0.002
weight       0.001  0.008 0.004 0.007
target_share 0.682  0.696 0.707 0.697
Total        0.684  0.714 0.714 0.714

$OrderedPredictorMetrics
             b Beta r rs rs2 Unique Common CD:0 CD:1 CD:2 CD:3 GenDom Pratt RLW
age          2    3 3  3   3      2      3    3    2    2    2      2     3   2
height       4    4 4  4   4      4      4    4    4    4    4      4     4   4
weight       3    2 2  2   2      3      2    2    3    3    3      3     2   3
target_share 1    1 1  1   1      1      1    1    1    1    1      1     1   1

$PairedDominanceMetrics
                    Comp Cond Gen
age>height           1.0  1.0   1
age>weight           0.5  0.5   0
age>target_share     0.0  0.0   0
height>weight        0.0  0.0   0
height>target_share  0.0  0.0   0
weight>target_share  0.0  0.0   0

$APSRelatedMetrics
                               Commonality  % Total    R2 age.Inc height.Inc
age                                  0.001    0.001 0.015      NA      0.007
height                               0.000    0.000 0.007   0.015         NA
weight                               0.001    0.001 0.017   0.014      0.000
target_share                         0.682    0.956 0.712   0.001      0.000
age,height                           0.000    0.000 0.022      NA         NA
age,weight                           0.000    0.000 0.031      NA      0.000
height,weight                        0.000    0.001 0.017   0.014         NA
age,target_share                     0.014    0.019 0.713      NA      0.000
height,target_share                  0.000    0.000 0.713   0.001         NA
weight,target_share                  0.009    0.013 0.713   0.001      0.000
age,height,weight                    0.000    0.000 0.031      NA         NA
age,height,target_share              0.000    0.000 0.713      NA         NA
age,weight,target_share              0.001    0.001 0.714      NA      0.000
height,weight,target_share           0.006    0.009 0.713   0.001         NA
age,height,weight,target_share       0.000    0.000 0.714      NA         NA
Total                                0.714    1.000    NA      NA         NA
                               weight.Inc target_share.Inc
age                                 0.016            0.698
height                              0.010            0.706
weight                                 NA            0.696
target_share                        0.001               NA
age,height                          0.010            0.692
age,weight                             NA            0.683
height,weight                          NA            0.696
age,target_share                    0.001               NA
height,target_share                 0.001               NA
weight,target_share                    NA               NA
age,height,weight                      NA            0.682
age,height,target_share             0.001               NA
age,weight,target_share                NA               NA
height,weight,target_share             NA               NA
age,height,weight,target_share         NA               NA
Total                                  NA               NA
Code
apsOut <- yhat::aps(
  dataMatrix = player_stats_seasonal %>% 
    filter(position %in% c("WR")) %>% 
    select(fantasyPoints, age, height, weight, target_share),
  dv = "fantasyPoints",
  ivlist = list("age", "height", "weight", "target_share"))

yhat::commonality(apsOut)
                                 Coefficient       % Total
age                             5.419797e-04  7.592116e-04
height                          6.673050e-06  9.347687e-06
weight                          5.241517e-04  7.342378e-04
target_share                    6.824913e-01  9.560419e-01
age,height                      2.866176e-06  4.014973e-06
age,weight                      1.758697e-05  2.463604e-05
height,weight                   4.590242e-04  6.430066e-04
age,target_share                1.355808e-02  1.899232e-02
height,target_share             2.743047e-04  3.842492e-04
weight,target_share             9.099877e-03  1.274721e-02
age,height,weight              -1.732795e-05 -2.427320e-05
age,height,target_share         1.000939e-04  1.402127e-04
age,weight,target_share         6.266413e-04  8.778065e-04
height,weight,target_share      6.387244e-03  8.947327e-03
age,height,weight,target_share -2.007430e-04 -2.812031e-04
Code
commonalityAnalysis <- yhat::commonalityCoefficients(
  dataMatrix = player_stats_seasonal %>% 
    filter(position %in% c("WR")) %>% 
    select(fantasyPoints, age, height, weight, target_share),
  dv = "fantasyPoints",
  ivlist = list("age", "height", "weight", "target_share"))

commonalityAnalysis
$CC
                                                Coefficient     % Total
Unique to age                                        0.0005        0.08
Unique to height                                     0.0000        0.00
Unique to weight                                     0.0005        0.07
Unique to target_share                               0.6825       95.60
Common to age, and height                            0.0000        0.00
Common to age, and weight                            0.0000        0.00
Common to height, and weight                         0.0005        0.06
Common to age, and target_share                      0.0136        1.90
Common to height, and target_share                   0.0003        0.04
Common to weight, and target_share                   0.0091        1.27
Common to age, height, and weight                    0.0000        0.00
Common to age, height, and target_share              0.0001        0.01
Common to age, weight, and target_share              0.0006        0.09
Common to height, weight, and target_share           0.0064        0.89
Common to age, height, weight, and target_share     -0.0002       -0.03
Total                                                0.7139      100.00

$CCTotalbyVar
             Unique Common  Total
age          0.0005 0.0141 0.0146
height       0.0000 0.0070 0.0070
weight       0.0005 0.0164 0.0169
target_share 0.6825 0.0298 0.7123

11.10.6 Dominance Analysis

We can perform a dominance analysis to evaluate the relative importance of the predictors in the regression model. To perform the dominance analysis, we use the parameters::dominance_analysis() function of the parameters package (Lüdecke et al., 2020; Lüdecke, Makowski, Ben-Shachar, Patil, Højsgaard, et al., 2025), which leverages the domir package (Luchman, 2024).

Code
parameters::dominance_analysis(linearRegressionModel)
# Dominance Analysis Results

Model R2 Value:  0.714 

General Dominance Statistics

Parameter    | General Dominance | Percent | Ranks |       Subset
-----------------------------------------------------------------
(Intercept)  |                   |         |       |     constant
age          |             0.008 |   0.011 |     3 |          age
height       |             0.002 |   0.003 |     4 |       height
weight       |             0.008 |   0.011 |     2 |       weight
target_share |             0.696 |   0.975 |     1 | target_share

Conditional Dominance Statistics

Subset       | IVs: 1 | IVs: 2 |    IVs: 3 |    IVs: 4
------------------------------------------------------
age          |  0.015 |  0.010 |     0.005 | 5.420e-04
height       |  0.007 |  0.003 | 2.521e-04 | 6.673e-06
weight       |  0.017 |  0.009 |     0.004 | 5.242e-04
target_share |  0.712 |  0.700 |     0.690 |     0.682

Complete Dominance Designations

Subset       | < age | < height | < weight | < target_share
-----------------------------------------------------------
age          |       |    FALSE |          |           TRUE
height       |  TRUE |          |     TRUE |           TRUE
weight       |       |    FALSE |          |           TRUE
target_share | FALSE |    FALSE |    FALSE |               

We can also conduct a dominance analysis using the yhat::dominance() function of the yhat package (Nimon et al., 2025).

Code
yhat::dominance(apsOut)
$DA
                               k          R2      age.Inc   height.Inc
age                            1 0.014629177           NA 7.127246e-03
height                         1 0.007012135 0.0147442880           NA
weight                         1 0.016896454 0.0142030198 3.839378e-04
target_share                   1 0.712336807 0.0005451049 4.512355e-04
age,height                     2 0.021756423           NA           NA
age,weight                     2 0.031099474           NA 2.809777e-04
height,weight                  2 0.017280392 0.0141000597           NA
age,target_share               2 0.712881912           NA 4.656973e-04
height,target_share            2 0.712788042 0.0005595667           NA
weight,target_share            2 0.713320242 0.0005448459 9.539226e-06
age,height,weight              3 0.031380451           NA           NA
age,height,target_share        3 0.713347609           NA           NA
age,weight,target_share        3 0.713865088           NA 6.673050e-06
height,weight,target_share     3 0.713329781 0.0005419797           NA
age,height,weight,target_share 4 0.713871761           NA           NA
                                 weight.Inc target_share.Inc
age                            0.0164702966        0.6982527
height                         0.0102682564        0.7057759
weight                                   NA        0.6964238
target_share                   0.0009834349               NA
age,height                     0.0096240282        0.6915912
age,weight                               NA        0.6827656
height,weight                            NA        0.6960494
age,target_share               0.0009831759               NA
height,target_share            0.0005417386               NA
weight,target_share                      NA               NA
age,height,weight                        NA        0.6824913
age,height,target_share        0.0005241517               NA
age,weight,target_share                  NA               NA
height,weight,target_share               NA               NA
age,height,weight,target_share           NA               NA

$CD
              age       height       weight target_share
CD:0 0.0146291771 7.012135e-03 0.0168964539    0.7123368
CD:1 0.0098308042 2.654140e-03 0.0092406626    0.7001508
CD:2 0.0050681575 2.520714e-04 0.0037163142    0.6901354
CD:3 0.0005419797 6.673050e-06 0.0005241517    0.6824913

$GD
         age       height       weight target_share 
 0.007517530  0.002481255  0.007594396  0.696278581 

11.10.7 Visualizing Regression Results

11.10.7.1 Regression Coefficients

To visualize the regression coefficients, we can use the broom::tidy() function of the broom package (Robinson et al., 2025), as in Figures 11.9 and 11.10.

Code
quadraticTermsRegressionModel_tidy <- broom::tidy(
  quadraticTermsRegressionModel,
  conf.int = TRUE)

quadraticTermsRegressionModelStandardized_tidy <- broom::tidy(
  quadraticTermsRegressionModelStandardized,
  conf.int = TRUE)
Code
ggplot2::ggplot(
  data = quadraticTermsRegressionModel_tidy,
  aes(
    x = term,
    y = estimate)) +
  geom_point() +
  geom_errorbar(
    aes(
      ymin = conf.low,
      ymax = conf.high),
    width = 0.2) +
  coord_flip() + # flip axes for readability
  labs(
    title = "Regression Coefficients with 95% CI",
    x = "Predictor",
    y = "Coefficient Estimate") +
  theme_minimal()
Regression Coefficients with 95% Confidence Interval.
Figure 11.9: Regression Coefficients with 95% Confidence Interval.
Code
ggplot2::ggplot(
  data = quadraticTermsRegressionModelStandardized_tidy,
  aes(
    x = term,
    y = estimate)) +
  geom_point() +
  geom_errorbar(
    aes(
      ymin = conf.low,
      ymax = conf.high),
    width = 0.2) +
  coord_flip() + # flip axes for readability
  labs(
    title = "Standardized Regression Coefficients with 95% CI",
    x = "Predictor",
    y = "Standardized Coefficient Estimate") +
  theme_minimal()
Standardized Regression Coefficients with 95% Confidence Interval.
Figure 11.10: Standardized Regression Coefficients with 95% Confidence Interval.

11.10.7.2 Model-Implied Association

If we wanted to visualize the shape of the model-implied association between target share and fantasy points, we could generate the model-implied predictions using the data range that we want to visualize.

Code
newdata <- data.frame(
  target_share = seq(
    from = min(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
    to = max(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
    length.out = 1000
  )
)

newdata$age <- mean(player_stats_seasonal$age[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$height <- mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$weight <- mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
Code
newdata$fantasyPoints <- predict(
  quadraticTermsRegressionModel,
  newdata = newdata
)

We can depict the model-implied predictions of fantasy points as a function of target share, as shown in Figure 11.11.

Code
ggplot2::ggplot(
  data = newdata,
  aes(
    x = target_share,
    y = fantasyPoints)) +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Target Share",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Target Share",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()
Model-Implied Predictions of A Wide Receiver's Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model.
Figure 11.11: Model-Implied Predictions of A Wide Receiver’s Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model.

We could also generate the model-implied prediction of fantasy points for any value of the predictor variables. For instance, here are the number of fantasy points expected for a Wide Receiver who is 23 years old, is 6’2” tall (72 inches), weighs 200 pounds, and has a target share of 50% (i.e., 0.5):

Code
predict(
  quadraticTermsRegressionModel,
  newdata = data.frame(
    age = 23,
    height = 72,
    weight = 200,
    target_share = .5
  )
)
       1 
322.7641 

The player would be expected to score 323 fantasy points.

We can calculate this by substituting in the regression coefficients and predictor values. Here is the computation:

Code
age <- 23
height <- 72
weight <- 200
target_share <- .5

beta0 <- coef(quadraticTermsRegressionModel)[["(Intercept)"]]
beta1 <- coef(quadraticTermsRegressionModel)[["age"]]
beta2 <- coef(quadraticTermsRegressionModel)[["I(age^2)"]]
beta3 <- coef(quadraticTermsRegressionModel)[["height"]]
beta4 <- coef(quadraticTermsRegressionModel)[["I(height^2)"]]
beta5 <- coef(quadraticTermsRegressionModel)[["weight"]]
beta6 <- coef(quadraticTermsRegressionModel)[["I(weight^2)"]]
beta7 <- coef(quadraticTermsRegressionModel)[["target_share"]]
beta8 <- coef(quadraticTermsRegressionModel)[["I(target_share^2)"]]

predictedFantasyPoints <- beta0 + beta1*age + beta2*(age^2) + beta3*height + beta4*(height^2) + beta5*weight + beta6*(weight^2) + beta7*target_share + beta8*(target_share^2)
predictedFantasyPoints
[1] 322.7641

The model formula with substituted values is in Equation 11.12:

\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \\ = &-58.83 + 3.97 \cdot 23 + -0.07 \cdot 23^2 + 2.29 \cdot 72 + -0.02 \cdot 72^2 + \\ &-0.74 \cdot 200 + 0.00 \cdot 200^2 + 1121.36 \cdot 0.5 + -949.12 \cdot 0.5^2 + \epsilon \\ = &322.76 \end{aligned} \tag{11.12}\]

11.10.8 Evaluating and Addressing Assumptions

The assumptions for multiple regression are described in Section 11.5. I describe ways to evaluate assumptions in Section 11.5.1.

As a reminder, here are four assumptions:

  • there is a linear association between the predictor variables and the outcome variable
  • there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
  • the residuals are independent; they are uncorrelated with each other
  • the residuals are normally distributed

11.10.8.1 Linear Association

We evaluated the shape of the association between the predictor variables and the outcome variables using scatterplots. We accounted for potential curvilinearity in the associations with a quadratic term. Other ways to account for nonlinearity, in addition to polynomials, include transforming predictors, use of splines/piecewise regression, and generalized additive models.

To evaluate for potential nonlinearity in the associations, we can also evaluate residual plots (Figure 11.21), marginal model plots (Figure 11.12), added-variable plots (Figure 11.13), and component-plus-residual plots (Figure 11.14) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). For evaluating linearity, we would expect minimal bend/curvature in the lines.

Code
car::marginalModelPlots(
  quadraticTermsRegressionModel,
  sd = TRUE,
  id = TRUE)
Marginal Model Plots.
Figure 11.12: Marginal Model Plots.
Code
car::avPlots(
  quadraticTermsRegressionModel,
  id = TRUE)
Added-Variable Plots.
Figure 11.13: Added-Variable Plots.
Code
car::crPlots(
  quadraticTermsRegressionModel,
  sd = TRUE,
  id = TRUE)
Component-Plus-Residual Plots.
Figure 11.14: Component-Plus-Residual Plots.

The marginal model plots (Figure 11.12), residual plots (Figure 11.21), and component-plus-residual plots (Figure 11.14) suggest that the nonlinearity of the association between target share and fantasy points may not be fully captured by the quadratic term. Thus, we may need to apply a different approach to handling the nonlinear association between target share and fantasy points.

One approach we can take is to transform the target_shares variable to be more normally distributed.

The histogram for the raw target_shares variable is in Figure 11.15.

Code
hist(
  player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Target Share")
Histogram of Target Share.
Figure 11.15: Histogram of Target Share.

The variable shows a strong positive skew. To address a strong positive skew, we can use a log transformation. The histogram of the log-transformed variable is in Figure 11.16.

Code
hist(
  log(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")] + 1),
  main = "Histogram of Target Share (Log Transformed)")
Histogram of Target Share, Transformed.
Figure 11.16: Histogram of Target Share, Transformed.

Now we can re-fit the model with the log-transformed variable.

Code
linearRegressionModel_logTargetShare <- lm(
  fantasyPoints ~ age + I(age^2) + height + I(height^2) + weight + I(weight^2) + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModel_logTargetShare)

Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) + 
    weight + I(weight^2) + I(log(target_share + 1)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-614.16  -15.28   -7.63    7.29  300.02 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -3.208e+02  4.751e+02  -0.675   0.4996    
age                       3.227e+00  2.526e+00   1.278   0.2014    
I(age^2)                 -4.931e-02  4.556e-02  -1.082   0.2791    
height                    1.136e+01  1.403e+01   0.810   0.4181    
I(height^2)              -8.050e-02  9.696e-02  -0.830   0.4065    
weight                   -1.379e+00  8.803e-01  -1.567   0.1172    
I(weight^2)               3.904e-03  2.190e-03   1.782   0.0748 .  
I(log(target_share + 1))  9.030e+02  7.536e+00 119.827   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.87 on 4689 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.7623,    Adjusted R-squared:  0.762 
F-statistic:  2149 on 7 and 4689 DF,  p-value: < 2.2e-16

Target share shows a more linear association with fantasy points after log-transforming it (albeit still not perfect), as depicted in Figures 11.17, 11.18, 11.19, and 11.20.

Code
car::marginalModelPlots(
  linearRegressionModel_logTargetShare,
  sd = TRUE,
  id = TRUE)
Marginal Model Plots After Log Transformation of Target Share.
Figure 11.17: Marginal Model Plots After Log Transformation of Target Share.
Code
car::avPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)
Added-Variable Plots After Log Transformation of Target Share.
Figure 11.18: Added-Variable Plots After Log Transformation of Target Share.
Code
car::crPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)
Component-Plus-Residual Plots After Log Transformation of Target Share.
Figure 11.19: Component-Plus-Residual Plots After Log Transformation of Target Share.

When creating a residual plot, the car package (Fox et al., 2024; Fox & Weisberg, 2019) also provides a test of the nonlinearity of each predictor.

Code
car::residualPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)
                         Test stat Pr(>|Test stat|)    
age                        -0.0968           0.9229    
I(age^2)                    1.4485           0.1475    
height                     -1.5212           0.1283    
I(height^2)                 1.1549           0.2482    
weight                     -0.2024           0.8396    
I(weight^2)                -0.2454           0.8062    
I(log(target_share + 1))  -35.2136           <2e-16 ***
Tukey test                -35.0759           <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots After Log Transformation of Target Share.
Figure 11.20: Residual Plots After Log Transformation of Target Share.

11.10.8.2 Homoscedasticity

To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.21) and spread-level plot (Figure 11.22) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). In a residual plot, you want a constant spread of residuals versus fitted values—you do not want the residuals to show a fan or cone shape.

Code
car::residualPlots(
  quadraticTermsRegressionModel,
  id = TRUE)
                  Test stat Pr(>|Test stat|)    
age                 -0.3552          0.72242    
I(age^2)             2.4447          0.01454 *  
height              -2.0606          0.03940 *  
I(height^2)          1.7195          0.08559 .  
weight              -0.4155          0.67779    
I(weight^2)         -0.7515          0.45237    
target_share         0.3843          0.70077    
I(target_share^2)  -15.9739          < 2e-16 ***
Tukey test          21.1776          < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots.
Figure 11.21: Residual Plots.

In a spread-level plot, you want a flat (zero) slope—you do not want a positive or negative slope.

Code
car::spreadLevelPlot(
  quadraticTermsRegressionModel,
  id = TRUE)

Suggested power transformation:  0.4771945 
Spread-Level Plot.
Figure 11.22: Spread-Level Plot.

In this example, the residuals appear to increase as a function of the fitted values. To handle this, we may need to transform the outcome variable to be more normally distributed.

The histogram for raw fantasy points is in Figure 12.35.

Code
hist(
  player_stats_seasonal$fantasyPoints[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Fantasy Points")
Histogram of Fantasy Points (Among Wide Receivers).
Figure 11.23: Histogram of Fantasy Points (Among Wide Receivers).

We can apply a transformation to the outcome variable to generate a more normally distributed variable using the bestNormalize::bestNormalize() function of the bestNormalize package (Peterson, 2021; Peterson, 2023; Peterson & Cavanaugh, 2020).

Code
bnTransformed <- bestNormalize::bestNormalize(player_stats_seasonal$fantasyPoints)

Here are the transformation results—in this case, the best-fitting transformation is an ordered quantile (ORQ) normalization transformation:

Code
bnTransformed
Best Normalizing transformation with 42935 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 15.8276
 - Center+scale: 112.7684
 - Double Reversed Log_b(x+a): 158.3888
 - Exp(x): 79.5093
 - Log_b(x+a): 17.846
 - orderNorm (ORQ): 4.5967
 - sqrt(x + a): 39.3116
 - Yeo-Johnson: 8.5004
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 42935 nonmissing obs and ties
 - 5068 unique values 
 - Original quantiles:
    0%    25%    50%    75%   100% 
 -7.28   8.50  29.00  69.50 485.10 

Then, we apply the best-fitting transformation to the variable, to create a transformed variable that is more normally distributed.

Code
player_stats_seasonal$fantasyPoints_transformed <- predict(bnTransformed)

# Can back-transform transformed values back to the original metric later if needed:
original <- predict(
  bnTransformed,
  newdata = player_stats_seasonal$fantasyPoints_transformed,
  inverse = TRUE)

The histogram of the transformed variable is in Figure 11.24.

Code
hist(
  player_stats_seasonal$fantasyPoints_transformed[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Fantasy Points (Transformed)")
Histogram of Fantasy Points (Among Wide Receivers), Transformed.
Figure 11.24: Histogram of Fantasy Points (Among Wide Receivers), Transformed.

Now we can refit the model.

Code
linearRegressionModel_outcomeTransformed <- lm(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModel_outcomeTransformed)

Call:
lm(formula = fantasyPoints_transformed ~ age + height + weight + 
    I(log(target_share + 1)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7389 -0.2298  0.0786  0.3290  3.0187 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -1.0335984  0.3283117  -3.148  0.00165 ** 
age                       0.0123863  0.0030574   4.051 5.18e-05 ***
height                   -0.0032316  0.0059391  -0.544  0.58639    
weight                    0.0020412  0.0009387   2.174  0.02973 *  
I(log(target_share + 1)) 11.7151947  0.1178769  99.385  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6403 on 4692 degrees of freedom
  (953 observations deleted due to missingness)
Multiple R-squared:  0.688, Adjusted R-squared:  0.6878 
F-statistic:  2587 on 4 and 4692 DF,  p-value: < 2.2e-16

The residual plot is in Figure 11.25.

Code
car::residualPlots(
  linearRegressionModel_outcomeTransformed,
  id = TRUE)
                         Test stat Pr(>|Test stat|)    
age                        -0.0665           0.9470    
height                     -0.3882           0.6979    
weight                     -0.9072           0.3643    
I(log(target_share + 1))  -47.9374           <2e-16 ***
Tukey test                -48.4582           <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots After Transformation of Fantasy Points.
Figure 11.25: Residual Plots After Transformation of Fantasy Points.

The spread-level plot is in Figure 11.26.

Code
car::spreadLevelPlot(
  linearRegressionModel_outcomeTransformed,
  id = TRUE)

Suggested power transformation:  1.068069 
Spread-Level Plot After Transformation of Fantasy Points.
Figure 11.26: Spread-Level Plot After Transformation of Fantasy Points.

The residuals show more constant variance after transforming the outcome variable.

11.10.8.3 Uncorrelated Residuals

To determine if residuals are correlated given the nested structure of the data, we can examine the proportion of variance that is attributable to the particular player. To do this, we can estimate the intraclass correlation coefficient (ICC) from a mixed model using the performance package (Lüdecke et al., 2021; Lüdecke, Makowski, Ben-Shachar, Patil, Waggoner, et al., 2025).

Code
mixedModel <- lmer(
  fantasyPoints_transformed ~ 1 + (1 | player_id),
  data = player_stats_seasonal)

performance::icc(mixedModel)

The ICC indicates that over half of the variance is attribute to between-player variance, so it would be important to account for the player-specific variance using a mixed model. For simplicity, we focus on multiple regression models in this chapter; mixed models are described in Chapter 12.

11.10.8.4 Normally Distributed Residuals

We can examine whether residuals are normally distributed using quantile–quantile (QQ) plots and probability–probability (PP) plots, as in Figures 11.27 and 11.28. If the residuals are normally distributed, they should stay close to the diagonal reference line.

Code
car::qqPlot(
  linearRegressionModel_outcomeTransformed,
  main = "QQ Plot",
  id = TRUE)
[1] 1400 1932
Residual Plots After Transformation of Fantasy Points.
Figure 11.27: Residual Plots After Transformation of Fantasy Points.
Code
petersenlab::ppPlot(
  linearRegressionModel_outcomeTransformed)
Residual Plots After Transformation of Fantasy Points.
Figure 11.28: Residual Plots After Transformation of Fantasy Points.

11.11 Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. The problem of having multiple predictor variables that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.

Multicollinearity in multiple regression is depicted conceptually in Figure 11.29.

Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
Figure 11.29: Conceptual Depiction of Multicollinearity in Multiple Regression. From Petersen (2024) and Petersen (2025).

Consider the following example adapted from Petersen (2025) where you have two predictor variables and one outcome variable, as shown in Table 11.3.

Table 11.3: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.
y x1 x2
9 2.0 4
11 3.0 6
17 4.0 8
3 1.0 2
21 5.0 10
13 3.5 7

The second predictor variable is not very good—it is exactly twice the value of the first predictor variable; thus, the two predictor variables are perfectly correlated (i.e., \(r = 1.0\)). This means that there are different prediction equation possibilities that are equally good—see Equations in Equation 11.13:

\[ \begin{aligned} 2x_2 &= y \\ 0x_1 + 2x_2 &= y \\ 4x_1 &= y \\ 4x_1 + 0x_2 &= y \\ 2x_1 + 1x_2 &= y \\ 5x_1 - 0.5x_2 &= y \\ ... &= y \end{aligned} \tag{11.13}\]

Then, what are the regression coefficients? We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well. Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate. In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates. Any predictor variables that have a correlation above ~ \(r = .30\) with each other could have an impact on the confidence interval of the regression coefficient. As the correlations among the predictor variables increase, the chance of getting an arbitrary answer increases, sometimes called “bouncing betas.” So, it is important to examine a correlation matrix of the predictor variables before putting them in the same regression model. You can also examine indices such as variance inflation factor (VIF), where a value greater than 5 or 10 indicates multicollinearity.

Here are the VIFs from our earlier model:

Code
car::vif(linearRegressionModel_outcomeTransformed)
                     age                   height                   weight 
                1.014639                 2.247440                 2.265154 
I(log(target_share + 1)) 
                1.028635 

To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors. For a meaningful answer regarding predictors in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size. However, if you are merely interested in prediction—and are not interested in interpreting the regression coefficients of individual predictors—multicollinearity poses less of a problem. For instance, machine learning cares more about achieving the greatest predictive accuracy possible and cares less about explaining which predictors are causally related to the outcome. So, multicollinearity is less of a concern for machine learning approaches.

11.12 Suppression

Suppression in a regression model occurs when including a predictor variable increases the magnitude of the regression coefficient (i.e., beta) of another predictor variable on an outcome variable (MacKinnon et al., 2000). The suppressor variable suppresses irrelevant variance in another predictor variable, allowing the predictor variable’s association with the outcome variable to emerge more clearly. Three types of suppression are depicted in Gaylord-Harden et al. (2010).

11.13 Handling of Missing Data

An important consideration in multiple regression is how missing data are handled. Multiple regression in R using the stats::lm() function applies listwise deletion. Listwise deletion (also called complete case analysis) removes any row (in the data file) from analysis that has a missing value on the outcome variable or any of the predictor variables. Removing all rows from analysis that have any missingness in the model variables can be a problem because missingness is often not completely at random—missingness often occurs systematically (i.e., for a reason). For instance, participants may be less likely to have data for all variables if they are from a lower socioeconomic status background and do not have the time to participate in all study procedures. Thus, applying listwise deletion, we might systematically exclude participants from lower socioeconomic status backgrounds (or other groups), which could lead to less generalizable inferences.

It is thus important to consider approaches to handle missingness. Various approaches to handle missingness include pairwise deletion (aka available-case analysis), multiple imputation, and full information maximum likelihood (FIML).

11.13.1 Pairwise Deletion

Pairwise deletion (also called available case analysis) uses all available data for each pair of variables when estimating the correlations/covariances, and the resulting correlation/covariance matrix can be used to estimate a multiple regression model. We can estimate a regression model that uses pairwise deletion using the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code
player_stats_seasonal$target_share_log <- log(player_stats_seasonal$target_share + 1)

modelData <- player_stats_seasonal %>% 
  filter(position %in% c("WR")) %>% 
  select(fantasyPoints_transformed, age, height, weight, target_share_log)

numObs <- sum(complete.cases(modelData))
varMeans <- colMeans(modelData, na.rm = TRUE)
varCovariances <- cov(modelData, use = "pairwise.complete.obs")

pairwiseRegression_syntax <- '
  fantasyPoints_transformed ~ age + height + weight + target_share_log
  fantasyPoints_transformed ~~ fantasyPoints_transformed
  fantasyPoints_transformed ~ 1
'

pairwiseRegression_fit <- lavaan::lavaan(
  pairwiseRegression_syntax,
  sample.mean = varMeans,
  sample.cov = varCovariances,
  sample.nobs = numObs
)

summary(
  pairwiseRegression_fit,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-21 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         6

  Number of observations                          4697

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    age                          0.022    0.003    7.267    0.000    0.022
    height                       0.001    0.006    0.147    0.883    0.001
    weight                       0.002    0.001    2.347    0.019    0.002
    target_shar_lg              11.660    0.117   99.250    0.000   11.660
  Std.all
         
    0.059
    0.002
    0.028
    0.818

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns   -1.598    0.327   -4.879    0.000   -1.598   -1.394

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    0.407    0.008   48.461    0.000    0.407    0.310

R-Square:
                   Estimate
    fntsyPnts_trns    0.690

11.13.2 Multiple Imputation

Multiple imputation takes a data set with missingness and it uses available information on other variables to estimate what likely values could have been for the missing value. It repeats the imputation multiple times, so there are multiple imputed data sets, which can give a sense of the degree of uncertainty of each imputed value. The model is then fit to each of the multiply imputed data sets separately, and the results are combined. We can multiply impute data using the mice package (van Buuren & Groothuis-Oudshoorn, 2011, 2024).

Code
numImputations <- 5

dataToImpute <- player_stats_seasonal %>% 
  filter(position %in% c("WR")) %>% 
  select(player_id, position, where(is.numeric)) %>% 
  select(
    player_id:games, carries:wopr, fantasy_points, fantasy_points_ppr,
    rush_40_yds, rec_40_yds, fumbles, two_pts, return_yds,
    rush_100_yds:draft_round, height:target_share_log) %>% 
  select(-c(fantasy_points_ppr, ageCentered20, target_share_log)) # drop collinear variables

predictors <- c("targets","receiving_yards","receiving_air_yards","receiving_yards_after_catch","receiving_first_downs","racr")

dataToImpute$player_id_integer <- as.integer(as.factor(dataToImpute$player_id))

varsToImpute <- c("age","height","weight","target_share")
Y <- varsToImpute

Now, let’s specify the imputation method—we use the two-level predictive mean matching (2l.pmm) method from the miceadds package (Robitzsch et al., 2024) to account for the nonindependent data (owing to multiple seasons per player):

Code
meth <- mice::make.method(dataToImpute)
meth[1:length(meth)] <- ""
meth[Y] <- "2l.pmm" # specify the imputation method here; this can differ by outcome variable

Now, let’s specify the prediction matrix. A predictor matrix is a matrix of values, where:

  • rows with non-zero values are the target variables to be imputed
  • columns with non-zero values are predictors of the variable specified in the given row
  • the diagonal of the predictor matrix should be zero because a variable cannot predict itself

The values are:

Code
pred <- mice::make.predictorMatrix(dataToImpute)
pred[1:nrow(pred), 1:ncol(pred)] <- 0
pred[Y, "player_id_integer"] <- (-2) # cluster variable
pred[Y, predictors] <- 1 # fixed effect predictors
pred[Y, "age"] <- 2 # random effect predictor
pred[Y, Y] <- 1 # fixed effect predictor

diag(pred) <- 0

Now, let’s run the imputation:

Code
imp <- mice::mice(
  as.data.frame(dataToImpute),
  method = meth,
  predictorMatrix = pred,
  m = numImputations,
  maxit = 5, # generally use 100 maximum iterations; this example uses 5 for speed
  seed = 52242)

Below are some imputation diagnostics. Trace plots are in Figure 11.30.

Code
plot(imp, c("target_share"))
Trace plots from multiple imputation.
Figure 11.30: Trace plots from multiple imputation.

A density plot is in Figure 11.31.

Code
densityplot(imp, ~ target_share)
Density plot from multiple imputation.
Figure 11.31: Density plot from multiple imputation.

The imputated data does not match well the distribution of the observed data. Thus, it may be necessary to select a different imputation method for more accurate imputation.

Now, let’s do some post-processing:

Code
imp_long <- mice::complete(
  imp,
  action = "long",
  include = TRUE)

imp_long$target_share_log <- log(imp_long$target_share + 1)

imp_long$fantasyPoints_transformed <- predict(
  bnTransformed,
  newdata = imp_long[["fantasyPoints"]])

imp_mids <- mice::as.mids(imp_long)

Now let’s estimate multiple regression with the multiply imputed data:

Code
imp_regression <- with(
  imp_mids,
  lm(
    fantasyPoints_transformed ~ age + height + weight + target_share_log)
  )

mice::pool(imp_regression)
Error:
! arguments imply differing number of rows: 1, 0
Code
summary(mice::pool(imp_regression))

11.13.3 Full Information Maximum Likelihood

Full information maximum likelihood (FIML) estimates model parameters using all available data for each case, without imputing missing values. FIML is commonly used for handling missingness in approaches to latent variable modeling, such as factor analysis and structural equation modeling. We can estimate a regression model that uses full information maximum likelihood using the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code
fimlRegression_syntax <- '
  fantasyPoints_transformed ~ age + height + weight + target_share_log
  fantasyPoints_transformed ~~ fantasyPoints_transformed
  fantasyPoints_transformed ~ 1
'

fimlRegression_fit <- lavaan::sem(
  fimlRegression_syntax,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  missing = "ML",
  fixed.x = FALSE
)

summary(
  fimlRegression_fit,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-21 ended normally after 48 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        20

  Number of observations                          5650
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    age                          0.015    0.003    5.159    0.000    0.015
    height                      -0.002    0.006   -0.282    0.778   -0.002
    weight                       0.002    0.001    2.370    0.018    0.002
    target_shar_lg              11.683    0.113  102.952    0.000   11.683
  Std.all
         
    0.042
   -0.003
    0.028
    0.820

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  age ~~                                                                
    height           -0.182    0.098   -1.860    0.063   -0.182   -0.025
    weight           -0.370    0.620   -0.597    0.551   -0.370   -0.008
    target_shar_lg    0.035    0.004    9.822    0.000    0.035    0.137
  height ~~                                                             
    weight           25.687    0.576   44.616    0.000   25.687    0.738
    target_shar_lg    0.016    0.003    6.149    0.000    0.016    0.085
  weight ~~                                                             
    target_shar_lg    0.149    0.016    9.033    0.000    0.149    0.124

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns   -1.246    0.319   -3.909    0.000   -1.246   -1.087
    age              26.260    0.042  628.699    0.000   26.260    8.364
    height           72.558    0.031 2325.165    0.000   72.558   30.934
    weight          200.445    0.198 1014.698    0.000  200.445   13.499
    target_shar_lg    0.080    0.001   72.781    0.000    0.080    0.999

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    0.408    0.008   48.987    0.000    0.408    0.311
    age               9.857    0.185   53.150    0.000    9.857    1.000
    height            5.502    0.104   53.151    0.000    5.502    1.000
    weight          220.478    4.148   53.151    0.000  220.478    1.000
    target_shar_lg    0.006    0.000   50.486    0.000    0.006    1.000

R-Square:
                   Estimate
    fntsyPnts_trns    0.689

11.14 Addressing Non-Independence of Observations

Please note that the \(p\)-value for regression coefficients assumes that the observations are independent—in particular, that the residuals are not correlated. However, the observations are not independent in the player_stats_seasonal dataframe used above, because the same player has multiple rows—one row corresponding to each season they played. This non-independence violates the traditional assumptions of the significance of regression coefficients. We could address this assumption by analyzing only one season from each player or by estimating the significance of the regression coefficients using cluster-robust standard errors. For simplicity in the models above, we present results above from the whole dataframe. In Chapter 12, we discuss mixed model approaches that handle repeated measures and other data that violate assumptions of non-independence. Below, we demonstrate how to account for non-independence of observations using cluster-robust standard errors with the rms package (Harrell, Jr., 2025).

Code
player_stats_seasonal_subset <- player_stats_seasonal %>% 
  filter(!is.na(player_id)) %>% 
  filter(position %in% c("WR"))

regressionWithClusterVariable <- rms::robcov(rms::ols(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal_subset,
  x = TRUE,
  y = TRUE),
  cluster = player_stats_seasonal_subset$player_id) #account for nested data within player

regressionWithClusterVariable
Frequencies of Missing Values Due to Each Variable
fantasyPoints_transformed                       age                    height 
                        0                         0                         0 
                   weight              target_share 
                        0                       953 

Linear Regression Model

rms::ols(formula = fantasyPoints_transformed ~ age + height + 
    weight + I(log(target_share + 1)), data = player_stats_seasonal_subset, 
    x = TRUE, y = TRUE)

                                                      Model Likelihood    Discrimination    
                                                            Ratio Test           Indexes    
 Obs                                        4697    LR chi2    5471.11    R2       0.688    
 sigma                                    0.6403    d.f.             4    R2 adj   0.688    
 d.f.                                       4692    Pr(> chi2)  0.0000    g        1.014    
Cluster onplayer_stats_seasonal_subset$player_id                                            
 Clusters                                   1346                                            

Residuals

    Min      1Q  Median      3Q     Max 
-7.7389 -0.2298  0.0786  0.3290  3.0187 

             Coef    S.E.   t     Pr(>|t|)
Intercept    -1.0336 0.4201 -2.46 0.0139  
age           0.0124 0.0035  3.57 0.0004  
height       -0.0032 0.0075 -0.43 0.6681  
weight        0.0020 0.0011  1.85 0.0647  
target_share 11.7152 0.4430 26.45 <0.0001 
Code
performance::r2(regressionWithClusterVariable)
# R2 for Linear Regression
  R2: 0.688

11.15 Impact of Outliers and Restricted Range

As with correlation, multiple regression can be strongly impacted by outliers and restricted range.

11.15.1 Robust Regression

To address outliers, there are various approaches to robust regression. One approach is to use an MM-type estimator, such as is used in the robustbase::lmrob() and robustbase::glmrob() functions of the robustbase package (Maechler et al., 2024; Todorov & Filzmoser, 2009).

Code
robustRegression <- robustbase::lmrob(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR"))
)

summary(robustRegression)

Call:
robustbase::lmrob(formula = fantasyPoints_transformed ~ age + height + weight + 
    I(log(target_share + 1)), data = player_stats_seasonal %>% filter(position %in% 
    c("WR")))
 \--> method = "MM"
Residuals:
     Min       1Q   Median       3Q      Max 
-8.53606 -0.29144  0.01306  0.26351  3.08110 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -0.1835731  0.2275381  -0.807    0.420    
age                      -0.0005275  0.0019637  -0.269    0.788    
height                   -0.0067733  0.0041664  -1.626    0.104    
weight                    0.0005558  0.0006472   0.859    0.391    
I(log(target_share + 1)) 12.9061459  0.1075719 119.977   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust residual standard error: 0.399 
  (953 observations deleted due to missingness)
Multiple R-squared:  0.8474,    Adjusted R-squared:  0.8473 
Convergence in 13 IRWLS iterations

Robustness weights: 
 69 observations c(142,174,185,232,272,284,294,365,392,398,514,515,752,753,1117,1153,1173,1190,1191,1252,1307,1387,1388,1431,1508,1520,1572,1599,1637,1796,1884,2082,2279,2308,2325,2342,2361,2367,2404,2561,2586,2629,2741,2803,2838,2919,2953,2961,3100,3128,3311,3446,3456,3630,3713,3759,3760,3880,3881,3912,4007,4131,4146,4147,4271,4290,4391,4480,4622)
     are outliers with |weight| = 0 ( < 2.1e-05); 
 392 weights are ~= 1. The remaining 4236 ones are summarized as
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0001167 0.8591000 0.9495000 0.8694000 0.9864000 0.9990000 
Algorithmic parameters: 
       tuning.chi                bb        tuning.psi        refine.tol 
        1.548e+00         5.000e-01         4.685e+00         1.000e-07 
          rel.tol         scale.tol         solve.tol          zero.tol 
        1.000e-07         1.000e-10         1.000e-07         1.000e-10 
      eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
        2.129e-05         4.820e-10         5.000e-01         5.000e-01 
     nResample         max.it         groups        n.group       best.r.s 
           500             50              5            400              2 
      k.fast.s          k.max    maxit.scale      trace.lev            mts 
             1            200            200              0           1000 
    compute.rd fast.s.large.n 
             0           2000 
                  psi           subsampling                   cov 
           "bisquare"         "nonsingular"         ".vcov.avar1" 
compute.outlier.stats 
                 "SM" 
seed : int(0) 

Another approach to handling outliers is to use boostrapping, which involves fitting models to various bootstrap resamples of the data. Bootstrap samples are datasets generated by sampling repeatedly with replacement.

We set up the bootstrap folds using the rsample::bootstraps() function of the rsample package (Frick et al., 2025).

Code
set.seed(52242)
bootstrapSamples <- 2000

# Create bootstrap resamples (with apparent = TRUE)
boots <- rsample::bootstraps(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  times = bootstrapSamples,
  apparent = TRUE
)

# Define recipe
rec <- recipes::recipe(
  fantasyPoints_transformed ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR"))
) %>%
  recipes::step_log(target_share, offset = 1) # replace I(log(target_share + 1))

# Define model spec and workflow
lm_spec <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  parsnip::set_mode("regression")

lm_workflow <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lm_spec)

# Function for fitting the models on the bootstrap samples
fit_lm <- function(split, ...) {
  analysis_data <- rsample::analysis(split)
  fit <- workflows::fit(lm_workflow, data = analysis_data)
  broom::tidy(fit)
}

# Fit bootstrap models and save results
boot_models <- boots %>%
  mutate(coef_info = purrr::map(splits, fit_lm))

# Extract bootstrapped coefficient estimates
boot_coefs <- boot_models %>%
  tidyr::unnest(coef_info)

# Percentile confidence intervals
percentile_intervals <- rsample::int_pctl(
  .data = boot_models,
  statistics = coef_info)

# Bias-corrected and accelerated confidence intervals
bca_intervals <- rsample::int_bca(
  .data = boot_models,
  statistics = coef_info,
  .fn = fit_lm
)

# View confidence intervals
percentile_intervals
Code
bca_intervals

The distributions of the regression coefficients across boostraps are depicted in Figure 11.32.

Code
ggplot(
  data = boot_coefs,
  aes(x = estimate)) +
  geom_histogram(
    fill = "gray80",
    color = "black") +
  facet_wrap(
    ~ term,
    scales = "free") +
  geom_vline(
    data = percentile_intervals,
    aes(
      xintercept = .lower),
    col = "blue",
    linetype = "dashed") +
  geom_vline(
    data = percentile_intervals,
    aes(
      xintercept = .upper),
    col = "blue",
    linetype = "dashed") +
  labs(
    title = "Bootstrap Distributions of Coefficients",
    x = "Coefficient Estimate",
    y = "Count") +
  theme_classic()
Histogram of Parameter Estimates Across Bootstraps.
Figure 11.32: Histogram of Parameter Estimates Across Bootstraps.

11.16 Moderated Multiple Regression

When examining moderation in multiple regression, several steps are important:

First, we mean-center the predictors. In this case, we center the predictors around the mean of height and weight for Wide Receivers:

Code
player_stats_seasonal$height_centered <- player_stats_seasonal$height - mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
player_stats_seasonal$weight_centered <- player_stats_seasonal$weight - mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)

Then, we compute the interaction term as the multiplication of the two centered predictors:

Code
player_stats_seasonal$heightXweight <- player_stats_seasonal$height_centered * player_stats_seasonal$weight_centered

Then, we fit the moderated multiple regression model:

Code
moderationModel <- lm(
  fantasyPoints_transformed ~ height_centered + weight_centered + height_centered:weight_centered,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(moderationModel)

Call:
lm(formula = fantasyPoints_transformed ~ height_centered + weight_centered + 
    height_centered:weight_centered, data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9113 -0.8306  0.0552  0.9584  3.1206 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      0.4245678  0.0181654  23.372  < 2e-16 ***
height_centered                 -0.0104841  0.0095946  -1.093   0.2746    
weight_centered                  0.0109833  0.0015104   7.272 4.03e-13 ***
height_centered:weight_centered -0.0006757  0.0003916  -1.725   0.0845 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.137 on 5646 degrees of freedom
Multiple R-squared:  0.01685,   Adjusted R-squared:  0.01633 
F-statistic: 32.25 on 3 and 5646 DF,  p-value: < 2.2e-16

This model is equivalent to the model that includes the interaction term explicitly:

Code
moderationModel <- lm(
  fantasyPoints_transformed ~ height_centered + weight_centered + heightXweight,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(moderationModel)

Now, we can visualize the interaction to understand it. We create an interaction plot (Figure 11.33) and Johnson-Neyman plot (Figure 11.34) using the interactions package (Long, 2024).

Code
interactions::interact_plot(
  moderationModel,
  pred = height_centered,
  modx = weight_centered)
Interaction Plot from Moderated Multiple Regression.
Figure 11.33: Interaction Plot from Moderated Multiple Regression.
Code
interactions::johnson_neyman(
  moderationModel,
  pred = height_centered,
  modx = weight_centered,
  alpha = .05)
JOHNSON-NEYMAN INTERVAL

The Johnson-Neyman interval could not be found. Is the p value for your
interaction term below the specified alpha?
Johnson-Neyman Plot from Moderated Multiple Regression.
Figure 11.34: Johnson-Neyman Plot from Moderated Multiple Regression.

Here is a simple slopes analysis:

Code
interactions::sim_slopes(
  moderationModel,
  pred = height_centered,
  modx = weight_centered,
  johnson_neyman = FALSE)
SIMPLE SLOPES ANALYSIS

Slope of height_centered when weight_centered = -1.484982e+01 (- 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.00   0.01    -0.04   0.97

Slope of height_centered when weight_centered =  1.247537e-14 (Mean): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.01   0.01    -1.09   0.27

Slope of height_centered when weight_centered =  1.484982e+01 (+ 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.02   0.01    -1.75   0.08

11.17 Mediation

A mediation model takes the following general form in the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code
mediationModel <- '
  # direct effect (cPrime)
  Y ~ direct*X
  
  # mediator
  M ~ a*X
  Y ~ b*M
  
  # indirect effect = a*b
  indirect := a*b
  
  # total effect (c)
  total := direct + indirect
  totalAbs := abs(direct) + abs(indirect)
  
  # proportion mediated
  Pm := abs(indirect) / totalAbs
'

Let’s substitute in our predictor, outcome, and hypothesized mediator. In this case, we predict that receiving touchdowns partially accounts for the association between Wide Receiver’s target share and their fantasy points. This is a silly example because fantasy points are derived, in part, from touchdowns, so of course touchdowns will partially account for almost any effect on Wide Receivers’ fantasy points. This example is merely for demonstrating the process of developing and examining a mediation model.

Code
mediationModel <- '
  # direct effect (cPrime)
  fantasyPoints_transformed ~ direct*target_share
  
  # mediator
  receiving_tds ~ a*target_share
  fantasyPoints_transformed ~ b*receiving_tds
  
  # indirect effect = a*b
  indirect := a*b
  
  # total effect (c)
  total := abs(direct) + abs(indirect)
  
  # proportion mediated
  Pm := abs(indirect) / total
'

To get a robust estimate of the indirect effect, we obtain bootstrapped estimates from 1,000 bootstrap draws. Typically, we would obtain bootstrapped estimates from 10,000 bootstrap draws, but this example uses only 1,000 bootstrap draws for a shorter runtime.

Code
mediationFit <- lavaan::sem(
  mediationModel,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  se = "bootstrap",
  bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
  parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
  iseed = 52242, # for reproducibility
  missing = "ML",
  estimator = "ML",
  fixed.x = FALSE)

Here are the model results:

Code
summary(
  mediationFit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-21 ended normally after 19 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9

  Number of observations                          5650
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                             10176.132
  Degrees of freedom                                 3
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000
                                                      
  Robust Comparative Fit Index (CFI)             1.000
  Robust Tucker-Lewis Index (TLI)                1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -13317.378
  Loglikelihood unrestricted model (H1)     -13317.378
                                                      
  Akaike (AIC)                               26652.756
  Bayesian (BIC)                             26712.511
  Sample-size adjusted Bayesian (SABIC)      26683.911

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: RMSEA <= 0.050                       NA
  P-value H_0: RMSEA >= 0.080                       NA
                                                      
  Robust RMSEA                                   0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: Robust RMSEA <= 0.050                NA
  P-value H_0: Robust RMSEA >= 0.080                NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws             997

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    trgt_sh (drct)               5.913    0.642    9.212    0.000    5.913
  receiving_tds ~                                                         
    trgt_sh    (a)              22.051    1.279   17.236    0.000   22.051
  fantasyPoints_transformed ~                                             
    rcvng_t    (b)               0.172    0.015   11.832    0.000    0.172
  Std.all
         
    0.483
         
    0.705
         
    0.439

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns   -0.494    0.024  -20.344    0.000   -0.494   -0.431
   .receiving_tds     0.315    0.100    3.163    0.002    0.315    0.108
    target_share      0.087    0.001   69.664    0.000    0.087    0.932

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    0.361    0.018   20.347    0.000    0.361    0.275
   .receiving_tds     4.302    0.289   14.874    0.000    4.302    0.502
    target_share      0.009    0.001   16.445    0.000    0.009    1.000

R-Square:
                   Estimate
    fntsyPnts_trns    0.725
    receiving_tds     0.498

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    indirect          3.792    0.141   26.943    0.000    3.792    0.310
    total             9.705    0.535   18.143    0.000    9.705    0.793
    Pm                0.391    0.032   12.088    0.000    0.391    0.391

We can also estimate a model with multiple hypothesized mediators:

Code
multipleMediatorModel <- '
  # direct effect (cPrime)
  fantasyPoints_transformed ~ direct*target_share
  
  # mediator
  receiving_tds ~ a1*target_share
  receiving_yards ~ a2*target_share
  
  fantasyPoints_transformed ~ b1*receiving_tds + b2*receiving_yards
  
  # indirect effect = a*b
  indirect1 := a1*b1
  indirect2 := a2*b2
  indirectTotal := indirect1 + indirect2
  
  # total effect (c)
  total := abs(direct) + abs(indirectTotal)
  
  # proportion mediated
  Pm1 := abs(indirect1) / total
  Pm2 := abs(indirect2) / total
  PmTotal := abs(indirectTotal) / total
'
Code
multipleMediatorFit <- lavaan::sem(
  multipleMediatorModel,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  se = "bootstrap",
  bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
  parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
  iseed = 52242, # for reproducibility
  missing = "ML",
  estimator = "ML",
  fixed.x = FALSE)

Here are the model results:

Code
summary(
  multipleMediatorFit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-21 ended normally after 36 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        13

  Number of observations                          5650
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                              3251.359
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                             22438.545
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.855
  Tucker-Lewis Index (TLI)                       0.131
                                                      
  Robust Comparative Fit Index (CFI)             0.850
  Robust Tucker-Lewis Index (TLI)                0.102

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -50524.268
  Loglikelihood unrestricted model (H1)     -48898.589
                                                      
  Akaike (AIC)                              101074.537
  Bayesian (BIC)                            101160.849
  Sample-size adjusted Bayesian (SABIC)     101119.539

Root Mean Square Error of Approximation:

  RMSEA                                          0.758
  90 Percent confidence interval - lower         0.737
  90 Percent confidence interval - upper         0.780
  P-value H_0: RMSEA <= 0.050                    0.000
  P-value H_0: RMSEA >= 0.080                    1.000
                                                      
  Robust RMSEA                                   0.791
  90 Percent confidence interval - lower         0.766
  90 Percent confidence interval - upper         0.817
  P-value H_0: Robust RMSEA <= 0.050             0.000
  P-value H_0: Robust RMSEA >= 0.080             1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.078

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws             998

Regressions:
                              Estimate   Std.Err  z-value  P(>|z|)   Std.lv 
  fantasyPoints_transformed ~                                               
    trgt_sh (drct)                1.612    0.330    4.887    0.000     1.612
  receiving_tds ~                                                           
    trgt_sh   (a1)               22.603    1.255   18.005    0.000    22.603
  receiving_yards ~                                                         
    trgt_sh   (a2)             3504.580  193.635   18.099    0.000  3504.580
  fantasyPoints_transformed ~                                               
    rcvng_t   (b1)                0.025    0.004    6.305    0.000     0.025
    rcvng_y   (b2)                0.002    0.000   28.600    0.000     0.002
  Std.all
         
    0.135
         
    0.731
         
    0.852
         
    0.063
    0.733

Intercepts:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .fntsyPnts_trns    -0.586    0.012  -49.226    0.000    -0.586   -0.516
   .receiving_tds      0.265    0.097    2.722    0.006     0.265    0.090
   .receiving_yrds    66.695   15.113    4.413    0.000    66.695    0.171
    target_share       0.087    0.001   69.583    0.000     0.087    0.923

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .fntsyPnts_trns     0.259    0.007   38.582    0.000     0.259    0.202
   .receiving_tds      3.986    0.265   15.039    0.000     3.986    0.466
   .receiving_yrds 41366.908 5515.504    7.500    0.000 41366.908    0.273
    target_share       0.009    0.001   15.609    0.000     0.009    1.000

R-Square:
                   Estimate 
    fntsyPnts_trns     0.798
    receiving_tds      0.534
    receiving_yrds     0.727

Defined Parameters:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    indirect1          0.554    0.091    6.092    0.000     0.554    0.046
    indirect2          7.492    0.231   32.394    0.000     7.492    0.625
    indirectTotal      8.046    0.241   33.432    0.000     8.046    0.671
    total              9.658    0.532   18.162    0.000     9.658    0.806
    Pm1                0.057    0.009    6.601    0.000     0.057    0.057
    Pm2                0.776    0.026   29.395    0.000     0.776    0.776
    PmTotal            0.833    0.024   34.335    0.000     0.833    0.833

11.18 Bayesian Multiple Regression

Code
bayesianMultipleRegressionModel <- brm(
  formula = fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  family = gaussian()
)
Code
summary(bayesianMultipleRegressionModel)
 Family: gaussian 
  Links: mu = identity 
Formula: fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)) 
   Data: player_stats_seasonal %>% filter(position %in% c(" (Number of observations: 4697) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept             -1.04      0.32    -1.67    -0.41 1.00     3943     2792
age                    0.01      0.00     0.01     0.02 1.00     5491     2583
height                -0.00      0.01    -0.01     0.01 1.00     3732     2669
weight                 0.00      0.00     0.00     0.00 1.00     3899     3257
Ilogtarget_shareP1    11.72      0.12    11.49    11.95 1.00     3265     2480

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.64      0.01     0.63     0.65 1.00     3472     2927

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
performance::r2(bayesianMultipleRegressionModel)
# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.688 (95% CI [0.680, 0.696])

11.19 Conclusion

Multiple regression allows examining the association between multiple predictor variables and one outcome variable. Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable. As with correlation, an association does not imply causation. However, identifying associations is important because associations are a necessary (but insufficient) condition for causality. When developing a multiple regression model, there are various assumptions that are important to evaluate. In addition, it is important to pay attention for potential multicollinearity—it may become difficult to detect a given predictor variable as statistically significant due to the greater uncertainty around the parameter estimates.

11.20 Session Info

Code
sessionInfo()
R version 4.5.3 (2026-03-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.70.0       knitr_1.51          lubridate_1.9.5    
 [4] forcats_1.0.1       stringr_1.6.0       readr_2.2.0        
 [7] tibble_3.3.1        tidyverse_2.0.0     yardstick_1.3.2    
[10] workflowsets_1.1.1  workflows_1.3.0     tune_2.0.1         
[13] tidyr_1.3.2         tailor_0.1.0        rsample_1.3.2      
[16] recipes_1.3.1       purrr_1.2.1         parsnip_1.4.1      
[19] modeldata_1.5.1     infer_1.1.0         ggplot2_4.0.2      
[22] dplyr_1.2.0         dials_1.4.2         scales_1.4.0       
[25] tidymodels_1.4.1    yhat_2.0-5          effectsize_1.0.2   
[28] rockchalk_1.8.157   broom_1.0.12        MASS_7.3-65        
[31] ordinal_2025.12-29  robustbase_0.99-7   parallelly_1.46.1  
[34] brms_2.23.0         Rcpp_1.1.1          interactions_1.2.0 
[37] miceadds_3.19-16    mice_3.19.0         lavaan_0.6-21      
[40] performance_0.16.0  lme4_2.0-1          Matrix_1.7-4       
[43] bestNormalize_1.9.2 car_3.1-5           carData_3.0-6      
[46] rms_8.1-1           Hmisc_5.2-5         petersenlab_1.2.3  

loaded via a namespace (and not attached):
  [1] fs_1.6.7              matrixStats_1.5.0     sparsevctrs_0.3.6    
  [4] httr_1.4.8            DiceDesign_1.10       RColorBrewer_1.1-3   
  [7] insight_1.4.6         doParallel_1.0.17     numDeriv_2016.8-1.1  
 [10] tools_4.5.3           doRNG_1.8.6.3         backports_1.5.0      
 [13] R6_2.6.1              nortest_1.0-4         mgcv_1.9-4           
 [16] jomo_2.7-6            withr_3.0.2           Brobdingnag_1.2-9    
 [19] prettyunits_1.2.0     gridExtra_2.3         quantreg_6.1         
 [22] cli_3.6.5             domir_1.2.0           mix_1.0-13           
 [25] sandwich_3.1-1        labeling_0.4.3        mvtnorm_1.3-6        
 [28] S7_0.2.1              polspline_1.1.25      proxy_0.4-29         
 [31] QuickJSR_1.9.0        pbivnorm_0.6.0        StanHeaders_2.32.10  
 [34] foreign_0.8-91        plotrix_3.8-14        readxl_1.4.5         
 [37] rstudioapi_0.18.0     generics_0.1.4        shape_1.4.6.1        
 [40] distributional_0.7.0  zip_2.3.3             inline_0.3.21        
 [43] loo_2.9.0             DescTools_0.99.60     abind_1.4-8          
 [46] lifecycle_1.0.5       multcomp_1.4-30       yaml_2.3.12          
 [49] grid_4.5.3            crayon_1.5.3          mitml_0.4-5          
 [52] butcher_0.4.0         lattice_0.22-9        haven_2.5.5          
 [55] jtools_2.3.1          pillar_1.11.1         gld_2.6.8            
 [58] boot_1.3-32           estimability_1.5.1    future.apply_1.20.2  
 [61] codetools_0.2-20      pan_1.9               glue_1.8.0           
 [64] V8_8.0.1              data.table_1.18.2.1   vctrs_0.7.1          
 [67] Rdpack_2.6.6          cellranger_1.1.0      gtable_0.3.6         
 [70] datawizard_1.3.0      gower_1.0.2           xfun_0.57            
 [73] openxlsx_4.2.8.1      rbibutils_2.4.1       prodlim_2026.03.11   
 [76] coda_0.19-4.1         reformulas_0.4.4      survival_3.8-6       
 [79] timeDate_4052.112     iterators_1.0.14      hardhat_1.4.2        
 [82] lava_1.8.2            TH.data_1.1-5         ipred_0.9-15         
 [85] nlme_3.1-168          progress_1.2.3        rstan_2.32.7         
 [88] tensorA_0.36.2.1      otel_0.2.0            rpart_4.1.24         
 [91] colorspace_2.1-2      DBI_1.3.0             nnet_7.3-20          
 [94] processx_3.8.6        Exact_3.3             mnormt_2.1.2         
 [97] tidyselect_1.2.1      emmeans_2.0.2         compiler_4.5.3       
[100] curl_7.0.0            glmnet_4.1-10         htmlTable_2.4.3      
[103] expm_1.0-0            SparseM_1.84-2        bayestestR_0.17.0    
[106] posterior_1.6.1       checkmate_2.3.4       DEoptimR_1.1-4       
[109] psych_2.6.1           quadprog_1.5-8        callr_3.7.6          
[112] digest_0.6.39         minqa_1.2.8           rmarkdown_2.30       
[115] htmltools_0.5.9       pkgconfig_2.0.3       base64enc_0.1-6      
[118] lhs_1.2.1             fastmap_1.2.0         rlang_1.1.7          
[121] htmlwidgets_1.6.4     farver_2.1.2          zoo_1.8-15           
[124] jsonlite_2.0.0        broom.mixed_0.2.9.7   magrittr_2.0.4       
[127] Formula_1.2-5         bayesplot_1.15.0      parameters_0.28.3    
[130] GPfit_1.0-9           ucminf_1.2.2          furrr_0.3.1          
[133] stringi_1.8.7         rootSolve_1.8.2.4     plyr_1.8.9           
[136] pkgbuild_1.4.8        parallel_4.5.3        listenv_0.10.1       
[139] lmom_3.2              kutils_1.73           splines_4.5.3        
[142] pander_0.6.6          hms_1.1.4             ps_1.9.1             
[145] rngtools_1.5.2        yacca_1.4-2           reshape2_1.4.5       
[148] stats4_4.5.3          rstantools_2.6.0      evaluate_1.0.5       
[151] mitools_2.4           RcppParallel_5.1.11-2 nloptr_2.2.1         
[154] tzdb_0.5.0            foreach_1.5.2         miscTools_0.6-30     
[157] MatrixModels_0.5-4    xtable_1.8-8          e1071_1.7-17         
[160] viridisLite_0.4.3     class_7.3-23          cluster_2.1.8.2      
[163] timechange_0.4.0      globals_0.19.1        bridgesampling_1.2-1 

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/LsnVKwqmS1VuxWD18

Email Notification

The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.