I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

11 Multiple Regression

This chapter provides an overview of multiple regression.

11.1 Getting Started

11.1.1 Load Packages

Code

library("petersenlab")
library("rms")
library("car")
library("caret")
library("lme4")
library("performance")
library("lavaan")
library("mice")
library("miceadds")
library("interactions")
library("brms")
library("parallelly")
library("robustbase")
library("ordinal")
library("MASS")
library("broom")
library("effectsize")
library("tidymodels")
library("tidyverse")
library("knitr")

11.1.2 Load Data

Code

load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")

We created the player_stats_weekly.RData and player_stats_seasonal.RData objects in Section 4.4.3.

11.2 Overview of Multiple Regression

Multiple regression is an extension of correlation. Correlation examines the association between one predictor variables and one outcome variable. Multiple regression examines the association between multiple predictor variables and one outcome variable. It allows obtaining a more accurate estimate of the unique contribution of a given predictor variable, by controlling for other variables (covariates). By including multiple predictor variables in prediction of the outcome variable, it also allows improved prediction accuracy.

All statistical analyses follow the same basic structure, as in Equation 11.1:

\[ \text{DATA} = \text{MODEL} + \text{ERROR} \tag{11.1}\]

Regression with one predictor variable takes the form of Equation 11.2:

\[ y = \beta_0 + \beta_1x_1 + \epsilon \tag{11.2}\]

where $y$ is the outcome variable, $\beta_0$ is the intercept, $\beta_1$ is the slope, $x_1$ is the predictor variable, and $\epsilon$ is the error term.

A regression line is depicted in Figure 11.29.

Figure 11.1: A Regression Best-Fit Line.

Regression with multiple predictors—i.e., multiple regression—takes the form of Equation 11.3:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon \tag{11.3}\]

where $p$ is the number of predictor variables. Multiple regression is basically a weighted sum of the predictor variables, and adding an intercept. Under the hood, multiple regression seeks to identify the best weight for each predictor. The intercept is the expected value of the outcome variable when all of the predictor variables have a value of zero.

11.3 Components

$B$ = unstandardized coefficient: direction and magnitude of the estimate (original scale)
$\beta$ (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
$SE$ = standard error: uncertainty of unstandardized estimate

The unstandardized regression coefficient ($B$) is interpreted such that, for every unit change in the predictor variable, there is a __ unit change in the outcome variable. For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age. (In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.) Unstandardized regression coefficients are tied to the metric of the raw data. Thus, a large unstandardized regression coefficient for two variables may mean completely different things. Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.

Standardized regression coefficients can be obtained by standardizing the variables to z scores so they all have a mean of zero and standard deviation of one. The standardized regression coefficient ($\beta$) is interpreted such that, for every standard deviation change in the predictor variable, there is a __ standard deviation change in the outcome variable. For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age. Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1]. Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients. In this way, standardized regression coefficients provide a meaningful index of effect size and can be used to identify the predictors with the strongest predictive validity.

The standard error of a regression coefficient represents the imprecision or uncertainty of the parameter. If we have less uncertainty (i.e., more confidence) about the parameter, the standard error will be small, reflecting greater precision of the regression coefficient. If we have more uncertainty (i.e., less confidence) about the parameter, the standard error will be large, reflecting less precision of the regression coefficient. If we used the same sampling procedure repeatedly and calculated the regression coefficient each time, the true parameter in the population would fall 68% of the time within the interval of: $[\text{model parameter estimate for the regression coefficient} \pm 1 \text{ standard error}]$. The standard error is related to the sample size—the larger the sample size, the smaller the standard error (the greater the precision of our estimate of the regression coefficient). Otherwise said, having more data gives more precise estimates and thus increases statistical power.

A confidence interval represents a range of plausible values such that, with repeated sampling, the true value falls within a given interval with some confidence. Our parameter estimate for the regression coefficient, plus or minus 1 standard error, reflects the 68% confidence interval for the coefficient. The 95% confidence interval is computed as the parameter estimated plus or minus 1.96 standard errors (because in a standard normal distribution, the middle 95% of the distribution lies between −1.96 and +1.96). For instance, if the parameter estimate for the regression coefficient is 0.50, and the standard error is 0.10, the 95% confidence interval is [0.30, 0.70]: $0.5 - (1.96 \times 0.10) = 0.3$; $0.5 + (1.96 \times 0.10) = 0.7$. That is, if we used the same sampling procedure repeatedly, the true value of the regression coefficient would be expected to be 95% of the time somewhere between 0.30 to 0.70.

11.4 Types of Regression

When the outcome variable is continuous, linear regression is common. However, there are other types of regression depending on type and distribution of the outcome variable. For instance, if the outcome variable is binary, logistic regression would be used. If the outcome variable is an ordered categorical variable, ordinal regression would be used. If the outcome variable is a count, Poisson or negative binomial regression would be preferable. If the outcome variable is a proportion, beta regression would be used.

Here are examples of each:

11.4.1 Linear Regression

We fit a linear regression model using the stats::lm() function.

Code

linearRegression <- lm(
  fantasyPoints ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegression)


Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share, 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-739.80  -18.68  -10.50   10.74  292.78 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.62684   23.85029  -1.452 0.146617    
age            0.75193    0.22418   3.354 0.000803 ***
height         0.05816    0.42094   0.138 0.890107    
weight         0.14716    0.06686   2.201 0.027794 *  
target_share 743.14867    7.38684 100.604  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.71 on 4432 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.7054,    Adjusted R-squared:  0.7052 
F-statistic:  2654 on 4 and 4432 DF,  p-value: < 2.2e-16

Code

print(effectsize::standardize_parameters(linearRegression, method = "refit"), digits = 2)

# Standardization method: refit

Parameter    | Std. Coef. |        95% CI
-----------------------------------------
(Intercept)  |  -8.91e-17 | [-0.02, 0.02]
age          |       0.03 | [ 0.01, 0.04]
height       |   1.63e-03 | [-0.02, 0.02]
weight       |       0.03 | [ 0.00, 0.05]
target share |       0.83 | [ 0.82, 0.85]

11.4.2 Logistic Regression

We fit a logistic regression model using the stats::glm() function and specifying family = binomial(). To calculate the model $R^2$, we use the performance::r2() function of the performance package (Lüdecke et al., 2021; Lüdecke, Makowski, Ben-Shachar, Patil, Waggoner, et al., 2025).

Code

newdata <- player_stats_weekly %>%
  mutate( # create a binary variable to be used as the outcome variable
    receiving_td = case_when(
      is.na(receiving_tds) ~ NA_real_, # keep NA
      receiving_tds >= 1   ~ 1,
      receiving_tds == 0   ~ 0
    )
  )

logisticRegression <- glm(
  receiving_td ~ age + height + weight + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  family = binomial(),
  na.action = "na.exclude"
)

summary(logisticRegression)


Call:
glm(formula = receiving_td ~ age + height + weight + target_share, 
    family = binomial(), data = newdata %>% filter(position %in% 
        c("WR")), na.action = "na.exclude")

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.4762075  0.4747551 -13.641  < 2e-16 ***
age          -0.0043680  0.0041374  -1.056    0.291    
height        0.0495091  0.0083609   5.922 3.19e-09 ***
weight        0.0008781  0.0012750   0.689    0.491    
target_share  8.0628633  0.1242041  64.916  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 42304  on 44307  degrees of freedom
Residual deviance: 37107  on 44303  degrees of freedom
  (12044 observations deleted due to missingness)
AIC: 37117

Number of Fisher Scoring iterations: 5

Code

performance::r2(logisticRegression)

# R2 for Logistic Regression
  Tjur's R2: 0.120

Code

print(effectsize::standardize_parameters(logisticRegression, method = "refit"), digits = 2)

# Standardization method: refit

Parameter    | Std. Coef. |         95% CI
------------------------------------------
(Intercept)  |      -1.72 | [-1.75, -1.69]
age          |      -0.01 | [-0.04,  0.01]
height       |       0.12 | [ 0.08,  0.16]
weight       |       0.01 | [-0.02,  0.05]
target share |       0.88 | [ 0.85,  0.90]

- Response is unstandardized.

Code

broom::tidy(
  logisticRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.3 Ordinal Regression

We fit an ordinal regression model using the ordinal::clm() function of the ordinal package (Christensen, 2024).

Code

newdata <- player_stats_weekly
newdata$receiving_tdsFactor <- factor(newdata$receiving_tds, ordered = TRUE)
table(newdata$receiving_tdsFactor)


     0      1      2      3      4 
420249  15050   1818    182     10

Code

ordinalRegression <- ordinal::clm(
  receiving_tdsFactor ~ age + height + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(ordinalRegression)

formula: receiving_tdsFactor ~ age + height + target_share
data:    newdata %>% filter(position %in% c("WR"))

 link  threshold nobs  logLik    AIC      niter max.grad cond.H 
 logit flexible  44308 -22202.80 44419.59 7(0)  5.34e-07 3.0e+07

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
age          -0.004442   0.004092  -1.085    0.278    
height        0.055903   0.005731   9.754   <2e-16 ***
target_share  8.171193   0.121637  67.177   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   6.7753     0.4320   15.69
1|2   9.0558     0.4334   20.89
2|3  11.4826     0.4429   25.93
3|4  14.8144     0.6612   22.41
(12044 observations deleted due to missingness)

Code

performance::r2(ordinalRegression)

  Nagelkerke's R2: 0.170

Code

print(effectsize::standardize_parameters(ordinalRegression, method = "refit"), digits = 2)

# Standardization method: refit

Component |    Parameter | Std. Coef. |         95% CI
------------------------------------------------------
intercept |          0|1 |       1.72 | [ 1.69,  1.75]
intercept |          1|2 |       4.00 | [ 3.94,  4.06]
intercept |          2|3 |       6.43 | [ 6.24,  6.61]
intercept |          3|4 |       9.76 | [ 8.78, 10.74]
location  |          age |      -0.01 | [-0.04,  0.01]
location  |       height |       0.13 | [ 0.11,  0.16]
location  | target share |       0.89 | [ 0.86,  0.91]

- Response is unstandardized.

Code

broom::tidy(
  ordinalRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.4 Poisson Regression

We fit a Poisson regression model using the stats::glm() function and specifying family = poisson().

Code

poissonRegression <- glm(
  receiving_tds ~ age + height + weight + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  family = poisson(),
  na.action = na.exclude
)

summary(poissonRegression)


Call:
glm(formula = receiving_tds ~ age + height + weight + target_share, 
    family = poisson(), data = newdata %>% filter(position %in% 
        c("WR")), na.action = na.exclude)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.9753891  0.3733793 -16.004  < 2e-16 ***
age           0.0001296  0.0032293   0.040    0.968    
height        0.0459619  0.0065711   6.995 2.66e-12 ***
weight        0.0009397  0.0010008   0.939    0.348    
target_share  5.1190772  0.0634014  80.741  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 32902  on 44307  degrees of freedom
Residual deviance: 27962  on 44303  degrees of freedom
  (12044 observations deleted due to missingness)
AIC: 45041

Number of Fisher Scoring iterations: 6

Code

performance::r2(poissonRegression)

# R2 for Generalized Linear Regression
  Nagelkerke's R2: 0.201

Code

print(effectsize::standardize_parameters(poissonRegression, method = "refit"), digits = 2)

# Standardization method: refit

Parameter    | Std. Coef. |         95% CI
------------------------------------------
(Intercept)  |      -1.75 | [-1.77, -1.73]
age          |   4.06e-04 | [-0.02,  0.02]
height       |       0.11 | [ 0.08,  0.14]
weight       |       0.01 | [-0.02,  0.04]
target share |       0.56 | [ 0.54,  0.57]

- Response is unstandardized.

Code

broom::tidy(
  poissonRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.5 Negative Binomial Regression

We fit a negative binomial regression model using the MASS::glm.nb() function of the MASS package (Ripley & Venables, 2025).

Code

negativeBinomialRegression <- MASS::glm.nb(
  receiving_tds ~ age + height + weight + target_share,
  data = newdata %>% filter(position %in% c("WR")),
  na.action = na.exclude,
  control = glm.control(maxit = 10000)
)

summary(negativeBinomialRegression)


Call:
MASS::glm.nb(formula = receiving_tds ~ age + height + weight + 
    target_share, data = newdata %>% filter(position %in% c("WR")), 
    na.action = na.exclude, control = glm.control(maxit = 10000), 
    init.theta = 4.750305306, link = log)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.7604913  0.3884147 -14.831  < 2e-16 ***
age          -0.0022711  0.0033461  -0.679    0.497    
height        0.0418343  0.0068311   6.124 9.12e-10 ***
weight        0.0008346  0.0010331   0.808    0.419    
target_share  6.0065807  0.0793939  75.655  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(4.7503) family taken to be 1)

    Null deviance: 30971  on 44307  degrees of freedom
Residual deviance: 25953  on 44303  degrees of freedom
  (12044 observations deleted due to missingness)
AIC: 44851

Number of Fisher Scoring iterations: 1

              Theta:  4.750 
          Std. Err.:  0.424 

 2 x log-likelihood:  -44839.055

Code

performance::r2(negativeBinomialRegression)

# R2 for Generalized Linear Regression
  Nagelkerke's R2: 0.213

Code

print(effectsize::standardize_parameters(negativeBinomialRegression, method = "refit"), digits = 2)

# Standardization method: refit

Parameter    | Std. Coef. |         95% CI
------------------------------------------
(Intercept)  |      -1.80 | [-1.82, -1.77]
age          |  -7.12e-03 | [-0.03,  0.01]
height       |       0.10 | [ 0.07,  0.13]
weight       |       0.01 | [-0.02,  0.04]
target share |       0.65 | [ 0.63,  0.67]

- Response is unstandardized.

Code

broom::tidy(
  negativeBinomialRegression,
  exponentiate = TRUE,
  conf.int = TRUE)

11.4.6 Beta Regression

We fit a (Bayesian) zero-one-inflated beta regression model (to allow for zeros and ones) using the brms::brm() function of the brms package (Bürkner, 2024) and specifying family = zero_one_inflated_beta().

Note 11.1: Bayesian beta regression

Note: the following code that runs the model takes a while. If you just want to save time and load the model object instead of running the model, you can load the model object (which has already been fit) using this code:

Code

load(url("https://osf.io/download/fe37j/"))

Code

betaRegression <- brms::brm(
  formula = target_share ~ age + height + weight,
  data = newdata %>% filter(position %in% c("WR")),
  family = zero_one_inflated_beta(),
  cores = 4,
  threads = threading(parallelly::availableCores()),
  seed = 52242,
  silent = 0
)

Code

summary(betaRegression)

 Family: zero_one_inflated_beta 
  Links: mu = logit; phi = identity; zoi = identity; coi = identity 
Formula: target_share ~ age + height + weight 
   Data: newdata %>% filter(position %in% c("WR")) (Number of observations: 44308) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -2.76      0.13    -3.00    -2.51 1.00     4624     3011
age           0.02      0.00     0.02     0.02 1.00     6163     2738
height       -0.01      0.00    -0.01    -0.00 1.00     4842     2812
weight        0.01      0.00     0.00     0.01 1.00     4447     3271

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi    12.24      0.09    12.07    12.40 1.00     2333     2348
zoi     0.13      0.00     0.13     0.14 1.00     2540     2413
coi     0.00      0.00     0.00     0.00 1.00     2114     2134

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

performance::r2(betaRegression)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.011 (95% CI [0.009, 0.012])

11.5 Assumptions of Multiple Regression

Linear regression models make the following assumptions:

there is a linear association between the predictor variable and the outcome variable
there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variable or as a function of the outcome variable
the residuals are independent; they are uncorrelated with each other
the residuals are normally distributed

Homoscedasticity of the residuals means that the variance of the residuals does not differ as a function of the outcome variable or as a function of the predictor variable (i.e., the residuals show constant variance as a function of outcome/predictors). If the residuals differ as a function of the outcome or predictor variable, this is called heterotoscedasticity.

Those are some of the key assumptions of multiple regression. However, there are additional assumptions of multiple regression, including ones discussed in the chapter on Causal Inference. For instance, the variables included should reflect a causal process such that the predictor variables influence the outcome variable, and not the other way around. That is, the outcome variable should not influence the predictor variables (i.e., there should be no reverse causation). In addition, it is important control for any confound(s). If a confound is not controlled for, this is called omitted-variable bias, and it leads the researcher to incorrectly attribute the effects of the omitted variable to the included variables.

11.5.1 Evaluating and Addressing Assumptions of Multiple Regression

11.5.1.1 Linear Association

To evaluate the shape of the association between the predictor variables and the outcome variable, we can examine scatterplots (Figure 11.2), residual plots (Figure 11.21), marginal model plots (Figure 11.12), added-variable plots (Figure 11.13), and component-plus-residual plots (Figure 11.14). Residual plots depict the residuals (errors) on the y-axis as a function of the fitted values or a specific predictor on the x-axis. Marginal model plots are basically glorified scatterplots that depict the outcome variable (both in terms of observed values and model-fitted values) on the y-axis and the predictor variables on the x-axis. Added-variable plots depict the unique association of each predictor variables with the outcome variable when controlling for all the other predictor variables in the model. Component-plus-residual plots depict partial residuals on the y-axis as a function of each predictor variable on the x-axis, where a partial residual for a given predictor is the effect of a given predictor (thus controlling for all the other predictor variables in the model) plus the residual from the full model.

Examples of linear and nonlinear associations are depicted with scatterplots in Figure 11.2.

Code

set.seed(52242)

sampleSize <- 1000
quadraticX <- runif(
  sampleSize,
  min = -4,
  max = 4)
linearY <- quadraticX + rnorm(sampleSize, mean = 0, sd = 0.5)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

plot(
  x = quadraticX,
  y = linearY,
  xlab = "",
  ylab = "",
  main = "Linear Association")

abline(
  lm(linearY ~ quadraticX),
  lwd = 2,
  col = "blue")

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "",
  main = "Nonlinear Association")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")

Example Associations Depicted With Scatterplots. — (a) Example of a Linear Association

If the shape of the association is nonlinear (as indicated by any of these plots), various approaches may be necessary such as including nonlinear terms (e.g., polynomial terms such as quadratic, cubic, quartic, or higher-degree terms), transforming the predictors (e.g., log, square root, inverse, exponential, Box-Cox, Yeo-Johnson transform), use of splines/piecewise regression, and generalized additive models.

11.5.1.2 Homoscedasticity

To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.21) and spread-level plot (Figure 11.22). A residual plot depicts the residuals on the y-axis as a function of the model’s fitted values on the x-axis. Homoscedasticity in a residual plot is identified as a constant spread of residuals versus fitted values—the residuals do not show a fan, cone, or bow-tie shape; a fan, cone, or bow-tie shape indicates heteroscedasticity. In a residual plot, a fan or cone shape indicates increasing or decreasing variance in the residuals as a function of the fitted values; a bow-tie shape indicates that the residuals are smallest in the middle of the fitted values and greatest on the extremes of the fitted values. A spread-level plot depicts the log of the absolute value of studentized residuals on the y-axis as a function of the log of the model’s fitted values on the x-axis. Homoscedasticity in a spread-level plot is identified as a flat slope; a slope that differs from zero indicates heteroscedasticity.

Code

set.seed(52242)
sampleSize <- 1000

# 1. Homoscedasticity
x1 <- runif(sampleSize, 0, 10)
y1 <- 3 + 2 * x1 + rnorm(sampleSize, mean = 0, sd = 2)
fit1 <- lm(y1 ~ x1)
res1 <- resid(fit1)
fitted1 <- fitted(fit1)

# 2. Fan-shaped heteroscedasticity
x2 <- runif(sampleSize, 0, 10)
y2 <- 3 + 2 * x2 + rnorm(sampleSize, mean = 0, sd = 0.5 * x2) # increasing variance
fit2 <- lm(y2 ~ x2)
res2 <- resid(fit2)
fitted2 <- fitted(fit2)

# 3. Bow-tie-shaped heteroscedasticity
x3 <- runif(sampleSize, 0, 10)
sd3 <- abs(x3 - 5) + 0.5  # variance smallest in middle, higher on edges
y3 <- 3 + 2 * x3 + rnorm(sampleSize, mean = 0, sd = sd3)
fit3 <- lm(y3 ~ x3)
res3 <- resid(fit3)
fitted3 <- fitted(fit3)

# 4. Diamond-shaped heteroscedasticity
x4 <- runif(sampleSize, 0, 10)
sd4 <- -abs(x4 - 5) + 5.5  # inverted bow-tie
y4 <- 3 + 2 * x4 + rnorm(sampleSize, mean = 0, sd = sd4)
fit4 <- lm(y4 ~ x4)
res4 <- resid(fit4)
fitted4 <- fitted(fit4)

#5. Triangle-shaped heteroscedasticity
height <- 10

sample_triangle <- function(n, v1, v2, v3) {
  u <- runif(n)
  v <- runif(n)
  is_flip <- u + v > 1
  u[is_flip] <- 1 - u[is_flip]
  v[is_flip] <- 1 - v[is_flip]
  
  x <- (1 - u - v) * v1[1] + u * v2[1] + v * v3[1]
  y <- (1 - u - v) * v1[2] + u * v2[2] + v * v3[2]
  
  cbind(x, y)
}

# Triangle vertices
v1 <- c(0, height)  # top left
v2 <- c(10, height) # top right
v3 <- c(5, 0)       # bottom center

# Sample points inside triangle
tri_points <- sample_triangle(
  sampleSize,
  v1,
  v2,
  v3)

# Extract x and residuals
x5 <- tri_points[, 1]
residuals5 <- tri_points[, 2]

# Generate outcome
y5 <- 3 + 2 * x5 + residuals5

# Fit model
fit5 <- lm(y5 ~ x5)
res5 <- resid(fit5)
fitted5 <- fitted(fit5)

# Plots
plot(
  fitted1,
  res1,
  main = "Homoscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted2,
  res2,
  main = "Fan-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted3,
  res3,
  main = "Bow-Tie-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted4,
  res4,
  main = "Diamond-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted5,
  res5,
  main = "Triangle-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

Example of Homoscedasticity and Heteroscedasticity in Residual Plots. — (a) Homoscedasticity

If there is heteroscedasticity, it may be necessary to transform the outcome variable to be more normally distributed. The spread-level plot provides a suggested power transformation to transform the outcome variable so that the spread of residuals becomes more uniform across the fitted values.

11.5.1.3 Uncorrelated Residuals

To determine if residuals are correlated by a grouping level, we can examine the proportion of variance that is attributable to the grouping level using the intraclass correlation coefficient (ICC) from a mixed model. The greater the ICC value, the more variance is accounted for by the grouping level, and the more the residuals are intercorrelated. If the residuals are intercorrelated, it may be necessary to account for the grouping structure of the data using a mixed model.

11.5.1.4 Normally Distributed Residuals

To examine whether residuals are normally distributed, we can examine quantile–quantile (QQ) plots and probability–probability (PP) plots. QQ plots depict quantiles of a sample distribution (y-axis) compared to the quantiles of a theoretical (in this case, normal) distribution (x-axis). PP plots depict cumulative probabilities of a sample distribution (y-axis) compared to those of a theoretical (in this case, normal) distribution (x-axis). QQ plots are particularly useful for identifying deviations from normality in the tails of the distribution; PP plots are particularly useful for identifying deviations from normality in the center of the distribution. Researchers tend to be more concerned about the tails of the distribution, because extreme values tend to have a greater impact on inferences, so researchers tend to use QQ plots more often than PP plots. Various examples of QQ plots and deviations from normality are depicted in Figure 11.4. If the residuals are normally distributed, they will stay close to the diagonal reference line of the QQ and PP plots.

Code

set.seed(52242)

sampleSize <- 10000

distribution_normal <- rnorm(
  sampleSize,
  mean = 0,
  sd = 1)

distribution_bimodal <- c(
  rnorm(
    sampleSize/2,
    mean = -2,
    sd = 1),
  rnorm(
    sampleSize/2,
    mean = 2,
    sd = 1))

distribution_negativeSkew <- -rlnorm(
  sampleSize,
  meanlog = 0,
  sdlog = 1)

distribution_positiveSkew <- rlnorm(
  sampleSize,
  meanlog = 0,
  sdlog = 1)

distribution_lightTailed <- runif(
  sampleSize,
  min = -2,
  max = 2)

distribution_heavyTailed <- rt(
  sampleSize,
  df = 2)

hist(
  distribution_normal,
  main = "Normal Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_normal,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_bimodal,
  main = "Bimodal Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_bimodal,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_negativeSkew,
  main = "Negatively Skewed Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_negativeSkew,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_positiveSkew,
  main = "Positively Skewed Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_positiveSkew,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_lightTailed,
  main = "Platykurtic (Light Tailed) Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_lightTailed,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_heavyTailed,
  main = "Leptokurtic (Heavy Tailed) Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_heavyTailed,
  main = "QQ Plot",
  id = FALSE)

Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side). — (a) Normal Distribution

If the residuals are not normally distributed (i.e., they do not stay close to the diagonal reference line of the QQ and PP plots), it may be necessary to transform the outcome variable to be more normally distributed or to use a generalized linear model (GLM) that more closely matches the distribution of the outcome variable (e.g., Poisson, binomial, gamma).

11.6 How Much Variance the Model Explains

When estimating a multiple regression model, it can be useful to evaluate how much variance in the outcome variable that the predictor variable(s) explain (i.e., account for). If the variables collectively explain only a small amount of variance, it suggest that the predictor variables have a small effect size and that other predictor variables will be necessary to account for the majority of variability in the outcome variable. There are two primary indices of how much how much variance in the outcome variable is explained by the predictor variable(s): the coefficient of determination ($R^2$) and adjusted $R^2$ ($R^2_{adj}$), described below.

11.6.1 Coefficient of Determination ($R^2$)

As noted in Section 9.6.6, The coefficient of determination ($R^2$) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions (i.e., by the predictor variable(s)), as in Equation 9.25: $R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}$. Various formulas for $R^2$ are in Equation 9.19. Larger $R^2$ values indicate greater accuracy. Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in Figure 11.29.

Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment. — Figure 11.5: Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable’s variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From Petersen (2024) and Petersen (2025).

One issue with $R^2$ is that it increases as the number of predictors increases, which can lead to overfitting if using $R^2$ as an index to compare models for purposes of selecting the “best-fitting” model. Consider the following example (adapted from Petersen (2025)) in which you have one predictor variable and one outcome variable, as shown in Table 11.1.

Table 11.1: Example Data of Predictor (x1) and Outcome (y) Used for Regression Model.

y	x1
7	1
13	2
29	7
10	2

Using the data, the best fitting regression model is: $y =$ 3.98 $+$ 3.59 $\cdot x_1$. In this example, the $R^2$ is 0.98. The equation is not a perfect prediction, but with a single predictor variable, it captures the majority of the variance in the outcome.

Now consider the following example where you add a second predictor variable to the data above, as shown in Table 11.2.

Table 11.2: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.

y	x1	x2
7	1	3
13	2	5
29	7	1
10	2	2

With the second predictor variable, the best fitting regression model is: $y =$ 0.00 + 4.00 $\cdot x_1 +$ 1.00 $\cdot x_2$. In this example, the $R^2$ is 1.00. The equation with the second predictor variable provides a perfect prediction of the outcome.

Providing perfect prediction with the right set of predictor variables is the dream of multiple regression. So, using multiple regression, we often add predictor variables to incrementally improve prediction. Knowing how much variance would be accounted for by random chance follows Equation 11.4:

\[ E(R^2) = \frac{K}{n-1} \tag{11.4}\]

where $E(R^2)$ is the expected value of $R^2$ (the proportion of variance explained), $K$ is the number of predictor variables, and $n$ is the sample size. The formula demonstrates that the more predictor variables in the regression model, the more variance will be accounted for by chance. With many predictor variables and a small sample, you can account for a large share of the variance merely by chance.

As an example, consider that we have 13 predictor variables to predict fantasy performance for 43 players. Assume that, with 13 predictor variables, we explain 38% of the variance ($R^2 = .38; r = .62$). We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: $E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31$. We expect to explain 31% of the variance, by chance, in the outcome. So, 82% of the variance explained was likely spurious (i.e., $\frac{.31}{.38} = .82$). As the sample size increases, the spuriousness decreases. To account for the number of predictor variables in the model, we can use a modified version of $R^2$ called adjusted $R^2$ ($R^2_{adj}$), described next.

11.6.2 Adjusted $R^2$ ($R^2_{adj}$)

Adjusted $R^2$ ($R^2_{adj}$) accounts for the number of predictor variables in the model, based on how much would be expected to be accounted for by chance to penalize overfitting. Adjusted $R^2$ ($R^2_{adj}$) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of predictor variables in the model. The formula for adjusted $R^2$ ($R^2_{adj}$) is in Equation 11.5:

\[ R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \tag{11.5}\]

where $p$ is the number of predictor variables in the model, and $n$ is the sample size.

11.7 Overfitting

Statistical models applied to big data (e.g., data with many predictor variables) can overfit the data, which means that the statistical model accounts for error variance, which will not generalize to future samples. So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built. Overfitting is most likely to occur when the model is too complex relative to the sample size (e.g., many predictor variables or parameters) In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.

Consider an example where you develop an algorithm to predict players’ fantasy performance based on 2024 data using hundreds of predictor variables. To some extent, these predictor variables will likely account for true variance (i.e., signal) and error variance (i.e., noise). If we were to apply the same algorithm based on the 2024 prediction model to 2025 data, the prediction model would likely predict less accurately than with 2024 data. The regression coefficients (and resulting accuracy) tend to become weaker when applied to new data, a phenomenon called shrinkage, which is described in Section 15.7.1. For instance, shrinking is observed when applying multiple regression models to new data in Section 19.7 and when applying machine learning models to new data in Section 19.8.

In Figure 11.6, the blue line represents the true distribution of the data, and the red line is an overfitting model:

Code

set.seed(52242)

sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

loessFit <- loess(
  quadraticY ~ quadraticX,
  data = quadraticData,
  span = 0.01,
  degree = 1)

loessNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$loessY <- predict(
  loessFit,
  newdata = quadraticNewData)

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")

lines(
  quadraticNewData$loessY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "red")

Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment. — Figure 11.6: Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From Petersen (2024) and Petersen (2025).

11.8 Covariates

Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable. Use of covariates examines the association between the predictor variable and the outcome variable when holding people’s level constant on the covariates. Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable. Ideally, you want to include any and all confounds as covariates. As described in Section 8.5.2.1, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association. Covariates are potentially (but not necessarily) confounds. For instance, you might include the player’s age as a covariate in a model that examines whether a player’s 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.

11.9 Example: Predicting Wide Receivers’ Fantasy Points

Let’s say we want to use a number of variables to predict a wide receiver’s fantasy performance. We want to consider several predictors, including the player’s age, height, weight, and target share. Target share is computed as the number of targets a player receives divided by the team’s total number of targets. We have only a few predictors and our sample size is large enough such that overfitting is not likely a concern.

11.9.1 Examine Descriptive Statistics

Let’s first examine descriptive statistics of the predictor and outcome variables.

Code

player_stats_seasonal %>% 
  dplyr::select(fantasyPoints, age, height, weight, target_share) %>% 
  dplyr::summarise(across(
      everything(),
      .fns = list(
        n = ~ length(na.omit(.)),
        missingness = ~ mean(is.na(.)) * 100,
        M = ~ mean(., na.rm = TRUE),
        SD = ~ sd(., na.rm = TRUE),
        min = ~ min(., na.rm = TRUE),
        max = ~ max(., na.rm = TRUE),
        range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
        IQR = ~ IQR(., na.rm = TRUE),
        MAD = ~ mad(., na.rm = TRUE),
        median = ~ median(., na.rm = TRUE),
        pseudomedian = ~ DescTools::HodgesLehmann(., na.rm = TRUE),
        mode = ~ petersenlab::Mode(., multipleModes = "mean"),
        skewness = ~ psych::skew(., na.rm = TRUE),
        kurtosis = ~ psych::kurtosi(., na.rm = TRUE)),
      .names = "{.col}.{.fn}")) %>%
    tidyr::pivot_longer(
      cols = everything(),
      names_to = c("variable","index"),
      names_sep = "\\.") %>% 
    tidyr::pivot_wider(
      names_from = index,
      values_from = value)

Let’s also examine the distributions of the variables using a density plot, as depicted in Figure 11.7.

Code

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = fantasyPoints)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Fantasy Points",
    y = "Density",
    title = "Density Plot of Fantasy Points with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = age)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Age (years)",
    y = "Density",
    title = "Density Plot of Player Age with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = height)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Height (inches)",
    y = "Density",
    title = "Density Plot of Player Height with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = weight)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Weight (pounds)",
    y = "Density",
    title = "Density Plot of Player Weight with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = target_share)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Target Share",
    y = "Density",
    title = "Density Plot of Target Share with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

Density Plot of Model Variables (i.e., Predictor and Outcome Variables). — (a) Fantasy Points

11.9.2 Examine Bivariate Associations

Then, let’s examine the bivariate association of each using a scatterplot to evaluate for any potential nonlinearity, as depicted in Figure 11.8.

Code

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = age,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Age (Years)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Age",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = height,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Height (Inches)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Height",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = weight,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Weight (Pounds)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Weight",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = target_share,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Target Share",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Target Share",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue. — (a) Age

There are some suggestions of potential nonlinearity, such as an inverted-U-shaped association between height and fantasy points, suggesting that there may an optimal range for height among Wide Receivers—being too short or too tall could be a disadvantage. In addition, target share shows a weakening association as target share increases. Thus, after evaluating the linear association between the predictors and outcome, we will also examine the possibility for curvilinear associations.

11.9.3 Estimate Multiple Regression Model

Now that we have examined descriptive statistics and bivariate associations, let’s first estimate a multiple regression model with only linear terms:

Code

linearRegressionModel <- lm(
  fantasyPoints ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

The model formula is in Equation 11.6:

\[ \text{fantasy points} = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \tag{11.6}\]

Here are the model results:

Code

summary(linearRegressionModel)


Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share, 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-739.80  -18.68  -10.50   10.74  292.78 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.62684   23.85029  -1.452 0.146617    
age            0.75193    0.22418   3.354 0.000803 ***
height         0.05816    0.42094   0.138 0.890107    
weight         0.14716    0.06686   2.201 0.027794 *  
target_share 743.14867    7.38684 100.604  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.71 on 4432 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.7054,    Adjusted R-squared:  0.7052 
F-statistic:  2654 on 4 and 4432 DF,  p-value: < 2.2e-16

The only terms that were significantly associated with fantasy performance among Wide Receivers are weight and target share, both of which showed a positive association with fantasy points.

We can obtain the coefficient of determination ($R^2$) and adjusted $R^2$ ($R^2_{adj}$) using the following code:

Code

summary(linearRegressionModel)$r.squared

[1] 0.7054381

Code

summary(linearRegressionModel)$adj.r.squared

[1] 0.7051723

The model explained 71% of the variability in fantasy points (i.e., $R^2 = .71$).

The model formula with substituted values is in Equation 11.7:

\[ \begin{aligned} \text{fantasy points} = &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \\ = &-34.63 + 0.75 \cdot \text{age} + 0.06 \cdot \text{height} + 0.15 \cdot \text{weight} + 743.15 \cdot \text{target share} + \epsilon \end{aligned} \tag{11.7}\]

If we want to obtain standardized regression coefficients, we can use the effectsize::standardize_parameters() function of the effectsize package (Ben-Shachar et al., 2020, 2025).

Code

print(effectsize::standardize_parameters(linearRegressionModel, method = "basic"), digits = 2)

# Standardization method: basic

Parameter    | Std. Coef. |        95% CI
-----------------------------------------
(Intercept)  |       0.00 | [ 0.00, 0.00]
age          |       0.03 | [ 0.01, 0.04]
height       |   1.63e-03 | [-0.02, 0.02]
weight       |       0.03 | [ 0.00, 0.05]
target share |       0.83 | [ 0.82, 0.85]

Or, we can standardize the outcome variable and each predictor variable using the base::scale() function:

Code

linearRegressionModelStandardized <- lm(
  scale(fantasyPoints) ~ scale(age) + scale(height) + scale(weight) + scale(target_share),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModelStandardized)


Call:
lm(formula = scale(fantasyPoints) ~ scale(age) + scale(height) + 
    scale(weight) + scale(target_share), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7953 -0.2221 -0.1249  0.1277  3.4808 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.0003851  0.0081630  -0.047 0.962380    
scale(age)           0.0281808  0.0084020   3.354 0.000803 ***
scale(height)        0.0016241  0.0117541   0.138 0.890107    
scale(weight)        0.0259735  0.0118013   2.201 0.027794 *  
scale(target_share)  0.8331762  0.0082817 100.604  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5435 on 4432 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.7054,    Adjusted R-squared:  0.7052 
F-statistic:  2654 on 4 and 4432 DF,  p-value: < 2.2e-16

Code

print(effectsize::standardize_parameters(linearRegressionModel, method = "refit"), digits = 2)

# Standardization method: refit

Parameter    | Std. Coef. |        95% CI
-----------------------------------------
(Intercept)  |  -8.91e-17 | [-0.02, 0.02]
age          |       0.03 | [ 0.01, 0.04]
height       |   1.63e-03 | [-0.02, 0.02]
weight       |       0.03 | [ 0.00, 0.05]
target share |       0.83 | [ 0.82, 0.85]

Target share has a large effect size. All of the other predictors have a small effect size.

Now let’s consider whether any of the terms show curvilinear associations with fantasy points by adding quadratic terms:

Code

quadraticTermsRegressionModel <- lm(
  fantasyPoints ~ age + I(age^2) + height + I(height^2) + weight + I(weight^2) + target_share + I(target_share^2),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

The model formula is in Equation 11.8:

Here are the model results:

Code

summary(quadraticTermsRegressionModel)


Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) + 
    weight + I(weight^2) + target_share + I(target_share^2), 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-276.575  -15.242   -3.012    6.187  310.190 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.379e+02  4.258e+02   0.324   0.7462    
age                3.698e+00  2.232e+00   1.657   0.0976 .  
I(age^2)          -5.963e-02  4.022e-02  -1.483   0.1382    
height            -3.861e+00  1.247e+01  -0.310   0.7568    
I(height^2)        2.315e-02  8.601e-02   0.269   0.7878    
weight            -5.090e-01  7.525e-01  -0.676   0.4988    
I(weight^2)        1.694e-03  1.868e-03   0.907   0.3646    
target_share       1.121e+03  9.067e+00 123.605   <2e-16 ***
I(target_share^2) -9.512e+02  1.764e+01 -53.936   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35.52 on 4428 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.8224,    Adjusted R-squared:  0.822 
F-statistic:  2563 on 8 and 4428 DF,  p-value: < 2.2e-16

Only target share (not height or weight) shows a significant association, including its linear and quadratic terms.

Here are the standardized coefficients:

Code

print(effectsize::standardize_parameters(quadraticTermsRegressionModel, method = "basic"), digits = 2)

# Standardization method: basic

Parameter      | Std. Coef. |         95% CI
--------------------------------------------
(Intercept)    |       0.00 | [ 0.00,  0.00]
age            |       0.14 | [-0.02,  0.30]
age^2          |      -0.12 | [-0.28,  0.04]
height         |      -0.11 | [-0.79,  0.58]
height^2       |       0.09 | [-0.59,  0.78]
weight         |      -0.09 | [-0.35,  0.17]
weight^2       |       0.12 | [-0.14,  0.38]
target share   |       1.26 | [ 1.24,  1.28]
target share^2 |      -0.54 | [-0.56, -0.52]

Code

quadraticTermsRegressionModelStandardized <- lm(
  scale(fantasyPoints) ~ scale(age) + scale(I(age^2)) + scale(height) + scale(I(height^2)) + scale(weight) + scale(I(weight^2)) + scale(target_share) + scale(I(target_share^2)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(quadraticTermsRegressionModelStandardized)


Call:
lm(formula = scale(fantasyPoints) ~ scale(age) + scale(I(age^2)) + 
    scale(height) + scale(I(height^2)) + scale(weight) + scale(I(weight^2)) + 
    scale(target_share) + scale(I(target_share^2)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2881 -0.1812 -0.0358  0.0736  3.6878 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -0.0009542  0.0063426  -0.150   0.8804    
scale(age)                0.1386022  0.0836588   1.657   0.0976 .  
scale(I(age^2))          -0.1246500  0.0840684  -1.483   0.1382    
scale(height)            -0.1078183  0.3480878  -0.310   0.7568    
scale(I(height^2))        0.0936578  0.3479758   0.269   0.7878    
scale(weight)            -0.0898412  0.1328264  -0.676   0.4988    
scale(I(weight^2))        0.1203022  0.1326650   0.907   0.3646    
scale(target_share)       1.2565612  0.0101659 123.605   <2e-16 ***
scale(I(target_share^2)) -0.5428470  0.0100647 -53.936   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4222 on 4428 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.8224,    Adjusted R-squared:  0.822 
F-statistic:  2563 on 8 and 4428 DF,  p-value: < 2.2e-16

Code

print(effectsize::standardize_parameters(quadraticTermsRegressionModel, method = "refit"), digits = 2)

# Standardization method: refit

Parameter      | Std. Coef. |         95% CI
--------------------------------------------
(Intercept)    |       0.10 | [ 0.08,  0.12]
age            |       0.02 | [ 0.01,  0.04]
age^2          |  -6.76e-03 | [-0.02,  0.00]
height         |      -0.01 | [-0.03,  0.00]
height^2       |   1.53e-03 | [-0.01,  0.01]
weight         |       0.03 | [ 0.01,  0.05]
weight^2       |   4.48e-03 | [-0.01,  0.01]
target share   |       1.07 | [ 1.05,  1.08]
target share^2 |      -0.10 | [-0.10, -0.10]

11.9.4 Dominance Analysis

We can perform a dominance analysis to evaluate the relative importance of the predictors in the regression model. To perform the dominance analysis, we use the parameters::dominance_analysis() function of the parameters package (Lüdecke et al., 2020; Lüdecke, Makowski, Ben-Shachar, Patil, Højsgaard, et al., 2025), which leverages the domir package (Luchman, 2024).

Code

parameters::dominance_analysis(linearRegressionModel)

# Dominance Analysis Results

Model R2 Value:  0.705 

General Dominance Statistics

Parameter    | General Dominance | Percent | Ranks |       Subset
-----------------------------------------------------------------
(Intercept)  |                   |         |       |     constant
age          |             0.009 |   0.013 |     2 |          age
height       |             0.002 |   0.003 |     4 |       height
weight       |             0.007 |   0.010 |     3 |       weight
target_share |             0.687 |   0.974 |     1 | target_share

Conditional Dominance Statistics

Subset       | IVs: 1 | IVs: 2 |    IVs: 3 |    IVs: 4
------------------------------------------------------
age          |  0.017 |  0.012 |     0.006 | 7.477e-04
height       |  0.006 |  0.002 | 1.967e-04 | 1.269e-06
weight       |  0.016 |  0.009 |     0.003 | 3.219e-04
target_share |  0.704 |  0.692 |     0.681 |     0.673

Complete Dominance Designations

Subset       | < age | < height | < weight | < target_share
-----------------------------------------------------------
age          |       |    FALSE |    FALSE |           TRUE
height       |  TRUE |          |     TRUE |           TRUE
weight       |  TRUE |    FALSE |          |           TRUE
target_share | FALSE |    FALSE |    FALSE |

11.9.5 Visualizing Regression Results

11.9.5.1 Regression Coefficients

To visualize the regression coefficients, we can use the broom::tidy() function of the broom package (Robinson et al., 2025), as in Figures 11.9 and 11.10.

Code

quadraticTermsRegressionModel_tidy <- broom::tidy(
  quadraticTermsRegressionModel,
  conf.int = TRUE)

quadraticTermsRegressionModelStandardized_tidy <- broom::tidy(
  quadraticTermsRegressionModelStandardized,
  conf.int = TRUE)

Code

ggplot2::ggplot(
  data = quadraticTermsRegressionModel_tidy,
  aes(
    x = term,
    y = estimate)) +
  geom_point() +
  geom_errorbar(
    aes(
      ymin = conf.low,
      ymax = conf.high),
    width = 0.2) +
  coord_flip() + # flip axes for readability
  labs(
    title = "Regression Coefficients with 95% CI",
    x = "Predictor",
    y = "Coefficient Estimate") +
  theme_minimal()

Figure 11.9: Regression Coefficients with 95% Confidence Interval.

Code

ggplot2::ggplot(
  data = quadraticTermsRegressionModelStandardized_tidy,
  aes(
    x = term,
    y = estimate)) +
  geom_point() +
  geom_errorbar(
    aes(
      ymin = conf.low,
      ymax = conf.high),
    width = 0.2) +
  coord_flip() + # flip axes for readability
  labs(
    title = "Standardized Regression Coefficients with 95% CI",
    x = "Predictor",
    y = "Standardized Coefficient Estimate") +
  theme_minimal()

Figure 11.10: Standardized Regression Coefficients with 95% Confidence Interval.

11.9.5.2 Model-Implied Association

If we wanted to visualize the shape of the model-implied association between target share and fantasy points, we could generate the model-implied predictions using the data range that we want to visualize.

Code

newdata <- data.frame(
  target_share = seq(
    from = min(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
    to = max(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
    length.out = 1000
  )
)

newdata$age <- mean(player_stats_seasonal$age[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$height <- mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$weight <- mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)

Code

newdata$fantasyPoints <- predict(
  quadraticTermsRegressionModel,
  newdata = newdata
)

We can depict the model-implied predictions of fantasy points as a function of target share, as shown in Figure 11.11.

Code

ggplot2::ggplot(
  data = newdata,
  aes(
    x = target_share,
    y = fantasyPoints)) +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Target Share",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Target Share",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

Model-Implied Predictions of A Wide Receiver's Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model. — Figure 11.11: Model-Implied Predictions of A Wide Receiver’s Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model.

We could also generate the model-implied prediction of fantasy points for any value of the predictor variables. For instance, here are the number of fantasy points expected for a Wide Receiver who is 23 years old, is 6’2” tall (72 inches), weighs 200 pounds, and has a target share of 50% (i.e., 0.5):

Code

predict(
  quadraticTermsRegressionModel,
  newdata = data.frame(
    age = 23,
    height = 72,
    weight = 200,
    target_share = .5
  )
)

       1 
321.9066

The player would be expected to score 322 fantasy points.

We can calculate this by substituting in the regression coefficients and predictor values. Here is the computation:

Code

age <- 23
height <- 72
weight <- 200
target_share <- .5

beta0 <- coef(quadraticTermsRegressionModel)[["(Intercept)"]]
beta1 <- coef(quadraticTermsRegressionModel)[["age"]]
beta2 <- coef(quadraticTermsRegressionModel)[["I(age^2)"]]
beta3 <- coef(quadraticTermsRegressionModel)[["height"]]
beta4 <- coef(quadraticTermsRegressionModel)[["I(height^2)"]]
beta5 <- coef(quadraticTermsRegressionModel)[["weight"]]
beta6 <- coef(quadraticTermsRegressionModel)[["I(weight^2)"]]
beta7 <- coef(quadraticTermsRegressionModel)[["target_share"]]
beta8 <- coef(quadraticTermsRegressionModel)[["I(target_share^2)"]]

predictedFantasyPoints <- beta0 + beta1*age + beta2*(age^2) + beta3*height + beta4*(height^2) + beta5*weight + beta6*(weight^2) + beta7*target_share + beta8*(target_share^2)
predictedFantasyPoints

[1] 321.9066

The model formula with substituted values is in Equation 11.9:

\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \\ = &137.86 + 3.70 \cdot 23 + -0.06 \cdot 23^2 + -3.86 \cdot 72 + 0.02 \cdot 72^2 + \\ &-0.51 \cdot 200 + 0.00 \cdot 200^2 + 1120.79 \cdot 0.5 + -951.24 \cdot 0.5^2 + \epsilon \\ = &321.91 \end{aligned} \tag{11.9}\]

11.9.6 Evaluating and Addressing Assumptions

The assumptions for multiple regression are described in Section 11.5. I describe ways to evaluate assumptions in Section 11.5.1.

As a reminder, here are four assumptions:

there is a linear association between the predictor variables and the outcome variable
there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
the residuals are independent; they are uncorrelated with each other
the residuals are normally distributed

11.9.6.1 Linear Association

We evaluated the shape of the association between the predictor variables and the outcome variables using scatterplots. We accounted for potential curvilinearity in the associations with a quadratic term. Other ways to account for nonlinearity, in addition to polynomials, include transforming predictors, use of splines/piecewise regression, and generalized additive models.

To evaluate for potential nonlinearity in the associations, we can also evaluate residual plots (Figure 11.21), marginal model plots (Figure 11.12), added-variable plots (Figure 11.13), and component-plus-residual plots (Figure 11.14) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). For evaluating linearity, we would expect minimal bend/curvature in the lines.

Code

car::marginalModelPlots(
  quadraticTermsRegressionModel,
  sd = TRUE,
  id = TRUE)

Code

car::avPlots(
  quadraticTermsRegressionModel,
  id = TRUE)

Code

car::crPlots(
  quadraticTermsRegressionModel,
  sd = TRUE,
  id = TRUE)

The marginal model plots (Figure 11.12), residual plots (Figure 11.21), and component-plus-residual plots (Figure 11.14) suggest that the nonlinearity of the association between target share and fantasy points may not be fully captured by the quadratic term. Thus, we may need to apply a different approach to handling the nonlinear association between target share and fantasy points.

One approach we can take is to transform the target_shares variable to be more normally distributed.

The histogram for the raw target_shares variable is in Figure 11.15.

Code

hist(
  player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Target Share")

Figure 11.15: Histogram of Target Share.

The variable shows a strong positive skew. To address a strong positive skew, we can use a log transformation. The histogram of the log-transformed variable is in Figure 11.16.

Code

hist(
  log(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")] + 1),
  main = "Histogram of Target Share (Log Transformed)")

Figure 11.16: Histogram of Target Share, Transformed.

Now we can re-fit the model with the log-transformed variable.

Code

linearRegressionModel_logTargetShare <- lm(
  fantasyPoints ~ age + I(age^2) + height + I(height^2) + weight + I(weight^2) + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModel_logTargetShare)


Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) + 
    weight + I(weight^2) + I(log(target_share + 1)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-610.70  -15.60   -8.09    7.67  299.80 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -1.130e+02  4.995e+02  -0.226    0.821    
age                       3.243e+00  2.619e+00   1.238    0.216    
I(age^2)                 -4.782e-02  4.718e-02  -1.014    0.311    
height                    4.906e+00  1.462e+01   0.336    0.737    
I(height^2)              -3.416e-02  1.009e-01  -0.339    0.735    
weight                   -1.168e+00  8.826e-01  -1.324    0.186    
I(weight^2)               3.247e-03  2.191e-03   1.482    0.138    
I(log(target_share + 1))  8.974e+02  7.864e+00 114.108   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.66 on 4429 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.7555,    Adjusted R-squared:  0.7551 
F-statistic:  1955 on 7 and 4429 DF,  p-value: < 2.2e-16

Target share shows a more linear association with fantasy points after log-transforming it (albeit still not perfect), as depicted in Figures 11.17, 11.18, 11.19, and 11.20.

Code

car::marginalModelPlots(
  linearRegressionModel_logTargetShare,
  sd = TRUE,
  id = TRUE)

Figure 11.17: Marginal Model Plots After Log Transformation of Target Share.

Code

car::avPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)

Figure 11.18: Added-Variable Plots After Log Transformation of Target Share.

Code

car::crPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)

Figure 11.19: Component-Plus-Residual Plots After Log Transformation of Target Share.

When creating a residual plot, the car package (Fox et al., 2024; Fox & Weisberg, 2019) also provides a test of the nonlinearity of each predictor.

Code

car::residualPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)

                         Test stat Pr(>|Test stat|)    
age                         1.2532           0.2102    
I(age^2)                    1.0520           0.2928    
height                      0.2004           0.8412    
I(height^2)                 1.3782           0.1682    
weight                      0.7991           0.4243    
I(weight^2)                -0.4440           0.6571    
I(log(target_share + 1))  -34.8271           <2e-16 ***
Tukey test                -34.7456           <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 11.20: Residual Plots After Log Transformation of Target Share.

11.9.6.2 Homoscedasticity

To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.21) and spread-level plot (Figure 11.22) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). In a residual plot, you want a constant spread of residuals versus fitted values—you do not want the residuals to show a fan or cone shape.

Code

car::residualPlots(
  quadraticTermsRegressionModel,
  id = TRUE)

                  Test stat Pr(>|Test stat|)    
age                  1.0848          0.27806    
I(age^2)             1.9964          0.04595 *  
height               0.5255          0.59927    
I(height^2)          1.7659          0.07747 .  
weight               0.8483          0.39634    
I(weight^2)         -0.7810          0.43485    
target_share         1.5656          0.11751    
I(target_share^2)  -15.0992          < 2e-16 ***
Tukey test          20.2150          < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In a spread-level plot, you want a flat (zero) slope—you do not want a positive or negative slope.

Code

car::spreadLevelPlot(
  quadraticTermsRegressionModel,
  id = TRUE)


Suggested power transformation:  0.4647539

In this example, the residuals appear to increase as a function of the fitted values. To handle this, we may need to transform the outcome variable to be more normally distributed.

The histogram for raw fantasy points is in Figure 12.35.

Code

hist(
  player_stats_seasonal$fantasyPoints[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Fantasy Points")

Figure 11.23: Histogram of Fantasy Points (Among Wide Receivers).

We can apply a Yeo-Johnson transformation to the outcome variable to generate a more normally distributed variable. A Yeo-Johnson transformation estimates the optimal transformation to make the variable more normally distributed. Let’s use a Yeo-Johnson transformation of fantasy points:

Code

yjTransformed <- caret::preProcess(
  player_stats_seasonal["fantasyPoints"],
  method = c("YeoJohnson"))

yjTransformed

Created from 41156 samples and 1 variables

Pre-processing:
  - ignored (0)
  - Yeo-Johnson transformation (1)

Lambda estimates for Yeo-Johnson transformation:
0.18

Code

player_stats_seasonal$fantasyPoints_transformed <- predict(
  yjTransformed,
  newdata = player_stats_seasonal["fantasyPoints"])$fantasyPoints

The histogram of the transformed variable is in Figure 11.24.

Code

hist(
  player_stats_seasonal$fantasyPoints_transformed[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Fantasy Points (Transformed)")

Figure 11.24: Histogram of Fantasy Points (Among Wide Receivers), Transformed.

Now we can refit the model.

Code

linearRegressionModel_outcomeTransformed <- lm(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModel_outcomeTransformed)


Call:
lm(formula = fantasyPoints_transformed ~ age + height + weight + 
    I(log(target_share + 1)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-18.3306  -0.6618   0.2290   0.8845   6.7486 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               2.373486   0.821447   2.889  0.00388 ** 
age                       0.033732   0.007723   4.367 1.29e-05 ***
height                   -0.009173   0.014496  -0.633  0.52690    
weight                    0.003643   0.002302   1.582  0.11366    
I(log(target_share + 1)) 27.866921   0.296506  93.984  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.574 on 4432 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.6767,    Adjusted R-squared:  0.6764 
F-statistic:  2319 on 4 and 4432 DF,  p-value: < 2.2e-16

The residual plot is in Figure 11.25.

Code

car::residualPlots(
  linearRegressionModel_outcomeTransformed,
  id = TRUE)

                         Test stat Pr(>|Test stat|)    
age                         0.0663           0.9472    
height                     -0.3612           0.7180    
weight                     -1.5995           0.1098    
I(log(target_share + 1))  -48.2540           <2e-16 ***
Tukey test                -48.8390           <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 11.25: Residual Plots After Transformation of Fantasy Points.

The spread-level plot is in Figure 11.26.

Code

car::spreadLevelPlot(
  linearRegressionModel_outcomeTransformed,
  id = TRUE)


Suggested power transformation:  1.890813

Figure 11.26: Spread-Level Plot After Transformation of Fantasy Points.

The residuals show more constant variance after transforming the outcome variable.

11.9.6.3 Uncorrelated Residuals

To determine if residuals are correlated given the nested structure of the data, we can examine the proportion of variance that is attributable to the particular player. To do this, we can estimate the intraclass correlation coefficient (ICC) from a mixed model using the performance package (Lüdecke et al., 2021; Lüdecke, Makowski, Ben-Shachar, Patil, Waggoner, et al., 2025).

Code

mixedModel <- lmer(
  fantasyPoints_transformed ~ 1 + (1 | player_id),
  data = player_stats_seasonal)

performance::icc(mixedModel)

The ICC indicates that over half of the variance is attribute to between-player variance, so it would be important to account for the player-specific variance using a mixed model. For simplicity, we focus on multiple regression models in this chapter; mixed models are described in Chapter 12.

11.9.6.4 Normally Distributed Residuals

We can examine whether residuals are normally distributed using quantile–quantile (QQ) plots and probability–probability (PP) plots, as in Figures 11.27 and 11.28. If the residuals are normally distributed, they should stay close to the diagonal reference line.

Code

car::qqPlot(
  linearRegressionModel_outcomeTransformed,
  main = "QQ Plot",
  id = TRUE)

[1] 1338 1843

Code

petersenlab::ppPlot(
  linearRegressionModel_outcomeTransformed)

11.10 Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. The problem of having multiple predictor variables that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.

Multicollinearity in multiple regression is depicted conceptually in Figure 11.29.

Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment. — Figure 11.29: Conceptual Depiction of Multicollinearity in Multiple Regression. From Petersen (2024) and Petersen (2025).

Consider the following example adapted from Petersen (2025) where you have two predictor variables and one outcome variable, as shown in Table 11.3.

Table 11.3: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.

y	x1	x2
9	2.0	4
11	3.0	6
17	4.0	8
3	1.0	2
21	5.0	10
13	3.5	7

The second predictor variable is not very good—it is exactly twice the value of the first predictor variable; thus, the two predictor variables are perfectly correlated (i.e., $r = 1.0$). This means that there are different prediction equation possibilities that are equally good—see Equations in Equation 11.10:

\[ \begin{aligned} 2x_2 &= y \\ 0x_1 + 2x_2 &= y \\ 4x_1 &= y \\ 4x_1 + 0x_2 &= y \\ 2x_1 + 1x_2 &= y \\ 5x_1 - 0.5x_2 &= y \\ ... &= y \end{aligned} \tag{11.10}\]

Then, what are the regression coefficients? We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well. Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate. In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates. Any predictor variables that have a correlation above ~ $r = .30$ with each other could have an impact on the confidence interval of the regression coefficient. As the correlations among the predictor variables increase, the chance of getting an arbitrary answer increases, sometimes called “bouncing betas.” So, it is important to examine a correlation matrix of the predictor variables before putting them in the same regression model. You can also examine indices such as variance inflation factor (VIF), where a value greater than 5 or 10 indicates multicollinearity.

Here are the VIFs from our earlier model:

Code

car::vif(linearRegressionModel_outcomeTransformed)

                     age                   height                   weight 
                1.019535                 2.095026                 2.111745 
I(log(target_share + 1)) 
                1.030896

To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors. For a meaningful answer regarding predictors in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size. However, if you are merely interested in prediction—and are not interested in interpreting the regression coefficients of individual predictors—multicollinearity poses less of a problem. For instance, machine learning cares more about achieving the greatest predictive accuracy possible and cares less about explaining which predictors are causally related to the outcome. So, multicollinearity is less of a concern for machine learning approaches.

11.11 Handling of Missing Data

An important consideration in multiple regression is how missing data are handled. Multiple regression in R using the stats::lm() function applies listwise deletion. Listwise deletion (also called complete case analysis) removes any row (in the data file) from analysis that has a missing value on the outcome variable or any of the predictor variables. Removing all rows from analysis that have any missingness in the model variables can be a problem because missingness is often not completely at random—missingness often occurs systematically (i.e., for a reason). For instance, participants may be less likely to have data for all variables if they are from a lower socioeconomic status background and do not have the time to participate in all study procedures. Thus, applying listwise deletion, we might systematically exclude participants from lower socioeconomic status backgrounds (or other groups), which could lead to less generalizable inferences.

It is thus important to consider approaches to handle missingness. Various approaches to handle missingness include pairwise deletion (aka available-case analysis), multiple imputation, and full information maximum likelihood (FIML).

11.11.1 Pairwise Deletion

Pairwise deletion (also called available case analysis) uses all available data for each pair of variables when estimating the covariances, and the resulting covariance matrix can be used to estimate a multiple regression model. We can estimate a regression model that uses pairwise deletion using the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code

player_stats_seasonal$target_share_log <- log(player_stats_seasonal$target_share + 1)

modelData <- player_stats_seasonal %>% 
  filter(position %in% c("WR")) %>% 
  select(fantasyPoints_transformed, age, height, weight, target_share_log)

numObs <- sum(complete.cases(modelData))
varMeans <- colMeans(modelData, na.rm = TRUE)
varCovariances <- cov(modelData, use = "pairwise.complete.obs")

pairwiseRegression_syntax <- '
  fantasyPoints_transformed ~ age + height + weight + target_share_log
  fantasyPoints_transformed ~~ fantasyPoints_transformed
  fantasyPoints_transformed ~ 1
'

pairwiseRegression_fit <- lavaan::lavaan(
  pairwiseRegression_syntax,
  sample.mean = varMeans,
  sample.cov = varCovariances,
  sample.nobs = numObs
)

summary(
  pairwiseRegression_fit,
  standardized = TRUE,
  rsquare = TRUE)

lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         6

  Number of observations                          4437

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    age                          0.055    0.008    7.305    0.000    0.055
    height                       0.008    0.014    0.574    0.566    0.008
    weight                       0.004    0.002    1.645    0.100    0.004
    target_shar_lg              27.725    0.295   94.041    0.000   27.725
  Std.all
         
    0.063
    0.007
    0.020
    0.811

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    0.543    0.818    0.664    0.507    0.543    0.196

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    2.450    0.052   47.101    0.000    2.450    0.320

R-Square:
                   Estimate
    fntsyPnts_trns    0.680

11.11.2 Multiple Imputation

Multiple imputation takes a data set with missingness and it uses available information on other variables to estimate what likely values could have been for the missing value. It repeats the imputation multiple times, so there are multiple imputed data sets, which can give a sense of the degree of uncertainty of each imputed value. The model is then fit to each of the multiply imputed data sets separately, and the results are combined. We can multiply impute data using the mice package (van Buuren & Groothuis-Oudshoorn, 2011, 2024).

Code

numImputations <- 5

dataToImpute <- player_stats_seasonal %>% 
  filter(position %in% c("WR")) %>% 
  select(player_id, position, where(is.numeric)) %>% 
  select(
    player_id:games, carries:wopr, fantasy_points, fantasy_points_ppr,
    rush_40_yds, rec_40_yds, fumbles, two_pts, return_yds,
    rush_100_yds:draftround, height:target_share_log) %>% 
  select(-c(fantasy_points_ppr, ageCentered20, target_share_log)) # drop collinear variables

predictors <- c("targets","receiving_yards","receiving_air_yards","receiving_yards_after_catch","receiving_first_downs","racr")

dataToImpute$player_id_integer <- as.integer(as.factor(dataToImpute$player_id))

varsToImpute <- c("age","height","weight","target_share")
Y <- varsToImpute

Now, let’s specify the imputation method—we use the two-level predictive mean matching (2l.pmm) method from the miceadds package (Robitzsch et al., 2024) to account for the nonindependent data (owing to multiple seasons per player):

Code

meth <- mice::make.method(dataToImpute)
meth[1:length(meth)] <- ""
meth[Y] <- "2l.pmm" # specify the imputation method here; this can differ by outcome variable

Now, let’s specify the prediction matrix. A predictor matrix is a matrix of values, where:

columns with non-zero values are predictors of the variable specified in the given row
the diagonal of the predictor matrix should be zero because a variable cannot predict itself

The values are:

NOT a predictor of the outcome: 0
cluster variable: -2
fixed effect of predictor: 1
fixed effect and random effect of predictor: 2
include cluster mean of predictor in addition to fixed effect of predictor: 3
include cluster mean of predictor in addition to fixed effect and random effect of predictor: 4

Code

pred <- mice::make.predictorMatrix(dataToImpute)
pred[1:nrow(pred), 1:ncol(pred)] <- 0
pred[Y, "player_id_integer"] <- (-2) # cluster variable
pred[Y, predictors] <- 1 # fixed effect predictors
pred[Y, "age"] <- 2 # random effect predictor
pred[Y, Y] <- 1 # fixed effect predictor

diag(pred) <- 0

Now, let’s run the imputation:

Code

imp <- mice::mice(
  as.data.frame(dataToImpute),
  method = meth,
  predictorMatrix = pred,
  m = numImputations,
  maxit = 5, # generally use 100 maximum iterations; this example uses 5 for speed
  seed = 52242)

Below are some imputation diagnostics. Trace plots are in Figure 11.30.

Code

plot(imp, c("target_share"))

Figure 11.30: Trace plots from multiple imputation.

A density plot is in Figure 11.31.

Code

densityplot(imp, ~ target_share)

Figure 11.31: Density plot from multiple imputation.

The imputated data does not match well the distribution of the observed data. Thus, it may be necessary to select a different imputation method for more accurate imputation.

Now, let’s do some post-processing:

Code

imp_long <- mice::complete(
  imp,
  action = "long",
  include = TRUE)

imp_long$target_share_log <- log(imp_long$target_share + 1)

imp_long$fantasyPoints_transformed <- predict(
  yjTransformed,
  newdata = imp_long["fantasyPoints"])$fantasyPoints

imp_mids <- mice::as.mids(imp_long)

Now let’s estimate multiple regression with the multiply imputed data:

Code

imp_regression <- with(
  imp_mids,
  lm(
    fantasyPoints_transformed ~ age + height + weight + target_share_log)
  )

summary(mice::pool(imp_regression))

11.11.3 Full Information Maximum Likelihood

Full information maximum likelihood (FIML) estimates model parameters using all available data for each case, without imputing missing values. FIML is commonly used for handling missingness in approaches to latent variable modeling, such as factor analysis and structural equation modeling. We can estimate a regression model that uses full information maximum likelihood using the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code

fimlRegression_syntax <- '
  fantasyPoints_transformed ~ age + height + weight + target_share_log
  fantasyPoints_transformed ~~ fantasyPoints_transformed
  fantasyPoints_transformed ~ 1
'

fimlRegression_fit <- lavaan::sem(
  fimlRegression_syntax,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  missing = "ML",
  fixed.x = FALSE
)

summary(
  fimlRegression_fit,
  standardized = TRUE,
  rsquare = TRUE)

lavaan 0.6-19 ended normally after 58 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        20

  Number of observations                          5389
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    age                          0.040    0.007    5.443    0.000    0.040
    height                      -0.003    0.014   -0.212    0.832   -0.003
    weight                       0.004    0.002    1.741    0.082    0.004
    target_shar_lg              27.770    0.285   97.531    0.000   27.770
  Std.all
         
    0.046
   -0.003
    0.021
    0.813

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  age ~~                                                                
    height           -0.297    0.101   -2.945    0.003   -0.297   -0.040
    weight           -0.155    0.637   -0.243    0.808   -0.155   -0.003
    target_shar_lg    0.037    0.004   10.100    0.000    0.037    0.145
  height ~~                                                             
    weight           24.979    0.584   42.755    0.000   24.979    0.716
    target_shar_lg    0.016    0.003    5.808    0.000    0.016    0.082
  weight ~~                                                             
    target_shar_lg    0.152    0.017    8.931    0.000    0.152    0.127

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    1.704    0.796    2.142    0.032    1.704    0.616
    age              26.254    0.043  611.427    0.000   26.254    8.329
    height           72.606    0.032 2269.504    0.000   72.606   30.916
    weight          200.480    0.202  991.395    0.000  200.480   13.505
    target_shar_lg    0.081    0.001   70.753    0.000    0.081    0.997

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    2.467    0.052   47.691    0.000    2.467    0.322
    age               9.936    0.191   51.909    0.000    9.936    1.000
    height            5.516    0.106   51.908    0.000    5.516    1.000
    weight          220.372    4.245   51.908    0.000  220.372    1.000
    target_shar_lg    0.007    0.000   49.109    0.000    0.007    1.000

R-Square:
                   Estimate
    fntsyPnts_trns    0.678

11.12 Addressing Non-Independence of Observations

Please note that the $p$-value for regression coefficients assumes that the observations are independent—in particular, that the residuals are not correlated. However, the observations are not independent in the player_stats_seasonal dataframe used above, because the same player has multiple rows—one row corresponding to each season they played. This non-independence violates the traditional assumptions of the significance of regression coefficients. We could address this assumption by analyzing only one season from each player or by estimating the significance of the regression coefficients using cluster-robust standard errors. For simplicity in the models above, we present results above from the whole dataframe. In Chapter 12, we discuss mixed model approaches that handle repeated measures and other data that violate assumptions of non-independence. Below, we demonstrate how to account for non-independence of observations using cluster-robust standard errors with the rms package (Harrell, Jr., 2025).

Code

player_stats_seasonal_subset <- player_stats_seasonal %>% 
  filter(!is.na(player_id)) %>% 
  filter(position %in% c("WR"))

regressionWithClusterVariable <- rms::robcov(rms::ols(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal_subset,
  x = TRUE,
  y = TRUE),
  cluster = player_stats_seasonal_subset$player_id) #account for nested data within player

regressionWithClusterVariable

Frequencies of Missing Values Due to Each Variable
fantasyPoints_transformed                       age                    height 
                        0                         0                         0 
                   weight              target_share 
                        0                       952 

Linear Regression Model

rms::ols(formula = fantasyPoints_transformed ~ age + height + 
    weight + I(log(target_share + 1)), data = player_stats_seasonal_subset, 
    x = TRUE, y = TRUE)

                                                      Model Likelihood    Discrimination    
                                                            Ratio Test           Indexes    
 Obs                                        4437    LR chi2    5009.98    R2       0.677    
 sigma                                    1.5744    d.f.             4    R2 adj   0.676    
 d.f.                                       4432    Pr(> chi2)  0.0000    g        2.428    
Cluster onplayer_stats_seasonal_subset$player_id                                            
 Clusters                                   1292                                            

Residuals

     Min       1Q   Median       3Q      Max 
-18.3306  -0.6618   0.2290   0.8845   6.7486 

             Coef    S.E.   t     Pr(>|t|)
Intercept     2.3735 1.0482  2.26 0.0236  
age           0.0337 0.0089  3.78 0.0002  
height       -0.0092 0.0183 -0.50 0.6171  
weight        0.0036 0.0027  1.34 0.1789  
target_share 27.8669 1.0949 25.45 <0.0001

Code

performance::r2(regressionWithClusterVariable)

# R2 for Linear Regression
  R2: 0.677

11.13 Impact of Outliers and Restricted Range

As with correlation, multiple regression can be strongly impacted by outliers and restricted range.

11.13.1 Robust Regression

To address outliers, there are various approaches to robust regression. One approach is to use an MM-type estimator, such as is used in the robustbase::lmrob() and robustbase::glmrob() functions of the robustbase package (Maechler et al., 2024; Todorov & Filzmoser, 2009).

Code

robustRegression <- robustbase::lmrob(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR"))
)

summary(robustRegression)


Call:
robustbase::lmrob(formula = fantasyPoints_transformed ~ age + height + weight + 
    I(log(target_share + 1)), data = player_stats_seasonal %>% filter(position %in% 
    c("WR")))
 \--> method = "MM"
Residuals:
     Min       1Q   Median       3Q      Max 
-20.8013  -0.7763   0.1021   0.7743   7.0129 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.9267186  0.6750051   5.817  6.4e-09 ***
age                       0.0033425  0.0056495   0.592    0.554    
height                   -0.0146743  0.0120894  -1.214    0.225    
weight                    0.0006476  0.0018724   0.346    0.729    
I(log(target_share + 1)) 31.7037140  0.3244958  97.701  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust residual standard error: 1.053 
  (952 observations deleted due to missingness)
Multiple R-squared:  0.8088,    Adjusted R-squared:  0.8086 
Convergence in 15 IRWLS iterations

Robustness weights: 
 56 observations c(134,175,221,260,277,343,368,480,481,705,706,1059,1094,1114,1130,1131,1186,1314,1315,1350,1425,1437,1486,1513,1549,1699,2172,2188,2203,2228,2264,2437,2577,2636,2671,2745,2779,2787,2918,2946,3251,3260,3428,3508,3551,3669,3697,3788,3905,3920,3921,4042,4061,4158,4242,4372)
     are outliers with |weight| = 0 ( < 2.3e-05); 
 332 weights are ~= 1. The remaining 4049 ones are summarized as
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0004601 0.8472000 0.9408000 0.8754000 0.9837000 0.9990000 
Algorithmic parameters: 
       tuning.chi                bb        tuning.psi        refine.tol 
        1.548e+00         5.000e-01         4.685e+00         1.000e-07 
          rel.tol         scale.tol         solve.tol          zero.tol 
        1.000e-07         1.000e-10         1.000e-07         1.000e-10 
      eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
        2.254e-05         4.820e-10         5.000e-01         5.000e-01 
     nResample         max.it         groups        n.group       best.r.s 
           500             50              5            400              2 
      k.fast.s          k.max    maxit.scale      trace.lev            mts 
             1            200            200              0           1000 
    compute.rd fast.s.large.n 
             0           2000 
                  psi           subsampling                   cov 
           "bisquare"         "nonsingular"         ".vcov.avar1" 
compute.outlier.stats 
                 "SM" 
seed : int(0)

Another approach to handling outliers is to use boostrapping, which involves fitting models to various bootstrap resamples of the data. Bootstrap samples are datasets generated by sampling repeatedly with replacement.

We set up the bootstrap folds using the rsample::bootstraps() function of the rsample package (Frick et al., 2025).

Code

set.seed(52242)
bootstrapSamples <- 2000

# Create bootstrap resamples (with apparent = TRUE)
boots <- rsample::bootstraps(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  times = bootstrapSamples,
  apparent = TRUE
)

# Define recipe
rec <- recipes::recipe(
  fantasyPoints_transformed ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR"))
) %>%
  recipes::step_log(target_share, offset = 1) # replace I(log(target_share + 1))

# Define model spec and workflow
lm_spec <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm") %>%
  parsnip::set_mode("regression")

lm_workflow <- workflows::workflow() %>%
  workflows::add_recipe(rec) %>%
  workflows::add_model(lm_spec)

# Function for fitting the models on the bootstrap samples
fit_lm <- function(split, ...) {
  analysis_data <- rsample::analysis(split)
  fit <- workflows::fit(lm_workflow, data = analysis_data)
  broom::tidy(fit)
}

# Fit bootstrap models and save results
boot_models <- boots %>%
  mutate(coef_info = purrr::map(splits, fit_lm))

# Extract bootstrapped coefficient estimates
boot_coefs <- boot_models %>%
  tidyr::unnest(coef_info)

# Percentile confidence intervals
percentile_intervals <- rsample::int_pctl(
  .data = boot_models,
  statistics = coef_info)

# Bias-corrected and accelerated confidence intervals
bca_intervals <- rsample::int_bca(
  .data = boot_models,
  statistics = coef_info,
  .fn = fit_lm
)

# View confidence intervals
percentile_intervals

Code

bca_intervals

The distributions of the regression coefficients across boostraps are depicted in Figure 11.32.

Code

ggplot(
  data = boot_coefs,
  aes(x = estimate)) +
  geom_histogram(
    fill = "gray80",
    color = "black") +
  facet_wrap(
    ~ term,
    scales = "free") +
  geom_vline(
    data = percentile_intervals,
    aes(
      xintercept = .lower),
    col = "blue",
    linetype = "dashed") +
  geom_vline(
    data = percentile_intervals,
    aes(
      xintercept = .upper),
    col = "blue",
    linetype = "dashed") +
  labs(
    title = "Bootstrap Distributions of Coefficients",
    x = "Coefficient Estimate",
    y = "Count") +
  theme_classic()

Figure 11.32: Histogram of Parameter Estimates Across Bootstraps.

11.14 Moderated Multiple Regression

When examining moderation in multiple regression, several steps are important:

When computing the interaction term, first mean-center the predictor variables. Calculate the interaction term as the multiplication of the mean-centered predictor variables. Mean-centering the predictor variables when computing the interaction term is important for addressing issues regarding multicollinearity (Iacobucci et al., 2016).
When including an interaction term in the model, make sure also to include the main effects.

First, we mean-center the predictors. In this case, we center the predictors around the mean of height and weight for Wide Receivers:

Code

player_stats_seasonal$height_centered <- player_stats_seasonal$height - mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
player_stats_seasonal$weight_centered <- player_stats_seasonal$weight - mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)

Then, we compute the interaction term as the multiplication of the two centered predictors:

Code

player_stats_seasonal$heightXweight <- player_stats_seasonal$height_centered * player_stats_seasonal$weight_centered

Then, we fit the moderated multiple regression model:

Code

moderationModel <- lm(
  fantasyPoints_transformed ~ height_centered + weight_centered + height_centered:weight_centered,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(moderationModel)


Call:
lm(formula = fantasyPoints_transformed ~ height_centered + weight_centered + 
    height_centered:weight_centered, data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-11.1162  -2.0928   0.3443   2.2965   5.5387 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.6129049  0.0445328 126.040  < 2e-16 ***
height_centered                 -0.0289282  0.0229691  -1.259   0.2079    
weight_centered                  0.0259022  0.0036211   7.153 9.62e-13 ***
height_centered:weight_centered -0.0017483  0.0009675  -1.807   0.0708 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.746 on 5385 degrees of freedom
Multiple R-squared:  0.01561,   Adjusted R-squared:  0.01506 
F-statistic: 28.46 on 3 and 5385 DF,  p-value: < 2.2e-16

This model is equivalent to the model that includes the interaction term explicitly:

Code

moderationModel <- lm(
  fantasyPoints_transformed ~ height_centered + weight_centered + heightXweight,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(moderationModel)

Now, we can visualize the interaction to understand it. We create an interaction plot (Figure 11.33) and Johnson-Neyman plot (Figure 11.34) using the interactions package (Long, 2024).

Code

interactions::interact_plot(
  moderationModel,
  pred = height_centered,
  modx = weight_centered)

Figure 11.33: Interaction Plot from Moderated Multiple Regression.

Code

interactions::johnson_neyman(
  moderationModel,
  pred = height_centered,
  modx = weight_centered,
  alpha = .05)

JOHNSON-NEYMAN INTERVAL

When weight_centered is INSIDE the interval [16.16, 136.16], the slope of
height_centered is p < .05.

Note: The range of observed values of weight_centered is [-47.48, 64.52]

Figure 11.34: Johnson-Neyman Plot from Moderated Multiple Regression.

Here is a simple slopes analysis:

Code

interactions::sim_slopes(
  moderationModel,
  pred = height_centered,
  modx = weight_centered,
  johnson_neyman = FALSE)

SIMPLE SLOPES ANALYSIS

Slope of height_centered when weight_centered = -1.484626e+01 (- 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.00   0.03    -0.12   0.91

Slope of height_centered when weight_centered = -1.002064e-16 (Mean): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.03   0.02    -1.26   0.21

Slope of height_centered when weight_centered =  1.484626e+01 (+ 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.05   0.03    -1.93   0.05

11.15 Mediation

A mediation model takes the following general form in the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code

mediationModel <- '
  # direct effect (cPrime)
  Y ~ direct*X
  
  # mediator
  M ~ a*X
  Y ~ b*M
  
  # indirect effect = a*b
  indirect := a*b
  
  # total effect (c)
  total := abs(direct) + abs(indirect)
  
  # proportion mediated
  Pm := abs(indirect) / total
'

Let’s substitute in our predictor, outcome, and hypothesized mediator. In this case, we predict that receiving touchdowns partially accounts for the association between Wide Receiver’s target share and their fantasy points. This is a silly example because fantasy points are derived, in part, from touchdowns, so of course touchdowns will partially account for almost any effect on Wide Receivers’ fantasy points. This example is merely for demonstrating the process of developing and examining a mediation model.

Code

mediationModel <- '
  # direct effect (cPrime)
  fantasyPoints_transformed ~ direct*target_share
  
  # mediator
  receiving_tds ~ a*target_share
  fantasyPoints_transformed ~ b*receiving_tds
  
  # indirect effect = a*b
  indirect := a*b
  
  # total effect (c)
  total := abs(direct) + abs(indirect)
  
  # proportion mediated
  Pm := abs(indirect) / total
'

To get a robust estimate of the indirect effect, we obtain bootstrapped estimates from 1,000 bootstrap draws. Typically, we would obtain bootstrapped estimates from 10,000 bootstrap draws, but this example uses only 1,000 bootstrap draws for a shorter runtime.

Code

mediationFit <- lavaan::sem(
  mediationModel,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  se = "bootstrap",
  bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
  parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
  iseed = 52242, # for reproducibility
  missing = "ML",
  estimator = "ML",
  fixed.x = FALSE)

Here are the model results:

Code

summary(
  mediationFit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)

lavaan 0.6-19 ended normally after 16 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9

  Number of observations                          5389
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                              9274.292
  Degrees of freedom                                 3
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000
                                                      
  Robust Comparative Fit Index (CFI)             1.000
  Robust Tucker-Lewis Index (TLI)                1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -17772.527
  Loglikelihood unrestricted model (H1)     -17772.527
                                                      
  Akaike (AIC)                               35563.053
  Bayesian (BIC)                             35622.382
  Sample-size adjusted Bayesian (SABIC)      35593.783

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: RMSEA <= 0.050                       NA
  P-value H_0: RMSEA >= 0.080                       NA
                                                      
  Robust RMSEA                                   0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: Robust RMSEA <= 0.050                NA
  P-value H_0: Robust RMSEA >= 0.080                NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws            1000

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    trgt_sh (drct)              14.213    1.533    9.271    0.000   14.213
  receiving_tds ~                                                         
    trgt_sh    (a)              21.853    1.309   16.696    0.000   21.853
  fantasyPoints_transformed ~                                             
    rcvng_t    (b)               0.401    0.035   11.393    0.000    0.401
  Std.all
         
    0.485
         
    0.701
         
    0.426

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    3.420    0.059   58.149    0.000    3.420    1.236
   .receiving_tds     0.335    0.101    3.303    0.001    0.335    0.114
    target_share      0.088    0.001   64.770    0.000    0.088    0.929

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    2.247    0.103   21.744    0.000    2.247    0.294
   .receiving_tds     4.399    0.287   15.336    0.000    4.399    0.508
    target_share      0.009    0.001   15.897    0.000    0.009    1.000

R-Square:
                   Estimate
    fntsyPnts_trns    0.706
    receiving_tds     0.492

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    indirect          8.760    0.328   26.702    0.000    8.760    0.299
    total            22.973    1.293   17.773    0.000   22.973    0.784
    Pm                0.381    0.032   11.872    0.000    0.381    0.381

We can also estimate a model with multiple hypothesized mediators:

Code

multipleMediatorModel <- '
  # direct effect (cPrime)
  fantasyPoints_transformed ~ direct*target_share
  
  # mediator
  receiving_tds ~ a1*target_share
  receiving_yards ~ a2*target_share
  
  fantasyPoints_transformed ~ b1*receiving_tds + b2*receiving_yards
  
  # indirect effect = a*b
  indirect1 := a1*b1
  indirect2 := a2*b2
  indirectTotal := indirect1 + indirect2
  
  # total effect (c)
  total := abs(direct) + abs(indirectTotal)
  
  # proportion mediated
  Pm1 := abs(indirect1) / total
  Pm2 := abs(indirect2) / total
  PmTotal := abs(indirectTotal) / total
'

Code

multipleMediatorFit <- lavaan::sem(
  multipleMediatorModel,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  se = "bootstrap",
  bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
  parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
  iseed = 52242, # for reproducibility
  missing = "ML",
  estimator = "ML",
  fixed.x = FALSE)

Here are the model results:

Code

summary(
  multipleMediatorFit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)

lavaan 0.6-19 ended normally after 33 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        13

  Number of observations                          5389
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                              3190.289
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                             20796.992
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.847
  Tucker-Lewis Index (TLI)                       0.080
                                                      
  Robust Comparative Fit Index (CFI)             0.842
  Robust Tucker-Lewis Index (TLI)                0.049

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -53416.617
  Loglikelihood unrestricted model (H1)     -51821.473
                                                      
  Akaike (AIC)                              106859.235
  Bayesian (BIC)                            106944.932
  Sample-size adjusted Bayesian (SABIC)     106903.623

Root Mean Square Error of Approximation:

  RMSEA                                          0.769
  90 Percent confidence interval - lower         0.747
  90 Percent confidence interval - upper         0.792
  P-value H_0: RMSEA <= 0.050                    0.000
  P-value H_0: RMSEA >= 0.080                    1.000
                                                      
  Robust RMSEA                                   0.804
  90 Percent confidence interval - lower         0.778
  90 Percent confidence interval - upper         0.830
  P-value H_0: Robust RMSEA <= 0.050             0.000
  P-value H_0: Robust RMSEA >= 0.080             1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.080

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws            1000

Regressions:
                              Estimate   Std.Err  z-value  P(>|z|)   Std.lv 
  fantasyPoints_transformed ~                                               
    trgt_sh (drct)                4.121    0.792    5.204    0.000     4.121
  receiving_tds ~                                                           
    trgt_sh   (a1)               22.440    1.281   17.514    0.000    22.440
  receiving_yards ~                                                         
    trgt_sh   (a2)             3472.583  196.560   17.667    0.000  3472.583
  fantasyPoints_transformed ~                                               
    rcvng_t   (b1)                0.039    0.011    3.500    0.000     0.039
    rcvng_y   (b2)                0.005    0.000   27.240    0.000     0.005
  Std.all
         
    0.143
         
    0.728
         
    0.848
         
    0.042
    0.732

Intercepts:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .fntsyPnts_trns     3.190    0.030  104.646    0.000     3.190    1.161
   .receiving_tds      0.282    0.099    2.857    0.004     0.282    0.096
   .receiving_yrds    70.408   15.353    4.586    0.000    70.408    0.180
    target_share       0.088    0.001   65.149    0.000     0.088    0.920

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .fntsyPnts_trns     1.646    0.036   45.284    0.000     1.646    0.218
   .receiving_tds      4.062    0.261   15.542    0.000     4.062    0.469
   .receiving_yrds 42850.451 5566.673    7.698    0.000 42850.451    0.280
    target_share       0.009    0.001   15.159    0.000     0.009    1.000

R-Square:
                   Estimate 
    fntsyPnts_trns     0.782
    receiving_tds      0.531
    receiving_yrds     0.720

Defined Parameters:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    indirect1          0.871    0.249    3.496    0.000     0.871    0.030
    indirect2         17.858    0.595   30.010    0.000    17.858    0.621
    indirectTotal     18.729    0.581   32.247    0.000    18.729    0.651
    total             22.849    1.278   17.873    0.000    22.849    0.794
    Pm1                0.038    0.011    3.609    0.000     0.038    0.038
    Pm2                0.782    0.027   28.728    0.000     0.782    0.782
    PmTotal            0.820    0.024   33.662    0.000     0.820    0.820

11.16 Bayesian Multiple Regression

Code

bayesianMultipleRegressionModel <- brm(
  formula = fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  family = gaussian()
)

Code

summary(bayesianMultipleRegressionModel)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)) 
   Data: player_stats_seasonal %>% filter(position %in% c(" (Number of observations: 4437) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              2.38      0.84     0.72     4.01 1.00     3585     3386
age                    0.03      0.01     0.02     0.05 1.00     4121     2648
height                -0.01      0.01    -0.04     0.02 1.00     3335     2946
weight                 0.00      0.00    -0.00     0.01 1.00     3647     3236
Ilogtarget_shareP1    27.87      0.29    27.30    28.44 1.00     3776     2511

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.57      0.02     1.54     1.61 1.00     4161     2949

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

performance::r2(bayesianMultipleRegressionModel)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.677 (95% CI [0.668, 0.685])

11.17 Conclusion

Multiple regression allows examining the association between multiple predictor variables and one outcome variable. Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable. As with correlation, an association does not imply causation. However, identifying associations is important because associations are a necessary (but insufficient) condition for causality. When developing a multiple regression model, there are various assumptions that are important to evaluate. In addition, it is important to pay attention for potential multicollinearity—it may become difficult to detect a given predictor variable as statistically significant due to the greater uncertainty around the parameter estimates.

11.18 Session Info

Code

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.67.0       knitr_1.50          lubridate_1.9.4    
 [4] forcats_1.0.0       stringr_1.5.1       readr_2.1.5        
 [7] tidyverse_2.0.0     yardstick_1.3.2     workflowsets_1.1.1 
[10] workflows_1.2.0     tune_1.3.0          tidyr_1.3.1        
[13] tibble_3.3.0        rsample_1.3.1       recipes_1.3.1      
[16] purrr_1.1.0         parsnip_1.3.2       modeldata_1.5.0    
[19] infer_1.0.9         dplyr_1.1.4         dials_1.4.1        
[22] scales_1.4.0        tidymodels_1.3.0    effectsize_1.0.1   
[25] broom_1.0.9         MASS_7.3-65         ordinal_2023.12-4.1
[28] robustbase_0.99-4-1 parallelly_1.45.1   brms_2.22.0        
[31] Rcpp_1.1.0          interactions_1.2.0  miceadds_3.17-44   
[34] mice_3.18.0         lavaan_0.6-19       performance_0.15.0 
[37] lme4_1.1-37         Matrix_1.7-3        caret_7.0-1        
[40] lattice_0.22-7      ggplot2_3.5.2       car_3.1-3          
[43] carData_3.0-5       rms_8.0-0           Hmisc_5.2-3        
[46] petersenlab_1.2.0  

loaded via a namespace (and not attached):
  [1] fs_1.6.6             matrixStats_1.5.0    sparsevctrs_0.3.4   
  [4] DiceDesign_1.10      httr_1.4.7           RColorBrewer_1.1-3  
  [7] insight_1.3.1        numDeriv_2016.8-1.1  tools_4.5.1         
 [10] backports_1.5.0      R6_2.6.1             mgcv_1.9-3          
 [13] jomo_2.7-6           withr_3.0.2          Brobdingnag_1.2-9   
 [16] gridExtra_2.3        quantreg_6.1         cli_3.6.5           
 [19] mix_1.0-13           sandwich_3.1-1       domir_1.2.0         
 [22] labeling_0.4.3       mvtnorm_1.3-3        polspline_1.1.25    
 [25] proxy_0.4-27         QuickJSR_1.8.0       pbivnorm_0.6.0      
 [28] StanHeaders_2.32.10  foreign_0.8-90       readxl_1.4.5        
 [31] rstudioapi_0.17.1    generics_0.1.4       shape_1.4.6.1       
 [34] distributional_0.5.0 inline_0.3.21        loo_2.8.0           
 [37] DescTools_0.99.60    abind_1.4-8          lifecycle_1.0.4     
 [40] multcomp_1.4-28      yaml_2.3.10          grid_4.5.1          
 [43] mitml_0.4-5          haven_2.5.5          jtools_2.3.0        
 [46] pillar_1.11.0        boot_1.3-31          gld_2.6.7           
 [49] estimability_1.5.1   future.apply_1.20.0  codetools_0.2-20    
 [52] pan_1.9              glue_1.8.0           V8_6.0.6            
 [55] data.table_1.17.8    vctrs_0.6.5          Rdpack_2.6.4        
 [58] cellranger_1.1.0     gtable_0.3.6         datawizard_1.2.0    
 [61] gower_1.0.2          xfun_0.52            rbibutils_2.3       
 [64] prodlim_2025.04.28   coda_0.19-4.1        reformulas_0.4.1    
 [67] survival_3.8-3       timeDate_4041.110    iterators_1.0.14    
 [70] hardhat_1.4.1        lava_1.8.1           TH.data_1.1-3       
 [73] ipred_0.9-15         nlme_3.1-168         rstan_2.32.7        
 [76] tensorA_0.36.2.1     rpart_4.1.24         colorspace_2.1-1    
 [79] DBI_1.2.3            nnet_7.3-20          processx_3.8.6      
 [82] Exact_3.3            mnormt_2.1.1         tidyselect_1.2.1    
 [85] emmeans_1.11.2       compiler_4.5.1       curl_6.4.0          
 [88] glmnet_4.1-10        htmlTable_2.4.3      SparseM_1.84-2      
 [91] expm_1.0-0           bayestestR_0.16.1    posterior_1.6.1     
 [94] checkmate_2.3.2      DEoptimR_1.1-4       psych_2.5.6         
 [97] quadprog_1.5-8       callr_3.7.6          digest_0.6.37       
[100] minqa_1.2.8          rmarkdown_2.29       htmltools_0.5.8.1   
[103] pkgconfig_2.0.3      base64enc_0.1-3      lhs_1.2.0           
[106] fastmap_1.2.0        rlang_1.1.6          htmlwidgets_1.6.4   
[109] farver_2.1.2         zoo_1.8-14           jsonlite_2.0.0      
[112] broom.mixed_0.2.9.6  ModelMetrics_1.2.2.2 magrittr_2.0.3      
[115] Formula_1.2-5        bayesplot_1.13.0     parameters_0.27.0   
[118] GPfit_1.0-9          ucminf_1.2.2         furrr_0.3.1         
[121] stringi_1.8.7        pROC_1.19.0.1        rootSolve_1.8.2.4   
[124] plyr_1.8.9           pkgbuild_1.4.8       parallel_4.5.1      
[127] listenv_0.9.1        lmom_3.2             splines_4.5.1       
[130] pander_0.6.6         hms_1.1.3            ps_1.9.1            
[133] reshape2_1.4.4       stats4_4.5.1         rstantools_2.4.0    
[136] evaluate_1.0.4       mitools_2.4          RcppParallel_5.1.10 
[139] nloptr_2.2.1         tzdb_0.5.0           foreach_1.5.2       
[142] MatrixModels_0.5-4   xtable_1.8-4         e1071_1.7-16        
[145] viridisLite_0.4.2    class_7.3-23         cluster_2.1.8.1     
[148] timechange_0.3.0     globals_0.18.0       bridgesampling_1.1-2

Ben-Shachar, M. S., Lüdecke, D., & Makowski, D. (2020). effectsize: Estimation of effect size indices and standardized parameters. Journal of Open Source Software, 5(56), 2815. https://doi.org/10.21105/joss.02815

Ben-Shachar, M. S., Makowski, D., Lüdecke, D., Patil, I., Wiernik, B. M., Thériault, R., & Waggoner, P. (2025). effectsize: Indices of effect size. https://doi.org/10.32614/CRAN.package.effectsize

Bürkner, P.-C. (2024). brms: Bayesian regression models using Stan. https://doi.org/10.32614/CRAN.package.brms

Christensen, R. H. B. (2024). ordinal: Regression models for ordinal data. https://doi.org/10.32614/CRAN.package.ordinal

Corston, R., & Colman, A. M. (2000). A crash course in SPSS for Windows. Wiley-Blackwell.

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (Third). Sage. https://www.john-fox.ca/Companion

Fox, J., Weisberg, S., & Price, B. (2024). car: Companion to applied regression. https://doi.org/10.32614/CRAN.package.car

Frick, H., Chow, F., Kuhn, M., Mahoney, M., Silge, J., & Wickham, H. (2025). rsample: General resampling infrastructure. https://doi.org/10.32614/CRAN.package.rsample

Harrell, Jr., F. E. (2025). rms: Regression modeling strategies. https://doi.org/10.32614/CRAN.package.rms

Iacobucci, D., Schneider, M. J., Popovich, D. L., & Bakamitsos, G. A. (2016). Mean centering helps alleviate “micro” but not “macro” multicollinearity. Behavior Research Methods, 48(4), 1308–1317. https://doi.org/10.3758/s13428-015-0624-x

Long, J. A. (2024). interactions: Comprehensive, user-friendly toolkit for probing interactions. https://doi.org/10.32614/CRAN.package.interactions

Luchman, J. (2024). domir: Tools to support relative importance analysis. https://doi.org/10.32614/CRAN.package.domir

Lüdecke, D., Ben-Shachar, M. S., Patil, I., & Makowski, D. (2020). Extracting, computing and exploring the parameters of statistical models using R. Journal of Open Source Software, 5(53), 2445. https://doi.org/10.21105/joss.02445

Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139. https://doi.org/10.21105/joss.03139

Lüdecke, D., Makowski, D., Ben-Shachar, M. S., Patil, I., Højsgaard, S., & Wiernik, B. M. (2025). parameters: Processing of model parameters. https://doi.org/10.32614/CRAN.package.parameters

Lüdecke, D., Makowski, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., Wiernik, B. M., & Thériault, R. (2025). performance: Assessment of regression models performance. https://doi.org/10.32614/CRAN.package.performance

Maechler, M., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Koller, M., & Conceicao, E. L. T. (2024). robustbase: Basic robust statistics. https://doi.org/10.32614/CRAN.package.robustbase

Petersen, I. T. (2024). Principles of psychological assessment: With applied examples in R. Chapman and Hall/CRC. https://doi.org/10.1201/9781003357421

Petersen, I. T. (2025). Principles of psychological assessment: With applied examples in R. University of Iowa Libraries. https://doi.org/10.25820/work.007199

Ripley, B., & Venables, B. (2025). MASS: Support functions and datasets for Venables and Ripley’s MASS. https://doi.org/10.32614/CRAN.package.MASS

Robinson, D., Hayes, A., & Couch, S. (2025). broom: Convert statistical objects into tidy tibbles. https://doi.org/10.32614/CRAN.package.broom

Robitzsch, A., Grund, S., & Henke, T. (2024). miceadds: Some additional multiple imputation functions, especially for mice. https://doi.org/10.32614/CRAN.package.miceadds

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02

Rosseel, Y., Jorgensen, T. D., & De Wilde, L. (2024). lavaan: Latent variable analysis. https://doi.org/10.32614/CRAN.package.lavaan

Todorov, V., & Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis. Journal of Statistical Software, 32(3), 1–47. https://doi.org/10.18637/jss.v032.i03

van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

van Buuren, S., & Groothuis-Oudshoorn, K. (2024). mice: Multivariate imputation by chained equations. https://doi.org/10.32614/CRAN.package.mice

11.1 Getting Started

11.1.1 Load Packages

11.1.2 Load Data

11.2 Overview of Multiple Regression

11.3 Components

11.4 Types of Regression

11.4.1 Linear Regression

11.4.2 Logistic Regression

11.4.3 Ordinal Regression

11.4.4 Poisson Regression

11.4.5 Negative Binomial Regression

11.4.6 Beta Regression

11.5 Assumptions of Multiple Regression

11.5.1 Evaluating and Addressing Assumptions of Multiple Regression

11.5.1.1 Linear Association

11.5.1.2 Homoscedasticity

11.5.1.3 Uncorrelated Residuals

11.5.1.4 Normally Distributed Residuals

11.6 How Much Variance the Model Explains

11.6.1 Coefficient of Determination (\(R^2\))

11.6.2 Adjusted \(R^2\) (\(R^2_{adj}\))

11.7 Overfitting

11.8 Covariates

11.9 Example: Predicting Wide Receivers’ Fantasy Points

11.9.1 Examine Descriptive Statistics

11.9.2 Examine Bivariate Associations

11.9.3 Estimate Multiple Regression Model

11.9.4 Dominance Analysis

11.9.5 Visualizing Regression Results

11.9.5.1 Regression Coefficients

11.9.5.2 Model-Implied Association

11.9.6 Evaluating and Addressing Assumptions

11.9.6.1 Linear Association

11.9.6.2 Homoscedasticity

11.9.6.3 Uncorrelated Residuals

11.9.6.4 Normally Distributed Residuals

11.10 Multicollinearity

11.11 Handling of Missing Data

11.11.1 Pairwise Deletion

11.11.2 Multiple Imputation

11.11.3 Full Information Maximum Likelihood

11.12 Addressing Non-Independence of Observations

11.13 Impact of Outliers and Restricted Range

11.13.1 Robust Regression

11.14 Moderated Multiple Regression

11.15 Mediation

11.16 Bayesian Multiple Regression

11.17 Conclusion

11.18 Session Info

Feedback

Email Notification