I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Hypothesis Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

11  Multiple Regression

11.1 Getting Started

11.1.1 Load Packages

Code
library("petersenlab")
library("rms")
library("car")
library("caret")
library("lme4")
library("performance")
library("lavaan")
library("mice")
library("miceadds")
library("interactions")
library("brms")
library("robustbase")
library("tidyverse")
library("knitr")

11.1.2 Load Data

Code
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")

We created the player_stats_weekly.RData and player_stats_seasonal.RData objects in Section 4.4.3.

11.2 Overview of Multiple Regression

Multiple regression is an extension of correlation. Correlation examines the association between one predictor variables and one outcome variable. Multiple regression examines the association between multiple predictor variables and one outcome variable. It allows obtaining a more accurate estimate of the unique contribution of a given predictor variable, by controlling for other variables (covariates).

Regression with one predictor variable takes the form of Equation 11.1:

\[ y = \beta_0 + \beta_1x_1 + \epsilon \tag{11.1}\]

where \(y\) is the outcome variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(x_1\) is the predictor variable, and \(\epsilon\) is the error term.

A regression line is depicted in Figure 11.27.

A Regression Best-Fit Line.
Figure 11.1: A Regression Best-Fit Line.

Regression with multiple predictors—i.e., multiple regression—takes the form of Equation 11.2:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon \tag{11.2}\]

where \(p\) is the number of predictor variables. Multiple regression is basically a weighted sum of the predictor variables, and adding an intercept. Under the hood, multiple regression seeks to identify the best weight for each predictor.

11.3 Components

  • \(B\) = unstandardized coefficient: direction and magnitude of the estimate (original scale)
  • \(\beta\) (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
  • \(SE\) = standard error: uncertainty of unstandardized estimate

The unstandardized regression coefficient (\(B\)) is interpreted such that, for every unit change in the predictor variable, there is a __ unit change in the outcome variable. For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age. (In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.) Unstandardized regression coefficients are tied to the metric of the raw data. Thus, a large unstandardized regression coefficient for two variables may mean completely different things. Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.

Standardized regression coefficients can be obtained by standardizing the variables to z-scores so they all have a mean of zero and standard deviation of one. The standardized regression coefficient (\(\beta\)) is interpreted such that, for every standard deviation change in the predictor variable, there is a __ standard deviation change in the outcome variable. For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age. Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1]. Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients. In this way, standardized regression coefficients provide a meaningful index of effect size.

The standard error of a regression coefficient represents the uncertainty of the parameter. If we have less uncertainty (i.e., more confidence) about the parameter, the standard error will be small. If we have more uncertainty (i.e., less confidence) about the parameter, the standard error will be large. If we used the same sampling procedure repeatedly and calculated the regression coefficient each time, the true parameter in the population would fall 68% of the time within the interval of: \([\text{model parameter estimate for the regression coefficient} \pm 1 \text{ standard error}]\) .

A confidence interval represents a range of plausible values such that, with repeated sampling, the true value falls within a given interval with some confidence. Our parameter estimate for the regression coefficient, plus or minus 1 standard error, reflects the 68% confidence interval for the coefficient. The 95% confidence interval is computed as the parameter estimated plus or minus 1.96 standard errors. For instance, if the parameter estimate for the regression coefficient is 0.50, and the standard error is 0.10, the 95% confidence interval is [0.30, 0.70]: \(0.5 - (1.96 \times 0.10) = 0.3\); \(0.5 + (1.96 \times 0.10) = 0.7\). That is, if we used the same sampling procedure repeatedly, the true value of the regression coefficient would be expected to be 95% of the time somewhere between 0.30 to 0.70.

11.4 Assumptions of Multiple Regression

Linear regression models make the following assumptions:

  • there is a linear association between the predictor variable and the outcome variable
  • there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variable or as a function of the outcome variable
  • the residuals are independent; they are uncorrelated with each other
  • the residuals are normally distributed

Homoscedasticity of the residuals means that the variance of the residuals does not differ as a function of the outcome variable or as a function of the predictor variable (i.e., the residuals show constant variance as a function of outcome/predictors). If the residuals differ as a function of the outcome or predictor variable, this is called heterotoscedasticity.

Those are some of the key assumptions of multiple regression. However, there are additional assumptions of multiple regression, including ones discussed in the chapter on Causal Inference. For instance, the variables included should reflect a causal process such that the predictor variables influence the outcome variable, and not the other way around. That is, the outcome variable should not influence the predictor variables (i.e., there should be no reverse causation). In addition, it is important control for any confound(s). If a confound is not controlled for, this is called omitted-variable bias, and it leads the researcher to incorrectly attribute the effects of the omitted variable to the included variables.

11.4.1 Evaluating and Addressing Assumptions of Multiple Regression

11.4.1.1 Linear Association

To evaluate the shape of the association between the predictor variables and the outcome variable, we can examine scatterplots (Figure 11.2), residual plots (Figure 11.19), marginal model plots (Figure 11.10), added-variable plots (Figure 11.11), and component-plus-residual plots (Figure 11.12). Residual plots depict the residuals (errors) on the y-axis as a function of the fitted values or a specific predictor on the x-axis. Marginal model plots are basically glorified scatterplots that depict the outcome variable (both in terms of observed values and model-fitted values) on the y-axis and the predictor variables on the x-axis. Added-variable plots depict the unique association of each predictor variables with the outcome variable when controlling for all the other predictor variables in the model. Component-plus-residual plots depict partial residuals on the y-axis as a function of each predictor variable on the x-axis, where a partial residual for a given predictor is the effect of a given predictor (thus controlling for all the other predictor variables in the model) plus the residual from the full model.

Examples of linear and nonlinear associations are depicted with scatterplots in Figure 11.2.

Code
set.seed(52242)

sampleSize <- 1000
quadraticX <- runif(
  sampleSize,
  min = -4,
  max = 4)
linearY <- quadraticX + rnorm(sampleSize, mean = 0, sd = 0.5)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

plot(
  x = quadraticX,
  y = linearY,
  xlab = "",
  ylab = "",
  main = "Linear Association")

abline(
  lm(linearY ~ quadraticX),
  lwd = 2,
  col = "blue")

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "",
  main = "Nonlinear Association")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")
Example Associations Depicted With Scatterplots.
(a) Example of a Linear Association
Example Associations Depicted With Scatterplots.
(b) Example of a Nonlinear Association
Figure 11.2: Example Associations Depicted With Scatterplots.

If the shape of the association is nonlinear (as indicated by any of these plots), various approaches may be necessary such as including nonlinear terms (e.g., polynomial terms such as quadratic, cubic, quartic, or higher-degree terms), transforming the predictors (e.g., log, square root, inverse, exponential, Box-Cox, Yeo-Johnson transform), use of splines/piecewise regression, and generalized additive models.

11.4.1.2 Homoscedasticity

To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.19) and spread-level plot (Figure 11.20). A residual plot depicts the residuals on the y-axis as a function of the model’s fitted values on the x-axis. Homoscedasticity in a residual plot is identified as a constant spread of residuals versus fitted values—the residuals do not show a fan, cone, or bow-tie shape; a fan, cone, or bow-tie shape indicates heteroscedasticity. In a residual plot, a fan or cone shape indicates increasing or decreasing variance in the residuals as a function of the fitted values; a bow-tie shape indicates that the residuals are smallest in the middle of the fitted values and greatest on the extremes of the fitted values. A spread-level plot depicts the log of the absolute value of studentized residuals on the y-axis as a function of the log of the model’s fitted values on the x-axis. Homoscedasticity in a spread-level plot is identified as a flat slope; a slope that differs from zero indicates heteroscedasticity.

Code
set.seed(52242)

# 1. Homoscedasticity
sampleSize <- 1000
x1 <- runif(sampleSize, 0, 10)
y1 <- 3 + 2 * x1 + rnorm(sampleSize, mean = 0, sd = 2)
fit1 <- lm(y1 ~ x1)
res1 <- resid(fit1)
fitted1 <- fitted(fit1)

# 2. Fan-shaped heteroscedasticity
x2 <- runif(sampleSize, 0, 10)
y2 <- 3 + 2 * x2 + rnorm(sampleSize, mean = 0, sd = 0.5 * x2) # increasing variance
fit2 <- lm(y2 ~ x2)
res2 <- resid(fit2)
fitted2 <- fitted(fit2)

# 3. Bow-tie-shaped heteroscedasticity
x3 <- runif(sampleSize, 0, 10)
sd3 <- abs(x3 - 5) + 0.5  # variance smallest in middle, higher on edges
y3 <- 3 + 2 * x3 + rnorm(sampleSize, mean = 0, sd = sd3)
fit3 <- lm(y3 ~ x3)
res3 <- resid(fit3)
fitted3 <- fitted(fit3)

plot(
  fitted1,
  res1,
  main = "Homoscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted2,
  res2,
  main = "Fan-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")

plot(
  fitted3,
  res3,
  main = "Bow-Tie-Shaped Heteroscedasticity",
  xlab = "Fitted value",
  ylab = "Residual")

abline(
  h = 0,
  lwd = 2,
  col = "blue")
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(a) Homoscedasticity
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(b) Fan-Shaped Heteroscedasticity
Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
(c) Bow-Tie-Shaped Heteroscedasticity
Figure 11.3: Example of Homoscedasticity and Heteroscedasticity in Residual Plots.

If there is heteroscedasticity, it may be necessary to transform the outcome variable to be more normally distributed. The spread-level plot provides a suggested power transformation to transform the outcome variable so that the spread of residuals becomes more uniform across the fitted values.

11.4.1.3 Uncorrelated Residuals

To determine if residuals are correlated by a grouping level, we can examine the proportion of variance that is attributable to the grouping level using the intraclass correlation coefficient (ICC) from a mixed model. The greater the ICC value, the more variance is accounted for by the grouping level, and the more the residuals are intercorrelated. If the residuals are intercorrelated, it may be necessary to account for the grouping structure of the data using a mixed model.

11.4.1.4 Normally Distributed Residuals

To examine whether residuals are normally distributed, we can examine quantile–quantile (QQ) plots and probability–probability (PP) plots. QQ plots are particularly useful for identifying deviations from normality in the tails of the distribution; PP plots are particularly useful for identifying deviations from normality in the center of the distribution. Researchers tend to be more concerned about the tails of the distribution, because extreme values tend to have a greater impact on inferences. Various examples of QQ plots and deviations from normality are depicted in Figure 11.4. If the residuals are normally distributed, they will stay close to the diagonal reference line of the QQ and PP plots.

Code
set.seed(52242)

sampleSize <- 10000

distribution_normal <- rnorm(
  sampleSize,
  mean = 0,
  sd = 1)

distribution_bimodal <- c(
  rnorm(
    sampleSize/2,
    mean = -2,
    sd = 1),
  rnorm(
    sampleSize/2,
    mean = 2,
    sd = 1))

distribution_negativeSkew <- -rlnorm(
  sampleSize,
  meanlog = 0,
  sdlog = 1)

distribution_positiveSkew <- rlnorm(
  sampleSize,
  meanlog = 0,
  sdlog = 1)

distribution_lightTailed <- runif(
  sampleSize,
  min = -2,
  max = 2)

distribution_heavyTailed <- rt(
  sampleSize,
  df = 2)

hist(
  distribution_normal,
  main = "Normal Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_normal,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_bimodal,
  main = "Bimodal Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_bimodal,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_negativeSkew,
  main = "Negatively Skewed Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_negativeSkew,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_positiveSkew,
  main = "Positively Skewed Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_positiveSkew,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_lightTailed,
  main = "Platykurtic (Light Tailed) Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_lightTailed,
  main = "QQ Plot",
  id = FALSE)

hist(
  distribution_heavyTailed,
  main = "Leptokurtic (Heavy Tailed) Distribution",
  col = "#0099F8")

car::qqPlot(
  distribution_heavyTailed,
  main = "QQ Plot",
  id = FALSE)
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(a) Normal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(b) Normal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(c) Bimodal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(d) Bimodal Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(e) Negatively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(f) Negatively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(g) Positively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(h) Positively Skewed Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(i) Platykurtic (Light-Tailed) Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(j) Platykurtic (Light-Tailed) Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(k) Leptokurtic (Heavy-Tailed) Distribution
Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
(l) Leptokurtic (Heavy-Tailed) Distribution
Figure 11.4: Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).

If the residuals are not normally distributed (i.e., they do not stay close to the diagonal reference line of the QQ and PP plots), it may be necessary to transform the outcome variable to be more normally distributed or to use a generalized linear model (GLM) that more closely matches the distribution of the outcome variable (e.g., Poisson, binomial, gamma).

11.5 How Much Variance the Model Explains

When estimating a multiple regression model, it can be useful to evaluate how much variance in the outcome variable that the predictor variable(s) explain (i.e., account for). If the variables collectively explain only a small amount of variance, it suggest that the predictor variables have a small effect size and that other predictor variables will be necessary to account for the majority of variability in the outcome variable. There are two primary indices of how much how much variance in the outcome variable is explained by the predictor variable(s): the coefficient of determination (\(R^2\)) and adjusted \(R^2\) (\(R^2_{adj}\)), described below.

11.5.1 Coefficient of Determination (\(R^2\))

As noted in Section 9.6.6, The coefficient of determination (\(R^2\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions, as in Equation 9.25: \(R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}\). Various formulas for \(R^2\) are in Equation 9.19. Larger \(R^2\) values indicate greater accuracy. Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in Figure 11.27.

Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
Figure 11.5: Conceptual Depiction of Proportion of Variance Explained (\(R^2\)) in an Outcome Variable (\(Y\)) by Multiple Predictors (\(X1\) and \(X2\)) in Multiple Regression. The size of each circle represents the variable’s variance. The proportion of variance in \(Y\) that is explained by the predictors is depicted by the areas in orange. The dark orange space (\(G\)) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome (\(G\)) will not be recovered in the regression coefficients when both predictors are included in the regression model. From Petersen (2024) and Petersen (2025).

One issue with \(R^2\) is that it increases as the number of predictors increases, which can lead to overfitting if using \(R^2\) as an index to compare models for purposes of selecting the “best-fitting” model. Consider the following example (adapted from Petersen (2025)) in which you have one predictor variable and one outcome variable, as shown in Table 11.1.

Table 11.1: Example Data of Predictor (x1) and Outcome (y) Used for Regression Model.
y x1
7 1
13 2
29 7
10 2

Using the data, the best fitting regression model is: \(y =\) 3.98 \(+\) 3.59 \(\cdot x_1\). In this example, the \(R^2\) is 0.98. The equation is not a perfect prediction, but with a single predictor variable, it captures the majority of the variance in the outcome.

Now consider the following example where you add a second predictor variable to the data above, as shown in Table 11.2.

Table 11.2: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.
y x1 x2
7 1 3
13 2 5
29 7 1
10 2 2

With the second predictor variable, the best fitting regression model is: \(y =\) 0.00 + 4.00 \(\cdot x_1 +\) 1.00 \(\cdot x_2\). In this example, the \(R^2\) is 1.00. The equation with the second predictor variable provides a perfect prediction of the outcome.

Providing perfect prediction with the right set of predictor variables is the dream of multiple regression. So, using multiple regression, we often add predictor variables to incrementally improve prediction. Knowing how much variance would be accounted for by random chance follows Equation 11.3:

\[ E(R^2) = \frac{K}{n-1} \tag{11.3}\]

where \(E(R^2)\) is the expected value of \(R^2\) (the proportion of variance explained), \(K\) is the number of predictor variables, and \(n\) is the sample size. The formula demonstrates that the more predictor variables in the regression model, the more variance will be accounted for by chance. With many predictor variables and a small sample, you can account for a large share of the variance merely by chance.

As an example, consider that we have 13 predictor variables to predict fantasy performance for 43 players. Assume that, with 13 predictor variables, we explain 38% of the variance (\(R^2 = .38; r = .62\)). We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: \(E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31\). We expect to explain 31% of the variance, by chance, in the outcome. So, 82% of the variance explained was likely spurious (i.e., \(\frac{.31}{.38} = .82\)). As the sample size increases, the spuriousness decreases. To account for the number of predictor variables in the model, we can use a modified version of \(R^2\) called adjusted \(R^2\) (\(R^2_{adj}\)), described next.

11.5.2 Adjusted \(R^2\) (\(R^2_{adj}\))

Adjusted \(R^2\) (\(R^2_{adj}\)) accounts for the number of predictor variables in the model, based on how much would be expected to be accounted for by chance to penalize overfitting. Adjusted \(R^2\) (\(R^2_{adj}\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of predictor variables in the model. The formula for adjusted \(R^2\) (\(R^2_{adj}\)) is in Equation 11.4:

\[ R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \tag{11.4}\]

where \(p\) is the number of predictor variables in the model, and \(n\) is the sample size.

11.6 Overfitting

Statistical models applied to big data (e.g., data with many predictor variables) can overfit the data, which means that the statistical model accounts for error variance, which will not generalize to future samples. So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built. In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.

Consider an example where you develop an algorithm to predict players’ fantasy performance based on 2024 data using hundreds of predictor variables. To some extent, these predictor variables will likely account for true variance (i.e., signal) and error variance (i.e., noise). If we were to apply the same algorithm based on the 2024 prediction model to 2025 data, the prediction model would likely predict less accurately than with 2024 data. The regression coefficients tend to become weaker when applied to new data, a phenomenon called shrinkage, which is described in Section 15.7.1. The regression coefficients in the [FILL IN]

In Figure 11.6, the blue line represents the true distribution of the data, and the red line is an overfitting model:

Code
set.seed(52242)

sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

loessFit <- loess(
  quadraticY ~ quadraticX,
  data = quadraticData,
  span = 0.01,
  degree = 1)

loessNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$loessY <- predict(
  loessFit,
  newdata = quadraticNewData)

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")

lines(
  quadraticNewData$loessY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "red")
Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
Figure 11.6: Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From Petersen (2024) and Petersen (2025).

11.7 Covariates

Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable. Use of covariates examines the association between the predictor variable and the outcome variable when holding people’s level constant on the covariates. Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable. Ideally, you want to include any and all confounds as covariates. As described in Section 8.4.2.1, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association. Covariates are potentially (but not necessarily) confounds. For instance, you might include the player’s age as a covariate in a model that examines whether a player’s 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.

11.8 Example: Predicting Wide Receivers’ Fantasy Points

Let’s say we want to use a number of variables to predict a wide receiver’s fantasy performance. We want to consider several predictors, including the player’s age, height, weight, and target share. We have only a few predictors and our sample size is large enough such that overfitting is not likely a concern.

On a weekly basis, target share is computed as the number of targets a player receives divided by the team’s total number of targets. It would be nice to compute it this way for seasonal data, as well; however, for calculating seasonal target share, the nflfastR (Carl & Baldwin, 2024) package currently sums up the player’s weekly target share. Thus, seasonal target share (unlike weekly target share)—as it is currently calculated—does not range from 0–1. For example, if a player has a weekly target share of 15%, that would be a seasonal (cross-week-summed) target share of 2.55 (\(0.15 \times 17 \; \text{games} = 2.55\).

11.8.1 Examine Descriptive Statistics

Let’s first examine descriptive statistics of the predictor and outcome variables.

Code
player_stats_seasonal %>% 
  dplyr::select(fantasyPoints, age, height, weight, target_share) %>% 
  dplyr::summarise(across(
      everything(),
      .fns = list(
        n = ~ length(na.omit(.)),
        missingness = ~ mean(is.na(.)) * 100,
        M = ~ mean(., na.rm = TRUE),
        SD = ~ sd(., na.rm = TRUE),
        min = ~ min(., na.rm = TRUE),
        max = ~ max(., na.rm = TRUE),
        range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
        IQR = ~ IQR(., na.rm = TRUE),
        MAD = ~ mad(., na.rm = TRUE),
        median = ~ median(., na.rm = TRUE),
        pseudomedian = ~ DescTools::HodgesLehmann(., na.rm = TRUE),
        mode = ~ petersenlab::Mode(., multipleModes = "mean"),
        skewness = ~ psych::skew(., na.rm = TRUE),
        kurtosis = ~ psych::kurtosi(., na.rm = TRUE)),
      .names = "{.col}.{.fn}")) %>%
    tidyr::pivot_longer(
      cols = everything(),
      names_to = c("variable","index"),
      names_sep = "\\.") %>% 
    tidyr::pivot_wider(
      names_from = index,
      values_from = value)

Let’s also examine the distributions of the variables using a density plot, as depicted in Figure 11.7.

Code
ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = fantasyPoints)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Fantasy Points",
    y = "Density",
    title = "Density Plot of Fantasy Points with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = age)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Age (years)",
    y = "Density",
    title = "Density Plot of Player Age with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = height)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Height (inches)",
    y = "Density",
    title = "Density Plot of Player Height with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = weight)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Weight (pounds)",
    y = "Density",
    title = "Density Plot of Player Weight with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title

ggplot2::ggplot(
  data = player_stats_seasonal %>%
    filter(position_group %in% c("WR")),
  mapping = aes(
    x = target_share)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6 # add transparency
  ) +
  geom_rug() +
  labs(
    x = "Target Share",
    y = "Density",
    title = "Density Plot of Target Share with Histogram and Rug Plot"
  ) +
  theme_classic() +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(a) Fantasy Points
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(b) Age
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(c) Height
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(d) Weight
Density Plot of Model Variables (i.e., Predictor and Outcome Variables).
(e) Target Share
Figure 11.7: Density Plot of Model Variables (i.e., Predictor and Outcome Variables).

11.8.2 Examine Bivariate Associations

Then, let’s examine the bivariate association of each using a scatterplot to evaluate for any potential nonlinearity, as depicted in Figure 11.8.

Code
ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = age,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Age (Years)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Age",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = height,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Height (Inches)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Height",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = weight,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Player Weight (Pounds)",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Player Weight",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()

ggplot2::ggplot(
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  aes(
    x = target_share,
    y = fantasyPoints)) +
  geom_point(alpha = 0.05) +
  geom_smooth(
    method = "lm",
    color = "black") +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Target Share",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Target Share",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(a) Age
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(b) Height
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(c) Weight
Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
(d) Target Share
Figure 11.8: Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.

There are some suggestions of potential nonlinearity, such as an inverted-U-shaped association between height and fantasy points, suggesting that there may an optimal range for height among Wide Receivers—being too short or too tall could be a disadvantage. In addition, target share shows a weakening association as target share increases. Thus, after evaluating the linear association between the predictors and outcome, we will also examine the possibility for curvilinear associations.

11.8.3 Estimate Multiple Regression Model

Now that we have examined descriptive statistics and bivariate associations, let’s first estimate a multiple regression model with only linear terms:

Code
linearRegressionModel <- lm(
  fantasyPoints ~ age + height + weight + target_share,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

The model formula is in Equation 11.5:

\[ \text{fantasy points} = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \tag{11.5}\]

Here are the model results:

Code
summary(linearRegressionModel)

Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share, 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-325.39  -12.92   -5.35   10.21  234.15 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.45896   19.11171   0.547  0.58423    
age           0.22273    0.18068   1.233  0.21772    
height       -0.58258    0.33739  -1.727  0.08429 .  
weight        0.17335    0.05336   3.249  0.00117 ** 
target_share 50.70291    0.37231 136.185  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35.82 on 4246 degrees of freedom
  (1138 observations deleted due to missingness)
Multiple R-squared:  0.8194,    Adjusted R-squared:  0.8193 
F-statistic:  4817 on 4 and 4246 DF,  p-value: < 2.2e-16

The only terms that are significantly associated with fantasy performance among Wide Receivers are weight and target share, both of which showed a positive association with fantasy points.

We can obtain the coefficient of determination (\(R^2\)) and adjusted \(R^2\) (\(R^2_{adj}\)) using the following code:

Code
summary(linearRegressionModel)$r.squared
[1] 0.8194318
Code
summary(linearRegressionModel)$adj.r.squared
[1] 0.8192617

The model explained 82% of the variability in fantasy points (i.e., \(R^2 = .82\)).

The model formula with substituted values is in Equation 11.6:

\[ \begin{aligned} \text{fantasy points} = &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \\ = &10.46 + 0.22 \cdot \text{age} + -0.58 \cdot \text{height} + 0.17 \cdot \text{weight} + 50.70 \cdot \text{target share} + \epsilon \end{aligned} \tag{11.6}\]

If we want to obtain standardized regression coefficients, we can standardize the outcome variable and each predictor variable using the scale() function:

Code
linearRegressionModelStandardized <- lm(
  scale(fantasyPoints) ~ scale(age) + scale(height) + scale(weight) + scale(target_share),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModelStandardized)

Call:
lm(formula = scale(fantasyPoints) ~ scale(age) + scale(height) + 
    scale(weight) + scale(target_share), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8685 -0.1537 -0.0636  0.1214  2.7838 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.003032   0.006537  -0.464  0.64280    
scale(age)           0.008348   0.006771   1.233  0.21772    
scale(height)       -0.016268   0.009421  -1.727  0.08429 .  
scale(weight)        0.030597   0.009419   3.249  0.00117 ** 
scale(target_share)  0.903075   0.006631 136.185  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4259 on 4246 degrees of freedom
  (1138 observations deleted due to missingness)
Multiple R-squared:  0.8194,    Adjusted R-squared:  0.8193 
F-statistic:  4817 on 4 and 4246 DF,  p-value: < 2.2e-16

Target share has a large effect size. All of the other predictors have small effect size.

Now let’s consider whether any of the terms show curvilinear associations with fantasy points by adding quadratic terms:

Code
quadraticTermsRegressionModel <- lm(
  fantasyPoints ~ age + I(age^2) + height + I(height^2) + weight + I(weight^2) + target_share + I(target_share^2),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

The model formula is in Equation 11.7:

\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \end{aligned} \tag{11.7}\]

Here are the model results:

Code
summary(quadraticTermsRegressionModel)

Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) + 
    weight + I(weight^2) + target_share + I(target_share^2), 
    data = player_stats_seasonal %>% filter(position %in% c("WR")), 
    na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-180.329  -14.089    0.151    9.499  245.425 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.894e+01  4.082e+02   0.193    0.847    
age                1.914e+00  2.199e+00   0.870    0.384    
I(age^2)          -3.362e-02  3.972e-02  -0.846    0.397    
height            -4.994e-01  1.194e+01  -0.042    0.967    
I(height^2)        4.283e-05  8.239e-02   0.001    1.000    
weight            -8.862e-01  7.178e-01  -1.235    0.217    
I(weight^2)        2.613e-03  1.781e-03   1.467    0.143    
target_share       7.083e+01  8.773e-01  80.738   <2e-16 ***
I(target_share^2) -4.205e+00  1.681e-01 -25.020   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.45 on 4242 degrees of freedom
  (1138 observations deleted due to missingness)
Multiple R-squared:  0.8427,    Adjusted R-squared:  0.8424 
F-statistic:  2841 on 8 and 4242 DF,  p-value: < 2.2e-16

Only the quadratic term for target share shows a significant association. If we wanted to visualize the shape of the model-implied association between target share and fantasy points, we could generate the model-implied predictions using the data range that we want to visualize.

Code
newdata <- data.frame(
  target_share = seq(
    from = min(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
    to = max(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
    length.out = 1000
  )
)

newdata$age <- mean(player_stats_seasonal$age[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$height <- mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$weight <- mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
Code
newdata$fantasyPoints <- predict(
  quadraticTermsRegressionModel,
  newdata = newdata
)

We can depict the model-implied predictions of fantasy points as a function of target share, as shown in Figure 11.9.

Code
ggplot2::ggplot(
  data = newdata,
  aes(
    x = target_share,
    y = fantasyPoints)) +
  geom_smooth() +
  coord_cartesian(
    ylim = c(0,NA),
    expand = FALSE) +
  labs(
    x = "Target Share",
    y = "Fantasy Points (Season)",
    title = "Fantasy Points (Season) by Target Share",
    subtitle = "(Among Wide Receivers)"
  ) +
  theme_classic()
Model-Implied Predictions of A Wide Receiver's Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model.
Figure 11.9: Model-Implied Predictions of A Wide Receiver’s Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model.

We could also generate the model-implied prediction of fantasy points for any value of the predictor variables. For instance, here are the number of fantasy points expected for a Wide Receiver who is 23 years old, is 6’2” tall (72 inches), weighs 200 pounds, and has a (cross-week-summed) target share of 3 (which reflects an approximate target share of 17.6% percent per game; i.e., \(0.176 = \frac{3}{17 \; \text{games}}\)):

Code
predict(
  quadraticTermsRegressionModel,
  newdata = data.frame(
    age = 23,
    height = 72,
    weight = 200,
    target_share = 3
  )
)
       1 
171.3862 

The player would be expected to score 171 fantasy points.

We can calculate this by substituting in the regression coefficients and predictor values. Here is the computation:

Code
age <- 23
height <- 72
weight <- 200
target_share <- 3

beta0 <- coef(quadraticTermsRegressionModel)[["(Intercept)"]]
beta1 <- coef(quadraticTermsRegressionModel)[["age"]]
beta2 <- coef(quadraticTermsRegressionModel)[["I(age^2)"]]
beta3 <- coef(quadraticTermsRegressionModel)[["height"]]
beta4 <- coef(quadraticTermsRegressionModel)[["I(height^2)"]]
beta5 <- coef(quadraticTermsRegressionModel)[["weight"]]
beta6 <- coef(quadraticTermsRegressionModel)[["I(weight^2)"]]
beta7 <- coef(quadraticTermsRegressionModel)[["target_share"]]
beta8 <- coef(quadraticTermsRegressionModel)[["I(target_share^2)"]]

predictedFantasyPoints <- beta0 + beta1*age + beta2*(age^2) + beta3*height + beta4*(height^2) + beta5*weight + beta6*(weight^2) + beta7*target_share + beta8*(target_share^2)
predictedFantasyPoints
[1] 171.3862

The model formula with substituted values is in Equation 11.8:

\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \\ = &78.94 + 1.91 \cdot 23 + -0.03 \cdot 23^2 + -0.50 \cdot 72 + 0.00 \cdot 72^2 + \\ &-0.89 \cdot 200 + 0.00 \cdot 200^2 + 70.83 \cdot 3 + -4.21 \cdot 3^2 + \epsilon \\ = &171.39 \end{aligned} \tag{11.8}\]

11.8.4 Evaluating and Addressing Assumptions

The assumptions for multiple regression are described in Section 11.4. I describe ways to evaluate assumptions in Section 11.4.1.

As a reminder, here are four assumptions:

  • there is a linear association between the predictor variables and the outcome variable
  • there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
  • the residuals are independent; they are uncorrelated with each other
  • the residuals are normally distributed

11.8.4.1 Linear Association

We evaluated the shape of the association between the predictor variables and the outcome variables using scatterplots. We accounted for potential curvilinearity in the associations with a quadratic term. Other ways to account for nonlinearity, in addition to polynomials, include transforming predictors, use of splines/piecewise regression, and generalized additive models.

To evaluate for potential nonlinearity in the associations, we can also evaluate residual plots (Figure 11.19), marginal model plots (Figure 11.10), added-variable plots (Figure 11.11), and component-plus-residual plots (Figure 11.12) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). For evaluating linearity, we would expect minimal bend/curvature in the lines.

Code
car::marginalModelPlots(
  quadraticTermsRegressionModel,
  sd = TRUE,
  id = TRUE)
Marginal Model Plots.
Figure 11.10: Marginal Model Plots.
Code
car::avPlots(
  quadraticTermsRegressionModel,
  id = TRUE)
Added-Variable Plots.
Figure 11.11: Added-Variable Plots.
Code
car::crPlots(
  quadraticTermsRegressionModel,
  sd = TRUE,
  id = TRUE)
Component-Plus-Residual Plots.
Figure 11.12: Component-Plus-Residual Plots.

The marginal model plots (Figure 11.10), residual plots (Figure 11.19), and component-plus-residual plots (Figure 11.12) suggest that the nonlinearity of the association between target share and fantasy points may not be fully captured by the quadratic term. Thus, we may need to apply a different approach to handling the nonlinear association between target share and fantasy points.

One approach we can take is to transform the target_shares variable to be more normally distributed.

The histogram for the raw target_shares variable is in Figure 11.13.

Code
hist(
  player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Target Share")
Histogram of Target Share.
Figure 11.13: Histogram of Target Share.

The variable shows a strong positive skew. To address a strong positive skew, we can use a log transformation. The histogram of the log-transformed variable is in Figure 11.14.

Code
hist(
  log(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")] + 1),
  main = "Histogram of Target Share (Log Transformed)")
Histogram of Target Share, Transformed.
Figure 11.14: Histogram of Target Share, Transformed.

Now we can re-fit the model with the log-transformed variable.

Code
linearRegressionModel_logTargetShare <- lm(
  fantasyPoints ~ age + I(age^2) + height + I(height^2) + weight + I(weight^2) + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModel_logTargetShare)

Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) + 
    weight + I(weight^2) + I(log(target_share + 1)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-161.438  -20.977    0.802   15.700  258.079 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -81.371684 442.140136  -0.184   0.8540    
age                        3.050092   2.382398   1.280   0.2005    
I(age^2)                  -0.052401   0.043019  -1.218   0.2233    
height                     5.283660  12.932511   0.409   0.6829    
I(height^2)               -0.040543   0.089233  -0.454   0.6496    
weight                    -1.676457   0.777014  -2.158   0.0310 *  
I(weight^2)                0.004671   0.001928   2.422   0.0155 *  
I(log(target_share + 1)) 133.712797   0.998403 133.927   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.23 on 4243 degrees of freedom
  (1138 observations deleted due to missingness)
Multiple R-squared:  0.8154,    Adjusted R-squared:  0.8151 
F-statistic:  2677 on 7 and 4243 DF,  p-value: < 2.2e-16

Target share shows a more linear association with fantasy points after log-transforming it (albeit still not perfect), as depicted in Figures 11.15, 11.16, 11.17, and 11.18.

Code
car::marginalModelPlots(
  linearRegressionModel_logTargetShare,
  sd = TRUE,
  id = TRUE)
Marginal Model Plots After Log Transformation of Target Share.
Figure 11.15: Marginal Model Plots After Log Transformation of Target Share.
Code
car::avPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)
Added-Variable Plots After Log Transformation of Target Share.
Figure 11.16: Added-Variable Plots After Log Transformation of Target Share.
Code
car::crPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)
Component-Plus-Residual Plots After Log Transformation of Target Share.
Figure 11.17: Component-Plus-Residual Plots After Log Transformation of Target Share.

When creating a residual plot, the car package (Fox et al., 2024; Fox & Weisberg, 2019) also provides a test of the nonlinearity of each predictor.

Code
car::residualPlots(
  linearRegressionModel_logTargetShare,
  id = TRUE)
                         Test stat Pr(>|Test stat|)    
age                        -0.2757          0.78282    
I(age^2)                    1.7447          0.08111 .  
height                     -1.7267          0.08429 .  
I(height^2)                 0.4603          0.64534    
weight                      0.6394          0.52261    
I(weight^2)                -1.7398          0.08197 .  
I(log(target_share + 1))   24.6080          < 2e-16 ***
Tukey test                 24.8338          < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots After Log Transformation of Target Share.
Figure 11.18: Residual Plots After Log Transformation of Target Share.

11.8.4.2 Homoscedasticity

To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.19) and spread-level plot (Figure 11.20) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). In a residual plot, you want a constant spread of residuals versus fitted values—you do not want the residuals to show a fan or cone shape.

Code
car::residualPlots(
  quadraticTermsRegressionModel,
  id = TRUE)
                  Test stat Pr(>|Test stat|)    
age                 -0.3563           0.7216    
I(age^2)             1.4827           0.1382    
height              -1.3604           0.1738    
I(height^2)          0.6870           0.4921    
weight               0.5744           0.5658    
I(weight^2)         -1.5116           0.1307    
target_share         1.1683           0.2427    
I(target_share^2)   -5.7545        9.301e-09 ***
Tukey test           8.1631        3.266e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots.
Figure 11.19: Residual Plots.

In a spread-level plot, you want a flat (zero) slope—you do not want a positive or negative slope.

Code
car::spreadLevelPlot(
  quadraticTermsRegressionModel,
  id = TRUE)

Suggested power transformation:  0.4071025 
Spread-Level Plot.
Figure 11.20: Spread-Level Plot.

In this example, the residuals appear to increase as a function of the fitted values. To handle this, we may need to transform the outcome variable to be more normally distributed.

The histogram for raw fantasy points is in Figure 12.33.

Code
hist(
  player_stats_seasonal$fantasyPoints[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Fantasy Points")
Histogram of Fantasy Points (Among Wide Receivers).
Figure 11.21: Histogram of Fantasy Points (Among Wide Receivers).

We can apply a Yeo-Johnson transformation to the outcome variable to generate a more normally distributed variable. A Yeo-Johnson transformation estimates the optimal transformation to make the variable more normally distributed. Let’s use a Yeo-Johnson transformation of fantasy points:

Code
yjTransformed <- caret::preProcess(
  player_stats_seasonal["fantasyPoints"],
  method = c("YeoJohnson"))

yjTransformed
Created from 38458 samples and 1 variables

Pre-processing:
  - ignored (0)
  - Yeo-Johnson transformation (1)

Lambda estimates for Yeo-Johnson transformation:
0.18
Code
player_stats_seasonal$fantasyPoints_transformed <- predict(
  yjTransformed,
  newdata = player_stats_seasonal["fantasyPoints"])$fantasyPoints

The histogram of the transformed variable is in Figure 11.22.

Code
hist(
  player_stats_seasonal$fantasyPoints_transformed[which(player_stats_seasonal$position == "WR")],
  main = "Histogram of Fantasy Points (Transformed)")
Histogram of Fantasy Points (Among Wide Receivers), Transformed.
Figure 11.22: Histogram of Fantasy Points (Among Wide Receivers), Transformed.

Now we can refit the model.

Code
linearRegressionModel_outcomeTransformed <- lm(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(linearRegressionModel_outcomeTransformed)

Call:
lm(formula = fantasyPoints_transformed ~ age + height + weight + 
    I(log(target_share + 1)), data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8110 -0.6124  0.1003  0.7085  7.0980 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.009835   0.563406   5.342 9.66e-08 ***
age                       0.009946   0.005331   1.866  0.06213 .  
height                   -0.025737   0.009950  -2.587  0.00972 ** 
weight                    0.004003   0.001573   2.544  0.01099 *  
I(log(target_share + 1))  4.493072   0.029044 154.698  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.056 on 4246 degrees of freedom
  (1138 observations deleted due to missingness)
Multiple R-squared:  0.8538,    Adjusted R-squared:  0.8536 
F-statistic:  6198 on 4 and 4246 DF,  p-value: < 2.2e-16

The residual plot is in Figure 11.23.

Code
car::residualPlots(
  linearRegressionModel_outcomeTransformed,
  id = TRUE)
                         Test stat Pr(>|Test stat|)    
age                         0.0894           0.9287    
height                      0.9811           0.3266    
weight                     -1.3368           0.1814    
I(log(target_share + 1))  -38.0852           <2e-16 ***
Tukey test                -38.2715           <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots After Transformation of Fantasy Points.
Figure 11.23: Residual Plots After Transformation of Fantasy Points.

The spread-level plot is in Figure 11.24.

Code
car::spreadLevelPlot(
  linearRegressionModel_outcomeTransformed,
  id = TRUE)

Suggested power transformation:  1.356186 
Spread-Level Plot After Transformation of Fantasy Points.
Figure 11.24: Spread-Level Plot After Transformation of Fantasy Points.

The residuals show more constant variance after transforming the outcome variable.

11.8.4.3 Uncorrelated Residuals

To determine if residuals are correlated given the nested structure of the data, we can examine the proportion of variance that is attributable to the particular player. To do this, we can estimate the intraclass correlation coefficient (ICC) from a mixed model using the performance package (Lüdecke et al., 2021, 2025).

Code
mixedModel <- lmer(
  fantasyPoints_transformed ~ 1 + (1 | player_id),
  data = player_stats_seasonal)

performance::icc(mixedModel)

The ICC indicates that over half of the variance is attribute to between-player variance, so it would be important to account for the player-specific variance using a mixed model. For simplicity, we focus on multiple regression models in this chapter; mixed models are described in Chapter 12.

11.8.4.4 Normally Distributed Residuals

We can examine whether residuals are normally distributed using quantile–quantile (QQ) plots and probability–probability (PP) plots, as in Figures 11.25 and 11.26. If the residuals are normally distributed, they should stay close to the diagonal reference line.

Code
car::qqPlot(
  linearRegressionModel_outcomeTransformed,
  main = "QQ Plot",
  id = TRUE)
[1] 3120 4445
Residual Plots After Transformation of Fantasy Points.
Figure 11.25: Residual Plots After Transformation of Fantasy Points.
Code
petersenlab::ppPlot(
  linearRegressionModel_outcomeTransformed)
Residual Plots After Transformation of Fantasy Points.
Figure 11.26: Residual Plots After Transformation of Fantasy Points.

11.9 Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. The problem of having multiple predictor variables that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.

Multicollinearity in multiple regression is depicted conceptually in Figure 11.27.

Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
Figure 11.27: Conceptual Depiction of Multicollinearity in Multiple Regression. From Petersen (2024) and Petersen (2025).

Consider the following example adapted from Petersen (2025) where you have two predictor variables and one outcome variable, as shown in Table 11.3.

Table 11.3: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.
y x1 x2
9 2.0 4
11 3.0 6
17 4.0 8
3 1.0 2
21 5.0 10
13 3.5 7

The second predictor variable is not very good—it is exactly twice the value of the first predictor variable; thus, the two predictor variables are perfectly correlated (i.e., \(r = 1.0\)). This means that there are different prediction equation possibilities that are equally good—see Equations in Equation 11.9:

\[ \begin{aligned} 2x_2 &= y \\ 0x_1 + 2x_2 &= y \\ 4x_1 &= y \\ 4x_1 + 0x_2 &= y \\ 2x_1 + 1x_2 &= y \\ 5x_1 - 0.5x_2 &= y \\ ... &= y \end{aligned} \tag{11.9}\]

Then, what are the regression coefficients? We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well. Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate. In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates. Any predictor variables that have a correlation above ~ \(r = .30\) with each other could have an impact on the confidence interval of the regression coefficient. As the correlations among the predictor variables increase, the chance of getting an arbitrary answer increases, sometimes called “bouncing betas.” So, it is important to examine a correlation matrix of the predictor variables before putting them in the same regression model. You can also examine indices such as variance inflation factor (VIF), where a value greater than 5 or 10 indicates multicollinearity.

Here are the VIFs from our earlier model:

Code
car::vif(linearRegressionModel_outcomeTransformed)
                     age                   height                   weight 
                1.020806                 2.109266                 2.121813 
I(log(target_share + 1)) 
                1.030619 

To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors. For a meaningful answer regarding predictors in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size. However, if you are merely interested in prediction—and are not interested in interpreting the regression coefficients of individual predictors—multicollinearity poses less of a problem. For instance, machine learning cares more about achieving the greatest predictive accuracy possible and cares less about explaining which predictors are causally related to the outcome. So, multicollinearity is less of a concern for machine learning approaches.

11.10 Handling of Missing Data

An important consideration in multiple regression is how missing data are handled. Multiple regression in R using the lm() function applies listwise deletion. Listwise deletion removes any row (in the data file) from analysis that has a missing value on the outcome variable or any of the predictor variables. Removing all rows from analysis that have any missingness in the model variables can be a problem because missingness is often not completely at random—missingness often occurs systematically (i.e., for a reason). For instance, participants may be less likely to have data for all variables if they are from a lower socioeconomic status background and do not have the time to participate in all study procedures. Thus, applying listwise deletion, we might systematically exclude participants from lower socioeconomic status backgrounds (or other groups), which could lead to less generalizable inferences.

It is thus important to consider approaches to handle missingness. Various approaches to handle missingness include pairwise deletion (aka available-case analysis), multiple imputation, and full information maximum likelihood (FIML).

11.10.1 Pairwise Deletion

We can estimate a regression model that uses pairwise deletion using the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code
player_stats_seasonal$target_share_log <- log(player_stats_seasonal$target_share + 1)

modelData <- player_stats_seasonal %>% 
  filter(position %in% c("WR")) %>% 
  select(fantasyPoints_transformed, age, height, weight, target_share_log)

numObs <- sum(complete.cases(modelData))
varMeans <- colMeans(modelData, na.rm = TRUE)
varCovariances <- cov(modelData, use = "pairwise.complete.obs")

pairwiseRegression_syntax <- '
  fantasyPoints_transformed ~ age + height + weight + target_share_log
  fantasyPoints_transformed ~~ fantasyPoints_transformed
  fantasyPoints_transformed ~ 1
'

pairwiseRegression_fit <- lavaan::lavaan(
  pairwiseRegression_syntax,
  sample.mean = varMeans,
  sample.cov = varCovariances,
  sample.nobs = numObs
)

summary(
  pairwiseRegression_fit,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         6

  Number of observations                          4251

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    age                          0.040    0.005    7.879    0.000    0.040
    height                      -0.013    0.010   -1.363    0.173   -0.013
    weight                       0.004    0.002    2.862    0.004    0.004
    target_shar_lg               4.466    0.028  157.112    0.000    4.466
  Std.all
         
    0.046
   -0.011
    0.024
    0.918

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    1.265    0.552    2.290    0.022    1.265    0.459

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    1.069    0.023   46.103    0.000    1.069    0.141

R-Square:
                   Estimate
    fntsyPnts_trns    0.859

11.10.2 Multiple Imputation

We can multiply impute data using the mice package (van Buuren & Groothuis-Oudshoorn, 2011, 2024).

Code
numImputations <- 5

dataToImpute <- player_stats_seasonal %>% 
  filter(position %in% c("WR")) %>% 
  select(player_id, position, where(is.numeric)) %>% 
  select(
    player_id:games, carries:wopr, fantasy_points, fantasy_points_ppr,
    rush_40_yds, rec_40_yds, fumbles, two_pts, return_yds,
    rush_100_yds:draftround, height:target_share_log) %>% 
  select(-c(fantasy_points_ppr, ageCentered20, target_share_log)) # drop collinear variables

predictors <- c("targets","receiving_yards","receiving_air_yards","receiving_yards_after_catch","receiving_first_downs","racr")

dataToImpute$player_id_integer <- as.integer(as.factor(dataToImpute$player_id))

varsToImpute <- c("age","height","weight","target_share")
Y <- varsToImpute

Now, let’s specify the imputation method—we use the two-level predictive mean matching (2l.pmm) method from the miceadds package (Robitzsch et al., 2024) to account for the nonindependent data (owing to multiple seasons per player):

Code
meth <- make.method(dataToImpute)
meth[1:length(meth)] <- ""
meth[Y] <- "2l.pmm" # specify the imputation method here; this can differ by outcome variable

Now, let’s specify the prediction matrix. A predictor matrix is a matrix of values, where:

  • columns with non-zero values are predictors of the variable specified in the given row
  • the diagonal of the predictor matrix should be zero because a variable cannot predict itself

The values are:

  • NOT a predictor of the outcome: 0
  • cluster variable: -2
  • fixed effect of predictor: 1
  • fixed effect and random effect of predictor: 2
  • include cluster mean of predictor in addition to fixed effect of predictor: 3
  • include cluster mean of predictor in addition to fixed effect and random effect of predictor: 4
Code
pred <- make.predictorMatrix(dataToImpute)
pred[1:nrow(pred), 1:ncol(pred)] <- 0
pred[Y, "player_id_integer"] <- (-2) # cluster variable
pred[Y, predictors] <- 1 # fixed effect predictors
pred[Y, "age"] <- 2 # random effect predictor
pred[Y, Y] <- 1 # fixed effect predictor

diag(pred) <- 0

Now, let’s run the imputation:

Code
imp <- mice::mice(
  as.data.frame(dataToImpute),
  method = meth,
  predictorMatrix = pred,
  m = numImputations,
  maxit = 5, # generally use 100 maximum iterations; this example uses 5 for speed
  seed = 52242)

Below are some imputation diagnostics. Trace plots are in Figure 11.28.

Code
plot(imp, c("target_share"))
Trace plots from multiple imputation.
Figure 11.28: Trace plots from multiple imputation.

A density plot is in Figure 11.29.

Code
densityplot(imp, ~ target_share)
Density plot from multiple imputation.
Figure 11.29: Density plot from multiple imputation.

The imputated data does not match well the distribution of the observed data. Thus, it may be necessary to select a different imputation method for more accurate imputation.

Now, let’s do some post-processing:

Code
imp_long <- complete(
  imp,
  action = "long",
  include = TRUE)

imp_long$target_share_log <- log(imp_long$target_share + 1)

imp_long$fantasyPoints_transformed <- predict(
  yjTransformed,
  newdata = imp_long["fantasyPoints"])$fantasyPoints

imp_mids <- as.mids(imp_long)

Now let’s estimate multiple regression with the multiply imputed data:

Code
imp_regression <- with(
  imp_mids,
  lm(
    fantasyPoints_transformed ~ age + height + weight + target_share_log)
  )

summary(pool(imp_regression))

11.10.3 Full Information Maximum Likelihood

We can estimate a regression model that uses full information maximum likelihood using the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code
fimlRegression_syntax <- '
  fantasyPoints_transformed ~ age + height + weight + target_share_log
  fantasyPoints_transformed ~~ fantasyPoints_transformed
  fantasyPoints_transformed ~ 1
'

fimlRegression_fit <- lavaan::sem(
  fimlRegression_syntax,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  missing = "ML",
  fixed.x = FALSE
)

summary(
  fimlRegression_fit,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-19 ended normally after 51 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        20

  Number of observations                          5389
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    age                          0.014    0.005    2.728    0.006    0.014
    height                      -0.024    0.010   -2.423    0.015   -0.024
    weight                       0.004    0.002    2.737    0.006    0.004
    target_shar_lg               4.482    0.028  158.556    0.000    4.482
  Std.all
         
    0.016
   -0.020
    0.023
    0.920

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  age ~~                                                                
    height           -0.297    0.101   -2.945    0.003   -0.297   -0.040
    weight           -0.155    0.637   -0.243    0.808   -0.155   -0.003
    target_shar_lg    0.284    0.025   11.294    0.000    0.284    0.160
  height ~~                                                             
    weight           24.979    0.584   42.755    0.000   24.979    0.716
    target_shar_lg    0.118    0.018    6.370    0.000    0.118    0.089
  weight ~~                                                             
    target_shar_lg    1.035    0.117    8.825    0.000    1.035    0.123

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    2.709    0.553    4.894    0.000    2.709    0.983
    age              26.254    0.043  611.426    0.000   26.254    8.329
    height           72.606    0.032 2269.509    0.000   72.606   30.916
    weight          200.480    0.202  991.397    0.000  200.480   13.505
    target_shar_lg    0.745    0.008   94.885    0.000    0.745    1.318

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    1.112    0.024   46.316    0.000    1.112    0.146
    age               9.936    0.191   51.909    0.000    9.936    1.000
    height            5.516    0.106   51.909    0.000    5.516    1.000
    weight          220.371    4.245   51.909    0.000  220.371    1.000
    target_shar_lg    0.320    0.006   50.129    0.000    0.320    1.000

R-Square:
                   Estimate
    fntsyPnts_trns    0.854

11.11 Addressing Non-Independence of Observations

Please note that the \(p\)-value for regression coefficients assumes that the observations are independent—in particular, that the residuals are not correlated. However, the observations are not independent in the player_stats_seasonal dataframe used above, because the same player has multiple rows—one row corresponding to each season they played. This non-independence violates the traditional assumptions of the significance of regression coefficients. We could address this assumption by analyzing only one season from each player or by estimating the significance of the regression coefficients using cluster-robust standard errors. For simplicity in the models above, we present results above from the whole dataframe. In Chapter 12, we discuss mixed model approaches that handle repeated measures and other data that violate assumptions of non-independence. Below, we demonstrate how to account for non-independence of observations using cluster-robust standard errors with the rms package (Harrell, Jr., 2025).

Code
player_stats_seasonal_subset <- player_stats_seasonal %>% 
  filter(!is.na(player_id)) %>% 
  filter(position %in% c("WR"))

rms::robcov(rms::ols(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal_subset,
  x = TRUE,
  y = TRUE),
  cluster = player_stats_seasonal_subset$player_id) #account for nested data within player
Frequencies of Missing Values Due to Each Variable
fantasyPoints_transformed                       age                    height 
                        0                         0                         0 
                   weight              target_share 
                        0                      1138 

Linear Regression Model

rms::ols(formula = fantasyPoints_transformed ~ age + height + 
    weight + I(log(target_share + 1)), data = player_stats_seasonal_subset, 
    x = TRUE, y = TRUE)

                                                      Model Likelihood    Discrimination    
                                                            Ratio Test           Indexes    
 Obs                                        4251    LR chi2    8173.35    R2       0.854    
 sigma                                    1.0563    d.f.             4    R2 adj   0.854    
 d.f.                                       4246    Pr(> chi2)  0.0000    g        2.918    
Cluster onplayer_stats_seasonal_subset$player_id                                            
 Clusters                                   1260                                            

Residuals

    Min      1Q  Median      3Q     Max 
-7.8110 -0.6124  0.1003  0.7085  7.0980 

             Coef    S.E.   t      Pr(>|t|)
Intercept     3.0098 0.6671   4.51 <0.0001 
age           0.0099 0.0058   1.73 0.0842  
height       -0.0257 0.0117  -2.20 0.0275  
weight        0.0040 0.0018   2.25 0.0247  
target_share  4.4931 0.0389 115.53 <0.0001 

11.12 Impact of Outliers

As with correlation, multiple regression can be strongly impacted by outliers.

11.12.1 Robust Regression

To address outliers, there are various approaches to robust regression. One approach is to use an MM-type estimator, such as is used in the lmrob() and glmrob() functions of the robustbase package (Maechler et al., 2024; Todorov & Filzmoser, 2009).

Code
robustRegression <- lmrob(
  fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR"))
)

summary(robustRegression)

Call:
lmrob(formula = fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 
    1)), data = player_stats_seasonal %>% filter(position %in% c("WR")))
 \--> method = "MM"
Residuals:
     Min       1Q   Median       3Q      Max 
-7.87540 -0.66317  0.04762  0.66103  7.04723 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.442784   0.583086   5.904 3.82e-09 ***
age                       0.008755   0.005064   1.729  0.08392 .  
height                   -0.028607   0.010445  -2.739  0.00619 ** 
weight                    0.003316   0.001642   2.019  0.04352 *  
I(log(target_share + 1))  4.481790   0.035380 126.675  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust residual standard error: 0.9524 
  (1138 observations deleted due to missingness)
Multiple R-squared:  0.8617,    Adjusted R-squared:  0.8616 
Convergence in 13 IRWLS iterations

Robustness weights: 
 7 observations c(1665,2461,2641,2979,3107,3513,3884)
     are outliers with |weight| = 0 ( < 2.4e-05); 
 335 weights are ~= 1. The remaining 3909 ones are summarized as
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01056 0.86100 0.94810 0.89460 0.98490 0.99900 
Algorithmic parameters: 
       tuning.chi                bb        tuning.psi        refine.tol 
        1.548e+00         5.000e-01         4.685e+00         1.000e-07 
          rel.tol         scale.tol         solve.tol          zero.tol 
        1.000e-07         1.000e-10         1.000e-07         1.000e-10 
      eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
        2.352e-05         4.820e-10         5.000e-01         5.000e-01 
     nResample         max.it         groups        n.group       best.r.s 
           500             50              5            400              2 
      k.fast.s          k.max    maxit.scale      trace.lev            mts 
             1            200            200              0           1000 
    compute.rd fast.s.large.n 
             0           2000 
                  psi           subsampling                   cov 
           "bisquare"         "nonsingular"         ".vcov.avar1" 
compute.outlier.stats 
                 "SM" 
seed : int(0) 

11.13 Moderated Multiple Regression

When examining moderation in multiple regression, several steps are important:

First, we mean-center the predictors. In this case, we center the predictors around the mean of height and weight for Wide Receivers:

Code
player_stats_seasonal$height_centered <- player_stats_seasonal$height - mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
player_stats_seasonal$weight_centered <- player_stats_seasonal$weight - mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)

Then, we compute the interaction term as the multiplication of the two centered predictors:

Code
player_stats_seasonal$heightXweight <- player_stats_seasonal$height_centered * player_stats_seasonal$weight_centered

Then, we fit the moderated multiple regression model:

Code
moderationModel <- lm(
  fantasyPoints_transformed ~ height_centered + weight_centered + height_centered:weight_centered,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(moderationModel)

Call:
lm(formula = fantasyPoints_transformed ~ height_centered + weight_centered + 
    height_centered:weight_centered, data = player_stats_seasonal %>% 
    filter(position %in% c("WR")), na.action = "na.exclude")

Residuals:
     Min       1Q   Median       3Q      Max 
-11.1065  -2.0822   0.3452   2.2866   5.5100 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.5968021  0.0443550 126.182  < 2e-16 ***
height_centered                 -0.0287872  0.0228774  -1.258   0.2083    
weight_centered                  0.0257918  0.0036066   7.151 9.75e-13 ***
height_centered:weight_centered -0.0017426  0.0009637  -1.808   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.735 on 5385 degrees of freedom
Multiple R-squared:  0.0156,    Adjusted R-squared:  0.01505 
F-statistic: 28.45 on 3 and 5385 DF,  p-value: < 2.2e-16

This model is equivalent to the model that includes the interaction term explicitly:

Code
moderationModel <- lm(
  fantasyPoints_transformed ~ height_centered + weight_centered + heightXweight,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  na.action = "na.exclude"
)

summary(moderationModel)

Now, we can visualize the interaction to understand it. We create an interaction plot (Figure 11.30) and Johnson-Neyman plot (Figure 11.31) using the interactions package (Long, 2024).

Code
interactions::interact_plot(
  moderationModel,
  pred = height_centered,
  modx = weight_centered)
Interaction Plot from Moderated Multiple Regression.
Figure 11.30: Interaction Plot from Moderated Multiple Regression.
Code
interactions::johnson_neyman(
  moderationModel,
  pred = height_centered,
  modx = weight_centered,
  alpha = .05)
JOHNSON-NEYMAN INTERVAL

When weight_centered is INSIDE the interval [16.17, 137.37], the slope of
height_centered is p < .05.

Note: The range of observed values of weight_centered is [-47.48, 64.52]
Johnson-Neyman Plot from Moderated Multiple Regression.
Figure 11.31: Johnson-Neyman Plot from Moderated Multiple Regression.

Here is a simple slopes analysis:

Code
interactions::sim_slopes(
  moderationModel,
  pred = height_centered,
  modx = weight_centered,
  johnson_neyman = FALSE)
SIMPLE SLOPES ANALYSIS

Slope of height_centered when weight_centered = -1.484626e+01 (- 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.00   0.03    -0.11   0.91

Slope of height_centered when weight_centered = -1.002064e-16 (Mean): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.03   0.02    -1.26   0.21

Slope of height_centered when weight_centered =  1.484626e+01 (+ 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.05   0.03    -1.93   0.05

11.14 Mediation

A mediation model takes the following general form in the lavaan package (Rosseel, 2012; Rosseel et al., 2024).

Code
mediationModel <- '
  # direct effect (cPrime)
  Y ~ direct*X
  
  # mediator
  M ~ a*X
  Y ~ b*M
  
  # indirect effect = a*b
  indirect := a*b
  
  # total effect (c)
  total := abs(direct) + abs(indirect)
'

Let’s substitute in our predictor, outcome, and hypothesized mediator. In this case, we predict that receiving touchdowns partially accounts for the association between Wide Receiver’s target share and their fantasy points. This is a silly example because fantasy points are derived, in part, from touchdowns, so of course touchdowns will partially account for almost any effect on Wide Receivers’ fantasy points. This example is merely for demonstrating the process of developing and examining a mediation model.

Code
mediationModel <- '
  # direct effect (cPrime)
  fantasyPoints_transformed ~ direct*target_share
  
  # mediator
  receiving_tds ~ a*target_share
  fantasyPoints_transformed ~ b*receiving_tds
  
  # indirect effect = a*b
  indirect := a*b
  
  # total effect (c)
  total := abs(direct) + abs(indirect)
'

To get a robust estimate of the indirect effect, we obtain bootstrapped estimates from 1,000 bootstrap draws. Typically, we would obtain bootstrapped estimates from 10,000 bootstrap draws, but this example uses only 1,000 bootstrap draws for a shorter runtime.

Code
mediationFit <- lavaan::sem(
  mediationModel,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  se = "bootstrap",
  bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
  parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
  iseed = 52242, # for reproducibility
  missing = "ML",
  estimator = "ML",
  fixed.x = FALSE)

Here are the model results:

Code
summary(
  mediationFit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-19 ended normally after 15 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9

  Number of observations                          5389
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                             10480.736
  Degrees of freedom                                 3
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000
                                                      
  Robust Comparative Fit Index (CFI)             1.000
  Robust Tucker-Lewis Index (TLI)                1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -29079.024
  Loglikelihood unrestricted model (H1)     -29079.024
                                                      
  Akaike (AIC)                               58176.048
  Bayesian (BIC)                             58235.377
  Sample-size adjusted Bayesian (SABIC)      58206.778

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: RMSEA <= 0.050                       NA
  P-value H_0: RMSEA >= 0.080                       NA
                                                      
  Robust RMSEA                                   0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: Robust RMSEA <= 0.050                NA
  P-value H_0: Robust RMSEA >= 0.080                NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws            1000

Regressions:
                              Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  fantasyPoints_transformed ~                                             
    trgt_sh (drct)               1.125    0.033   33.884    0.000    1.125
  receiving_tds ~                                                         
    trgt_sh    (a)               1.497    0.031   48.902    0.000    1.497
  fantasyPoints_transformed ~                                             
    rcvng_t    (b)               0.280    0.014   20.187    0.000    0.280
  Std.all
         
    0.612
         
    0.763
         
    0.299

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    3.253    0.035   93.550    0.000    3.253    1.180
   .receiving_tds     0.029    0.034    0.865    0.387    0.029    0.010
    target_share      1.485    0.021   71.159    0.000    1.485    0.990

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .fntsyPnts_trns    1.951    0.050   39.344    0.000    1.951    0.257
   .receiving_tds     3.617    0.141   25.573    0.000    3.617    0.418
    target_share      2.248    0.061   36.638    0.000    2.248    1.000

R-Square:
                   Estimate
    fntsyPnts_trns    0.743
    receiving_tds     0.582

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    indirect          0.419    0.015   27.571    0.000    0.419    0.228
    total             1.544    0.025   61.644    0.000    1.544    0.840

We can also estimate a model with multiple hypothesized mediator:

Code
multipleMediatorModel <- '
  # direct effect (cPrime)
  fantasyPoints_transformed ~ direct*target_share
  
  # mediator
  receiving_tds ~ a1*target_share
  receiving_yards ~ a2*target_share
  
  fantasyPoints_transformed ~ b1*receiving_tds + b2*receiving_yards
  
  # indirect effect = a*b
  indirect1 := a1*b1
  indirect2 := a2*b2
  indirectTotal := indirect1 + indirect2
  
  # total effect (c)
  total := abs(direct) + abs(indirectTotal)
'
Code
multipleMediatorFit <- lavaan::sem(
  multipleMediatorModel,
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  se = "bootstrap",
  bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
  parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
  iseed = 52242, # for reproducibility
  missing = "ML",
  estimator = "ML",
  fixed.x = FALSE)

Here are the model results:

Code
summary(
  multipleMediatorFit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)
lavaan 0.6-19 ended normally after 18 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        13

  Number of observations                          5389
  Number of missing patterns                         2

Model Test User Model:
                                                      
  Test statistic                              2117.502
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                             22710.101
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.907
  Tucker-Lewis Index (TLI)                       0.441
                                                      
  Robust Comparative Fit Index (CFI)             0.901
  Robust Tucker-Lewis Index (TLI)                0.405

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -63833.389
  Loglikelihood unrestricted model (H1)     -62774.638
                                                      
  Akaike (AIC)                              127692.778
  Bayesian (BIC)                            127778.476
  Sample-size adjusted Bayesian (SABIC)     127737.166

Root Mean Square Error of Approximation:

  RMSEA                                          0.627
  90 Percent confidence interval - lower         0.604
  90 Percent confidence interval - upper         0.649
  P-value H_0: RMSEA <= 0.050                    0.000
  P-value H_0: RMSEA >= 0.080                    1.000
                                                      
  Robust RMSEA                                   0.673
  90 Percent confidence interval - lower         0.646
  90 Percent confidence interval - upper         0.701
  P-value H_0: Robust RMSEA <= 0.050             0.000
  P-value H_0: Robust RMSEA >= 0.080             1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.046

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws            1000

Regressions:
                              Estimate   Std.Err  z-value  P(>|z|)   Std.lv 
  fantasyPoints_transformed ~                                               
    trgt_sh (drct)                0.418    0.035   12.097    0.000     0.418
  receiving_tds ~                                                           
    trgt_sh   (a1)                1.529    0.030   51.056    0.000     1.529
  receiving_yards ~                                                         
    trgt_sh   (a2)              235.236    3.487   67.470    0.000   235.236
  fantasyPoints_transformed ~                                               
    rcvng_t   (b1)                0.034    0.011    3.077    0.002     0.034
    rcvng_y   (b2)                0.005    0.000   30.555    0.000     0.005
  Std.all
         
    0.230
         
    0.786
         
    0.909
         
    0.036
    0.645

Intercepts:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .fntsyPnts_trns     3.156    0.030  105.208    0.000     3.156    1.149
   .receiving_tds     -0.020    0.032   -0.604    0.546    -0.020   -0.007
   .receiving_yrds    25.834    3.912    6.603    0.000    25.834    0.066
    target_share       1.486    0.021   71.601    0.000     1.486    0.983

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .fntsyPnts_trns     1.612    0.035   45.735    0.000     1.612    0.214
   .receiving_tds      3.313    0.125   26.443    0.000     3.313    0.383
   .receiving_yrds 26421.734 1279.929   20.643    0.000 26421.734    0.173
    target_share       2.284    0.063   36.402    0.000     2.284    1.000

R-Square:
                   Estimate 
    fntsyPnts_trns     0.786
    receiving_tds      0.617
    receiving_yrds     0.827

Defined Parameters:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    indirect1          0.052    0.016    3.147    0.002     0.052    0.028
    indirect2          1.066    0.027   39.897    0.000     1.066    0.587
    indirectTotal      1.118    0.020   56.501    0.000     1.118    0.615
    total              1.536    0.025   61.189    0.000     1.536    0.845

11.15 Bayesian Multiple Regression

Code
bayesianMultipleRegressionModel <- brm(
  formula = fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
  data = player_stats_seasonal %>% filter(position %in% c("WR")),
  family = gaussian()
)
Code
summary(bayesianMultipleRegressionModel)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)) 
   Data: player_stats_seasonal %>% filter(position %in% c(" (Number of observations: 4251) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              3.01      0.56     1.94     4.12 1.00     4002     3148
age                    0.01      0.01    -0.00     0.02 1.00     4461     2685
height                -0.03      0.01    -0.05    -0.01 1.00     4191     2572
weight                 0.00      0.00     0.00     0.01 1.00     4564     3299
Ilogtarget_shareP1     4.49      0.03     4.44     4.55 1.00     3936     2351

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.06      0.01     1.04     1.08 1.00     4819     2988

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

11.16 Conclusion

Multiple regression allows examining the association between multiple predictor variables and one outcome variable. Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable. As with correlation, an association does not imply causation. However, identifying associations is important because associations are a necessary (but insufficient) condition for causality. When developing a multiple regression model, there are various assumptions that are important to evaluate. In addition, it is important to pay attention for potential multicollinearity—it may become difficult to detect a given predictor variable as statistically significant due to the greater uncertainty around the parameter estimates.

11.17 Session Info

Code
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.50          lubridate_1.9.4     forcats_1.0.0      
 [4] stringr_1.5.1       dplyr_1.1.4         purrr_1.0.4        
 [7] readr_2.1.5         tidyr_1.3.1         tibble_3.2.1       
[10] tidyverse_2.0.0     robustbase_0.99-4-1 brms_2.22.0        
[13] Rcpp_1.0.14         interactions_1.2.0  miceadds_3.17-44   
[16] mice_3.17.0         lavaan_0.6-19       performance_0.13.0 
[19] lme4_1.1-37         Matrix_1.7-3        caret_7.0-1        
[22] lattice_0.22-6      ggplot2_3.5.2       car_3.1-3          
[25] carData_3.0-5       rms_8.0-0           Hmisc_5.2-3        
[28] petersenlab_1.1.3  

loaded via a namespace (and not attached):
  [1] splines_4.5.0        polspline_1.1.25     cellranger_1.1.0    
  [4] hardhat_1.4.1        pROC_1.18.5          rpart_4.1.24        
  [7] lifecycle_1.0.4      Rdpack_2.6.4         StanHeaders_2.32.10 
 [10] processx_3.8.6       globals_0.17.0       MASS_7.3-65         
 [13] insight_1.2.0        backports_1.5.0      magrittr_2.0.3      
 [16] rmarkdown_2.29       yaml_2.3.10          pkgbuild_1.4.7      
 [19] gld_2.6.7            DBI_1.2.3            minqa_1.2.8         
 [22] RColorBrewer_1.1-3   multcomp_1.4-28      abind_1.4-8         
 [25] expm_1.0-0           quadprog_1.5-8       nnet_7.3-20         
 [28] TH.data_1.1-3        tensorA_0.36.2.1     sandwich_3.1-1      
 [31] jtools_2.3.0         ipred_0.9-15         lava_1.8.1          
 [34] inline_0.3.21        listenv_0.9.1        MatrixModels_0.5-4  
 [37] bridgesampling_1.1-2 parallelly_1.43.0    codetools_0.2-20    
 [40] tidyselect_1.2.1     shape_1.4.6.1        bayesplot_1.12.0    
 [43] farver_2.1.2         broom.mixed_0.2.9.6  matrixStats_1.5.0   
 [46] stats4_4.5.0         base64enc_0.1-3      jsonlite_2.0.0      
 [49] e1071_1.7-16         mitml_0.4-5          Formula_1.2-5       
 [52] survival_3.8-3       iterators_1.0.14     emmeans_1.11.0      
 [55] foreach_1.5.2        tools_4.5.0          DescTools_0.99.60   
 [58] glue_1.8.0           mnormt_2.1.1         prodlim_2024.06.25  
 [61] gridExtra_2.3        pan_1.9              mgcv_1.9-1          
 [64] xfun_0.52            distributional_0.5.0 loo_2.8.0           
 [67] withr_3.0.2          fastmap_1.2.0        mitools_2.4         
 [70] boot_1.3-31          SparseM_1.84-2       callr_3.7.6         
 [73] digest_0.6.37        timechange_0.3.0     R6_2.6.1            
 [76] estimability_1.5.1   colorspace_2.1-1     mix_1.0-13          
 [79] generics_0.1.3       data.table_1.17.0    recipes_1.3.0       
 [82] class_7.3-23         httr_1.4.7           htmlwidgets_1.6.4   
 [85] ModelMetrics_1.2.2.2 pkgconfig_2.0.3      Exact_3.3           
 [88] gtable_0.3.6         timeDate_4041.110    furrr_0.3.1         
 [91] htmltools_0.5.8.1    scales_1.4.0         lmom_3.2            
 [94] posterior_1.6.1      gower_1.0.2          reformulas_0.4.0    
 [97] rstudioapi_0.17.1    tzdb_0.5.0           reshape2_1.4.4      
[100] curl_6.2.2           coda_0.19-4.1        checkmate_2.3.2     
[103] nlme_3.1-168         nloptr_2.2.1         proxy_0.4-27        
[106] zoo_1.8-14           rootSolve_1.8.2.4    parallel_4.5.0      
[109] foreign_0.8-90       pillar_1.10.2        grid_4.5.0          
[112] vctrs_0.6.5          jomo_2.7-6           xtable_1.8-4        
[115] cluster_2.1.8.1      htmlTable_2.4.3      evaluate_1.0.3      
[118] pbivnorm_0.6.0       mvtnorm_1.3-3        cli_3.6.5           
[121] compiler_4.5.0       rlang_1.1.6          rstantools_2.4.0    
[124] future.apply_1.11.3  labeling_0.4.3       ps_1.9.1            
[127] fs_1.6.6             plyr_1.8.9           rstan_2.32.7        
[130] stringi_1.8.7        psych_2.5.3          pander_0.6.6        
[133] QuickJSR_1.7.0       viridisLite_0.4.2    V8_6.0.3            
[136] glmnet_4.1-8         Brobdingnag_1.2-9    quantreg_6.1        
[139] hms_1.1.3            future_1.40.0        haven_2.5.4         
[142] rbibutils_2.3        broom_1.0.8          RcppParallel_5.1.10 
[145] DEoptimR_1.1-3-1     readxl_1.4.5        

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/LsnVKwqmS1VuxWD18

Email Notification

The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.