I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
11 Multiple Regression
11.1 Getting Started
11.1.1 Load Packages
11.1.2 Load Data
We created the player_stats_weekly.RData
and player_stats_seasonal.RData
objects in Section 4.4.3.
11.2 Overview of Multiple Regression
Multiple regression is an extension of correlation. Correlation examines the association between one predictor variables and one outcome variable. Multiple regression examines the association between multiple predictor variables and one outcome variable. It allows obtaining a more accurate estimate of the unique contribution of a given predictor variable, by controlling for other variables (covariates).
Regression with one predictor variable takes the form of Equation 11.1:
\[ y = \beta_0 + \beta_1x_1 + \epsilon \tag{11.1}\]
where \(y\) is the outcome variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(x_1\) is the predictor variable, and \(\epsilon\) is the error term.
A regression line is depicted in Figure 11.27.
Regression with multiple predictors—i.e., multiple regression—takes the form of Equation 11.2:
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon \tag{11.2}\]
where \(p\) is the number of predictor variables. Multiple regression is basically a weighted sum of the predictor variables, and adding an intercept. Under the hood, multiple regression seeks to identify the best weight for each predictor.
11.3 Components
- \(B\) = unstandardized coefficient: direction and magnitude of the estimate (original scale)
- \(\beta\) (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
- \(SE\) = standard error: uncertainty of unstandardized estimate
The unstandardized regression coefficient (\(B\)) is interpreted such that, for every unit change in the predictor variable, there is a __ unit change in the outcome variable. For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age. (In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.) Unstandardized regression coefficients are tied to the metric of the raw data. Thus, a large unstandardized regression coefficient for two variables may mean completely different things. Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.
Standardized regression coefficients can be obtained by standardizing the variables to z-scores so they all have a mean of zero and standard deviation of one. The standardized regression coefficient (\(\beta\)) is interpreted such that, for every standard deviation change in the predictor variable, there is a __ standard deviation change in the outcome variable. For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age. Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1]. Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients. In this way, standardized regression coefficients provide a meaningful index of effect size.
The standard error of a regression coefficient represents the uncertainty of the parameter. If we have less uncertainty (i.e., more confidence) about the parameter, the standard error will be small. If we have more uncertainty (i.e., less confidence) about the parameter, the standard error will be large. If we used the same sampling procedure repeatedly and calculated the regression coefficient each time, the true parameter in the population would fall 68% of the time within the interval of: \([\text{model parameter estimate for the regression coefficient} \pm 1 \text{ standard error}]\) .
A confidence interval represents a range of plausible values such that, with repeated sampling, the true value falls within a given interval with some confidence. Our parameter estimate for the regression coefficient, plus or minus 1 standard error, reflects the 68% confidence interval for the coefficient. The 95% confidence interval is computed as the parameter estimated plus or minus 1.96 standard errors. For instance, if the parameter estimate for the regression coefficient is 0.50, and the standard error is 0.10, the 95% confidence interval is [0.30, 0.70]: \(0.5 - (1.96 \times 0.10) = 0.3\); \(0.5 + (1.96 \times 0.10) = 0.7\). That is, if we used the same sampling procedure repeatedly, the true value of the regression coefficient would be expected to be 95% of the time somewhere between 0.30 to 0.70.
11.4 Assumptions of Multiple Regression
Linear regression models make the following assumptions:
- there is a linear association between the predictor variable and the outcome variable
- there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variable or as a function of the outcome variable
- the residuals are independent; they are uncorrelated with each other
- the residuals are normally distributed
Homoscedasticity of the residuals means that the variance of the residuals does not differ as a function of the outcome variable or as a function of the predictor variable (i.e., the residuals show constant variance as a function of outcome/predictors). If the residuals differ as a function of the outcome or predictor variable, this is called heterotoscedasticity.
Those are some of the key assumptions of multiple regression. However, there are additional assumptions of multiple regression, including ones discussed in the chapter on Causal Inference. For instance, the variables included should reflect a causal process such that the predictor variables influence the outcome variable, and not the other way around. That is, the outcome variable should not influence the predictor variables (i.e., there should be no reverse causation). In addition, it is important control for any confound(s). If a confound is not controlled for, this is called omitted-variable bias, and it leads the researcher to incorrectly attribute the effects of the omitted variable to the included variables.
11.4.1 Evaluating and Addressing Assumptions of Multiple Regression
11.4.1.1 Linear Association
To evaluate the shape of the association between the predictor variables and the outcome variable, we can examine scatterplots (Figure 11.2), residual plots (Figure 11.19), marginal model plots (Figure 11.10), added-variable plots (Figure 11.11), and component-plus-residual plots (Figure 11.12). Residual plots depict the residuals (errors) on the y-axis as a function of the fitted values or a specific predictor on the x-axis. Marginal model plots are basically glorified scatterplots that depict the outcome variable (both in terms of observed values and model-fitted values) on the y-axis and the predictor variables on the x-axis. Added-variable plots depict the unique association of each predictor variables with the outcome variable when controlling for all the other predictor variables in the model. Component-plus-residual plots depict partial residuals on the y-axis as a function of each predictor variable on the x-axis, where a partial residual for a given predictor is the effect of a given predictor (thus controlling for all the other predictor variables in the model) plus the residual from the full model.
Examples of linear and nonlinear associations are depicted with scatterplots in Figure 11.2.
Code
set.seed(52242)
sampleSize <- 1000
quadraticX <- runif(
sampleSize,
min = -4,
max = 4)
linearY <- quadraticX + rnorm(sampleSize, mean = 0, sd = 0.5)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
data.frame %>%
arrange(quadraticX)
quadraticModel <- lm(
quadraticY ~ quadraticX + I(quadraticX ^ 2),
data = quadraticData)
quadraticNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$quadraticY <- predict(
quadraticModel,
newdata = quadraticNewData)
plot(
x = quadraticX,
y = linearY,
xlab = "",
ylab = "",
main = "Linear Association")
abline(
lm(linearY ~ quadraticX),
lwd = 2,
col = "blue")
plot(
x = quadraticData$quadraticX,
y = quadraticData$quadraticY,
xlab = "",
ylab = "",
main = "Nonlinear Association")
lines(
quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "blue")
If the shape of the association is nonlinear (as indicated by any of these plots), various approaches may be necessary such as including nonlinear terms (e.g., polynomial terms such as quadratic, cubic, quartic, or higher-degree terms), transforming the predictors (e.g., log, square root, inverse, exponential, Box-Cox, Yeo-Johnson transform), use of splines/piecewise regression, and generalized additive models.
11.4.1.2 Homoscedasticity
To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.19) and spread-level plot (Figure 11.20). A residual plot depicts the residuals on the y-axis as a function of the model’s fitted values on the x-axis. Homoscedasticity in a residual plot is identified as a constant spread of residuals versus fitted values—the residuals do not show a fan, cone, or bow-tie shape; a fan, cone, or bow-tie shape indicates heteroscedasticity. In a residual plot, a fan or cone shape indicates increasing or decreasing variance in the residuals as a function of the fitted values; a bow-tie shape indicates that the residuals are smallest in the middle of the fitted values and greatest on the extremes of the fitted values. A spread-level plot depicts the log of the absolute value of studentized residuals on the y-axis as a function of the log of the model’s fitted values on the x-axis. Homoscedasticity in a spread-level plot is identified as a flat slope; a slope that differs from zero indicates heteroscedasticity.
Code
set.seed(52242)
# 1. Homoscedasticity
sampleSize <- 1000
x1 <- runif(sampleSize, 0, 10)
y1 <- 3 + 2 * x1 + rnorm(sampleSize, mean = 0, sd = 2)
fit1 <- lm(y1 ~ x1)
res1 <- resid(fit1)
fitted1 <- fitted(fit1)
# 2. Fan-shaped heteroscedasticity
x2 <- runif(sampleSize, 0, 10)
y2 <- 3 + 2 * x2 + rnorm(sampleSize, mean = 0, sd = 0.5 * x2) # increasing variance
fit2 <- lm(y2 ~ x2)
res2 <- resid(fit2)
fitted2 <- fitted(fit2)
# 3. Bow-tie-shaped heteroscedasticity
x3 <- runif(sampleSize, 0, 10)
sd3 <- abs(x3 - 5) + 0.5 # variance smallest in middle, higher on edges
y3 <- 3 + 2 * x3 + rnorm(sampleSize, mean = 0, sd = sd3)
fit3 <- lm(y3 ~ x3)
res3 <- resid(fit3)
fitted3 <- fitted(fit3)
plot(
fitted1,
res1,
main = "Homoscedasticity",
xlab = "Fitted value",
ylab = "Residual")
abline(
h = 0,
lwd = 2,
col = "blue")
plot(
fitted2,
res2,
main = "Fan-Shaped Heteroscedasticity",
xlab = "Fitted value",
ylab = "Residual")
abline(
h = 0,
lwd = 2,
col = "blue")
plot(
fitted3,
res3,
main = "Bow-Tie-Shaped Heteroscedasticity",
xlab = "Fitted value",
ylab = "Residual")
abline(
h = 0,
lwd = 2,
col = "blue")
If there is heteroscedasticity, it may be necessary to transform the outcome variable to be more normally distributed. The spread-level plot provides a suggested power transformation to transform the outcome variable so that the spread of residuals becomes more uniform across the fitted values.
11.4.1.4 Normally Distributed Residuals
To examine whether residuals are normally distributed, we can examine quantile–quantile (QQ) plots and probability–probability (PP) plots. QQ plots are particularly useful for identifying deviations from normality in the tails of the distribution; PP plots are particularly useful for identifying deviations from normality in the center of the distribution. Researchers tend to be more concerned about the tails of the distribution, because extreme values tend to have a greater impact on inferences. Various examples of QQ plots and deviations from normality are depicted in Figure 11.4. If the residuals are normally distributed, they will stay close to the diagonal reference line of the QQ and PP plots.
Code
set.seed(52242)
sampleSize <- 10000
distribution_normal <- rnorm(
sampleSize,
mean = 0,
sd = 1)
distribution_bimodal <- c(
rnorm(
sampleSize/2,
mean = -2,
sd = 1),
rnorm(
sampleSize/2,
mean = 2,
sd = 1))
distribution_negativeSkew <- -rlnorm(
sampleSize,
meanlog = 0,
sdlog = 1)
distribution_positiveSkew <- rlnorm(
sampleSize,
meanlog = 0,
sdlog = 1)
distribution_lightTailed <- runif(
sampleSize,
min = -2,
max = 2)
distribution_heavyTailed <- rt(
sampleSize,
df = 2)
hist(
distribution_normal,
main = "Normal Distribution",
col = "#0099F8")
car::qqPlot(
distribution_normal,
main = "QQ Plot",
id = FALSE)
hist(
distribution_bimodal,
main = "Bimodal Distribution",
col = "#0099F8")
car::qqPlot(
distribution_bimodal,
main = "QQ Plot",
id = FALSE)
hist(
distribution_negativeSkew,
main = "Negatively Skewed Distribution",
col = "#0099F8")
car::qqPlot(
distribution_negativeSkew,
main = "QQ Plot",
id = FALSE)
hist(
distribution_positiveSkew,
main = "Positively Skewed Distribution",
col = "#0099F8")
car::qqPlot(
distribution_positiveSkew,
main = "QQ Plot",
id = FALSE)
hist(
distribution_lightTailed,
main = "Platykurtic (Light Tailed) Distribution",
col = "#0099F8")
car::qqPlot(
distribution_lightTailed,
main = "QQ Plot",
id = FALSE)
hist(
distribution_heavyTailed,
main = "Leptokurtic (Heavy Tailed) Distribution",
col = "#0099F8")
car::qqPlot(
distribution_heavyTailed,
main = "QQ Plot",
id = FALSE)
If the residuals are not normally distributed (i.e., they do not stay close to the diagonal reference line of the QQ and PP plots), it may be necessary to transform the outcome variable to be more normally distributed or to use a generalized linear model (GLM) that more closely matches the distribution of the outcome variable (e.g., Poisson, binomial, gamma).
11.5 How Much Variance the Model Explains
When estimating a multiple regression model, it can be useful to evaluate how much variance in the outcome variable that the predictor variable(s) explain (i.e., account for). If the variables collectively explain only a small amount of variance, it suggest that the predictor variables have a small effect size and that other predictor variables will be necessary to account for the majority of variability in the outcome variable. There are two primary indices of how much how much variance in the outcome variable is explained by the predictor variable(s): the coefficient of determination (\(R^2\)) and adjusted \(R^2\) (\(R^2_{adj}\)), described below.
11.5.1 Coefficient of Determination (\(R^2\))
As noted in Section 9.6.6, The coefficient of determination (\(R^2\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions, as in Equation 9.25: \(R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}\). Various formulas for \(R^2\) are in Equation 9.19. Larger \(R^2\) values indicate greater accuracy. Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in Figure 11.27.
One issue with \(R^2\) is that it increases as the number of predictors increases, which can lead to overfitting if using \(R^2\) as an index to compare models for purposes of selecting the “best-fitting” model. Consider the following example (adapted from Petersen (2025)) in which you have one predictor variable and one outcome variable, as shown in Table 11.1.
y | x1 |
---|---|
7 | 1 |
13 | 2 |
29 | 7 |
10 | 2 |
Using the data, the best fitting regression model is: \(y =\) 3.98 \(+\) 3.59 \(\cdot x_1\). In this example, the \(R^2\) is 0.98. The equation is not a perfect prediction, but with a single predictor variable, it captures the majority of the variance in the outcome.
Now consider the following example where you add a second predictor variable to the data above, as shown in Table 11.2.
y | x1 | x2 |
---|---|---|
7 | 1 | 3 |
13 | 2 | 5 |
29 | 7 | 1 |
10 | 2 | 2 |
With the second predictor variable, the best fitting regression model is: \(y =\) 0.00 + 4.00 \(\cdot x_1 +\) 1.00 \(\cdot x_2\). In this example, the \(R^2\) is 1.00. The equation with the second predictor variable provides a perfect prediction of the outcome.
Providing perfect prediction with the right set of predictor variables is the dream of multiple regression. So, using multiple regression, we often add predictor variables to incrementally improve prediction. Knowing how much variance would be accounted for by random chance follows Equation 11.3:
\[ E(R^2) = \frac{K}{n-1} \tag{11.3}\]
where \(E(R^2)\) is the expected value of \(R^2\) (the proportion of variance explained), \(K\) is the number of predictor variables, and \(n\) is the sample size. The formula demonstrates that the more predictor variables in the regression model, the more variance will be accounted for by chance. With many predictor variables and a small sample, you can account for a large share of the variance merely by chance.
As an example, consider that we have 13 predictor variables to predict fantasy performance for 43 players. Assume that, with 13 predictor variables, we explain 38% of the variance (\(R^2 = .38; r = .62\)). We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: \(E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31\). We expect to explain 31% of the variance, by chance, in the outcome. So, 82% of the variance explained was likely spurious (i.e., \(\frac{.31}{.38} = .82\)). As the sample size increases, the spuriousness decreases. To account for the number of predictor variables in the model, we can use a modified version of \(R^2\) called adjusted \(R^2\) (\(R^2_{adj}\)), described next.
11.5.2 Adjusted \(R^2\) (\(R^2_{adj}\))
Adjusted \(R^2\) (\(R^2_{adj}\)) accounts for the number of predictor variables in the model, based on how much would be expected to be accounted for by chance to penalize overfitting. Adjusted \(R^2\) (\(R^2_{adj}\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of predictor variables in the model. The formula for adjusted \(R^2\) (\(R^2_{adj}\)) is in Equation 11.4:
\[ R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \tag{11.4}\]
where \(p\) is the number of predictor variables in the model, and \(n\) is the sample size.
11.6 Overfitting
Statistical models applied to big data (e.g., data with many predictor variables) can overfit the data, which means that the statistical model accounts for error variance, which will not generalize to future samples. So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built. In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.
Consider an example where you develop an algorithm to predict players’ fantasy performance based on 2024 data using hundreds of predictor variables. To some extent, these predictor variables will likely account for true variance (i.e., signal) and error variance (i.e., noise). If we were to apply the same algorithm based on the 2024 prediction model to 2025 data, the prediction model would likely predict less accurately than with 2024 data. The regression coefficients tend to become weaker when applied to new data, a phenomenon called shrinkage, which is described in Section 15.7.1. The regression coefficients in the [FILL IN]
In Figure 11.6, the blue line represents the true distribution of the data, and the red line is an overfitting model:
Code
set.seed(52242)
sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
data.frame %>%
arrange(quadraticX)
quadraticModel <- lm(
quadraticY ~ quadraticX + I(quadraticX ^ 2),
data = quadraticData)
quadraticNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$quadraticY <- predict(
quadraticModel,
newdata = quadraticNewData)
loessFit <- loess(
quadraticY ~ quadraticX,
data = quadraticData,
span = 0.01,
degree = 1)
loessNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$loessY <- predict(
loessFit,
newdata = quadraticNewData)
plot(
x = quadraticData$quadraticX,
y = quadraticData$quadraticY,
xlab = "",
ylab = "")
lines(
quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "blue")
lines(
quadraticNewData$loessY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "red")
11.7 Covariates
Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable. Use of covariates examines the association between the predictor variable and the outcome variable when holding people’s level constant on the covariates. Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable. Ideally, you want to include any and all confounds as covariates. As described in Section 8.4.2.1, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association. Covariates are potentially (but not necessarily) confounds. For instance, you might include the player’s age as a covariate in a model that examines whether a player’s 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.
11.8 Example: Predicting Wide Receivers’ Fantasy Points
Let’s say we want to use a number of variables to predict a wide receiver’s fantasy performance. We want to consider several predictors, including the player’s age, height, weight, and target share. We have only a few predictors and our sample size is large enough such that overfitting is not likely a concern.
On a weekly basis, target share is computed as the number of targets a player receives divided by the team’s total number of targets. It would be nice to compute it this way for seasonal data, as well; however, for calculating seasonal target share, the nflfastR
(Carl & Baldwin, 2024) package currently sums up the player’s weekly target share. Thus, seasonal target share (unlike weekly target share)—as it is currently calculated—does not range from 0–1. For example, if a player has a weekly target share of 15%, that would be a seasonal (cross-week-summed) target share of 2.55 (\(0.15 \times 17 \; \text{games} = 2.55\).
11.8.1 Examine Descriptive Statistics
Let’s first examine descriptive statistics of the predictor and outcome variables.
Code
player_stats_seasonal %>%
dplyr::select(fantasyPoints, age, height, weight, target_share) %>%
dplyr::summarise(across(
everything(),
.fns = list(
n = ~ length(na.omit(.)),
missingness = ~ mean(is.na(.)) * 100,
M = ~ mean(., na.rm = TRUE),
SD = ~ sd(., na.rm = TRUE),
min = ~ min(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE),
range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
IQR = ~ IQR(., na.rm = TRUE),
MAD = ~ mad(., na.rm = TRUE),
median = ~ median(., na.rm = TRUE),
pseudomedian = ~ DescTools::HodgesLehmann(., na.rm = TRUE),
mode = ~ petersenlab::Mode(., multipleModes = "mean"),
skewness = ~ psych::skew(., na.rm = TRUE),
kurtosis = ~ psych::kurtosi(., na.rm = TRUE)),
.names = "{.col}.{.fn}")) %>%
tidyr::pivot_longer(
cols = everything(),
names_to = c("variable","index"),
names_sep = "\\.") %>%
tidyr::pivot_wider(
names_from = index,
values_from = value)
Let’s also examine the distributions of the variables using a density plot, as depicted in Figure 11.7.
Code
ggplot2::ggplot(
data = player_stats_seasonal %>%
filter(position_group %in% c("WR")),
mapping = aes(
x = fantasyPoints)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6 # add transparency
) +
geom_rug() +
labs(
x = "Fantasy Points",
y = "Density",
title = "Density Plot of Fantasy Points with Histogram and Rug Plot"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
ggplot2::ggplot(
data = player_stats_seasonal %>%
filter(position_group %in% c("WR")),
mapping = aes(
x = age)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6 # add transparency
) +
geom_rug() +
labs(
x = "Age (years)",
y = "Density",
title = "Density Plot of Player Age with Histogram and Rug Plot"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
ggplot2::ggplot(
data = player_stats_seasonal %>%
filter(position_group %in% c("WR")),
mapping = aes(
x = height)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6 # add transparency
) +
geom_rug() +
labs(
x = "Height (inches)",
y = "Density",
title = "Density Plot of Player Height with Histogram and Rug Plot"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
ggplot2::ggplot(
data = player_stats_seasonal %>%
filter(position_group %in% c("WR")),
mapping = aes(
x = weight)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6 # add transparency
) +
geom_rug() +
labs(
x = "Weight (pounds)",
y = "Density",
title = "Density Plot of Player Weight with Histogram and Rug Plot"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
ggplot2::ggplot(
data = player_stats_seasonal %>%
filter(position_group %in% c("WR")),
mapping = aes(
x = target_share)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6 # add transparency
) +
geom_rug() +
labs(
x = "Target Share",
y = "Density",
title = "Density Plot of Target Share with Histogram and Rug Plot"
) +
theme_classic() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
11.8.2 Examine Bivariate Associations
Then, let’s examine the bivariate association of each using a scatterplot to evaluate for any potential nonlinearity, as depicted in Figure 11.8.
Code
ggplot2::ggplot(
data = player_stats_seasonal %>% filter(position %in% c("WR")),
aes(
x = age,
y = fantasyPoints)) +
geom_point(alpha = 0.05) +
geom_smooth(
method = "lm",
color = "black") +
geom_smooth() +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Player Age (Years)",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Player Age",
subtitle = "(Among Wide Receivers)"
) +
theme_classic()
ggplot2::ggplot(
data = player_stats_seasonal %>% filter(position %in% c("WR")),
aes(
x = height,
y = fantasyPoints)) +
geom_point(alpha = 0.05) +
geom_smooth(
method = "lm",
color = "black") +
geom_smooth() +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Player Height (Inches)",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Player Height",
subtitle = "(Among Wide Receivers)"
) +
theme_classic()
ggplot2::ggplot(
data = player_stats_seasonal %>% filter(position %in% c("WR")),
aes(
x = weight,
y = fantasyPoints)) +
geom_point(alpha = 0.05) +
geom_smooth(
method = "lm",
color = "black") +
geom_smooth() +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Player Weight (Pounds)",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Player Weight",
subtitle = "(Among Wide Receivers)"
) +
theme_classic()
ggplot2::ggplot(
data = player_stats_seasonal %>% filter(position %in% c("WR")),
aes(
x = target_share,
y = fantasyPoints)) +
geom_point(alpha = 0.05) +
geom_smooth(
method = "lm",
color = "black") +
geom_smooth() +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Target Share",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Target Share",
subtitle = "(Among Wide Receivers)"
) +
theme_classic()
There are some suggestions of potential nonlinearity, such as an inverted-U-shaped association between height and fantasy points, suggesting that there may an optimal range for height among Wide Receivers—being too short or too tall could be a disadvantage. In addition, target share shows a weakening association as target share increases. Thus, after evaluating the linear association between the predictors and outcome, we will also examine the possibility for curvilinear associations.
11.8.3 Estimate Multiple Regression Model
Now that we have examined descriptive statistics and bivariate associations, let’s first estimate a multiple regression model with only linear terms:
The model formula is in Equation 11.5:
\[ \text{fantasy points} = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \tag{11.5}\]
Here are the model results:
Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share,
data = player_stats_seasonal %>% filter(position %in% c("WR")),
na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-325.39 -12.92 -5.35 10.21 234.15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.45896 19.11171 0.547 0.58423
age 0.22273 0.18068 1.233 0.21772
height -0.58258 0.33739 -1.727 0.08429 .
weight 0.17335 0.05336 3.249 0.00117 **
target_share 50.70291 0.37231 136.185 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 35.82 on 4246 degrees of freedom
(1138 observations deleted due to missingness)
Multiple R-squared: 0.8194, Adjusted R-squared: 0.8193
F-statistic: 4817 on 4 and 4246 DF, p-value: < 2.2e-16
The only terms that are significantly associated with fantasy performance among Wide Receivers are weight and target share, both of which showed a positive association with fantasy points.
We can obtain the coefficient of determination (\(R^2\)) and adjusted \(R^2\) (\(R^2_{adj}\)) using the following code:
[1] 0.8194318
[1] 0.8192617
The model explained 82% of the variability in fantasy points (i.e., \(R^2 = .82\)).
The model formula with substituted values is in Equation 11.6:
\[ \begin{aligned} \text{fantasy points} = &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{height} + \beta_3 \cdot \text{weight} + \beta_4 \cdot \text{target share} + \epsilon \\ = &10.46 + 0.22 \cdot \text{age} + -0.58 \cdot \text{height} + 0.17 \cdot \text{weight} + 50.70 \cdot \text{target share} + \epsilon \end{aligned} \tag{11.6}\]
If we want to obtain standardized regression coefficients, we can standardize the outcome variable and each predictor variable using the scale()
function:
Code
Call:
lm(formula = scale(fantasyPoints) ~ scale(age) + scale(height) +
scale(weight) + scale(target_share), data = player_stats_seasonal %>%
filter(position %in% c("WR")), na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-3.8685 -0.1537 -0.0636 0.1214 2.7838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.003032 0.006537 -0.464 0.64280
scale(age) 0.008348 0.006771 1.233 0.21772
scale(height) -0.016268 0.009421 -1.727 0.08429 .
scale(weight) 0.030597 0.009419 3.249 0.00117 **
scale(target_share) 0.903075 0.006631 136.185 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4259 on 4246 degrees of freedom
(1138 observations deleted due to missingness)
Multiple R-squared: 0.8194, Adjusted R-squared: 0.8193
F-statistic: 4817 on 4 and 4246 DF, p-value: < 2.2e-16
Target share has a large effect size. All of the other predictors have small effect size.
Now let’s consider whether any of the terms show curvilinear associations with fantasy points by adding quadratic terms:
The model formula is in Equation 11.7:
\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \end{aligned} \tag{11.7}\]
Here are the model results:
Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) +
weight + I(weight^2) + target_share + I(target_share^2),
data = player_stats_seasonal %>% filter(position %in% c("WR")),
na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-180.329 -14.089 0.151 9.499 245.425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.894e+01 4.082e+02 0.193 0.847
age 1.914e+00 2.199e+00 0.870 0.384
I(age^2) -3.362e-02 3.972e-02 -0.846 0.397
height -4.994e-01 1.194e+01 -0.042 0.967
I(height^2) 4.283e-05 8.239e-02 0.001 1.000
weight -8.862e-01 7.178e-01 -1.235 0.217
I(weight^2) 2.613e-03 1.781e-03 1.467 0.143
target_share 7.083e+01 8.773e-01 80.738 <2e-16 ***
I(target_share^2) -4.205e+00 1.681e-01 -25.020 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.45 on 4242 degrees of freedom
(1138 observations deleted due to missingness)
Multiple R-squared: 0.8427, Adjusted R-squared: 0.8424
F-statistic: 2841 on 8 and 4242 DF, p-value: < 2.2e-16
Only the quadratic term for target share shows a significant association. If we wanted to visualize the shape of the model-implied association between target share and fantasy points, we could generate the model-implied predictions using the data range that we want to visualize.
Code
newdata <- data.frame(
target_share = seq(
from = min(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
to = max(player_stats_seasonal$target_share[which(player_stats_seasonal$position == "WR")], na.rm = TRUE),
length.out = 1000
)
)
newdata$age <- mean(player_stats_seasonal$age[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$height <- mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
newdata$weight <- mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
We can depict the model-implied predictions of fantasy points as a function of target share, as shown in Figure 11.9.
Code
ggplot2::ggplot(
data = newdata,
aes(
x = target_share,
y = fantasyPoints)) +
geom_smooth() +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Target Share",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Target Share",
subtitle = "(Among Wide Receivers)"
) +
theme_classic()
We could also generate the model-implied prediction of fantasy points for any value of the predictor variables. For instance, here are the number of fantasy points expected for a Wide Receiver who is 23 years old, is 6’2” tall (72 inches), weighs 200 pounds, and has a (cross-week-summed) target share of 3 (which reflects an approximate target share of 17.6% percent per game; i.e., \(0.176 = \frac{3}{17 \; \text{games}}\)):
Code
1
171.3862
The player would be expected to score 171 fantasy points.
We can calculate this by substituting in the regression coefficients and predictor values. Here is the computation:
Code
age <- 23
height <- 72
weight <- 200
target_share <- 3
beta0 <- coef(quadraticTermsRegressionModel)[["(Intercept)"]]
beta1 <- coef(quadraticTermsRegressionModel)[["age"]]
beta2 <- coef(quadraticTermsRegressionModel)[["I(age^2)"]]
beta3 <- coef(quadraticTermsRegressionModel)[["height"]]
beta4 <- coef(quadraticTermsRegressionModel)[["I(height^2)"]]
beta5 <- coef(quadraticTermsRegressionModel)[["weight"]]
beta6 <- coef(quadraticTermsRegressionModel)[["I(weight^2)"]]
beta7 <- coef(quadraticTermsRegressionModel)[["target_share"]]
beta8 <- coef(quadraticTermsRegressionModel)[["I(target_share^2)"]]
predictedFantasyPoints <- beta0 + beta1*age + beta2*(age^2) + beta3*height + beta4*(height^2) + beta5*weight + beta6*(weight^2) + beta7*target_share + beta8*(target_share^2)
predictedFantasyPoints
[1] 171.3862
The model formula with substituted values is in Equation 11.8:
\[ \begin{aligned} \text{fantasy points} \; = \; &\beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2 + \beta_3 \cdot \text{height} + \beta_4 \cdot \text{height}^2 + \\ &\beta_5 \cdot \text{weight} + \beta_6 \cdot \text{weight}^2 + \beta_7 \cdot \text{target share} + \beta_8 \cdot \text{target share}^2 + \epsilon \\ = &78.94 + 1.91 \cdot 23 + -0.03 \cdot 23^2 + -0.50 \cdot 72 + 0.00 \cdot 72^2 + \\ &-0.89 \cdot 200 + 0.00 \cdot 200^2 + 70.83 \cdot 3 + -4.21 \cdot 3^2 + \epsilon \\ = &171.39 \end{aligned} \tag{11.8}\]
11.8.4 Evaluating and Addressing Assumptions
The assumptions for multiple regression are described in Section 11.4. I describe ways to evaluate assumptions in Section 11.4.1.
As a reminder, here are four assumptions:
- there is a linear association between the predictor variables and the outcome variable
- there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
- the residuals are independent; they are uncorrelated with each other
- the residuals are normally distributed
11.8.4.1 Linear Association
We evaluated the shape of the association between the predictor variables and the outcome variables using scatterplots. We accounted for potential curvilinearity in the associations with a quadratic term. Other ways to account for nonlinearity, in addition to polynomials, include transforming predictors, use of splines/piecewise regression, and generalized additive models.
To evaluate for potential nonlinearity in the associations, we can also evaluate residual plots (Figure 11.19), marginal model plots (Figure 11.10), added-variable plots (Figure 11.11), and component-plus-residual plots (Figure 11.12) from the car
package (Fox et al., 2024; Fox & Weisberg, 2019). For evaluating linearity, we would expect minimal bend/curvature in the lines.
The marginal model plots (Figure 11.10), residual plots (Figure 11.19), and component-plus-residual plots (Figure 11.12) suggest that the nonlinearity of the association between target share and fantasy points may not be fully captured by the quadratic term. Thus, we may need to apply a different approach to handling the nonlinear association between target share and fantasy points.
One approach we can take is to transform the target_shares
variable to be more normally distributed.
The histogram for the raw target_shares
variable is in Figure 11.13.
Code
The variable shows a strong positive skew. To address a strong positive skew, we can use a log transformation. The histogram of the log-transformed variable is in Figure 11.14.
Code
Now we can re-fit the model with the log-transformed variable.
Code
Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) +
weight + I(weight^2) + I(log(target_share + 1)), data = player_stats_seasonal %>%
filter(position %in% c("WR")), na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-161.438 -20.977 0.802 15.700 258.079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.371684 442.140136 -0.184 0.8540
age 3.050092 2.382398 1.280 0.2005
I(age^2) -0.052401 0.043019 -1.218 0.2233
height 5.283660 12.932511 0.409 0.6829
I(height^2) -0.040543 0.089233 -0.454 0.6496
weight -1.676457 0.777014 -2.158 0.0310 *
I(weight^2) 0.004671 0.001928 2.422 0.0155 *
I(log(target_share + 1)) 133.712797 0.998403 133.927 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.23 on 4243 degrees of freedom
(1138 observations deleted due to missingness)
Multiple R-squared: 0.8154, Adjusted R-squared: 0.8151
F-statistic: 2677 on 7 and 4243 DF, p-value: < 2.2e-16
Target share shows a more linear association with fantasy points after log-transforming it (albeit still not perfect), as depicted in Figures 11.15, 11.16, 11.17, and 11.18.
When creating a residual plot, the car
package (Fox et al., 2024; Fox & Weisberg, 2019) also provides a test of the nonlinearity of each predictor.
Test stat Pr(>|Test stat|)
age -0.2757 0.78282
I(age^2) 1.7447 0.08111 .
height -1.7267 0.08429 .
I(height^2) 0.4603 0.64534
weight 0.6394 0.52261
I(weight^2) -1.7398 0.08197 .
I(log(target_share + 1)) 24.6080 < 2e-16 ***
Tukey test 24.8338 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
11.8.4.2 Homoscedasticity
To evaluate homoscedasticity, we can evaluate a residual plot (Figure 11.19) and spread-level plot (Figure 11.20) from the car
package (Fox et al., 2024; Fox & Weisberg, 2019). In a residual plot, you want a constant spread of residuals versus fitted values—you do not want the residuals to show a fan or cone shape.
Test stat Pr(>|Test stat|)
age -0.3563 0.7216
I(age^2) 1.4827 0.1382
height -1.3604 0.1738
I(height^2) 0.6870 0.4921
weight 0.5744 0.5658
I(weight^2) -1.5116 0.1307
target_share 1.1683 0.2427
I(target_share^2) -5.7545 9.301e-09 ***
Tukey test 8.1631 3.266e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In a spread-level plot, you want a flat (zero) slope—you do not want a positive or negative slope.
Suggested power transformation: 0.4071025
In this example, the residuals appear to increase as a function of the fitted values. To handle this, we may need to transform the outcome variable to be more normally distributed.
The histogram for raw fantasy points is in Figure 12.33.
Code
We can apply a Yeo-Johnson transformation to the outcome variable to generate a more normally distributed variable. A Yeo-Johnson transformation estimates the optimal transformation to make the variable more normally distributed. Let’s use a Yeo-Johnson transformation of fantasy points:
Code
Created from 38458 samples and 1 variables
Pre-processing:
- ignored (0)
- Yeo-Johnson transformation (1)
Lambda estimates for Yeo-Johnson transformation:
0.18
The histogram of the transformed variable is in Figure 11.22.
Code
Now we can refit the model.
Code
Call:
lm(formula = fantasyPoints_transformed ~ age + height + weight +
I(log(target_share + 1)), data = player_stats_seasonal %>%
filter(position %in% c("WR")), na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-7.8110 -0.6124 0.1003 0.7085 7.0980
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.009835 0.563406 5.342 9.66e-08 ***
age 0.009946 0.005331 1.866 0.06213 .
height -0.025737 0.009950 -2.587 0.00972 **
weight 0.004003 0.001573 2.544 0.01099 *
I(log(target_share + 1)) 4.493072 0.029044 154.698 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.056 on 4246 degrees of freedom
(1138 observations deleted due to missingness)
Multiple R-squared: 0.8538, Adjusted R-squared: 0.8536
F-statistic: 6198 on 4 and 4246 DF, p-value: < 2.2e-16
The residual plot is in Figure 11.23.
Test stat Pr(>|Test stat|)
age 0.0894 0.9287
height 0.9811 0.3266
weight -1.3368 0.1814
I(log(target_share + 1)) -38.0852 <2e-16 ***
Tukey test -38.2715 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The spread-level plot is in Figure 11.24.
Suggested power transformation: 1.356186
The residuals show more constant variance after transforming the outcome variable.
11.8.4.4 Normally Distributed Residuals
We can examine whether residuals are normally distributed using quantile–quantile (QQ) plots and probability–probability (PP) plots, as in Figures 11.25 and 11.26. If the residuals are normally distributed, they should stay close to the diagonal reference line.
[1] 3120 4445
11.9 Multicollinearity
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. The problem of having multiple predictor variables that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.
Multicollinearity in multiple regression is depicted conceptually in Figure 11.27.
Consider the following example adapted from Petersen (2025) where you have two predictor variables and one outcome variable, as shown in Table 11.3.
y | x1 | x2 |
---|---|---|
9 | 2.0 | 4 |
11 | 3.0 | 6 |
17 | 4.0 | 8 |
3 | 1.0 | 2 |
21 | 5.0 | 10 |
13 | 3.5 | 7 |
The second predictor variable is not very good—it is exactly twice the value of the first predictor variable; thus, the two predictor variables are perfectly correlated (i.e., \(r = 1.0\)). This means that there are different prediction equation possibilities that are equally good—see Equations in Equation 11.9:
\[ \begin{aligned} 2x_2 &= y \\ 0x_1 + 2x_2 &= y \\ 4x_1 &= y \\ 4x_1 + 0x_2 &= y \\ 2x_1 + 1x_2 &= y \\ 5x_1 - 0.5x_2 &= y \\ ... &= y \end{aligned} \tag{11.9}\]
Then, what are the regression coefficients? We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well. Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate. In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates. Any predictor variables that have a correlation above ~ \(r = .30\) with each other could have an impact on the confidence interval of the regression coefficient. As the correlations among the predictor variables increase, the chance of getting an arbitrary answer increases, sometimes called “bouncing betas.” So, it is important to examine a correlation matrix of the predictor variables before putting them in the same regression model. You can also examine indices such as variance inflation factor (VIF), where a value greater than 5 or 10 indicates multicollinearity.
Here are the VIFs from our earlier model:
age height weight
1.020806 2.109266 2.121813
I(log(target_share + 1))
1.030619
To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors. For a meaningful answer regarding predictors in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size. However, if you are merely interested in prediction—and are not interested in interpreting the regression coefficients of individual predictors—multicollinearity poses less of a problem. For instance, machine learning cares more about achieving the greatest predictive accuracy possible and cares less about explaining which predictors are causally related to the outcome. So, multicollinearity is less of a concern for machine learning approaches.
11.10 Handling of Missing Data
An important consideration in multiple regression is how missing data are handled. Multiple regression in R
using the lm()
function applies listwise deletion. Listwise deletion removes any row (in the data file) from analysis that has a missing value on the outcome variable or any of the predictor variables. Removing all rows from analysis that have any missingness in the model variables can be a problem because missingness is often not completely at random—missingness often occurs systematically (i.e., for a reason). For instance, participants may be less likely to have data for all variables if they are from a lower socioeconomic status background and do not have the time to participate in all study procedures. Thus, applying listwise deletion, we might systematically exclude participants from lower socioeconomic status backgrounds (or other groups), which could lead to less generalizable inferences.
It is thus important to consider approaches to handle missingness. Various approaches to handle missingness include pairwise deletion (aka available-case analysis), multiple imputation, and full information maximum likelihood (FIML).
11.10.1 Pairwise Deletion
We can estimate a regression model that uses pairwise deletion using the lavaan
package (Rosseel, 2012; Rosseel et al., 2024).
Code
player_stats_seasonal$target_share_log <- log(player_stats_seasonal$target_share + 1)
modelData <- player_stats_seasonal %>%
filter(position %in% c("WR")) %>%
select(fantasyPoints_transformed, age, height, weight, target_share_log)
numObs <- sum(complete.cases(modelData))
varMeans <- colMeans(modelData, na.rm = TRUE)
varCovariances <- cov(modelData, use = "pairwise.complete.obs")
pairwiseRegression_syntax <- '
fantasyPoints_transformed ~ age + height + weight + target_share_log
fantasyPoints_transformed ~~ fantasyPoints_transformed
fantasyPoints_transformed ~ 1
'
pairwiseRegression_fit <- lavaan::lavaan(
pairwiseRegression_syntax,
sample.mean = varMeans,
sample.cov = varCovariances,
sample.nobs = numObs
)
summary(
pairwiseRegression_fit,
standardized = TRUE,
rsquare = TRUE)
lavaan 0.6-19 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 6
Number of observations 4251
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv
fantasyPoints_transformed ~
age 0.040 0.005 7.879 0.000 0.040
height -0.013 0.010 -1.363 0.173 -0.013
weight 0.004 0.002 2.862 0.004 0.004
target_shar_lg 4.466 0.028 157.112 0.000 4.466
Std.all
0.046
-0.011
0.024
0.918
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 1.265 0.552 2.290 0.022 1.265 0.459
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 1.069 0.023 46.103 0.000 1.069 0.141
R-Square:
Estimate
fntsyPnts_trns 0.859
11.10.2 Multiple Imputation
We can multiply impute data using the mice
package (van Buuren & Groothuis-Oudshoorn, 2011, 2024).
Code
numImputations <- 5
dataToImpute <- player_stats_seasonal %>%
filter(position %in% c("WR")) %>%
select(player_id, position, where(is.numeric)) %>%
select(
player_id:games, carries:wopr, fantasy_points, fantasy_points_ppr,
rush_40_yds, rec_40_yds, fumbles, two_pts, return_yds,
rush_100_yds:draftround, height:target_share_log) %>%
select(-c(fantasy_points_ppr, ageCentered20, target_share_log)) # drop collinear variables
predictors <- c("targets","receiving_yards","receiving_air_yards","receiving_yards_after_catch","receiving_first_downs","racr")
dataToImpute$player_id_integer <- as.integer(as.factor(dataToImpute$player_id))
varsToImpute <- c("age","height","weight","target_share")
Y <- varsToImpute
Now, let’s specify the imputation method—we use the two-level predictive mean matching (2l.pmm
) method from the miceadds
package (Robitzsch et al., 2024) to account for the nonindependent data (owing to multiple seasons per player):
Now, let’s specify the prediction matrix. A predictor matrix is a matrix of values, where:
- columns with non-zero values are predictors of the variable specified in the given row
- the diagonal of the predictor matrix should be zero because a variable cannot predict itself
The values are:
- NOT a predictor of the outcome:
0
- cluster variable:
-2
- fixed effect of predictor:
1
- fixed effect and random effect of predictor:
2
- include cluster mean of predictor in addition to fixed effect of predictor:
3
- include cluster mean of predictor in addition to fixed effect and random effect of predictor:
4
Now, let’s run the imputation:
Below are some imputation diagnostics. Trace plots are in Figure 11.28.
A density plot is in Figure 11.29.
The imputated data does not match well the distribution of the observed data. Thus, it may be necessary to select a different imputation method for more accurate imputation.
Now, let’s do some post-processing:
Now let’s estimate multiple regression with the multiply imputed data:
11.10.3 Full Information Maximum Likelihood
We can estimate a regression model that uses full information maximum likelihood using the lavaan
package (Rosseel, 2012; Rosseel et al., 2024).
Code
fimlRegression_syntax <- '
fantasyPoints_transformed ~ age + height + weight + target_share_log
fantasyPoints_transformed ~~ fantasyPoints_transformed
fantasyPoints_transformed ~ 1
'
fimlRegression_fit <- lavaan::sem(
fimlRegression_syntax,
data = player_stats_seasonal %>% filter(position %in% c("WR")),
missing = "ML",
fixed.x = FALSE
)
summary(
fimlRegression_fit,
standardized = TRUE,
rsquare = TRUE)
lavaan 0.6-19 ended normally after 51 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 20
Number of observations 5389
Number of missing patterns 2
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv
fantasyPoints_transformed ~
age 0.014 0.005 2.728 0.006 0.014
height -0.024 0.010 -2.423 0.015 -0.024
weight 0.004 0.002 2.737 0.006 0.004
target_shar_lg 4.482 0.028 158.556 0.000 4.482
Std.all
0.016
-0.020
0.023
0.920
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
age ~~
height -0.297 0.101 -2.945 0.003 -0.297 -0.040
weight -0.155 0.637 -0.243 0.808 -0.155 -0.003
target_shar_lg 0.284 0.025 11.294 0.000 0.284 0.160
height ~~
weight 24.979 0.584 42.755 0.000 24.979 0.716
target_shar_lg 0.118 0.018 6.370 0.000 0.118 0.089
weight ~~
target_shar_lg 1.035 0.117 8.825 0.000 1.035 0.123
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 2.709 0.553 4.894 0.000 2.709 0.983
age 26.254 0.043 611.426 0.000 26.254 8.329
height 72.606 0.032 2269.509 0.000 72.606 30.916
weight 200.480 0.202 991.397 0.000 200.480 13.505
target_shar_lg 0.745 0.008 94.885 0.000 0.745 1.318
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 1.112 0.024 46.316 0.000 1.112 0.146
age 9.936 0.191 51.909 0.000 9.936 1.000
height 5.516 0.106 51.909 0.000 5.516 1.000
weight 220.371 4.245 51.909 0.000 220.371 1.000
target_shar_lg 0.320 0.006 50.129 0.000 0.320 1.000
R-Square:
Estimate
fntsyPnts_trns 0.854
11.11 Addressing Non-Independence of Observations
Please note that the \(p\)-value for regression coefficients assumes that the observations are independent—in particular, that the residuals are not correlated. However, the observations are not independent in the player_stats_seasonal
dataframe used above, because the same player has multiple rows—one row corresponding to each season they played. This non-independence violates the traditional assumptions of the significance of regression coefficients. We could address this assumption by analyzing only one season from each player or by estimating the significance of the regression coefficients using cluster-robust standard errors. For simplicity in the models above, we present results above from the whole dataframe. In Chapter 12, we discuss mixed model approaches that handle repeated measures and other data that violate assumptions of non-independence. Below, we demonstrate how to account for non-independence of observations using cluster-robust standard errors with the rms
package (Harrell, Jr., 2025).
Code
player_stats_seasonal_subset <- player_stats_seasonal %>%
filter(!is.na(player_id)) %>%
filter(position %in% c("WR"))
rms::robcov(rms::ols(
fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1)),
data = player_stats_seasonal_subset,
x = TRUE,
y = TRUE),
cluster = player_stats_seasonal_subset$player_id) #account for nested data within player
Frequencies of Missing Values Due to Each Variable
fantasyPoints_transformed age height
0 0 0
weight target_share
0 1138
Linear Regression Model
rms::ols(formula = fantasyPoints_transformed ~ age + height +
weight + I(log(target_share + 1)), data = player_stats_seasonal_subset,
x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 4251 LR chi2 8173.35 R2 0.854
sigma 1.0563 d.f. 4 R2 adj 0.854
d.f. 4246 Pr(> chi2) 0.0000 g 2.918
Cluster onplayer_stats_seasonal_subset$player_id
Clusters 1260
Residuals
Min 1Q Median 3Q Max
-7.8110 -0.6124 0.1003 0.7085 7.0980
Coef S.E. t Pr(>|t|)
Intercept 3.0098 0.6671 4.51 <0.0001
age 0.0099 0.0058 1.73 0.0842
height -0.0257 0.0117 -2.20 0.0275
weight 0.0040 0.0018 2.25 0.0247
target_share 4.4931 0.0389 115.53 <0.0001
11.12 Impact of Outliers
As with correlation, multiple regression can be strongly impacted by outliers.
11.12.1 Robust Regression
To address outliers, there are various approaches to robust regression. One approach is to use an MM-type estimator, such as is used in the lmrob()
and glmrob()
functions of the robustbase
package (Maechler et al., 2024; Todorov & Filzmoser, 2009).
Code
Call:
lmrob(formula = fantasyPoints_transformed ~ age + height + weight + I(log(target_share +
1)), data = player_stats_seasonal %>% filter(position %in% c("WR")))
\--> method = "MM"
Residuals:
Min 1Q Median 3Q Max
-7.87540 -0.66317 0.04762 0.66103 7.04723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.442784 0.583086 5.904 3.82e-09 ***
age 0.008755 0.005064 1.729 0.08392 .
height -0.028607 0.010445 -2.739 0.00619 **
weight 0.003316 0.001642 2.019 0.04352 *
I(log(target_share + 1)) 4.481790 0.035380 126.675 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Robust residual standard error: 0.9524
(1138 observations deleted due to missingness)
Multiple R-squared: 0.8617, Adjusted R-squared: 0.8616
Convergence in 13 IRWLS iterations
Robustness weights:
7 observations c(1665,2461,2641,2979,3107,3513,3884)
are outliers with |weight| = 0 ( < 2.4e-05);
335 weights are ~= 1. The remaining 3909 ones are summarized as
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01056 0.86100 0.94810 0.89460 0.98490 0.99900
Algorithmic parameters:
tuning.chi bb tuning.psi refine.tol
1.548e+00 5.000e-01 4.685e+00 1.000e-07
rel.tol scale.tol solve.tol zero.tol
1.000e-07 1.000e-10 1.000e-07 1.000e-10
eps.outlier eps.x warn.limit.reject warn.limit.meanrw
2.352e-05 4.820e-10 5.000e-01 5.000e-01
nResample max.it groups n.group best.r.s
500 50 5 400 2
k.fast.s k.max maxit.scale trace.lev mts
1 200 200 0 1000
compute.rd fast.s.large.n
0 2000
psi subsampling cov
"bisquare" "nonsingular" ".vcov.avar1"
compute.outlier.stats
"SM"
seed : int(0)
11.13 Moderated Multiple Regression
When examining moderation in multiple regression, several steps are important:
- When computing the interaction term, first mean-center the predictor variables. Calculate the interaction term as the multiplication of the mean-centered predictor variables. Mean-centering the predictor variables when computing the interaction term is important for addressing issues regarding multicollinearity (Iacobucci et al., 2016).
- When including an interaction term in the model, make sure also to include the main effects.
First, we mean-center the predictors. In this case, we center the predictors around the mean of height and weight for Wide Receivers:
Code
player_stats_seasonal$height_centered <- player_stats_seasonal$height - mean(player_stats_seasonal$height[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
player_stats_seasonal$weight_centered <- player_stats_seasonal$weight - mean(player_stats_seasonal$weight[which(player_stats_seasonal$position == "WR")], na.rm = TRUE)
Then, we compute the interaction term as the multiplication of the two centered predictors:
Then, we fit the moderated multiple regression model:
Code
Call:
lm(formula = fantasyPoints_transformed ~ height_centered + weight_centered +
height_centered:weight_centered, data = player_stats_seasonal %>%
filter(position %in% c("WR")), na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-11.1065 -2.0822 0.3452 2.2866 5.5100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5968021 0.0443550 126.182 < 2e-16 ***
height_centered -0.0287872 0.0228774 -1.258 0.2083
weight_centered 0.0257918 0.0036066 7.151 9.75e-13 ***
height_centered:weight_centered -0.0017426 0.0009637 -1.808 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.735 on 5385 degrees of freedom
Multiple R-squared: 0.0156, Adjusted R-squared: 0.01505
F-statistic: 28.45 on 3 and 5385 DF, p-value: < 2.2e-16
This model is equivalent to the model that includes the interaction term explicitly:
Now, we can visualize the interaction to understand it. We create an interaction plot (Figure 11.30) and Johnson-Neyman plot (Figure 11.31) using the interactions
package (Long, 2024).
Code
JOHNSON-NEYMAN INTERVAL
When weight_centered is INSIDE the interval [16.17, 137.37], the slope of
height_centered is p < .05.
Note: The range of observed values of weight_centered is [-47.48, 64.52]
Here is a simple slopes analysis:
Code
SIMPLE SLOPES ANALYSIS
Slope of height_centered when weight_centered = -1.484626e+01 (- 1 SD):
Est. S.E. t val. p
------- ------ -------- ------
-0.00 0.03 -0.11 0.91
Slope of height_centered when weight_centered = -1.002064e-16 (Mean):
Est. S.E. t val. p
------- ------ -------- ------
-0.03 0.02 -1.26 0.21
Slope of height_centered when weight_centered = 1.484626e+01 (+ 1 SD):
Est. S.E. t val. p
------- ------ -------- ------
-0.05 0.03 -1.93 0.05
11.14 Mediation
A mediation model takes the following general form in the lavaan
package (Rosseel, 2012; Rosseel et al., 2024).
Let’s substitute in our predictor, outcome, and hypothesized mediator. In this case, we predict that receiving touchdowns partially accounts for the association between Wide Receiver’s target share and their fantasy points. This is a silly example because fantasy points are derived, in part, from touchdowns, so of course touchdowns will partially account for almost any effect on Wide Receivers’ fantasy points. This example is merely for demonstrating the process of developing and examining a mediation model.
To get a robust estimate of the indirect effect, we obtain bootstrapped estimates from 1,000 bootstrap draws. Typically, we would obtain bootstrapped estimates from 10,000 bootstrap draws, but this example uses only 1,000 bootstrap draws for a shorter runtime.
Code
mediationFit <- lavaan::sem(
mediationModel,
data = player_stats_seasonal %>% filter(position %in% c("WR")),
se = "bootstrap",
bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
iseed = 52242, # for reproducibility
missing = "ML",
estimator = "ML",
fixed.x = FALSE)
Here are the model results:
lavaan 0.6-19 ended normally after 15 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 9
Number of observations 5389
Number of missing patterns 2
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 10480.736
Degrees of freedom 3
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Robust Comparative Fit Index (CFI) 1.000
Robust Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -29079.024
Loglikelihood unrestricted model (H1) -29079.024
Akaike (AIC) 58176.048
Bayesian (BIC) 58235.377
Sample-size adjusted Bayesian (SABIC) 58206.778
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Robust RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: Robust RMSEA <= 0.050 NA
P-value H_0: Robust RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 1000
Number of successful bootstrap draws 1000
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv
fantasyPoints_transformed ~
trgt_sh (drct) 1.125 0.033 33.884 0.000 1.125
receiving_tds ~
trgt_sh (a) 1.497 0.031 48.902 0.000 1.497
fantasyPoints_transformed ~
rcvng_t (b) 0.280 0.014 20.187 0.000 0.280
Std.all
0.612
0.763
0.299
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 3.253 0.035 93.550 0.000 3.253 1.180
.receiving_tds 0.029 0.034 0.865 0.387 0.029 0.010
target_share 1.485 0.021 71.159 0.000 1.485 0.990
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 1.951 0.050 39.344 0.000 1.951 0.257
.receiving_tds 3.617 0.141 25.573 0.000 3.617 0.418
target_share 2.248 0.061 36.638 0.000 2.248 1.000
R-Square:
Estimate
fntsyPnts_trns 0.743
receiving_tds 0.582
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect 0.419 0.015 27.571 0.000 0.419 0.228
total 1.544 0.025 61.644 0.000 1.544 0.840
We can also estimate a model with multiple hypothesized mediator:
Code
multipleMediatorModel <- '
# direct effect (cPrime)
fantasyPoints_transformed ~ direct*target_share
# mediator
receiving_tds ~ a1*target_share
receiving_yards ~ a2*target_share
fantasyPoints_transformed ~ b1*receiving_tds + b2*receiving_yards
# indirect effect = a*b
indirect1 := a1*b1
indirect2 := a2*b2
indirectTotal := indirect1 + indirect2
# total effect (c)
total := abs(direct) + abs(indirectTotal)
'
Code
multipleMediatorFit <- lavaan::sem(
multipleMediatorModel,
data = player_stats_seasonal %>% filter(position %in% c("WR")),
se = "bootstrap",
bootstrap = 1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speed
parallel = "multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PC
iseed = 52242, # for reproducibility
missing = "ML",
estimator = "ML",
fixed.x = FALSE)
Here are the model results:
lavaan 0.6-19 ended normally after 18 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 13
Number of observations 5389
Number of missing patterns 2
Model Test User Model:
Test statistic 2117.502
Degrees of freedom 1
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 22710.101
Degrees of freedom 6
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.907
Tucker-Lewis Index (TLI) 0.441
Robust Comparative Fit Index (CFI) 0.901
Robust Tucker-Lewis Index (TLI) 0.405
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -63833.389
Loglikelihood unrestricted model (H1) -62774.638
Akaike (AIC) 127692.778
Bayesian (BIC) 127778.476
Sample-size adjusted Bayesian (SABIC) 127737.166
Root Mean Square Error of Approximation:
RMSEA 0.627
90 Percent confidence interval - lower 0.604
90 Percent confidence interval - upper 0.649
P-value H_0: RMSEA <= 0.050 0.000
P-value H_0: RMSEA >= 0.080 1.000
Robust RMSEA 0.673
90 Percent confidence interval - lower 0.646
90 Percent confidence interval - upper 0.701
P-value H_0: Robust RMSEA <= 0.050 0.000
P-value H_0: Robust RMSEA >= 0.080 1.000
Standardized Root Mean Square Residual:
SRMR 0.046
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 1000
Number of successful bootstrap draws 1000
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv
fantasyPoints_transformed ~
trgt_sh (drct) 0.418 0.035 12.097 0.000 0.418
receiving_tds ~
trgt_sh (a1) 1.529 0.030 51.056 0.000 1.529
receiving_yards ~
trgt_sh (a2) 235.236 3.487 67.470 0.000 235.236
fantasyPoints_transformed ~
rcvng_t (b1) 0.034 0.011 3.077 0.002 0.034
rcvng_y (b2) 0.005 0.000 30.555 0.000 0.005
Std.all
0.230
0.786
0.909
0.036
0.645
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 3.156 0.030 105.208 0.000 3.156 1.149
.receiving_tds -0.020 0.032 -0.604 0.546 -0.020 -0.007
.receiving_yrds 25.834 3.912 6.603 0.000 25.834 0.066
target_share 1.486 0.021 71.601 0.000 1.486 0.983
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 1.612 0.035 45.735 0.000 1.612 0.214
.receiving_tds 3.313 0.125 26.443 0.000 3.313 0.383
.receiving_yrds 26421.734 1279.929 20.643 0.000 26421.734 0.173
target_share 2.284 0.063 36.402 0.000 2.284 1.000
R-Square:
Estimate
fntsyPnts_trns 0.786
receiving_tds 0.617
receiving_yrds 0.827
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect1 0.052 0.016 3.147 0.002 0.052 0.028
indirect2 1.066 0.027 39.897 0.000 1.066 0.587
indirectTotal 1.118 0.020 56.501 0.000 1.118 0.615
total 1.536 0.025 61.189 0.000 1.536 0.845
11.15 Bayesian Multiple Regression
Family: gaussian
Links: mu = identity; sigma = identity
Formula: fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1))
Data: player_stats_seasonal %>% filter(position %in% c(" (Number of observations: 4251)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 3.01 0.56 1.94 4.12 1.00 4002 3148
age 0.01 0.01 -0.00 0.02 1.00 4461 2685
height -0.03 0.01 -0.05 -0.01 1.00 4191 2572
weight 0.00 0.00 0.00 0.01 1.00 4564 3299
Ilogtarget_shareP1 4.49 0.03 4.44 4.55 1.00 3936 2351
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.06 0.01 1.04 1.08 1.00 4819 2988
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
11.16 Conclusion
Multiple regression allows examining the association between multiple predictor variables and one outcome variable. Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable. As with correlation, an association does not imply causation. However, identifying associations is important because associations are a necessary (but insufficient) condition for causality. When developing a multiple regression model, there are various assumptions that are important to evaluate. In addition, it is important to pay attention for potential multicollinearity—it may become difficult to detect a given predictor variable as statistically significant due to the greater uncertainty around the parameter estimates.
11.17 Session Info
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.50 lubridate_1.9.4 forcats_1.0.0
[4] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[7] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[10] tidyverse_2.0.0 robustbase_0.99-4-1 brms_2.22.0
[13] Rcpp_1.0.14 interactions_1.2.0 miceadds_3.17-44
[16] mice_3.17.0 lavaan_0.6-19 performance_0.13.0
[19] lme4_1.1-37 Matrix_1.7-3 caret_7.0-1
[22] lattice_0.22-6 ggplot2_3.5.2 car_3.1-3
[25] carData_3.0-5 rms_8.0-0 Hmisc_5.2-3
[28] petersenlab_1.1.3
loaded via a namespace (and not attached):
[1] splines_4.5.0 polspline_1.1.25 cellranger_1.1.0
[4] hardhat_1.4.1 pROC_1.18.5 rpart_4.1.24
[7] lifecycle_1.0.4 Rdpack_2.6.4 StanHeaders_2.32.10
[10] processx_3.8.6 globals_0.17.0 MASS_7.3-65
[13] insight_1.2.0 backports_1.5.0 magrittr_2.0.3
[16] rmarkdown_2.29 yaml_2.3.10 pkgbuild_1.4.7
[19] gld_2.6.7 DBI_1.2.3 minqa_1.2.8
[22] RColorBrewer_1.1-3 multcomp_1.4-28 abind_1.4-8
[25] expm_1.0-0 quadprog_1.5-8 nnet_7.3-20
[28] TH.data_1.1-3 tensorA_0.36.2.1 sandwich_3.1-1
[31] jtools_2.3.0 ipred_0.9-15 lava_1.8.1
[34] inline_0.3.21 listenv_0.9.1 MatrixModels_0.5-4
[37] bridgesampling_1.1-2 parallelly_1.43.0 codetools_0.2-20
[40] tidyselect_1.2.1 shape_1.4.6.1 bayesplot_1.12.0
[43] farver_2.1.2 broom.mixed_0.2.9.6 matrixStats_1.5.0
[46] stats4_4.5.0 base64enc_0.1-3 jsonlite_2.0.0
[49] e1071_1.7-16 mitml_0.4-5 Formula_1.2-5
[52] survival_3.8-3 iterators_1.0.14 emmeans_1.11.0
[55] foreach_1.5.2 tools_4.5.0 DescTools_0.99.60
[58] glue_1.8.0 mnormt_2.1.1 prodlim_2024.06.25
[61] gridExtra_2.3 pan_1.9 mgcv_1.9-1
[64] xfun_0.52 distributional_0.5.0 loo_2.8.0
[67] withr_3.0.2 fastmap_1.2.0 mitools_2.4
[70] boot_1.3-31 SparseM_1.84-2 callr_3.7.6
[73] digest_0.6.37 timechange_0.3.0 R6_2.6.1
[76] estimability_1.5.1 colorspace_2.1-1 mix_1.0-13
[79] generics_0.1.3 data.table_1.17.0 recipes_1.3.0
[82] class_7.3-23 httr_1.4.7 htmlwidgets_1.6.4
[85] ModelMetrics_1.2.2.2 pkgconfig_2.0.3 Exact_3.3
[88] gtable_0.3.6 timeDate_4041.110 furrr_0.3.1
[91] htmltools_0.5.8.1 scales_1.4.0 lmom_3.2
[94] posterior_1.6.1 gower_1.0.2 reformulas_0.4.0
[97] rstudioapi_0.17.1 tzdb_0.5.0 reshape2_1.4.4
[100] curl_6.2.2 coda_0.19-4.1 checkmate_2.3.2
[103] nlme_3.1-168 nloptr_2.2.1 proxy_0.4-27
[106] zoo_1.8-14 rootSolve_1.8.2.4 parallel_4.5.0
[109] foreign_0.8-90 pillar_1.10.2 grid_4.5.0
[112] vctrs_0.6.5 jomo_2.7-6 xtable_1.8-4
[115] cluster_2.1.8.1 htmlTable_2.4.3 evaluate_1.0.3
[118] pbivnorm_0.6.0 mvtnorm_1.3-3 cli_3.6.5
[121] compiler_4.5.0 rlang_1.1.6 rstantools_2.4.0
[124] future.apply_1.11.3 labeling_0.4.3 ps_1.9.1
[127] fs_1.6.6 plyr_1.8.9 rstan_2.32.7
[130] stringi_1.8.7 psych_2.5.3 pander_0.6.6
[133] QuickJSR_1.7.0 viridisLite_0.4.2 V8_6.0.3
[136] glmnet_4.1-8 Brobdingnag_1.2-9 quantreg_6.1
[139] hms_1.1.3 future_1.40.0 haven_2.5.4
[142] rbibutils_2.3 broom_1.0.8 RcppParallel_5.1.10
[145] DEoptimR_1.1-3-1 readxl_1.4.5