I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.
11 Multiple Regression
11.1 Getting Started
11.1.1 Load Packages
11.2 Overview of Multiple Regression
Multiple regression examines the association between multiple predictor variables and one outcome variable. It allows obtaining a more accurate estimate of the unique contribution of a given predictor variable, by controlling for other variables (covariates).
Regression with one predictor variable takes the form of Equation 11.1:
\[ y = \beta_0 + \beta_1x_1 + \epsilon \tag{11.1}\]
where \(y\) is the outcome variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(x_1\) is the predictor variable, and \(\epsilon\) is the error term.
A regression line is depicted in Figure 11.4.
Regression with multiple predictors—i.e., multiple regression—takes the form of Equation 11.2:
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon \tag{11.2}\]
where \(p\) is the number of predictor variables.
11.3 Components
- \(B\) = unstandardized coefficient: direction and magnitude of the estimate (original scale)
- \(\beta\) (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
- \(SE\) = standard error: uncertainty of unstandardized estimate
The unstandardized regression coefficient (\(B\)) is interpreted such that, for every unit change in the predictor variable, there is a __ unit change in the outcome variable. For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age. (In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.) Unstandardized regression coefficients are tied to the metric of the raw data. Thus, a large unstandardized regression coefficient for two variables may mean completely different things. Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.
Standardized regression coefficients can be obtained by standardizing the variables to z-scores so they all have a mean of zero and standard deviation of one. The standardized regression coefficient (\(\beta\)) is interpreted such that, for every standard deviation change in the predictor variable, there is a __ standard deviation change in the outcome variable. For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age. Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1]. Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients. In this way, standardized regression coefficients provide a meaningful index of effect size.
11.4 Assumptions of Multiple Regression
Linear regression models make the following assumptions:
- there is a linear association between the predictor variables and the outcome variable
- there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
- the residuals are independent; they are uncorrelated with each other
- the residuals are normally distributed
11.5 Coefficient of Determination (\(R^2\))
The coefficient of determination (\(R^2\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions: \(R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}\). Various formulas for \(R^2\) are in Equation 9.19. Larger \(R^2\) values indicate greater accuracy. Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in Figure 11.4.
One issue with \(R^2\) is that it increases as the number of predictors increases, which can lead to overfitting if using \(R^2\) as an index to compare models for purposes of selecting the “best-fitting” model. Consider the following example (adapted from Petersen (2024b)) in which you have one predictor variable and one outcome variable, as shown in Table 11.1.
y | x1 |
---|---|
7 | 1 |
13 | 2 |
29 | 7 |
10 | 2 |
Using the data, the best fitting regression model is: \(y =\) 3.98 \(+\) 3.59 \(\cdot x_1\). In this example, the \(R^2\) is 0.98. The equation is not a perfect prediction, but with a single predictor variable, it captures the majority of the variance in the outcome.
Now consider the following example where you add a second predictor variable to the data above, as shown in Table 11.2.
y | x1 | x2 |
---|---|---|
7 | 1 | 3 |
13 | 2 | 5 |
29 | 7 | 1 |
10 | 2 | 2 |
With the second predictor variable, the best fitting regression model is: \(y =\) 0.00 + 4.00 \(\cdot x_1 +\) 1.00 \(\cdot x_2\). In this example, the \(R^2\) is 1.00. The equation with the second predictor variable provides a perfect prediction of the outcome.
Providing perfect prediction with the right set of predictor variables is the dream of multiple regression. So, using multiple regression, we often add predictor variables to incrementally improve prediction. Knowing how much variance would be accounted for by random chance follows Equation 11.3:
\[ E(R^2) = \frac{K}{n-1} \tag{11.3}\]
where \(E(R^2)\) is the expected value of \(R^2\) (the proportion of variance explained), \(K\) is the number of predictor variables, and \(n\) is the sample size. The formula demonstrates that the more predictor variables in the regression model, the more variance will be accounted for by chance. With many predictor variables and a small sample, you can account for a large share of the variance merely by chance.
As an example, consider that we have 13 predictor variables to predict fantasy performance for 43 players. Assume that, with 13 predictor variables, we explain 38% of the variance (\(R^2 = .38; r = .62\)). We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: \(E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31\). We expect to explain 31% of the variance, by chance, in the outcome. So, 82% of the variance explained was likely spurious (i.e., \(\frac{.31}{.38} = .82\)). As the sample size increases, the spuriousness decreases.
To account for the number of predictor variables in the model, we can use a modified version of \(R^2\) called adjusted \(R^2\) (\(R^2_{adj}\)). Adjusted \(R^2\) (\(R^2_{adj}\)) accounts for the number of predictor variables in the model, based on how much would be expected to be accounted for by chance to penalize overfitting. Adjusted \(R^2\) (\(R^2_{adj}\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of predictor variables in the model. The formula for adjusted \(R^2\) (\(R^2_{adj}\)) is in Equation 11.4:
\[ R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \tag{11.4}\]
where \(p\) is the number of predictor variables in the model, and \(n\) is the sample size.
11.6 Overfitting
Statistical models applied to big data (e.g., data with many predictor variables) can overfit the data, which means that the statistical model accounts for error variance, which will not generalize to future samples. So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built. In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.
Consider an example where you develop an algorithm to predict players’ fantasy performance based on 2023 data using hundreds of predictor variables. To some extent, these predictor variables will likely account for true variance (i.e., signal) and error variance (i.e., noise). If we were to apply the same algorithm based on the 2023 prediction model to 2024 data, the prediction model would likely predict less accurately than with 2023 data. The regression coefficients in the
In Figure 11.3, the blue line represents the true distribution of the data, and the red line is an overfitting model:
Code
set.seed(52242)
sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
data.frame %>%
arrange(quadraticX)
quadraticModel <- lm(
quadraticY ~ quadraticX + I(quadraticX ^ 2),
data = quadraticData)
quadraticNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$quadraticY <- predict(
quadraticModel,
newdata = quadraticNewData)
loessFit <- loess(
quadraticY ~ quadraticX,
data = quadraticData,
span = 0.01,
degree = 1)
loessNewData <- data.frame(
quadraticX = seq(
from = min(quadraticData$quadraticX),
to = max(quadraticData$quadraticY),
length.out = sampleSize))
quadraticNewData$loessY <- predict(
loessFit,
newdata = quadraticNewData)
plot(
x = quadraticData$quadraticX,
y = quadraticData$quadraticY,
xlab = "",
ylab = "")
lines(
quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "blue")
lines(
quadraticNewData$loessY ~ quadraticNewData$quadraticX,
lwd = 2,
col = "red")
11.7 Covariates
Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable. Use of covariates examines the association between the predictor variable and the outcome variable when holding people’s level constant on the covariates. Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable. Ideally, you want to include any and all confounds as covariates. As described in Section 8.4.2.1, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association. Covariates are potentially (but not necessarily) confounds. For instance, you might include the player’s age as a covariate in a model that examines whether a player’s 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.
11.8 Multicollinearity
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. The problem of having multiple predictor variables that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.
Multicollinearity in multiple regression is depicted conceptually in Figure 11.4.
Consider the following example adapted from Petersen (2024b) where you have two predictor variables and one outcome variable, as shown in Table 11.3.
y | x1 | x2 |
---|---|---|
9 | 2.0 | 4 |
11 | 3.0 | 6 |
17 | 4.0 | 8 |
3 | 1.0 | 2 |
21 | 5.0 | 10 |
13 | 3.5 | 7 |
The second predictor variable is not very good—it is exactly twice the value of the first predictor variable; thus, the two predictor variables are perfectly correlated (i.e., \(r = 1.0\)). This means that there are different prediction equation possibilities that are equally good—see Equations in Equation 11.5:
\[ \begin{aligned} 2x_2 &= y \\ 0x_1 + 2x_2 &= y \\ 4x_1 &= y \\ 4x_1 + 0x_2 &= y \\ 2x_1 + 1x_2 &= y \\ 5x_1 - 0.5x_2 &= y \\ ... &= y \end{aligned} \tag{11.5}\]
Then, what are the regression coefficients? We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well. Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate. In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates. Any predictor variables that have a correlation above ~ \(r = .30\) with each other could have an impact on the confidence interval of the regression coefficient. As the correlations among the predictor variables increase, the chance of getting an arbitrary answer increases, sometimes called “bouncing betas.” So, it is important to examine a correlation matrix of the predictor variables before putting them in the same regression model. You can also examine indices such as variance inflation factor (VIF).
To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors. For a meaningful answer in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size.
11.9 Impact of Oultiers
As with correlation, multiple regression can be strongly impacted by outliers.
11.10 Moderated Multiple Regression
When examining moderation in multiple regression, several steps are important:
- When computing the interaction term, first mean-center the predictor variables. Calculate the interaction term as the multiplication of the mean-centered predictor variables. Mean-centering the predictor variables when computing the interaction term is important for addressing issues regarding multicollinearity (Iacobucci et al., 2016).
- When including an interaction term in the model, make sure also to include the main effects.
11.11 Mediation
11.12 Bayesian Multiple Regression
11.13 Conclusion
Multiple regression allows examining the association between multiple predictor variables and one outcome variable. Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable. As with correlation, an association does not imply causation. However, identifying associations is important because associations are a necessary (but insufficient) condition for causality. When developing a multiple regression model, it is important to pay attention for potential multicollinearity—it may become difficult to detect a given predictor variable as statistically significant due to the greater uncertainty around the parameter estimates.
11.14 Session Info
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.49 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 petersenlab_1.1.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.49 htmlwidgets_1.6.4 psych_2.4.6.26
[5] lattice_0.22-6 tzdb_0.4.0 quadprog_1.5-8 vctrs_0.6.5
[9] tools_4.4.2 generics_0.1.3 stats4_4.4.2 parallel_4.4.2
[13] fansi_1.0.6 cluster_2.1.6 pkgconfig_2.0.3 data.table_1.16.2
[17] checkmate_2.3.2 RColorBrewer_1.1-3 lifecycle_1.0.4 compiler_4.4.2
[21] munsell_0.5.1 mnormt_2.1.1 mitools_2.4 htmltools_0.5.8.1
[25] yaml_2.3.10 htmlTable_2.4.3 Formula_1.2-5 pillar_1.9.0
[29] Hmisc_5.2-0 rpart_4.1.23 nlme_3.1-166 lavaan_0.6-19
[33] tidyselect_1.2.1 digest_0.6.37 mvtnorm_1.3-2 stringi_1.8.4
[37] reshape2_1.4.4 fastmap_1.2.0 grid_4.4.2 colorspace_2.1-1
[41] cli_3.6.3 magrittr_2.0.3 base64enc_0.1-3 utf8_1.2.4
[45] pbivnorm_0.6.0 withr_3.0.2 foreign_0.8-87 scales_1.3.0
[49] backports_1.5.0 timechange_0.3.0 rmarkdown_2.29 nnet_7.3-19
[53] gridExtra_2.3 hms_1.1.3 evaluate_1.0.1 mix_1.0-12
[57] viridisLite_0.4.2 rlang_1.1.4 Rcpp_1.0.13-1 xtable_1.8-4
[61] glue_1.8.0 DBI_1.2.3 rstudioapi_0.17.1 jsonlite_1.8.9
[65] R6_2.5.1 plyr_1.8.9