I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know.
The best ways to provide feedback are by GitHub or hypothes.is annotations.
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
\[
y = \beta_0 + \beta_1x_1 + \epsilon
\tag{12.1}\]
where \(y\) is the outcome variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(x_1\) is the predictor variable, and \(\epsilon\) is the error term.
where \(p\) is the number of predictor variables. Multiple regression is basically a weighted sum of the predictor variables, and adding an intercept. Under the hood, multiple regression seeks to identify the best weight for each predictor.
12.3 Components
\(B\) = unstandardized coefficient: direction and magnitude of the estimate (original scale)
\(\beta\) (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
\(SE\) = standard error: uncertainty of unstandardized estimate
The unstandardized regression coefficient (\(B\)) is interpreted such that, for every unit change in the predictor variable, there is a __ unit change in the outcome variable. For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age. (In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.) Unstandardized regression coefficients are tied to the metric of the raw data. Thus, a large unstandardized regression coefficient for two variables may mean completely different things. Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.
Standardized regression coefficients can be obtained by standardizing the variables to z-scores so they all have a mean of zero and standard deviation of one. The standardized regression coefficient (\(\beta\)) is interpreted such that, for every standard deviation change in the predictor variable, there is a __ standard deviation change in the outcome variable. For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age. Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1]. Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients. In this way, standardized regression coefficients provide a meaningful index of effect size.
The standard error of a regression coefficient represents the uncertainty of the parameter. If we have less uncertainty (i.e., more confidence) about the parameter, the standard error will be small. If we have more uncertainty (i.e., less confidence) about the parameter, the standard error will be large. If we used the same sampling procedure repeatedly and calculated the regression coefficient each time, the true parameter in the population would fall 68% of the time within the interval of: \([\text{model parameter estimate for the regression coefficient} \pm 1 \text{ standard error}]\) .
A confidence interval represents a range of plausible values such that, with repeated sampling, the true value falls within a given interval with some confidence. Our parameter estimate for the regression coefficient, plus or minus 1 standard error, reflects the 68% confidence interval for the coefficient. The 95% confidence interval is computed as the parameter estimated plus or minus 1.96 standard errors. For instance, if the parameter estimate for the regression coefficient is 0.50, and the standard error is 0.10, the 95% confidence interval is [0.30, 0.70]: \(0.5 - (1.96 \times 0.10) = 0.3\); \(0.5 + (1.96 \times 0.10) = 0.7\). That is, if we used the same sampling procedure repeatedly, the true value of the regression coefficient would be expected to be 95% of the time somewhere between 0.30 to 0.70.
there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variable or as a function of the outcome variable
the residuals are independent; they are uncorrelated with each other
the residuals are normally distributed
Homoscedasticity of the residuals means that the variance of the residuals does not differ as a function of the outcome variable or as a function of the predictor variable (i.e., the residuals show constant variance as a function of outcome/predictors). If the residuals differ as a function of the outcome or predictor variable, this is called heterotoscedasticity.
Those are some of the key assumptions of multiple regression. However, there are additional assumptions of multiple regression, including ones discussed in the chapter on Causal Inference. For instance, the variables included should reflect a causal process such that the predictor variables influence the outcome variable, and not the other way around. That is, the outcome variable should not influence the predictor variables (i.e., there should be no reverse causation). In addition, it is important control for any confound(s). If a confound is not controlled for, this is called omitted-variable bias, and it leads the researcher to incorrectly attribute the effects of the omitted variable to the included variables.
12.5.1 Evaluating and Addressing Assumptions of Multiple Regression
12.5.1.1 Linear Association
To evaluate the shape of the association between the predictor variables and the outcome variable, we can examine scatterplots (Figure 12.2), residual plots (Figure 12.21), marginal model plots (Figure 12.12), added-variable plots (Figure 12.13), and component-plus-residual plots (Figure 12.14). Residual plots depict the residuals (errors) on the y-axis as a function of the fitted values or a specific predictor on the x-axis. Marginal model plots are basically glorified scatterplots that depict the outcome variable (both in terms of observed values and model-fitted values) on the y-axis and the predictor variables on the x-axis. Added-variable plots depict the unique association of each predictor variables with the outcome variable when controlling for all the other predictor variables in the model. Component-plus-residual plots depict partial residuals on the y-axis as a function of each predictor variable on the x-axis, where a partial residual for a given predictor is the effect of a given predictor (thus controlling for all the other predictor variables in the model) plus the residual from the full model.
Examples of linear and nonlinear associations are depicted with scatterplots in Figure 12.2.
Figure 12.2: Example Associations Depicted With Scatterplots.
If the shape of the association is nonlinear (as indicated by any of these plots), various approaches may be necessary such as including nonlinear terms (e.g., polynomial terms such as quadratic, cubic, quartic, or higher-degree terms), transforming the predictors (e.g., log, square root, inverse, exponential, Box-Cox, Yeo-Johnson transform), use of splines/piecewise regression, and generalized additive models.
12.5.1.2 Homoscedasticity
To evaluate homoscedasticity, we can evaluate a residual plot (Figure 12.21) and spread-level plot (Figure 12.22). A residual plot depicts the residuals on the y-axis as a function of the model’s fitted values on the x-axis. Homoscedasticity in a residual plot is identified as a constant spread of residuals versus fitted values—the residuals do not show a fan, cone, or bow-tie shape; a fan, cone, or bow-tie shape indicates heteroscedasticity. In a residual plot, a fan or cone shape indicates increasing or decreasing variance in the residuals as a function of the fitted values; a bow-tie shape indicates that the residuals are smallest in the middle of the fitted values and greatest on the extremes of the fitted values. A spread-level plot depicts the log of the absolute value of studentized residuals on the y-axis as a function of the log of the model’s fitted values on the x-axis. Homoscedasticity in a spread-level plot is identified as a flat slope; a slope that differs from zero indicates heteroscedasticity.
Figure 12.3: Example of Homoscedasticity and Heteroscedasticity in Residual Plots.
If there is heteroscedasticity, it may be necessary to transform the outcome variable to be more normally distributed. The spread-level plot provides a suggested power transformation to transform the outcome variable so that the spread of residuals becomes more uniform across the fitted values.
12.5.1.3 Uncorrelated Residuals
To determine if residuals are correlated by a grouping level, we can examine the proportion of variance that is attributable to the grouping level using the intraclass correlation coefficient (ICC) from a mixed model. The greater the ICC value, the more variance is accounted for by the grouping level, and the more the residuals are intercorrelated. If the residuals are intercorrelated, it may be necessary to account for the grouping structure of the data using a mixed model.
12.5.1.4 Normally Distributed Residuals
To examine whether residuals are normally distributed, we can examine quantile–quantile (QQ) plots and probability–probability (PP) plots. QQ plots are particularly useful for identifying deviations from normality in the tails of the distribution; PP plots are particularly useful for identifying deviations from normality in the center of the distribution. Researchers tend to be more concerned about the tails of the distribution, because extreme values tend to have a greater impact on inferences, so researchers tend to use QQ plots more often than PP plots. Various examples of QQ plots and deviations from normality are depicted in Figure 12.4. If the residuals are normally distributed, they will stay close to the diagonal reference line of the QQ and PP plots.
Figure 12.4: Quantile–Quantile (QQ) Plots of Various Distributions (Right Side) with the Histogram of the Associated Distribution (Left Side).
If the residuals are not normally distributed (i.e., they do not stay close to the diagonal reference line of the QQ and PP plots), it may be necessary to transform the outcome variable to be more normally distributed or to use a generalized linear model (GLM) that more closely matches the distribution of the outcome variable (e.g., Poisson, binomial, gamma).
As noted in Section 10.6.6, The coefficient of determination (\(R^2\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions, as in Equation 10.25: \(R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}\). Various formulas for \(R^2\) are in Equation 10.19. Larger \(R^2\) values indicate greater accuracy. Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in Figure 12.29.
Figure 12.5: Conceptual Depiction of Proportion of Variance Explained (\(R^2\)) in an Outcome Variable (\(Y\)) by Multiple Predictors (\(X1\) and \(X2\)) in Multiple Regression. The size of each circle represents the variable’s variance. The proportion of variance in \(Y\) that is explained by the predictors is depicted by the areas in orange. The dark orange space (\(G\)) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome (\(G\)) will not be recovered in the regression coefficients when both predictors are included in the regression model. From Petersen (2024) and Petersen (2025).
One issue with \(R^2\) is that it increases as the number of predictors increases, which can lead to overfitting if using \(R^2\) as an index to compare models for purposes of selecting the “best-fitting” model. Consider the following example (adapted from Petersen (2025)) in which you have one predictor variable and one outcome variable, as shown in Table 12.1.
Table 12.1: Example Data of Predictor (x1) and Outcome (y) Used for Regression Model.
y
x1
7
1
13
2
29
7
10
2
Using the data, the best fitting regression model is: \(y =\) 3.98 \(+\) 3.59 \(\cdot x_1\). In this example, the \(R^2\) is 0.98. The equation is not a perfect prediction, but with a single predictor variable, it captures the majority of the variance in the outcome.
Now consider the following example where you add a second predictor variable to the data above, as shown in Table 12.2.
Table 12.2: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.
y
x1
x2
7
1
3
13
2
5
29
7
1
10
2
2
With the second predictor variable, the best fitting regression model is: \(y =\) 0.00 + 4.00 \(\cdot x_1 +\) 1.00 \(\cdot x_2\). In this example, the \(R^2\) is 1.00. The equation with the second predictor variable provides a perfect prediction of the outcome.
Providing perfect prediction with the right set of predictor variables is the dream of multiple regression. So, using multiple regression, we often add predictor variables to incrementally improve prediction. Knowing how much variance would be accounted for by random chance follows Equation 12.3:
\[
E(R^2) = \frac{K}{n-1}
\tag{12.3}\]
where \(E(R^2)\) is the expected value of \(R^2\) (the proportion of variance explained), \(K\) is the number of predictor variables, and \(n\) is the sample size. The formula demonstrates that the more predictor variables in the regression model, the more variance will be accounted for by chance. With many predictor variables and a small sample, you can account for a large share of the variance merely by chance.
As an example, consider that we have 13 predictor variables to predict fantasy performance for 43 players. Assume that, with 13 predictor variables, we explain 38% of the variance (\(R^2 = .38; r = .62\)). We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: \(E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31\). We expect to explain 31% of the variance, by chance, in the outcome. So, 82% of the variance explained was likely spurious (i.e., \(\frac{.31}{.38} = .82\)). As the sample size increases, the spuriousness decreases. To account for the number of predictor variables in the model, we can use a modified version of \(R^2\) called adjusted \(R^2\) (\(R^2_{adj}\)), described next.
12.6.2 Adjusted \(R^2\) (\(R^2_{adj}\))
Adjusted \(R^2\) (\(R^2_{adj}\)) accounts for the number of predictor variables in the model, based on how much would be expected to be accounted for by chance to penalize overfitting. Adjusted \(R^2\) (\(R^2_{adj}\)) reflects the proportion of variance in the outcome (dependent) variable that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of predictor variables in the model. The formula for adjusted \(R^2\) (\(R^2_{adj}\)) is in Equation 12.4:
where \(p\) is the number of predictor variables in the model, and \(n\) is the sample size.
12.7 Overfitting
Statistical models applied to big data (e.g., data with many predictor variables) can overfit the data, which means that the statistical model accounts for error variance, which will not generalize to future samples. So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built. In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.
Consider an example where you develop an algorithm to predict players’ fantasy performance based on 2024 data using hundreds of predictor variables. To some extent, these predictor variables will likely account for true variance (i.e., signal) and error variance (i.e., noise). If we were to apply the same algorithm based on the 2024 prediction model to 2025 data, the prediction model would likely predict less accurately than with 2024 data. The regression coefficients tend to become weaker when applied to new data, a phenomenon called shrinkage, which is described in Section 16.7.1. The regression coefficients in the [FILL IN]
In Figure 12.6, the blue line represents the true distribution of the data, and the red line is an overfitting model:
Figure 12.6: Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From Petersen (2024) and Petersen (2025).
12.8 Covariates
Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable. Use of covariates examines the association between the predictor variable and the outcome variable when holding people’s level constant on the covariates. Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable. Ideally, you want to include any and all confounds as covariates. As described in Section 9.4.2.1, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association. Covariates are potentially (but not necessarily) confounds. For instance, you might include the player’s age as a covariate in a model that examines whether a player’s 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.
12.9 Example: Predicting Wide Receivers’ Fantasy Points
Let’s say we want to use a number of variables to predict a wide receiver’s fantasy performance. We want to consider several predictors, including the player’s age, height, weight, and target share. We have only a few predictors and our sample size is large enough such that overfitting is not likely a concern.
On a weekly basis, target share is computed as the number of targets a player receives divided by the team’s total number of targets. It would be nice to compute it this way for seasonal data, as well; however, for calculating seasonal target share, the nflfastR(Carl & Baldwin, 2024) package currently sums up the player’s weekly target share. Thus, seasonal target share (unlike weekly target share)—as it is currently calculated—does not range from 0–1. For example, if a player has a weekly target share of 15%, that would be a seasonal (cross-week-summed) target share of 2.55 (\(0.15 \times 17 \; \text{games} = 2.55\).
Figure 12.8: Scatterplots With Fantasy Points (Season) Among Wide Receivers. The linear best-fit line is in black. The nonlinear best-fit line is in blue.
There are some suggestions of potential nonlinearity, such as an inverted-U-shaped association between height and fantasy points, suggesting that there may an optimal range for height among Wide Receivers—being too short or too tall could be a disadvantage. In addition, target share shows a weakening association as target share increases. Thus, after evaluating the linear association between the predictors and outcome, we will also examine the possibility for curvilinear associations.
12.9.3 Estimate Multiple Regression Model
Now that we have examined descriptive statistics and bivariate associations, let’s first estimate a multiple regression model with only linear terms:
Call:
lm(formula = fantasyPoints ~ age + height + weight + target_share,
data = player_stats_seasonal %>% filter(position %in% c("WR")),
na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-739.80 -18.68 -10.50 10.74 292.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.62684 23.85029 -1.452 0.146617
age 0.75193 0.22418 3.354 0.000803 ***
height 0.05816 0.42094 0.138 0.890107
weight 0.14716 0.06686 2.201 0.027794 *
target_share 743.14867 7.38684 100.604 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45.71 on 4432 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.7054, Adjusted R-squared: 0.7052
F-statistic: 2654 on 4 and 4432 DF, p-value: < 2.2e-16
The only terms that are significantly associated with fantasy performance among Wide Receivers are weight and target share, both of which showed a positive association with fantasy points.
If we want to obtain standardized regression coefficients, we can use the standardize_parameters() function of the effectsize package (Ben-Shachar et al., 2020, 2025).
Figure 12.10: Standardized Regression Coefficients with 95% Confidence Interval.
12.9.4.2 Model-Implied Association
If we wanted to visualize the shape of the model-implied association between target share and fantasy points, we could generate the model-implied predictions using the data range that we want to visualize.
Figure 12.11: Model-Implied Predictions of A Wide Receiver’s Fantasy Points as a Function of Target Share. The model-implied predictions were estimated based on a multiple regression model.
We could also generate the model-implied prediction of fantasy points for any value of the predictor variables. For instance, here are the number of fantasy points expected for a Wide Receiver who is 23 years old, is 6’2” tall (72 inches), weighs 200 pounds, and has a (cross-week-summed) target share of 3 (which reflects an approximate target share of 17.6% percent per game; i.e., \(0.176 = \frac{3}{17 \; \text{games}}\)):
there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
the residuals are independent; they are uncorrelated with each other
the residuals are normally distributed
12.9.5.1 Linear Association
We evaluated the shape of the association between the predictor variables and the outcome variables using scatterplots. We accounted for potential curvilinearity in the associations with a quadratic term. Other ways to account for nonlinearity, in addition to polynomials, include transforming predictors, use of splines/piecewise regression, and generalized additive models.
To evaluate for potential nonlinearity in the associations, we can also evaluate residual plots (Figure 12.21), marginal model plots (Figure 12.12), added-variable plots (Figure 12.13), and component-plus-residual plots (Figure 12.14) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). For evaluating linearity, we would expect minimal bend/curvature in the lines.
The marginal model plots (Figure 12.12), residual plots (Figure 12.21), and component-plus-residual plots (Figure 12.14) suggest that the nonlinearity of the association between target share and fantasy points may not be fully captured by the quadratic term. Thus, we may need to apply a different approach to handling the nonlinear association between target share and fantasy points.
One approach we can take is to transform the target_shares variable to be more normally distributed.
The histogram for the raw target_shares variable is in Figure 12.15.
Code
hist( player_stats_seasonal$target_share[which(player_stats_seasonal$position =="WR")],main ="Histogram of Target Share")
Figure 12.15: Histogram of Target Share.
The variable shows a strong positive skew. To address a strong positive skew, we can use a log transformation. The histogram of the log-transformed variable is in Figure 12.16.
Code
hist(log(player_stats_seasonal$target_share[which(player_stats_seasonal$position =="WR")] +1),main ="Histogram of Target Share (Log Transformed)")
Figure 12.16: Histogram of Target Share, Transformed.
Now we can re-fit the model with the log-transformed variable.
Call:
lm(formula = fantasyPoints ~ age + I(age^2) + height + I(height^2) +
weight + I(weight^2) + I(log(target_share + 1)), data = player_stats_seasonal %>%
filter(position %in% c("WR")), na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-610.70 -15.60 -8.09 7.67 299.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.130e+02 4.995e+02 -0.226 0.821
age 3.243e+00 2.619e+00 1.238 0.216
I(age^2) -4.782e-02 4.718e-02 -1.014 0.311
height 4.906e+00 1.462e+01 0.336 0.737
I(height^2) -3.416e-02 1.009e-01 -0.339 0.735
weight -1.168e+00 8.826e-01 -1.324 0.186
I(weight^2) 3.247e-03 2.191e-03 1.482 0.138
I(log(target_share + 1)) 8.974e+02 7.864e+00 114.108 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.66 on 4429 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.7555, Adjusted R-squared: 0.7551
F-statistic: 1955 on 7 and 4429 DF, p-value: < 2.2e-16
Target share shows a more linear association with fantasy points after log-transforming it (albeit still not perfect), as depicted in Figures 12.17, 12.18, 12.19, and 12.20.
Figure 12.20: Residual Plots After Log Transformation of Target Share.
12.9.5.2 Homoscedasticity
To evaluate homoscedasticity, we can evaluate a residual plot (Figure 12.21) and spread-level plot (Figure 12.22) from the car package (Fox et al., 2024; Fox & Weisberg, 2019). In a residual plot, you want a constant spread of residuals versus fitted values—you do not want the residuals to show a fan or cone shape.
In this example, the residuals appear to increase as a function of the fitted values. To handle this, we may need to transform the outcome variable to be more normally distributed.
The histogram for raw fantasy points is in Figure 13.34.
Code
hist( player_stats_seasonal$fantasyPoints[which(player_stats_seasonal$position =="WR")],main ="Histogram of Fantasy Points")
Figure 12.23: Histogram of Fantasy Points (Among Wide Receivers).
We can apply a Yeo-Johnson transformation to the outcome variable to generate a more normally distributed variable. A Yeo-Johnson transformation estimates the optimal transformation to make the variable more normally distributed. Let’s use a Yeo-Johnson transformation of fantasy points:
Figure 12.26: Spread-Level Plot After Transformation of Fantasy Points.
The residuals show more constant variance after transforming the outcome variable.
12.9.5.3 Uncorrelated Residuals
To determine if residuals are correlated given the nested structure of the data, we can examine the proportion of variance that is attributable to the particular player. To do this, we can estimate the intraclass correlation coefficient (ICC) from a mixed model using the performance package (Lüdecke et al., 2021, 2025).
The ICC indicates that over half of the variance is attribute to between-player variance, so it would be important to account for the player-specific variance using a mixed model. For simplicity, we focus on multiple regression models in this chapter; mixed models are described in Chapter 13.
12.9.5.4 Normally Distributed Residuals
We can examine whether residuals are normally distributed using quantile–quantile (QQ) plots and probability–probability (PP) plots, as in Figures 12.27 and 12.28. If the residuals are normally distributed, they should stay close to the diagonal reference line.
Figure 12.28: Residual Plots After Transformation of Fantasy Points.
12.10 Multicollinearity
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. The problem of having multiple predictor variables that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.
Multicollinearity in multiple regression is depicted conceptually in Figure 12.29.
Figure 12.29: Conceptual Depiction of Multicollinearity in Multiple Regression. From Petersen (2024) and Petersen (2025).
Table 12.3: Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model.
y
x1
x2
9
2.0
4
11
3.0
6
17
4.0
8
3
1.0
2
21
5.0
10
13
3.5
7
The second predictor variable is not very good—it is exactly twice the value of the first predictor variable; thus, the two predictor variables are perfectly correlated (i.e., \(r = 1.0\)). This means that there are different prediction equation possibilities that are equally good—see Equations in Equation 12.9:
\[
\begin{aligned}
2x_2 &= y \\
0x_1 + 2x_2 &= y \\
4x_1 &= y \\
4x_1 + 0x_2 &= y \\
2x_1 + 1x_2 &= y \\
5x_1 - 0.5x_2 &= y \\
...
&= y
\end{aligned}
\tag{12.9}\]
Then, what are the regression coefficients? We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well. Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate. In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates. Any predictor variables that have a correlation above ~ \(r = .30\) with each other could have an impact on the confidence interval of the regression coefficient. As the correlations among the predictor variables increase, the chance of getting an arbitrary answer increases, sometimes called “bouncing betas.” So, it is important to examine a correlation matrix of the predictor variables before putting them in the same regression model. You can also examine indices such as variance inflation factor (VIF), where a value greater than 5 or 10 indicates multicollinearity.
age height weight
1.019535 2.095026 2.111745
I(log(target_share + 1))
1.030896
To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors. For a meaningful answer regarding predictors in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size. However, if you are merely interested in prediction—and are not interested in interpreting the regression coefficients of individual predictors—multicollinearity poses less of a problem. For instance, machine learning cares more about achieving the greatest predictive accuracy possible and cares less about explaining which predictors are causally related to the outcome. So, multicollinearity is less of a concern for machine learning approaches.
12.11 Handling of Missing Data
An important consideration in multiple regression is how missing data are handled. Multiple regression in R using the lm() function applies listwise deletion. Listwise deletion removes any row (in the data file) from analysis that has a missing value on the outcome variable or any of the predictor variables. Removing all rows from analysis that have any missingness in the model variables can be a problem because missingness is often not completely at random—missingness often occurs systematically (i.e., for a reason). For instance, participants may be less likely to have data for all variables if they are from a lower socioeconomic status background and do not have the time to participate in all study procedures. Thus, applying listwise deletion, we might systematically exclude participants from lower socioeconomic status backgrounds (or other groups), which could lead to less generalizable inferences.
It is thus important to consider approaches to handle missingness. Various approaches to handle missingness include pairwise deletion (aka available-case analysis), multiple imputation, and full information maximum likelihood (FIML).
Now, let’s specify the imputation method—we use the two-level predictive mean matching (2l.pmm) method from the miceadds package (Robitzsch et al., 2024) to account for the nonindependent data (owing to multiple seasons per player):
Code
meth <-make.method(dataToImpute)meth[1:length(meth)] <-""meth[Y] <-"2l.pmm"# specify the imputation method here; this can differ by outcome variable
Now, let’s specify the prediction matrix. A predictor matrix is a matrix of values, where:
columns with non-zero values are predictors of the variable specified in the given row
the diagonal of the predictor matrix should be zero because a variable cannot predict itself
The values are:
NOT a predictor of the outcome: 0
cluster variable: -2
fixed effect of predictor: 1
fixed effect and random effect of predictor: 2
include cluster mean of predictor in addition to fixed effect of predictor: 3
include cluster mean of predictor in addition to fixed effect and random effect of predictor: 4
imp <- mice::mice(as.data.frame(dataToImpute),method = meth,predictorMatrix = pred,m = numImputations,maxit =5, # generally use 100 maximum iterations; this example uses 5 for speedseed =52242)
Below are some imputation diagnostics. Trace plots are in Figure 12.30.
Code
plot(imp, c("target_share"))
Figure 12.30: Trace plots from multiple imputation.
Figure 12.31: Density plot from multiple imputation.
The imputated data does not match well the distribution of the observed data. Thus, it may be necessary to select a different imputation method for more accurate imputation.
lavaan 0.6-19 ended normally after 58 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 20
Number of observations 5389
Number of missing patterns 2
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv
fantasyPoints_transformed ~
age 0.040 0.007 5.443 0.000 0.040
height -0.003 0.014 -0.212 0.832 -0.003
weight 0.004 0.002 1.741 0.082 0.004
target_shar_lg 27.770 0.285 97.531 0.000 27.770
Std.all
0.046
-0.003
0.021
0.813
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
age ~~
height -0.297 0.101 -2.945 0.003 -0.297 -0.040
weight -0.155 0.637 -0.243 0.808 -0.155 -0.003
target_shar_lg 0.037 0.004 10.100 0.000 0.037 0.145
height ~~
weight 24.979 0.584 42.755 0.000 24.979 0.716
target_shar_lg 0.016 0.003 5.808 0.000 0.016 0.082
weight ~~
target_shar_lg 0.152 0.017 8.931 0.000 0.152 0.127
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 1.704 0.796 2.142 0.032 1.704 0.616
age 26.254 0.043 611.427 0.000 26.254 8.329
height 72.606 0.032 2269.504 0.000 72.606 30.916
weight 200.480 0.202 991.395 0.000 200.480 13.505
target_shar_lg 0.081 0.001 70.753 0.000 0.081 0.997
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.fntsyPnts_trns 2.467 0.052 47.691 0.000 2.467 0.322
age 9.936 0.191 51.909 0.000 9.936 1.000
height 5.516 0.106 51.908 0.000 5.516 1.000
weight 220.372 4.245 51.908 0.000 220.372 1.000
target_shar_lg 0.007 0.000 49.109 0.000 0.007 1.000
R-Square:
Estimate
fntsyPnts_trns 0.678
12.12 Addressing Non-Independence of Observations
Please note that the \(p\)-value for regression coefficients assumes that the observations are independent—in particular, that the residuals are not correlated. However, the observations are not independent in the player_stats_seasonal dataframe used above, because the same player has multiple rows—one row corresponding to each season they played. This non-independence violates the traditional assumptions of the significance of regression coefficients. We could address this assumption by analyzing only one season from each player or by estimating the significance of the regression coefficients using cluster-robust standard errors. For simplicity in the models above, we present results above from the whole dataframe. In Chapter 13, we discuss mixed model approaches that handle repeated measures and other data that violate assumptions of non-independence. Below, we demonstrate how to account for non-independence of observations using cluster-robust standard errors with the rms package (Harrell, Jr., 2025).
Code
player_stats_seasonal_subset <- player_stats_seasonal %>%filter(!is.na(player_id)) %>%filter(position %in%c("WR"))rms::robcov(rms::ols( fantasyPoints_transformed ~ age + height + weight +I(log(target_share +1)),data = player_stats_seasonal_subset,x =TRUE,y =TRUE),cluster = player_stats_seasonal_subset$player_id) #account for nested data within player
Frequencies of Missing Values Due to Each Variable
fantasyPoints_transformed age height
0 0 0
weight target_share
0 952
Linear Regression Model
rms::ols(formula = fantasyPoints_transformed ~ age + height +
weight + I(log(target_share + 1)), data = player_stats_seasonal_subset,
x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 4437 LR chi2 5009.98 R2 0.677
sigma 1.5744 d.f. 4 R2 adj 0.676
d.f. 4432 Pr(> chi2) 0.0000 g 2.428
Cluster onplayer_stats_seasonal_subset$player_id
Clusters 1292
Residuals
Min 1Q Median 3Q Max
-18.3306 -0.6618 0.2290 0.8845 6.7486
Coef S.E. t Pr(>|t|)
Intercept 2.3735 1.0482 2.26 0.0236
age 0.0337 0.0089 3.78 0.0002
height -0.0092 0.0183 -0.50 0.6171
weight 0.0036 0.0027 1.34 0.1789
target_share 27.8669 1.0949 25.45 <0.0001
12.13 Impact of Outliers
As with correlation, multiple regression can be strongly impacted by outliers.
12.13.1 Robust Regression
To address outliers, there are various approaches to robust regression. One approach is to use an MM-type estimator, such as is used in the lmrob() and glmrob() functions of the robustbase package (Maechler et al., 2024; Todorov & Filzmoser, 2009).
Now, we can visualize the interaction to understand it. We create an interaction plot (Figure 12.32) and Johnson-Neyman plot (Figure 12.33) using the interactions package (Long, 2024).
JOHNSON-NEYMAN INTERVAL
When weight_centered is INSIDE the interval [16.16, 136.16], the slope of
height_centered is p < .05.
Note: The range of observed values of weight_centered is [-47.48, 64.52]
Figure 12.33: Johnson-Neyman Plot from Moderated Multiple Regression.
SIMPLE SLOPES ANALYSIS
Slope of height_centered when weight_centered = -1.484626e+01 (- 1 SD):
Est. S.E. t val. p
------- ------ -------- ------
-0.00 0.03 -0.12 0.91
Slope of height_centered when weight_centered = -1.002064e-16 (Mean):
Est. S.E. t val. p
------- ------ -------- ------
-0.03 0.02 -1.26 0.21
Slope of height_centered when weight_centered = 1.484626e+01 (+ 1 SD):
Est. S.E. t val. p
------- ------ -------- ------
-0.05 0.03 -1.93 0.05
mediationModel <-' # direct effect (cPrime) Y ~ direct*X # mediator M ~ a*X Y ~ b*M # indirect effect = a*b indirect := a*b # total effect (c) total := abs(direct) + abs(indirect)'
Let’s substitute in our predictor, outcome, and hypothesized mediator. In this case, we predict that receiving touchdowns partially accounts for the association between Wide Receiver’s target share and their fantasy points. This is a silly example because fantasy points are derived, in part, from touchdowns, so of course touchdowns will partially account for almost any effect on Wide Receivers’ fantasy points. This example is merely for demonstrating the process of developing and examining a mediation model.
To get a robust estimate of the indirect effect, we obtain bootstrapped estimates from 1,000 bootstrap draws. Typically, we would obtain bootstrapped estimates from 10,000 bootstrap draws, but this example uses only 1,000 bootstrap draws for a shorter runtime.
Code
mediationFit <- lavaan::sem( mediationModel,data = player_stats_seasonal %>%filter(position %in%c("WR")),se ="bootstrap",bootstrap =1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speedparallel ="multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PCiseed =52242, # for reproducibilitymissing ="ML",estimator ="ML",fixed.x =FALSE)
multipleMediatorFit <- lavaan::sem( multipleMediatorModel,data = player_stats_seasonal %>%filter(position %in%c("WR")),se ="bootstrap",bootstrap =1000, # generally use 10,000 bootstrap draws; this example uses 1,000 for speedparallel ="multicore", # parallelization for speed: use "multicore" for Mac/Linux; "snow" for PCiseed =52242, # for reproducibilitymissing ="ML",estimator ="ML",fixed.x =FALSE)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: fantasyPoints_transformed ~ age + height + weight + I(log(target_share + 1))
Data: player_stats_seasonal %>% filter(position %in% c(" (Number of observations: 4437)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 2.37 0.82 0.76 3.95 1.00 3779 3268
age 0.03 0.01 0.02 0.05 1.00 4499 2979
height -0.01 0.01 -0.04 0.02 1.00 3472 2597
weight 0.00 0.00 -0.00 0.01 1.00 3686 3511
Ilogtarget_shareP1 27.86 0.30 27.26 28.45 1.00 3745 2698
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.57 0.02 1.54 1.61 1.00 3590 2406
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
12.17 Conclusion
Multiple regression allows examining the association between multiple predictor variables and one outcome variable. Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable. As with correlation, an association does not imply causation. However, identifying associations is important because associations are a necessary (but insufficient) condition for causality. When developing a multiple regression model, there are various assumptions that are important to evaluate. In addition, it is important to pay attention for potential multicollinearity—it may become difficult to detect a given predictor variable as statistically significant due to the greater uncertainty around the parameter estimates.
Ben-Shachar, M. S., Lüdecke, D., & Makowski, D. (2020). effectsize: Estimation of effect size indices and standardized parameters. Journal of Open Source Software, 5(56), 2815. https://doi.org/10.21105/joss.02815
Ben-Shachar, M. S., Makowski, D., Lüdecke, D., Patil, I., Wiernik, B. M., Thériault, R., & Waggoner, P. (2025). effectsize: Indices of effect size. https://doi.org/10.32614/CRAN.package.effectsize
Iacobucci, D., Schneider, M. J., Popovich, D. L., & Bakamitsos, G. A. (2016). Mean centering helps alleviate “micro” but not “macro” multicollinearity. Behavior Research Methods, 48(4), 1308–1317. https://doi.org/10.3758/s13428-015-0624-x
Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139. https://doi.org/10.21105/joss.03139
Lüdecke, D., Makowski, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., Wiernik, B. M., & Thériault, R. (2025). performance: Assessment of regression models performance. https://doi.org/10.32614/CRAN.package.performance
Petersen, I. T. (2024). Principles of psychological assessment: With applied examples in R. Chapman and Hall/CRC. https://doi.org/10.1201/9781003357421
Petersen, I. T. (2025). Principles of psychological assessment: With applied examples in R. University of Iowa Libraries. https://doi.org/10.25820/work.007199
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02
Todorov, V., & Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis. Journal of Statistical Software, 32(3), 1–47. https://doi.org/10.18637/jss.v032.i03
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03
Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link:
https://forms.gle/LsnVKwqmS1VuxWD18
Email Notification
The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.