I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know.
The best ways to provide feedback are by GitHub or hypothes.is annotations.
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
12Mixed Models
This chapter provides an overview of mixed models.
We created the player_stats_weekly.RData and player_stats_seasonal.RData objects in Section 4.4.3.
12.2 Overview of Mixed Models
Mixed models are simply regression models that account for the nested or hierarchical structure of the data. Mixed models can be be used to address the assumption in multiple regression that the residuals are independent (i.e., uncorrelated). When data come from the same group, that can lead to residuals being correlated, so accounting for the grouping-level structure can address the assumption.
The modeling framework goes by many terms, including mixed models, mixed-effects models, multilevel models, hierarchical linear models. They are sometimes called multilevel models and hierarchical linear models, whose name emphasizes the hierarchical structure of the data because the data are nonindependent. When observations (i.e., data points) are collected from multiple lower-level units (e.g., people) in an upper-level unit (e.g., married couple, family, classroom, school, neighborhood, team), the data from the lower-level units are considered “nested” within the upper-level unit. In this context, nested data refers to multiple observations from the same upper-level unit. For instance, longitudinal data are nested within the same participant. Students are nested within classrooms, which are nested within schools. Players are nested within teams.
When data are nested, the data from the lower-level unit are likely to be correlated, to some degree, because they come from the same upper-level unit. For example, multiple players may come from the same team, and the players’ performance on that team is likely interrelated because they share common experiences and influence one another. Thus, data from multiple players on a given team are considered nested within that team. Longitudinal data can also be considered nested data, in which time points are nested within the person (i.e., the same player provides an observation across multiple time points). As we will discuss, it is important to account for levels of nesting when the observations are nonindependent. Mixed models provide a framework for accounting for levels of nesting, in order to account for why data from the same upper-level unit are likely to be intercorrelated.
These models are also sometimes called mixed models or mixed-effects models because the models can include a mix of fixed and random effects. Fixed effects are effects that are constant across individuals (i.e., upper-level units). Random effects are effects that vary across individuals (i.e., upper-level units). For instance, consider a longitudinal study of fantasy performance as a function of age. If we have longitudinal data for multiple players, the time points are nested within players. Examining the association between age as a fixed effect in relation to fantasy performance would examine the association between a player’s age and their fantasy performance while holding the association between age and fantasy performance constant across all players. That is, it would assume that all players show the same trajectory such as increase from ages 20 to 24 and then decrease. Examining the association between age as a random effect in relation to fantasy performance would examine the association between a player’s age and their fantasy performance while allowing the association between age and fantasy performance to vary across players. That is, it would allow the possibility that some players improve with age, whereas other players decline in performance with age.
When including random effects of a variable (e.g., age) in a mixed model, it is also important to include fixed effects of that variable in the model. This is because random effects have a mean of zero. Fixed effects allow the mean to differ from zero. Thus, inclusion of random effects without the corresponding fixed effect can lead to bias in estimation of the association between the predictor variables and the outcome variable.
Mixed modeling is a more flexible modeling framework than analysis of variance (ANOVA). Unlike ANOVA, mixed modeling can accommodate both continuous and categorical predictor variables that vary within units (Brauer & Curtin, 2018). ANOVA, cannot handle continuous predictor variables that vary within units (Brauer & Curtin, 2018). In addition, ANOVA yields biased inferences (i.e., a higher Type I error rate) when the same participants are exposed to multiple items or stimuli [i.e., responses by the same respondent tend to be correlated across items, and responses to the same item tend to be correlated across respondents; Brauer & Curtin (2018)]. Additionally, mixed models better handle missing data compared to ANOVA(Brauer & Curtin, 2018).
12.2.1 Ecological Fallacy
As described in Section 14.5.9, the ecological fallacy is the error of drawing inferences about an individual from group-level data. A type of ecological fallacy is Simpson’s paradox.
12.2.2 Simpson’s Paradox
Simpson’s paradox occurs when the association between the predictor variable and outcome variable for the subgroups differ from the association when the subgroups are combined. Examples of Simpson’s paradox are depicted in Figures 13.5 and 12.1. In the example below, there is a positive association between the predictor variable and outcome variable for each group. However, when the groups are combined, there is a negative association between predictor variable and outcome variable. That is, the sign of an association can differ at different levels of analysis (e.g., group level versus person level).
Figure 12.1: An Example of Simpson’s Paradox. In this example, there is a positive association between x and y within each group (i.e., red or blue), but there is a negative association between x and y when the groups are combined. (Figure retrieved from https://en.wikipedia.org/wiki/File:Simpson%27s_paradox_continuous.svg)
Consider that we observe a between-person association between a predictor, for example, how much sports drink a player drinks and their performance, such that players who drink more sports drink before games tend to perform better during games. However, based on this association, if we draw the inference that sports drink consumption leads to better performance, this could be a faulty inference. It is possible that, at the within-person level, there is no association or even a negative association between sports drink consumption and performance. For helping to approximate causality, a much stronger test than relying on the between-person association would be to examine the association within the individual. That is, examining the association within the individual would examine: when a player consumes more sports drink, whether they perform better than when the same player consumes less sports drink. We describe within-person analyses to approximate causal inferences in Section 13.5.2.2.
In short, we often need to account for the groups or “levels” within which people are nested. When multiple observations are nested within the same upper-level unit, they are often correlated. That is, some variance is attributable to the lower-level unit (e.g., the player) and some variance is attributable to the upper-level unit (e.g., the team). It is important to evaluate how much variance is attributable to the upper-level unit. A way of evaluating how much variance is attributable to an upper-level unit is with the intraclass correlation coefficient (ICC). The ICC indicates the proportion of variance that is attributable to an upper-level unit, and it ranges from 0–1. The ICC can also be interpreted to reflect the correlation between scores for those observations that share an upper-level unit (e.g., the correlation of fantasy points across time within players). If substantial variance is attributable to the upper-level unit (i.e., there is a substantial ICC), it is important to account for the nested structure of the data. Mixed models are a useful approach to account for data with a nested structure so you can avoid committing the ecological fallacy.
12.2.3 Data Structure
For fitting mixed models, the data should be in long form—each player (or unit of interest) should have multiple rows based on the grouping structure. The cluster variable is the variable or variables that represent the upper-level unit(s), within which the data are nested. When estimating mixed models, it is important to know the grouping structure of the data because the data structure influences the model syntax. Examples of data from nonindependent units include nested data, multiple membership data, and cross-classified data, as depicted in Figure 12.2.
Figure 12.2: Various Data Structures Involving Data from Nonindependent Units, Including Nested data, Multiple Membership Data, and Cross-Classified Data.
One example of nonindependent data is if your data are nested. When data are collected from multiple lower-level units (e.g., people) in an upper-level unit (e.g., married couple, family, classroom, school, neighborhood), and each lower-level unit belongs to one and only one higher-level unit, the data from the lower-level units are considered nested within the upper-level unit. For example, multiple participants in your sample may come from the same classroom. Data from multiple people can be nested within the same family, classroom, school, etc. Longitudinal data can also be considered nested data, in which time points are nested within the participant (i.e., the same participant provides an observation across multiple time points). And if you have multiple informants of a child’s behavior (e.g., parent-, teacher-, friend-, and self-report), the ratings could also be nested within the participant (i.e., there are ratings from multiple informants of the same child).
Another form of nonindependent data is when data involve multiple membership. Data involve multiple membership when the lower-level units belong to more than one upper-level unit. As an example of multiple membership, children may have more than one teacher, and therefore, in a modeling sense, children “belong” to more than one teacher cluster.
Another form of nonindependent data is when data are cross-classified (also called crossed). Data are cross-classified when the lower-level units are classified by two or more upper-level units, but the upper-level units are not hierarchical or nested within one another. For instance, children may be nested within the crossing of schools and neighborhoods. That is, children are nested within schools, and children are also nested within neighborhoods; however, children attending the same school may not necessarily be from the same neighborhood, and children from the same neighborhood may not necessarily attend the same school. The data would still have a two-level hierarchy, but the data would be nested in specific school-neighborhood combinations. That is, children are nested within the cross-classification of schools and neighborhoods. An example of cross-classified data in football is that players are nested within teams and within agents. Agents are not nested within teams, and teams are not nested within agents. That is, teams are nested within the cross-classification of teams and agents.
Applied to longitudinal fantasy football data, most simply, players’ longitudinal performance data would be modeled as nested data with a two-level model: timepoints nested within players, with the player’s ID as the cluster variable. However, you could get more complicated. For instance, you could estimate a three-level model of timepoints nested within players nested within teams. This would help account for players’ productivity benefitting or taking a hit from being on a given team, and it would account for players changing teams. Alternatively, consistent with Usami & Murayama (2018), you could estimate a cross-classified model of players’ individual observations/measurement occasions belonging to two upper-level units: players and time points (e.g., week 1, week 2, 2025 season, etc.). This would help account for systematic effects that are specific to a given timepoint (e.g., week or season).
When trying to determine what predictors to include and how complex of a model to fit, it can be helpful to compare a variety a models to determine how well they fit the data (i.e., how much variance they account for). To compare non-nested models, you can use fit criteria such as Akaike information criterion (AIC), which penalizes model complexity. If two models are nested, you can compare their model fit using a likelihood ratio test. Two models are considered nested if one model includes all of the terms of the original model along with additional terms. The simpler model is considered “nested within” the more complex model. In the likelihood ratio test, a significant difference indicates that the more complex model is significantly better fitting than the simpler model. If the likelihood ratio test is non-significant, it indicates that the more complex model is not significantly better fitting, and thus you should prefer the more parsimonious model.
12.4 Fantasy Points Per Season by Position, Age, and Experience
In this chapter, we use mixed models to demonstrate how to model trajectories (growth curves) of players’ fantasy points as a function of age. When estimating growth curve models, time points are nested within individuals (i.e., players). Mixed models provide a flexible framework for estimating growth curve models because they allow unequal time intervals between measurement occasions, time intervals that differ between individuals, and different numbers of observations for each individual (Brauer & Curtin, 2018).
One of the challenges when modeling players’ fantasy points as a function of age is that fantasy points are a function of both ability and opportunity. With age, the player’s ability and opportunity may both decline, but it is difficult to disentangle the extent to which a player’s decline is related to declines in ability versus opportunity. As players get older, they may be supplanted by younger, more talented players. With age, despite still having latent (unobserved) ability, players will get fewer and, eventually, no opportunity. And we do not know the counterfactual—how many fantasy points they would have scored had they been given the full opportunity each season. Thus, players may go from scoring 100+ points in a season to all of a sudden scoring way fewer points, with little in the way of intermediate steps. Thus, this should inform how you interpret the resulting trajectories. For instance, the steep declines can make it look like the model-implied fantasy points go well below zero at later ages, even though it is obviously not likely for players to obtain such a magnitude of negative fantasy points. The negative model-implied values of fantasy points at later ages are likely an artifact of the decreasing opportunity that players tend to get as they age. The crossover point, where the positive values becomes negative might indicate, for instance, that players at that age tend to have retired and thus have no opportunity. Regardless, the form/shape/steepness of the decline is still potentially meaningful even if the precise negative number at a given age is not.
Create a newdata object for generating the plots of model-implied fantasy points by age and position:
Code
pointsPerSeason_positionAge_newData <-expand.grid(positionFactor =factor(c("FB","QB","RB","TE","WR")), #,"K"age =seq(from =20, to =40, length.out =10000))pointsPerSeason_positionAge_newData$ageCentered20 <- pointsPerSeason_positionAge_newData$age -20pointsPerSeason_positionAge_newData$ageCentered20Quadratic <- pointsPerSeason_positionAge_newData$ageCentered20 ^2pointsPerSeason_positionAge_newData$years_of_experience <-floor(pointsPerSeason_positionAge_newData$age -22) # assuming that most players start at age 22 (i.e., rookie year) and thus have 1 year of experience at age 23pointsPerSeason_positionAge_newData$years_of_experience[which(pointsPerSeason_positionAge_newData$years_of_experience <0)] <-0
Create an object with complete cases for generating the plots of individuals’ model-implied fantasy points by age and position:
12.4.1 Scatterplots of Fantasy Points by Age and Position
Scatterplots are a helpful tool for quickly examining the association between two variables. However, scatterplots—as well as correlation and multiple regression—can hide meaningful associations that differ across units of analysis.
plot_scatterplotFantasyPointsByAgeQB <-ggplot(data = player_stats_seasonal_offense_subset %>%filter(position =="QB") %>%mutate(age =round(age, 2),fantasyPoints =round(fantasyPoints, 2) ),mapping =aes(x = age,y = fantasyPoints,color = player_id)) +geom_point(aes(text = player_display_name, # add player name for mouse over tooltiplabel = season # add season for mouse over tooltip )) +geom_smooth(mapping =aes(x = age,y = fantasyPoints),inherit.aes =FALSE ) +scale_color_viridis(discrete =TRUE) +labs(x ="Player Age (years)",y ="Fantasy Points (Season)",title ="Fantasy Points (Season) by Player Age: Quarterbacks" ) +theme_classic() +theme(legend.position ="none")plotly::ggplotly( plot_scatterplotFantasyPointsByAgeQB,tooltip =c("age","fantasyPoints","text","label"))
Figure 12.3: Scatterplot of Fantasy Points by Age for Quarterbacks.
Based on the scatterplot (and the bivariate association below), Quarterbacks’ fantasy points appear to (slightly) increase with age.
Code
cor.test(formula =~ age + fantasyPoints,data = player_stats_seasonal_offense_subset %>%filter(position =="QB"))
Pearson's product-moment correlation
data: age and fantasyPoints
t = 2.261, df = 2091, p-value = 0.02386
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.006553027 0.092036161
sample estimates:
cor
0.04938503
plot_scatterplotFantasyPointsByAgeTE <-ggplot(data = player_stats_seasonal_offense_subset %>%filter(position =="TE") %>%mutate(age =round(age, 2),fantasyPoints =round(fantasyPoints, 2) ),mapping =aes(x = age,y = fantasyPoints,color = player_id)) +geom_point(aes(text = player_display_name, # add player name for mouse over tooltiplabel = season # add season for mouse over tooltip )) +geom_smooth(mapping =aes(x = age,y = fantasyPoints),inherit.aes =FALSE ) +scale_color_viridis(discrete =TRUE) +labs(x ="Player Age (years)",y ="Fantasy Points (Season)",title ="Fantasy Points (Season) by Player Age: Tight Ends" ) +theme_classic() +theme(legend.position ="none")plotly::ggplotly( plot_scatterplotFantasyPointsByAgeTE,tooltip =c("age","fantasyPoints","text","label"))
Figure 12.7: Scatterplot of Fantasy Points by Age for Tight Ends.
Based on the scatterplot (and the bivariate association below), Tight Ends’ fantasy points appear to increase with age.
Code
cor.test(formula =~ age + fantasyPoints,data = player_stats_seasonal_offense_subset %>%filter(position =="TE"))
Pearson's product-moment correlation
data: age and fantasyPoints
t = 5.6517, df = 3139, p-value = 1.731e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.06562156 0.13486568
sample estimates:
cor
0.1003652
12.4.2 Plots of Raw Trajectories of Fantasy Points By Age and Player
Scatterplots can be helpful for quickly visualizing the association between two variables. However, as mentioned earlier, scatterplots can hide the association between variables at different units of analysis. For instance, consider that we are trying to predict how a player will perform based on their age. We are interested not only in what the association is between age and fantasy points between players (i.e., a between-person association). We are also interested in what the association is between age and fantasy points within a given player (and within each player; i.e., a within-individual association). Arguably, the within-individual association between age and fantasy points is more relevant to the prediction of performance than the association between age and fantasy points between players. Assuming that the between-player association between age and fantasy points is the same as the within-player association when it is not is an example of the ecological fallacy.
Below, we depict players’ raw trajectories of fantasy points as a function of age. These are known as spaghetti plots. By examining the trajectory for each player, we can get a better understanding of how performance changes (within an individual) as a function of age.
12.4.2.1 Quarterbacks
A plot of Quarterbacks’ raw fantasy points data by age is in Figure 12.8.
Code
plot_rawFantasyPointsByAgeQB <-ggplot(data = player_stats_seasonal_offense_subset %>%filter(position =="QB") %>%mutate(age =round(age, 2),fantasyPoints =round(fantasyPoints, 2) ),mapping =aes(x = age,y = fantasyPoints,color = player_id)) +geom_line(aes(text = player_display_name, # add player name for mouse over tooltiplabel = season # add season for mouse over tooltip )) +scale_color_viridis(discrete =TRUE) +labs(x ="Player Age (years)",y ="Fantasy Points (Season)",title ="Fantasy Points (Season) by Player Age: Quarterbacks" ) +theme_classic() +theme(legend.position ="none")plotly::ggplotly( plot_rawFantasyPointsByAgeQB,tooltip =c("age","fantasyPoints","text","label"))
Figure 12.8: Plot of Raw Trajectories of Fantasy Points by Age for Quarterbacks.
12.4.2.2 Fullbacks
A plot of Fullbacks’ raw fantasy points data by age is in Figure 12.9.
Code
plot_rawFantasyPointsByAgeFB <-ggplot(data = player_stats_seasonal_offense_subset %>%filter(position =="FB") %>%mutate(age =round(age, 2),fantasyPoints =round(fantasyPoints, 2) ),mapping =aes(x = age,y = fantasyPoints,color = player_id)) +geom_line(aes(text = player_display_name, # add player name for mouse over tooltiplabel = season # add season for mouse over tooltip )) +scale_color_viridis(discrete =TRUE) +labs(x ="Player Age (years)",y ="Fantasy Points (Season)",title ="Fantasy Points (Season) by Player Age: Fullbacks" ) +theme_classic() +theme(legend.position ="none")plotly::ggplotly( plot_rawFantasyPointsByAgeFB,tooltip =c("age","fantasyPoints","text","label"))
Figure 12.9: Plot of Raw Trajectories of Fantasy Points by Age for Fullbacks.
12.4.2.3 Running Backs
A plot of Running Backs’ raw fantasy points data by age is in Figure 12.10.
Code
plot_rawFantasyPointsByAgeRB <-ggplot(data = player_stats_seasonal_offense_subset %>%filter(position =="RB") %>%mutate(age =round(age, 2),fantasyPoints =round(fantasyPoints, 2) ),mapping =aes(x = age,y = fantasyPoints,color = player_id)) +geom_line(aes(text = player_display_name, # add player name for mouse over tooltiplabel = season # add season for mouse over tooltip )) +scale_color_viridis(discrete =TRUE) +labs(x ="Player Age (years)",y ="Fantasy Points (Season)",title ="Fantasy Points (Season) by Player Age: Running Backs" ) +theme_classic() +theme(legend.position ="none")plotly::ggplotly( plot_rawFantasyPointsByAgeRB,tooltip =c("age","fantasyPoints","text","label"))
Figure 12.10: Plot of Raw Trajectories of Fantasy Points by Age for Running Backs.
12.4.2.4 Wide Receivers
A plot of Wide Receivers’ raw fantasy points data by age is in Figure 12.11.
Code
plot_rawFantasyPointsByAgeWR <-ggplot(data = player_stats_seasonal_offense_subset %>%filter(position =="WR") %>%mutate(age =round(age, 2),fantasyPoints =round(fantasyPoints, 2) ),mapping =aes(x = age,y = fantasyPoints,color = player_id)) +geom_line(aes(text = player_display_name, # add player name for mouse over tooltiplabel = season # add season for mouse over tooltip )) +scale_color_viridis(discrete =TRUE) +labs(x ="Player Age (years)",y ="Fantasy Points (Season)",title ="Fantasy Points (Season) by Player Age: Wide Receivers" ) +theme_classic() +theme(legend.position ="none")plotly::ggplotly( plot_rawFantasyPointsByAgeWR,tooltip =c("age","fantasyPoints","text","label"))