I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
10 Correlation Analysis
10.1 Getting Started
10.1.1 Load Packages
10.1.2 Load Data
We created the player_stats_weekly.RData
and player_stats_seasonal.RData
objects in Section 4.4.3.
10.2 Overview of Correlation
Correlation is an index of the association between variables. Covariance is the association between variables and in an unstandardized metric that differs for variables with different scales. By contrast, correlation is in a standardized metric that does not differ for variables with different scales. When examining the association between variables that are interval or ratio levels of measurement, Pearson correlation is used. When examining the association between variables that are ordinal in level of measurement, Spearman correlation is used. Pearson correlation is an index of the linear association between variables. If a nonlinear association is present, other indices like xi [\(\xi\); Chatterjee (2021)] and distance correlation coefficients are better suited to detect the association.
10.3 The Correlation Coefficient (\(r\))
The formula for the (Pearson) correlation coefficient is in Equation 9.22. As noted in Section 9.6.6, the correlation coefficient can be thought of as the ratio of shared variance (i.e., covariance) to total variance, as in Equation 9.24.
The correlation coefficient ranges from −1.0 to +1.0. The correlation coefficient (\(r\)) tells you two things: (1) the direction (sign) of the association (positive or negative) and (2) the magnitude of the association. If the correlation coefficient is positive, the association is positive. If the correlation coefficient is negative, the association is negative. If the association is positive, as X
increases, Y
increases (or conversely, as X
decreases, Y
decreases). If the association is negative, as X
increases, Y
decreases (or conversely, as X
decreases, Y
increases). The smaller the absolute value of the correlation coefficient (i.e., the closer the \(r\) value is to zero), the weaker the association and the flatter the slope of the best-fit line in a scatterplot. The larger the absolute value of the correlation coefficient (i.e., the closer the absolute value of the \(r\) value is to one), the stronger the association and the steeper the slope of the best-fit line in a scatterplot. See Figure 10.1 for a range of different correlation coefficients and what some example data may look like for each direction and strength of association.
Code
set.seed(52242)
correlations <- data.frame(criterion = rnorm(1000))
correlations$v1 <- complement(correlations$criterion, -1)
correlations$v2 <- complement(correlations$criterion, -.9)
correlations$v3 <- complement(correlations$criterion, -.8)
correlations$v4 <- complement(correlations$criterion, -.7)
correlations$v5 <- complement(correlations$criterion, -.6)
correlations$v6 <- complement(correlations$criterion, -.5)
correlations$v7 <- complement(correlations$criterion, -.4)
correlations$v8 <- complement(correlations$criterion, -.3)
correlations$v9 <- complement(correlations$criterion, -.2)
correlations$v10 <-complement(correlations$criterion, -.1)
correlations$v11 <-complement(correlations$criterion, 0)
correlations$v12 <-complement(correlations$criterion, .1)
correlations$v13 <-complement(correlations$criterion, .2)
correlations$v14 <-complement(correlations$criterion, .3)
correlations$v15 <-complement(correlations$criterion, .4)
correlations$v16 <-complement(correlations$criterion, .5)
correlations$v17 <-complement(correlations$criterion, .6)
correlations$v18 <-complement(correlations$criterion, .7)
correlations$v19 <-complement(correlations$criterion, .8)
correlations$v20 <-complement(correlations$criterion, .9)
correlations$v21 <-complement(correlations$criterion, 1)
par(mfrow = c(7,3), mar = c(1, 0, 1, 0))
# -1.0
plot(correlations$criterion, correlations$v1, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v1)$estimate, 2))))
abline(lm(v1 ~ criterion, data = correlations), col = "black")
# -.9
plot(correlations$criterion, correlations$v2, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v2)$estimate, 2))))
abline(lm(v2 ~ criterion, data = correlations), col = "black")
# -.8
plot(correlations$criterion, correlations$v3, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v3)$estimate, 2))))
abline(lm(v3 ~ criterion, data = correlations), col = "black")
# -.7
plot(correlations$criterion, correlations$v4, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v4)$estimate, 2))))
abline(lm(v4 ~ criterion, data = correlations), col = "black")
# -.6
plot(correlations$criterion, correlations$v5, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v5)$estimate, 2))))
abline(lm(v5 ~ criterion, data = correlations), col = "black")
# -.5
plot(correlations$criterion, correlations$v6, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v6)$estimate, 2))))
abline(lm(v6 ~ criterion, data = correlations), col = "black")
# -.4
plot(correlations$criterion, correlations$v7, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v7)$estimate, 2))))
abline(lm(v7 ~ criterion, data = correlations), col = "black")
# -.3
plot(correlations$criterion, correlations$v8, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v8)$estimate, 2))))
abline(lm(v8 ~ criterion, data = correlations), col = "black")
# -.2
plot(correlations$criterion, correlations$v9, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v9)$estimate, 2))))
abline(lm(v9 ~ criterion, data = correlations), col = "black")
# -.1
plot(correlations$criterion, correlations$v10, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v10)$estimate, 2))))
abline(lm(v10 ~ criterion, data = correlations), col = "black")
# 0.0
plot(correlations$criterion, correlations$v11, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v11)$estimate, 2))))
abline(lm(v11 ~ criterion, data = correlations), col = "black")
# 0.1
plot(correlations$criterion, correlations$v12, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v12)$estimate, 2))))
abline(lm(v12 ~ criterion, data = correlations), col = "black")
# 0.2
plot(correlations$criterion, correlations$v13, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v13)$estimate, 2))))
abline(lm(v13 ~ criterion, data = correlations), col = "black")
# 0.3
plot(correlations$criterion, correlations$v14, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v14)$estimate, 2))))
abline(lm(v14 ~ criterion, data = correlations), col = "black")
# 0.4
plot(correlations$criterion, correlations$v15, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v15)$estimate, 2))))
abline(lm(v15 ~ criterion, data = correlations), col = "black")
# 0.5
plot(correlations$criterion, correlations$v16, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v16)$estimate, 2))))
abline(lm(v16 ~ criterion, data = correlations), col = "black")
# 0.6
plot(correlations$criterion, correlations$v17, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v17)$estimate, 2))))
abline(lm(v17 ~ criterion, data = correlations), col = "black")
# 0.7
plot(correlations$criterion, correlations$v18, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v18)$estimate, 2))))
abline(lm(v18 ~ criterion, data = correlations), col = "black")
# 0.8
plot(correlations$criterion, correlations$v19, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v19)$estimate, 2))))
abline(lm(v19 ~ criterion, data = correlations), col = "black")
# 0.9
plot(correlations$criterion, correlations$v20, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v20)$estimate, 2))))
abline(lm(v20 ~ criterion, data = correlations), col = "black")
# 1.0
plot(correlations$criterion, correlations$v21, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v21)$estimate, 2))))
abline(lm(v21 ~ criterion, data = correlations), col = "black")
invisible(dev.off()) #par(mfrow = c(1,1))
See Figure 10.2 for the interpretation of the magnitude and direction (sign) of various correlation coefficients.
Code
library("patchwork")
set.seed(52242)
correlations2 <- data.frame(criterion = rnorm(15))
correlations2$v1 <- complement(correlations2$criterion, -1)
correlations2$v2 <- complement(correlations2$criterion, -.9)
correlations2$v3 <- complement(correlations2$criterion, -.8)
correlations2$v4 <- complement(correlations2$criterion, -.7)
correlations2$v5 <- complement(correlations2$criterion, -.6)
correlations2$v6 <- complement(correlations2$criterion, -.5)
correlations2$v7 <- complement(correlations2$criterion, -.4)
correlations2$v8 <- complement(correlations2$criterion, -.3)
correlations2$v9 <- complement(correlations2$criterion, -.2)
correlations2$v10 <-complement(correlations2$criterion, -.1)
correlations2$v11 <-complement(correlations2$criterion, 0)
correlations2$v12 <-complement(correlations2$criterion, .1)
correlations2$v13 <-complement(correlations2$criterion, .2)
correlations2$v14 <-complement(correlations2$criterion, .3)
correlations2$v15 <-complement(correlations2$criterion, .4)
correlations2$v16 <-complement(correlations2$criterion, .5)
correlations2$v17 <-complement(correlations2$criterion, .6)
correlations2$v18 <-complement(correlations2$criterion, .7)
correlations2$v19 <-complement(correlations2$criterion, .8)
correlations2$v20 <-complement(correlations2$criterion, .9)
correlations2$v21 <-complement(correlations2$criterion, 1)
# -1.0
p1 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v1
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Perfect Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−1.0"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# -0.9
p2 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v2
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Strong Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−.9"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# -0.5
p3 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v6
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Moderate Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−.5"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# -0.2
p4 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v9
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Weak Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−.2"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.0
p5 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v11
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "No Association",
subtitle = expression(paste(italic("r"), " = ", ".0"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.2
p6 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v13
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Weak Positive Association",
subtitle = expression(paste(italic("r"), " = ", ".2"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.5
p7 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v16
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Moderate Positive Association",
subtitle = expression(paste(italic("r"), " = ", ".5"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.9
p8 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v20
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Strong Positive Association",
subtitle = expression(paste(italic("r"), " = ", ".9"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 1.0
p9 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v21
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Perfect Positive Association",
subtitle = expression(paste(italic("r"), " = ", "1.0"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9 +
plot_layout(
ncol = 3,
heights = 1,
widths = 1)
An interactive visualization by Magnusson (2023) on interpreting correlations is at the following link: https://rpsychologist.com/correlation/ (archived at https://perma.cc/G8YR-VCM4)
Keep in mind that the (Pearson) correlation examines the strength of the linear association between two variables. If the association between two variables is nonlinear, the (Pearson) correlation provides the strength of the linear trend and may not provide a meaningful index of the strength of the association between the variables. For instance, Anscombe’s quartet includes four sets of data that have nearly identical basic descriptive statistics (see Tables 10.1 and 10.2), including the same bivariate correlation, yet have very different distributions and whose association takes very different forms (see Figure 10.3).
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
---|---|---|---|---|---|---|---|
10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
14 | 9.96 | 14 | 8.10 | 14 | 8.84 | 8 | 7.04 |
6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
4 | 4.26 | 4 | 3.10 | 4 | 5.39 | 19 | 12.50 |
12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
Property | Value |
---|---|
Sample size | 11 |
Mean of X | 9.0 |
Mean of Y | ~7.5 |
Variance of X | 11.0 |
Variance of Y | ~4.1 |
Equation of regression line | Y = 3 + 0.5X |
Standard error of slope | 0.118 |
One-sample t-statistic | 4.24 |
Sum of squares of X | 110.0 |
Regression sum of squares | 27.50 |
Residual sum of squares of Y | 13.75 |
Correlation coefficient | .816 |
Coefficient of determination | .67 |
Code
par(mfrow = c(2,2))
plot(
anscombe$x1,
anscombe$y1,
xlab = "x",
ylab = "y",
xlim = c(0,20),
ylim = c(0,15),
main =
substitute(
paste(italic(r), " = ", x, sep = ""),
list(x = round(cor.test(x = anscombe$x1, y = anscombe$y1)$estimate, 2))))
abline(lm(
y1 ~ x1,
data = anscombe),
col = "black",
lty = 2)
plot(
anscombe$x2,
anscombe$y2,
xlab = "x",
ylab = "y",
xlim = c(0,20),
ylim = c(0,15),
main = substitute(
paste(italic(r), " = ", x, sep = ""),
list(x = round(cor.test(x = anscombe$x2, y = anscombe$y2)$estimate, 2))))
abline(lm(
y2 ~ x2,
data = anscombe),
col = "black",
lty = 2)
plot(
anscombe$x3,
anscombe$y3,
xlab = "x",
ylab = "y",
xlim = c(0,20),
ylim = c(0,15),
main = substitute(
paste(italic(r), " = ", x, sep = ""),
list(x = round(cor.test(x = anscombe$x3, y = anscombe$y3)$estimate, 2))))
abline(lm(
y3 ~ x3,
data = anscombe),
col = "black",
lty = 2)
plot(
anscombe$x4,
anscombe$y4,
xlab = "x",
ylab = "y",
xlim = c(0,20),
ylim = c(0,15),
main = substitute(
paste(italic(r), " = ", x, sep = ""),
list(x = round(cor.test(x = anscombe$x4, y = anscombe$y4)$estimate, 2))))
abline(lm(
y4 ~ x4,
data = anscombe),
col = "black",
lty = 2)
10.4 Example: Player Height and Fantasy Points
Is there an association between a player’s height and the number of fantasy points they score? We can examine this possibility separately by position.
First, let’s examine descriptive statistics of height and fantasy points (height is measured in inches):
Code
player_stats_seasonal %>%
dplyr::select(height, fantasyPoints) %>%
dplyr::summarise(across(
everything(),
.fns = list(
n = ~ length(na.omit(.)),
missingness = ~ mean(is.na(.)) * 100,
M = ~ mean(., na.rm = TRUE),
SD = ~ sd(., na.rm = TRUE),
min = ~ min(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE),
range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
IQR = ~ IQR(., na.rm = TRUE),
MAD = ~ mad(., na.rm = TRUE),
median = ~ median(., na.rm = TRUE),
pseudomedian = ~ DescTools::HodgesLehmann(., na.rm = TRUE),
mode = ~ petersenlab::Mode(., multipleModes = "mean"),
skewness = ~ psych::skew(., na.rm = TRUE),
kurtosis = ~ psych::kurtosi(., na.rm = TRUE)),
.names = "{.col}.{.fn}")) %>%
tidyr::pivot_longer(
cols = everything(),
names_to = c("variable","index"),
names_sep = "\\.") %>%
tidyr::pivot_wider(
names_from = index,
values_from = value)
10.4.1 Covariance
The following syntax shows the variance-covariance of a set of variables, where the covariance of a variable with itself is the variable’s variance.
Code
height weight fantasyPoints
height 6.750655 82.69151 -9.438497
weight 82.691508 2055.45910 -436.380805
fantasyPoints -9.438497 -436.38081 3842.840696
If you just want the covariance between two variables, you can use the following syntax.
Code
[1] -9.438497
Thus, it appears there is a negative covariance between height and fantasy points in the whole population of National Football League (NFL) players.
As shown below, the negative covariance between height and fantasy points holds when examining just Quarterbacks, Running Backs, Wide Receivers, and Tight Ends.
Code
height fantasyPoints
height 7.762595 -5.505416
fantasyPoints -5.505416 7519.932024
We can also examine the association separately by position. Here is the association for Quarterbacks:
Code
height fantasyPoints
height 2.803888 10.70975
fantasyPoints 10.709751 12375.42941
Here is the association for Running Backs:
Code
height fantasyPoints
height 3.149978 5.260645
fantasyPoints 5.260645 7845.217336
Here is the association for Wide Receivers:
Code
height fantasyPoints
height 5.51653 15.38999
fantasyPoints 15.38999 7075.02075
Here is the association for Tight Ends:
Code
height fantasyPoints
height 1.884198 2.760161
fantasyPoints 2.760161 3357.695180
Interestingly, there is a positive covariance between height and fantasy points when examining each position separately. This is an example of Simpson’s paradox, where the sign of an association differs at different levels of analysis, as described in Section 12.2.2. There is a negative covariance between height and fantasy points when examining Quarterbacks, Running Backs, Wide Receivers, and Tight Ends altogether, but there is a positive covariance between height and fantasy points when examining the association between each position separately. That is, if you were to examine all positions together, you might assume that being taller might be disadvantageous to scoring fantasy fantasy points; however, once we examine the association within a given position, it emerges that height appears to be somewhat advantageous, if anything, for scoring fantasy points.
The covariance is an unstandardized index of the association between two variables. However, it can be helpful to use a standardized index to examine the effect size—i.e., the strength of the association. The Pearson correlation coefficient, \(r\), is an example of a standardized index of the covariance between two variables.
10.4.2 Pearson Correlation
Pearson correlation assumes that the variables are interval or ratio level of measurement. If the data are ordinal, Spearman correlation is more appropriate.
The following syntax shows the Pearson correlation matrix of a set of variables.
Code
height weight fantasyPoints
height 1.0000000 0.7021849 -0.0621341
weight 0.7021849 1.0000000 -0.1836901
fantasyPoints -0.0621341 -0.1836901 1.0000000
Notice the diagonal values are all 1.0. That is because a variable is perfectly correlated with itself. Also notice that the values above the diagonal are a mirror image of (i.e., they are the same as) the respective values below the diagonal. The association between two variables is the same regardless of the order (i.e., which is the predictor variable and which is the outcome variable); that is, the association between variables A and B is the same as the association between variables B and A.
If you just want the Pearson correlation between two variables, you can use the following syntax.
Pearson's product-moment correlation
data: height and fantasyPoints
t = -12.208, df = 38456, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07208375 -0.05217209
sample estimates:
cor
-0.0621341
The \(r\) value (-.06) is slightly negative. The effect size of the association is small—for what is considered a small, medium, and large effect size for \(r\) values, see Section 9.4.2.9. The \(p\)-value is less than .05, so it is a statistically significant correlation. We could report this finding as follows: There was a statistically significant, weak negative association between height and fantasy points (\(r(38456) = -.06\), \(p < .001\)); the taller a player was, the fewer fantasy points they tended to score.
Now let’s examine the association separately by position. Here is the association for Quarterbacks:
Pearson's product-moment correlation
data: height and fantasyPoints
t = 2.5754, df = 2000, p-value = 0.01008
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.01371905 0.10104803
sample estimates:
cor
0.05749352
Here is the association for Running Backs:
Pearson's product-moment correlation
data: height and fantasyPoints
t = 2.0515, df = 3754, p-value = 0.04029
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.001483591 0.065376776
sample estimates:
cor
0.03346437
Here is the association for Wide Receivers:
Pearson's product-moment correlation
data: height and fantasyPoints
t = 5.735, df = 5387, p-value = 1.028e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.05130742 0.10438364
sample estimates:
cor
0.07790072
Here is the association for Tight Ends:
Pearson's product-moment correlation
data: height and fantasyPoints
t = 1.9022, df = 3001, p-value = 0.05725
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.001068264 0.070382934
sample estimates:
cor
0.03470168
The sign of the association was positive for each position, suggesting that, for a given position (among Quarterbacks, Running Backs, Wide Receivers, and Tight Ends), taller players tend to score more fantasy points. However, the effect size is small. Moreover, as described in Section 10.8, just because there is an association between variables does not mean that the association reflects a causal effect.
10.4.3 Scatterplot with Best-Fit Line
A scatterplot with best-fit line is in Figure 10.4.
Code
plot_scatterplot <- ggplot2::ggplot(
data = player_stats_seasonal %>% filter(position %in% c("QB","RB","WR","TE")),
aes(
x = height,
y = fantasyPoints)) +
geom_point(
aes(
text = player_display_name, # add player name for mouse over tooltip
label = season), # add season for mouse over tooltip
alpha = 0.05) +
geom_smooth(
method = "lm",
color = "black") +
geom_smooth() +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Player Height (Inches)",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Player Height",
subtitle = "(Among QBs, RBs, WRs, and TEs)"
) +
theme_classic()
plotly::ggplotly(plot_scatterplot)
Examining the linear line of best fit, we would see that there is a slight negative slope, consistent with a weak negative association. However, the nonlinear best-fit line suggests that there may be an inverted-U-shaped association, which might suggest that there may be an “optimal range of height,” where being too tall or too short may be a disadvantage. We now examine the association between height and fantasy points separately by position in Figure 10.5.
Code
plot_scatterplotByPosition <- ggplot2::ggplot(
data = player_stats_seasonal %>% filter(position %in% c("QB","RB","WR","TE")),
aes(
x = height,
y = fantasyPoints,
color = position,
fill = position)) +
geom_point(
aes(
text = player_display_name, # add player name for mouse over tooltip
label = season), # add season for mouse over tooltip
alpha = 0.7) +
geom_smooth(
method = "lm",
color = "black") +
geom_smooth(
method = "loess",
span = 0.5) +
coord_cartesian(
ylim = c(0,NA),
expand = FALSE) +
labs(
x = "Player Height (Inches)",
y = "Fantasy Points (Season)",
title = "Fantasy Points (Season) by Player Height and Position"
) +
guides(fill = "none") +
theme_classic()
plotly::ggplotly(plot_scatterplotByPosition)
Examining the association between height and fantasy points separately by position, it appears that there is an inverted-U-shaped associations for Wide Receivers, a relatively flat association for Running Backs and Tight Ends, and a nonlinear association among Quarterbacks. Interestingly, the best-fit lines suggest that the different positions have different height ranges, which is confirmed in Figure 10.6.
Code
ggplot2::ggplot(
data = player_stats_seasonal %>%
filter(position %in% c("QB","RB","WR","TE")),
mapping = aes(
x = height,
y = position,
group = position,
fill = position)
) +
ggridges::geom_density_ridges(
bandwidth = 0.6
) +
labs(
x = "Height (inches)",
y = "Position",
title = "Ridgeline Plot of Player Height by Position"
) +
theme_classic() +
theme(
legend.position = "none",
axis.title.y = element_text(angle = 0, vjust = 0.5)) # horizontal y-axis title
According to the distributions of player height by position (among Quarterbacks, Running Backs, Wide Receivers, and Tight Ends), Running Backs tend to be the shortest, followed by Wide Receivers; Tight Ends tend to be the tallest, followed by Quarterbacks. Wide Receivers showed the greatest variability; some were quite short, whereas others were quite tall.
10.4.4 Rank Correlation: Spearman’s Rho (\(\rho\))
Spearman’s rho (\(\rho\)) is a rank correlation. Spearman’s rho does not assume that the data are interval or ratio level of measurement. Unlike Pearson correlation, Spearman’s rho allows estimating associations among variables that are ordinal level of measurement.
For instance, using Spearman’s rho, we could examine the association between a player’s draft round and their fantasy points, because draft round is an ordinal variable:
Code
draftround fantasyPoints
draftround 1.0000000 -0.3628814
fantasyPoints -0.3628814 1.0000000
Spearman's rank correlation rho
data: draftround and fantasyPoints
S = 5.8025e+11, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.3628814
Using Spearman’s rho, there is a negative association between draft round and fantasy points (\(r(13668) = -.36\), \(p < .001\)); unsurprisingly, the earlier the round a player is drafted, the more fantasy points they tend to score.
10.4.5 Partial Correlation
A partial correlation examines the association between variables after controlling for one or more variables. Below, we examine a partial Pearson correlation between height and fantasy points after controlling for age using the the partial.r()
function from the psych
package (Revelle, 2025).
partial correlations
height fantasyPoints
height 1.00 -0.07
fantasyPoints -0.07 1.00
Below, we examine a partial Spearman’s rho rank correlation between height and fantasy points after controlling for age.
10.4.6 Nonlinear Correlation
An example of a nonlinear correlation is a distance correlation and the xi (\(\xi\)) coefficient. These estimates of nonlinear correlation only range from 0–1 (where 0 represents no association and 1 represents a strong association), so they are interpreted differently than a traditional correlation coefficient (which can be negative). We can estimate a distance correlation using the correlation()
function of the correlation package (Makowski et al., 2020, 2025).
Code
A nonlinear correlation can be estimated using the xi (\(\xi\)) coefficient from the XICOR
package (Chatterjee, 2021; Holmes & Chatterjee, 2023):
10.4.7 Correlation Matrix
The petersenlab
package (Petersen, 2025a) contains the cor.table()
function that generates a correlation matrix of variables, including the \(r\) value, number of observations (\(n\)), asterisks denoting statistical significance, and \(p\)-value for each association. The asterisks follow the following traditional convention for statistical significance: \(^\dagger p < .1\); \(^* p < .05\); \(^{**} p < .01\); \(^{***} p < .001\). Here is a Pearson correlation matrix:
Here is a Spearman correlation matrix:
Code
Here is a correlation matrix of variables that includes just the \(r\) value and asterisks denoting the statistical significance of the association, for greater concision.
Code
Here is a correlation matrix of variables that includes just the \(r\) value for even greater concision.
Code
The petersenlab
package (Petersen, 2025a) contains the partialcor.table()
function that generates a partial correlation matrix. Here is a partial correlation matrix, controlling for the player’s age:
10.4.8 Correlogram
We can depict a correlogram using the corrplot()
function of the corrplot
package (Wei & Simko, 2024), as shown in Figure 10.7.
10.5 Addressing Non-Independence of Observations
Please note that the \(p\)-value for a correlation assumes that the observations are independent—in particular, that the residuals are not correlated. However, the observations are not independent in the player_stats_seasonal
dataframe used above, because the same player has multiple rows—one row corresponding to each season they played. This non-independence violates the traditional assumptions of the significance test of a correlation. We could address this assumption by analyzing only one season from each player or by estimating the significance of the correlation coefficient using cluster-robust standard errors. For simplicity, we present results above from the whole dataframe. In Chapter 12, we discuss mixed model approaches that handle repeated measures and other data that violate assumptions of non-independence that are shared by correlation and multiple regression; assumptions of multiple regression are described in Section 11.4. In section Section 11.11, we demonstrate how to account for non-independence of observations using cluster-robust standard errors.
10.6 Impact of Outliers
The correlation coefficient is strongly impacted by outliers—i.e., extreme (i.e., very large or very small) values relative to the other observations. Consider the following example:
Pearson's product-moment correlation
data: v1 and v2
t = 0.31786, df = 8, p-value = 0.7587
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5571223 0.6926038
sample estimates:
cor
0.1116785
The associated scatterplot and best-fit line is in Figure 10.8.
Now, let’s add one outlier.
Pearson's product-moment correlation
data: v1 and v2
t = 2.7132, df = 9, p-value = 0.02387
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1186112 0.9060613
sample estimates:
cor
0.6707604
The associated scatterplot and best-fit line for the updated data are in Figure 10.9.
Code
Note how the association was not close to being statistically significant without the outlier, but with the outlier added, the association is statistically significant. One way to combat this is to use methods for estimating the association between variables that are robust to (i.e., less impacted by) outliers. In the next section, we describe robust correlation methods.
10.6.1 Examining Robust Correlation
There are various approaches to estimating correlation in the presence of outliers—so-called robust correlation methods.
One approach is the biweight midcorrelation, which is based on the median rather than the mean, and is thus less sensitive to outliers. Another approach is the percentage bend correlation, which gives less weight to observations that are farther away from the median. We can estimate the each using the correlation()
function of the correlation package (Makowski et al., 2020, 2025).
10.7 Impact of Restricted Range
In addition to being impacted by outliers, correlations can also be greatly impacted by restricted variability or restricted range (Petersen, 2024, 2025b). Correlation depends on variability; if there is no or limited variability, it can be difficult to detect an association with another variable. Thus, if a variable has restricted range—such as owing to a floor effect or ceiling effect—that tends to artificially weaken associations. For instance, a floor effect is one in which many of the scores are the lowest possible score. By contrast, a ceiling effect is one in which many of the scores are the highest possible score.
Consider the following association between passing attempts and expected points added (EPA) via passing. Expected points added from a given play is calculated as the difference between a team’s expected points before the play and the team’s expected points after the play. This can be summed across all passing plays to determine a player’s EPA from passing during a given season.
Pearson's product-moment correlation
data: player_stats_seasonal$attempts and player_stats_seasonal$passing_epa
t = 24.822, df = 2758, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3963356 0.4573453
sample estimates:
cor
0.4273268
There is a statistically significant, moderate positive association between passing attempts and expected points added (EPA) via passing, as depicted in Figure 10.10.
Code
plot_scatterplotWithoutRangeRestriction <- ggplot2::ggplot(
data = player_stats_seasonal,
aes(
x = attempts,
y = passing_epa)) +
geom_point(
aes(
text = player_display_name, # add player name for mouse over tooltip
label = season), # add season for mouse over tooltip
alpha = 0.3) +
geom_smooth(
method = "lm") +
#geom_smooth() +
coord_cartesian(
expand = FALSE) +
labs(
x = "Player's Passing Attempts",
y = "Player's Expected Points Added (EPA) via Passing",
title = "Player's Passing Expected Points Added (EPA)\nby Passing Attempts (Season)",
#subtitle = ""
) +
theme_classic()
plotly::ggplotly(plot_scatterplotWithoutRangeRestriction)
Now, consider the same association when we restrict the range to examine only those players who had fewer than 450 pass attempts in a season (thus resulting in less variability):
Code
Pearson's product-moment correlation
data: player_stats_seasonal %>% filter(attempts < 450) %>% select(attempts) %>% pull() and player_stats_seasonal %>% filter(attempts < 450) %>% select(passing_epa) %>% pull()
t = -2.4303, df = 2302, p-value = 0.01516
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.091236097 -0.009771823
sample estimates:
cor
-0.05058811
The association is no longer positive and is weak in terms of effect size, as depicted in Figure 10.11.
Code
plot_scatterplotWithRangeRestriction <- ggplot2::ggplot(
data = player_stats_seasonal %>% filter(attempts < 450),
aes(
x = attempts,
y = passing_epa)) +
geom_point(
aes(
text = player_display_name, # add player name for mouse over tooltip
label = season), # add season for mouse over tooltip
alpha = 0.3) +
geom_smooth(
method = "lm") +
#geom_smooth() +
coord_cartesian(
expand = FALSE) +
labs(
x = "Player's Passing Attempts",
y = "Player's Expected Points Added (EPA) via Passing",
title = "Player's Passing Expected Points Added (EPA)\nby Passing Attempts (Season)",
subtitle = "(Among Players with < 450 Passing Attempts)"
) +
theme_classic()
plotly::ggplotly(plot_scatterplotWithRangeRestriction)
10.8 Correlation Does Not Imply Causation
As described in Section 8.4.2.1, correlation does not imply causation. There are several reasons (described in Section 8.4.2.1) that, just because X
is correlated with Y
does not necessarily mean that X
causes Y
. However, correlation can still be useful. In order for two processes to be causally related, they must be associated. That is, association is necessary but insufficient for causality.
10.9 Conclusion
Correlation is an index of the association between variables. The correlation coefficient (\(r\)) ranges from −1 to +1, and indicates the sign and magnitude of the association. Although correlation does not imply causation, identifying associations between variables can still be useful because association is a necessary (but insufficient) condition for causality.
10.10 Session Info
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.0 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
[5] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
[9] tibble_3.2.1 ggplot2_3.5.2 tidyverse_2.0.0 ggridges_0.5.6
[13] correlation_0.8.7 corrplot_0.95 DescTools_0.99.60 psych_2.5.3
[17] XICOR_0.4.1 petersenlab_1.1.3
loaded via a namespace (and not attached):
[1] DBI_1.2.3 mnormt_2.1.1 gridExtra_2.3 gld_2.6.7
[5] readxl_1.4.5 rlang_1.1.6 magrittr_2.0.3 e1071_1.7-16
[9] compiler_4.5.0 mgcv_1.9-1 vctrs_0.6.5 reshape2_1.4.4
[13] quadprog_1.5-8 pkgconfig_2.0.3 fastmap_1.2.0 backports_1.5.0
[17] labeling_0.4.3 pbivnorm_0.6.0 rmarkdown_2.29 psychTools_2.5.3
[21] tzdb_0.5.0 haven_2.5.4 xfun_0.52 jsonlite_2.0.0
[25] parallel_4.5.0 lavaan_0.6-19 cluster_2.1.8.1 R6_2.6.1
[29] stringi_1.8.7 RColorBrewer_1.1-3 boot_1.3-31 rpart_4.1.24
[33] cellranger_1.1.0 Rcpp_1.0.14 knitr_1.50 base64enc_0.1-3
[37] splines_4.5.0 Matrix_1.7-3 nnet_7.3-20 timechange_0.3.0
[41] tidyselect_1.2.1 rstudioapi_0.17.1 yaml_2.3.10 lattice_0.22-6
[45] plyr_1.8.9 withr_3.0.2 bayestestR_0.15.2 evaluate_1.0.3
[49] foreign_0.8-90 proxy_0.4-27 pillar_1.10.2 checkmate_2.3.2
[53] rtf_0.4-14.1 stats4_4.5.0 insight_1.2.0 plotly_4.10.4
[57] generics_0.1.3 mix_1.0-13 hms_1.1.3 scales_1.4.0
[61] rootSolve_1.8.2.4 xtable_1.8-4 class_7.3-23 glue_1.8.0
[65] lazyeval_0.2.2 Hmisc_5.2-3 lmom_3.2 tools_4.5.0
[69] data.table_1.17.0 Exact_3.3 fs_1.6.6 mvtnorm_1.3-3
[73] grid_4.5.0 mitools_2.4 crosstalk_1.2.1 datawizard_1.0.2
[77] colorspace_2.1-1 nlme_3.1-168 htmlTable_2.4.3 Formula_1.2-5
[81] cli_3.6.5 expm_1.0-0 viridisLite_0.4.2 gtable_0.3.6
[85] R.methodsS3_1.8.2 digest_0.6.37 htmlwidgets_1.6.4 farver_2.1.2
[89] htmltools_0.5.8.1 R.oo_1.27.0 lifecycle_1.0.4 httr_1.4.7
[93] MASS_7.3-65