I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.
10 Correlation Analysis
10.1 Getting Started
10.1.1 Load Packages
10.2 Overview of Correlation
Correlation is an index of the association between variables. Covariance is the association between variables and in an unstandardized metric that differs for variables with different scales. By contrast, correlation is in a standarized metric that does not differ for variables with different scales. When examining the association between variables that are interval or ratio levels of measurement, Pearson correlation is used. When examining the association between variables that are ordinal in level of measurement, Spearman correlation is used. Pearson correlation is an index of the linear association between variables. If a nonlinear association is present, other indices like xi [\(\xi\); Chatterjee (2021)] and distance correlation coefficients are better suited to detect the association.
10.3 The Correlation Coefficient (\(r\))
The formula for the correlation coefficient is in Equation 9.22.
The correlation coefficient ranges from −1.0 to +1.0. The correlation coefficient (\(r\)) tells you two things: (1) the direction (sign) of the association (positive or negative) and (2) the magnitude of the association. If the correlation coefficient is positive, the association is positive. If the correlation coefficient is negative, the association is negative. If the association is positive, as X
increases, Y
increases (or conversely, as X
decreases, Y
decreases). If the association is negative, as X
increases, Y
decreases (or conversely, as X
decreases, Y
increases). The smaller the absolute value of the correlation coefficient (i.e., the closer the \(r\) value is to zero), the weaker the association and the flatter the slope of the best-fit line in a scatterplot. The larger the absolute value of the correlation coefficient (i.e., the closer the absolute value of the \(r\) value is to one), the stronger the association and the steeper the slope of the best-fit line in a scatterplot. See Figure 10.1 for a range of different correlation coefficients and what some example data may look like for each direction and strength of association.
Code
set.seed(52242)
correlations <- data.frame(criterion = rnorm(1000))
correlations$v1 <- complement(correlations$criterion, -1)
correlations$v2 <- complement(correlations$criterion, -.9)
correlations$v3 <- complement(correlations$criterion, -.8)
correlations$v4 <- complement(correlations$criterion, -.7)
correlations$v5 <- complement(correlations$criterion, -.6)
correlations$v6 <- complement(correlations$criterion, -.5)
correlations$v7 <- complement(correlations$criterion, -.4)
correlations$v8 <- complement(correlations$criterion, -.3)
correlations$v9 <- complement(correlations$criterion, -.2)
correlations$v10 <-complement(correlations$criterion, -.1)
correlations$v11 <-complement(correlations$criterion, 0)
correlations$v12 <-complement(correlations$criterion, .1)
correlations$v13 <-complement(correlations$criterion, .2)
correlations$v14 <-complement(correlations$criterion, .3)
correlations$v15 <-complement(correlations$criterion, .4)
correlations$v16 <-complement(correlations$criterion, .5)
correlations$v17 <-complement(correlations$criterion, .6)
correlations$v18 <-complement(correlations$criterion, .7)
correlations$v19 <-complement(correlations$criterion, .8)
correlations$v20 <-complement(correlations$criterion, .9)
correlations$v21 <-complement(correlations$criterion, 1)
par(mfrow = c(7,3), mar = c(1, 0, 1, 0))
# -1.0
plot(correlations$criterion, correlations$v1, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v1)$estimate, 2))))
abline(lm(v1 ~ criterion, data = correlations), col = "black")
# -.9
plot(correlations$criterion, correlations$v2, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v2)$estimate, 2))))
abline(lm(v2 ~ criterion, data = correlations), col = "black")
# -.8
plot(correlations$criterion, correlations$v3, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v3)$estimate, 2))))
abline(lm(v3 ~ criterion, data = correlations), col = "black")
# -.7
plot(correlations$criterion, correlations$v4, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v4)$estimate, 2))))
abline(lm(v4 ~ criterion, data = correlations), col = "black")
# -.6
plot(correlations$criterion, correlations$v5, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v5)$estimate, 2))))
abline(lm(v5 ~ criterion, data = correlations), col = "black")
# -.5
plot(correlations$criterion, correlations$v6, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v6)$estimate, 2))))
abline(lm(v6 ~ criterion, data = correlations), col = "black")
# -.4
plot(correlations$criterion, correlations$v7, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v7)$estimate, 2))))
abline(lm(v7 ~ criterion, data = correlations), col = "black")
# -.3
plot(correlations$criterion, correlations$v8, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v8)$estimate, 2))))
abline(lm(v8 ~ criterion, data = correlations), col = "black")
# -.2
plot(correlations$criterion, correlations$v9, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v9)$estimate, 2))))
abline(lm(v9 ~ criterion, data = correlations), col = "black")
# -.1
plot(correlations$criterion, correlations$v10, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v10)$estimate, 2))))
abline(lm(v10 ~ criterion, data = correlations), col = "black")
# 0.0
plot(correlations$criterion, correlations$v11, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v11)$estimate, 2))))
abline(lm(v11 ~ criterion, data = correlations), col = "black")
# 0.1
plot(correlations$criterion, correlations$v12, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v12)$estimate, 2))))
abline(lm(v12 ~ criterion, data = correlations), col = "black")
# 0.2
plot(correlations$criterion, correlations$v13, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v13)$estimate, 2))))
abline(lm(v13 ~ criterion, data = correlations), col = "black")
# 0.3
plot(correlations$criterion, correlations$v14, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v14)$estimate, 2))))
abline(lm(v14 ~ criterion, data = correlations), col = "black")
# 0.4
plot(correlations$criterion, correlations$v15, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v15)$estimate, 2))))
abline(lm(v15 ~ criterion, data = correlations), col = "black")
# 0.5
plot(correlations$criterion, correlations$v16, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v16)$estimate, 2))))
abline(lm(v16 ~ criterion, data = correlations), col = "black")
# 0.6
plot(correlations$criterion, correlations$v17, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v17)$estimate, 2))))
abline(lm(v17 ~ criterion, data = correlations), col = "black")
# 0.7
plot(correlations$criterion, correlations$v18, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v18)$estimate, 2))))
abline(lm(v18 ~ criterion, data = correlations), col = "black")
# 0.8
plot(correlations$criterion, correlations$v19, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v19)$estimate, 2))))
abline(lm(v19 ~ criterion, data = correlations), col = "black")
# 0.9
plot(correlations$criterion, correlations$v20, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v20)$estimate, 2))))
abline(lm(v20 ~ criterion, data = correlations), col = "black")
# 1.0
plot(correlations$criterion, correlations$v21, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v21)$estimate, 2))))
abline(lm(v21 ~ criterion, data = correlations), col = "black")
invisible(dev.off()) #par(mfrow = c(1,1))
See Figure 10.2 for the interpretation of the magnitude and direction (sign) of various correlation coefficients.
Code
library("patchwork")
set.seed(52242)
correlations2 <- data.frame(criterion = rnorm(15))
correlations2$v1 <- complement(correlations2$criterion, -1)
correlations2$v2 <- complement(correlations2$criterion, -.9)
correlations2$v3 <- complement(correlations2$criterion, -.8)
correlations2$v4 <- complement(correlations2$criterion, -.7)
correlations2$v5 <- complement(correlations2$criterion, -.6)
correlations2$v6 <- complement(correlations2$criterion, -.5)
correlations2$v7 <- complement(correlations2$criterion, -.4)
correlations2$v8 <- complement(correlations2$criterion, -.3)
correlations2$v9 <- complement(correlations2$criterion, -.2)
correlations2$v10 <-complement(correlations2$criterion, -.1)
correlations2$v11 <-complement(correlations2$criterion, 0)
correlations2$v12 <-complement(correlations2$criterion, .1)
correlations2$v13 <-complement(correlations2$criterion, .2)
correlations2$v14 <-complement(correlations2$criterion, .3)
correlations2$v15 <-complement(correlations2$criterion, .4)
correlations2$v16 <-complement(correlations2$criterion, .5)
correlations2$v17 <-complement(correlations2$criterion, .6)
correlations2$v18 <-complement(correlations2$criterion, .7)
correlations2$v19 <-complement(correlations2$criterion, .8)
correlations2$v20 <-complement(correlations2$criterion, .9)
correlations2$v21 <-complement(correlations2$criterion, 1)
# -1.0
p1 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v1
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Perfect Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−1.0"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# -0.9
p2 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v2
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Strong Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−.9"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# -0.5
p3 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v6
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Moderate Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−.5"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# -0.2
p4 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v9
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Weak Negative Association",
subtitle = expression(paste(italic("r"), " = ", "−.2"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.0
p5 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v11
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "No Association",
subtitle = expression(paste(italic("r"), " = ", ".0"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.2
p6 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v13
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Weak Positive Association",
subtitle = expression(paste(italic("r"), " = ", ".2"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.5
p7 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v16
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Moderate Positive Association",
subtitle = expression(paste(italic("r"), " = ", ".5"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 0.9
p8 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v20
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Strong Positive Association",
subtitle = expression(paste(italic("r"), " = ", ".9"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# 1.0
p9 <- ggplot(
data = correlations2,
mapping = aes(
x = criterion,
y = v21
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE) +
labs(
title = "Perfect Positive Association",
subtitle = expression(paste(italic("r"), " = ", "1.0"))
) +
theme_classic(
base_size = 12) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9 +
plot_layout(
ncol = 3,
heights = 1,
widths = 1)
Interactive visualizations by Kristoffer Magnusson on p-values and null-hypothesis significance testing are below:
10.4 Examples
10.4.1 Covariance
10.4.2 Pearson Correlation
10.4.3 Spearman Correlation
10.4.4 Nonlinear Correlation
10.4.5 Correlation Matrix
10.4.6 Correlogram
10.5 Impact of Outliers
10.6 Correlation Does Not Imply Causation
As described in Section 8.4.2.1, correlation does not imply causation. There are several reasons (described in Section 8.4.2.1) that, just because X
is correlated with Y
does not necessarily mean that X
causes Y
. However, correlation can still be useful. In order for two processes to be causally related, they must be associated. That is, association is necessary but insufficient for causality.
10.7 Conclusion
Correlation is an index of the association between variables. The correlation coefficient (\(r\)) ranges from −1 to +1, and indicates the sign and magnitude of the association. Although correlation does not imply causation, identifying associations between variables can still be useful because association is a necessary (but insufficient) condition for causality.
10.8 Session Info
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.0 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
[5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 XICOR_0.4.1
[13] petersenlab_1.1.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 psych_2.4.12 viridisLite_0.4.2 farver_2.1.2
[5] fastmap_1.2.0 digest_0.6.37 rpart_4.1.23 timechange_0.3.0
[9] lifecycle_1.0.4 cluster_2.1.6 magrittr_2.0.3 compiler_4.4.2
[13] rlang_1.1.4 Hmisc_5.2-1 tools_4.4.2 yaml_2.3.10
[17] data.table_1.16.4 knitr_1.49 labeling_0.4.3 htmlwidgets_1.6.4
[21] mnormt_2.1.1 plyr_1.8.9 RColorBrewer_1.1-3 foreign_0.8-87
[25] withr_3.0.2 R.oo_1.27.0 nnet_7.3-19 grid_4.4.2
[29] stats4_4.4.2 lavaan_0.6-19 xtable_1.8-4 colorspace_2.1-1
[33] scales_1.3.0 cli_3.6.3 mvtnorm_1.3-2 rmarkdown_2.29
[37] generics_0.1.3 rstudioapi_0.17.1 reshape2_1.4.4 tzdb_0.4.0
[41] DBI_1.2.3 rtf_0.4-14.1 splines_4.4.2 parallel_4.4.2
[45] base64enc_0.1-3 mitools_2.4 vctrs_0.6.5 Matrix_1.7-1
[49] jsonlite_1.8.9 hms_1.1.3 Formula_1.2-5 htmlTable_2.4.3
[53] glue_1.8.0 stringi_1.8.4 gtable_0.3.6 quadprog_1.5-8
[57] munsell_0.5.1 pillar_1.10.0 psychTools_2.4.3 htmltools_0.5.8.1
[61] R6_2.5.1 mix_1.0-13 evaluate_1.0.1 pbivnorm_0.6.0
[65] lattice_0.22-6 R.methodsS3_1.8.2 backports_1.5.0 Rcpp_1.0.13-1
[69] gridExtra_2.3 nlme_3.1-166 checkmate_2.3.2 mgcv_1.9-1
[73] xfun_0.49 pkgconfig_2.0.3