I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Alternatively, you can leave an annotation using hypothes.is.
To add an annotation, select some text and then click the
symbol on the pop-up menu.
To see the annotations of others, click the
symbol in the upper right-hand corner of the page.
21 Cluster Analysis
This chapter provides an overview of cluster analysis.
21.1 Getting Started
21.1.1 Load Packages
21.1.2 Load Data
We created the player_stats_weekly.RData
and player_stats_seasonal.RData
objects in Section 4.4.3.
21.1.3 Overview
Whereas factor analysis evaluates how variables do or do not hang together—in terms of their associations and non-associations, cluster analysis evaluates how people are or or not similar—in terms of their scores on one or more variables. The goal of cluster analysis is to identify distinguishable subgroups of people. The people within a subgroup are expected to be more similar to each other than they are to people in other subgroups. For instance, we might expect that there are distinguishable subtypes of Wide Receivers: possession, deep threats, and slot-type Wide Receivers. Possession Wide Receivers tend to be taller and heavier, with good hands who catch the ball at a high rate. Deep threat Wide Receivers tend to be fast. Slot-type Wide Receivers tend to be small, quick, and agile. In order to identify these clusters of Wide Receivers, we might conduct a cluster analysis with variables relating to the players’ height, weight, percent of (catchable) targets caught, air yards received, and various metrics from the National Football League (NFL) Combine, including their times in the 40-yard dash, 20-yard shuttle run, and three cone drill.
There are many approaches to cluster analysis, including model-based clustering, density-based clustering, centroid-based clustering, hierarchical clustering (aka connectivity-based clustering), etc. An overview of approaches to cluster analysis in R
is provided by Kassambara (2017). In this chapter, we focus on examples using model-based clustering with the R
package mclust
(Fraley et al., 2024; Scrucca et al., 2023), which uses Gaussian finite mixture modeling. The various types of mclust
models are provided here: https://mclust-org.github.io/mclust/reference/mclustModelNames.html.
21.1.4 Tiers of Prior Season Fantasy Points
21.1.4.1 Prepare Data
Code
[1] 2024
Code
player_stats_seasonal_offense_recent <- player_stats_seasonal %>%
filter(season == recentSeason) %>%
filter(position_group %in% c("QB","RB","WR","TE"))
player_stats_seasonal_offense_recentQB <- player_stats_seasonal_offense_recent %>%
filter(position_group == "QB")
player_stats_seasonal_offense_recentRB <- player_stats_seasonal_offense_recent %>%
filter(position_group == "RB")
player_stats_seasonal_offense_recentWR <- player_stats_seasonal_offense_recent %>%
filter(position_group == "WR")
player_stats_seasonal_offense_recentTE <- player_stats_seasonal_offense_recent %>%
filter(position_group == "TE")
21.1.4.2 Identify the Optimal Number of Tiers by Position
21.1.4.2.1 Quarterbacks
We can perform a cluster analysis using the mclust::mclustBIC()
, mclust::mclustICL()
, and mclust::mclustBootstrapLRT()
functions of the mclust
package (Fraley et al., 2024; Scrucca et al., 2023).
Code
Bayesian Information Criterion (BIC):
E V
1 -982.9038 -982.9038
2 -964.3518 -927.2534
3 -973.0978 -930.5327
4 -971.2212 -912.2067
5 -971.1002 -924.2192
6 -979.8174 -928.6330
7 -974.9956 -949.0362
8 -981.8676 -955.9257
9 -990.5409 -963.9506
Top 3 models based on the BIC criterion:
V,4 V,5 V,2
-912.2067 -924.2192 -927.2534
Best BIC values:
V,4 V,5 V,2
BIC -912.2067 -924.21918 -927.25337
BIC diff 0.0000 -12.01245 -15.04664
Code
Integrated Complete-data Likelihood (ICL) criterion:
E V
1 -982.9038 -982.9038
2 -972.1069 -933.7840
3 -1039.9236 -945.9954
4 -1040.3715 -927.2426
5 -1033.2208 -945.8315
6 -1061.5988 -935.2325
7 -1056.6193 -993.1199
8 -1065.2675 -976.8222
9 -1088.2374 -986.9286
Top 3 models based on the ICL criterion:
V,4 V,2 V,6
-927.2426 -933.7840 -935.2325
Best ICL values:
V,4 V,2 V,6
ICL -927.2426 -933.78400 -935.232482
ICL diff 0.0000 -6.54137 -7.989849
Code
tiersQB_boostrap <- mclust::mclustBootstrapLRT(
data = player_stats_seasonal_offense_recentQB$fantasyPoints,
modelName = "V") # variable/unequal variance (for univariate data)
numTiersQB <- as.numeric(summary(tiersQB_boostrap)[,"Length"][1]) # or could specify the number of teams manually
tiersQB_boostrap
-------------------------------------------------------------
Bootstrap sequential LRT for the number of mixture components
-------------------------------------------------------------
Model = V
Replications = 999
LRTS bootstrap p-value
1 vs 2 68.720575 0.001
2 vs 3 9.790787 0.037
3 vs 4 31.396105 0.001
4 vs 5 1.057678 0.666
21.1.4.2.2 Running Backs
Code
Bayesian Information Criterion (BIC):
E V
1 -1854.031 -1854.031
2 -1786.170 -1741.345
3 -1796.289 -1681.204
4 -1786.692 -1683.266
5 -1796.779 -1688.302
6 -1806.869 -1699.238
7 -1788.245 -1713.803
8 -1798.339 -1714.711
9 -1805.916 -1729.569
Top 3 models based on the BIC criterion:
V,3 V,4 V,5
-1681.204 -1683.266 -1688.302
Best BIC values:
V,3 V,4 V,5
BIC -1681.204 -1683.266361 -1688.302292
BIC diff 0.000 -2.062052 -7.097982
Code
Integrated Complete-data Likelihood (ICL) criterion:
E V
1 -1854.031 -1854.031
2 -1791.597 -1765.263
3 -1955.631 -1710.036
4 -1940.803 -1727.646
5 -2032.425 -1729.461
6 -2085.572 -1739.405
7 -2049.102 -1771.322
8 -2088.752 -1774.842
9 -2104.642 -1806.327
Top 3 models based on the ICL criterion:
V,3 V,4 V,5
-1710.036 -1727.646 -1729.461
Best ICL values:
V,3 V,4 V,5
ICL -1710.036 -1727.64580 -1729.46112
ICL diff 0.000 -17.61014 -19.42546
The model-based bootstrap clustering of Running Backs’ fantasy points is unable to run due to an error:
Thus, we cannot use the following code, which would otherwise summarize the model results, specify the number of tiers, and plot model comparisons:
21.1.4.2.3 Wide Receivers
Code
Bayesian Information Criterion (BIC):
E V
1 -2773.668 -2773.668
2 -2715.276 -2579.573
3 -2726.221 -2567.175
4 -2701.874 -2555.180
5 -2712.788 -2558.928
6 -2690.115 -2569.660
7 -2701.052 -2571.626
8 -2703.583 -2583.527
9 -2714.560 -2598.340
Top 3 models based on the BIC criterion:
V,4 V,5 V,3
-2555.180 -2558.928 -2567.175
Best BIC values:
V,4 V,5 V,3
BIC -2555.18 -2558.928068 -2567.17473
BIC diff 0.00 -3.748463 -11.99512
Code
Integrated Complete-data Likelihood (ICL) criterion:
E V
1 -2773.668 -2773.668
2 -2740.609 -2601.071
3 -2983.660 -2629.491
4 -2918.630 -2648.113
5 -3013.988 -2646.928
6 -3010.614 -2668.189
7 -3065.987 -2647.081
8 -3064.463 -2669.032
9 -3097.662 -2684.364
Top 3 models based on the ICL criterion:
V,2 V,3 V,5
-2601.071 -2629.491 -2646.928
Best ICL values:
V,2 V,3 V,5
ICL -2601.071 -2629.49065 -2646.92847
ICL diff 0.000 -28.41921 -45.85703
Code
tiersWR_boostrap <- mclust::mclustBootstrapLRT(
data = player_stats_seasonal_offense_recentWR$fantasyPoints,
modelName = "V") # variable/unequal variance (for univariate data)
numTiersWR <- as.numeric(summary(tiersWR_boostrap)[,"Length"][1]) # or could specify the number of teams manually
tiersWR_boostrap
-------------------------------------------------------------
Bootstrap sequential LRT for the number of mixture components
-------------------------------------------------------------
Model = V
Replications = 999
LRTS bootstrap p-value
1 vs 2 210.486649 0.001
2 vs 3 28.790091 0.001
3 vs 4 28.386617 0.001
4 vs 5 12.643032 0.014
5 vs 6 5.659307 0.164
21.1.4.2.4 Tight Ends
Code
Bayesian Information Criterion (BIC):
E V
1 -1416.237 -1416.237
2 -1382.407 -1329.715
3 -1392.097 -1304.704
4 -1401.790 -1304.096
5 -1370.177 -1313.794
6 -1379.889 -1321.488
7 -1387.142 -1328.972
8 -1396.793 -1342.684
9 -1406.526 -1354.634
Top 3 models based on the BIC criterion:
V,4 V,3 V,5
-1304.096 -1304.704 -1313.794
Best BIC values:
V,4 V,3 V,5
BIC -1304.096 -1304.7036430 -1313.794382
BIC diff 0.000 -0.6074285 -9.698167
Code
Integrated Complete-data Likelihood (ICL) criterion:
E V
1 -1416.237 -1416.237
2 -1392.973 -1349.457
3 -1524.660 -1330.942
4 -1587.201 -1340.591
5 -1569.061 -1357.545
6 -1611.215 -1358.914
7 -1616.252 -1359.035
8 -1640.868 -1389.717
9 -1687.213 -1401.670
Top 3 models based on the ICL criterion:
V,3 V,4 V,2
-1330.942 -1340.591 -1349.457
Best ICL values:
V,3 V,4 V,2
ICL -1330.942 -1340.590921 -1349.45661
ICL diff 0.000 -9.648454 -18.51414
Code
tiersTE_boostrap <- mclust::mclustBootstrapLRT(
data = player_stats_seasonal_offense_recentTE$fantasyPoints,
modelName = "V") # variable/unequal variance (for univariate data)
numTiersTE <- as.numeric(summary(tiersTE_boostrap)[,"Length"][1]) # or could specify the number of teams manually
tiersTE_boostrap
-------------------------------------------------------------
Bootstrap sequential LRT for the number of mixture components
-------------------------------------------------------------
Model = V
Replications = 999
LRTS bootstrap p-value
1 vs 2 101.054621 0.001
2 vs 3 39.543494 0.001
3 vs 4 15.139990 0.006
4 vs 5 4.834394 0.236
21.1.4.3 Fit the Cluster Model to the Optimal Number of Tiers
21.1.4.3.1 Quarterbacks
In our data, all of the following models are equivalent—i.e., they result in the same unequal variance model with a 4-cluster solution—but they arrive there in different ways. We can fit the cluster model to the optimal number of tiers using the mclust::Mclust()
function.
Code
mclust::Mclust(
data = player_stats_seasonal_offense_recentQB$fantasyPoints,
G = numTiersQB,
)
mclust::Mclust(
data = player_stats_seasonal_offense_recentQB$fantasyPoints,
G = 4,
)
mclust::Mclust(
data = player_stats_seasonal_offense_recentQB$fantasyPoints,
)
mclust::Mclust(
data = player_stats_seasonal_offense_recentQB$fantasyPoints,
x = tiersQB_bic
)
Let’s fit one of these:
Here are the number of players that are in each of the four clusters (i.e., tiers):
21.1.4.3.2 Running Backs
Here are the number of players that are in each of the four clusters (i.e., tiers):
21.1.4.3.3 Wide Receivers
Here are the number of players that are in each of the four clusters (i.e., tiers):
21.1.4.3.4 Tight Ends
Here are the number of players that are in each of the four clusters (i.e., tiers):
21.1.4.4 Plot the Tiers
We can merge the player’s classification into the dataset and plot each player’s classification.
21.1.4.4.1 Quarterbacks
Code
player_stats_seasonal_offense_recentQB$tier <- clusterModelQBs$classification
player_stats_seasonal_offense_recentQB <- player_stats_seasonal_offense_recentQB %>%
mutate(
tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
)
player_stats_seasonal_offense_recentQB$position_rank <- rank(
player_stats_seasonal_offense_recentQB$fantasyPoints * -1,
na.last = "keep",
ties.method = "min")
plot_qbTiers <- ggplot2::ggplot(
data = player_stats_seasonal_offense_recentQB,
mapping = aes(
x = fantasyPoints,
y = position_rank,
color = tier
)) +
geom_point(
aes(
text = player_display_name # add player name for mouse over tooltip
)) +
scale_y_continuous(trans = "reverse") +
coord_cartesian(clip = "off") +
labs(
x = "Projected Points",
y = "Position Rank",
title = "Quarterback Fantasy Points by Tier",
color = "Tier") +
theme_classic() +
theme(legend.position = "top")
plotly::ggplotly(plot_qbTiers)
21.1.4.4.2 Running Backs
Code
player_stats_seasonal_offense_recentRB$tier <- clusterModelRBs$classification
player_stats_seasonal_offense_recentRB <- player_stats_seasonal_offense_recentRB %>%
mutate(
tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
)
player_stats_seasonal_offense_recentRB$position_rank <- rank(
player_stats_seasonal_offense_recentRB$fantasyPoints * -1,
na.last = "keep",
ties.method = "min")
plot_rbTiers <- ggplot2::ggplot(
data = player_stats_seasonal_offense_recentRB,
mapping = aes(
x = fantasyPoints,
y = position_rank,
color = tier
)) +
geom_point(
aes(
text = player_display_name # add player name for mouse over tooltip
)) +
scale_y_continuous(trans = "reverse") +
coord_cartesian(clip = "off") +
labs(
x = "Projected Points",
y = "Position Rank",
title = "Running Back Fantasy Points by Tier",
color = "Tier") +
theme_classic() +
theme(legend.position = "top")
plotly::ggplotly(plot_rbTiers)
21.1.4.4.3 Wide Receivers
Code
player_stats_seasonal_offense_recentWR$tier <- clusterModelWRs$classification
player_stats_seasonal_offense_recentWR <- player_stats_seasonal_offense_recentWR %>%
mutate(
tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
)
player_stats_seasonal_offense_recentWR$position_rank <- rank(
player_stats_seasonal_offense_recentWR$fantasyPoints * -1,
na.last = "keep",
ties.method = "min")
plot_wrTiers <- ggplot2::ggplot(
data = player_stats_seasonal_offense_recentWR,
mapping = aes(
x = fantasyPoints,
y = position_rank,
color = tier
)) +
geom_point(
aes(
text = player_display_name # add player name for mouse over tooltip
)) +
scale_y_continuous(trans = "reverse") +
coord_cartesian(clip = "off") +
labs(
x = "Projected Points",
y = "Position Rank",
title = "Wide Receiver Fantasy Points by Tier",
color = "Tier") +
theme_classic() +
theme(legend.position = "top")
plotly::ggplotly(plot_wrTiers)
21.1.4.4.4 Tight Ends
Code
player_stats_seasonal_offense_recentTE$tier <- clusterModelTEs$classification
player_stats_seasonal_offense_recentTE <- player_stats_seasonal_offense_recentTE %>%
mutate(
tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
)
player_stats_seasonal_offense_recentTE$position_rank <- rank(
player_stats_seasonal_offense_recentTE$fantasyPoints * -1,
na.last = "keep",
ties.method = "min")
plot_teTiers <- ggplot2::ggplot(
data = player_stats_seasonal_offense_recentTE,
mapping = aes(
x = fantasyPoints,
y = position_rank,
color = tier
)) +
geom_point(
aes(
text = player_display_name # add player name for mouse over tooltip
)) +
scale_y_continuous(trans = "reverse") +
coord_cartesian(clip = "off") +
labs(
x = "Projected Points",
y = "Position Rank",
title = "Tight End Fantasy Points by Tier",
color = "Tier") +
theme_classic() +
theme(legend.position = "top")
plotly::ggplotly(plot_teTiers)
21.1.5 Types of Wide Receivers
Code
# Compute Advanced PFR Stats by Career
pfrVars <- nfl_advancedStatsPFR_seasonal %>%
select(pocket_time.pass:cmp_percent.def, g, gs) %>%
names()
weightedAverageVars <- c(
"pocket_time.pass",
"ybc_att.rush","yac_att.rush",
"ybc_r.rec","yac_r.rec","adot.rec","rat.rec",
"yds_cmp.def","yds_tgt.def","dadot.def","m_tkl_percent.def","rat.def"
)
recomputeVars <- c(
"drop_pct.pass", # drops.pass / pass_attempts.pass
"bad_throw_pct.pass", # bad_throws.pass / pass_attempts.pass
"on_tgt_pct.pass", # on_tgt_throws.pass / pass_attempts.pass
"pressure_pct.pass", # times_pressured.pass / pass_attempts.pass
"drop_percent.rec", # drop.rec / tgt.rec
"rec_br.rec", # rec.rec / brk_tkl.rec
"cmp_percent.def" # cmp.def / tgt.def
)
sumVars <- pfrVars[pfrVars %ni% c(
weightedAverageVars, recomputeVars,
"merge_name", "loaded.pass", "loaded.rush", "loaded.rec", "loaded.def")]
nfl_advancedStatsPFR_career <- nfl_advancedStatsPFR_seasonal %>%
group_by(pfr_id, merge_name) %>%
summarise(
across(all_of(weightedAverageVars), ~ weighted.mean(.x, w = g, na.rm = TRUE)),
across(all_of(sumVars), ~ sum(.x, na.rm = TRUE)),
.groups = "drop") %>%
mutate(
drop_pct.pass = drops.pass / pass_attempts.pass,
bad_throw_pct.pass = bad_throws.pass / pass_attempts.pass,
on_tgt_pct.pass = on_tgt_throws.pass / pass_attempts.pass,
pressure_pct.pass = times_pressured.pass / pass_attempts.pass,
drop_percent.rec = drop.rec / tgt.rec,
rec_br.rec = drop.rec / tgt.rec,
cmp_percent.def = cmp.def / tgt.def
)
uniqueCases <- nfl_advancedStatsPFR_seasonal %>% select(pfr_id, merge_name, gsis_id) %>% unique()
uniqueCases %>%
group_by(pfr_id) %>%
filter(n() > 1)
Code
nfl_advancedStatsPFR_seasonal <- nfl_advancedStatsPFR_seasonal %>%
filter(pfr_id != "WillMa06" | merge_name != "MARCUSWILLIAMS" | !is.na(gsis_id))
nfl_advancedStatsPFR_career <- left_join(
nfl_advancedStatsPFR_career,
nfl_advancedStatsPFR_seasonal %>% select(pfr_id, merge_name, gsis_id) %>% unique(),
by = c("pfr_id", "merge_name")
)
# Compute Player Stats Per Season
player_stats_seasonal_careerWRs <- player_stats_seasonal %>%
filter(position == "WR") %>%
group_by(player_id) %>%
summarise(
across(all_of(c("targets", "receptions", "receiving_air_yards")), ~ weighted.mean(.x, w = games, na.rm = TRUE)),
.groups = "drop")
# Drop players with no receiving air yards
player_stats_seasonal_careerWRs <- player_stats_seasonal_careerWRs %>%
filter(receiving_air_yards != 0) %>%
rename(
targets_per_season = targets,
receptions_per_season = receptions,
receiving_air_yards_per_season = receiving_air_yards
)
# Merge
playerListToMerge <- list(
nfl_players %>% select(gsis_id, display_name, position, height, weight),
nfl_combine %>% select(gsis_id, vertical, forty, ht, wt),
player_stats_seasonal_careerWRs %>% select(player_id, targets_per_season, receptions_per_season, receiving_air_yards_per_season) %>%
rename(gsis_id = player_id),
nfl_actualStats_career_player_inclPost %>% select(player_id, receptions, targets, receiving_air_yards, air_yards_share, target_share) %>%
rename(gsis_id = player_id),
nfl_advancedStatsPFR_career %>% select(gsis_id, adot.rec, rec.rec, brk_tkl.rec, drop.rec, drop_percent.rec)
)
merged_data <- playerListToMerge %>%
reduce(
full_join,
by = c("gsis_id"),
na_matches = "never")
Additional processing:
Code
merged_data <- merged_data %>%
mutate(
height_coalesced = coalesce(height, ht),
weight_coalesced = coalesce(weight, wt),
receptions_coalesced = pmax(receptions, rec.rec, na.rm = TRUE),
receiving_air_yards_per_rec = receiving_air_yards / receptions
)
merged_data$receiving_air_yards_per_rec[which(merged_data$receptions == 0)] <- 0
merged_dataWRs <- merged_data %>%
filter(position == "WR")
merged_dataWRs_cluster <- merged_dataWRs %>%
filter(receptions_coalesced >= 100) %>% # keep WRs with at least 100 receptions
select(gsis_id, display_name, vertical, forty, height_coalesced, weight_coalesced, adot.rec, drop_percent.rec, receiving_air_yards_per_rec, brk_tkl.rec, receptions_per_season) %>% #targets_per_season, receiving_air_yards_per_season, air_yards_share, target_share
na.omit()
21.1.5.1 Identify the Number of WR Types
Code
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE
1 -8603.643 -8603.643 -5248.673 -5248.673 -5248.673 -5248.673 -5013.525
2 -8180.746 -8145.945 -5195.004 -5179.054 -5074.935 -5188.752 -5043.922
3 -8018.775 -7958.193 -5123.311 -5123.364 -5052.693 -5045.947 -5027.830
4 -7886.718 -7809.731 -5122.158 -5084.377 -5032.702 -5041.932 -5008.398
5 -7793.774 -7768.060 -5134.493 -5098.375 -5071.128 -5069.619 -4983.409
6 -7804.335 -7710.476 -5143.886 -5064.669 -5097.344 -5086.589 -5026.058
7 -7829.608 -7750.220 -5102.629 -5089.686 -5130.281 -5148.934 -5064.913
8 -7811.437 -7690.886 -5147.698 -5107.179 NA -5152.835 -5103.672
9 -7821.009 NA -5176.384 -5135.631 -5216.412 -5210.954 -5130.004
VEE EVE VVE EEV VEV EVV VVV
1 -5013.525 -5013.525 -5013.525 -5013.525 -5013.525 -5013.525 -5013.525
2 -4931.580 -4783.578 -4798.503 -4900.518 -4910.007 -4829.636 -4917.365
3 -4918.407 -4732.160 -4741.304 -5008.543 -4979.580 -4951.217 -4935.149
4 NA NA NA -5066.316 -5069.238 NA NA
5 NA NA NA -5190.303 -5154.202 NA NA
6 NA NA NA -5302.184 -5337.304 NA NA
7 NA NA NA -5410.215 -5488.902 NA NA
8 NA NA NA -5614.172 -5598.786 NA NA
9 NA NA NA -5810.771 -5718.486 NA NA
Top 3 models based on the BIC criterion:
EVE,3 VVE,3 EVE,2
-4732.160 -4741.304 -4783.578
Best BIC values:
EVE,3 VVE,3 EVE,2
BIC -4732.16 -4741.303592 -4783.57810
BIC diff 0.00 -9.143912 -51.41842
Code
Integrated Complete-data Likelihood (ICL) criterion:
EII VII EEI VEI EVI VVI EEE
1 -8603.643 -8603.643 -5248.673 -5248.673 -5248.673 -5248.673 -5013.525
2 -8187.741 -8151.209 -5212.753 -5193.445 -5087.831 -5209.229 -5059.061
3 -8024.506 -7962.807 -5140.832 -5137.842 -5077.579 -5065.933 -5044.859
4 -7899.800 -7817.683 -5139.308 -5100.665 -5050.906 -5062.092 -5025.581
5 -7804.126 -7772.814 -5163.669 -5117.892 -5090.647 -5088.063 -4996.299
6 -7817.466 -7715.284 -5172.196 -5080.730 -5118.995 -5102.059 -5045.255
7 -7843.580 -7759.784 -5127.665 -5105.712 -5151.788 -5160.820 -5087.764
8 -7827.441 -7700.565 -5177.110 -5120.976 NA -5164.537 -5129.469
9 -7837.682 NA -5199.922 -5149.695 -5235.548 -5222.012 -5156.778
VEE EVE VVE EEV VEV EVV VVV
1 -5013.525 -5013.525 -5013.525 -5013.525 -5013.525 -5013.525 -5013.525
2 -4935.604 -4789.231 -4801.418 -4900.886 -4912.682 -4830.592 -4919.755
3 -4920.143 -4751.404 -4755.780 -5011.444 -4983.537 -4956.237 -4937.699
4 NA NA NA -5073.906 -5075.676 NA NA
5 NA NA NA -5194.380 -5156.025 NA NA
6 NA NA NA -5304.374 -5340.039 NA NA
7 NA NA NA -5410.594 -5490.630 NA NA
8 NA NA NA -5614.753 -5599.162 NA NA
9 NA NA NA -5811.260 -5718.988 NA NA
Top 3 models based on the ICL criterion:
EVE,3 VVE,3 EVE,2
-4751.404 -4755.780 -4789.231
Best ICL values:
EVE,3 VVE,3 EVE,2
ICL -4751.404 -4755.780236 -4789.23090
ICL diff 0.000 -4.376279 -37.82694
Based on the cluster analyses, it appears that three clusters are the best fit to the data.
21.1.5.2 Fit the Cluster Model to the Optimal Number of WR Types
Code
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components:
log-likelihood n df BIC ICL
-2147.738 128 90 -4732.16 -4751.404
Clustering table:
1 2 3
39 17 72
21.1.5.3 Plots of the Cluster Model
21.1.5.4 Interpreting the Clusters
1 2 3
39 17 72
Code
[,1] [,2] [,3]
type 1.00 2.00 3.00
vertical 36.44 36.82 35.81
forty 4.47 4.45 4.47
height_coalesced 72.74 73.06 72.42
weight_coalesced 205.67 205.53 197.38
adot.rec 10.22 12.52 10.44
drop_percent.rec 0.04 0.06 0.05
receiving_air_yards_per_rec 16.13 23.28 17.36
brk_tkl.rec 25.10 0.53 7.15
receptions_per_season 74.90 39.92 42.09
Based on this analysis (and the variables included), there appear to be three types of Wide Receivers. We examined the following variables: the player’s vertical jump in the NFL Combine,40-yard-dash time in the NFL Combine, height, weight, average depth of target, drop percentage, receiving air yards per reception, broken tackles, and receptions per season.
Type 1 Wide Receivers included the Elite WR1s who are strong possession receivers (note: not all players in a given cluster map on perfectly to the typology—i.e., not all Type 1 Wide Receivers are elite WR1s). They tended to have the lowest drop percentage, the shortest average depth of target, and the fewest receiving air yards per reception. They tended to have the most receptions per season and break the most tackles.
Type 2 Wide Receivers included the consistent contributor, WR2 types. They had fewer receptions and fewer broken tackles than Type 1 Wide Receivers. Their average depth of target was longer than Type 1, and they had more receiving air yards per reception than Type 1.
Type 3 Wide Receivers included the deep threats. They had the greatest average depth of target and the most receiving yards per reception. However, they also had the fewest receptions, the highest drop percentage, and the fewest broken tackles. Thus, they may be considered the boom-or-bust Wide Receivers.
The tiers were not particularly distinguishable based on their height, weight, vertical jump, or forty-yard dash time.
Type 1 (“Elite/WR1”) WRs:
Type 2 (“Consistent Contributor/WR2”) WRs:
Type 3 (“Deep Threat/Boom-or-Bust”) WRs:
21.2 Conclusion
The goal of cluster analysis is to identify distinguishable subgroups of people. There are many approaches to cluster analysis, including model-based clustering, density-based clustering, centroid-based clustering, hierarchical clustering (aka connectivity-based clustering), and others. The present chapter used model-based clustering to identify tiers of players based on projected points. Using various performance metrics of Wide Receivers, we identified three subtypes of Wide Receivers: 1) Elite WR1s who are strong possession receivers; 2) Consistent Contributor/WR2s; 3) deep threats/boom-or-bust receivers. The “Elite WR1s” tended to have the lowest drop percentage, the shortest average depth of target, the fewest receiving air yards per reception, the most receptions per season, and the most broken tackles. The “Consistent Contributor/WR2s” had fewer receptions and fewer broken tackles than the Elite WR1s; their average depth of target was longer than Elite WR1s, and they had more receiving air yards per reception than Elite WR1s. The “Deep Threat/Boom-or-Bust” receivers had the greatest average depth of target and the most receiving yards per reception; however, they also had the fewest receptions, the highest drop percentage, and the fewest broken tackles. In sum, cluster analysis can be a useful way of identifying subgroups of individuals who are more similar to one another on various characteristics.
21.3 Session Info
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.1.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
[9] tidyverse_2.0.0 plotly_4.11.0 ggplot2_3.5.2 mclust_6.1.1
[13] nflreadr_1.5.0 petersenlab_1.2.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 psych_2.5.6 viridisLite_0.4.2 farver_2.1.2
[5] fastmap_1.2.0 lazyeval_0.2.2 digest_0.6.37 rpart_4.1.24
[9] timechange_0.3.0 lifecycle_1.0.4 cluster_2.1.8.1 magrittr_2.0.3
[13] compiler_4.5.1 rlang_1.1.6 Hmisc_5.2-3 tools_4.5.1
[17] yaml_2.3.10 data.table_1.17.8 knitr_1.50 labeling_0.4.3
[21] htmlwidgets_1.6.4 mnormt_2.1.1 plyr_1.8.9 RColorBrewer_1.1-3
[25] foreign_0.8-90 withr_3.0.2 nnet_7.3-20 grid_4.5.1
[29] stats4_4.5.1 lavaan_0.6-19 xtable_1.8-4 colorspace_2.1-1
[33] scales_1.4.0 MASS_7.3-65 cli_3.6.5 mvtnorm_1.3-3
[37] rmarkdown_2.29 reformulas_0.4.1 generics_0.1.4 rstudioapi_0.17.1
[41] tzdb_0.5.0 httr_1.4.7 reshape2_1.4.4 minqa_1.2.8
[45] DBI_1.2.3 cachem_1.1.0 splines_4.5.1 parallel_4.5.1
[49] base64enc_0.1-3 mitools_2.4 vctrs_0.6.5 boot_1.3-31
[53] Matrix_1.7-3 jsonlite_2.0.0 hms_1.1.3 Formula_1.2-5
[57] htmlTable_2.4.3 crosstalk_1.2.2 glue_1.8.0 nloptr_2.2.1
[61] stringi_1.8.7 gtable_0.3.6 quadprog_1.5-8 lme4_1.1-37
[65] pillar_1.11.0 htmltools_0.5.8.1 R6_2.6.1 Rdpack_2.6.4
[69] mix_1.0-13 evaluate_1.0.5 pbivnorm_0.6.0 lattice_0.22-7
[73] rbibutils_2.3 backports_1.5.0 memoise_2.0.1 Rcpp_1.1.0
[77] gridExtra_2.3 nlme_3.1-168 checkmate_2.3.3 xfun_0.53
[81] pkgconfig_2.0.3