I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

21 Cluster Analysis

21.1 Getting Started

21.1.1 Load Packages

Code

library("petersenlab")
library("nflreadr")
library("mclust")
library("plotly")
library("tidyverse")

21.1.2 Load Data

Code

load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_actualStats_career.RData")

21.1.3 Overview

Whereas factor analysis evaluates how variables do or do not hang together—in terms of their associations and non-associations, cluster analysis evaluates how people are or or not similar—in terms of their scores on one or more variables. The goal of cluster analysis is to identify distinguishable subgroups of people. The people within a subgroup are expected to be more similar to each other than they are to people in other subgroups. For instance, we might expect that there are distinguishable subtypes of Wide Receivers: possession, deep threats, and slot-type Wide Receivers. Possession Wide Receivers tend to be taller and heavier, with good hands who catch the ball at a high rate. Deep threat Wide Receivers tend to be fast. Slot-type Wide Receivers tend to be small, quick, and agile. In order to identify these clusters of Wide Receivers, we might conduct a cluster analysis with variables relating to the players’ height, weight, percent of (catchable) targets caught, air yards received, and various metrics from the National Football League (NFL) Combine, including their times in the 40-yard dash, 20-yard shuttle run, and three cone drill.

There are many approaches to cluster analysis, including model-based clustering, density-based clustering, centroid-based clustering, hierarchical clustering (aka connectivity-based clustering), etc. An overview of approaches to cluster analysis in R is provided by Kassambara (2017). In this chapter, we focus on examples using model-based clustering with the R package mclust (Fraley et al., 2024; Scrucca et al., 2023), which uses Gaussian finite mixture modeling. The various types of mclust models are provided here: https://mclust-org.github.io/mclust/reference/mclustModelNames.html.

21.1.4 Tiers of Prior Season Fantasy Points

21.1.4.1 Prepare Data

Code

recentSeason <- max(player_stats_seasonal$season, na.rm = TRUE) # also works: nflreadr::most_recent_season()
recentSeason

[1] 2024

Code

player_stats_seasonal_offense_recent <- player_stats_seasonal %>% 
  filter(season == recentSeason) %>% 
  filter(position_group %in% c("QB","RB","WR","TE"))

player_stats_seasonal_offense_recentQB <- player_stats_seasonal_offense_recent %>% 
  filter(position_group == "QB")

player_stats_seasonal_offense_recentRB <- player_stats_seasonal_offense_recent %>% 
  filter(position_group == "RB")

player_stats_seasonal_offense_recentWR <- player_stats_seasonal_offense_recent %>% 
  filter(position_group == "WR")

player_stats_seasonal_offense_recentTE <- player_stats_seasonal_offense_recent %>% 
  filter(position_group == "TE")

21.1.4.2 Identify the Optimal Number of Tiers by Position

21.1.4.2.1 Quarterbacks

Code

tiersQB_bic <- mclust::mclustBIC(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  G = 1:9
)

tiersQB_bic

Bayesian Information Criterion (BIC): 
          E         V
1 -982.9038 -982.9038
2 -964.3518 -927.2534
3 -973.0978 -930.5327
4 -971.2212 -912.2067
5 -971.1002 -924.2192
6 -979.8174 -928.6330
7 -974.9956 -949.0362
8 -981.8676 -955.9257
9 -990.5409 -963.9506

Top 3 models based on the BIC criterion: 
      V,4       V,5       V,2 
-912.2067 -924.2192 -927.2534

Code

summary(tiersQB_bic)

Best BIC values:
               V,4        V,5        V,2
BIC      -912.2067 -924.21918 -927.25337
BIC diff    0.0000  -12.01245  -15.04664

Code

plot(tiersQB_bic)

Code

tiersQB_icl <- mclust::mclustICL(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  G = 1:9
)

tiersQB_icl

Integrated Complete-data Likelihood (ICL) criterion: 
           E         V
1  -982.9038 -982.9038
2  -972.1069 -933.7840
3 -1039.9236 -945.9954
4 -1040.3715 -927.2426
5 -1033.2208 -945.8315
6 -1061.5988 -935.2325
7 -1056.6193 -993.1199
8 -1065.2675 -976.8222
9 -1088.2374 -986.9286

Top 3 models based on the ICL criterion: 
      V,4       V,2       V,6 
-927.2426 -933.7840 -935.2325

Code

summary(tiersQB_icl)

Best ICL values:
               V,4        V,2         V,6
ICL      -927.2426 -933.78400 -935.232482
ICL diff    0.0000   -6.54137   -7.989849

Code

plot(tiersQB_icl)

Code

tiersQB_boostrap <- mclust::mclustBootstrapLRT(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  modelName = "V") # variable/unequal variance (for univariate data)

numTiersQB <- as.numeric(summary(tiersQB_boostrap)[,"Length"][1]) # or could specify the number of teams manually

tiersQB_boostrap

------------------------------------------------------------- 
Bootstrap sequential LRT for the number of mixture components 
------------------------------------------------------------- 
Model        = V 
Replications = 999 
              LRTS bootstrap p-value
1 vs 2   68.720575             0.001
2 vs 3    9.790787             0.045
3 vs 4   31.396105             0.001
4 vs 5    1.057678             0.656

Code

plot(
  tiersQB_boostrap,
  G = numTiersQB - 1)

21.1.4.2.2 Running Backs

Code

tiersRB_bic <- mclust::mclustBIC(
  data = player_stats_seasonal_offense_recentRB$fantasyPoints,
  G = 1:9
)

tiersRB_bic

Bayesian Information Criterion (BIC): 
          E         V
1 -1888.714 -1888.714
2 -1817.804 -1769.298
3 -1827.956 -1699.724
4 -1817.083 -1701.580
5 -1827.203 -1708.617
6 -1837.331 -1719.106
7 -1817.623 -1721.044
8 -1827.752 -1735.666
9 -1834.919 -1746.427

Top 3 models based on the BIC criterion: 
      V,3       V,4       V,5 
-1699.724 -1701.580 -1708.617

Code

summary(tiersRB_bic)

Best BIC values:
               V,3          V,4          V,5
BIC      -1699.724 -1701.580264 -1708.616531
BIC diff     0.000    -1.855914    -8.892182

Code

plot(tiersRB_bic)

Code

tiersRB_icl <- mclust::mclustICL(
  data = player_stats_seasonal_offense_recentRB$fantasyPoints,
  G = 1:9
)

tiersRB_icl

Integrated Complete-data Likelihood (ICL) criterion: 
          E         V
1 -1888.714 -1888.714
2 -1823.200 -1793.185
3 -1991.232 -1728.105
4 -1974.495 -1745.695
5 -2074.939 -1750.066
6 -2123.855 -1757.956
7 -2081.524 -1765.455
8 -2133.100 -1796.801
9 -2136.424 -1795.120

Top 3 models based on the ICL criterion: 
      V,3       V,4       V,5 
-1728.105 -1745.695 -1750.066

Code

summary(tiersRB_icl)

Best ICL values:
               V,3         V,4         V,5
ICL      -1728.105 -1745.69534 -1750.06574
ICL diff     0.000   -17.58998   -21.96037

Code

plot(tiersRB_icl)

Code

numTiersRB <- 3

The model-based bootstrap clustering of Running Backs’ fantasy points is unable to run due to an error:

Code

tiersRB_boostrap <- mclust::mclustBootstrapLRT(
  data = player_stats_seasonal_offense_recentRB$fantasyPoints,
  modelName = "V") # variable/unequal variance (for univariate data)

Thus, we cannot use the following code, which would otherwise summarize the model results, specify the number of tiers, and plot model comparisons:

Code

numTiersRB <- as.numeric(summary(tiersRB_boostrap)[,"Length"][1]) # or could specify the number of teams manually

tiersRB_boostrap
plot(
  tiersRB_boostrap,
  G = numTiersRB - 1)

21.1.4.2.3 Wide Receivers

Code

tiersWR_bic <- mclust::mclustBIC(
  data = player_stats_seasonal_offense_recentWR$fantasyPoints,
  G = 1:9
)

tiersWR_bic

Bayesian Information Criterion (BIC): 
          E         V
1 -2761.531 -2761.531
2 -2703.730 -2574.337
3 -2714.665 -2561.183
4 -2690.946 -2551.896
5 -2701.848 -2559.810
6 -2679.348 -2566.401
7 -2690.252 -2567.887
8 -2693.451 -2579.761
9 -2704.412 -2594.502

Top 3 models based on the BIC criterion: 
      V,4       V,5       V,3 
-2551.896 -2559.810 -2561.183

Code

summary(tiersWR_bic)

Best BIC values:
               V,4          V,5          V,3
BIC      -2551.896 -2559.809568 -2561.182771
BIC diff     0.000    -7.913781    -9.286984

Code

plot(tiersWR_bic)

Code

tiersWR_icl <- mclust::mclustICL(
  data = player_stats_seasonal_offense_recentWR$fantasyPoints,
  G = 1:9
)

tiersWR_icl

Integrated Complete-data Likelihood (ICL) criterion: 
          E         V
1 -2761.531 -2761.531
2 -2728.952 -2597.147
3 -2967.945 -2623.521
4 -2909.051 -2643.926
5 -3004.434 -2652.681
6 -2995.921 -2665.160
7 -3044.355 -2642.838
8 -3043.060 -2662.966
9 -3081.954 -2680.271

Top 3 models based on the ICL criterion: 
      V,2       V,3       V,7 
-2597.147 -2623.521 -2642.838

Code

summary(tiersWR_icl)

Best ICL values:
               V,2         V,3         V,7
ICL      -2597.147 -2623.52084 -2642.83833
ICL diff     0.000   -26.37432   -45.69181

Code

plot(tiersWR_icl)

Code

tiersWR_boostrap <- mclust::mclustBootstrapLRT(
  data = player_stats_seasonal_offense_recentWR$fantasyPoints,
  modelName = "V") # variable/unequal variance (for univariate data)

numTiersWR <- as.numeric(summary(tiersWR_boostrap)[,"Length"][1]) # or could specify the number of teams manually

tiersWR_boostrap

------------------------------------------------------------- 
Bootstrap sequential LRT for the number of mixture components 
------------------------------------------------------------- 
Model        = V 
Replications = 999 
               LRTS bootstrap p-value
1 vs 2   203.573535             0.001
2 vs 3    29.532613             0.001
3 vs 4    25.665741             0.001
4 vs 5     8.464976             0.049
5 vs 6     9.786848             0.042
6 vs 7    14.893389             0.009
7 vs 8     4.504398             0.190

Code

plot(
  tiersWR_boostrap,
  G = numTiersWR - 1)

21.1.4.2.4 Tight Ends

Code

tiersTE_bic <- mclust::mclustBIC(
  data = player_stats_seasonal_offense_recentTE$fantasyPoints,
  G = 1:9
)

tiersTE_bic

Bayesian Information Criterion (BIC): 
          E         V
1 -1416.311 -1416.311
2 -1382.530 -1330.306
3 -1392.221 -1305.417
4 -1401.914 -1304.670
5 -1370.398 -1314.375
6 -1380.110 -1322.054
7 -1387.386 -1329.543
8 -1397.037 -1343.259
9 -1406.769 -1349.787

Top 3 models based on the BIC criterion: 
      V,4       V,3       V,5 
-1304.670 -1305.417 -1314.375

Code

summary(tiersTE_bic)

Best BIC values:
              V,4           V,3          V,5
BIC      -1304.67 -1305.4171376 -1314.374518
BIC diff     0.00    -0.7472878    -9.704669

Code

plot(tiersTE_bic)

Code

tiersTE_icl <- mclust::mclustICL(
  data = player_stats_seasonal_offense_recentTE$fantasyPoints,
  G = 1:9
)

tiersTE_icl

Integrated Complete-data Likelihood (ICL) criterion: 
          E         V
1 -1416.311 -1416.311
2 -1393.104 -1350.405
3 -1524.763 -1331.375
4 -1592.916 -1341.536
5 -1569.134 -1358.678
6 -1611.364 -1360.491
7 -1616.459 -1360.443
8 -1650.436 -1392.210
9 -1687.470 -1383.417

Top 3 models based on the ICL criterion: 
      V,3       V,4       V,2 
-1331.375 -1341.536 -1350.405

Code

summary(tiersTE_icl)

Best ICL values:
               V,3         V,4         V,2
ICL      -1331.375 -1341.53615 -1350.40527
ICL diff     0.000   -10.16078   -19.02991

Code

plot(tiersTE_icl)

Code

tiersTE_boostrap <- mclust::mclustBootstrapLRT(
  data = player_stats_seasonal_offense_recentTE$fantasyPoints,
  modelName = "V") # variable/unequal variance (for univariate data)

numTiersTE <- as.numeric(summary(tiersTE_boostrap)[,"Length"][1]) # or could specify the number of teams manually

tiersTE_boostrap

------------------------------------------------------------- 
Bootstrap sequential LRT for the number of mixture components 
------------------------------------------------------------- 
Model        = V 
Replications = 999 
               LRTS bootstrap p-value
1 vs 2   100.537455             0.001
2 vs 3    39.421427             0.001
3 vs 4    15.279849             0.006
4 vs 5     4.827893             0.201

Code

plot(
  tiersTE_boostrap,
  G = numTiersTE - 1)

21.1.4.3 Fit the Cluster Model to the Optimal Number of Tiers

21.1.4.3.1 Quarterbacks

In our data, all of the following models are equivalent—i.e., they result in the same unequal variance model with a 4-cluster solution—but they arrive there in different ways.

Code

mclust::Mclust(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  G = numTiersQB,
)

mclust::Mclust(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  G = 4,
)

mclust::Mclust(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
)

mclust::Mclust(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  x = tiersQB_bic
)

Let’s fit one of these:

Code

clusterModelQBs <- mclust::Mclust(
  data = player_stats_seasonal_offense_recentQB$fantasyPoints,
  G = numTiersQB,
)

Here are the number of players that are in each of the four clusters (i.e., tiers):

Code

table(clusterModelQBs$classification)


 1  2  3  4 
11 20 26 21

21.1.4.3.2 Running Backs

Code

clusterModelRBs <- mclust::Mclust(
  data = player_stats_seasonal_offense_recentRB$fantasyPoints,
  G = numTiersRB,
)

Here are the number of players that are in each of the four clusters (i.e., tiers):

Code

table(clusterModelRBs$classification)


 1  2  3 
39 61 58

21.1.4.3.3 Wide Receivers

Code

clusterModelWRs <- mclust::Mclust(
  data = player_stats_seasonal_offense_recentWR$fantasyPoints,
  G = numTiersWR,
)

Here are the number of players that are in each of the four clusters (i.e., tiers):

Code

table(clusterModelWRs$classification)


 1  2  3  4  5  6  7 
36 25 30 24 39 28 53

21.1.4.3.4 Tight Ends

Code

clusterModelTEs <- mclust::Mclust(
  data = player_stats_seasonal_offense_recentTE$fantasyPoints,
  G = numTiersTE,
)

Here are the number of players that are in each of the four clusters (i.e., tiers):

Code

table(clusterModelTEs$classification)


 1  2  3  4 
24 32 29 42

21.1.4.4 Plot the Tiers

We can merge the player’s classification into the dataset and plot each player’s classification.

21.1.4.4.1 Quarterbacks

Code

player_stats_seasonal_offense_recentQB$tier <- clusterModelQBs$classification

player_stats_seasonal_offense_recentQB <- player_stats_seasonal_offense_recentQB %>%
  mutate(
    tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
  )

player_stats_seasonal_offense_recentQB$position_rank <- rank(
  player_stats_seasonal_offense_recentQB$fantasyPoints * -1,
  na.last = "keep",
  ties.method = "min")

plot_qbTiers <- ggplot2::ggplot(
  data = player_stats_seasonal_offense_recentQB,
  mapping = aes(
    x = fantasyPoints,
    y = position_rank,
    color = tier
  )) +
  geom_point(
    aes(
      text = player_display_name # add player name for mouse over tooltip
  )) +
  scale_y_continuous(trans = "reverse") +
  coord_cartesian(clip = "off") +
  labs(
    x = "Projected Points",
    y = "Position Rank",
    title = "Quarterback Fantasy Points by Tier",
    color = "Tier") +
  theme_classic() +
  theme(legend.position = "top")

plotly::ggplotly(plot_qbTiers)

Figure 21.1: Quarterback Fantasy Points by Tier.

21.1.4.4.2 Running Backs

Code

player_stats_seasonal_offense_recentRB$tier <- clusterModelRBs$classification

player_stats_seasonal_offense_recentRB <- player_stats_seasonal_offense_recentRB %>%
  mutate(
    tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
  )

player_stats_seasonal_offense_recentRB$position_rank <- rank(
  player_stats_seasonal_offense_recentRB$fantasyPoints * -1,
  na.last = "keep",
  ties.method = "min")

plot_rbTiers <- ggplot2::ggplot(
  data = player_stats_seasonal_offense_recentRB,
  mapping = aes(
    x = fantasyPoints,
    y = position_rank,
    color = tier
  )) +
  geom_point(
    aes(
      text = player_display_name # add player name for mouse over tooltip
  )) +
  scale_y_continuous(trans = "reverse") +
  coord_cartesian(clip = "off") +
  labs(
    x = "Projected Points",
    y = "Position Rank",
    title = "Running Back Fantasy Points by Tier",
    color = "Tier") +
  theme_classic() +
  theme(legend.position = "top")

plotly::ggplotly(plot_rbTiers)

Figure 21.2: Running Back Fantasy Points by Tier.

21.1.4.4.3 Wide Receivers

Code

player_stats_seasonal_offense_recentWR$tier <- clusterModelWRs$classification

player_stats_seasonal_offense_recentWR <- player_stats_seasonal_offense_recentWR %>%
  mutate(
    tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
  )

player_stats_seasonal_offense_recentWR$position_rank <- rank(
  player_stats_seasonal_offense_recentWR$fantasyPoints * -1,
  na.last = "keep",
  ties.method = "min")

plot_wrTiers <- ggplot2::ggplot(
  data = player_stats_seasonal_offense_recentWR,
  mapping = aes(
    x = fantasyPoints,
    y = position_rank,
    color = tier
  )) +
  geom_point(
    aes(
      text = player_display_name # add player name for mouse over tooltip
  )) +
  scale_y_continuous(trans = "reverse") +
  coord_cartesian(clip = "off") +
  labs(
    x = "Projected Points",
    y = "Position Rank",
    title = "Wide Receiver Fantasy Points by Tier",
    color = "Tier") +
  theme_classic() +
  theme(legend.position = "top")

plotly::ggplotly(plot_wrTiers)

Figure 21.3: Quarterback Fantasy Points by Tier.

21.1.4.4.4 Tight Ends

Code

player_stats_seasonal_offense_recentTE$tier <- clusterModelTEs$classification

player_stats_seasonal_offense_recentTE <- player_stats_seasonal_offense_recentTE %>%
  mutate(
    tier = factor(max(tier, na.rm = TRUE) + 1 - tier)
  )

player_stats_seasonal_offense_recentTE$position_rank <- rank(
  player_stats_seasonal_offense_recentTE$fantasyPoints * -1,
  na.last = "keep",
  ties.method = "min")

plot_teTiers <- ggplot2::ggplot(
  data = player_stats_seasonal_offense_recentTE,
  mapping = aes(
    x = fantasyPoints,
    y = position_rank,
    color = tier
  )) +
  geom_point(
    aes(
      text = player_display_name # add player name for mouse over tooltip
  )) +
  scale_y_continuous(trans = "reverse") +
  coord_cartesian(clip = "off") +
  labs(
    x = "Projected Points",
    y = "Position Rank",
    title = "Tight End Fantasy Points by Tier",
    color = "Tier") +
  theme_classic() +
  theme(legend.position = "top")

plotly::ggplotly(plot_teTiers)

Figure 21.4: Tight End Fantasy Points by Tier.

21.1.5 Types of Wide Receivers

Code

# Compute Advanced PFR Stats by Career
pfrVars <- nfl_advancedStatsPFR_seasonal %>% 
  select(pocket_time.pass:cmp_percent.def, g, gs) %>% 
  names()

weightedAverageVars <- c(
  "pocket_time.pass",
  "ybc_att.rush","yac_att.rush",
  "ybc_r.rec","yac_r.rec","adot.rec","rat.rec",
  "yds_cmp.def","yds_tgt.def","dadot.def","m_tkl_percent.def","rat.def"
)

recomputeVars <- c(
  "drop_pct.pass", # drops.pass / pass_attempts.pass
  "bad_throw_pct.pass", # bad_throws.pass / pass_attempts.pass
  "on_tgt_pct.pass", # on_tgt_throws.pass / pass_attempts.pass
  "pressure_pct.pass", # times_pressured.pass / pass_attempts.pass
  "drop_percent.rec", # drop.rec / tgt.rec
  "rec_br.rec", # rec.rec / brk_tkl.rec
  "cmp_percent.def" # cmp.def / tgt.def
)

sumVars <- pfrVars[pfrVars %ni% c(
  weightedAverageVars, recomputeVars,
  "merge_name", "loaded.pass", "loaded.rush", "loaded.rec", "loaded.def")]

nfl_advancedStatsPFR_career <- nfl_advancedStatsPFR_seasonal %>% 
  group_by(pfr_id, merge_name) %>% 
  summarise(
    across(all_of(weightedAverageVars), ~ weighted.mean(.x, w = g, na.rm = TRUE)),
    across(all_of(sumVars), ~ sum(.x, na.rm = TRUE)),
    .groups = "drop") %>% 
  mutate(
    drop_pct.pass = drops.pass / pass_attempts.pass,
    bad_throw_pct.pass = bad_throws.pass / pass_attempts.pass,
    on_tgt_pct.pass = on_tgt_throws.pass / pass_attempts.pass,
    pressure_pct.pass = times_pressured.pass / pass_attempts.pass,
    drop_percent.rec = drop.rec / tgt.rec,
    rec_br.rec = drop.rec / tgt.rec,
    cmp_percent.def = cmp.def / tgt.def
  )

uniqueCases <- nfl_advancedStatsPFR_seasonal %>% select(pfr_id, merge_name, gsis_id) %>% unique()

uniqueCases %>%
  group_by(pfr_id) %>% 
  filter(n() > 1)

Code

nfl_advancedStatsPFR_seasonal <- nfl_advancedStatsPFR_seasonal %>% 
  filter(pfr_id != "WillMa06" | merge_name != "MARCUSWILLIAMS" | !is.na(gsis_id))


nfl_advancedStatsPFR_career <- left_join(
  nfl_advancedStatsPFR_career,
  nfl_advancedStatsPFR_seasonal %>% select(pfr_id, merge_name, gsis_id) %>% unique(),
  by = c("pfr_id", "merge_name")
)

# Compute Player Stats Per Season
player_stats_seasonal_careerWRs <- player_stats_seasonal %>% 
  filter(position == "WR") %>% 
  group_by(player_id) %>% 
  summarise(
    across(all_of(c("targets", "receptions", "receiving_air_yards")), ~ weighted.mean(.x, w = games, na.rm = TRUE)),
    .groups = "drop")

# Drop players with no receiving air yards
player_stats_seasonal_careerWRs <- player_stats_seasonal_careerWRs %>% 
  filter(receiving_air_yards != 0) %>% 
  rename(
    targets_per_season = targets,
    receptions_per_season = receptions,
    receiving_air_yards_per_season = receiving_air_yards
  )

# Merge
playerListToMerge <- list(
  nfl_players %>% select(gsis_id, display_name, position, height, weight),
  nfl_combine %>% select(gsis_id, vertical, forty, ht, wt),
  player_stats_seasonal_careerWRs %>% select(player_id, targets_per_season, receptions_per_season, receiving_air_yards_per_season) %>% 
    rename(gsis_id = player_id),
  nfl_actualStats_career_player_inclPost %>% select(player_id, receptions, targets, receiving_air_yards, air_yards_share, target_share) %>% 
    rename(gsis_id = player_id),
  nfl_advancedStatsPFR_career %>% select(gsis_id, adot.rec, rec.rec, brk_tkl.rec, drop.rec, drop_percent.rec)
)

merged_data <- playerListToMerge %>% 
  reduce(
    full_join,
    by = c("gsis_id"),
    na_matches = "never")

Additional processing:

Code

merged_data <- merged_data %>% 
  mutate(
    height_coalesced = coalesce(height, ht),
    weight_coalesced = coalesce(weight, wt),
    receptions_coalesced = pmax(receptions, rec.rec, na.rm = TRUE),
    receiving_air_yards_per_rec = receiving_air_yards / receptions
  )

merged_data$receiving_air_yards_per_rec[which(merged_data$receptions == 0)] <- 0

merged_dataWRs <- merged_data %>% 
  filter(position == "WR")

merged_dataWRs_cluster <- merged_dataWRs %>% 
  filter(receptions_coalesced >= 100) %>% # keep WRs with at least 100 receptions
  select(gsis_id, display_name, vertical, forty, height_coalesced, weight_coalesced, adot.rec, drop_percent.rec, receiving_air_yards_per_rec, brk_tkl.rec, receptions_per_season) %>% #targets_per_season, receiving_air_yards_per_season, air_yards_share, target_share
  na.omit()

21.1.5.1 Identify the Number of WR Types

Code

wrTypes_bic <- mclust::mclustBIC(
  data = merged_dataWRs_cluster %>% select(-gsis_id, -display_name),
  G = 1:9
)

wrTypes_bic

Bayesian Information Criterion (BIC): 
        EII       VII       EEI       VEI       EVI       VVI       EEE
1 -8521.963 -8521.963 -5185.509 -5185.509 -5185.509 -5185.509 -4992.434
2 -8104.831 -8073.802 -5145.239 -5136.649 -5018.457 -5146.317 -5017.149
3 -7957.960 -7920.971 -5080.707 -5085.739 -4994.395 -4997.246 -5011.675
4 -7836.962 -7767.059 -5070.012 -5038.031 -4991.076 -4973.775 -4980.329
5 -7743.153 -7718.246 -5064.335 -5045.676 -5010.565 -4996.813 -4958.735
6 -7753.031 -7705.521 -5078.853 -5069.783 -5011.298 -5033.772 -5040.084
7 -7764.288 -7721.187 -5081.158 -5045.356 -5062.471 -5060.530 -5014.736
8 -7756.644 -7672.340 -5072.408 -5061.930 -5098.986 -5107.732 -5064.086
9 -7774.246        NA -5077.748        NA        NA        NA -4962.765
        VEE       EVE       VVE       EEV       VEV       EVV       VVV
1 -4992.434 -4992.434 -4992.434 -4992.434 -4992.434 -4992.434 -4992.434
2 -4989.172 -4756.142 -4759.198 -4808.728 -4886.250 -4886.723 -4900.238
3 -4995.917 -4717.684 -4712.338 -4918.197 -4876.018 -4955.641 -4928.440
4 -4912.841 -4757.765 -4762.691 -5141.402 -5159.351 -5136.086 -5168.938
5 -4893.903 -4814.817 -4836.123 -5195.286 -5260.706 -5325.169 -5370.854
6 -4978.477 -4853.550 -4845.930 -5345.145 -5361.930 -5494.314 -5489.821
7 -4981.574 -4896.554 -4898.030 -5492.422 -5569.211 -5697.334 -5683.734
8 -5011.101 -4945.459 -4947.183 -5658.904 -5695.686 -5846.447 -5867.393
9        NA        NA        NA -5823.011        NA        NA        NA

Top 3 models based on the BIC criterion: 
    VVE,3     EVE,3     EVE,2 
-4712.338 -4717.684 -4756.142

Code

summary(wrTypes_bic)

Best BIC values:
             VVE,3        EVE,3       EVE,2
BIC      -4712.338 -4717.684010 -4756.14230
BIC diff     0.000    -5.345829   -43.80412

Code

plot(wrTypes_bic)

Code

wrTypes_icl <- mclust::mclustICL(
  data = merged_dataWRs_cluster %>% select(-gsis_id, -display_name),
  G = 1:9
)

wrTypes_icl

Integrated Complete-data Likelihood (ICL) criterion: 
        EII       VII       EEI       VEI       EVI       VVI       EEE
1 -8521.963 -8521.963 -5185.509 -5185.509 -5185.509 -5185.509 -4992.434
2 -8110.421 -8080.036 -5164.217 -5152.012 -5032.439 -5167.841 -5025.798
3 -7968.947 -7928.577 -5101.176 -5106.284 -5009.569 -5013.301 -5027.893
4 -7846.600 -7775.743 -5091.038 -5055.367 -5012.572 -4993.914 -4995.932
5 -7754.336 -7729.565 -5091.398 -5069.961 -5030.357 -5019.417 -4973.963
6 -7770.496 -7713.518 -5105.656 -5092.076 -5029.966 -5051.763 -5057.254
7 -7786.260 -7733.662 -5108.242 -5064.440 -5082.236 -5070.226 -5032.586
8 -7775.389 -7684.201 -5096.702 -5080.609 -5112.700 -5116.636 -5094.738
9 -7791.055        NA -5105.526        NA        NA        NA -4981.214
        VEE       EVE       VVE       EEV       VEV       EVV       VVV
1 -4992.434 -4992.434 -4992.434 -4992.434 -4992.434 -4992.434 -4992.434
2 -5004.174 -4760.848 -4765.868 -4809.428 -4888.714 -4887.183 -4902.365
3 -5006.245 -4735.419 -4725.314 -4919.292 -4876.541 -4963.354 -4930.929
4 -4922.883 -4772.627 -4775.317 -5149.339 -5166.186 -5144.031 -5174.419
5 -4904.516 -4828.564 -4852.419 -5197.371 -5263.216 -5327.647 -5374.328
6 -4995.230 -4864.618 -4862.199 -5346.234 -5365.514 -5496.064 -5492.120
7 -4994.700 -4907.920 -4907.062 -5493.562 -5570.041 -5698.361 -5685.074
8 -5024.572 -4954.210 -4954.163 -5659.382 -5696.321 -5846.660 -5868.538
9        NA        NA        NA -5823.465        NA        NA        NA

Top 3 models based on the ICL criterion: 
    VVE,3     EVE,3     EVE,2 
-4725.314 -4735.419 -4760.848

Code

summary(wrTypes_icl)

Best ICL values:
             VVE,3       EVE,3       EVE,2
ICL      -4725.314 -4735.41901 -4760.84753
ICL diff     0.000   -10.10454   -35.53306

Code

plot(wrTypes_icl)

Based on the cluster analyses, it appears that three clusters are the best fit to the data.

Code

numTypesWR <- 3

Code

wrTypes_boostrap <- mclust::mclustBootstrapLRT(
  data = merged_dataWRs_cluster %>% select(-gsis_id, -display_name),
  modelName = "EVE") # ellipsoidal with equal volume, variable shape, and equal orientation (for multivariate data)

wrTypes_boostrap
plot(
  wrTypes_boostrap,
  G = numTypesWR - 1)

21.1.5.2 Fit the Cluster Model to the Optimal Number of WR Types

Code

clusterModelWRtypes <- mclust::Mclust(
  data = merged_dataWRs_cluster %>% select(-gsis_id, -display_name),
  G = numTypesWR,
)

summary(clusterModelWRtypes)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVE (ellipsoidal, equal orientation) model with 3 components: 

 log-likelihood   n df       BIC       ICL
      -2133.336 127 92 -4712.338 -4725.314

Clustering table:
 1  2  3 
29 22 76

21.1.5.3 Plots of the Cluster Model

Code

plot(
  clusterModelWRtypes,
  what = "BIC")

Code

plot(
  clusterModelWRtypes,
  what = "classification")

Code

plot(
  clusterModelWRtypes,
  what = "uncertainty")

Code

plot(
  clusterModelWRtypes,
  what = "density")

21.1.5.4 Interpreting the Clusters

Code

table(clusterModelWRtypes$classification)


 1  2  3 
29 22 76

Code

merged_dataWRs_cluster$type <- clusterModelWRtypes$classification

merged_dataWRs_cluster %>% 
  group_by(type) %>% 
  summarise(across(
    where(is.numeric),
    ~ mean(., na.rm = TRUE)
    )) %>% 
  t() %>% 
  round(., 2)

                              [,1]   [,2]   [,3]
type                          1.00   2.00   3.00
vertical                     36.53  36.25  35.91
forty                         4.47   4.45   4.47
height_coalesced             73.21  72.73  72.53
weight_coalesced            207.21 199.95 198.24
adot.rec                     10.18  12.30  10.35
drop_percent.rec              0.04   0.06   0.05
receiving_air_yards_per_rec  15.85  22.09  17.16
brk_tkl.rec                  27.14   0.64   8.39
receptions_per_season        78.88  37.52  45.76

Based on this analysis (and the variables included), there appear to be three types of Wide Receivers. Type 1 Wide Receivers includes the Elite WR1s who are strong possession receivers (note: not all players in a given cluster map on perfectly to the typology—i.e., not all Type 1 Wide Receivers are elite WR1s). They tend to have the lowest drop percentage, the shortest average depth of target, and the fewest receiving air yards per reception. They tend to have the most receptions per season and break the most tackles.

Type 2 Wide Receivers includes the consistent contributor, WR2 types. They had fewer receptions and fewer broken tackles than Type 1 Wide Receivers. Their average depth of target was longer than Type 1, and they had more receiving air yards per reception than Type 1.

Type 3 Wide Receivers includes the deep threats. They have the greatest average depth of target and the most receiving yards per reception. However, they also have the fewest receptions, the highest drop percentage, and the fewest broken tackles. Thus, they may be considered the boom-or-bust Wide Receivers.

The tiers were not particularly distinguishable based on their height, weight, vertical jump, or forty-yard dash time.

Type 1 (“Elite/WR1”) WRs:

Code

merged_dataWRs_cluster %>% 
  filter(type == 1) %>% 
  select(display_name)

Type 2 (“Consistent Contributor/WR2”) WRs:

Code

merged_dataWRs_cluster %>% 
  filter(type == 2) %>% 
  select(display_name)

Type 3 (“Deep Threat/Boom-or-Bust”) WRs:

Code

merged_dataWRs_cluster %>% 
  filter(type == 3) %>% 
  select(display_name)

21.2 Conclusion

21.3 Session Info

Code

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
 [5] purrr_1.1.0       readr_2.1.5       tidyr_1.3.1       tibble_3.3.0     
 [9] tidyverse_2.0.0   plotly_4.11.0     ggplot2_3.5.2     mclust_6.1.1     
[13] nflreadr_1.4.1    petersenlab_1.1.7

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1   psych_2.5.6        viridisLite_0.4.2  farver_2.1.2      
 [5] fastmap_1.2.0      lazyeval_0.2.2     digest_0.6.37      rpart_4.1.24      
 [9] timechange_0.3.0   lifecycle_1.0.4    cluster_2.1.8.1    magrittr_2.0.3    
[13] compiler_4.5.1     rlang_1.1.6        Hmisc_5.2-3        tools_4.5.1       
[17] yaml_2.3.10        data.table_1.17.8  knitr_1.50         labeling_0.4.3    
[21] htmlwidgets_1.6.4  mnormt_2.1.1       plyr_1.8.9         RColorBrewer_1.1-3
[25] foreign_0.8-90     withr_3.0.2        nnet_7.3-20        grid_4.5.1        
[29] stats4_4.5.1       lavaan_0.6-19      xtable_1.8-4       colorspace_2.1-1  
[33] scales_1.4.0       MASS_7.3-65        cli_3.6.5          mvtnorm_1.3-3     
[37] rmarkdown_2.29     reformulas_0.4.1   generics_0.1.4     rstudioapi_0.17.1 
[41] tzdb_0.5.0         httr_1.4.7         reshape2_1.4.4     minqa_1.2.8       
[45] DBI_1.2.3          cachem_1.1.0       splines_4.5.1      parallel_4.5.1    
[49] base64enc_0.1-3    mitools_2.4        vctrs_0.6.5        boot_1.3-31       
[53] Matrix_1.7-3       jsonlite_2.0.0     hms_1.1.3          Formula_1.2-5     
[57] htmlTable_2.4.3    crosstalk_1.2.1    glue_1.8.0         nloptr_2.2.1      
[61] stringi_1.8.7      gtable_0.3.6       quadprog_1.5-8     lme4_1.1-37       
[65] pillar_1.11.0      htmltools_0.5.8.1  R6_2.6.1           Rdpack_2.6.4      
[69] mix_1.0-13         evaluate_1.0.4     pbivnorm_0.6.0     lattice_0.22-7    
[73] rbibutils_2.3      backports_1.5.0    memoise_2.0.1      Rcpp_1.1.0        
[77] gridExtra_2.3      nlme_3.1-168       checkmate_2.3.2    xfun_0.52         
[81] pkgconfig_2.0.3

21 Cluster Analysis

21.1 Getting Started

21.1.1 Load Packages

21.1.2 Load Data

21.1.3 Overview

21.1.4 Tiers of Prior Season Fantasy Points

21.1.4.1 Prepare Data

21.1.4.2 Identify the Optimal Number of Tiers by Position

21.1.4.2.1 Quarterbacks

21.1.4.2.2 Running Backs

21.1.4.2.3 Wide Receivers

21.1.4.2.4 Tight Ends

21.1.4.3 Fit the Cluster Model to the Optimal Number of Tiers

21.1.4.3.1 Quarterbacks

21.1.4.3.2 Running Backs

21.1.4.3.3 Wide Receivers

21.1.4.3.4 Tight Ends

21.1.4.4 Plot the Tiers

21.1.4.4.1 Quarterbacks

21.1.4.4.2 Running Backs

21.1.4.4.3 Wide Receivers

21.1.4.4.4 Tight Ends

21.1.5 Types of Wide Receivers

21.1.5.1 Identify the Number of WR Types

21.1.5.2 Fit the Cluster Model to the Optimal Number of WR Types

21.1.5.3 Plots of the Cluster Model

21.1.5.4 Interpreting the Clusters

21.2 Conclusion

21.3 Session Info

Feedback

Email Notification