I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

You can leave a comment at the bottom of the page/chapter, or open an issue or submit a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Alternatively, you can leave an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

9 Basic Statistics

This chapter provides an overview of foundational statistical concepts.

9.1 Getting Started

9.1.1 Load Packages

Code

library("petersenlab")
library("DescTools")
library("pwr")
library("pwrss")
library("WebPower")
library("grid")
library("psych")
library("SimDesign")
library("tidyverse")

9.1.2 Load Data

Code

load(file = "./data/player_stats_seasonal.RData")

We created the player_stats_seasonal.RData object in Section 4.4.3.

9.2 Descriptive Statistics

Descriptive statistics are used to describe data. For instance, they may be used to describe the center, spread, or shape of the data. There are various indices of each.

9.2.1 Center

Indices to describe the center (central tendency) of a variable’s data include:

mean (aka “average”) (i.e., $M$ or $\mu$)
median
Hodges-Lehmann statistic (aka pseudomedian)
mode
weighted mean
weighted median

The mean of $x$ (written as: $\bar{x}$ or $\mu_x$) is calculated as in Equation 9.1:

\[ \bar{x} = \frac{\sum x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n} \tag{9.1}\]

Code

exampleValues <- c(0, 0, 10, 15, 20, 30, 1000)
exampleValues_mean <- mean(exampleValues)

That is, to compute the mean, sum all of the values and divide by the number of values ($n$). One issue with the mean is that it is sensitive to extreme (outlying) values. For instance, the mean of the values of 0, 0, 10, 15, 20, 30, and 1000 is 153.57.

Code

exampleValues_median <- median(exampleValues)

The median is determined as the value at the 50th percentile (i.e., the value that is higher than 50% of the values and is lower than the other 50% of values). Compared to the mean, the median is less influenced by outliers. The median of the values of 0, 0, 10, 15, 20, 30, and 1000 is 15.

Code

exampleValues_pseudomedian <- DescTools::HodgesLehmann(exampleValues)

The Hodges-Lehmann statistic (aka pseudomedian) is computed as the median of all pairwise means, and it is also robust to outliers. The pseudomedian can be computed using the DescTools::HodgesLehmann() function of the DescTools package (Signorell, 2025). The pseudomedian of the values of 0, 0, 10, 15, 20, 30, and 1000 is 15. The pseudomedian is an example of a robust average—an index of the center of a distribution that is less sensitive to outliers.

Code

exampleValues_mode <- petersenlab::Mode(exampleValues)

The mode is the most common/frequent value. The mode of the values of 0, 0, 10, 15, 20, 30, and 1000 is 0. The petersenlab package (Petersen, 2025a) contains the petersenlab::Mode() function for computing the mode of a set of data.

If you want to give some values more weight to others, you can calculate a weighted mean and a weighted median (or other quantile), while assigning a weight to each value. The petersenlab package (Petersen, 2025a) contains various functions for computing the weighted median (i.e., a weighted quantile at the 0.5 quantile, which is equivalent to the 50th percentile) based on Akinshin (2023). Because some projections are outliers, we use a trimmed version of the weighted Harrell-Davis quantile estimator for greater robustness.

Below is R code to estimate each:

mean (base::mean()):

Code

mean(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 50.47589

median (stats::median()):

Code

median(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 29.3

Hodges-Lehmann statistic (aka pseudomedian; DescTools::HodgesLehmann()):

Code

DescTools::HodgesLehmann(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 39.15

mode (petersenlab::Mode()):

Code

petersenlab::Mode(player_stats_seasonal$fantasyPoints)

[1] 1

weighted mean (stats::weighted.mean()):

Code

weighted.mean(
  player_stats_seasonal$fantasyPoints,
  weights = sample( # randomly generate weights (could specify them manually)
    x = 1:3,
    size = length(player_stats_seasonal$fantasyPoints),
    replace = TRUE),
  na.rm = TRUE)

[1] 50.47589

weighted median (petersenlab::wthdquantile()):

Code

petersenlab::wthdquantile(
  player_stats_seasonal$fantasyPoints,
  w = sample( # randomly generate weights (could specify them manually)
    x = 1:3,
    size = length(player_stats_seasonal$fantasyPoints),
    replace = TRUE),
  probs = 0.5)

[1] 29.17776

9.2.2 Spread

Indices to describe the spread (variability or dispersion) of a variable’s data include:

standard deviation ($s$ or $\sigma$)
variance ($s^2$ or $\sigma^2$)
range
minimum and maximum (or some other quantiles)
weighted quantiles
interquartile range (IQR)
median absolute deviation
coefficient of variation

The (sample) variance of $x$ (written as: $s^2$) is calculated as in Equation 9.2:

\[ s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} \tag{9.2}\]

where $x_i$ is each data point, $\bar{x}$ is the mean of $x$, and $n$ is the number of data points. The population variance divides by $n$ rather than $n-1$ and is written as $\sigma^2$.

The (sample) standard deviation of $x$ (written as: $s$) is calculated as in Equation 9.3:

\[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} \tag{9.3}\]

The population standard deviation divides by $n$ rather than $n-1$ and is written as $\sigma$.

The range is calculated of $x$ is calculated as in Equation 9.4:

\[ \text{range} = \text{maximum} - \text{minimum} \tag{9.4}\]

The interquartile range (IQR) is calculated as in Equation 9.5:

\[ \text{IQR} = Q_3 - Q_1 \tag{9.5}\]

where $Q_3$ is the score at the third quartile (i.e., 75th percentile), and $Q_1$ is the score at the first quartile (i.e., 25th percentile).

The median absolute deviation (MAD) is the median of all deviations from the median, and is calculated as in Equation 9.6:

\[ \text{MAD} = \text{median}(|x_i - \tilde{x}|) \tag{9.6}\]

where $\tilde{x}$ is the median of $x$. Compared to the standard deviation, the median absolute deviation is more robust to outliers.

The coefficient of variation (CV) accounts for the fact that variables with larger means also tend to have larger standard deviations; it is calculated as the standard deviation divided by the mean, as in Equation 6.1.

If you want to give some values more weight to others, you can calculate weighted quantiles, while assigning a weight to each value. The petersenlab package (Petersen, 2025a) contains various functions for computing weighted quantiles based on Akinshin (2023). Because some projections are outliers, we use a trimmed version of the weighted Harrell-Davis quantile estimator for greater robustness.

Below is R code to estimate each:

standard deviation (stats::sd()):

Code

sd(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 61.26828

variance (stats::var()):

Code

var(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 3753.803

range (base::range()):

Code

range(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1]  -7.28 485.10

Code

max(player_stats_seasonal$fantasyPoints, na.rm = TRUE) - min(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 492.38

minimum (base::min()) and maximum (base::max()):

Code

min(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] -7.28

Code

max(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 485.1

quantiles (in this case, the 10th and 90th quantiles; stats::quantile()):

Code

quantile(player_stats_seasonal$fantasyPoints, c(0.10, 0.90), na.rm = TRUE)

  10%   90% 
  2.0 120.2

weighted quantiles (in this case, the 10th and 90th quantiles; petersenlab::wthdquantile()):

Code

petersenlab::wthdquantile(
  player_stats_seasonal$fantasyPoints,
  w = sample( # randomly generate weights (could specify them manually)
    x = 1:3,
    size = length(player_stats_seasonal$fantasyPoints),
    replace = TRUE),
  probs = c(0.10, 0.90))

[1]   2.0000 120.1844

interquartile range (IQR; stats::IQR()):

Code

IQR(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 61.48

median absolute deviation (stats::mad()):

Code

mad(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 36.76848

coefficient of variation:

Code

sd(player_stats_seasonal$fantasyPoints, na.rm = TRUE) / mean(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 1.213813

9.2.3 Shape

Indices to describe the shape of a variable’s data include:

skewness
kurtosis

The skewness represents the degree of asymmetry of a distribution around its mean. Positive skewness (right-skewed) reflects a longer or heavier right-tailed distribution, whereas negative skewness (left-skewed) reflects a longer or heavier left-tailed distribution. Fantasy points tend to be positively skewed.

The kurtosis reflects the extent of extreme (outlying) values in a distribution relative to a normal distribution (or bell curve). A mesokurtic distribution (with a kurtosis value near zero) reflects a normal amount of tailedness. Positive kurtosis values reflect a leptokurtic distribution, where there are heavier tails and a sharper peak than a normal distribution, indicating more frequent extreme values. Negative kurtosis values reflect a platykurtic distribution, where there are lighter tails and a flatter peak than a normal distribution, indicating fewer extreme values. Fantasy points tend to have a leptokurtic distribution.

Below is R code to estimate each using the psych::skew() and psych::kurtosis() functions of the psych package (Revelle, 2025):

skewness (psych::skew()):

Code

psych::skew(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 2.234608

kurtosis (psych::kurtosi()):

Code

psych::kurtosi(player_stats_seasonal$fantasyPoints, na.rm = TRUE)

[1] 6.068551

9.2.4 Combination

To estimate multiple indices of center, spread, and shape of the data, you can use the following code:

Code

psych::describe(player_stats_seasonal["fantasyPoints"])

Code

player_stats_seasonal %>% 
  dplyr::select(age, years_of_experience, fantasyPoints) %>% 
  dplyr::summarise(across(
      everything(),
      .fns = list(
        n = ~ length(na.omit(.)),
        missingness = ~ mean(is.na(.)) * 100,
        M = ~ mean(., na.rm = TRUE),
        SD = ~ sd(., na.rm = TRUE),
        min = ~ min(., na.rm = TRUE),
        max = ~ max(., na.rm = TRUE),
        q10 = ~ quantile(., .10, na.rm = TRUE), # 10th quantile
        q90 = ~ quantile(., .90, na.rm = TRUE), # 90th quantile
        range = ~ max(., na.rm = TRUE) - min(., na.rm = TRUE),
        IQR = ~ IQR(., na.rm = TRUE),
        MAD = ~ mad(., na.rm = TRUE),
        CV = ~ sd(., na.rm = TRUE) / mean(., na.rm = TRUE),
        median = ~ median(., na.rm = TRUE),
        pseudomedian = ~ DescTools::HodgesLehmann(., na.rm = TRUE),
        mode = ~ petersenlab::Mode(., multipleModes = "mean"),
        skewness = ~ psych::skew(., na.rm = TRUE),
        kurtosis = ~ psych::kurtosi(., na.rm = TRUE)),
      .names = "{.col}.{.fn}")) %>%
    tidyr::pivot_longer(
      cols = everything(),
      names_to = c("variable","index"),
      names_sep = "\\.") %>% 
    tidyr::pivot_wider(
      names_from = index,
      values_from = value)

9.2.5 Interactive Visualization

You can use the following interactive graph to visualize various distributions according to a specified center (mean), spread (standard deviation), and shape (skewness and kurtosis). Try varying the standard deviation, skewness, and kurtosis to see the impact of each. The interactive graph was created using the shiny (Chang et al., 2024) and PearsonDS (Becker & Klößner, 2025) packages. The source code for interactive graph is here: https://github.com/isaactpetersen/DistributionVisualizer.

9.3 Scores and Scales

There are many different types of scores and scales. This book focuses on raw scores and z-scores. For information on other scores and scales, including percentile ranks, T-scores, standard scores, scaled scores, and stanine scores, see here: https://isaactpetersen.github.io/Principles-Psychological-Assessment/scoresScales.html#scoreTransformation (Petersen, 2025b).

9.3.1 Raw Scores

Raw scores are the original data on the original metric. Thus, raw scores are considered unstandardized. For example, raw scores that represent the players’ age may range from 20 to 40. Raw scores depend on the construct and unit; thus raw scores may not be comparable across variables.

9.3.2 z Scores

z scores have a mean of zero and a standard deviation of one. z scores are frequently used to render scores across variables more comparable. Thus, z scores are considered a form of a standardized score.

z scores are calculated using Equation 9.7:

\[ z = \frac{x - \bar{x}}{s_x} \tag{9.7}\]

where $x$ is the observed score, $\bar{x}$ is the mean observed score, and $s_x$ is the standard deviation of the observed scores.

You can easily convert a variable to a z score using the base::scale() function:

Code

scale(variable)

With a standard normal curve, 68% of scores fall within one standard deviation of the mean. 95% of scores fall within two standard deviations of the mean. 99.7% of scores fall within three standard deviations of the mean.

The area under a normal curve within one standard deviation of the mean is calculated below using the stats::pnorm() function, which calculates the cumulative density function for a normal curve.

Code

stdDeviations <- 1

pnorm(stdDeviations) - pnorm(stdDeviations * -1)

[1] 0.6826895

The area under a normal curve within one standard deviation of the mean is depicted in Figure 9.1.

Code

x <- seq(-4, 4, length = 200)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l",
     xlab = "z Score",
     ylab = "Normal Density")

x <- seq(stdDeviations * -1, stdDeviations, length = 100)
y <- dnorm(x, mean = 0, sd = 1)
polygon(c(stdDeviations * -1, x, stdDeviations),
        c(0, y, 0),
        col = "blue")

Figure 9.1: Density of Standard Normal Distribution. The blue region represents the area within one standard deviation of the mean.

The area under a normal curve within two standard deviations of the mean is calculated below:

Code

stdDeviations <- 2

pnorm(stdDeviations) - pnorm(stdDeviations * -1)

[1] 0.9544997

The area under a normal curve within two standard deviations of the mean is depicted in Figure 9.2.

Code

x <- seq(-4, 4, length = 200)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l",
     xlab = "z Score",
     ylab = "Normal Density")

x <- seq(stdDeviations * -1, stdDeviations, length = 100)
y <- dnorm(x, mean = 0, sd = 1)
polygon(c(stdDeviations * -1, x, stdDeviations),
        c(0, y, 0),
        col = "blue")

Figure 9.2: Density of Standard Normal Distribution. The blue region represents the area within two standard deviations of the mean.

The area under a normal curve within three standard deviations of the mean is calculated below:

Code

stdDeviations <- 3

pnorm(stdDeviations) - pnorm(stdDeviations * -1)

[1] 0.9973002

The area under a normal curve within three standard deviations of the mean is depicted in Figure 9.3.

Code

x <- seq(-4, 4, length = 200)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l",
     xlab = "z Score",
     ylab = "Normal Density")

x <- seq(stdDeviations * -1, stdDeviations, length = 100)
y <- dnorm(x, mean = 0, sd = 1)
polygon(c(stdDeviations * -1, x, stdDeviations),
        c(0, y, 0),
        col = "blue")

Figure 9.3: Density of Standard Normal Distribution. The blue region represents the area within three standard deviations of the mean.

If you want to determine the z score associated with a particular percentile in a normal distribution, you can use the stats::qnorm() function. For instance, the z score associated with the 37th percentile is:

Code

qnorm(.37)

[1] -0.3318533

9.4 Inferential Statistics

Inferential statistics are used to draw inferences regarding whether there is (a) a difference in level on variable across groups or (b) an association between variables. For instance, inferential statistics may be used to evaluate whether Quarterbacks tend to have longer careers compared to Running Backs. Or, they could be used to evaluate whether number of carries is associated with injury likelihood. To apply inferential statistics, we make use of the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$).

9.4.1 Null Hypothesis Significance Testing

To draw statistical inferences, the frequentist statistics paradigm leverages null hypothesis significance testing. Frequentist statistics is the most widely used statistical paradigm. However, frequentist statistics is not the only statistical paradigm. Other statistical paradigms exist, including Bayesian statistics, which is based on Bayes’ theorem. This chapter focuses on the frequentist approach to hypothesis testing, known as null hypothesis significance testing. We discuss Bayesian statistics in Chapter 16.

9.4.1.1 Null Hypothesis ($H_0$)

When testing whether there are differences in level across groups on a variable of interest, the null hypothesis ($H_0$) is that there is no difference in level across groups. For instance, when testing whether Quarterbacks tend to have longer careers compared to Running Backs, the null hypothesis ($H_0$) is that Quarterbacks do not systematically differ from Running Backs in the length of their career.

When testing whether there is an association between variables, the null hypothesis ($H_0$) is that there is no association between the variables. For instance, when testing whether number of carries is associated with injury likelihood, the null hypothesis ($H_0$) is that there is no association between number of carries and injury likelihood.

9.4.1.2 Alternative Hypothesis ($H_1$)

The alternative hypothesis ($H_1$) is the researcher’s hypothesis that they want to evaluate. An alternative hypothesis ($H_1$) might be directional (i.e., one-sided) or non-directional (i.e., two-sided).

Directional hypotheses specify a particular direction, such as which group will have larger scores or which direction (positive or negative) two variables will be associated. Examples of directional hypotheses include:

Quarterbacks have longer careers compared to Running Backs
Number of carries is positively associated with injury likelihood

Non-directional hypotheses do not specify a particular direction. For instance, non-directional hypotheses may state that two groups differ but do not specify which group will have larger scores. Or, non-directional hypotheses may state that two variables are associated but do not state what the sign is of the association—i.e., positive or negative. Examples of non-directional hypotheses include:

Quarterbacks differ in the length of their careers compared to Running Backs
Number of carries is associated with injury likelihood

9.4.1.3 Statistical Significance

Statistical significance is the reliability (i.e., consistency) of an effect, and it is evaluated with the p-value. The p-value does not represent the probability that you observed the result by chance. The p-value represents a conditional probability—it examines the probability of one event given another event. In particular, the p-value evaluates the likelihood that you would detect a result as at least as extreme as the one observed (in terms of the magnitude of the difference or of the association) given that the null hypothesis ($H_0$) is true.

This can be expressed in conditional probability notation, $P(A | B)$, which is the probability (likelihood) of event A occurring given that event B occurred (or given condition B).

The conditional probability notation for a left-tailed directional test (i.e., Quarterbacks have shorter careers than Running Backs; or number of carries is negatively associated with injury likelihood) is in Equation 9.8.

\[ p\text{-value} = P(T \le t | H_0) \tag{9.8}\]

where $T$ is the test statistic of interest (e.g., the distribution of $t$-, $r-$, or $F$ values, depending on the test) and $t$ is the observed test statistic (e.g., $t$-, $r-$, or $F$-coefficient, depending on the test).

The conditional probability notation for a right-tailed directional test (i.e., Quarterbacks have longer careers than Running Backs; or number of carries is positively associated with injury likelihood) is in Equation 9.9.

\[ p\text{-value} = P(T \ge t | H_0) \tag{9.9}\]

The conditional probability notation for a two-tailed non-directional test (i.e., Quarterbacks differ in the length of their careers compared to Running Backs; or number of carries is associated with injury likelihood) is in Equation 9.10.

\[ p\text{-value} = 2 \times \text{min}(P(T \le t | H_0), P(T \ge t | H_0)) \tag{9.10}\]

where min(a, b) is the smaller number of a and b.

If the distribution of the test statistic is symmetric around zero, the p-value for the two-tailed non-directional test simplifies to Equation 9.11.

\[ p\text{-value} = 2 \times P(T \ge |t| | H_0) \tag{9.11}\]

Nevertheless, to be conservative (i.e., to avoid false positive/Type I errors), many researchers use two-tailed p-values regardless whether their hypothesis is one- or two-tailed.

For a test of group differences, the p-value evaluates the likelihood that you would observe a difference as large or larger than the one you observed between the groups if there were no systematic difference between the groups in the population, as depicted in Figure 9.4. For instance, when evaluating whether Quarterbacks have longer careers than Running Backs, and you observed a mean difference of 0.03 years, the p-value evaluates the likelihood that you would observe a difference as large or larger than 0.03 years between the groups if, in truth among all Quarterbacks and Running Backs in the NFL, Quarterbacks do not differ from Running Backs in terms of the length of their career.

Code

set.seed(52242)

nObserved <- 1000
nPopulation <- 1000000

observedGroups <- data.frame(
  score = c(rnorm(nObserved, mean = 47, sd = 3), rnorm(nObserved, mean = 52, sd = 3)),
  group = as.factor(c(rep("Group 1", nObserved), rep("Group 2", nObserved)))
)

populationGroups <- data.frame(
  score = c(rnorm(nPopulation, mean = 50, sd = 3.03), rnorm(nPopulation, mean = 50, sd = 3)),
  group = as.factor(c(rep("Group 1", nPopulation), rep("Group 2", nPopulation)))
)

ggplot2::ggplot(
  data = observedGroups,
  mapping = aes(
    x = score,
    fill = group,
    color = group
  )
) +
  geom_density(alpha = 0.5) +
  scale_color_manual(values = c("red", "blue")) +
  scale_fill_manual(values = c("red","blue")) +
  geom_vline(xintercept = mean(observedGroups$score[which(observedGroups$group == "Group 1")])) +
  geom_vline(xintercept = mean(observedGroups$score[which(observedGroups$group == "Group 2")])) +
  ggplot2::labs(
    x = "Score",
    y = "Frequency",
    title = "What is the probability my data would look like this..."
  ) +
  ggplot2::theme_classic(
    base_size = 16) +
  ggplot2::theme(
    legend.title = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    #plot.title.position = "plot"
    legend.position = "inside",
    legend.margin = margin(0, 0, 0, 0),
    legend.justification.top = "left",
    legend.justification.left = "top",
    legend.justification.bottom = "right",
    legend.justification.inside = c(1, 1),
    legend.location = "plot")

ggplot2::ggplot(
  data = populationGroups,
  mapping = aes(
    x = score,
    fill = group,
    color = group
  )
) +
  geom_density(alpha = 0.5) +
  scale_color_manual(values = c("red", "blue")) +
  scale_fill_manual(values = c("red","blue")) +
  geom_vline(xintercept = mean(populationGroups$score[which(populationGroups$group == "Group 1")])) +
  geom_vline(xintercept = mean(populationGroups$score[which(populationGroups$group == "Group 2")])) +
  ggplot2::labs(
    x = "Score",
    y = "Frequency",
    title = "...if in the population, the groups were really this:"
  ) +
  ggplot2::theme_classic(
    base_size = 16) +
  ggplot2::theme(
    legend.title = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    #plot.title.position = "plot",
    legend.position = "inside",
    legend.margin = margin(0, 0, 0, 0),
    legend.justification.top = "left",
    legend.justification.left = "top",
    legend.justification.bottom = "right",
    legend.justification.inside = c(1, 1),
    legend.location = "plot")

Interpretation of *p*-Values When Examining The Differences Between Groups. The vertical black lines reflect the group means. — (a) What is the probability my data would look like this…

For a test of whether two variables are associated, the p-value evaluates the likelihood that you would observe an association as strong or stronger than the one you observed if there were no actual association between the variables in the population, as depicted in Figure 9.5. For instance, when evaluating whether number of carries is positively associated with injury likelihood, and you observed a correlation coefficient of $r = .25$ between number of carries and injury likelihood, the p-value evaluates the likelihood that you would observe a correlation as strong or stronger than $r = .25$ between the variables if, in truth among all NFL Running Backs, number of carries is not associated with injury likelihood.

Code

set.seed(52242)

observedCorrelation <- 0.9

correlations <- data.frame(criterion = rnorm(2000))
correlations$sample <- NA
correlations$sample[1:100] <- complement(correlations$criterion[1:100], observedCorrelation)
correlations$population <- complement(correlations$criterion, 0)

ggplot2::ggplot(
  data = correlations,
  mapping = aes(
    x = sample,
    y = criterion
  )
) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_continuous(
    limits = c(-3.5,3)
  ) +
  annotate(
    x = 0,
    y = 4,
    label = paste("italic(r) != ", 0, sep = ""),
    parse = TRUE,
    geom = "text",
    size = 7) + 
  labs(
    x = "Predictor Variable",
    y = "Outcome Variable",
    title = "What is the probability my data would look like this..."
  ) +
  theme_classic(
    base_size = 16) +
  theme(
    legend.title = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank())

ggplot2::ggplot(
  data = correlations,
  mapping = aes(
    x = population,
    y = criterion
  )
) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE) +
  scale_x_continuous(
    limits = c(-2.5,2.5)
  ) +
  annotate(
    x = 0,
    y = 4,
    label = paste("italic(r) == '", "0.00", "'", sep = ""),
    parse = TRUE,
    geom = "text",
    size = 7) + 
  labs(
    x = "Predictor Variable",
    y = "Outcome Variable",
    title = "...if in the population, the association was really this:"
  ) +
  theme_classic(
    base_size = 16) +
  theme(
    legend.title = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank())

Interpretation of *p*-Values When Examining The Association Between Variables. — (a) What is the probability my data would look like this…

Using what is called null-hypothesis significance testing (NHST), we consider an effect to be statistically significant if the p-value is less than some threshold, called the alpha level. In science, we typically want to be conservative because a false positive (i.e., Type I error) is considered more problematic than a false negative (i.e., Type II error). That is, we would rather say an effect does not exist when it really does than to say an effect does exist when it really does not. Thus, we typically set the alpha level to a low value, commonly .05. Then, we would consider an effect to be statistically significant if the p-value is less than .05. That is, there is a small chance (5%; or 1 in 20 times) that we would observe an effect at least as extreme as the effect observed, if the null hypothesis were true. So, you might expect around 5% of tests where the null hypothesis is true to be statistically significant just by chance. We could lower the rate of Type II (i.e., false negative) errors—i.e., we could detect more effects—if we set the alpha level to a higher value (e.g., .10); however, raising the alpha level would raise the possibility of Type I (false positive) errors.

If the p-value is less than .05, we reject the null hypothesis ($H_0$) that there was no difference or association. Thus, we conclude that there was a statistically significant (non-zero) difference or association. If the p-value is greater than .05, we fail to reject the null hypothesis; the difference/association was not statistically significant. Thus, we do not have confidence that there was a difference or association. However, we do not accept the null hypothesis; it could be there we did not observe an effect because we did not have adequate power to detect the effect—e.g., if the effect size was small, the data were noisy, and the sample size was small and/or unrepresentative.

There are four general possibilities of decision making outcomes when performing null-hypothesis significance testing:

We (correctly) reject the null hypothesis when it is in fact false ($1 - \beta$). This is a true positive. For instance, we may correctly determine that Quarterbacks have longer careers than Running Backs.
We (correctly) fail to reject the null hypothesis when it is in fact true ($1 - \alpha$). This is a true negative. For instance, we may correctly determine that Quarterbacks do not have longer careers than Running Backs.
We (incorrectly) reject the null hypothesis when it is in fact true ($\alpha$). This is a false positive. When performing null hypothesis testing, a false positive is known as a Type I error. For instance, we may incorrectly determine that Quarterbacks have longer careers than Running Backs when, in fact, Quarterbacks and Running Backs do not differ in their career length.
We (incorrectly) fail to reject the null hypothesis when it is in fact false ($\beta$). This is a false negative. When performing null hypothesis testing, a false negative is known as a Type II error. For instance, we may incorrectly determine that Quarterbacks and Running Backs do not differ in their career length when, in fact, Quarterbacks have longer careers than Running Backs.

A two-by-two confusion matrix for null-hypothesis significance testing is in Figure 9.6.

In statistics, power is the probability of detecting an effect, if, in fact, the effect exists. Otherwise said, power is the probability of rejecting the null hypothesis, if, in fact, the null hypothesis is false. Power is influenced by several variables:

the sample size (N): the larger the N, the greater the power
- for group comparisons, the power depends on the sample size of each group
the effect size: the larger the effect, the greater the power
- for group comparisons, larger effect sizes reflect:
  - larger between-group variance, and
  - smaller within-group variance (i.e., strong measurement precision, i.e., reliability)
the alpha level: the researcher specifies the alpha level (though it is typically set at .05); the higher the alpha level, the greater the power; however, the higher we set the alpha level, the higher the likelihood of Type I errors (false positives)
one- versus two-tailed tests: one-tailed tests have higher power than two-tailed tests
within-subject versus between-subject comparisons: within-subject designs tend to have greater power than between-subject designs

A plot of statistical power is in Figure 9.7, as adapted from Magnusson (2013) (archived at https://perma.cc/FG3J-85L6).

Code

m1 <- 0  # mu H0
sd1 <- 1.5 # sigma H0
m2 <- 3.5 # mu HA
sd2 <- 1.5 # sigma HA
 
z_crit <- qnorm(1-(0.05/2), m1, sd1)
 
# set length of tails
min1 <- m1-sd1*4
max1 <- m1+sd1*4
min2 <- m2-sd2*4
max2 <- m2+sd2*4
# create x sequence
x <- seq(min(min1,min2), max(max1, max2), .01)
# generate normal dist #1
y1 <- dnorm(x, m1, sd1)
# put in data frame
df1 <- data.frame("x" = x, "y" = y1)
# generate normal dist #2
y2 <- dnorm(x, m2, sd2)
# put in data frame
df2 <- data.frame("x" = x, "y" = y2)
 
# Alpha polygon
y.poly <- pmin(y1,y2)
poly1 <- data.frame(x=x, y=y.poly)
poly1 <- poly1[poly1$x >= z_crit, ]
poly1<-rbind(poly1, c(z_crit, 0))  # add lower-left corner
 
# Beta polygon
poly2 <- df2
poly2 <- poly2[poly2$x <= z_crit,]
poly2<-rbind(poly2, c(z_crit, 0))  # add lower-left corner
 
# power polygon; 1-beta
poly3 <- df2
poly3 <- poly3[poly3$x >= z_crit,]
poly3 <-rbind(poly3, c(z_crit, 0))  # add lower-left corner
 
# combine polygons
poly1$id <- 3 # alpha, give it the highest number to make it the top layer
poly2$id <- 2 # beta
poly3$id <- 1 # power; 1 - beta
poly <- rbind(poly1, poly2, poly3)
poly$id <- factor(poly$id,  labels = c("power", "beta", "alpha"))

# plot with ggplot2
ggplot(poly, aes(x,y, fill=id, group=id)) +
  geom_polygon(show.legend=F, alpha=I(8/10)) +
  # add line for treatment group
  geom_line(data=df1, aes(x, y, color = "H0", group = NULL, fill = NULL), linewidth = 1.5, show_guide = FALSE) + 
  # add line for treatment group. These lines could be combined into one dataframe.
  geom_line(data=df2, aes(color = "HA", group = NULL, fill = NULL), linewidth = 1.5, show_guide = FALSE) +
  # add vlines for z_crit
  geom_vline(xintercept = z_crit, linewidth=1, linetype="dashed") +
  # change colors
  scale_color_manual(
    "Group",
    values = c("HA" = "#981e0b", "H0" = "black")) +
  scale_fill_manual("test", values= c("alpha" = "#0d6374", "beta" = "#be805e", "power"="#7cecee")) +
  # beta arrow
  annotate("segment", x = 0.1, y = 0.045, xend = 1.3, yend = 0.01, arrow = arrow(length = unit(0.3, "cm")), linewidth = 1) +
  annotate("text", label="beta", x=0, y=0.05, parse=T, size=8) +
  # alpha arrow
  annotate("segment", x = 4, y = 0.043, xend = 3.4, yend = 0.01, arrow = arrow(length = unit(0.3, "cm")), linewidth = 1) +
  annotate("text", label="frac(alpha,2)", x=4.2, y=0.05, parse=T, size=8) +
  # power arrow
  annotate("segment", x = 6, y = 0.2, xend = 4.5, yend = 0.15, arrow = arrow(length = unit(0.3, "cm")), linewidth = 1) +
  annotate("text", label = expression(paste(1-beta, "  (\"power\")")), x = 6.1, y = 0.21, parse = TRUE, size = 8) +
  # H_0 title
  annotate("text", label = "H[0]", x = m1, y = 0.28, parse = TRUE, size = 8) +
  # H_a title
  annotate("text", label = "H[1]", x = m2, y = 0.28, parse = TRUE, size = 8) +
  ggtitle("Statistical Power") +
  # remove some elements
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
    panel.background = element_blank(),
    plot.background = element_rect(fill="white"),
    panel.border = element_blank(),
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    plot.title = element_text(size=22))

Statistical Power (Adapted from Kristoffer Magnusson: <https://rpsychologist.com/creating-a-typical-textbook-illustration-of-statistical-power-using-either-ggplot-or-base-graphics>; archived at <https://perma.cc/FG3J-85L6>). The dashed line represents the critical value or threshold. — Figure 9.7: Statistical Power (Adapted from Kristoffer Magnusson: https://rpsychologist.com/creating-a-typical-textbook-illustration-of-statistical-power-using-either-ggplot-or-base-graphics; archived at https://perma.cc/FG3J-85L6). The dashed line represents the critical value or threshold.

Interactive visualizations by Kristoffer Magnusson on p-values and null-hypothesis significance testing are below:

https://rpsychologist.com/pvalue [Magnusson (2021); archived at https://perma.cc/JP9F-9ZVY]
https://rpsychologist.com/d3/pdist [Magnusson (2015); archived at https://perma.cc/BE96-8LSJ]
https://rpsychologist.com/d3/nhst [Magnusson (2014); archived at https://perma.cc/ZU9A-37F3]

Twelve misconceptions about p-values (Goodman, 2008) are in Table 9.1.

Table 9.1: Twelve Misconceptions About p-Values from Goodman (2008). Goodman also provides a discussion about why each statement is false.

Number	Misconception
1	If $p = .05$, the null hypothesis has only a 5% chance of being true.
2	A nonsignificant difference (eg, $p > .05$) means there is no difference between groups.
3	A statistically significant finding is clinically important.
4	Studies with $p$-values on opposite sides of .05 are conflicting.
5	Studies with the same $p$-value provide the same evidence against the null hypothesis.
6	$p = .05$ means that we have observed data that would occur only 5% of the time under the null hypothesis.
7	$p = .05$ and $p < .05$ mean the same thing.
8	$p$-values are properly written as inequalities (e.g., “$p \le .05$” when $p = .015$).
9	$p = .05$ means that if you reject the null hypothesis, the probability of a Type I error is only 5%.
10	With a $p = .05$ threshold for significance, the chance of a Type I error will be 5%.
11	You should use a one-sided $p$-value when you don’t care about a result in one direction, or a difference in that direction is impossible.
12	A scientific conclusion or treatment policy should be based on whether or not the $p$-value is significant.

That is, the p-value is not:

the probability that the effect was due to chance
the probability that the null hypothesis is true
the size of the effect
the importance of the effect
whether the effect is true, real, or causal

Statistical significance involves the consistency of an effect/association/difference; it suggests that the association/difference is reliably non-zero. However, just because something is statistically significant does not mean that it is important. For instance, consider that we discover that players who consume sports drink before a game tend to perform better than players who do not ($p < .05$). However, what if consumption of sports drinks is associated with an average improvement of 0.002 points per game. A small effect such as this might be detectable with a large sample size. This effect would be considered to be reliable/consistent because it is statistically significant. However, such an effect is so small that it results in differences that are not practically important. Thus, in addition to statistical significance, it is also important to consider practical significance.

9.4.2 Practical Significance

Practical significance deals with how large or important the effect/association/difference is. It is based on the magnitude of the effect, called the effect size. Effect size can be quantified in various ways including:

Cohen’s $d$
Standardized regression coefficient (beta; $\beta$)
Correlation coefficient ($r$)
Cohen’s $\omega$ (omega)
Cohen’s $f$
Cohen’s $f^2$
Coefficient of determination ($R^2$)
Eta squared ($\eta^2$)
Partial eta squared ($\eta_p^2$)

9.4.2.1 Cohen’s $d$

Cohen’s $d$ is commonly used for evaluating the magnitude of group differences. Cohen’s $d$ is calculated as in Equation 9.12:

\[ \begin{aligned} d &= \frac{\text{mean difference}}{\text{pooled standard deviation}} \\ &= \frac{\bar{x_1} - \bar{x_2}}{s} \\ \end{aligned} \tag{9.12}\]

where $\bar{x_1}$ and $\bar{x_2}$ is the mean for group 1 and group 2, respectively, and:

\[ s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \tag{9.13}\]

where $n_1$ and $n_2$ is the sample size of group 1 and group 2, respectively, and $s_1$ and $s_2$ is the standard deviation of group 1 and group 2, respectively.

Cohen’s $d$ is interpreted as the mean of group 1 was _ standard deviation units [larger/smaller] than the mean of group 2.

9.4.2.2 Standardized Regression Coefficient (Beta; $\beta$)

The standardized regression coefficient (beta; $\beta$) is used in multiple regression, and is calculated as in Equation 9.14:

\[ \beta_x = B_x \times \frac{s_x}{s_y} \tag{9.14}\]

where $B_x$ is the unstandardized regression coefficient of the predictor variable $x$ in predicting the outcome variable $y$, $s_x$ is the standard deviation of $x$, and $s_y$ is the standard deviation of $y$. The standardized regression coefficient is interpreted such that, for every 1 standard deviation increase in $x$, there was a _ standard deviation [increase/decrease] in $y$.

9.4.2.3 Correlation Coefficient ($r$)

The formula for the correlation coefficient is in Equation 9.22.

9.4.2.4 Cohen’s $\omega$

Cohen’s $\omega$ is used in chi-square tests, and is calculated as in Equation 9.15:

\[ \omega = \sqrt{\frac{\chi^2}{N} - \frac{df}{N}} \tag{9.15}\]

where $\chi^2$ is the chi-square statistic from the test, $N$ is the sample size, and $df$ is the degrees of freedom.

9.4.2.5 Cohen’s $f$

Cohen’s $f$ is commonly used in ANOVA, and is calculated as in Equation 9.16:

\[ \begin{aligned} f &= \sqrt{\frac{R^2}{1 - R^2}} \\ &= \sqrt{\frac{\eta^2}{1 - \eta^2}} \end{aligned} \tag{9.16}\]

9.4.2.6 Cohen’s $f^2$

Cohen’s $f^2$ is commonly used in regression, and is calculated as in Equation 9.17:

\[ \begin{aligned} f^2 &= \frac{R^2}{1 - R^2} \\ &= \frac{\eta^2}{1 - \eta^2} \end{aligned} \tag{9.17}\]

To calculate the effect size of a particular predictor, you can calculate $\Delta f^2$ as in Equation 9.18:

\[ \begin{aligned} \Delta f^2 &= \frac{R^2_{\text{model}} - R^2_{\text{reduced}}}{1 - R^2_{\text{model}}} \\ &= \frac{\eta^2_{\text{model}} - \eta^2_{\text{reduced}}}{1 - \eta^2_{\text{model}}} \end{aligned} \tag{9.18}\]

where $R^2_{\text{model}}$ is the $R^2$ of the model with the predictor variable of interest and $R^2_{\text{reduced}}$ is the $R^2$ of the model without the predictor variable of interest.

9.4.2.7 Coefficient of Determination ($R^2$)

The coefficient of determination ($R^2$) reflects the proportion of variance in the outcome variable that is explained by the predictor variable(s). $R^2$ is commonly used in regression, and is calculated as in Equation 9.19:

\[ \begin{aligned} R^2 &= 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} \\ &= 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}} \\ &= 1 - \frac{\text{sum of squared residuals}}{\text{total sum of squares}} \\ &= \frac{f^2}{1 + f^2} \\ &= \eta^2 \\ &= \frac{\text{variance explained in }y}{\text{total variance in }y} \end{aligned} \tag{9.19}\]

where $y$ is the outcome variable, $y_i$ is the observed value of the outcome variable for the $i$th observation, $\hat{y}_i$ is the model predicted value for the $i$th observation, and $\bar{y}$ is the mean of the observed values of the outcome variable. The total sum of squares is an index of the total variation in the outcome variable.

The coefficient of determination typically ranges from 0 to 1, but it can be negative. A negative coefficient of determination indicates that the predicted values are less accurate than using the mean of the actual values as the predicted values (i.e., predicting the mean of the observed values for every case).

9.4.2.8 Eta Squared ($\eta^2$) and Partial Eta Squared ($\eta_p^2$)

Like $R^2$, eta squared ($\eta^2$) reflects the proportion of variance in the dependent variable that is explained by the independent variable(s). $\eta^2$ is commonly used in ANOVA, and is calculated as in Equation 9.20:

\[ \begin{aligned} \eta^2 &= \frac{SS_{\text{effect}}}{SS_{\text{total}}} \\ &= 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}} \\ &= 1 - \frac{\text{sum of squared residuals}}{\text{total sum of squares}} \\ &= \frac{f^2}{1 + f^2} \\ &= R^2 \end{aligned} \tag{9.20}\]

where $SS_{\text{effect}}$ is the sum of squares for the effect of interest and $SS_{\text{total}}$ is the total sum of squares.

Partial eta squared ($\eta_p^2$) reflects the proportion of variance in the dependent variable that is explained by the independent variable while controlling for the other independent variables. $\eta_p^2$ is commonly used in ANOVA, and is calculated as in Equation 9.21:

\[ \eta_p^2 = \frac{SS_{\text{effect}}}{SS_{\text{effect}} + SS_{\text{error}}} \tag{9.21}\]

where $SS_{\text{effect}}$ is the sum of squares for the effect of interest and $SS_{\text{error}}$ is the sum of squares for the residual error term.

9.4.2.9 Effect Size Thresholds

Effect size thresholds (Cohen, 1988; McGrath & Meyer, 2006) for small, medium, and large effect sizes are in Table 9.2.

Table 9.2: Effect Size Thresholds for Small, Medium, and Large Effect Sizes.

Effect Size Index	Small	Medium	Large
Cohen’s $d$	$\ge \|.20\|$	$\ge \|.50\|$	$\ge \|.80\|$
Standardized regression coefficient (beta; $\beta$)	$\ge \|.10\|$	$\ge \|.24\|$	$\ge \|.37\|$
Correlation coefficient ($r$)	$\ge \|.10\|$	$\ge \|.24\|$	$\ge \|.37\|$
Cohen’s $\omega$	$\ge .10$	$\ge .30$	$\ge .50$
Cohen’s $f$	$\ge .10$	$\ge .25$	$\ge .40$
Cohen’s $f^2$	$\ge .01$	$\ge .06$	$\ge .16$
Coefficient of determination ($R^2$)	$\ge .01$	$\ge .06$	$\ge .14$
Eta squared ($\eta^2$)	$\ge .01$	$\ge .06$	$\ge .14$
Partial eta squared ($\eta_p^2$)	$\ge .01$	$\ge .06$	$\ge .14$

9.5 Statistical Decision Tree

A statistical decision tree is a flowchart or decision tree that depicts which statistical test to use given the purpose of analysis, the type of data, etc. An example statistical decision tree is depicted in Figure 9.8.

A Statistical Decision Tree For Choosing an Appropriate Statistical Procedure. Adapted from: <https://commons.wikimedia.org/wiki/File:InferentialStatisticalDecisionMakingTrees.pdf>. The original source is: Corston, R. & Colman, A. M. (2000). *A crash course in SPSS for Windows*. Wiley-Blackwell. Changes were made to the original, including the addition of several statistical tests. — Figure 9.8: A Statistical Decision Tree For Choosing an Appropriate Statistical Procedure. Adapted from: https://commons.wikimedia.org/wiki/File:InferentialStatisticalDecisionMakingTrees.pdf. The original source is: Corston, R. & Colman, A. M. (2000). *A crash course in SPSS for Windows*. Wiley-Blackwell. Changes were made to the original, including the addition of several statistical tests. *Note*: “Interval” as a level of measurement includes data with an “interval” or higher level of measurement; thus, it also includes data with a “ratio” level of measurement.

This statistical decision tree can be generally summarized such that associations are examined with the correlation/regression family, and differences are examined with the t-test/ANOVA family, as depicted in Figure 9.9.

Figure 9.9: Summary of A Statistical Decision Tree For Choosing an Appropriate Statistical Procedure.

However, many statistical tests can be re-formulated in a regression framework, as in Figure 9.10.

A Statistical Decision Tree For Choosing an Appropriate Statistical Procedure, Re-Formulated in a Regression Framework. Adapted from: <https://commons.wikimedia.org/wiki/File:InferentialStatisticalDecisionMakingTrees.pdf>. The original source is: Corston, R. & Colman, A. M. (2000). *A crash course in SPSS for Windows*. Wiley-Blackwell. Changes were made to the original, including re-formulating the tests in a regression framework. — Figure 9.10: A Statistical Decision Tree For Choosing an Appropriate Statistical Procedure, Re-Formulated in a Regression Framework. Adapted from: https://commons.wikimedia.org/wiki/File:InferentialStatisticalDecisionMakingTrees.pdf. The original source is: Corston, R. & Colman, A. M. (2000). *A crash course in SPSS for Windows*. Wiley-Blackwell. Changes were made to the original, including re-formulating the tests in a regression framework.

Both associations and differences can be examined with the regression family, which greatly simplifies our summary of the statistical decision tree, as depicted in Figure 9.11.

Thus, in most cases, the regression framework can be used to examine most questions regarding associations between variables or differences between groups.

Jackson-Wood (2017) provides an online, interactive statistical decision tree to help you decide which statistical analysis to use: https://www.statsflowchart.co.uk

9.6 Statistical Tests

9.6.1 t-Test

There are several t-tests:

one-sample t-test
two-samples t-test
- independent samples t-test
- paired samples t-test

A one-sample t-test is used to evaluate whether a sample mean differs systematically from a particular value. The null hypothesis is that the sample mean does not differ systematically from the pre-specified value. The alternative hypothesis is that the sample mean differs systematically from the pre-specified value. For instance, let’s say you want to test out a new draft strategy. You could participate in a mock draft and draft players using the new strategy. Then, you could use a one-sample t-test to evaluate whether your new draft strategy yields players with more projected points than the average of players’ projected points for other teams.

Two-samples t-tests are used to test for differences between scores of two groups. If the two groups are independent, the independent samples t-test is used. If the two groups involve paired samples, the paired samples t-test is used. The null hypothesis is that the mean of group 1 does not differ systematically from the mean of group 2. The alternative hypothesis is that the mean of group 1 differs systematically from the mean of group 2. For instance, you could use an independent-samples t-test if you want to examine whether Quarterbacks tend to have have longer careers than Running Backs. By contrast, you could use a paired samples t-test if you want to examine whether Quarterbacks tend to score more points in the second year of their contract compared to their rookie year, because the same subjects were assessed twice (i.e., a within-subject design).

9.6.2 Analysis of Variance

Analysis of variance (ANOVA) allows examining whether groups differ systematically as a function of one or more factors. There are multiple variants of ANOVA:

one-way ANOVA
factorial ANOVA
repeated measures ANOVA (RM-ANOVA)
multivariate ANOVA (MANOVA)

Like two-samples t-tests, ANOVA allows examining whether groups differ as a function of an independent variable. However, unlike a t-test, ANOVA allows examining multiple independent variables and more than two groups. The null hypothesis is that the the groups’ mean value does not differ systematically. The alternative hypothesis is that the groups’ mean value differs systematically.

A one-way ANOVA examines whether two or more groups differ as a function of an independent variable. For instance, you could use a one-way ANOVA to evaluate if you want to evaluate whether multiple positions differ in their length of career. Factorial ANOVA examines whether two or more groups differ as a function of multiple independent variables. For instance, you could use factorial ANOVA to evaluate whether one’s length of career depends on one’s position and weight. Repeated measures ANOVA examines whether scores differ across repeated measures (e.g., across time) for the same participants. For instance, you could use repeated-measures ANOVA to evaluate whether rookies score more points as the season progresses. Multivariate ANOVA examines whether multiple dependent variables differ as a function of one or more factor(s). For instance, you could use MANOVA to evaluate whether one’s contract length and pay differ as a function of one’s position.

9.6.3 Correlation

Correlation examines the association between a predictor and outcome variable. The null hypothesis is that the the two variables are not associated. The alternative hypothesis is that the two variables are associated.

The Pearson correlation coefficient ($r$) is calculated as in Equation 9.22:

\[ \begin{aligned} r &= \frac{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (y_i - \bar{y})^2}} && \quad \text{the numerator is the covariance; the denominator is the standard deviations} \\ &= \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} && \quad \text{$n-1$ cancels out in numerator and denominator (typical formula for Pearson correlation)} \\ &= \frac{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{s_x \cdot s_y} && \quad \text{adding back $n-1$, replacing denominator from line 1 with standard deviation symbols} \\ &= \frac{\text{Cov}(x, y)}{s_x \cdot s_y} && \quad \text{replacing numerator with covariance notation} \end{aligned} \tag{9.22}\]

where $x$ is the predictor variable, $y$ is the outcome variable, $\text{Cov}(x, y)$ is the covariance between $x$ and $y$, $s_x$ is the standard deviation of $x$, and $s_y$ is the standard deviation of $y$.

9.6.4 (Multiple) Regression

Regression, like correlation, examines the association between a predictor and outcome variable. However, unlike correlation, regression allows multiple predictor variables.

Regression with a single predictor takes the form in Equation 11.1. A regression line is depicted in Figure 11.29. Multiple regression (i.e., regression with multiple predictors) takes the form in Equation 11.2.

The null hypothesis is that the the predictor variable(s) are not associated with the outcome variable. The alternative hypothesis is that the predictor variable(s) are associated with the outcome variable.

9.6.5 Chi-Square Test

There are two primary types of chi-square tests:

chi-square goodness-of-fit test
chi-square test for association (aka test of independence)

The chi-square goodness-of-fit test evaluates whether a set of categorical data came from a specified distribution. The null hypothesis is that the data came from the specified distribution. The alternative hypothesis is that the data did not come from the specified distribution.

The chi-square test for association evaluates whether two categorical variables are associated. The null hypothesis is that the two variables are not associated. The alternative hypothesis is that the two variables are associated.

9.6.6 Formulating Statistical Tests in Terms of Partitioned Variance

Many statistical tests can be formulated in terms of partitioned variance.

For instance, the t statistic from the independent-samples t-test and the F statistic from ANOVA can be thought of as the ratio of between-group variance to within-group variance, as in Equation 9.23:

\[ t \text{ or } F = \frac{\text{between-group variance}}{\text{within-group variance}} \tag{9.23}\]

The correlation coefficient can be thought of as the ratio of shared variance (i.e., covariance) to total variance, as in Equation 9.24:

\[ r = \frac{\text{shared variance}}{\text{total variance}} \tag{9.24}\]

The coefficient of determination ($R^2$) is the proportion of variance in the outcome variable that is explained by the predictor variables. $\eta^2$ is the proportion of variance in the dependent variable that is explained by the independent variables. The coefficient of determination and $\eta^2$ can be expressed as the ratio of variance explained in the outcome or dependent variable to the total variance in the outcome or dependent variable, as in Equation 9.25:

\[ R^2 \text{ or } \eta^2 = \frac{\text{variance explained in the outcome variable}}{\text{total variance in the outcome variable}} \tag{9.25}\]

9.6.7 Critical Value

The critical value is the test value for a given test, above which the effect is considered to be statistically significant. The critical value for statistical significance for each test can be determined based on the degrees of freedom and alpha level. The degrees of freedom (df) refer to the number of values in the calculation of a test statistic that are free to vary.

Code

alpha <- .05
N <- 200
nGroup1 <- 150
nGroup2 <- 150
numGroups <- 4
numLevelsFactorA <- 3
numLevelsFactorB <- 4
numMeasurements <- 4
numPredictors <- 5
numCategories <- 6
numRows <- 5
numColumns <- 2

9.6.7.1 One-Sample t-Test

For a one-sample t-test, the degrees of freedom is in Equation 9.26:

\[ df = N - 1 \tag{9.26}\]

where $N$ is sample size.

Code

df_oneSampleTtest <- N - 1

One-tailed test (stats::qt()):

Code

qt(1 - alpha, df_oneSampleTtest)

[1] 1.652547

Two-tailed test (stats::qt()):

Code

qt(1 - alpha/2, df_oneSampleTtest)

[1] 1.971957

9.6.7.2 Independent-Samples t-Test

For an independent-samples t-test, the degrees of freedom is in Equation 9.27:

\[ df = n_1 + n_2 - 2 \tag{9.27}\]

where $n_1$ is the sample size of group 1 and $n_2$ is the sample size of group 2.

Code

df_independentSamplesTtest <- nGroup1 + nGroup2 - 2

One-tailed test (stats::qt()):

Code

qt(1 - alpha, df_independentSamplesTtest)

[1] 1.649983

Two-tailed test (stats::qt()):

Code

qt(1 - alpha/2, df_independentSamplesTtest)

[1] 1.967957

9.6.7.3 Paired-Samples t-Test

For a paired-samples t-test, the degrees of freedom is in Equation 9.28:

\[ df = N - 1 \tag{9.28}\]

where $N$ is sample size (i.e., the number of paired observations).

Code

df_pairedSamplesTtest <- N - 1

One-tailed test (stats::qt()):

Code

qt(1 - alpha, df_pairedSamplesTtest)

[1] 1.652547

Two-tailed test (stats::qt()):

Code

qt(1 - alpha/2, df_pairedSamplesTtest)

[1] 1.971957

9.6.7.4 One-Way ANOVA

For a one-way ANOVA, the degrees of freedom is in Equation 9.29:

\[ \begin{aligned} df_\text{between} &= g - 1 \\ df_\text{within} &= N - g \end{aligned} \tag{9.29}\]

where $N$ is sample size and $g$ is the number of groups.

Code

df_betweenOneWayANOVA <- numGroups - 1
df_withinOneWayANOVA <- N - numGroups

One-tailed test (stats::qf()):

Code

qf(1 - alpha, df_betweenOneWayANOVA, df_withinOneWayANOVA)

[1] 2.650677

Two-tailed test (stats::qf()):

Code

qf(1 - alpha/2, df_betweenOneWayANOVA, df_withinOneWayANOVA)

[1] 3.183378

9.6.7.5 Factorial ANOVA

For a factorial two-way ANOVA, the degrees of freedom is in Equation 9.30:

\[ \begin{aligned} df_\text{Factor A} &= a - 1 \\ df_\text{Factor B} &= b - 1 \\ df_\text{Interaction} &= (a - 1)(b - 1) \\ df_\text{error} &= ab(N - 1) \end{aligned} \tag{9.30}\]

where $N$ is sample size, $a$ is the number of levels for factor A, and $b$ is the number of levels for factor B.

Code

df_factorA <- numLevelsFactorA - 1
df_factorB <- numLevelsFactorB - 1
df_interaction <- df_factorA * df_factorB
df_error <- numLevelsFactorA * numLevelsFactorB * (N - 1)

Factor A (one-tailed test; stats::qf()):

Code

qf(1 - alpha, df_factorA, df_error)

[1] 2.999494

Factor B (one-tailed test; stats::qf()):

Code

qf(1 - alpha, df_factorB, df_error)

[1] 2.608629

Interaction (one-tailed test; stats::qf()):

Code

qf(1 - alpha, df_interaction, df_error)

[1] 2.102376

Factor A (two-tailed test; stats::qf()):

Code

qf(1 - alpha/2, df_factorA, df_error)

[1] 3.694584

Factor B (two-tailed test; stats::qf()):

Code

qf(1 - alpha/2, df_factorB, df_error)

[1] 3.121587

Interaction (two-tailed test; stats::qf()):

Code

qf(1 - alpha/2, df_interaction, df_error)

[1] 2.413504

9.6.7.6 Repeated Measures ANOVA

For a repeated measures ANOVA, the degrees of freedom is in Equation 9.31:

\[ \begin{aligned} df_1 &= T - 1 \\ df_2 &= (T - 1)(N - 1) \end{aligned} \tag{9.31}\]

where $N$ is sample size and $T$ is the number of measurements (i.e., the number of levels of the within-person factor: e.g., timepoints or conditions).

Code

df1_RMANOVA <- numMeasurements - 1
df2_RMANOVA <- (numMeasurements - 1) * (N - 1)

One-tailed test (stats::qf()):

Code

qf(1 - alpha, df1_RMANOVA, df2_RMANOVA)

[1] 2.619828

Two-tailed test (stats::qf()):

Code

qf(1 - alpha/2, df1_RMANOVA, df2_RMANOVA)

[1] 3.138017

9.6.7.7 Correlation

For a correlation, the degrees of freedom is in Equation 9.32:

\[ df = N - 2 \tag{9.32}\]

where $N$ is sample size.

Code

df_correlation <- N - 2

One-tailed test (stats::qt()):

Code

qt(1 - alpha, df_correlation)

[1] 1.652586

Two-tailed test (stats::qt()):

Code

qt(1 - alpha/2, df_correlation)

[1] 1.972017

9.6.7.8 Multiple Regression

For multiple regression, the degrees of freedom is in Equation 9.33:

\[ \begin{aligned} df_1 &= p \\ df_2 &= N - p - 1 \end{aligned} \tag{9.33}\]

where $N$ is sample size and $p$ is the number of predictors.

Code

df1_regression <- numPredictors
df2_regression <- N - numPredictors - 1

One-tailed test (stats::qf()):

Code

qf(1 - alpha, df1_regression, df2_regression)

[1] 2.260647

Two-tailed test (stats::qf()):

Code

qf(1 - alpha/2, df1_regression, df2_regression)

[1] 2.63243

9.6.7.9 Chi-Square Goodness-of-Fit Test

For the chi-square goodness-of-fit test, the degrees of freedom is in Equation 9.34:

\[ df = c - 1 \tag{9.34}\]

where $c$ is the number of categories.

Code

df_chisquareGOF <- numCategories - 1

One-tailed test (stats::qchisq()):

Code

qchisq(1 - alpha, df_chisquareGOF)

[1] 11.0705

Two-tailed test (stats::qchisq()):

Code

qchisq(1 - alpha/2, df_chisquareGOF)

[1] 12.8325

9.6.7.10 Chi-Square Test for Association

For the chi-square test for association, the degrees of freedom is in Equation 9.35:

\[ df = (r - 1) \times (c - 1) \tag{9.35}\]

where $r$ is the number of rows in the contingency table and $c$ is the number of columns in the contingency table.

Code

df_chisquareAssociation <- (numRows - 1) * (numColumns - 1)

One-tailed test (stats::qchisq()):

Code

qchisq(1 - alpha, df_chisquareAssociation)

[1] 9.487729

Two-tailed test (stats::qchisq()):

Code

qchisq(1 - alpha/2, df_chisquareAssociation)

[1] 11.14329

9.6.8 Statistical Power

As described above, statistical power is the probability of detecting an effect, if, in fact, the effect exists. Statistical power for a given test can be calculated based on three factors:

effect size
sample size
alpha level

Knowing any three of the following, you can calculate the fourth: statistical power, effect size, sample size, and alpha level. Below is R code for calculating power for each of various statistical tests (i.e., a power analysis). We use the pwr (Champely, 2020) and pwrss (Bulus, 2023) packages to estimate statistical power. For free point-and-click software for calculating statistical power, see G*Power: https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html

Code

power <- .8
effectSize_d <- .5
effectSize_r <- .24
effectSize_beta <- .24
effectSize_f <- .25
effectSize_fSquared <- .06
effectSize_omega <- .3

When designing a study, it is important to consider power and the sample size needed to detect the hypothesized effect size. If your sample size is too small and you do not detect an effect (i.e., $p > .05$), you do not know whether your failure to detect the effect was because a) the effect does not exist, or b) the effect exists but you did not have enough power to detect it.

9.6.8.1 One-Sample t-Test

Solving for statistical power achieved (given effect size, sample size, and alpha level; pwr::pwr.t.test()):

Code

pwr::pwr.t.test(
  n = N,
  d = effectSize_d,
  sig.level = alpha,
  type = "one.sample",
  alternative = "two.sided")


     One-sample t test power calculation 

              n = 200
              d = 0.5
      sig.level = 0.05
          power = 0.9999998
    alternative = two.sided

Solving for sample size needed (given effect size, power, and alpha level):

Code

pwr::pwr.t.test(
  power = power,
  d = effectSize_d,
  sig.level = alpha,
  type = "one.sample",
  alternative = "two.sided")


     One-sample t test power calculation 

              n = 33.36713
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Solving for the minimum detectable effect size (given sample size, power, and alpha level):

Code

pwr::pwr.t.test(
  power = power,
  n = N,
  sig.level = alpha,
  type = "one.sample",
  alternative = "two.sided")


     One-sample t test power calculation 

              n = 200
              d = 0.1990655
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

9.6.8.2 Independent-Samples t-Test

9.6.8.2.1 Balanced Group Sizes

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; pwr::pwr.t.test()):

Code

pwr::pwr.t.test(
  n = N,
  d = effectSize_d,
  sig.level = alpha,
  type = "two.sample",
  alternative = "two.sided")


     Two-sample t test power calculation 

              n = 200
              d = 0.5
      sig.level = 0.05
          power = 0.9987689
    alternative = two.sided

NOTE: n is number in *each* group

Solving for sample size per group needed (given effect size, power, and alpha level):

Code

pwr::pwr.t.test(
  power = power,
  d = effectSize_d,
  sig.level = alpha,
  type = "two.sample",
  alternative = "two.sided")


     Two-sample t test power calculation 

              n = 63.76561
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level):

Code

pwr::pwr.t.test(
  power = power,
  n = N,
  sig.level = alpha,
  type = "two.sample",
  alternative = "two.sided")


     Two-sample t test power calculation 

              n = 200
              d = 0.2808267
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

9.6.8.2.2 Unbalanced Group Sizes

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; pwr::pwr.t2n.test()):

Code

pwr::pwr.t2n.test(
  n1 = nGroup1,
  n2 = nGroup2,
  d = effectSize_d,
  sig.level = alpha,
  alternative = "two.sided")


     t test power calculation 

             n1 = 150
             n2 = 150
              d = 0.5
      sig.level = 0.05
          power = 0.9907677
    alternative = two.sided

Solving for sample size per group needed (given effect size, power, and alpha level):

Code

pwr::pwr.t2n.test(
  power = power,
  n1 = nGroup1,
  d = effectSize_d,
  sig.level = alpha,
  alternative = "two.sided")


     t test power calculation 

             n1 = 150
             n2 = 40.22483
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level):

Code

pwr::pwr.t2n.test(
  power = power,
  n1 = nGroup1,
  n2 = nGroup2,
  sig.level = alpha,
  alternative = "two.sided")


     t test power calculation 

             n1 = 150
             n2 = 150
              d = 0.3245459
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

9.6.8.3 Paired-Samples t-Test

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; pwr::pwr.t.test()):

Code

pwr::pwr.t.test(
  n = N,
  d = effectSize_d,
  sig.level = alpha,
  type = "paired",
  alternative = "two.sided")


     Paired t test power calculation 

              n = 200
              d = 0.5
      sig.level = 0.05
          power = 0.9999998
    alternative = two.sided

NOTE: n is number of *pairs*

Solving for sample size per group needed (given effect size, power, and alpha level):

Code

pwr::pwr.t.test(
  power = power,
  d = effectSize_d,
  sig.level = alpha,
  type = "paired",
  alternative = "two.sided")


     Paired t test power calculation 

              n = 33.36713
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number of *pairs*

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level):

Code

pwr::pwr.t.test(
  power = power,
  n = N,
  sig.level = alpha,
  type = "paired",
  alternative = "two.sided")


     Paired t test power calculation 

              n = 200
              d = 0.1990655
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number of *pairs*

9.6.8.4 One-Way ANOVA

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; pwr::pwr.anova.test()):

Code

pwr::pwr.anova.test(
  n = N,
  f = effectSize_f,
  sig.level = alpha,
  k = numGroups)


     Balanced one-way analysis of variance power calculation 

              k = 4
              n = 200
              f = 0.25
      sig.level = 0.05
          power = 0.9999962

NOTE: n is number in each group

Solving for sample size per group needed (given effect size, power, and alpha level; pwr::pwr.anova.test()):

Code

pwr::pwr.anova.test(
  power = power,
  f = effectSize_f,
  sig.level = alpha,
  k = numGroups)


     Balanced one-way analysis of variance power calculation 

              k = 4
              n = 44.59927
              f = 0.25
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level; pwr::pwr.anova.test()):

Code

pwr::pwr.anova.test(
  power = power,
  n = N,
  sig.level = alpha,
  k = numGroups)


     Balanced one-way analysis of variance power calculation 

              k = 4
              n = 200
              f = 0.117038
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

9.6.8.4.1 Simulation

The power analysis code above assumes the groups are of equal size (i.e., a balanced design). If the design is unbalanced (i.e., there are different numbers of participants in each group), it may be necessary to conduct a power analysis via a simulation. Below is an example of evaluating the statistical power for detecting an effect unbalanced designs via simulation using the SimDesign (Chalmers, 2025; Chalmers & Adkins, 2020) package. We discuss simulation more in Chapter 24.

9.6.8.4.1.1 1. Design Matrix

First, we set up the design matrix using the SimDesign::createDesign() function:

Code

Design <- SimDesign::createDesign(
  nGroup1 = c(10),
  nGroup2 = c(25, 50, 100),
  f       = c(.10, .25, .40), # small, medium, large effect sizes
  alpha   = 0.05
)

9.6.8.4.1.2 2. Generate

Second, we set up the Generate() function:

Code

Generate <- function(condition, fixed_objects = NULL) {
  with(condition, {
    # Means for each group
    mean1 <- 0
    mean2 <- f * sqrt((nGroup1 + nGroup2) / 2)
    
    # Generate data
    group1 <- rnorm(nGroup1, mean = mean1, sd = 1)
    group2 <- rnorm(nGroup2, mean = mean2, sd = 1)
    
    data.frame(
      value = c(group1, group2),
      group = factor(rep(c("Group1", "Group2"), 
                         c(nGroup1, nGroup2)))
    )
  })
}

9.6.8.4.1.3 3. Analyse

Third, we set up the Analyse() function:

Code

Analyse <- function(condition, dat, fixed_objects = NULL) {
  aov_result <- aov(value ~ group, data = dat)
  p_value <- summary(aov_result)[[1]][["Pr(>F)"]][1]
  
  data.frame(
    p_value  = p_value,
    rejectH0 = as.numeric(p_value < condition$alpha)
  )
}

9.6.8.4.1.4 4. Summarise

Fourth, we set up the Summarise() function:

Code

Summarise <- function(condition, results, fixed_objects = NULL) {
  # Estimated power = proportion of rejections
  Power <- mean(results$rejectH0)
  
  data.frame(Power = Power)
}

9.6.8.4.1.5 5. Run the Simulation

Fifth, we run the simulation using the SimDesign::runSimulation() function:

Code

monteCarloSim_results <- SimDesign::runSimulation(
  design       = Design,
  replications = 1000,
  generate     = Generate,
  analyse      = Analyse,
  summarise    = Summarise,
  seed = genSeeds(Design, iseed = 52242), # for reproducibility
  parallel = TRUE # for faster (parallel) processing
)

9.6.8.4.1.6 Simulation Results

Here are the simulation results:

Code

monteCarloSim_results

9.6.8.5 Factorial ANOVA

The power analysis code below assumes the groups are of equal size (i.e., a balanced design). If the design is unbalanced (i.e., there are different numbers of participants in each group), it may be necessary to conduct a power analysis via a simulation. See Section 9.6.8.4 for an example power analysis simulation for one-way ANOVA.

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; pwr::pwr.anova.test()):

Code

pwr::pwr.anova.test(
  n = N,
  f = effectSize_f,
  sig.level = alpha,
  k = numLevelsFactorA)


     Balanced one-way analysis of variance power calculation 

              k = 3
              n = 200
              f = 0.25
      sig.level = 0.05
          power = 0.9999238

NOTE: n is number in each group

Code

pwr::pwr.anova.test(
  n = N,
  f = effectSize_f,
  sig.level = alpha,
  k = numLevelsFactorB)


     Balanced one-way analysis of variance power calculation 

              k = 4
              n = 200
              f = 0.25
      sig.level = 0.05
          power = 0.9999962

NOTE: n is number in each group

Code

pwr::pwr.anova.test(
  n = N,
  f = effectSize_f,
  sig.level = alpha,
  k = numLevelsFactorA + numLevelsFactorB)


     Balanced one-way analysis of variance power calculation 

              k = 7
              n = 200
              f = 0.25
      sig.level = 0.05
          power = 1

NOTE: n is number in each group

Solving for sample size per group needed (given effect size, power, and alpha level):

Code

pwr::pwr.anova.test(
  power = power,
  f = effectSize_f,
  sig.level = alpha,
  k = numLevelsFactorA)


     Balanced one-way analysis of variance power calculation 

              k = 3
              n = 52.3966
              f = 0.25
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Code

pwr::pwr.anova.test(
  power = power,
  f = effectSize_f,
  sig.level = alpha,
  k = numLevelsFactorB)


     Balanced one-way analysis of variance power calculation 

              k = 4
              n = 44.59927
              f = 0.25
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Code

pwr::pwr.anova.test(
  power = power,
  f = effectSize_f,
  sig.level = alpha,
  k = numLevelsFactorA + numLevelsFactorB)


     Balanced one-way analysis of variance power calculation 

              k = 7
              n = 32.05196
              f = 0.25
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level):

Code

pwr::pwr.anova.test(
  power = power,
  n = N,
  sig.level = alpha,
  k = numLevelsFactorA)


     Balanced one-way analysis of variance power calculation 

              k = 3
              n = 200
              f = 0.1270373
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Code

pwr::pwr.anova.test(
  power = power,
  n = N,
  sig.level = alpha,
  k = numLevelsFactorB)


     Balanced one-way analysis of variance power calculation 

              k = 4
              n = 200
              f = 0.117038
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Code

pwr::pwr.anova.test(
  power = power,
  n = N,
  sig.level = alpha,
  k = numLevelsFactorA + numLevelsFactorB)


     Balanced one-way analysis of variance power calculation 

              k = 7
              n = 200
              f = 0.09889082
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

9.6.8.6 Repeated Measures ANOVA

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; WebPower::wp.rmanova()):

Code

WebPower::wp.rmanova(
  n = N,
  ng = numGroups,
  nm = numMeasurements,
  f = effectSize_f,
  alpha = alpha,
  type = 0)

Repeated-measures ANOVA analysis

      n    f ng nm nscor alpha     power
    200 0.25  4  4     1  0.05 0.8484718

NOTE: Power analysis for between-effect test
URL: http://psychstat.org/rmanova

Code

WebPower::wp.rmanova(
  n = N,
  ng = numGroups,
  nm = numMeasurements,
  f = effectSize_f,
  alpha = alpha,
  type = 1)

Repeated-measures ANOVA analysis

      n    f ng nm nscor alpha     power
    200 0.25  4  4     1  0.05 0.8536292

NOTE: Power analysis for within-effect test
URL: http://psychstat.org/rmanova

Code

WebPower::wp.rmanova(
  n = N,
  ng = numGroups,
  nm = numMeasurements,
  f = effectSize_f,
  alpha = alpha,
  type = 2)

Repeated-measures ANOVA analysis

      n    f ng nm nscor alpha     power
    200 0.25  4  4     1  0.05 0.6756298

NOTE: Power analysis for interaction-effect test
URL: http://psychstat.org/rmanova

Solving for sample size per group needed (given effect size, power, and alpha level):

Code

WebPower::wp.rmanova(
  power = power,
  ng = numGroups,
  nm = numMeasurements,
  f = effectSize_f,
  alpha = alpha,
  type = 0)

Repeated-measures ANOVA analysis

           n    f ng nm nscor alpha power
    178.3971 0.25  4  4     1  0.05   0.8

NOTE: Power analysis for between-effect test
URL: http://psychstat.org/rmanova

Code

WebPower::wp.rmanova(
  power = power,
  ng = numGroups,
  nm = numMeasurements,
  f = effectSize_f,
  alpha = alpha,
  type = 1)

Repeated-measures ANOVA analysis

           n    f ng nm nscor alpha power
    175.7692 0.25  4  4     1  0.05   0.8

NOTE: Power analysis for within-effect test
URL: http://psychstat.org/rmanova

Code

WebPower::wp.rmanova(
  power = power,
  ng = numGroups,
  nm = numMeasurements,
  f = effectSize_f,
  alpha = alpha,
  type = 2)

Repeated-measures ANOVA analysis

           n    f ng nm nscor alpha power
    253.2369 0.25  4  4     1  0.05   0.8

NOTE: Power analysis for interaction-effect test
URL: http://psychstat.org/rmanova

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level):

Code

WebPower::wp.rmanova(
  power = power,
  n = N,
  ng = numGroups,
  nm = numMeasurements,
  alpha = alpha,
  type = 0)

Repeated-measures ANOVA analysis

      n         f ng nm nscor alpha power
    200 0.2358259  4  4     1  0.05   0.8

NOTE: Power analysis for between-effect test
URL: http://psychstat.org/rmanova

Code

WebPower::wp.rmanova(
  power = power,
  n = N,
  ng = numGroups,
  nm = numMeasurements,
  alpha = alpha,
  type = 1)

Repeated-measures ANOVA analysis

      n         f ng nm nscor alpha power
    200 0.2342726  4  4     1  0.05   0.8

NOTE: Power analysis for within-effect test
URL: http://psychstat.org/rmanova

Code

WebPower::wp.rmanova(
  power = power,
  n = N,
  ng = numGroups,
  nm = numMeasurements,
  alpha = alpha,
  type = 2)

Repeated-measures ANOVA analysis

      n         f ng nm nscor alpha power
    200 0.2817486  4  4     1  0.05   0.8

NOTE: Power analysis for interaction-effect test
URL: http://psychstat.org/rmanova

9.6.8.7 Correlation

Solving for statistical power achieved (given effect size, sample size per group, and alpha level; pwr::pwr.r.test()):

Code

pwr::pwr.r.test(
  n = N,
  r = effectSize_r,
  sig.level = alpha,
  alternative = "two.sided")


     approximate correlation power calculation (arctangh transformation) 

              n = 200
              r = 0.24
      sig.level = 0.05
          power = 0.9310138
    alternative = two.sided

Solving for sample size per group needed (given effect size, power, and alpha level):

Code

pwr::pwr.r.test(
  power = power,
  r = effectSize_r,
  sig.level = alpha,
  alternative = "two.sided")


     approximate correlation power calculation (arctangh transformation) 

              n = 133.1299
              r = 0.24
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Solving for the minimum detectable effect size (given sample size per group, power, and alpha level):

Code

pwr::pwr.r.test(
  power = power,
  n = N,
  sig.level = alpha,
  alternative = "two.sided")


     approximate correlation power calculation (arctangh transformation) 

              n = 200
              r = 0.1965767
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

9.6.8.8 Multiple Regression

Solving for statistical power achieved (given effect size, sample size, and alpha level; pwr::pwr.f2.test()):

Code

pwr::pwr.f2.test(
  f2 = effectSize_fSquared,
  sig.level = alpha,
  u = numPredictors,
  v = N - numPredictors - 1)


     Multiple regression power calculation 

              u = 5
              v = 194
             f2 = 0.06
      sig.level = 0.05
          power = 0.7548031

Code

pwrss::pwrss.t.reg(
  n = N,
  beta1 = effectSize_beta,
  k = numPredictors,
  alpha = alpha,
  alternative = "not equal")

 Linear Regression Coefficient (t Test) 
 H0: beta1 = beta0 
 HA: beta1 != beta0 
 ------------------------------ 
  Statistical power = 0.936 
  n = 200 
 ------------------------------ 
 Alternative = "not equal" 
 Degrees of freedom = 194 
 Non-centrality parameter = 3.496 
 Type I error rate = 0.05 
 Type II error rate = 0.064

Solving for sample size needed (given effect size, power, and alpha level)—$v = N - \text{numberOfPredictors} - 1$; thus, $N = v + \text{numberOfPredictors} + 1$:

Code

multipleRegressionSampleSizeModel <- pwr::pwr.f2.test(
  power = power,
  f2 = effectSize_fSquared,
  sig.level = alpha,
  u = numPredictors)

multipleRegressionSampleSizeModel


     Multiple regression power calculation 

              u = 5
              v = 213.3947
             f2 = 0.06
      sig.level = 0.05
          power = 0.8

Code

vNeeded <- multipleRegressionSampleSizeModel$v
sampleSizeNeeded <- vNeeded + numPredictors + 1
sampleSizeNeeded

[1] 219.3947

Code

pwrss::pwrss.t.reg(
  power = power,
  beta1 = effectSize_beta,
  k = numPredictors,
  alpha = alpha,
  alternative = "not equal")

 Linear Regression Coefficient (t Test) 
 H0: beta1 = beta0 
 HA: beta1 != beta0 
 ------------------------------ 
  Statistical power = 0.8 
  n = 131 
 ------------------------------ 
 Alternative = "not equal" 
 Degrees of freedom = 124.427 
 Non-centrality parameter = 2.823 
 Type I error rate = 0.05 
 Type II error rate = 0.2

Solving for the minimum detectable effect size (given sample size, power, and alpha level):

Code

pwr::pwr.f2.test(
  power = power,
  sig.level = alpha,
  u = numPredictors,
  v = N - numPredictors - 1)


     Multiple regression power calculation 

              u = 5
              v = 194
             f2 = 0.06597765
      sig.level = 0.05
          power = 0.8

9.6.8.9 Chi-Square Goodness-of-Fit Test

Solving for statistical power achieved (given effect size, sample size, and alpha level; pwr::pwr.chisq.test()):

Code

pwr::pwr.chisq.test(
  N = N,
  w = effectSize_omega,
  df = numCategories - 1,
  sig.level = alpha)


     Chi squared power calculation 

              w = 0.3
              N = 200
             df = 5
      sig.level = 0.05
          power = 0.9269225

NOTE: N is the number of observations

Solving for sample size needed (given effect size, power, and alpha level):

Code

pwr::pwr.chisq.test(
  power = power,
  w = effectSize_omega,
  df = numCategories - 1,
  sig.level = alpha)


     Chi squared power calculation 

              w = 0.3
              N = 142.529
             df = 5
      sig.level = 0.05
          power = 0.8

NOTE: N is the number of observations

Solving for the minimum detectable effect size (given sample size, power, and alpha level):

Code

pwr::pwr.chisq.test(
  power = power,
  N = N,
  df = numCategories - 1,
  sig.level = alpha)


     Chi squared power calculation 

              w = 0.2532543
              N = 200
             df = 5
      sig.level = 0.05
          power = 0.8

NOTE: N is the number of observations

9.6.8.10 Chi-Square Test for Association

Solving for statistical power achieved (given effect size, sample size, and alpha level; pwr::pwr.chisq.test()):

Code

pwr::pwr.chisq.test(
  N = N,
  w = effectSize_omega,
  df = (numRows - 1)*(numColumns - 1),
  sig.level = alpha)


     Chi squared power calculation 

              w = 0.3
              N = 200
             df = 4
      sig.level = 0.05
          power = 0.9431195

NOTE: N is the number of observations

Solving for sample size needed (given effect size, power, and alpha level):

Code

pwr::pwr.chisq.test(
  power = power,
  w = effectSize_omega,
  df = (numRows - 1)*(numColumns - 1),
  sig.level = alpha)


     Chi squared power calculation 

              w = 0.3
              N = 132.6143
             df = 4
      sig.level = 0.05
          power = 0.8

NOTE: N is the number of observations

Solving for the minimum detectable effect size (given sample size, power, and alpha level):

Code

pwr::pwr.chisq.test(
  power = power,
  N = N,
  df = (numRows - 1)*(numColumns - 1),
  sig.level = alpha)


     Chi squared power calculation 

              w = 0.2442875
              N = 200
             df = 4
      sig.level = 0.05
          power = 0.8

NOTE: N is the number of observations

9.6.8.11 Multilevel Modeling

Power analysis for multilevel modeling approaches is more complicated than it is for other statistical analyses, such as correlation, multiple regression, t-tests, ANOVA, etc.

There are free web applications for calculating power in multilevel modeling:

9.6.8.12 Path Analysis, Factor Analysis, and Structural Equation Modeling

Power analysis for latent variable modeling approaches like structural equation modeling (SEM) is more complicated than it is for other statistical analyses, such as correlation, multiple regression, t-tests, ANOVA, etc.

I provide an example of power analysis in SEM using Monte Carlo simulation in R here: https://isaactpetersen.github.io/Principles-Psychological-Assessment/sem.html#monteCarloPowerAnalysis (Petersen, 2025b).

There are also free web applications for calculating power in SEM:

9.6.8.13 Mediation and Moderation

There are free tools for calculating power for tests of mediation and moderation:

https://schoemanna.shinyapps.io/mc_power_med (Schoemann et al., 2017)
https://www.causalevaluation.org/power-analysis.html (web application: https://powerupr.shinyapps.io/index/) (Ataneka et al., 2023), based on the PowerUpR package (Bulus et al., 2021)
https://webpower.psychstat.org/wiki/models/index (Zhang & Yuan, 2018)

9.7 Ethics of Reporting Statistics

It is important to report statistics ethically and responsibly. People can use statistics to lie (Huff, 2023). It is important to present statistics fairly and accurately even if they are not convenient to your preferred narrative. In addition, it is important to consider practical significance in addition to statistical significance.

9.8 Conclusion

Descriptive statistics are used to describe the data, including the center, spread, or shape of data. Inferential statistics are used to draw inferences regarding differences between groups or associations between variables. Null hypothesis signficance testing is a framework for inferential statistics, in which there is a null hypothesis and alternative hypothesis. The null hypothesis is that there is no difference between groups or that there is no association between variables. Statistical significance is evaluated with a $p$-value, which represents the probability of obtaining a result at least as extreme as the result observed if the null hypothesis is true. Effects with p-values less than .05 are considered statistically significant. However, it is also important to consider practical significance and effect sizes. When designing a study, it is important to consider statistical power and the sample size needed to detect the hypothesized effect size. You can determine a study’s power based on the effect size, sample size, and alpha level.

9.9 Session Info

Code

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
 [5] purrr_1.1.0       readr_2.1.5       tidyr_1.3.1       tibble_3.3.0     
 [9] ggplot2_3.5.2     tidyverse_2.0.0   SimDesign_2.20.0  psych_2.5.6      
[13] WebPower_0.9.4    PearsonDS_1.3.2   lavaan_0.6-19     lme4_1.1-37      
[17] Matrix_1.7-3      MASS_7.3-65       pwrss_0.3.1       pwr_1.3-0        
[21] DescTools_0.99.60 petersenlab_1.2.0

loaded via a namespace (and not attached):
 [1] Rdpack_2.6.4        DBI_1.2.3           mnormt_2.1.1       
 [4] pbapply_1.7-4       gridExtra_2.3       gld_2.6.7          
 [7] testthat_3.2.3      readxl_1.4.5        rlang_1.1.6        
[10] magrittr_2.0.3      e1071_1.7-16        compiler_4.5.1     
[13] mgcv_1.9-3          vctrs_0.6.5         reshape2_1.4.4     
[16] quadprog_1.5-8      pkgconfig_2.0.3     fastmap_1.2.0      
[19] backports_1.5.0     labeling_0.4.3      pbivnorm_0.6.0     
[22] rmarkdown_2.29      sessioninfo_1.2.3   tzdb_0.5.0         
[25] haven_2.5.5         nloptr_2.2.1        xfun_0.52          
[28] jsonlite_2.0.0      cluster_2.1.8.1     R6_2.6.1           
[31] stringi_1.8.7       RColorBrewer_1.1-3  parallelly_1.45.1  
[34] boot_1.3-31         rpart_4.1.24        brio_1.1.5         
[37] cellranger_1.1.0    Rcpp_1.1.0          knitr_1.50         
[40] future.apply_1.20.0 audio_0.1-11        base64enc_0.1-3    
[43] R.utils_2.13.0      timechange_0.3.0    splines_4.5.1      
[46] nnet_7.3-20         tidyselect_1.2.1    rstudioapi_0.17.1  
[49] yaml_2.3.10         codetools_0.2-20    listenv_0.9.1      
[52] lattice_0.22-7      plyr_1.8.9          withr_3.0.2        
[55] evaluate_1.0.4      foreign_0.8-90      future_1.67.0      
[58] proxy_0.4-27        pillar_1.11.0       checkmate_2.3.2    
[61] stats4_4.5.1        reformulas_0.4.1    generics_0.1.4     
[64] mix_1.0-13          hms_1.1.3           scales_1.4.0       
[67] rootSolve_1.8.2.4   minqa_1.2.8         globals_0.18.0     
[70] xtable_1.8-4        class_7.3-23        glue_1.8.0         
[73] Hmisc_5.2-3         lmom_3.2            tools_4.5.1        
[76] data.table_1.17.8   beepr_2.0           Exact_3.3          
[79] fs_1.6.6            mvtnorm_1.3-3       mitools_2.4        
[82] rbibutils_2.3       colorspace_2.1-1    nlme_3.1-168       
[85] htmlTable_2.4.3     Formula_1.2-5       cli_3.6.5          
[88] expm_1.0-0          viridisLite_0.4.2   gtable_0.3.6       
[91] R.methodsS3_1.8.2   digest_0.6.37       progressr_0.15.1   
[94] htmlwidgets_1.6.4   farver_2.1.2        htmltools_0.5.8.1  
[97] R.oo_1.27.1         lifecycle_1.0.4     httr_1.4.7

Number	Misconception
1	If \(p = .05\), the null hypothesis has only a 5% chance of being true.
2	A nonsignificant difference (eg, \(p > .05\)) means there is no difference between groups.
3	A statistically significant finding is clinically important.
4	Studies with \(p\)-values on opposite sides of .05 are conflicting.
5	Studies with the same \(p\)-value provide the same evidence against the null hypothesis.
6	\(p = .05\) means that we have observed data that would occur only 5% of the time under the null hypothesis.
7	\(p = .05\) and \(p < .05\) mean the same thing.
8	\(p\)-values are properly written as inequalities (e.g., “\(p \le .05\)” when \(p = .015\)).
9	\(p = .05\) means that if you reject the null hypothesis, the probability of a Type I error is only 5%.
10	With a \(p = .05\) threshold for significance, the chance of a Type I error will be 5%.
11	You should use a one-sided \(p\)-value when you don’t care about a result in one direction, or a difference in that direction is impossible.
12	A scientific conclusion or treatment policy should be based on whether or not the \(p\)-value is significant.

Effect Size Index	Small	Medium	Large
Cohen’s \(d\)	\(\ge \|.20\|\)	\(\ge \|.50\|\)	\(\ge \|.80\|\)
Standardized regression coefficient (beta; \(\beta\))	\(\ge \|.10\|\)	\(\ge \|.24\|\)	\(\ge \|.37\|\)
Correlation coefficient (\(r\))	\(\ge \|.10\|\)	\(\ge \|.24\|\)	\(\ge \|.37\|\)
Cohen’s \(\omega\)	\(\ge .10\)	\(\ge .30\)	\(\ge .50\)
Cohen’s \(f\)	\(\ge .10\)	\(\ge .25\)	\(\ge .40\)
Cohen’s \(f^2\)	\(\ge .01\)	\(\ge .06\)	\(\ge .16\)
Coefficient of determination (\(R^2\))	\(\ge .01\)	\(\ge .06\)	\(\ge .14\)
Eta squared (\(\eta^2\))	\(\ge .01\)	\(\ge .06\)	\(\ge .14\)
Partial eta squared (\(\eta_p^2\))	\(\ge .01\)	\(\ge .06\)	\(\ge .14\)

9.1 Getting Started

9.1.1 Load Packages

9.1.2 Load Data

9.2 Descriptive Statistics

9.2.1 Center

9.2.2 Spread

9.2.3 Shape

9.2.4 Combination

9.2.5 Interactive Visualization

9.3 Scores and Scales

9.3.1 Raw Scores

9.3.2 z Scores

9.4 Inferential Statistics

9.4.1 Null Hypothesis Significance Testing

9.4.1.1 Null Hypothesis (\(H_0\))

9.4.1.2 Alternative Hypothesis (\(H_1\))

9.4.1.3 Statistical Significance

9.4.2 Practical Significance

9.4.2.1 Cohen’s \(d\)

9.4.2.2 Standardized Regression Coefficient (Beta; \(\beta\))

9.4.2.3 Correlation Coefficient (\(r\))

9.4.2.4 Cohen’s \(\omega\)

9.4.2.5 Cohen’s \(f\)

9.4.2.6 Cohen’s \(f^2\)

9.4.2.7 Coefficient of Determination (\(R^2\))

9.4.2.8 Eta Squared (\(\eta^2\)) and Partial Eta Squared (\(\eta_p^2\))

9.4.2.9 Effect Size Thresholds

9.5 Statistical Decision Tree

9.6 Statistical Tests

9.6.1 t-Test

9.6.2 Analysis of Variance

9.6.3 Correlation

9.6.4 (Multiple) Regression

9.6.5 Chi-Square Test

9.6.6 Formulating Statistical Tests in Terms of Partitioned Variance

9.6.7 Critical Value

9.6.7.1 One-Sample t-Test

9.6.7.2 Independent-Samples t-Test

9.6.7.3 Paired-Samples t-Test

9.6.7.4 One-Way ANOVA

9.6.7.5 Factorial ANOVA

9.6.7.6 Repeated Measures ANOVA

9.6.7.7 Correlation

9.6.7.8 Multiple Regression

9.6.7.9 Chi-Square Goodness-of-Fit Test

9.6.7.10 Chi-Square Test for Association

9.6.8 Statistical Power

9.6.8.1 One-Sample t-Test

9.6.8.2 Independent-Samples t-Test

9.6.8.2.1 Balanced Group Sizes

9.6.8.2.2 Unbalanced Group Sizes

9.6.8.3 Paired-Samples t-Test

9.6.8.4 One-Way ANOVA

9.6.8.4.1 Simulation

9.6.8.4.1.1 1. Design Matrix

9.6.8.4.1.2 2. Generate

9.6.8.4.1.3 3. Analyse

9.6.8.4.1.4 4. Summarise

9.6.8.4.1.5 5. Run the Simulation

9.6.8.4.1.6 Simulation Results

9.6.8.5 Factorial ANOVA

9.6.8.6 Repeated Measures ANOVA

9.6.8.7 Correlation

9.6.8.8 Multiple Regression

9.6.8.9 Chi-Square Goodness-of-Fit Test

9.6.8.10 Chi-Square Test for Association

9.6.8.11 Multilevel Modeling

9.6.8.12 Path Analysis, Factor Analysis, and Structural Equation Modeling

9.6.8.13 Mediation and Moderation

9.7 Ethics of Reporting Statistics

9.8 Conclusion

9.9 Session Info

Feedback

Email Notification