I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

22  Data Reduction: Principal Component Analysis

22.1 Getting Started

22.1.1 Load Packages

22.2 Overview of Principal Component Analysis

Principal component analysis (PCA) is used if you want to reduce your data matrix. PCA composites represent the variances of an observed measure in as economical a fashion as possible, with no latent underlying variables. The goal of PCA is to identify a smaller number of components that explain as much variance in a set of variables as possible. It is an atheoretical way to decompose a matrix. PCA involves decomposition of a data matrix into a set of eigenvectors, which are transformations of the old variables.

The eigenvectors attempt to simplify the data in the matrix. PCA takes the data matrix and identifies the weighted sum of all variables that does the best job at explaining variance: these are the principal components, also called eigenvectors. Principal components reflect optimally weighted sums.

PCA decomposes the data matrix into any number of components—as many as the number of variables, which will always account for all variance. After the PCA is performed, you can look at the results and discard the components which likely reflect error variance. Judgments about which components to retain are based on empirical criteria in conjunction with theory to select a parsimonious number of components that account for the majority of variance.

The eigenvalue reflects the amount of variance explained by the component (eigenvector). When using a varimax (orthogonal) rotation, an eigenvalue for a component is calculated as the sum of squared standardized component loadings on that component. When using oblique rotation, however, the items explain more variance than is attributable to their factor loadings because the factors are correlated.

PCA pulls the first principal component out (i.e., the eigenvector that explains the most variance) and makes a new data matrix: i.e., new correlation matrix. Then the PCA pulls out the component that explains the next most variance—i.e., the eigenvector with the next largest eigenvalue, and it does this for all components, equal to the same number of variables. For instance, if there are six variables, it will iteratively extract an additional component up to six components. You can extract as many eigenvectors as there are variables. If you extract all six components, the data matrix left over will be the same as the correlation matrix in Figure 22.1. That is, the remaining variables (as part of the leftover data matrix) will be entirely uncorrelated with each other, because six components explain 100% of the variance from six variables. In other words, you can explain (6) variables with (6) new things!

Example Correlation Matrix 2.
Figure 22.1: Example Correlation Matrix 2.

However, it does no good if you have to use all (6) components because there is no data reduction from the original number of variables. When the goal is data reduction (as in PCA), the hope is that the first few components will explain most of the variance, so we can explain the variability in the data with fewer components than there are variables.

The sum of all eigenvalues is equal to the number of variables in the analysis. PCA does not have the same assumptions as factor analysis, which assumes that measures are partly from common variance and error. But if you estimate (6) eigenvectors and only keep (2), the model is a two-component model and whatever left becomes error. Therefore, PCA does not have the same assumptions as factor analysis, but it often ends up in the same place.

22.3 Ensuring the Variables are on the Same Scale

Before performing a PCA, it is important to ensure that the variables included in the PCA are on the same scale. PCA seeks to identify components that explain variance in the data, so if the variables are not on the same scale, some variables may contribute considerably more variance than others. A common way of ensuring that variables are on the same scale is to standardize them using, for example, z-scores.

22.4 Component Rotation

Similar considerations as in factor analysis can be used to determine whether and how to rotate components in PCA. The considerations for determining whether and how to rotate factors in factor analysis are described in Section 21.3.1.

22.5 Determining the Number of Components to Retain

Similar criteria as in factor analysis can be used to determine the number of components to retain in PCA. The criteria for determining the number of factors to retain in factor analysis are described in Section 21.5.

22.6 Interpreting and Using PCA Components

The next step is interpreting the PCA components. Use theory to interpret and label the components.

22.7 PCA Versus Factor Analysis

Both factor analysis and PCA can be used for data reduction. The key distinction between factor analysis and PCA is depicted in Figure 22.2.

Distinction Between Factor Analysis and Principal Component Analysis.
Figure 22.2: Distinction Between Factor Analysis and Principal Component Analysis.

There are several differences between factor analysis and PCA. Factor analysis has greater sophistication than PCA, but greater sophistication often results in greater assumptions. Factor analysis does not always work; the data may not always fit to a factor analysis model. However, PCA can decompose any data matrix; it always works. PCA is okay if you are not interested in the factor structure. PCA uses all variance of variables and assumes variables have no error, so it does not account for measurement error. PCA is good if you just want to form a linear composite and perform data reduction. However, if you are interested in the factor structure, use factor analysis, which estimates a latent variable that accounts for the common variance and discards error variance. Factor analysis better handles error than PCA—factor analysis assumes that what is in the variable is the combination of common construct variance and error. By contrast, PCA assumes that the measures have no measurement error. Factor analysis is useful for the identification of latent constructs—i.e., underlying dimensions or factors that explain (cause) observed scores.

22.8 Conclusion

22.9 Session Info

Code
sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.3    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.3       htmltools_0.5.8.1 rmarkdown_2.29    knitr_1.49       
 [9] jsonlite_1.9.1    xfun_0.51         digest_0.6.37     rlang_1.1.5      
[13] evaluate_1.0.3   

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/LsnVKwqmS1VuxWD18

Email Notification

The online version of this book will remain open access. If you want to know when the print version of the book is for sale, enter your email below so I can let you know.