R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.4.2 fastmap_1.2.0 cli_3.6.3
[5] tools_4.4.2 htmltools_0.5.8.1 rmarkdown_2.29 knitr_1.49
[9] jsonlite_1.8.9 xfun_0.49 digest_0.6.37 rlang_1.1.4
[13] evaluate_1.0.1
I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Fantasy-Football-Analytics-Textbook
Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.
15 Judgment Versus Actuarial Approaches to Prediction
15.1 Getting Started
15.1.1 Load Packages
15.2 Approaches to Prediction
There are two primary approaches to prediction: human judgment and the actuarial (i.e., statistical) method.
15.2.1 Human Judgment
Using the judgment method of prediction, all gathered information is collected and formulated into a prediction in the person’s mind. The person selects, measures, and combines information and produces projections solely according to their experience and judgment. For instance, a proclaimed “fantasy expert” might use their experience, expertise, and judgment to make a prediction about how each player will perform by using whatever information and data they deem to be important, aggregating all of this information in their mind to make the prediction for each player. Professional scouts and coaches use judgment when making predictions or selecting players based on their impressions of the players (Den Hartigh et al., 2018). As an example in popular media, in the movie, “Trouble with the Curve”, a professional scout makes a judgment about how well a baseball hitter will do in the major leagues from his impressions of the hitter’s ability based on the sound of the ball off the player’s bat.
15.2.2 Actuarial/Statistical Method
In the actuarial or statistical method of prediction, information is gathered and combined systematically in an evidence-based statistical prediction formula. The method is based on equations and data, so both are needed.
An example of a statistical method of prediction is the Violence Risk Appraisal Guide (Rice et al., 2013). The Violence Risk Appraisal Guide is used in an attempt to predict violence and is used for parole decisions. For instance, the equation might be something like Equation 15.1:
\[ \scriptsize \text{violence risk} = \beta \cdot \text{conduct disorder} + \beta \cdot \text{substance use} + \beta \cdot \text{suspended from school} + \beta \cdot \text{childhood aggression} + ... \tag{15.1}\]
Then, based on their score and the established cutoffs, a person is given a “low risk”, “medium risk”, or “high risk” designation.
An actuarial formula for projecting a Running Back’s rushing yards might be something like Equation 15.2:
\[ \scriptsize \text{rushing yards} = \beta \cdot \text{rushing yards last season} + \beta \cdot \text{age} + \beta \cdot \text{injury history} + \beta \cdot \text{strength of offensive line} + ... \tag{15.2}\]
The beta weights in the actuarial model reflect the relative weight to assign each predictor. For instance, in predicting rushing yards, a player’s historical performance is likely the strongest predictor, whereas injury history might be a relatively weaker predictor. Thus, we might give historical performance a beta of 3 and injury history a beta of 1 to give a player’s historical performance three times more weight than the player’s injury history in predicting their rushing yards. For generating the actuarial model, you could obtain the beta weights for each predictor from multiple regression, from machine learning, or from prior research on the relative importance of each predictor.
As an example of using the actuarial approach, Billy Beane, who was the general manager of the Oakland Athletics at the time, wanted to find ways for his team—which had less finanical resources than its competitors—to compete with teams that hard more money to sign players. Because the team did not have the resources to sign the best players, they had to find to find other ways to find the optimal players that they could afford. So, he used statistical formulas that weight variables, such as on-base percentage and slugging percentage, according to their value for winning games, for the use of selecting players. His approach became well-known based on the Michael Lewis book, “Moneyball: The Art of Winning an Unfair Game”, and the eventual movie, “Moneyball”.
15.2.3 Combining Human Judgment and Statistical Algorithms
There are numerous ways in which humans and statistical algorithms could be involved. On one extreme, humans make all judgments. On the other extreme, although humans may be involved in data collection, a statistical formula makes all decisions based on the input data, consistent with an actuarial approach. However, the human judgment and actuarial approaches can be combined in a hybrid way (Dana & Thomas, 2006). For example, to save time and money, a clinical psychologist might use an actuarial approach in all cases, but might only use a judgment approach when the actuarial approach gives a “positive” test. Or, the clinical psychologist might use both human judgment and an actuarial approach independently to see whether they agree. That is, the clinician may make a prediction based on their judgment and might also generate a prediction from an actuarial approach.
The challenge is what to do when the human and the algorithm disagree. Hypothetically, humans reviewing and adjusting the results from the statistical algorithm could lead to more accurate prediction. However, human input also could lead to the possibility or exacerbation of biased predictions. In general, with very few exceptions, actuarial approaches are as accurate or more accurate than “expert” judgment (Ægisdóttir et al., 2006; Baird & Wagner, 2000; Dawes et al., 1989; Grove et al., 2000; Grove & Meehl, 1996). This is also likely true with respect to predicting player performance in sports (Den Hartigh et al., 2018). Moreover, the superiority of actuarial approaches to human judgment tends to hold even when the expert is given more information than the actuarial approach (Dawes et al., 1989). In addition, actuarial predictions outperform human judgment even when the human is given the result of the actuarial prediction (Kahneman, 2011). Allowing experts to override actuarial predictions consistently leads to lower predictive accuracy (Garb & Wood, 2019).
There is sometimes a misconception that formulas cannot account for qualitative information. However, that is not true. Qualitative information can be scored or coded to be quantified so that it can be included in statistical formulas. For instance, if an expert scout is able to meaningfully assess a player’s cognitive and motivational factors (i.e., the “X factor” or “intangibles”), the scout can score this across multiple players and include these data in the actuarial prediction formula. For instance, the scout could use a rating scale (e.g., 1 = “poor”; 2 = “fair”; 3 = “good”; 4 = “very good”; 5 = “excellent”) to code (i.e., translate) their qualitative judgment into a quantifiable rating that can be integrated with other information in the actuarial formula. That said, the quality of predictions rests on the quality and relevance of the assessment information for the particular prediction decision. If the assessment data are lousy, it is unlikely that a statistical algorithm (or a human for that matter) will make an accurate prediction: “Garbage in, garbage out”. A statistical formula cannot rescue inaccurate assessment data.
15.3 Errors in Human Judgment
Human judgment is naturally subject to errors. Common heuristics, cognitive biases, and fallacies are described in Chapter 14. Below, I describe a few errors to which human judgment seems particularly prone.
When operating freely, clinicians and medical experts (and humans more generally) tend to overestimate exceptions to the established rules (i.e., the broken leg syndrome). Meehl (1957) acknowledged that there may be some situations where it is glaringly obvious that the statistical formula would be incorrect because it fails to account for an important factor. He called these special cases “broken leg” cases, in which the human should deviate from the formula (i.e., broken leg countervailing). The example goes like this:
If a sociologist were predicting whether Professor X would go to the movies on a certain night, he might have an equation involving age, academic specialty, and introversion score. The equation might yield a probability of .90 that Professor X goes to the movie tonight. But if the family doctor announced that Professor X had just broken his leg, no sensible sociologist would stick with the equation. Why didn’t the factor of ‘broken leg’ appear in the formula? Because broken legs are very rare, and in the sociologist’s entire sample of 500 criterion cases plus 250 cross-validating cases, he did not come upon a single instance of it. He uses the broken leg datum confidently, because ‘broken leg’ is a subclass of a larger class we may crudely denote as ‘relatively immobilizing illness or injury,’ and movie-attending is a subclass of a larger class of ‘actions requiring moderate mobility.’
— Meehl (1957, pp. 269–270)
However, people too often think that cases where they disagree with the statistical algorithm are broken leg cases. People too often think their case is an exception to the rule. As a result, they too often change the result of the statistical algorithm and are more likely to be wrong than right in doing so. Because actuarial methods are based on actual population levels (i.e., base rates), unique exceptions are not overestimated.
Actuarial predictions are perfectly reliable—they will always return the same conclusion given an identical set of data. The human judge is likely to both disagree with others and with themselves given the same set of symptoms.
The decision by an expert (all by all humans) is likely to be influenced by past experiences. Actuarial methods are based on objective algorithms, and past personal experience and personal biases do not factor into any decisions. Humans give weight to less relevant information, and often give too much weight to singular variables. Actuarial formulas do a better job of focusing on relevant variables. Computers are good at factoring in base rates. Humans ignore base rates (base rate neglect).
Computers are better at accurately weighing predictors and calculating unbiased risk estimates. In an actuarial formula, the relevant predictors are weighted according to their predictive power.
Humans are typically given no feedback on their judgments. To improve accuracy of judgments, it is important for feedback to be clear, consistent, and timely. Intuition is a form of recognition-based judgment (i.e., recognizing cues that provide access to information in memory). Development of strong intuition depends on the quality and speed of feedback, in addition to having adequate opportunities to practice [i.e., sufficient opportunities to learn the cues; Kahneman (2011)]. The quality and speed of the feedback tend to benefit anesthesiologists who often quickly learn the results of their actions. By contrast, radiologists tend not to receive quality feedback about the accuracy of their diagnoses, including their false-positive and false-negative decisions (Kahneman, 2011).
In general, many so-called experts are “pseudo-experts” who do not know the boundaries of their competence—that is, they do not know what they do not know; they have the illusion of validity of their predictions and are overconfident about their predictions (Kahneman, 2011). Yet, many people arrogantly proclaim to have predictive powers, including in low-validity environments such as fantasy football. Indeed, pundits are more likely to be television guests if they are opinionated, clear, and (overly) confident and make big, bold predictions, because they are more entertaining and their predictions seem more compelling [even though they tend to be less accurate than individuals whose thinking is more complex and less decisive; Kahneman (2011); Silver (2012)]. Consider sports pundits like Stephen A. Smith and Skip Bayless who make bold predictions with uber confidence. Optimism and (over)confidence are valued by society (Kahneman, 2011). Nevertheless, true experts know their limits in terms of knowledge and ability to predict.
Here is a video of sports pundits, Stephen A. Smith and Skip Bayless, making bold statements and incorrect predictions:
Intuitions tend to be skilled when a) the environment is regular and predictable, and b) there is opportunity to learn the regularities, cues, and contingencies through extensive practice (Kahneman, 2011). Example domains that meet these conditions supporting intuition include activities such as chess, bridge, and poker, and occupations such as medical providers, athletes, and firefighters. By contrast, fantasy football and other domains such as stock-picking, clinical psychology, and other long-terms forecasts are low-validity environments that are irregular and unpredictable. In environments that do not have stable regularities, intuition cannot be trusted (Kahneman, 2011).
15.4 Humans Versus Computers
15.4.1 Advantages of Computers
Here are some advantages of computers over humans, including “experts”:
- Computers can process lots of information simultaneously. So can humans. But computers can to an even greater degree.
- Computers are faster at making calculations.
- Given the same input, a formula will give the exact same result everytime. Humans’ judgment tends to be inconsistent both across raters and within rater across time, when trying to make judgments or predictions from complex information (Kahneman, 2011). As noted in Section 8.8.3, reliability sets the upper bound for validity, so unreliable judgments cannot be accurate (i.e., valid).
- Computations by computers are error-free (as long as the computations are programmed correctly).
- Computers’ judgments will not be biased by fatigue or emotional responses.
- Computers’ judgments will tend not to be biased in the way that humans’ cognitive biases are. Computers are less likely to be overconfident in their judgments.
- Computers can more accurately weight the set of predictors based on large data sets. Humans tend to give too much weight to singular predictors. Experts may attempt to be clever and to consider complex combinations of predictors, but doing so often reduces validity (Kahneman, 2011). Simple combinations of predictions often outperform more complex combinations (Kahneman, 2011).
15.4.2 Advantages of Humans
Computers are bad at some things too. Here are some advantages of humans over computers (as of now):
- Humans can be better at identifying patterns in data (but also can mistakenly identify patterns where there are none—i.e., illusory correlation).
- Humans can be flexible and take a different approach if a given approach is not working.
- Humans are better at tasks requiring creativity and imagination, such as developing theories that explain phenomena.
- Humans have the ability to reason, which is especially important when dealing with complex, abstract, or open-ended problems, or problems that have not been faced before (or for which we have insufficient data).
- Humans are better able to learn.
- Humans are better at holistic, gestalt processing, including facial and linguistic processing.
There may be situations in which a human judgment would do better than an actuarial judgment. One situation where human judgment would be important is when no actuarial method exists for the judgment or prediction. For instance, when no actuarial method exists for the diagnosis or disorder (e.g., suicide), it is up to the clinician. However, we could collect data on the outcomes or on clinicians’ judgments to develop an actuarial method that will be more reliable than the clinicians’ judgments. That is, an actuarial method developed based on clinicians’ judgments will be more accurate than clinicians’ judgments. In other words, we do not necessarily need outcome data to develop an actuarial method. We could use the client’s data as predictors of the clinicians’ judgments to develop a structured approach to prediction that weighs factors similarly to clinicians, but with more reliable predictions.
Another situation in which human judgment could outperform a statistical algorithm is in true “broken leg” cases, e.g., important and rare events (edge cases) that are not yet accounted for by the algorithm.
Another situation in which human judgment could be preferable is if advanced, complex theories exist. Computers have a difficult time adhering to complex theories, so clinicians may be better suited. However, we do not have any of these complex theories in psychology that are accurate. We would need strong theory informed by data regarding causal influences, and accurate measures to assess them. However, no theories in psychology are that good. Nevertheless, predictive accuracy can be improved when considering theory (Garb & Wood, 2019; Silver, 2012).
If the prediction requires complex configural relations that a computer will have a difficult time replicating, a clinician’s judgment may be preferred. Although the likelihood that a person can accurately work through these complex relations is theoretically possible, it is highly unlikely. Holistic pattern recognition (such as language and faces) tends to be better by humans than computers. But computers are getting better with holistic pattern recognition through machine learning.
In sum, the human seeks to integrate information to make a decision, but is biased.
15.4.3 Comparison of Evidence
Hundreds of studies have examined clinical versus actuarial prediction methods across many disciplines. Findings consistently show that actuarial methods are as accurate or more accurate than human judgment/prediction methods. “There is no controversy in social science that shows such a large body of qualitatively diverse studies coming out so uniformly…as this one” (Meehl, 1986, pp. 373–374).
Actuarial methods are particularly valuable for criterion-referenced assessment tasks, in which the aim is to predict specific events or outcomes (Garb & Wood, 2019). For instance, actuarial methods have shown promise in predicting violence, criminal recidivism, psychosis onset, course of mental disorders, treatment selection, treatment failure, suicide attempts, and suicide (Garb & Wood, 2019). Actuarial methods are especially important to use in low-validity environments (like fantasy football) in which there is considerable uncertainty and unpredictability (Kahneman, 2011).
Moreover, actuarial methods are explicit; they can be transparent and lead to informed scientific criticism to improve them. By contrast, human judgment methods are not typically transparent; human judgment relies on mental processes that are often difficult to specify.
15.5 Why Judgment is More Widely Used Than Statistical Formulas
Despite actuarial methods being generally more accurate than human judgment, judgment is much more widely used by clinicians. There are several reasons why actuarial methods have not caught on; one reason is professional traditions. Experts in any field do not like to think that a computer could outperform them. Some practitioners argue that judgment/prediction is an “art form” and that using a statistical formula is treating people like a number. However, using an approach (i.e., human judgment) that systematically leads to less accurate decisions and predictions is an ethical problem.
Some clinicians do not think that group averages (e.g., in terms of which treatment is most effective) apply to an individual client. This invokes the distinction between nomothetic (group-level) inferences and idiographic (individual-level) inferences. However, the scientific evidence and probability theory strongly indicate that it is better to generalize from group-level evidence than throwing out all the evidence and taking the approach of “anything goes.” Clinicians frequently believe the broken leg fallacy, i.e., thinking that your client is an exception to the algorithmic prediction. In most cases, deviating from the statistical formula will result in less accurate predictions. People tend to overestimate the probability of low base rate conditions and events.
Another reason why actuarial methods have not caught on is the belief that receiving a treatment is the only thing that matters. But it is an empirical question which treatment is most effective for whom. What if we could do better? For example, we could potentially use a formula to identify the most effective treatment for a client. Some treatments are no better than placebo; other treatments are actually harmful (Lilienfeld, 2007; Williams et al., 2021).
Another reason why judgment methods are more widely used than actuarial methods is that so-called “experts” (and people in general) show overconfidence in their predictions—clinicians, experts, and humans in general think they are more accurate than they actually are. We see this when examining their calibration; their predictions tend to be miscalibrated. For example, things they report with 80% confidence occur less than 80% of the time, an example of overprecision in their predictions. Humans will sometimes be correct by chance, and they tend to mis-attribute that to their skill; humans tend to remember the successes and forget the failures.
Another argument against using actuarial methods is that “no methods exist”. In some cases, that is true—actuarial methods do not yet exist for some prediction problems. However, one can always create an algorithm of the experts’ judgments, even if one does not have access to the outcome information. A model of clinicians’ responses tends to be more accurate than clinicians’ judgments themselves because the model gives the same outcome with the same input data—i.e., it is perfectly reliable.
Another argument from some clinicians is that, “My job is to understand, not to predict”. But what kind of understanding does not involve predictions? Accurate predictions help in understanding. Knowing how people would perform in different conditions is the same thing as good understanding.
15.6 Steps to Conduct Actuarial Approaches
Here are several steps to conduct actuarial approaches (Den Hartigh et al., 2018; Kahneman, 2011):
- Determine a set of relevant variables to measure
- Determine how you will combine the variables
- Do some variables have more weight than other variables?
- Determine how the variables will be scored
- e.g., a 7-point likert scale
- Combine the scores based on the pre-specified formula
- Use the final score to make your prediction (for selecting players)
15.7 Challenges of Data-Driven Approaches and How to Address
There are various challenges of data-driven approaches. First, they are sometimes not interpretable or consistent with theory. Second, they tend to overfit the data. Overfitting is described in Section 11.6. Third, as a result of overfitting the data, they tend to show shrinkage.
15.7.1 Shrinkage
In general, there is often shrinkage of estimates from training data set to a test data set. Shrinkage is when variables with stronger predictive power in the original data set tend to show somewhat smaller predictive power (smaller regression coefficients) when applied to new groups. Shrinkage reflects a model overfitting—i.e., when the model explains error variance by capitalizing on chance. Shrinkage is especially likely when the original sample is small and/or unrepresentative and the number of variables considered for inclusion is large. To help minimize the extent of shrinkage, it is recommended to apply cross-validation.
15.7.2 Cross-Validation
Cross-validation with large, representative samples can help evaluate the amount of shrinkage of estimates, particularly for more complex models such as machine learning models (Ursenbach et al., 2019). Ideally, cross-validation would be conducted with a separate sample (external cross-validation) to see the generalizability of estimates. However, you can also do internal cross-validation. For example, you can perform k-fold cross-validation, where you:
- split the data set into k groups
- for each unique group:
- take the group as a hold-out data set (also called a test data set)
- take the remaining groups as a training data set
- fit a model on the training data set and evaluate it on the test data set
- after all k-folds have been used as the test data set, and all models have been fit, you average the estimates across the models, which presumably yields more robust, generalizable estimates
15.8 Best Actuarial Approaches to Prediction
The best actuarial models tend to be relatively simple (parsimonious), that can account for one or several of the most important predictors and their optimal weightings, and that account for the base rate of the phenomenon. Multiple regression and/or prior literature can be used to identify the weights of various predictors. Even unit-weighted formulas (formulas whose predictor variables are equally weighted with a weight of one) can sometimes generalize better to other samples than complex weightings (Garb & Wood, 2019; Kahneman, 2011). Differential weightings sometimes capture random variance and over-fit the model, thus leading to predictive accuracy shrinkage in cross-validation samples (Garb & Wood, 2019), as described below. The choice of predictor variables often matters more than their weighting.
An emerging technique that holds promise for increasing predictive accuracy of actuarial methods is machine learning (Garb & Wood, 2019). However, one challenge of some machine learning techniques is that they are like a “black box” and are not transparent, which raises ethical concerns (Garb & Wood, 2019). Moreover, machine learning also tends to lead to overfitting and shrinkage. machine learning may be most valuable when the data available are complex and there are many predictor variables (Garb & Wood, 2019), and when the model is validated with cross-validation.
15.9 Conclusion
In general, it is better to develop and use structured, actuarial approaches than informal approaches that rely on human judgment or judgment by “so-called” experts. Actuarial approaches to prediction tend to be as accurate or more accurate than expert judgment. Nevertheless, in many domains, human judgment tends to be much more widely used than actuarial approaches.