I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Principles-Psychological-Assessment

Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

Chapter 10 Clinical Judgment Versus Algorithmic Prediction

10.1 Approaches to Prediction

There are two primary approaches to prediction: human (i.e., clinical) judgment and the actuarial (i.e., statistical) method.

10.1.1 Human/Clinical Judgment

Using the clinical judgment method of prediction, all gathered information is collected and formulated into a diagnosis or prediction in the clinician’s mind. The clinician selects, measures, and combines risk factors and produces risk estimate solely according to clinical experience and judgment.

10.1.2 Actuarial/Statistical Method

In the actuarial or statistical method of prediction (i.e., statistical prediction rules), information is gathered and combined systematically in an evidence-based statistical prediction formula, and established cutoffs are used. The method is based on equations and data, so both are needed.

An example of a statistical method of prediction is the Violence Risk Appraisal Guide (Rice et al., 2013). The Violence Risk Appraisal Guide is used in an attempt to predict violence and is used for parole decisions. For instance, the equation might be something like Equation (10.1):

\[\begin{equation} \scriptsize \text{Violence risk} = \beta \cdot \text{conduct disorder} + \beta \cdot \text{substance use} + \beta \cdot \text{suspended from school} + \beta \cdot \text{childhood aggression} + ... \tag{10.1} \end{equation}\]

Then, based on their score and the established cutoffs, a person is given a “low risk”, “medium risk”, or “high risk” designation.

10.1.3 Combining Human Judgment and Statistical Algorithms

There are numerous ways in which humans and statistical algorithms could be involved. On one extreme, humans make all judgments (consistent with the clinical judgment approach). On the other extreme, although humans may be involved in data collection, a statistical formula makes all decisions based on the input data, consistent with an actuarial approach. However, the clinical judgment and actuarial approaches can be combined in a hybrid way (J. Dana & Thomas, 2006). For example, to save time and money, a clinician might use an actuarial approach in all cases, but might only use a clinical approach when the actuarial approach gives a “positive” test. Or, the clinician might use both human judgment and an actuarial approach independently to see whether they agree. That is, the clinician may make a prediction based on their judgment and might also generate a prediction from an actuarial approach.

The challenge is what to do when the human and the algorithm disagree. Hypothetically, humans reviewing and adjusting the results from the statistical algorithm could lead to more accurate predictions. However, human input also could lead to the possibility or exacerbation of biased predictions. In general, with very few exceptions, actuarial approaches are as accurate or more accurate than clinical judgment (Ægisdóttir et al., 2006; Baird & Wagner, 2000; Dawes et al., 1989; Grove et al., 2000; Grove & Meehl, 1996). Moreover, the superiority of actuarial approaches to clinical judgment tends to hold even when the clinician is given more information than the actuarial approach (Dawes et al., 1989). In addition, actuarial predictions outperform human judgment even when the human is given the result of the actuarial prediction (Kahneman, 2011). Allowing clinicians to override actuarial predictions consistently leads to lower predictive accuracy (Garb & Wood, 2019).

In general, validity tends to increase with greater structure (e.g., structured or semi-structured interviews as opposed to free-flowing, unstructured interviews), in terms of administration, responses, scoring, interpretation, etc. Unstructured interviews are susceptible to confirmatory bias, such as exploring only the diagnoses that confirm the clinician’s hypotheses rather than attempting to explore diagnoses that disconfirm the clinician’s hypotheses, which leads to fewer diagnoses. Unstructured interviews are also susceptible to bias relating to race, gender, class, etc. (Garb, 1997, 2005).

There is sometimes a misconception that formulas cannot account for qualitative information. However, that is not true. Qualitative information can be scored or coded to be quantified so that it can be included in statistical formulas. That said, the quality of predictions rests on the quality and relevance of the assessment information for the particular prediction/judgment decision. If the assessment data are lousy, it is unlikely that a statistical algorithm (or a human for that matter) will make an accurate prediction: “Garbage in, garbage out”. A statistical formula cannot rescue inaccurate assessment data.

10.2 Errors in Clinical Judgment

Clinical judgment is naturally subject to errors. Below, I describe a few errors to which clinical judgment seems particularly prone.

When operating freely, clinicians tend to over-estimate exceptions to the established rules (i.e., the broken leg syndrome). Meehl (1957) acknowledged that there may be some situations where it is glaringly obvious that the statistical formula would be incorrect because it fails to account for an important factor. He called these special cases “broken leg” cases, in which the human should deviate from the formula (i.e., broken leg countervailing). The example goes like this:

“If a sociologist were predicting whether Professor X would go to the movies on a certain night, he might have an equation involving age, academic specialty, and introversion score. The equation might yield a probability of .90 that Professor X goes to the movie tonight. But if the family doctor announced that Professor X had just broken his leg, no sensible sociologist would stick with the equation. Why didn’t the factor of ‘broken leg’ appear in the formula? Because broken legs are very rare, and in the sociologist’s entire sample of 500 criterion cases plus 250 cross-validating cases, he did not come upon a single instance of it. He uses the broken leg datum confidently, because ‘broken leg’ is a subclass of a larger class we may crudely denote as ‘relatively immobilizing illness or injury,’ and movie-attending is a subclass of a larger class of ‘actions requiring moderate mobility.’” (Meehl, 1957, pp. 269–270)

However, people too often think that cases where they disagree with the statistical algorithm are broken leg cases. People too often think their case is an exception to the rule. As a result, they too often change the result of the statistical algorithm and are more likely to be wrong than right in doing so. Because actuarial methods are based on actual population levels (i.e., base rates), unique exceptions are not over-estimated.

Actuarial predictions are perfectly reliable—they will always return the same conclusion given an identical set of data. The human judge is likely to both disagree with others and with themselves given the same set of symptoms.

A clinician’s decision is likely to be influenced by past experiences, and given the sample of humanity that the clinician is exposed to, the clinician is likely (based on prior experience) to over-estimate the likelihood of occurrence of infrequent phenomena. Actuarial methods are based on objective algorithms, and past personal experience and personal biases do not factor into any decisions. Clinicians give weight to less relevant information, and often give too much weight to singular variables (e.g., Graduate Record Examination scores). Actuarial formulas do a better job of focusing on relevant variables. Computers are good at factoring in base rates, inverse conditional probabilities, etc. Humans ignore base rates (base rate neglect), and tend to show confusion of the inverse.

Computers are better at accurately weighing risk factors and calculating unbiased risk estimates. In an actuarial formula, the relevant risk factors are weighted according to their predictive power. This stands in contrast to the Diagnostic and Statistical Manual of Mental Disorders (DSM), in which each symptom is equally weighted.

Humans are typically given no feedback on their judgments. To improve accuracy of judgments, it is important for feedback to be clear, consistent, and timely. It is especially unlikely for feedback in clinical psychology to be timely because we will have to wait a long time to see the outcomes of predictions. The feedback that clinicians receive regarding their clients’ long-term outcomes tends to be sparse, delayed or nonexistent, and too ambiguous to support learning from experience (Kahneman, 2011). Moreover, it is important to distinguish short-term anticipation in a therapy session (e.g., what the client will say next) from long-terms predictions predictions. Although clinicians may have strong intuitions dealing with short-term anticipation in therapy sessions, their long-term predictions are not accurate (Kahneman, 2011).

Clinicians are susceptible to representative schema biases (Dawes, 1986). Clinicians are exposed to a skewed sample of humanity, and they make judgments based on a prototype from their (biased) experiences. This is known as the representativeness heuristic. Different clinicians may have different prototypes, leading to lower inter-rater reliability.

Intuition is a form of recognition-based judgment (i.e., recognizing cues that provide access to information in memory). Development of strong intuition depends on the quality and speed of feedback, in addition to having adequate opportunities to practice [i.e., sufficient opportunities to learn the cues; Kahneman (2011)]. The quality and speed of the feedback tend to benefit anesthesiologists who often quickly learn the results of their actions. By contrast, radiologists tend not to receive quality feedback about the accuracy of their diagnoses, including their false-positive and false-negative decisions (Kahneman, 2011).

In general, many so-called experts are “pseudo-experts” who do not know the boundaries of their competence—that is, they do not know what they do not know; they have the illusion of validity of their predictions and are overconfident about their predictions (Kahneman, 2011). Yet, many people arrogantly proclaim to have predictive powers, including in low-validity environments such as clinical psychology. True experts know their limits in terms of knowledge and ability to predict.

Intuitions tend to be skilled when a) the environment is regular and predictable, and b) there is opportunity to learn the regularities, cues, and contingencies through extensive practice (Kahneman, 2011). Example domains that meet these conditions supporting intuition include activities such as chess, bridge, and poker, and occupations such as medical providers, athletes, and firefighters. By contrast, clinical psychology and other domains such as stock-picking and other long-terms forecasts are low-validity environments that are irregular and unpredictable. In environments that do not have stable regularities, intuition cannot be trusted (Kahneman, 2011).

10.3 Humans Versus Computers

10.3.1 Advantages of Computers

Here are some advantages of computers over humans, including “experts”:

Computers can process lots of information simultaneously. So can humans. But computers can to an even greater degree.
Computers are faster at making calculations.
Given the same input, a formula will give the exact same result everytime. Humans’ judgment tends to be inconsistent both across raters and within rater across time, when trying to make judgments or predictions from complex information (Kahneman, 2011). As noted in Section 5.5, reliability sets the upper bound for validity, so unreliable judgments cannot be valid.
Computations by computers are error-free (as long as the computations are programmed correctly).
Computers’ judgments will not be biased by fatigue or emotional responses.
Computers’ judgments will tend not to be biased in the way that humans’ cognitive biases are, such as with anchoring bias, representativeness bias, confirmation bias, or recency bias. Computers are less likely to be over-confident in their judgments.
Computers can more accurately weight the set of predictors based on large data sets. Humans tend to give too much weight to singular predictors. Experts may attempt to be clever and to consider complex combinations of predictors, but doing so often reduces validity (Kahneman, 2011). Simple combinations of predictions often outperform more complex combinations (Kahneman, 2011).

10.3.2 Advantages of Humans

Computers are bad at some things too. Here are some advantages of humans over computers (as of now):

Humans can be better at identifying patterns in data (but also can mistakenly identify patterns where there are none).
Humans can be flexible and take a different approach if a given approach is not working.
Humans are better at tasks requiring creativity and imagination, such as developing theories that explain phenomena.
Humans have the ability to reason, which is especially important when dealing with complex, abstract, or open-ended problems, or problems that have not been faced before (or for which we have insufficient data).
Humans are better able to learn.
Humans are better at holistic, gestalt processing, including facial and linguistic processing.

There may be situations in which a clinical judgment would do better than an actuarial judgment. One situation where clinical judgment would be important is when no actuarial method exists for the judgment or prediction. For instance, when no actuarial method exists for the diagnosis or disorder (e.g., suicide), it is up to the clinical judge. However, we could collect data on the outcomes or on clinicians’ judgments to develop an actuarial method that will be more reliable than the clinicians’ judgments. That is, an actuarial method developed based on clinicians’ judgments will be more accurate than clinicians’ judgments. In other words, we do not necessarily need clients’ outcome data to develop an actuarial method. We could use the client’s data as predictors of the clinicians’ judgments to develop a structured approach to prediction that weighs factors similarly to clinicians, but with more reliable predictions.

Another situation in which clinical judgment could outperform a statistical algorithm is in true “broken leg” cases, e.g., important and rare events (edge cases) that are not yet accounted for by the algorithm.

Another situation in which clinical judgment could be preferable is if advanced, complex theories exist. Computers have a difficult time adhering to complex theories, so clinicians may be better suited. However, we do not have any of these complex theories in psychology that are accurate. We would need strong theory informed by data regarding causal influences, and accurate measures to assess them. If theories alone were true and could explain everything, the psychoanalytic tradition would give us all the answers, as depicted in Figure 10.1. However, no theories in psychology are that good. Nevertheless, predictive accuracy can be improved when considering theory (Garb & Wood, 2019; Silver, 2012).

Figure 10.1: Conceptual Depiction of the Psychoanalytic Tradition.

If the diagnosis/prediction requires complex configural relations that a computer will have a difficult time replicating, a clinician’s judgment may be preferred. Although the likelihood that the clinician can accurately work through these complex relations is theoretically possible, it is highly unlikely. Holistic pattern recognition (such as language and faces) tends to be better by humans than computers. But computers are getting better with holistic pattern recognition through machine learning. An example of a simple decision (as opposed to a complex decision) would be whether to provide treatment A (e.g., cognitive behavior therapy) or treatment B (e.g., medication) for someone who shows melancholy.

In sum, the clinician seeks to integrate all information to make a decision, but is biased.

10.3.3 Comparison of Evidence

Hundreds of studies have examined clinical versus actuarial prediction methods across many disciplines. Findings consistently show that actuarial methods are as accurate or more accurate than clinical prediction methods (Ægisdóttir et al., 2006; Baird & Wagner, 2000; Dawes et al., 1989; Grove et al., 2000; Grove & Meehl, 1996). “There is no controversy in social science that shows such a large body of qualitatively diverse studies coming out so uniformly…as this one” (Meehl, 1986, pp. 373–374). In general, actuarial methods tend to be about 10% more accurate than clinical judgment (Grove et al., 2000). Thus, actuarial methods have incremental validity over clinical judgment. As noted earlier, the superiority of actuarial approaches to clinical judgment tends to hold even when the clinician is given more information than the actuarial approach (Dawes et al., 1989). In addition, actuarial predictions outperform human judgment even when the human is given the result of the actuarial prediction (Kahneman, 2011). Allowing clinicians to override actuarial predictions consistently leads to lower predictive accuracy (Garb & Wood, 2019).

Actuarial methods are particularly valuable for criterion-referenced assessment tasks, in which the aim is to predict specific events or outcomes (Garb & Wood, 2019). For instance, actuarial methods have shown promise in predicting violence, criminal recidivism, psychosis onset, course of mental disorders, treatment selection, treatment failure, suicide attempts, and suicide (Garb & Wood, 2019). Actuarial methods are especially important to use in low-validity environments (like clinical psychology) in which there is considerable uncertainty and unpredictability (Kahneman, 2011). By contrast, for norm-referenced assessment tasks, different approaches may be preferred over actuarial methods and statistical prediction rules. For instance, for norm-referenced assessment tasks such as describing personality and psychopathology (Garb & Wood, 2019), where the criterion is a latent construct (derived from the internal pattern of associations among the items) rather than a specific event or outcome, psychometric methods of scale construction such as factor analysis may be preferred (Garb & Wood, 2019)—i.e., the inductive approach to scale construction.

Another benefit of actuarial methods is that they are explicit; they can be transparent and lead to informed scientific criticism to improve them. By contrast, clinical judgment methods are not typically transparent; clinical judgment relies on mental processes that are often difficult to specify.

10.4 Accuracy of Different Statistical Models

It is important to evaluate the accuracy of different types of statistical models to provide guidance on which types of models may be most accurate for a given prediction problem. Simpler statistical formulas are more likely to generalize to new samples than complex statistical models due to model over-fitting. As described in Section 5.3.1.3.3.3, over-fitting occurs when the statistical model accounts for error variance (an overly specific prediction), which will not generalize to future samples.

Youngstrom et al. (2018) discussed the benefits of the probability nomogram. The probability nomogram combines the risk ratio with the base rate (according to Bayes’ theorem) to generate a prediction. The risk ratio (diagnostic likelihood ratio) is an index of the predictive validity of an instrument: it is the ratio of the probability that a test result is correct to the probability that the test result is incorrect. There are two types of diagnostic likelihood ratios: the positive likelihood ratio and the negative likelihood ratio. As described in Equation (9.61), the positive likelihood ratio is the probability that a person with the disease tested positive for the disease (true positive rate) divided by the probability that a person without the disease tested positive for the disease (false positive rate). As described in Equation (9.62), the negative likelihood ratio is the probability that a person with the disease tested negative for the disease (false negative rate) divided by the probability that a person without the disease tested negative for the disease (true negative rate). Using a probability nomogram, you start with the base rate, then plot the diagnostic likelihood ratio corresponding to a second source of information (e.g., positive test), then connect the base rate (pretest probability) to the posttest probability through the likelihood ratio. These approaches are described in greater detail in Section 12.5.

Youngstrom et al. (2018) ordered actuarial approaches to diagnostic classification from less to more complex:

Predicting from the base rate
Take the best
The probability nomogram
Naïve Bayesian algorithms
Logistic regression with one predictor
Logistic regression with multiple predictors
Least absolute shrinkage and selection option (LASSO)

As described in Section 9.1.2.4, predicting from the base rate is selecting the most likely outcome in every case. “Take the best” refers to focusing on the single variable with the largest validity coefficient, and making a decision based on it (based on some threshold). The probability nomogram combines the base rate (prior probability) with the information offered by an assessment finding to revise the probability. Naïve Bayesian algorithms use the probability nomogram with multiple assessment findings. Basically, you continue to calculate a new posttest probability based on each assessment finding, revising the pretest probability for the next assessment finding based on the posttest probability of the last assessment finding. However, this approach assumes that all predictors are uncorrelated, which is probably not true in practice.

Logistic regression is useful for classification problems because it deals with a dichotomous dependent variable, but one could extrapolate it to a continuous dependent variable with multiple regression. The effect size from logistic regression is equivalent to receiver operating characteristic (ROC) curve analysis. Logistic regression with multiple predictors combines multiple predictors into one regression model. Including multiple predictors in the same model optimizes the weights of multiple predictors, but it is important for predictors not to be collinear.

LASSO is a form of regression that can handle multiple predictors (like multiple regression). However, it can handle more predictors than multiple regression. For instance, LASSO can accommodate situations where there are more predictors than there are participants. Moreover, unlike multiple regression, it can handle collinear predictors. LASSO uses internal cross-validation to avoid over-fitting.

Youngstrom et al. (2018) found that model complexity improves accuracy but only to a point; some of the simpler models (Naïve Bayesian and Logistic regression) did just as well and in some cases better than LASSO models in classifying bipolar disorder. Although LASSO models showed the highest discrimination accuracy in the internal sample, they showed the largest shrinkage in the external sample. Moreover, the LASSO models showed poor calibration in the internal sample; they over-diagnosed bipolar disorder. By contrast, simpler models showed better calibration in the internal sample. The probability nomogram with one or multiple assessment findings showed better calibration than LASSO models because they accounted for the original base rate, so they did not over-diagnose bipolar disorder.

In summary, the best models are those that are relatively simple (parsimonious), that can account for one or several of the most important predictors and their optimal weightings, and that account for the base rate of the phenomenon. Multiple regression and/or prior literature can be used to identify the weights of various predictors. Even unit-weighted formulas (formulas whose predictors are equally weighted with a weight of one) can sometimes generalize better to other samples than complex weightings (Garb & Wood, 2019; Wainer, 1976). Differential weightings sometimes capture random variance and over-fit the model, thus leading to predictive accuracy shrinkage in cross-validation samples (Garb & Wood, 2019), as described below. The choice of predictive variables often matters more than their weighting.

In general, there is often shrinkage of estimates from training data set to a test data set. Shrinkage is when variables with stronger predictive power in the original data set tend to show somewhat smaller predictive power (smaller validity coefficients) when applied to new groups. Shrinkage reflects a model over-fitting (i.e., fitting to error by capitalizing on chance). Shrinkage is especially likely when the original sample is small and/or unrepresentative and the number of variables considered for inclusion is large. Cross-validation with large, representative samples can help evaluate the amount of shrinkage of estimates, particularly for more complex models such as machine learning models (Ursenbach et al., 2019). Ideally, cross-validation would be conducted with a separate sample (external cross-validation) to see the generalizability of estimates. However, you can also do internal cross-validation. For example, you can perform k-fold cross-validation, where you:

split the data set into k groups
for each unique group:
- take the group as a hold-out data set (also called a test data set)
- take the remaining groups as a training data set
- fit a model on the training data set and evaluate it on the test data set
- after all k-folds have been used as the test data set, and all models have been fit, you average the estimates across the models, which presumably yields more robust, generalizable estimates

An emerging technique that holds promise for increasing predictive accuracy of actuarial methods is machine learning (Garb & Wood, 2019). However, one challenge of some machine learning techniques is that they are like a “black box” and are not transparent, which raises ethical issues (Garb & Wood, 2019). Machine learning may be most valuable when the data available are complex and there are many predictive variables (Garb & Wood, 2019).

10.5 Getting Started

10.5.1 Load Libraries

Code

library("petersenlab") #to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")
library("pROC")
library("caret")
library("MASS")
library("randomForest")
library("e1071")
library("ranger")
library("ordinalForest")
library("elasticnet")
library("LiblineaR")
library("glmnet")
library("viridis")
library("MOTE")
library("tidyverse")
library("here")
library("tinytex")
library("rmarkdown")

10.5.2 Prepare Data

10.5.2.1 Load Data

aSAH is a data set from the pROC package (Robin et al., 2021) that contains test scores (s100b) and clinical outcomes (outcome) for patients.

Code

data(aSAH)
mydataActuarial <- aSAH

mydataActuarial$disorder <- NA
mydataActuarial$disorder[which(
  mydataActuarial$outcome == "Poor")] <- 1
mydataActuarial$disorder[which(
  mydataActuarial$outcome == "Good")] <- 0

10.5.2.2 Simulate Data

For reproducibility, I set the seed below. Using the same seed will yield the same answer every time. There is nothing special about this particular seed.

Code

set.seed(52242)

mydataActuarial$ordinal <- sample(
  0:2,
  nrow(mydataActuarial),
  replace = TRUE)

mydataActuarial$ordinalFactor <- 
  factor(mydataActuarial$ordinal, ordered = TRUE)

10.6 Fitting the Statistical Models

You can find an array of statistical models for classification and prediction available in the caret package (Kuhn, 2022) at the following link: https://topepo.github.io/caret/available-models.html (archived at https://perma.cc/E45U-YE6S)

10.6.1 Regression with One Predictor

Dichotomous Outcome:

Code

summary(
  glm(disorder ~ s100b,
  data = mydataActuarial))


Call:
glm(formula = disorder ~ s100b, data = mydataActuarial)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1796     0.0561   3.202  0.00178 ** 
s100b         0.7417     0.1530   4.848 4.09e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1942322)

    Null deviance: 26.124  on 112  degrees of freedom
Residual deviance: 21.560  on 111  degrees of freedom
AIC: 139.49

Number of Fisher Scoring iterations: 2

Ordinal Outcome:

Code

summary(
  polr(ordinalFactor ~ s100b,
  data = mydataActuarial))

Call:
polr(formula = ordinalFactor ~ s100b, data = mydataActuarial)

Coefficients:
       Value Std. Error t value
s100b 0.8485     0.6905   1.229

Intercepts:
    Value   Std. Error t value
0|1 -0.6055  0.2570    -2.3565
1|2  0.8514  0.2621     3.2480

Residual Deviance: 246.405 
AIC: 252.405

Code

lrm(
  ordinal ~ s100b,
  data = mydataActuarial,
  x = TRUE,
  y = TRUE)

Logistic Regression Model

lrm(formula = ordinal ~ s100b, data = mydataActuarial, x = TRUE, 
    y = TRUE)

                      Model Likelihood      Discrimination    Rank Discrim.    
                            Ratio Test             Indexes          Indexes    
Obs           113    LR chi2      1.59      R2       0.016    C       0.525    
 0             35    d.f.            1      R2(1,113)0.005    Dxy     0.050    
 1             39    Pr(> chi2) 0.2067    R2(1,100.4)0.006    gamma   0.051    
 2             39                           Brier    0.213    tau-a   0.033    
max |deriv| 1e-09                                                              

      Coef    S.E.   Wald Z Pr(>|Z|)
y>=1   0.6055 0.2570  2.36  0.0184  
y>=2  -0.8514 0.2621 -3.25  0.0012  
s100b  0.8485 0.6905  1.23  0.2191

Code

ordinalModel <- train(
  ordinalFactor ~ s100b,
  data = mydataActuarial,
  method = "polr")

print(ordinalModel)

Ordered Logistic or Probit Regression 

113 samples
  1 predictor
  3 classes: '0', '1', '2' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  method    Accuracy   Kappa      
  cauchit   0.2999569  0.002188742
  cloglog   0.3092742  0.014863850
  logistic  0.3070254  0.011181887
  loglog    0.3050676  0.008781209
  probit    0.3090156  0.014176878

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was method = cloglog.

Continuous Outcome:

Code

summary(
  lm(s100b ~ ndka,
  data = mydataActuarial))


Call:
lm(formula = s100b ~ ndka, data = mydataActuarial)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.37261 -0.14000 -0.09024  0.10732  0.76057 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.1705939  0.0234559   7.273 5.28e-11 ***
ndka        0.0038861  0.0005259   7.390 2.94e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2238 on 111 degrees of freedom
Multiple R-squared:  0.3298,    Adjusted R-squared:  0.3237 
F-statistic: 54.61 on 1 and 111 DF,  p-value: 2.936e-11

10.6.2 Regression with Multiple Predictors

Dichotomous Outcome:

Code

summary(
  glm(disorder ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial))


Call:
glm(formula = disorder ~ s100b + ndka + gender + age + wfns, 
    data = mydataActuarial)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1888383  0.1527337   1.236 0.219099    
s100b         0.1537913  0.2338580   0.658 0.512231    
ndka          0.0006123  0.0012259   0.499 0.618515    
genderFemale -0.1713339  0.0824500  -2.078 0.040168 *  
age           0.0052547  0.0028551   1.840 0.068550 .  
wfns.L        0.4182090  0.1078207   3.879 0.000184 ***
wfns.Q        0.0261311  0.1295000   0.202 0.840479    
wfns.C        0.1669874  0.0834770   2.000 0.048064 *  
wfns^4       -0.0932508  0.1550179  -0.602 0.548784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1549053)

    Null deviance: 26.124  on 112  degrees of freedom
Residual deviance: 16.110  on 104  degrees of freedom
AIC: 120.56

Number of Fisher Scoring iterations: 2

Ordinal Outcome:

Code

summary(
  polr(ordinalFactor ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial))

Call:
polr(formula = ordinalFactor ~ s100b + ndka + gender + age + 
    wfns, data = mydataActuarial)

Coefficients:
                  Value Std. Error  t value
s100b         1.8489542   1.164022  1.58842
ndka          0.0003385   0.006903  0.04903
genderFemale -0.6508492   0.398431 -1.63353
age           0.0015384   0.013144  0.11705
wfns.L       -0.6994631   0.530547 -1.31838
wfns.Q       -1.3813985   0.613950 -2.25002
wfns.C       -0.1869263   0.429721 -0.43499
wfns^4        0.6399225   0.731127  0.87526

Intercepts:
    Value   Std. Error t value
0|1 -0.9359  0.7340    -1.2751
1|2  0.6024  0.7295     0.8257

Residual Deviance: 238.5788 
AIC: 258.5788

Code

lrm(
  ordinal ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  x = TRUE,
  y = TRUE)

Logistic Regression Model

lrm(formula = ordinal ~ s100b + ndka + gender + age + wfns, data = mydataActuarial, 
    x = TRUE, y = TRUE)

                      Model Likelihood      Discrimination    Rank Discrim.    
                            Ratio Test             Indexes          Indexes    
Obs           113    LR chi2      9.42      R2       0.090    C       0.637    
 0             35    d.f.            8      R2(8,113)0.012    Dxy     0.275    
 1             39    Pr(> chi2) 0.3080    R2(8,100.4)0.014    gamma   0.275    
 2             39                           Brier    0.205    tau-a   0.184    
max |deriv| 3e-10                                                              

              Coef    S.E.   Wald Z Pr(>|Z|)
y>=1           0.4489 0.8968  0.50  0.6167  
y>=2          -1.0894 0.9016 -1.21  0.2269  
s100b          1.8487 1.1637  1.59  0.1122  
ndka           0.0003 0.0069  0.05  0.9609  
gender=Female -0.6508 0.3984 -1.63  0.1024  
age            0.0015 0.0131  0.12  0.9068  
wfns           0.3266 0.4408  0.74  0.4587  
wfns=3         0.7044 1.1576  0.61  0.5428  
wfns=4        -0.8591 1.2771 -0.67  0.5011  
wfns=5        -2.3094 1.6651 -1.39  0.1655

Code

ordinalModelPredictors <- train(
  ordinalFactor ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "polr")

print(ordinalModelPredictors)

Ordered Logistic or Probit Regression 

113 samples
  5 predictor
  3 classes: '0', '1', '2' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  method    Accuracy   Kappa       
  cauchit   0.2051282  -0.194664032
  cloglog   0.3177155   0.019400327
  logistic  0.3105615   0.005977777
  loglog    0.3080244   0.007859259
  probit    0.3074942   0.005253176

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was method = cloglog.

Continuous Outcome:

Code

summary(
  lm(s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial))


Call:
lm(formula = s100b ~ ndka + gender + age + wfns, data = mydataActuarial)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45724 -0.08047 -0.00746  0.03834  0.60901 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.0870834  0.0631673   1.379   0.1709    
ndka          0.0033368  0.0003945   8.458 1.72e-13 ***
genderFemale  0.0333867  0.0342521   0.975   0.3319    
age           0.0017750  0.0011788   1.506   0.1351    
wfns.L        0.3124368  0.0330875   9.443 1.09e-15 ***
wfns.Q        0.1156324  0.0528496   2.188   0.0309 *  
wfns.C       -0.0125489  0.0348138  -0.360   0.7192    
wfns^4       -0.0779965  0.0642403  -1.214   0.2274    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1642 on 105 degrees of freedom
Multiple R-squared:  0.6586,    Adjusted R-squared:  0.6358 
F-statistic: 28.93 on 7 and 105 DF,  p-value: < 2.2e-16

10.6.3 Least Absolute Shrinkage and Selection Option (LASSO)

The least absolute shrinkage and selection option (LASSO) models were fit in the caret package (Kuhn, 2022).

Dichotomous Outcome:

Code

lassoModelDichotomous <- train(
  outcome ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "glmnet")

print(lassoModelDichotomous)

glmnet 

113 samples
  5 predictor
  2 classes: 'Good', 'Poor' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  alpha  lambda        Accuracy   Kappa    
  0.10   0.0005284466  0.7355782  0.4287611
  0.10   0.0052844664  0.7287405  0.4099028
  0.10   0.0528446638  0.7218622  0.3770501
  0.55   0.0005284466  0.7375306  0.4322519
  0.55   0.0052844664  0.7286740  0.4091351
  0.55   0.0528446638  0.7227115  0.3738905
  1.00   0.0005284466  0.7344793  0.4259979
  1.00   0.0052844664  0.7336155  0.4208213
  1.00   0.0528446638  0.7221166  0.3723118

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.55 and lambda = 0.0005284466.

Continuous Outcome:

Code

lassoModelContinuous <- train(
  s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "lasso")

print(lassoModelContinuous)

The lasso 

113 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  fraction  RMSE       Rsquared   MAE      
  0.1       0.2429079  0.2539792  0.1738286
  0.5       0.2108372  0.3378498  0.1324253
  0.9       0.2166231  0.3055272  0.1267198

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was fraction = 0.5.

10.6.4 Ridge Regression

The ridge regression models were fit in the caret package (Kuhn, 2022).

Dichotomous Outcome:

Code

ridgeModelDichotomous <- train(
  outcome ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "regLogistic")

print(ridgeModelDichotomous)

Regularized Logistic Regression 

113 samples
  5 predictor
  2 classes: 'Good', 'Poor' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  cost  loss       epsilon  Accuracy   Kappa    
  0.5   L1         0.001    0.7325252  0.4103154
  0.5   L1         0.010    0.7295235  0.4051628
  0.5   L1         0.100    0.7277705  0.3960880
  0.5   L2_dual    0.001    0.7248420  0.3938821
  0.5   L2_dual    0.010    0.7326213  0.4044132
  0.5   L2_dual    0.100    0.7345136  0.4173344
  0.5   L2_primal  0.001    0.7345607  0.4114534
  0.5   L2_primal  0.010    0.7323873  0.4078716
  0.5   L2_primal  0.100    0.6439091  0.1596625
  1.0   L1         0.001    0.7355483  0.4187553
  1.0   L1         0.010    0.7374291  0.4240174
  1.0   L1         0.100    0.7358523  0.4188055
  1.0   L2_dual    0.001    0.7259651  0.4116438
  1.0   L2_dual    0.010    0.7290618  0.4028366
  1.0   L2_dual    0.100    0.7236890  0.3806801
  1.0   L2_primal  0.001    0.7436326  0.4349872
  1.0   L2_primal  0.010    0.7334694  0.4118764
  1.0   L2_primal  0.100    0.6446988  0.1637100
  2.0   L1         0.001    0.7452451  0.4415898
  2.0   L1         0.010    0.7435028  0.4363186
  2.0   L1         0.100    0.7363747  0.4247123
  2.0   L2_dual    0.001    0.7400012  0.4239179
  2.0   L2_dual    0.010    0.7279734  0.3936125
  2.0   L2_dual    0.100    0.7177376  0.3990560
  2.0   L2_primal  0.001    0.7485614  0.4486697
  2.0   L2_primal  0.010    0.7436215  0.4330285
  2.0   L2_primal  0.100    0.6466046  0.1685345

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cost = 2, loss = L2_primal and
 epsilon = 0.001.

Continuous Outcome:

Code

ridgeModelContinuous <- train(
  s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "ridge")

print(ridgeModelContinuous)

Ridge Regression 

113 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  lambda  RMSE       Rsquared   MAE      
  0e+00   0.2334776  0.3049999  0.1386310
  1e-04   0.2330332  0.3068559  0.1380783
  1e-01   0.2364670  0.2965017  0.1388237

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 1e-04.

10.6.5 Elastic Net

The elastic net models were fit in the caret package (Kuhn, 2022).

Continuous Outcome:

Code

elasticnetModelContinuous <- train(
  s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "enet")

print(elasticnetModelContinuous)

Elasticnet 

113 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared   MAE      
  0e+00   0.050     0.2731440  0.2621499  0.1863863
  0e+00   0.525     0.2336049  0.3631764  0.1389417
  0e+00   1.000     0.2385990  0.3368319  0.1364719
  1e-04   0.050     0.2731444  0.2621501  0.1863868
  1e-04   0.525     0.2336066  0.3631744  0.1389431
  1e-04   1.000     0.2386016  0.3368277  0.1364723
  1e-01   0.050     0.2734316  0.2622984  0.1867288
  1e-01   0.525     0.2348413  0.3605836  0.1397894
  1e-01   1.000     0.2412255  0.3321961  0.1369182

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.525 and lambda = 0.

10.6.6 Random Forest Machine Learning

The random forest models were fit in the caret package (Kuhn, 2022).

Dichotomous Outcome:

Code

rfModelDichotomous <- train(
  outcome ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "rf")

print(rfModelDichotomous)

Random Forest 

113 samples
  5 predictor
  2 classes: 'Good', 'Poor' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.7369067  0.3953884
  5     0.7347013  0.4032376
  8     0.7284303  0.3899586

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Ordinal Outcome:

Code

rfModelOrdinal <- train(
  ordinalFactor ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "ordinalRF")

print(rfModelOrdinal)

Random Forest 

113 samples
  5 predictor
  3 classes: '0', '1', '2' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  nsets  ntreeperdiv  ntreefinal  Accuracy   Kappa       
   50     50          200         0.3211389  -0.005466516
   50     50          400         0.3022129  -0.031700266
   50     50          600         0.3053906  -0.026771454
   50    100          200         0.3079423  -0.024898546
   50    100          400         0.3067742  -0.025865855
   50    100          600         0.3104021  -0.020939333
   50    150          200         0.3086011  -0.025645459
   50    150          400         0.3040616  -0.029628515
   50    150          600         0.3020118  -0.033196001
  100     50          200         0.3290558   0.003086972
  100     50          400         0.3042518  -0.027140626
  100     50          600         0.3138210  -0.014979737
  100    100          200         0.3070572  -0.029330252
  100    100          400         0.3142540  -0.017303057
  100    100          600         0.3048068  -0.034920067
  100    150          200         0.3163612  -0.013654494
  100    150          400         0.3058112  -0.031364175
  100    150          600         0.3091778  -0.025260742
  150     50          200         0.3080989  -0.023783202
  150     50          400         0.3097436  -0.022047402
  150     50          600         0.3088715  -0.027163376
  150    100          200         0.3076460  -0.027799395
  150    100          400         0.3090770  -0.023510219
  150    100          600         0.3130607  -0.018028479
  150    150          200         0.3058690  -0.028026053
  150    150          400         0.2998348  -0.039127889
  150    150          600         0.3062674  -0.031066605

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nsets = 100, ntreeperdiv = 50
 and ntreefinal = 200.

Continuous Outcome:

Code

rfModelContinuous <- train(
  s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "rf")

print(rfModelContinuous)

Random Forest 

113 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared   MAE      
  2     0.2217943  0.3596668  0.1278765
  4     0.2219191  0.3547089  0.1278610
  7     0.2222394  0.3476888  0.1318543

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

10.6.7 k-Fold Cross-Validation

The k-fold cross-validation models were fit in the caret package (Kuhn, 2022).

Code

kFolds <- 10

train.control <- trainControl(
  method = "cv",
  number = kFolds)

Dichotomous Outcome:

Code

cvModelDichotomous <- train(
  outcome ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "glm",
  trControl = train.control)

print(cvModelDichotomous)

Generalized Linear Model 

113 samples
  5 predictor
  2 classes: 'Good', 'Poor' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 102, 102, 102, 102, 102, 102, ... 
Resampling results:

  Accuracy   Kappa    
  0.7780303  0.5046079

Continuous Outcome:

Code

cvModelContinuous <- train(
  s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "lm",
  trControl = train.control)

print(cvModelContinuous)

Linear Regression 

113 samples
  4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 102, 103, 101, 102, 101, 102, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.2074522  0.3819308  0.1293988

Tuning parameter 'intercept' was held constant at a value of TRUE

10.6.8 Leave-One-Out (LOO) Cross-Validation

The leave-one-out (LOO) cross-validation models were fit in the caret package (Kuhn, 2022).

Code

train.control <- trainControl(
  method = "LOOCV")

Dichotomous Outcome:

Code

loocvModelDichotomous <- train(
  outcome ~ s100b + ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "glm",
  trControl = train.control)

print(loocvModelDichotomous)

Generalized Linear Model 

113 samples
  5 predictor
  2 classes: 'Good', 'Poor' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 112, 112, 112, 112, 112, 112, ... 
Resampling results:

  Accuracy   Kappa    
  0.7699115  0.4747944

Continuous Outcome:

Code

loocvModelContinuous <- train(
  s100b ~ ndka + gender + age + wfns,
  data = mydataActuarial,
  method = "lm",
  trControl = train.control)

print(loocvModelContinuous)

Linear Regression 

113 samples
  4 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 112, 112, 112, 112, 112, 112, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.2313417  0.2764679  0.1274925

Tuning parameter 'intercept' was held constant at a value of TRUE

10.7 Examples of Using Actuarial Methods

There are many examples of actuarial methods used in research and practice. As just a few examples, actuarial methods have been used to predict violence risk (Rice et al., 2013), to predict treatment outcome [for the treatment of depression; DeRubeis et al. (2014); Z. D. Cohen & DeRubeis (2018)], and to screen for attention-deficit/hyperactivity disorder (Lindhiem et al., 2015). Machine learning is becoming more widely used as another form of approaches to actuarial methods (Dwyer et al., 2018). For an empirical example, we have examined actuarial methods to predict arrests, illegal drug use, and drunk driving (Petersen et al., 2015).

10.8 Why Clinical Judgment is More Widely Used Than Statistical Formulas

Despite actuarial methods being generally more accurate than clinical judgment, clinical judgment is much more widely used. There are several reasons why actuarial methods have not caught on; one reason is professional traditions. Experts in any field do not like to think that a computer could outperform them. Some practitioners argue that judgment/prediction is an “art form” and that using a statistical formula is treating people like a number. However, using an approach (i.e., clinical judgment) that systematically leads to less accurate decisions and predictions is an ethical problem.

Some clinicians do not think that group averages (e.g., in terms of which treatment is most effective) apply to an individual client. This invokes the distinction between nomothetic (group-level) inferences and idiographic (individual-level) inferences. However, the scientific evidence and probability theory strongly favor the nomothetic approach to clinical prediction—it is better to generalize from group-level evidence than throwing out all the evidence and taking the approach of “anything goes.” Clinicians frequently believe the broken leg fallacy, i.e., thinking that your client is an exception to the algorithmic prediction. In most cases, deviating from the statistical formula will result in less accurate predictions. People tend to over-estimate the probability of low base rate conditions and events.

Another reason why actuarial methods have not caught on is the belief that receiving a treatment is the only thing that matters. But it is an empirical question which treatment is most effective for whom. What if we could do better? For example, we could potentially use a formula to identify the most effective treatment for a client. Some treatments are no better than placebo; other treatments are actually harmful (Lilienfeld, 2007; Williams et al., 2021).

Another reason why clinical methods are more widely used than actuarial methods is the over-confidence in clinicians’ predictions—clinicians think they are more accurate than they actually are. We see this when examining their calibration; their predictions tend to be mis-calibrated. For example, things they report with 80% confidence occur less than 80% of the time. Clinicians will sometimes be correct by chance, and they tend to mis-attribute that to their assessment; clinicians tend to remember the successes and forget the failures. Note, however, that it is not just clinicians who are over-confident; humans in general tend to be over-confident in their predictions.

Another argument against using actuarial methods is that “no methods exist”. In some cases, that is true—actuarial methods do not yet exist for some prediction problems. However, one can always create an algorithm of the clinicians’ judgments, even if one does not have access to the clients’ outcome information. A model of clinicians’ responses tends to be more accurate than clinicians’ judgments themselves because the model gives the same outcome with the same input data—i.e., it is perfectly reliable.

Another argument from some clinicians is that, “My job is to understand, not to predict”. But what kind of understanding does not involve predictions? Accurate predictions help in understanding. Knowing how people would perform in different conditions is the same thing as good understanding.

Another potential reason clinicians may not use actuarial judgment is time. This can be a legitimate challenge to using actuarial models in some contexts. Actuarial methods can take time to apply—they require assessing the relevant criteria, putting the assessment scores into an actuarial formula, and interpreting the output score. In some clinical situations, such as risk assessments in terms of whether someone is imminently suicidal or homicidal and how to proceed, clinicians must make immediate in-session decisions (Jacinto et al., 2018). Thus, it can be challenging or impractical in some situations to conduct all of the relevant assessments needed to apply actuarial approaches. Moreover, in some workplace settings, such as health care settings, clinicians may be pushed beyond their limits in terms of productivity expectations. In these situations, one possible approach is to exercise one’s best possible judgment in the moment based on the available evidence [i.e., intuitive decision-making processes; Jacinto et al. (2018)], and to check it against an actuarial model after the fact. This is like comparing your decisions in a chess match, after the game ends, to how a computer would have moved in each of the situations. It is valuable to receive feedback on the accuracy of one’s judgments and, when time permits, to evaluate one’s intuitive decision-making against analytical decision-making. However, there are structured approaches to assessing suicide risk that are quick to administer (e.g., Posner et al., 2011).

10.9 Conclusion

In general, it is better to develop and use structured, actuarial approaches than informal approaches that rely on human or clinical judgment. Actuarial approaches to prediction tend to be as accurate or more accurate than clinical judgment. Nevertheless, clinical judgment tends to be much more widely used than actuarial approaches, which is a major ethical problem.

10.10 Suggested Readings

Dawes et al. (1989); Garb & Wood (2019); Grove et al. (2000)

References

Ægisdóttir, S., White, M. J., Spengler, P. M., Maugherman, A. S., Anderson, L. A., Cook, R. S., Nichols, C. N., Lampropoulos, G. K., Walker, B. S., Cohen, G., & Rush, J. D. (2006). The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. The Counseling Psychologist, 34(3), 341–382. https://doi.org/10.1177/0011000005285875

Baird, C., & Wagner, D. (2000). The relative validity of actuarial- and consensus-based risk assessment systems. Children and Youth Services Review, 22(11), 839–871. https://doi.org/10.1016/S0190-7409(00)00122-5

Cohen, Z. D., & DeRubeis, R. J. (2018). Treatment selection in depression. Annual Review of Clinical Psychology, 14(1), 209–236. https://doi.org/10.1146/annurev-clinpsy-050817-084746

Dana, J., & Thomas, R. (2006). In defense of clinical judgment … and mechanical prediction. Journal of Behavioral Decision Making, 19(5), 413–428. https://doi.org/10.1002/bdm.537

Dawes, R. M. (1986). Representative thinking in clinical judgment. Clinical Psychology Review, 6, 425–441. https://doi.org/10.1016/0272-7358(86)90030-9

Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243(4899), 1668–1674. https://doi.org/10.1126/science.2648573

DeRubeis, R. J., Cohen, Z. D., Forand, N. R., Fournier, J. C., Gelfand, L. A., & Lorenzo-Luaces, L. (2014). The personalized advantage index: Translating research on prediction into individualized treatment recommendations. A demonstration. PLoS ONE, 9(1), e83875. https://doi.org/10.1371/journal.pone.0083875

Dwyer, D. B., Falkai, P., & Koutsouleris, N. (2018). Machine learning approaches for clinical psychology and psychiatry. Annual Review of Clinical Psychology, 14(1), 91–118. https://doi.org/10.1146/annurev-clinpsy-032816-045037

Garb, H. N. (1997). Race bias, social class bias, and gender bias in clinical judgment. Clinical Psychology: Science and Practice, 4(2), 99–120. https://doi.org/10.1111/j.1468-2850.1997.tb00104.x

Garb, H. N. (2005). Clinical judgment and decision making. Annual Review of Clinical Psychology, 1, 67–89. https://doi.org/10.1146/annurev.clinpsy.1.102803.143810

Garb, H. N., & Wood, J. M. (2019). Methodological advances in statistical prediction. Psychological Assessment, 31(12), 1456–1466. https://doi.org/10.1037/pas0000673

Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy. Psychology, Public Policy, and Law, 2(2), 293–323. https://doi.org/10.1037/1076-8971.2.2.293

Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12(1), 19–30. https://doi.org/10.1037/1040-3590.12.1.19

Jacinto, S. B., Lewis, C. C., Braga, J. N., & Scott, K. (2018). A conceptual model for generating and validating in-session clinical judgments. Psychotherapy Research, 28(1), 91–105. https://doi.org/10.1080/10503307.2016.1169329

Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus, and Giroux.

Kuhn, M. (2022). caret: Classification and regression training. https://github.com/topepo/caret/

Lilienfeld, S. O. (2007). Psychological treatments that cause harm. Perspectives on Psychological Science, 2(1), 53–70. https://doi.org/10.1111/j.1745-6916.2007.00029.x

Lindhiem, O., Yu, L., Grasso, D. J., Kolko, D. J., & Youngstrom, E. A. (2015). Adapting the posterior probability of diagnosis index to enhance evidence-based screening: An application to ADHD in primary care. Assessment, 22(2), 198–207. https://doi.org/10.1177/1073191114540748

Meehl, P. E. (1957). When shall we use our heads instead of the formula? Journal of Counseling Psychology, 4(4), 268–273. https://doi.org/10.1037/h0047554

Meehl, P. E. (1986). Causes and effects of my disturbing little book. Journal of Personality Assessment, 50(3), 370–375. https://doi.org/10.1207/s15327752jpa5003_6

Petersen, I. T., Bates, J. E., Dodge, K. A., Lansford, J. E., & Pettit, G. S. (2015). Describing and predicting developmental profiles of externalizing problems from childhood to adulthood. Development and Psychopathology, 27(3), 791–818. https://doi.org/10.1017/S0954579414000789

Posner, K., Brown, G. K., Stanley, B., Brent, D. A., Yershova, K. V., Oquendo, M. A., Currier, G. W., Melvin, G. A., Greenhill, L., Shen, S., & Mann, J. J. (2011). The Columbia–Suicide Severity Rating Scale: Initial validity and internal consistency findings from three multisite studies with adolescents and adults. American Journal of Psychiatry, 168(12), 1266–1277. https://doi.org/10.1176/appi.ajp.2011.10111704

Rice, M. E., Harris, G. T., & Lang, C. (2013). Validation of and revision to the VRAG and SORAG: The Violence Risk Appraisal Guide—Revised (VRAG-R). Psychological Assessment, 25(3), 951–965. https://doi.org/10.1037/a0032878

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2021). pROC: Display and analyze ROC curves. http://expasy.org/tools/pROC/

Silver, N. (2012). The signal and the noise: Why so many predictions fail–but some don’t. Penguin.

Ursenbach, J., O’Connell, M. E., Neiser, J., Tierney, M. C., Morgan, D., Kosteniuk, J., & Spiteri, R. J. (2019). Scoring algorithms for a computer-based cognitive screening tool: An illustrative example of overfitting machine learning approaches and the impact on estimates of classification accuracy. Psychological Assessment, 31(11), 1377–1382. https://doi.org/10.1037/pas0000764

Wainer, H. (1976). Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin, 83(2), 213–217. https://doi.org/10.1037/0033-2909.83.2.213

Williams, A. J., Botanov, Y., Kilshaw, R. E., Wong, R. E., & Sakaluk, J. K. (2021). Potentially harmful therapies: A meta-scientific review of evidential value. Clinical Psychology: Science and Practice, 28(1), 5–18. https://doi.org/10.1111/cpsp.12331

Youngstrom, E. A., Halverson, T. F., Youngstrom, J. K., Lindhiem, O., & Findling, R. L. (2018). Evidence-based assessment from simple clinical judgments to statistical learning: Evaluating a range of options using pediatric bipolar disorder as a diagnostic challenge. Clinical Psychological Science, 6(2), 243–265. https://doi.org/10.1177/2167702617741845

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/95iW4p47cuaphTek6