I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Principles-Psychological-Assessment

Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

Chapter 5 Validity

“What we know depends on how we know it.”

5.1 Overview of Validity

According to the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014, p. 11), measurement validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” We summarized reliability with three words: repeatability, consistency, and precision. A summary of validity in three words is accuracy, utility, and meaningfulness. As with reliability, validity is not a property of the measure itself—the validity of a measure’s scores reflects an interaction of the properties of the measure with the population for whom it is designed and the sample and context in which it is administered. Moreover, validity is tied to the interpretation of a measure’s scores for the proposed uses, not (just) to the measure itself. The same set of scores can have different degrees of validity for different purposes. For instance, a measure’s scores may have stronger validity for making a diagnostic decision than for making a prediction about future behavior. Thus, as the Standards indicate, it is incorrect to use the unqualified phrase “the validity of the measure” or “the measure is (in)valid,” because these phrases do not specify which scores were used from the test, what the use is (e.g., predicting whether a person will succeed in a given job), and what interpretation was made of the test scores for this purpose.¹ For these reasons, it is insufficient to cite evidence for the validity of the measure in other work; it is also important to evaluate the validity of the measure’s scores for the intended use in your particular sample.

Below, we prepare the data to provide some validity-related examples throughout the rest of the chapter.

5.2 Getting Started

5.2.1 Load Libraries

Code

library("petersenlab") #to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")
library("lavaan")
library("semPlot")
library("rockchalk")
library("semTools")
library("semPlot")
library("kableExtra")
library("MASS")
library("psych")
library("simstandard")
library("MOTE")
library("tidyverse")
library("tinytex")
library("knitr")
library("rmarkdown")
library("bookdown")
library("here")
library("DT")

5.2.2 Prepare Data

5.2.2.1 Simulate Data

For reproducibility, we set the seed below. Using the same seed will yield the same answer every time. There is nothing special about this particular seed.

Code

sampleSize <- 1000

set.seed(52242)

means <- c(50, 100)
standardDeviations <- c(10, 15)

correlationMatrix <- matrix(.7, nrow = 2, ncol = 2)
diag(correlationMatrix) <- 1
rownames(correlationMatrix) <- colnames(correlationMatrix) <- 
  c("predictor","criterion")

covarianceMatrix <- psych::cor2cov(
  correlationMatrix,
  sigma = standardDeviations)

mydataValidity <- as.data.frame(mvrnorm(
  n = sampleSize,
  mu = means,
  Sigma = covarianceMatrix,
  empirical = TRUE))

errorToAddToPredictor <- 3.20
errorToAddToCriterion <- 6.15

mydataValidity$predictorWithMeasurementErrorT1 <- 
  mydataValidity$predictor + 
  rnorm(n = sampleSize, mean = 0, sd = errorToAddToPredictor)

mydataValidity$predictorWithMeasurementErrorT2 <- 
  mydataValidity$predictor + 
  rnorm(n = sampleSize, mean = 0, sd = errorToAddToPredictor)

mydataValidity$criterionWithMeasurementErrorT1 <- 
  mydataValidity$criterion + 
  rnorm(n = sampleSize, mean = 0, sd = errorToAddToCriterion)

mydataValidity$criterionWithMeasurementErrorT2 <- 
  mydataValidity$criterion + 
  rnorm(n = sampleSize, mean = 0, sd = errorToAddToCriterion)

mydataValidity$oldpredictor <- mydataValidity$criterion + 
  rnorm(n = sampleSize, mean = 0, sd = 7.5)

latentCorrelation <- .8
reliabilityPredictor <- .9
reliabilityCriterion <- .85

mydataValidity$predictorLatentSEM <- rnorm(sampleSize, 0 , 1)

mydataValidity$criterionLatentSEM <- latentCorrelation * 
  mydataValidity$predictorLatentSEM + rnorm(
    sampleSize,
    0,
    sqrt(1 - latentCorrelation ^ 2))

mydataValidity$predictorObservedSEM <- reliabilityPredictor * 
  mydataValidity$predictorLatentSEM + rnorm(
    sampleSize,
    0,
    sqrt(1 - reliabilityPredictor ^ 2))

mydataValidity$criterionObservedSEM <- reliabilityCriterion * 
  mydataValidity$criterionLatentSEM + rnorm(
    sampleSize,
    0,
    sqrt(1 - reliabilityCriterion ^ 2))

5.2.2.2 Add Missing Data

Adding missing data to dataframes helps make examples more realistic to real-life data and helps you get in the habit of programming to account for missing data.

Code

missingValuesPredictor <- sample(
  1:sampleSize,
  size = 50,
  replace = FALSE)
missingValuesCriterion <- sample(
  1:sampleSize,
  size = 50,
  replace = FALSE)

mydataValidity$predictor[
  missingValuesPredictor] <- NA
mydataValidity$predictorWithMeasurementErrorT1[
  missingValuesPredictor] <- NA
mydataValidity$predictorWithMeasurementErrorT2[
  missingValuesPredictor] <- NA
mydataValidity$predictorObservedSEM[
  missingValuesPredictor] <- NA

mydataValidity$criterion[
  missingValuesCriterion] <- NA
mydataValidity$criterionWithMeasurementErrorT1[
  missingValuesCriterion] <- NA
mydataValidity$criterionWithMeasurementErrorT2[
  missingValuesCriterion] <- NA
mydataValidity$criterionObservedSEM[
  missingValuesCriterion] <- NA

mydataValidity$oldpredictor[
  missingValuesPredictor] <- NA

5.3 Types of Validity

Like reliability, validity is not one thing. There are many types of validity. In this book, we discuss the following types of validity:

We arrange these types of validity into two broader categories: measurement validity and research design validity.

5.3.1 Measurement Validity

Aspects of measurement validity involve the validity of a particular measure, or more specifically, the validity of interpretations of scores from that measure for the proposed uses. Aspects of measurement validity include:

5.3.1.1 Face Validity

The interpretation of a measure’s scores has face validity (for a given construct and a given use) if a typical person—a nonexpert—who looks at the content of each item will believe that the item belongs in the scale for this construct and for this use. The measure, and each item, looks “on its face” like it assesses the target construct. There are several advantages of a measure having face validity. First, outside groups will be less likely to be critical of the measure because it is intuitive. Second, use of a face valid measure is rarely objected to on ethical and bias charges for selling the measure to the public or clinicians. Third, face validity can be helpful for dissemination because more people may be receptive to it.

However, face validity also has important disadvantages. First, judgments of face validity are not based on theory. Second, face validity is based on subjective judgment, which can be inaccurate. Third, these subjective judgments are made by laypeople whose judgments may be inaccurate because of biases and lack of awareness of scientific knowledge. Fourth, a face valid measure may be too simple because anybody can understand the questions and what the questions are intended to assess, so (presumably) respondents can easily fake responses to achieve their goals. Faking of responses may be more of a concern in situations when there is an incentive for the respondent to achieve a particular outcome (e.g., be deemed competent to stand trial, be judged competent to obtain a job or custody of child, be judged to have a disorder to receive accommodations or disability benefits).

It is disputed whether having face validity is good or bad. Whether face validity is important to a given measure depends on the construct that is intended to be assessed, the context in which the assessment will occur, who will be paying for and/or administering the assessment, whether the respondents have incentives to achieve particular scores, and the goals of the assessment (i.e., how the assessment will be used). There is also controversy about whether face validity is a true form of validity; many researchers have argued that it is not a true psychometric form of validity, because the appearance of validity is not validity (Royal, 2016).

5.3.1.2 Content Validity

Content validity involves a judgment about whether or not the content (items) of the measure theoretically matches the construct that is intended to be assessed—that is, whether the operationalization accurately reflects the construct. Content validity is developed based on items generated and selected by experts of the construct and based on the subjective determination that the measure adequately assesses and covers the construct of interest. Content validity differs from face validity in that, for face validity, a layperson determines whether or not the measure seems to assess the construct of interest. By contrast, for content validity, an expert determines whether or not the measure adheres to the construct of interest.

For a measure to have content validity, its items should span the breadth of the construct. For instance, the construct of depression has many facets, such as sleep disturbances, weight/appetite changes, low mood, suicidality, etc., as depicted in Figure 5.1.

Figure 5.1: Content Facets of the Construct of Depression.

For a measure to have content validity, there should be no gaps—facets of the construct that are not assessed by the measure—and there should be no intrusions—facets of different constructs that are assessed by the measure. Consider the construct of depression. If theory states the construct includes various facets such as sadness, loss of interest in activities, sleep disturbances, lack of energy, weight/appetite change, and suicidal thoughts, then a content-valid measure should assess all of these facets. If the measure does not assess sleep disturbances (a gap), the measure would lack content validity. If the measure assessed facets of other constructs, such as impulsivity (an intrusion), the measure would lack content validity.

With content validity, it is important to consider the population of interest. The same construct may look different in different populations and may require different content to assess it. For instance, it is important to consider the cultural relativity of constructs. The content of a construct may depend on the culture, such as in the case of culture-bound syndromes. Culture-bound syndromes are syndromes that are limited to particular cultures. An example of a culture-bound syndrome among Korean women is hwa-byung, which is the feeling of an uncomfortable abdominal mass in response to emotional distress. Another important dimension to consider is development. Constructs can manifest differently at different points in development, known as heterotypic continuity, which is discussed in Section 23.9 of Chapter 23 on repeated assessments across time. When considering the different dimensions of your population, it can be helpful to remember the acronym ADDRESSING, which is described in Section 25.2.1 of Chapter 25 on cultural and individual diversity.

However, like face validity, content validity is based on subjective judgment, which can be inaccurate.

5.3.1.3 Criterion-Related Validity

Criterion-related validity examines whether a measure behaves the way it should given your theory of the construct. This is quantified by the correlation between a measure’s scores and some (hopefully universally accepted) criterion we select. For instance, a criterion could be a diagnosis, a child’s achievement in school, an employee’s performance in a job, etc.

Below, we provide an example of criterion-related validity by examining the Pearson correlation between a predictor and a criterion.

Code

cor.test(x = mydataValidity$predictor, y = mydataValidity$criterion)


    Pearson's product-moment correlation

data:  mydataValidity$predictor and mydataValidity$criterion
t = 30.07, df = 900, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6737883 0.7390515
sample estimates:
      cor 
0.7079278

In this case, the estimate of criterion-related validity is \(r = .71\). There are two types of criterion-related validity: concurrent validity and predictive validity.

5.3.1.3.1 Concurrent Validity

Concurrent validity considers the concurrent association between the chosen measure and the criterion. That is, both the measure and the criterion are assessed at the same point in time. An example of concurrent validity would be examining self-report of court involvement in relation to current court records.

5.3.1.3.1.1 Sensitivity to Change (Responsiveness)

An important aspect of validity that is related to concurrent validity is the degree to which a measure’s scores show sensitivity to change [also known as responsiveness; Fok & Henry (2015)]. If a researcher conducts an intervention that improves people’s mood, it is important to have measures that are sensitive enough to detect the improvements in people’s mood.

5.3.1.3.1.2 Nonreactivity

As noted in Section 22.8.1, the reactivity of a measure is how a measure’s scores change in response to being measured. In an intervention, reactivity can be valuable because having the person self-monitor their behavior can lead to behavior improvement, as described in Section 22.11. However, when the goal is to measure change over time, reactivity can be a problem because it can lead to changes in the measure’s scores that are not due to people’s changes in the construct. Thus, when merely monitoring changes over time, it is important to use measures that show nonreactivity (Myers & Winters, 2002).

5.3.1.3.2 Predictive Validity

Predictive validity considers the association between the chosen measure and the criterion at a later time point. An example of predictive validity would be examining the predictive association between children’s scores on an academic achievement test in first grade and their eventual academic outcomes five years later.

5.3.1.3.3 Empiricism and Theory

Criterion-related validity arose out of a movement known as radical operationalism. Radical operationalism was a pushback against psychoanalysis. Psychoanalysis focused on grand theoretical accounts for how constructs relate. The goal of radical operationalism was to clarify concepts from a behavioristic perspective to allow predicting and changing behavior more successfully. An “operation” in radical operationalism refers to a fully described measurement.

Proponents of radical operationalism argued that all constructs in psychology that could not be operationally defined should be excluded from the field as “nonscientific.” They asserted that operations should be well-defined enough to be able to replicate the findings. So, constructs had to be defined precisely according to this perspective, but how precisely? You could go on forever trying to more precisely describe a behavior in terms of its form, frequency, duration, intensity, situation, antecedents, consequences, biological substrates, etc. So, radical operationalists asserted that we should use theory of the construct to determine what is essential and what is not.

Radical operationalism was also related to radical behavioralism, which was espoused by B.F. Skinner. Skinner famously used a box (the “Skinner Box”) to more directly control, describe, and assess behaviors. Skinner noted the major role that the environment played in influencing behavior. Skinner proposed a theory of implicit learning about a behavior or stimulus based on its consequences, known as operant conditioning. According to operant conditioning, something that increases the frequency of a given behavior is called a reinforcer (e.g., praise), and something that decreases the frequency of a behavior is called a punisher (e.g., loss of a privilege). Through this work, Skinner came to view everything an organism does (e.g., action, thought, feeling) as a behavior.

Related to these historical perspectives was a perspective known as dustbowl empiricism. Dustbowl empiricism focused on the empirical connections between things—how things were associated using data. It was a completely atheoretical perspective in which interpretation was entirely data driven. An example of dustbowl empiricism is the approach that was used to develop the first version of the Minnesota Multiphasic Personality Inventory (MMPI). The MMPI was developed using an approach known as empirical-criterion keying, where items were selected for the scale for no reason other than the items demonstrate an association with the criterion. That is, an item was selected if it showed a strong ability to discriminate (differentiate) between clinical and control groups. Using this method with hundreds of items (and thousands of inter-item correlations), the MMPI developed 10 clinical scales, which involved operational rules based on previously collected empirical evidence.

But what do you know with this abundance of correlations? You can use data reduction methods to reduce the many variables, based on their inter-correlations, down to a more parsimonious set of factors. But how do you name each factor, which is composed of many items? The developers originally numbered the MMPI clinical scales from 1 to 10. But numbered scales are not useful for other people, so the factors were eventually given labels (e.g., Paranoia). And if a client received an elevated score on a factor, many people would label the clients as _____ [the name of the factor], such as “paranoid.” The MMPI is discussed in further detail in Chapter 18 on objective personality testing.

The idea of dustbowl empiricism was to develop a strong empirical base that would provide a strong foundation to help build up to a broad understanding that was integrated, coherent, and systematic. However, this process was unclear when there was only a table of correlations. Radical operationalists were opposed to content validity because it allows intrusion of our flawed thinking. According to operationalists, there are no experts. According to this perspective, the content does not matter; we just need enough data to bootstrap ourselves to a better understanding of the constructs.

Although the atheoretical approach can perform reasonably well, it can be improved by making better use of theory. An empirical result (e.g., a correlation) might not necessarily have a lot of meaning associated with it. As the maxim goes, correlation does not imply causation.

5.3.1.3.3.1 Correlation Does Not Imply Causation

Just because \(X\) is associated with \(Y\) does not mean that \(X\) causes \(Y\). Consider that you find an association between variables \(X\) and \(Y\), consistent with your hypothesis, as depicted in Figure 5.2.

Figure 5.2: Hypothesized Causal Effect Based on an Observed Association Between \(X\) and \(Y\), Such That \(X\) Causes \(Y\).

There are three primary reasons that an observed association between \(X\) and \(Y\) does not necessarily mean that \(X\) causes \(Y\).

First, the association could reflect the opposite direction of effect, where \(Y\) actually causes \(X\), as depicted in Figure 5.3.

Figure 5.3: Reverse (Opposite) Direction of Effect From the Hypothesized Effect, Where \(Y\) Causes \(X\).

Second, the association could reflect the influence of a third variable. If a third variable is a common cause of each and accounts for their association, it is a confound. An observed association between \(X\) and \(Y\) could reflect a confound—i.e., a cause (\(Z\)) that influences both \(X\) and \(Y\), which explains why \(X\) and \(Y\) are correlated even though they are not causally related. A third variable confound that is a common cause of both \(X\) and \(Y\) is depicted in Figure 5.4.

Figure 5.4: Confounded Association Between \(X\) and \(Y\) due to a Common Cause, \(Z\).

Third, the association might be spurious. It might just reflect random variation (i.e., chance), and that when tested on an independent sample, what appeared as an association may not hold when testing whether the association generalizes.

However, even if the association between \(X\) and \(Y\) reflects a causal effect and that \(X\) causes \(Y\), it does not necessarily mean that the effect is clinically actionable or useful. An association may reflect a static or unmodifiable predictor that is not practically useful as a treatment target.

5.3.1.3.3.2 Understanding the Causal System

As Silver (2012) notes, “The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.” (p. 9). If we understand the variables in the system and how they influence each other, we can predict things more accurately than predicting for the sake of predicting. For instance, we have made great strides in the last decades when it comes to more accurate weather forecasts (see here; archived at https://perma.cc/PF8P-BT3D and here; archived at https://perma.cc/248N-9CW5), including extreme weather events like hurricanes. These great strides have more to do with a better causal understanding of the weather system and the ability to conduct simulations of the atmosphere than merely because of big data (Silver, 2012). By contrast, other events are still incredibly difficult to predict, including earthquakes, in large part because we do not have a strong understanding of the system (and because we do not have ways of precisely measuring those causes because they occur at a depth below which we are realistically able to drill) (Silver, 2012).

5.3.1.3.3.3 Model Over-Fitting

Statistical models applied to big data (i.e., lots of variables and lots of samples) can over-fit the data, which means that the statistical model accounts for error variance (an overly specific prediction), which will not generalize to future samples. So, even though an over-fitting statistical model appears to be accurate, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built.

In Figure 5.5, the blue line represents the true distribution of the data, and the red line is an over-fitting model:

Figure 5.5: Over-fitting Model in Red Relative to the True Distribution of the Data in Blue.

5.3.1.3.3.4 Criterion Contamination

An important issue in predictive validity is the criterion problem—finding the right criterion. It is important to avoid criterion contamination, which is artificial commonality between the measure and the criterion. The criterion is not always a well-measured clear criterion to predict (like predicting death in the medical field). And you may not have access to a predictive criterion until a long time from now. So, what researchers often do is adopt intermediate assessments, which are not actually what they are interested in, but it is related to the criterion of interest, and it is in a window of time that allows for some meaningful prediction. For instance, intermediate graduate school markers of whether a graduate student will go on to have a successful career could include their grades in graduate school, whether they completed the dissertation, their performance in comprehensive/qualifying exams, etc. However, these intermediate assessments do not always indicate whether or not a student will go on to have a successful career (i.e., they are not always correlated with the real criterion of interest).

5.3.1.3.3.5 Using Theory as a Guide

So, empiricism is often not enough. It is important to use theory to guide selection of an intermediate criterion that will relate to the real criterion of interest. In psychology, even our long-term criteria are not well defined relative to other sciences. In clinical psychology, for example, we are often predicting a diagnosis, which is not that much more valid compared to our measure/predictor.

At the same time, in psychology, our theories of the causal processes that influence outcomes are not yet very strong. Indeed, I have misgivings calling them theories because they do not meet the traditional scientific standard for a theory. A scientific theory is an explanation of the natural world that is testable and falsifiable, and that has withstood rigorous scientific testing and scrutiny. In psychology, our “theories” are more like conceptual frameworks. And these conceptual frameworks are often vague, do not make specific predictions of effects and noneffects, and do not hold up consistently when rigorously tested. As described by Meehl (1978):

I consider it unnecessary to persuade you that most so-called “theories” in the soft areas of psychology (clinical, counseling, social, personality, community, and school psychology) are scientifically unimpressive and technologically worthless … Perhaps the easiest way to convince yourself is by scanning the literature of soft psychology over the last 30 years and noticing what happens to theories. Most of them suffer the fate that General MacArthur ascribed to old generals—They never die, they just slowly fade away. In the developed sciences, theories tend either to become widely accepted and built into the larger edifice of well-tested human knowledge or else they suffer destruction in the face of recalcitrant facts and are abandoned, perhaps regretfully as a “nice try.” But in fields like personology and social psychology, this seems not to happen. There is a period of enthusiasm about a new theory, a period of attempted application to several fact domains, a period of disillusionment as the negative data come in, a growing bafflement about inconsistent and unreplicable empirical results, multiple resort to ad hoc excuses, and then finally people just sort of lose interest in the thing and pursue other endeavors. (pp. 806–807).

Even if we had strong theoretical understanding of the causal system that influences behavior, we would likely still have difficulty making accurate predictions because the field has largely relied on relatively crude instruments. According to one philosophical perspective known as LaPlace’s demon, if we were able to know the exact conditions of everything in the universe, we would be able to know how the conditions would be in the future. This is an example of scientific determinism, where if you know the initial conditions, you also know the future. Other perspectives, such as quantum mechanics and chaos theory, would say that, even if we knew the initial conditions with 100% certainty, there would still be uncertainty in our understanding of the future. But assume, for a moment, that LaPlace’s demon is true. The challenge in psychology is that we have a relatively poor understanding of the initial conditions of the universe. Thus, our predictions would necessarily be probabilistic, similar to weather forecasts. Despite having a strong understanding of how weather systems behave, we have imperfect understanding of the initial conditions (e.g., the position and movement of all molecules) (Silver, 2012).

5.3.1.3.3.6 Psychoanalysis Versus Empiricism

We can consider the difference between psychoanalysts and empiricists in cultural references.

Here is an excerpt from Douglas Adams’ The Restaurant at the End of the Universe (from the series, The Hitchhiker’s Guide to the Galaxy):

To explain—since every piece of matter in the Universe is in some way affected by every other piece of matter in the Universe, it is in theory possible to extrapolate the whole of creation—every sun, every planet, their orbits, their composition and their economic and social history from, say, one small piece of fairy cake.

— Douglas Adams (1980, p. 76)

If we consider this idea that someone might be able to extrapolate a simulation of the universe from a single piece of cake, we find similarities to how psychoanalysts connect everything to everything else through grand theories. Psychoanalysts try to reconstruct the entire universe from a sparse bit of data with supposedly strong theoretical understanding (when, in reality, our theories are not so strong). Their “theories” make grand conceptual claims.

Let us contrast psychoanalysts with empiricists/radical operationalism. Figure 5.6 presents a depiction of empiricism. Empiricists evaluate how an observed predictor relates to an observed outcome. The rectangles in the figure represent entities that can be observed. For instance, an empiricist might examine the extent to which a person’s blood pressure is related to the number of words they speak per minute.

Figure 5.6: Conceptual Depiction of Empiricism.

Contrast the empirical approach with psychoanalysis, as depicted in Figure 5.7.

Figure 5.7: Conceptual Depiction of Psychoanalysis.

Circles represent unobserved, latent entities. For instance a psychoanalyst may make a conceptual claim that one observed variable influences another observed variable through a complex chain of intervening processes that are unobservable.

There is a classic and hilarious Monty Python video (see below) that is an absurd example akin to radical operationalism taken to the extreme.

Monty Python video on penguin research

In the video, the researchers make lots of measurements and predictions. The researchers identify interspecies differences in intelligence where humans show better performance on the English-based intelligence test (who got an IQ score of 100) than the penguins, who got an IQ score of 2. But the penguins did not speak English and were unable to provide answers to the English-based intelligence test. So, the researchers also assessed a group of non-English speaking humans as an attempt to control for language ability. They found that the penguins’ scores were equal to the scores of the non-English speaking humans. They argued that, based on their smaller brain and equal IQ when controlling for language ability, that penguins are smarter than humans. However, the researchers clearly made mistakes about confounding variables. And inclusion of a control group of non-English speaking humans does not solve the problem of validity or bias; it just obscures the problem. In summary, radical operationalism provides rich lower-level information, but lacks the broader picture. So, it seems, that we need both theory and empiricism. Theory and empiricism can—and should—inform each other.

5.3.1.4 Construct Validity

Construct validity is the extent to which a measure accurately assesses a target construct (Cronbach & Meehl, 1955). Construct validity is not quantified by a single index but rather consists of a “web of evidence” (the totality of evidence), which reflect the sum of inferences from multiple aspects of validity. That is, construct validity is the extent to which a measure assesses a target construct. Construct validity deals with the association between a measure and an unobservable criterion, i.e., a latent construct. By contrast, criterion-related validity deals with an observable criterion.

Construct validity encompasses all other types of measurement validity (e.g., content and criterion-related validity), in addition to

scores on the measure show homogeneity—i.e., scores on the measure assess a single construct
scores on the measure show theoretically expected developmental changes (i.e., criterion-related validity with respect to age or development)
scores on the measure show theoretically expected group differences (i.e., criterion-related validity with respect to some group status)—this is also called known groups validity (when the groups are known to differ in their level on the construct)
scores on the measure show theoretically expected intervention effects (i.e., criterion-related validity with respect to intervention)
establishing the nomological network of a construct

Threats to construct validity are described by Shadish et al. (2002) (see Table 3.1).

5.3.1.4.1 Nomological Network

A nomological network is the interlocking system of laws that constitute a theory. It describes how the concepts (constructs) of interest are causally linked, including their observable manifestations and the causal relations among and between them. An example of a nomological network is depicted in Figure 5.8.

Figure 5.8: Example of a Nomological Network. O.M. = Observable Manifestation.

With construct validity, we can judge the quality of a measure by how well or how sensibly it fits in a nomological network. Latent constructs and observed measures improve each other step by step. But there is no established way to evaluate the process. Approaches such as network analysis may be useful toward this aim.

Historically, construct validity became a way for some researchers to skip out on other types of validity. People found a correlation between a measure’s scores and some group membership and argued, therefore, that the measure has construct validity because there was a theoretically expected correlation between some measure and ______ (insert whatever measure). People started finding a correlation of a measure with other measures, asserting that it provides evidence of construct validity, and saying “that’s my nomological network.” But a theoretically expected association is not enough! For example, consider a measure of how quickly someone can count backward by seven. Performance is impaired in those with schizophrenia, anxiety (due to greater distractibility), and depression (due to concentration difficulties and cognitive slowing). Therefore, it is a low-quality claim that counting backward is part of the phenomenology of these disorders because it lacks differential deficit or discriminant validity (D. T. Campbell & Fiske, 1959). This is related to the “glop problem,” which asserts that every bad thing is associated with every other bad thing—there is high comorbidity. Therefore, researchers needed some way to distinguish method variance from construct variance. This led to the development of the multitrait-multimethod matrix.

5.3.1.4.2 Multitrait-Multimethod Matrix (MTMM)

The multitrait-multimethod matrix (MTMM), as proposed by Campbell and Fiske (1959), is a concrete way to evaluate the validity of a measure. The MTMM allows you to split the variance of measures’ scores into variance that is due to the method (i.e., method variance or method bias) and variance that is due to the construct (trait) of interest (i.e., construct variance). To create an MTMM, you need at least two methods and at least two constructs. For example, an MTMM could include self-report and observation of depression and introversion. You would then examine the correlations across combinations of construct and method.

For an example of an MTMM, see Figure 5.9. Several aspects of psychometrics can be evaluated with an MTMM, including reliability, convergent validity, and discriminant validity. The reliability diagonal of an MTMM is the correlation of a variable with itself, i.e., the test–retest reliability or monotrait-monomethod correlations. The reliability coefficients should be the highest values in the matrix because each measure should be more correlated with itself than with anything else.

Multitrait-Multimethod Matrix. (Figure reprinted from Campbell and Fiske (1959), Table 1, p. 82. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. https://doi.org/10.1037/h0046016)

Figure 5.9: Multitrait-Multimethod Matrix. (Figure reprinted from Campbell and Fiske (1959), Table 1, p. 82. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. https://doi.org/10.1037/h0046016)

A multitrait-multimethod matrix can be organized by method and then by construct or vice versa. An MTMM organized by method then by construct is depicted in Figure 5.10.

Figure 5.10: Multitrait-Multimethod Matrix Organized by Method Then by Construct.

An MTMM organized by construct then by method is depicted in Figure 5.11.

Figure 5.11: Multitrait-Multimethod Matrix Organized by Construct Then by Method.

5.3.1.4.2.1 Convergent Validity

Convergent validity is the extent to which a measure is associated with other measures of the same target construct. In an MTMM, convergent validity evaluates whether measures targeting the same construct, but using different methods, converge upon the same construct. These are observed in the validity diagonals, also known as the convergent correlations or the monotrait-heteromethod correlations. For strong convergent validity, we would expect the values in the validity diagonals to be significant and high-ish.

5.3.1.4.2.2 Discriminant (Divergent) Validity

You can also evaluate the discriminant validity, also called divergent validity, of a measure in the context of an MTMM. Discriminant validity is the extent to which a measure is not associated (or less strongly associated) with measures of different constructs that are not theoretically expected to be related (compared to associations with measures of the same construct). In an MTMM, discriminant validity determines the extent to which a measure does not correlate with measures that share a method but assess different constructs. For strong discriminant validity, we would expect the discriminant correlations (heterotrait monomethod) to be low.

According to Campbell and Fiske (1959), discriminant validity of measures can be established when three criteria are met:

The convergent (monotrait-heteromethod) correlations are stronger than heterotrait-heteromethod correlations. This provides the weakest evidence for discriminant validity. That is, the convergent correlations are higher than the values in the same column or row in the same heteromethod block. This can be evaluated using the heterotrait-monotrait ratio (described below).
The convergent (monotrait-heteromethod) correlations are stronger than discriminant correlations (monomethod-heterotrait). This provides stronger evidence of discriminant validity.
The patterns of intercorrelations between constructs are the same, regardless of which measurement method is used. That is, the pattern of inter-trait associations is the same in all triangles. For example, if extraversion and anxiety are moderately correlated with each other but uncorrelated with achievement, we would expect this pattern of interrelations between constructs would hold, regardless of which method was used to assess the construct.

You can estimate a measure’s degree of discriminant validity based on the heterotrait-monotrait ratio [HTMT; Henseler et al. (2015); Roemer et al. (2021)]. HTMT is the average of the heterotrait-heteromethod correlations (i.e., the correlations of measures from different measurement methods that assess different constructs), relative to the average of the monotrait-heteromethod correlations (i.e., the correlations of measures from different measurement methods that assess the same construct). As described here (https://www.henseler.com/htmt.html) (archived at https://perma.cc/A9DU-WZWQ) based on evidence from Voorhees et al. (2016),

If the HTMT is clearly smaller than one, discriminant validity can be regarded as established. In many practical situations, a threshold of 0.85 reliably distinguishes between those pairs of latent variables that are discriminant valid and those that are not.

The authors updated the HTMT to use the geometric mean of the correlations rather than the arithmetic mean of the correlations to relax the assumption of tau-equivalence (Roemer et al., 2021). They called the updated index HTMT2. Unlike HTMT, HTMT2 can be used only if all inter-variable correlations are positive. HTMT2 was calculated below using the semTools package (Jorgensen et al., 2021). In this case, the HTMT values are less than .85, providing evidence of discriminant validity.

Code

modelCFA <- '
  visual  =~ x1 + x2 + x3
  textual =~ x4 + x5 + x6
  speed   =~ x7 + x8 + x9
'

htmt(
  modelCFA,
  data = HolzingerSwineford1939,
  missing = "ml")

        visual textul speed
visual   1.000             
textual  0.384  1.000      
speed    0.387  0.280 1.000

Other indexes of discriminant validity include the \(\text{CI}_\text{CFA}\text{(sys)}\) (Rönkkö & Cho, 2020), \(\chi^2\text{(sys)}\) (Rönkkö & Cho, 2020), and the Fornell-Larcker Ratio [FLR; Fornell & Larcker (1981)]. The FLR is based on the idea that a construct should be more strongly associated with its indicators than with other constructs, and is computed as the ratio of (a) the highest squared inter-correlation the construct has with other constructs to (b) the average variance extracted [AVE] by the construct. A criterion of FLR < 1 evaluates whether the AVE of a factor is greater than the highest squared inter-correlation between the factor and other factors in the model (Fornell & Larcker, 1981). Another index estimates the degree of correspondence of convergent and discriminant validity correlations with a priori hypotheses as an index of construct validity (Furr & Heuckeroth, 2019).

Using an MTMM, a researcher can learn a lot about the quality of a measure and build a nomological network. For instance, we can estimate the extent of method variance, or variance that is attributable to the measurement method rather than the construct of interest. Method variance is estimated by the difference between monomethod versus heteromethod blocks.

A poor example of an MTMM would be having two constructs such as height and depression and two methods such as Likert and true/false. This would be a poor MTMM for two reasons: (1) there are trivial differences in the methods, which would lead to obvious convergence, and (2) the differences between the constructs are not important—they are obviously discriminant. Using such an MTMM would find strong convergent associations because the methods are maximally similar and weak discriminant associations because the constructs are maximally different. It would be better to use maximally different measurement methods (e.g., self-report and performance-based measures) and to use constructs that are important to distinguish (e.g., depression and anxiey).

The paper by Campbell and Fiske (1959) that introduced the MTMM is one of the most widely cited papers in psychology of all time. Psychological Bulletin published a more recent paper by Fiske and Campbell (1992), entitled “Citations Do Not Solve Problems.” They noted how their original paper was the most widely cited paper in the history of Psychological Bulletin, but they argued that nothing came of their paper. The MTMM matrices published today show little-to-no improvement compared to the ones they published in 1959. In part, this may be because we need better measures (i.e., a higher ratio of construct to method variance).

5.3.1.4.2.3 Multitrait-Multimethod Correlation Matrix

This example is courtesy of W. Joel Schneider. First, we simulate data for an MTMM model using a fixed model with population parameters using the simstandard package (Schneider, 2021):

Code

model_fixed <- '
 Verbal =~ .5*VO1 + .6*VO2 + .7*VO3 +
           .7*VW1 + .6*VW2 + .5*VW3 + 
           .6*VM1 + .7*VM2 + .5*VM3 
 Spatial =~ .7*SO1 + .7*SO2 + .6*SO3 + 
            .6*SW1 + .7*SW2 + .5*SW3 + 
            .7*SM1 + .5*SM2 + .7*SM3 
 Quant =~ .5*QO1 + .7*QO2 + .5*QO3 + 
          .5*QW1 + .6*QW2 + .7*QW3 + 
          .5*QM1 + .6*QM2 + .7*QM3 
 Oral =~ .4*VO1 + .5*VO2 + .3*VO3 + 
         .3*SO1 + .3*SO2 + .5*SO3 + 
         .6*QO1 + .3*QO2 + .4*QO3
 Written =~ .6*VW1 + .4*VW2 + .3*VW3 + 
            .6*SW1 + .5*SW2 + .5*SW3 + 
            .4*QW1 + .4*QW2 + .5*QW3
 Manipulative =~ .5*VM1 + .5*VM2 + .3*VM3 + 
                 .5*SM1 + .5*SM2 + .6*SM3 + 
                 .4*QM1 + .3*QM2 + .3*QM3
 Verbal ~~ .7*Spatial + .6*Quant
 Spatial ~~ .5*Quant
'

MTMM_data <- sim_standardized(
  model_fixed,
  n = 10000,
  observed = TRUE,
  latent = FALSE,
  errors = FALSE)

A multitrait-multimethod matrix correlation matrix is below.

Code

round(cor(MTMM_data), 2)

5.3.1.4.2.4 Construct Validity Beyond Campbell and Fiske

There are a number of other approaches that can be helpful for establishing construct validity in ways that go beyond the approaches proposed by Campbell and Fiske (1959). One way is known as triangulation. Triangulation is conceptually depicted in Figure 5.12. Triangulation involves testing a hypothesis with multiple measures and/or methods to see if the findings are consistent—that is, whether the findings triangulate.

Figure 5.12: Using Triangulation to Arrive at a Closer Estimate of the Construct Using Multiple Measures and/or Methods.

A contemporary approach to multitrait-multimethod modeling uses confirmatory factor analysis, as described below.

5.3.1.4.2.5 MTMM in Confirmatory Factor Analysis (CFA)

Using modern modeling approaches, there are even more advanced ways of examining an MTMM. For instance, you can use structural equation modeling (SEM) or confirmatory factor analysis (CFA) to derive a latent variable of a construct from multiple methods to be free from method-related error variance to generate purer assessments of the construct to see how it relates to other constructs. For an example of an MTMM in confirmatory factor analysis, see Figure 5.13 and Section 14.4.2.13 in Chapter 14 on factor analysis.

Multitrait-Multimethod Model in Confirmatory Factor Analysis With Three Constructs (Internalizing, Externalizing, and Thought-Disordered Problems) and Three Methods (Mother-, Father-, and Teacher-report).

Figure 5.13: Multitrait-Multimethod Model in Confirmatory Factor Analysis With Three Constructs (Internalizing, Externalizing, and Thought-Disordered Problems) and Three Methods (Mother-, Father-, and Teacher-report).

The confirmatory factor analysis (CFA) model was fit in the lavaan package (Rosseel et al., 2022).

Code

modelMTMM <- '
 g =~ Verbal + Spatial + Quant
 Verbal =~ 
   VO1 + VO2 + VO3 + 
   VW1 + VW2 + VW3 + 
   VM1 + VM2 + VM3
 Spatial =~ 
   SO1 + SO2 + SO3 + 
   SW1 + SW2 + SW3 + 
   SM1 + SM2 + SM3
 Quant =~ 
   QO1 + QO2 + QO3 + 
   QW1 + QW2 + QW3 + 
   QM1 + QM2 + QM3
 Oral =~ 
   VO1 + VO2 + VO3 + 
   SO1 + SO2 + SO3 + 
   QO1 + QO2 + QO3
 Written =~ 
   VW1 + VW2 + VW3 + 
   SW1 + SW2 + SW3 + 
   QW1 + QW2 + QW3
 Manipulative =~ 
   VM1 + VM2 + VM3 + 
   SM1 + SM2 + SM3 + 
   QM1 + QM2 + QM3
'

MTMM.fit <- cfa(
  modelMTMM,
  data = MTMM_data,
  orthogonal = TRUE,
  missing = "ml")

summary(
  MTMM.fit,
  fit.measures = TRUE,
  standardized = TRUE,
  rsquare = TRUE)

lavaan 0.6-20 ended normally after 72 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                       111

  Number of observations                         10000
  Number of missing patterns                         1

Model Test User Model:
                                                      
  Test statistic                               254.449
  Degrees of freedom                               294
  P-value (Chi-square)                           0.954

Model Test Baseline Model:

  Test statistic                            146391.496
  Degrees of freedom                               351
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000
                                                      
  Robust Comparative Fit Index (CFI)             1.000
  Robust Tucker-Lewis Index (TLI)                1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)            -310030.970
  Loglikelihood unrestricted model (H1)    -309903.745
                                                      
  Akaike (AIC)                              620283.939
  Bayesian (BIC)                            621084.287
  Sample-size adjusted Bayesian (SABIC)     620731.545

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: RMSEA <= 0.050                    1.000
  P-value H_0: RMSEA >= 0.080                    0.000
                                                      
  Robust RMSEA                                   0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: Robust RMSEA <= 0.050             1.000
  P-value H_0: Robust RMSEA >= 0.080             0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.006

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  g =~                                                                  
    Verbal            1.000                               0.919    0.919
    Spatial           1.159    0.032   36.586    0.000    0.767    0.767
    Quant             0.711    0.022   31.970    0.000    0.651    0.651
  Verbal =~                                                             
    VO1               1.000                               0.499    0.498
    VO2               1.217    0.026   47.525    0.000    0.607    0.605
    VO3               1.424    0.029   48.471    0.000    0.710    0.710
    VW1               1.434    0.030   48.380    0.000    0.715    0.709
    VW2               1.230    0.029   42.960    0.000    0.613    0.607
    VW3               1.073    0.028   38.834    0.000    0.535    0.529
    VM1               1.231    0.028   43.741    0.000    0.614    0.610
    VM2               1.408    0.030   47.141    0.000    0.702    0.703
    VM3               0.991    0.027   37.122    0.000    0.494    0.496
  Spatial =~                                                            
    SO1               1.000                               0.693    0.699
    SO2               1.011    0.015   65.536    0.000    0.700    0.700
    SO3               0.861    0.014   61.111    0.000    0.597    0.596
    SW1               0.863    0.015   58.554    0.000    0.598    0.597
    SW2               1.011    0.015   65.716    0.000    0.700    0.701
    SW3               0.725    0.015   47.749    0.000    0.502    0.505
    SM1               1.024    0.016   64.837    0.000    0.709    0.713
    SM2               0.712    0.016   45.721    0.000    0.493    0.495
    SM3               1.024    0.015   66.814    0.000    0.709    0.712
  Quant =~                                                              
    QO1               1.000                               0.501    0.504
    QO2               1.376    0.026   52.340    0.000    0.689    0.694
    QO3               1.015    0.023   44.104    0.000    0.508    0.504
    QW1               1.017    0.025   40.585    0.000    0.509    0.506
    QW2               1.176    0.026   44.977    0.000    0.588    0.595
    QW3               1.388    0.027   50.622    0.000    0.695    0.696
    QM1               1.003    0.025   40.236    0.000    0.502    0.504
    QM2               1.185    0.027   43.897    0.000    0.593    0.590
    QM3               1.408    0.029   48.103    0.000    0.705    0.702
  Oral =~                                                               
    VO1               1.000                               0.405    0.404
    VO2               1.219    0.037   32.726    0.000    0.494    0.492
    VO3               0.732    0.028   26.016    0.000    0.297    0.297
    SO1               0.722    0.029   24.751    0.000    0.293    0.295
    SO2               0.732    0.030   24.777    0.000    0.297    0.296
    SO3               1.264    0.039   32.392    0.000    0.512    0.512
    QO1               1.440    0.045   32.186    0.000    0.583    0.587
    QO2               0.765    0.030   25.496    0.000    0.310    0.312
    QO3               0.993    0.036   27.328    0.000    0.402    0.399
  Written =~                                                            
    VW1               1.000                               0.596    0.591
    VW2               0.674    0.015   43.950    0.000    0.402    0.397
    VW3               0.484    0.017   28.450    0.000    0.289    0.286
    SW1               1.005    0.016   63.516    0.000    0.599    0.598
    SW2               0.843    0.014   59.466    0.000    0.503    0.503
    SW3               0.820    0.017   48.050    0.000    0.489    0.491
    QW1               0.670    0.018   38.290    0.000    0.400    0.397
    QW2               0.661    0.016   40.755    0.000    0.394    0.398
    QW3               0.844    0.015   57.187    0.000    0.503    0.504
  Manipulative =~                                                       
    VM1               1.000                               0.495    0.493
    VM2               0.998    0.021   47.750    0.000    0.495    0.495
    VM3               0.588    0.023   26.003    0.000    0.291    0.293
    SM1               0.959    0.022   43.879    0.000    0.475    0.478
    SM2               1.004    0.026   39.327    0.000    0.498    0.499
    SM3               1.176    0.024   49.284    0.000    0.583    0.585
    QM1               0.811    0.024   33.472    0.000    0.402    0.403
    QM2               0.627    0.022   27.982    0.000    0.311    0.309
    QM3               0.606    0.021   29.284    0.000    0.300    0.299

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  g ~~                                                                  
    Oral              0.000                               0.000    0.000
    Written           0.000                               0.000    0.000
    Manipulative      0.000                               0.000    0.000
  Oral ~~                                                               
    Written           0.000                               0.000    0.000
    Manipulative      0.000                               0.000    0.000
  Written ~~                                                            
    Manipulative      0.000                               0.000    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .VO1              -0.007    0.010   -0.717    0.474   -0.007   -0.007
   .VO2              -0.013    0.010   -1.253    0.210   -0.013   -0.013
   .VO3              -0.008    0.010   -0.814    0.415   -0.008   -0.008
   .VW1               0.015    0.010    1.463    0.143    0.015    0.015
   .VW2               0.011    0.010    1.043    0.297    0.011    0.010
   .VW3              -0.006    0.010   -0.572    0.567   -0.006   -0.006
   .VM1              -0.011    0.010   -1.089    0.276   -0.011   -0.011
   .VM2              -0.010    0.010   -0.994    0.320   -0.010   -0.010
   .VM3              -0.012    0.010   -1.196    0.232   -0.012   -0.012
   .SO1              -0.006    0.010   -0.587    0.557   -0.006   -0.006
   .SO2              -0.014    0.010   -1.402    0.161   -0.014   -0.014
   .SO3              -0.008    0.010   -0.825    0.409   -0.008   -0.008
   .SW1               0.004    0.010    0.406    0.685    0.004    0.004
   .SW2               0.007    0.010    0.674    0.500    0.007    0.007
   .SW3               0.005    0.010    0.473    0.637    0.005    0.005
   .SM1              -0.009    0.010   -0.914    0.361   -0.009   -0.009
   .SM2              -0.008    0.010   -0.755    0.450   -0.008   -0.008
   .SM3              -0.012    0.010   -1.187    0.235   -0.012   -0.012
   .QO1               0.005    0.010    0.477    0.633    0.005    0.005
   .QO2               0.005    0.010    0.457    0.648    0.005    0.005
   .QO3              -0.001    0.010   -0.143    0.886   -0.001   -0.001
   .QW1               0.010    0.010    0.950    0.342    0.010    0.009
   .QW2               0.013    0.010    1.268    0.205    0.013    0.013
   .QW3               0.025    0.010    2.554    0.011    0.025    0.026
   .QM1              -0.000    0.010   -0.046    0.963   -0.000   -0.000
   .QM2               0.009    0.010    0.876    0.381    0.009    0.009
   .QM3               0.004    0.010    0.380    0.704    0.004    0.004

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .VO1               0.591    0.009   63.508    0.000    0.591    0.589
   .VO2               0.395    0.007   54.567    0.000    0.395    0.392
   .VO3               0.407    0.007   57.442    0.000    0.407    0.408
   .VW1               0.151    0.004   36.640    0.000    0.151    0.148
   .VW2               0.485    0.008   64.170    0.000    0.485    0.474
   .VW3               0.652    0.010   67.068    0.000    0.652    0.638
   .VM1               0.389    0.007   58.677    0.000    0.389    0.385
   .VM2               0.260    0.005   50.781    0.000    0.260    0.261
   .VM3               0.662    0.010   67.306    0.000    0.662    0.668
   .SO1               0.417    0.007   58.772    0.000    0.417    0.424
   .SO2               0.423    0.007   58.685    0.000    0.423    0.423
   .SO3               0.383    0.007   53.146    0.000    0.383    0.383
   .SW1               0.287    0.005   52.740    0.000    0.287    0.286
   .SW2               0.255    0.005   53.088    0.000    0.255    0.255
   .SW3               0.499    0.008   63.923    0.000    0.499    0.504
   .SM1               0.261    0.005   55.403    0.000    0.261    0.264
   .SM2               0.503    0.008   63.437    0.000    0.503    0.506
   .SM3               0.150    0.004   37.860    0.000    0.150    0.151
   .QO1               0.396    0.008   46.709    0.000    0.396    0.402
   .QO2               0.415    0.007   57.018    0.000    0.415    0.421
   .QO3               0.595    0.009   62.959    0.000    0.595    0.586
   .QW1               0.592    0.009   65.191    0.000    0.592    0.586
   .QW2               0.475    0.008   62.430    0.000    0.475    0.487
   .QW3               0.260    0.005   48.515    0.000    0.260    0.261
   .QM1               0.581    0.009   63.593    0.000    0.581    0.584
   .QM2               0.561    0.009   63.074    0.000    0.561    0.556
   .QM3               0.420    0.007   56.238    0.000    0.420    0.417
    g                 0.210    0.009   22.548    0.000    1.000    1.000
   .Verbal            0.039    0.004   10.917    0.000    0.155    0.155
   .Spatial           0.198    0.007   28.581    0.000    0.412    0.412
   .Quant             0.144    0.006   25.995    0.000    0.576    0.576
    Oral              0.164    0.009   18.420    0.000    1.000    1.000
    Written           0.355    0.009   38.114    0.000    1.000    1.000
    Manipulative      0.245    0.009   26.149    0.000    1.000    1.000

R-Square:
                   Estimate
    VO1               0.411
    VO2               0.608
    VO3               0.592
    VW1               0.852
    VW2               0.526
    VW3               0.362
    VM1               0.615
    VM2               0.739
    VM3               0.332
    SO1               0.576
    SO2               0.577
    SO3               0.617
    SW1               0.714
    SW2               0.745
    SW3               0.496
    SM1               0.736
    SM2               0.494
    SM3               0.849
    QO1               0.598
    QO2               0.579
    QO3               0.414
    QW1               0.414
    QW2               0.513
    QW3               0.739
    QM1               0.416
    QM2               0.444
    QM3               0.583
    Verbal            0.845
    Spatial           0.588
    Quant             0.424

A path diagram of the model is depicted in Figure 5.14 using the semPlot package (Epskamp, 2022).

Code

semPaths(
  MTMM.fit,
  what = "std",
  layout = "tree3",
  bifactor = c("Verbal","Spatial","Quant","g"))

Figure 5.14: Multitrait-Multimethod Model in Confirmatory Factor Analysis.

5.3.1.5 Incremental Validity

Accuracy is not enough for a measure to be useful. Proposed by Sechrest (1963), measures should also be judged by the extent to which they provide an increment in predictive efficiency over the information otherwise easily and cheaply available. Incremental validity deals with the incremental value or utility over a measure beyond other sources of information. It must be demonstrated that the addition of a measure will produce better predictions than are made on the basis of information ordinarily available. It is not enough to show that the measure is better than chance, and the measure should not just be capitalizing on shared method variance with the criterion or on increased reliability of the measure. That is, the measure should explain truly unique variance—variance that was not explained before. Incremental validity demonstrates added value, unless the other measure is cheaper or less time-consuming. Incremental validity is a specific kind of criterion-related validity: significantly increased \(R^2\) in hierarchical regression. The incremental validity of a measure can be evaluated by examining whether the measure explains significant unique variance in the criterion when accounting for other information, such as easily accessible information, traditionally available measures, or the current gold-standard measure. The extent of incremental validity of a measure can be quantified with the change in the coefficient of determination (\(\Delta R^2\)) that compares (a) the model that includes the old predictor(s) to (b) the model that includes old predictors and the new predictor (measure).

Code

model1 <- lm(
  criterion ~ oldpredictor,
  data = na.omit(mydataValidity))

model2 <- lm(
  criterion ~ oldpredictor + predictor,
  data = na.omit(mydataValidity))

summary(model1)


Call:
lm(formula = criterion ~ oldpredictor, data = na.omit(mydataValidity))

Residuals:
     Min       1Q   Median       3Q      Max 
-21.9998  -4.6383   0.0737   4.4824  24.3993 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20.42618    1.41547   14.43   <2e-16 ***
oldpredictor  0.79479    0.01394   57.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.015 on 900 degrees of freedom
Multiple R-squared:  0.7831,    Adjusted R-squared:  0.7829 
F-statistic:  3250 on 1 and 900 DF,  p-value: < 2.2e-16

Code

summary(model2)


Call:
lm(formula = criterion ~ oldpredictor + predictor, data = na.omit(mydataValidity))

Residuals:
     Min       1Q   Median       3Q      Max 
-20.7095  -4.2435  -0.1041   4.0814  23.1936 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.04252    1.33293   12.04   <2e-16 ***
oldpredictor  0.65476    0.01645   39.80   <2e-16 ***
predictor     0.36889    0.02745   13.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.405 on 899 degrees of freedom
Multiple R-squared:  0.8194,    Adjusted R-squared:  0.819 
F-statistic:  2040 on 2 and 899 DF,  p-value: < 2.2e-16

Code

model1Rsquare <- summary(model1)$r.squared
model2Rsquare <- summary(model2)$r.squared

model1RsquareAdj <- summary(model1)$adj.r.squared
model2RsquareAdj <- summary(model2)$adj.r.squared

deltaRsquare <- getDeltaRsquare(model2)[["predictor"]]

The deltaR-square values: the change in the R-square
      observed when a single term is removed.
Same as the square of the 'semi-partial correlation coefficient'
             deltaRsquare
oldpredictor   0.31824800
predictor      0.03627593

Code

deltaRsquareAdj <- model2RsquareAdj - model1RsquareAdj

deltaRsquareAdj

[1] 0.03611514

Code

anova(model1, model2)

Code

getDeltaRsquare(model2)

The deltaR-square values: the change in the R-square
      observed when a single term is removed.
Same as the square of the 'semi-partial correlation coefficient'
             deltaRsquare
oldpredictor   0.31824800
predictor      0.03627593

\(\Delta R^2\) was calculated using the rockchalk package (P. E. Johnson, 2022). The predictor shows significant incremental validity above the old predictor in predicting the criterion. Model 1 explains \(78.31\)% of the variance \((R^2 = 0.78)\), and Model 2 explains \(81.94\)% of the variance \((R^2 = 0.82)\). Thus, the predictor explains only \(3.63\)% additional variance in the criterion above the variance explained by the old predictor \((\Delta R^2)\). Based on adjusted \(R^2\) (\(R^2_{adj}\)), the predictor explains only \(3.61\)% additional variance in the criterion above the variance explained by the old predictor (\(\Delta R^2_{adj}\)).

5.3.1.6 Treatment Utility of Assessment

S. C. Hayes et al. (1987) argued that it is not enough for measures to be reliable and valid. The authors raised another important consideration for the validity of a measure: its treatment utility or usefulness. They asked the question: What goal do assessments accomplish in clinical psychology? In clinical psychology, the goal of assessment is to lead to better treatment outcomes. Therefore, the treatment utility of a measure is the extent to which a measure is shown to contribute to beneficial treatment outcomes. That is, if a clinician has the information/results from having administered this measure, do the clients have better outcomes? The treatment utility of assessment is a specific kind of criterion-related validity, with the idea that “all criteria are not created equal.” And the criterion that is most important to optimize, from this perspective, when developing and selecting measures is a client’s treatment outcomes.

S. C. Hayes et al. (1987) described different approaches to evaluating the extent to which a measure shows treatment utility. These are a priori group comparison approaches that examine whether a specific difference in the assessment approach relates to treatment outcome. They distinguished between (a) manipulated assessment and (b) manipulated use. Manipulated assessment and manipulated use are depicted in Figure 5.15.

Figure 5.15: Research Designs That Evaluate the Treatment Utility of Assessment.

In manipulated assessment, a single group of subjects is randomly divided into two subgroups, and either the collection or availability of assessment data is varied systematically. Therapists then design or implement treatment in accord with the data available. As an example, the measure of interest may be administered in one condition, and the standard measure may be administered in the other condition. Then the treatment outcomes would be compared across the two conditions. An advantage of manipulated assessment is that this type of design is more realistic than manipulated use. Making the assessment data available but not assigning a certain treatment based on the assessment outcomes better simulates a realistic clinical environment. Also, because the control group has no access to the data, it might serve as a stronger control. A disadvantage of manipulated assessment is that it depends on whether and how clinicians use the measure, which depends on how positively the measure was received by the clinicians.

In manipulated use, the same assessment information is available for all subjects, but the researcher manipulates the way in which the assessment information is used. For example, one group gets a treatment matched to assessment outcomes, and the other group gets a cross-matched treatment that does not target the problem identified by the assessment. So, in one group, the assessment information is used to match the treatment to the client based on their assessment results, whereas the other group receives a standard treatment regardless of their assessment results. An advantage of this design is that you can be certain that the treatment decisions are explicitly informed by the assessment results because the researcher ensures this, whereas the decision of how the assessment information is used is up to the clinician in manipulated assessment.

Relatively few measures show evidence of treatment utility of assessment. For instance, a clinical trial using the manipulated assessment approach determined that the Development and Well-Being Assessment (DAWBA) did not show treatment utility as a diagnostic assessment for children and adolescents (Sayal et al., in press). A review on the treatment utility of assessment is provided by Nelson-Gray (2003).

5.3.1.7 Discriminative Validity

Discriminative validity is the degree to which a measure accurately identifies persons placed into groups on the basis of another measure. Discriminative validity is not to be confused with discriminant (divergent) validity and discriminant analysis. A measure shows discriminant validity if it does not correlate with things that it would not be theoretically expected to correlate with. A measure shows discriminative validity, by contrast, if the measure is able to accurately differentiate things (e.g., two groups such as men and women). Discriminant analysis is a model that combines predictors to differentiate between multiple groups.

5.3.1.8 Elaborative Validity

Elaborative validity involves the extent to which a measure increases our theoretical understanding of the target construct or of neighboring constructs. It deals with a measure’s meaningfulness. Elaborative validity is a type of incremental theoretical validity. It is a combination of criterion-related validity and construct validity that examines how much a given measure increases our understanding of a construct’s nomological network. However, I am unaware of strong examples of measures that show strong elaborative validity in psychology.

5.3.1.9 Consequential Validity

Consequential validity is a form of validity that differs from evidential validity, or a test’s potential to provide accurate, useful information based on research. Consequential validity takes a more macro-view and deals with the community, sociological, and public policy perspective. Consequential validity evaluates the consequences of our measures beyond the circumstances of their development, based on their impact. Measures can have positive, negative, or mixed effects on society. An example of consequential validity would be asking what the impacts of aptitude testing are on society—how aptitude testing affects society in the long run. Some would argue that, even in cases where the aptitude tests have some degree of evidential validity (i.e., they accurately assess to some degree what they tend to assess), they are consequentially invalid—that is, they have had a net negative effect on society and, due to their low value to society, are invalid. The tests themselves may be fine in terms of their accuracy, but consequential validity says that their validity depends on what we do with the test, i.e., how the test is used.

Another example of consequential validity is when the validity of a measure changes over time due to changes in people’s or society’s response, as depicted in Figure 5.16. Judgments and predictions can change the way society reacts and the way people behave so that the predictions become either more or less accurate. A prediction that becomes more accurate as a result of people’s response to a prediction is a self-fulfilling prediction or prophecy. For instance, if school staff make a prediction that a child will be held back a grade, the teacher may not provide adequate opportunities for the child to learn, which may lead to the child being more likely to be held back.

Figure 5.16: Invalidation of a Measure Due to Society’s Response to the Use of the Measure.

A prediction that becomes less accurate as a result of people’s response to a prediction is a self-canceling prediction. The most effective prediction about a flu outbreak might be one that leads people to safer behavior; therefore lowering flu rates, which does not correspond to the initial prediction (Silver, 2012). Society’s response to a measure can invalidate the measure. For example, consider that an organization rates cities based on quality-of-life measures. If quality-of-life indices include the percent of solved cases by the city’s police (i.e., the clearance rate), cities may try to improve their ratings of “quality-of-life” by increasing their clearance rate either by increasing the number of cases they mark as solved (such as by marking cases falsely as resolved) or by decreasing the number of cases (such as by investigating fewer cases). That is, cities may “game” the system based on the quality-of-life indices used by various organizations. In this case, the clearance rate becomes a less accurate measure of a city’s quality of life.

Another example of a measure that becomes less valid due to society’s response is the use of alumni donations as an indicator of the strength of a university that is used for generating university rankings. Such an indicator could lead schools to admit wealthier students who give the most donations, and students whose parents were alumni and provide lavish donations to the university. Yet another example could be standardized testing, where instructors may “teach to the test” such that better performance might not necessarily reflect better underlying competence.

5.3.1.10 Representational Validity

Representational validity examines the extent to which the items or content of a measure “flesh out” and mirror the true nature and mechanisms of the construct. There are many different types of validity, but many of them are overlapping and can be considered types of other forms of validity. For instance, representational validity is a type of elaborative validity and content validity.

5.3.1.11 Factorial (Structural) Validity

Factorial validity is examined in Section 14.1.4.2.2 on factor analysis. Factorial validity considers whether the factor structure of the measure(s) is consistent with the construct. According to factorial validity, if you claim that the measure is unidimensional with four items, factor analysis should support the unidimensionality. Thus, it involves testing empirically whether the measure has the same structure as would be expected based on theory. It is a type of construct validity.

5.3.1.12 Ecological Validity

Ecological validity examines the extent to which a measure provides scores that are indicative of the behavior of a person in the natural environment.

5.3.1.13 Process-Focused Validity

Process-focused validity attempts to get closer to the underlying mechanisms. Process-focused validity examines the degree to which respondents engage in a predictable set of psychological processes (which are specified a priori) when completing the measure (Bornstein, 2011; Furr, 2017). These psychological processes include effects of the instrument (e.g., observer versus third-party report versus projective test) as well as effects of the context (e.g., assessment setting, assessment instructions, affect state of the participant, etc.). To determine whether a test is valid in the process-focused technique, one experimentally manipulates variables that moderate the test score–criterion association—to better understand the underlying processes. The ideal outcome is a measure that both (1) has an adequate outcome (correlations where expected) as well as (2) adequate process validity—the psychological processes that one engages in are well hypothesized.

The idea of process-focused validity is that if a measure does not inform us about process or mechanisms, it is not worth doing. Process-focused validity is a type of elaborative validity and construct validity. For instance, consider the common finding that low socioeconomic status is associated with poorer outcomes. To make an impact, process-focused validity would argue that we need to know the mechanisms that underlie this association, and it involves how a measure helps us understand process.

5.3.1.14 Diagnostic Validity

Diagnostic validity is the extent to which the diagnostic category accurately captures the abnormal phenomenon of interest. It is a form of construct validity for diagnoses.

5.3.1.15 Social Validity

Social validity involves the extent to which the proposed procedures (assessment, intervention, etc.) will be well-liked and acceptable by those who receive and implement them.

5.3.1.16 Cultural Validity

Cultural validity refers to “the effectiveness of a measure or the accuracy of a clinical diagnosis to address the existence and importance of essential cultural factors” (Leong & Kalibatseva, 2016, p. 58). Essential cultural factors may include values, beliefs, experiences, communication patterns, and approaches to knowledge (epistemologies).

5.3.2 Research Design (Experimental) Validity

Aspects of research design validity, also called experimental validity, involve the validity of a research design for making various interpretations. Research design validity includes internal validity, external validity, and conclusion validity.

5.3.2.1 Internal Validity

Internal validity is the extent to which causal inference is justified from the research design. This encompasses multiple features including

temporal precedence—does the (purported) cause occur before the (purported) effect?
covariation of cause and effect—correlation is necessary (even if insufficient) for causality
no plausible alternative explanations—such as third variables that could influence both variables and explain their covariation

There are number of potential threats to internal validity, which are important to consider when designing studies and interpreting findings. Examples of potential threats to internal validity include history, maturation, testing, instrumentation, regression, selection, attrition, and an interaction of threats (Shadish et al., 2002, Table 2.4; Slack & Draugalis, 2001).

Research designs differ in the extent to which they show internal validity versus external validity. An experiment is a research design in which one or more variables (independent variables) are manipulated to observe how the manipulation influences the dependent variable. In an experiment, the researcher has greater control over the variables and attempts to hold everything else constant (e.g., by standardization and random assignment). In correlational designs, however, the researcher has less control over the variables. They may be able to statistically account for potential confounds using covariates or for the reverse direction of effect using longitudinal designs. Nevertheless, we can have greater confidence about whether a variable influences another variable in an experiment. Thus, experiments tend to have higher internal validity than correlational designs.

5.3.2.2 External Validity

External validity is the extent to which the findings can be generalized across person, place, and time—e.g., to the broader population and the real world. External validity is crucial to consider for studies that intend to make inferences to people outside of those who were assessed. For instance, norm-referenced assessments attempt to identify the distribution of scores for a given population from a sample within that population. The validity of the norms of a norm-referenced assessment are important to consider (Achenbach, 2001). The validity of norms and external validity, more broadly, can depend highly on how representative the sample is of the target population and how appropriate this population is to a given participant. Some measures are known to have poor norms, including the Exner Comprehensive System (Exner, 1974; Exner & Erdberg, 2005) for administering and scoring the Rorschach Inkblot Test, which has been known to over-pathologize (Wood, Teresa, et al., 2001; Wood, Nezworski, et al., 2001). The Rorschach is discussed in greater detail in Chapter 19. Threats to external validity are described by Shadish et al. (2002) (see Table 3.2).

5.3.2.2.1 Tradeoffs of Internal Validity and External Validity

It is important to note that there is a tradeoff between internal and external validity—a single research design cannot have both high internal and high external validity. Some research designs are better suited for making causal inferences, whereas other designs tend to be better suited for making inferences that generalize to the real world. The research design that is best suited to making causal inferences is an experiment, where the researcher manipulates one variable (the independent variable) and holds all other variables constant to see how a change in the independent variable influences the outcome (dependent variable). Thus, experiments tend to have higher internal validity than other research designs. However, by manipulating one variable and holding everything else constant, the research takes place in a very standardized fashion that can become like studying a process in a vacuum. So, even if a process is theoretically causal in a vacuum, it may act very differently in the real world when it interacts with other processes.

Observational designs have greater capacity for external validity than experimental designs because people can be observed in their natural environments to see how variables are related in the real world. However, the greater external validity comes at a cost of lower internal validity. Observational designs are not well-positioned to make causal inferences because they have multiple threats to internal validity, including issues of temporal precedence in cross-sectional observational designs, and there are numerous potential third variables (i.e., confounds) that could act as a common cause of both the predictor and outcome variables. Thus, just because two variables are associated does not necessarily mean that they are causally related.

As the internal validity of a study’s design increases, its external validity tends to decrease. The greater control we have over variables (and, therefore, have greater confidence about causal inferences), the lower the likelihood that the findings reflect what happens in the real world because it is studying things in a metaphorical vacuum. Because no single research design can have both high internal and external validity, scientific inquiry needs a combination of many different research designs so we can be more confident in our inferences—experimental designs for making causal inferences and observational designs for making inferences that are more likely to reflect the real world.

Case studies, because they have smaller sample sizes, tend to have lower external validity than both experimental and observational studies. Case studies also tend to have lower internal validity because they have less potential to control for threats to internal validity, such as potential confounds or temporal precedence. Nevertheless, case studies can still be useful for generating hypotheses that can then be tested empirically with a larger sample in experimental or observational studies.

5.3.2.3 (Statistical) Conclusion Validity

Conclusion validity, also called statistical conclusion validity, considers the extent to which conclusions are reasonable about the association among variables based on the data. That is, were the correct statistical analyses performed, and are the interpretations of the findings from those analyses correct? Threats to conclusion validity are described by Shadish et al. (2002) (see Table 2.2). Did you have adequate statistical power, adequate reliability of measures, accurate model estimation, and proper treatment of missing data? Did you not violate assumptions of the statistical tests used? Did any characteristics of the data distort the results (e.g., outliers, missingness, nonnormality, restriction of range, etc.)? Did you avoid fishing for significant results (i.e., p-hacking) and inflating the Type I error rate by performing multiple tests? Did you avoid making causal inferences from correlational data? Were effect sizes estimated accurately? To what extent are participants heterogeneous? If a treatment was administered, was it implemented with fidelity for all participants? To what extent was there extraneous variance in the research setting that may have contributed to error variance? All of these considerations are important for ensuring that the conclusions drawn from a study are valid.

5.3.3 Putting It All Together: An Organizational Framework

There are many types of measurement validity, but the central psychometric aspect of measurement validity is construct validity. That is, whether the measure accurately assesses the target construct is the most important consideration of measurement validity. As discussed earlier, construct validity includes the nomological network of the construct. Construct validity also subsumes other key types of measurement validity, including

The organization of types of measurement validity that are subsumed by construct validity is depicted in Figure 5.17.

Figure 5.17: Organization of Types of Measurement Validity That Are Subsumed by Construct Validity.

Moreover, many different types of reliability and validity can be viewed through the lens of construct validity:

Internal consistency, which can be estimated as the coefficient of internal consistency, where the criterion for criterion-related validity is other items on the same measure
Test–retest reliability, which can be estimated as the coefficient of stability, where the criterion for criterion-related validity is the same measure at another time point
Parallel-forms reliability or convergent validity, which can be estimated as the coefficient of equivalence, where the criterion for criterion-related validity is the parallel form

5.4 Validity Is a Process, Not an Outcome

Validity (and validation) is a continual process, not an outcome. Validation is never complete. When evaluating the validity of a measure, we must ask: Validity for what and to what degree? We would not just say that a measure is or is not valid. We would express the strength of evidence on a measure’s validity across the different types of validity for a particular purpose, with a particular population, in a particular context (consistent with generalizability theory).

5.5 Reliability Versus Validity

Reliability and validity are not the same thing. Reliability deals with consistency, whereas validity deals with accuracy. A measure can be consistent but not accurate (see Figure 5.18). That is, a measure can be reliable but not valid. However, a measure cannot be accurate if it is inconsistent; that is, a measure cannot be valid and unreliable.

Figure 5.18: Traditional Depiction of Reliability (Consistency) Versus Validity (Accuracy).

The typical way of depicting the distinction between reliability and validity is in Figure 5.18, in which a measure can either have (a) low reliability and low validity, (b) high reliability and low validity, or (c) high reliability and high validity. However, it can be worth thinking about validity in terms of accuracy at the person level versus group level. When we distinguish between person- versus group-level accuracy, we can distinguish four general combinations of reliability and validity, as depicted in Figure 5.19: (a) low reliability, low accuracy at the person level, and low accuracy at the group level, (b) low reliability, low accuracy at the person level, and high accuracy at the group level, (c) high reliability, low accuracy at the person level, and low accuracy at the group level, and (d) high reliability, high accuracy at the person level, and high accuracy at the group level. However, as discussed before, reliability and validity are not binary states of low versus high—they exist to varying degrees.

Figure 5.19: Depiction of Reliability Versus Validity, While Distinguishing Between Validity (Accuracy) at the Person Versus Group Level.

Even though reliability and validity are not the same thing, there is a relation between reliability and validity. Validity depends on reliability. Reliability is necessary but insufficient for validity. However, measurement error (unreliability) can be systematic or random. If measurement error is systematic, it reduces the validity of the measure. If measurement error is random, it reduces the precision of an individual’s score on a measure, but the measure could still be a valid measure of the construct at the group level. However, random error would make it more difficult to make an accurate prediction for an individual person.

Reliability places the upper bound on validity because a measure can be no more valid than it is reliable. In other words, a measure should not correlate more highly with another variable than it correlates with itself. Therefore, the maximum validity coefficient is the square root of the product of the reliability of each measure, as in Equation (5.1):

\[ \begin{aligned} r_{xy_{\text{max}}} &= \sqrt{r_{xx}r_{yy}} \\ \text{maximum association between } x \text{ and } y &= \sqrt{\text{reliability of } x \text{ and } y} \end{aligned} \tag{5.1} \]

So, the maximum validity coefficient is based on the reliability of each measure. To the extent that one of the measures is unreliable, the validity coefficient will be attenuated relative to the true validity (i.e., the true strength of association of the constructs), as we describe below.

5.6 Effect of Measurement Error on Associations

Figure 5.20 depicts the classical test theory approach to understanding the validity of a measure, i.e., its association with another measure, which is the validity coefficient (\(r_{xy}\)).

Figure 5.20: The Criterion-Related Validity of a Measure, i.e., Its Association With Another Measure, as Depicted in a Path Diagram.

As described above, (random) measurement error weakens (or attenuates) the association between variables (Goodwin & Leech, 2006; Schmidt & Hunter, 1996). The greater the random measurement error, the weaker the association. Thus, the correlation between \(x\) and \(y\) depends on both the true correlation of \(x\) and \(y\) (\(r_{x_{t}y_{t}}\)) and the reliabilities of \(x\) (\(r_{xx}\)) and \(y\) (\(r_{yy}\)). So, measurement error in \(x\) and \(y\) can reduce the observed correlation below the true correlation. This is known as the attenuation formula (Equation (5.2)):

\[ \small \begin{aligned} r_{xy} &= r_{x_{t}y_{t}} \sqrt{r_{xx}r_{yy}} \\ \text{observed association between } x \text{ and } y &= (\text{true association of constructs}) \times \sqrt{\text{reliability of } x \text{ and } y} \end{aligned} \tag{5.2} \]

The lower the reliability, the greater the attenuation of the validity coefficient relative to the true association between the constructs.

All of these \(r\) values (excluding the true correlation) are just estimates unless the sample size is infinite, so the observed association is an imperfect estimate. Hence, we need a correction for this attenuation (Schmidt & Hunter, 1996). This correction for the attenuation of an association due to measurement error (unreliability) is known as the disattenuation of a correlation, i.e., correction for the attenuation of an association due to measurement error to get a more accurate estimate of the true association between constructs. Rearranging the terms from the attenuation formula, the formula for disattenuation of a correlation (i.e., the disattenuation formula) is in Equation (5.3):

\[ \begin{aligned} r_{x_{t}y_{t}} &= \frac{r_{xy}}{\sqrt{r_{xx}r_{yy}}} \\ \text{true association of constructs} &= \frac{\text{observed association between } x \text{ and } y}{\sqrt{\text{reliability of } x \text{ and } y}} \end{aligned} \tag{5.3} \]

All of this is implied in the path diagram (see Figure 5.20). The attenuation and disattenuation formulas are based on classical test theory, and therefore assume that all measurement error is random, that errors are uncorrelated, etc. Nevertheless, the attenuation formula can be informative for understanding how imperiled your research is when your measures have low reliability (i.e., when there is instability in the measure). Unreliability is a strong threat to statistical power—i.e., being able to detect an effect if it exists. Many people do not account for reliability of the measures when calculating statistical power, but you should if you want your study to be adequately powered! For instance, one could determine what the expected observed association would be given the expected true association and the expected reliability of the measures, and then use that expected observed association to calculate statistical power. Researchers recommend accounting for measurement reliability, to better estimate the association between constructs, either with the disattenuation formula (Schmidt & Hunter, 1996) or with structural equation modeling, as described in the next chapter.

5.6.1 Test with Simulated Data

5.6.1.1 Reliability of Predictor

Code

cor.test(
  x = mydataValidity$predictorWithMeasurementErrorT1,
  y = mydataValidity$predictorWithMeasurementErrorT2)


    Pearson's product-moment correlation

data:  mydataValidity$predictorWithMeasurementErrorT1 and mydataValidity$predictorWithMeasurementErrorT2
t = 63.6, df = 948, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8872651 0.9114970
sample estimates:
      cor 
0.9000747

5.6.1.2 Reliability of Criterion

Code

cor.test(
  x = mydataValidity$criterionWithMeasurementErrorT1,
  y = mydataValidity$criterionWithMeasurementErrorT2)


    Pearson's product-moment correlation

data:  mydataValidity$criterionWithMeasurementErrorT1 and mydataValidity$criterionWithMeasurementErrorT2
t = 49.587, df = 948, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8308430 0.8663438
sample estimates:
      cor 
0.8495525

5.6.1.3 True Association

Code

cor.test(
  x = mydataValidity$predictor,
  y = mydataValidity$criterion)


    Pearson's product-moment correlation

data:  mydataValidity$predictor and mydataValidity$criterion
t = 30.07, df = 900, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6737883 0.7390515
sample estimates:
      cor 
0.7079278

5.6.1.4 Observed Association (After Adding Measurement Error)

Code

cor.test(
  x = mydataValidity$predictorWithMeasurementErrorT1,
  y = mydataValidity$criterionWithMeasurementErrorT1)


    Pearson's product-moment correlation

data:  mydataValidity$predictorWithMeasurementErrorT1 and mydataValidity$criterionWithMeasurementErrorT1
t = 23.73, df = 900, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5785273 0.6589657
sample estimates:
      cor 
0.6203752

Using simulated data, when the reliability of the predictor is .90, the reliability of the criterion is .85, and the true association between the predictor and criterion is \(r = .71\), the observed association is attenuated to \(r = .62\).

5.6.2 Attenuation of True Correlation Due to Measurement Error

The attenuation formula is presented in Equation (5.2). We extend it to a specific example in Equation (5.4):

\[ \small \begin{aligned} r_{xy} &= r_{x_ty_t} \sqrt{r_{xx} r_{yy}} \\ \text{observed correlation between }x \text{ and } y &= \text{(true association between construct } A \text{ and construct } B) \times \\ & \;\;\; \sqrt{\text{reliability of } x \times \text{reliability of } y} \end{aligned} \tag{5.4} \]

where \(x = \text{measure of construct} \ A\); \(y = \text{measure of construct} \ B\).

Below is an example of how to find the observed correlation between the predictor and criterion if the true association between the constructs (i.e., correlation between true scores of constructs) is .70, the reliability of the predictor is .90, and the reliability of the criterion is .85:

Code

trueAssociation <- .7
reliabilityOfPredictor <- 0.9
reliabilityOfCriterion <- 0.85

trueAssociation * sqrt(reliabilityOfPredictor * reliabilityOfCriterion)

[1] 0.6122499

The petersenlab package (Petersen, 2025) contains the attenuationCorrelation() function that estimates the observed association given the true association and the reliability of the predictor and criterion:

Code

attenuationCorrelation(
  trueAssociation = 0.7, 
  reliabilityOfPredictor = 0.9,
  reliabilityOfCriterion = 0.85)

[1] 0.6122499

The observed association (\(r = .61\)) is attenuated relative to the true association (\(r = .70\)).

5.6.3 Disattenuation of Observed Correlation Due to Measurement Error

The disattenuation formula is presented in Equation (5.3). We extend it to a specific example in Equation (5.5):

\[ \small \begin{aligned} r_{x_ty_t} &= \frac{r_{xy}}{\sqrt{r_{xx} r_{yy}}} \\ \text{true association between construct } A \text{ and construct } B &= \frac{\text{observed correlation between } x \text{ and } y}{\sqrt{\text{reliability of } x \times \text{reliability of } y}} \end{aligned} \tag{5.5} \]

where \(x = \text{measure of construct} \ A\); \(y = \text{measure of construct} \ B\)

Find the true association between the construct assessed by the predictor and the construct assessed by the criterion given an observed association if the reliability of the predictor is .9, and the reliability of the criterion is .85:

Code

reliabilityOfPredictor <- 0.9
reliabilityOfCriterion <- 0.85

The observed (attenuated) association is as follows:

Code

observedAssociation <- cor.test(
  x = mydataValidity$predictor,
  y = mydataValidity$criterion)$estimate

observedAssociation

      cor 
0.7079278

The true (disattenuated) association is as follows:

Code

observedAssociation / sqrt(reliabilityOfPredictor * reliabilityOfCriterion)

      cor 
0.8093908

The petersenlab package (Petersen, 2025) contains the disattenuationCorrelation() function that estimates the observed association given the true association and the reliability of the predictor and criterion:

Code

disattenuationCorrelation(
  observedAssociation = observedAssociation, 
  reliabilityOfPredictor = 0.9,
  reliabilityOfCriterion = 0.85)

      cor 
0.8093908

The disattenuation of an observed correlation due to measurement error can be demonstrated using structural equation modeling. For instance, consider the following observed association:

Code

cor.test(
  x = mydataValidity$predictorObservedSEM,
  y = mydataValidity$criterionObservedSEM)$estimate

      cor 
0.6118049

The observed association can be estimated in structural equation modeling in the lavaan package (Rosseel et al., 2022) using the following syntax:

Code

observedSEM_syntax <- '
  criterionObservedSEM ~ predictorObservedSEM
  
  # Specify residual errors (measurement error)
  predictorObservedSEM ~~ predictorObservedSEM
  criterionObservedSEM ~~ criterionObservedSEM
'

observedSEM_fit <- sem(
  observedSEM_syntax,
  data = mydataValidity,
  missing = "ML")

summary(
  observedSEM_fit,
  standardized = TRUE)

lavaan 0.6-20 ended normally after 9 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           998        1000
  Number of missing patterns                         3            

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  criterionObservedSEM ~                                                      
    prdctrObsrvSEM          0.633    0.027   23.366    0.000    0.633    0.611

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .crtrnObsrvdSEM   -0.012    0.027   -0.459    0.646   -0.012   -0.012
    prdctrObsrvSEM    0.012    0.032    0.373    0.709    0.012    0.012

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    prdctrObsrvSEM    0.971    0.044   21.932    0.000    0.971    1.000
   .crtrnObsrvdSEM    0.654    0.030   21.518    0.000    0.654    0.627

Code

lavInspect(observedSEM_fit, "rsquare")

criterionObservedSEM 
               0.373

A path diagram of the observed association is depicted in Figure 5.21.

Code

semPaths(
  observedSEM_fit,
  what = "Std.all",
  layout = "tree2",
  edge.label.cex = 2)

Figure 5.21: Observed Association, as Depicted in a Structural Equation Model.

Now consider when we account for the degree of unreliability of each measure. We can account for the (un)reliability of each measure by specifying the residual errors as \(1 - \text{reliability}\), as below:

Code

# Syntax that specifies the reliability programmatically
disattenuationSEM_syntax <- 
  paste(
    '
    # Factor loadings
    predictorLatent =~ 1*predictorObservedSEM
    criterionLatent =~ 1*criterionObservedSEM
    
    # Factor correlation
    criterionLatent ~ predictorLatent
    
    # Specify residual errors (measurement error)
    predictorObservedSEM ~~ (1 - ', reliabilityOfPredictor, ')*predictorObservedSEM
    criterionObservedSEM ~~ (1 - ', reliabilityOfCriterion, ')*criterionObservedSEM
    ',
    sep = "")

# Syntax that substitutes in the reliability values
disattenuationTraditionalSEM_syntax <- '
  # Factor loadings
  predictorLatent =~ 1*predictorObservedSEM
  criterionLatent =~ 1*criterionObservedSEM
  
  # Factor correlation
  criterionLatent ~ predictorLatent
  
  # Specify residual errors (measurement error)
  predictorObservedSEM ~~ (1 - .9)*predictorObservedSEM # where .9 is the reliability of the predictor
  criterionObservedSEM ~~ (1 - .85)*criterionObservedSEM # where .85 is the reliability of the criterion
'

disattenuationSEM_fit <- sem(
  disattenuationSEM_syntax,
  data = mydataValidity,
  missing = "ML")

summary(
  disattenuationSEM_fit,
  standardized = TRUE,
  rsquare = TRUE)

lavaan 0.6-20 ended normally after 15 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           998        1000
  Number of missing patterns                         3            

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Latent Variables:
                     Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  predictorLatent =~                                                      
    prdctrObsrvSEM      1.000                               0.933    0.947
  criterionLatent =~                                                      
    crtrnObsrvdSEM      1.000                               0.946    0.925

Regressions:
                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  criterionLatent ~                                                      
    predictorLatnt     0.706    0.030   23.152    0.000    0.697    0.697

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .prdctrObsrvSEM    0.012    0.032    0.373    0.709    0.012    0.012
   .crtrnObsrvdSEM   -0.005    0.033   -0.143    0.887   -0.005   -0.005

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .prdctrObsrvSEM    0.100                               0.100    0.103
   .crtrnObsrvdSEM    0.150                               0.150    0.144
    predictorLatnt    0.871    0.044   19.674    0.000    1.000    1.000
   .criterionLatnt    0.460    0.031   14.960    0.000    0.514    0.514

R-Square:
                   Estimate
    prdctrObsrvSEM    0.897
    crtrnObsrvdSEM    0.856
    criterionLatnt    0.486

A path diagram of the true association is depicted in Figures 5.22 and 5.23.

Code

semPaths(
  disattenuationSEM_fit,
  what = "Std.all",
  layout = "tree2",
  edge.label.cex = 2)

Figure 5.22: Disattenuation of an Observed Association Due to Measurement Error, as Depicted in a Structural Equation Model.

Figure 5.23: Disattenuation of an Observed Association Due to Measurement Error, as Depicted in a Structural Equation Model.

The observed association (\(\beta = 0.61\)) becomes \(\beta = 0.70\) when it is disattenuated for measurement error.

5.7 Generalizability Theory (G-Theory)

Generalizability theory (G-theory) is discussed in greater detail in the chapter on reliability in Section 4.11 and in the chapter on generalizability theory. As a brief reminder, G-theory is a measurement theory that, unlike classical test theory, does not treat all measurement differences across time, rater, or situation as “error” but rather as a phenomenon of interest. G-theory can simultaneously consider multiple aspects of reliability and validity in the same model, something that classical test theory cannot achieve.

5.8 Ways to Increase Validity

Here are potential ways to increase the validity of the interpretation of a measure’s scores for a particular purpose:

Make sure the measure’s scores are reliable. For potential ways to increase the reliability of measurement, see Section 4.14. But do not switch to a less valid measure or to items that are less valid merely because they are more reliable.
Use multiple measures and multiple methods to remedy the effects of method bias.
Design the measure with a particular population and purpose in mind. When describing the measure in papers or in public spheres, make it clear to others what the population and intended purposes are and what they are not.
Make sure each item’s scores are valid, based on theory and empiricism, for the particular population and purpose. For instance, the items should show content validity—the items should assess facets of the target construct for the target population as defined by experts, without item intrusions from other constructs. The items’ scores should show convergent validity—the items’ scores should be related to other measures of the construct, within the population of interest. The items’ scores should show discriminant validity—the items’ scores should be more strongly related to measures that are intended to assess the same construct than they are to measures that are intended to assess other constructs.
Combine the items with latent variable modeling, such as with item response theory (IRT), structural equation modeling (SEM), or factor analysis, to reduce the contribution of random error and to disattenuate associations for measurement error.
Obtain samples that are as representative of the population as possible, paying attention to including people who are traditionally under-represented in research (if such groups are part of the target population).
Make sure that people in the population can understand, interpret, and respond to each item in a meaningful and comparable way.
Make sure the measure and its items are not biased against any subgroup within the population of interest. Test bias is discussed in Chapter 16.
Be careful to administer the measure to the population of interest under the conditions in which it is designed. If the measure must be administered to people from a different population or under different conditions from which it was designed, be careful to (a) note that the measure was not designed to be administered for these other populations or conditions, and (b) note that individuals’ scores may not accurately reflect their level on the construct. If interpretations are made based on these scores, make them cautiously and say how the differences in population or condition may have influenced the scores.
Continue to monitor the validity of the measure’s scores for the given population and purpose. The validity of measures’ scores can change over time for a number of reasons. Cohort effects can lead items to become obsolete over time. If people or organizations change their behavior in response to a measure, this can invalidate a measure’s scores for the intended purpose, as described in Section 5.3.1.9 when discussing consequential validity.

5.9 Conclusion

Validity is how much accuracy, utility, and meaningfulness the interpretation of a measure’s scores have for a particular purpose. Like reliability, validity is not one thing. There are multiple aspects of validity. Validity is also not a characteristic that resides in a test. The validity of a measure’s scores reflect an interaction of the properties of the test with the population for whom it is designed and the sample and context in which it is administered. Thus, when reporting validity in papers, it is important to adequately describe the aspects of validity that have been considered and the population, sample, and context in which the measure is assessed.

5.10 Suggested Readings

Cronbach & Meehl (1955); L. A. Clark & Watson (2019)

5.11 Exercises

5.11.1 Questions

Note: Several of the following questions use data from the Children of the National Longitudinal Survey of Youth (CNLSY). The CNLSY is a publicly available longitudinal data set provided by the Bureau of Labor Statistics (https://www.bls.gov/nls/nlsy79-children.htm#topical-guide; archived at https://perma.cc/EH38-HDRN). The CNLSY data file for these exercises is located on this book’s page of the Open Science Framework (https://osf.io/3pwza). Children’s behavior problems were rated in 1988 (time 1: T1) and then again in 1990 (time 2: T2) on the Behavior Problems Index (BPI).

What is the criterion-related validity of the Antisocial Behavior subscale of the BPI in relation to the Hyperactive subscale of the BPI?
Assume the true correlation between two constructs, antisocial behavior and hyperactivity, is \(r = .8\). And assume the reliability of the measure of antisocial behavior is \(r = .7\), and the reliability of the measure of hyperactivity is \(r = .7090234\). According to the attenuation formula (that attenuates the association due to measurement error), what would be the correlation between measures of antisocial behavior and hyperactivity that we would actually observe?
Assume the true correlation between two constructs, antisocial behavior and hyperactivity, is \(r = .8\), the reliability of the Antisocial Behavior subscale of the BPI is \(r = .7\), and the reliability of the measure of Hyperactive subscale of the BPI is \(r = .8\). According to the disattenuation formula (to correct for attenuation of the association due to measurement error), what would be the true association between the constructs of antisocial behavior and hyperactivity?
You are interested in whether a child’s levels of anxiety/depression can explain unique variance in children’s hyperactivity above and beyond their level of antisocial behavior. Is the child’s level of anxiety/depression (as rated on the Anxiety/Depression subscale of the BPI) significantly associated with the child’s level of hyperactivity (as rated on the Hyperactive subscale of the BPI) above and beyond the variance accounted for by the child’s level of antisocial behavior (as rated on the Antisocial Behavior subscale of the BPI). How much unique variance in hyperactivity is accounted for by their anxiety/depression?
In Section 5.3.1.4.2.3, we simulated data for a multitrait-multimethod matrix. The simulated data includes data on participants’ verbal, spatial, and quantitative abilities, each assessed in three subtests in each of three methods: oral, written, and manipulative.
1. Provide a multitrait-multimethod matrix of the data from the first subtest of each trait-by-method combination (VO1, SO1, QO1, VW1, SW1, QW1, VM1, SM1, QM1). Assume the reliability of the variables assessed orally is .7, the reliability of the variables assessed using the written method is .8, and the reliability of variables assessed using the manipulative method is .9.
2. Interpret the multitrait-multimethod matrix you just created.
3. What is the Heterotrait-Monotrait (HTMT) ratio for the measures of the verbal (VO1, VW1, VM1) and spatial (SO1, SW1, SM1) constructs? What does this indicate?

5.11.2 Answers

The criterion-related validity is \(r = .56\).
The observed correlation would be \(r = .56\).
The true association would be \(r = .75\).
Yes, the child’s level of anxiety/depression is significantly associated with their hyperactivity above and beyond their antisocial behavior \((F[df = 1] = 217.25, p < .001)\). The child’s level of anxiety/depression accounted for \(6.02\%\) unique variance in hyperactivity above and beyond their antisocial behavior.
1. The multitrait-multimethod matrix is below:

Figure 5.24: Multitrait-Multimethod Matrix.

1. The convergent correlations (green cells) are statistically significant (\(p\text{s} < .05\)) and moderate in magnitude (\(.45 < r\text{s} < .53\)), supporting the convergent validity of the measures. [The reliabilities of these measures are not provided, so we are not able to compare the magnitude of the convergent validities to the magnitude of reliability to see the extent to which the convergent validities may be attenuated due to measurement unreliability]. Evidence of discriminant validity is supported by three findings: First, the convergent correlations (green cells: \(r\text{s} =\) .46–.53) are stronger than the heteromethod-heterotrait correlations (pink cells: \(r\text{s} =\) .24–.34). Second, the convergent correlations (green cells: \(r\text{s} =\) .46–.53) are stronger than the discriminant correlations (orange cells: \(r\text{s} =\) .30–.43). Third, the patterns of intercorrelations between traits are the same, regardless of which measurement method is used. Verbal, spatial, and quantitative skills are intercorrelated for every measurement method used.
2. The HTMT ratio for the measures of the verbal and spatial constructs is \(0.853\). The HTMT ratio is the average of the heterotrait-heteromethod correlations, relative to the average of the monotrait-heteromethod correlations. Given that the HTMT ratio is considerably less than 1 (and less than the common cutoff of .85), it indicates that the monotrait-heteromethod correlations are considerably larger than the heterotrait-heteromethod correlations. Thus, the HTMT provides evidence that the measures of verbal and spatial constructs show discriminant validity.

References

Achenbach, T. M. (2001). What are norms and why do we need valid ones? Clinical Psychology: Science and Practice, 8(4), 446–450. https://doi.org/10.1093/clipsy.8.4.446

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Bornstein, R. F. (2011). Toward a process-focused model of test score validity: Improving psychological assessment in science and practice. Psychological Assessment, 23(2), 532–544. https://doi.org/10.1037/a0022402

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016

Clark, L. A., & Watson, D. (2019). Constructing validity: New developments in creating objective measuring instruments. Psychological Assessment, 31(12), 1412–1427. https://doi.org/10.1037/pas0000626

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957

Epskamp, S. (2022). semPlot: Path diagrams and visual analysis of various SEM packages’ output. https://github.com/SachaEpskamp/semPlot

Exner, J. E. (1974). The Rorschach: A comprehensive system. John Wiley & Sons.

Exner, J. E., & Erdberg, S. P. (2005). The Rorschach, a comprehensive system: Advanced interpretation (3rd ed., Vol. 2). John Wiley & Sons, Inc.

Fiske, D. W., & Campbell, D. T. (1992). Citations do not solve problems. Psychological Bulletin, 112(3), 393–395. https://doi.org/10.1037/0033-2909.112.3.393

Fok, C. C. T., & Henry, D. (2015). Increasing the sensitivity of measures to change. Prevention Science, 16(7), 978–986. https://doi.org/10.1007/s11121-015-0545-z

Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18(1), 39–50. https://doi.org/10.2307/3151312

Furr, R. M. (2017). Psychometrics: An introduction. SAGE publications.

Furr, R. M., & Heuckeroth, S. (2019). The “quantifying construct validity” procedure: Its role, value, interpretations, and computation. Assessment, 26(4), 555–566. https://doi.org/10.1177/1073191118820638

Goodwin, L. D., & Leech, N. L. (2006). Understanding correlation: Factors that affect the size of r. The Journal of Experimental Education, 74(3), 249–266. https://doi.org/10.3200/JEXE.74.3.249-266

Hayes, S. C., Nelson, R. O., & Jarrett, R. B. (1987). The treatment utility of assessment: A functional approach to evaluating assessment quality. American Psychologist, 42, 963–974. https://doi.org/10.1037/0003-066X.42.11.963

Henseler, J., Ringle, C. M., & Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1), 115–135. https://doi.org/10.1007/s11747-014-0403-8

Johnson, P. E. (2022). rockchalk: Regression estimation and presentation. https://CRAN.R-project.org/package=rockchalk

Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Rosseel, Y. (2021). semTools: Useful tools for structural equation modeling. https://github.com/simsem/semTools/wiki

Leong, F. T. L., & Kalibatseva, Z. (2016). Threats to cultural validity in clinical diagnosis and assessment: Illustrated with the case of Asian Americans. In N. Zane, G. Bernal, & F. T. L. Leong (Eds.), Evidence-based psychological practice with ethnic minorities: Culturally informed research and clinical strategies (pp. 57–74). American Psychological Association.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834. https://doi.org/10.1037/0022-006x.46.4.806

Myers, K., & Winters, N. C. (2002). Ten-year review of rating scales. I: Overview of scale functioning, psychometric properties, and selection. Journal of the American Academy of Child & Adolescent Psychiatry, 41(2), 114–122. https://doi.org/10.1097/00004583-200202000-00004

Nelson-Gray, R. O. (2003). Treatment utility of psychological assessment. Psychological Assessment, 15(4), 521–531. https://doi.org/10.1037/1040-3590.15.4.521

Petersen, I. T. (2025). petersenlab: A collection of R functions by the Petersen Lab. https://doi.org/10.32614/CRAN.package.petersenlab

Roemer, E., Schuberth, F., & Henseler, J. (2021). HTMT2–an improved criterion for assessing discriminant validity in structural equation modeling. Industrial Management & Data Systems, 121(12), 2637–2650. https://doi.org/10.1108/IMDS-02-2021-0082

Rönkkö, M., & Cho, E. (2020). An updated guideline for assessing discriminant validity. Organizational Research Methods, 1094428120968614. https://doi.org/10.1177/1094428120968614

Rosseel, Y., Jorgensen, T. D., & Rockwood, N. (2022). lavaan: Latent variable analysis. https://lavaan.ugent.be

Royal, K. (2016). “Face validity” is not a legitimate type of validity evidence! The American Journal of Surgery, 212(5), 1026–1027. https://doi.org/10.1016/j.amjsurg.2016.02.018

Sayal, K., Wyatt, L., Partlett, C., Ewart, C., Bhardwaj, A., Dubicka, B., Marshall, T., Gledhill, J., Lang, A., Sprange, K., Thomson, L., Moody, S., Holt, G., Bould, H., Upton, C., Keane, M., Cox, E., James, M., & Montgomery, A. (in press). The clinical and cost effectiveness of a STAndardised DIagnostic Assessment for children and adolescents with emotional difficulties: The STADIA multi-centre randomised controlled trial. Journal of Child Psychology and Psychiatry. https://doi.org/10.1111/jcpp.14090

Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1(2), 199–223. https://doi.org/10.1037/1082-989X.1.2.199

Schneider, W. J. (2021). simstandard: Generate standardized data. https://github.com/wjschne/simstandard

Sechrest, L. (1963). Incremental validity: A recommendation. Educational and Psychological Measurement, 23, 153–158. https://doi.org/10.1177/001316446302300113

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin.

Silver, N. (2012). The signal and the noise: Why so many predictions fail–but some don’t. Penguin.

Slack, M. K., & Draugalis, J., Jolaine R. (2001). Establishing the internal and external validity of experimental studies. American Journal of Health-System Pharmacy, 58(22), 2173–2181. https://doi.org/10.1093/ajhp/58.22.2173

Voorhees, C. M., Brady, M. K., Calantone, R., & Ramirez, E. (2016). Discriminant validity testing in marketing: An analysis, causes for concern, and proposed remedies. Journal of the Academy of Marketing Science, 44(1), 119–134. https://doi.org/10.1007/s11747-015-0455-4

Wood, J. M., Nezworski, M. T., Garb, H. N., & Lilienfeld, S. O. (2001). Problems with the norms of the Comprehensive System for the Rorschach: Methodological and conceptual considerations. Clinical Psychology: Science and Practice, 8(3), 397–402. https://doi.org/10.1093/clipsy.8.3.397

Wood, J. M., Teresa, P. M., Garb, H. N., & Lilienfeld, S. O. (2001). The misperception of psychopathology: Problems with the norms of the Comprehensive System for the Rorschach. Clinical Psychology: Science and Practice, 8(3), 350–373. https://doi.org/10.1093/clipsy.8.3.350

For shorthand and to avoid repetition, we sometimes refer to the validity of the measure, but in such instances, we are actually referring to the validity of the interpretation of a measure’s scores for a given use.↩︎

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/95iW4p47cuaphTek6